At Forage AI, we’ve spent years wrestling with a problem that has plagued the document processing industry: the impossible choice between security and capability. Do you go with ML-based extraction that’s secure but inflexible? Or do you embrace AI that’s versatile but can’t handle the nuance of complex document structures?
We refused to accept this compromise. And what we built has changed how organizations process documents.
Our game-changing integration combines the precision of ML-trained annotations with the intelligence of AI-driven prompts. This hybrid approach has transformed how we – and our clients – handle everything from confidential finance documents like Schedule of Investments (SOIs) and Form 5500s to everyday public receipts and invoices. The results? Nothing short of revolutionary.
Why We Built Something Different
Let’s be honest about traditional ML-based document processing. Yes, it works. Yes, your data stays protected within your servers. But it comes with significant limitations that we saw our clients struggling with daily:
- The Versatility Problem: ML models trained on specific document types struggle when you throw them a curveball. A model trained on standard invoices might fumble when presented with a receipt from a mom-and-pop shop with a completely different layout.
- The Standardization Nightmare: After extraction, you’re left with data in various formats that need to be manually standardized. We watched our clients spend hours on tedious work — clicking through rows, reformatting dates, normalizing currency formats. We knew there had to be a better way.
- The Speed Issue: When processing diverse document sets, traditional ML models slow to a crawl. Each document variation requires additional processing time, creating bottlenecks in workflows.
And here’s the thing: while ML models are precise at extraction, they’re terrible at post-extraction manipulation. Ask them to reformat 1,000 rows of data into a different structure? You’re looking at manual work or complex custom scripting.
So why not just use LLMs instead?
Two reasons. First, for confidential documents, ML-based extraction is non-negotiable. When you’re dealing with sensitive financial information, employee data, or proprietary business details, you can’t afford to send that data to external AI services. Your data security cannot be compromised.
Second, anyone who’s attempted full-scale Form 5500 extraction using just an LLM knows the frustration. The AI might understand the context, but it struggles with structural precision. It misaligns data. It confuses similar-looking characters. It fails at complex table extraction.
So if pure ML is too rigid and pure LLM is too imprecise, what’s the answer? Both — applied strategically.
This insight became the foundation of everything we built next.
Our Breakthrough: A Customized Multi-Stage Approach
You can think of our approach like visiting an ophthalmologist. They don’t just hand you glasses and call it a day. They use multiple lenses — different powers, different angles — to understand exactly how your eye works. Our ML-trained models work the same way for documents.
For example, we use specialized annotations for columns, cells, headers, tables, and rows that work together to understand document structure at a granular level. This is how we distinguish between a zero and the letter O. This is how we understand that a value in one column relates to a header three rows above it. This is how we handle tables that span multiple pages or receipts where items are listed without clear table structures.
Try doing that with a plain AI prompt. Go ahead, we’ll wait.
How our three-stage pipeline works
That’s the ML foundation — structural precision no plain LLM can match. But extraction is only stage one.
The full pipeline has three stages: extraction, standardization, and automated QA. Each handles confidential and public documents differently. For confidential documents, AI generates the logic but never touches your data — your infrastructure executes everything locally. For public documents, AI handles it end-to-end.
Here’s how it comes together.
Stage 1: Extraction
Confidential Documents (SOIs, Form 5500s, Tax Documents)
We use ML-based extraction as the foundation for all confidential document processing. Our pre-trained models, with their multi-layered annotations, handle the heavy lifting:
- Column annotations identify vertical data structures.
- Cell annotations pinpoint exact value locations.
- Header annotations establish context and relationships.
- Table annotations map complex multi-page structures.
- Row annotations ensure data integrity across related fields.
This extraction happens entirely within your secure environment. Your compliance team can sleep soundly, knowing that no data leaves your servers.
Our ML models read documents like expert analysts who’ve seen thousands of Form 5500s. They know where to look for plan information, participant counts, and financial data — even when the layout varies slightly between plan years or administrators.
Public Documents (Receipts, Invoices, Purchase Orders)
For public documents, we unleash the full power of AI-based extraction. Public documents don’t carry the same security constraints, so we leverage the contextual understanding and adaptability of large language models to their fullest extent.
Our AI excels at:
- Understanding merchant names regardless of font or formatting.
- Interpreting line items even when layouts are unconventional.
- Handling multi-language receipts.
- Pulling key-value pairs from unstructured text blocks with remarkable accuracy.
- Extracting numbers embedded within paragraphs or descriptions that traditional OCR systems miss.
Our AI doesn’t just extract text — it understands context. It knows that “$12.99” next to “Coffee Beans – Premium Blend” represents a purchase, even if there’s no clear table structure. It can parse through dense text blocks and identify critical information like “Invoice #12345” or “Due Date: 30 days from receipt” without requiring explicit template mapping.
Stage 2: Standardization
This is where the real magic happens — and where this approach fundamentally departs from traditional solutions.
Confidential Documents
After extraction, our AI-powered system handles standardization without ever exposing your sensitive data. Here’s how it works:
Instead of sending your extracted data to an LLM, we send only the data transformation requirement. Our AI agent understands your prompt and writes Python code to perform the required data manipulation. That code then executes locally on your data within your secure environment.
Let’s say you’ve extracted data from 50 different Form 5500s, each with slightly different date formats, currency notations, or categorical labels. Previously, you’d need team members to manually standardize each entry or developers to write custom scripts.
Now? One prompt: “Standardize all date fields to YYYY-MM-DD format, convert all currency values to USD with two decimal places, and calculate year-over-year growth percentages.”
The AI generates the transformation code. Your system executes it. Thousands of rows, standardized in seconds. The LLM handles the logic; your infrastructure handles the data.
Public Documents
For public documents, AI handles both extraction and standardization end-to-end. The same AI that understood the document structure immediately transforms the data to your specifications:
- Normalize merchant names (Starbucks, SBUX, Starbucks Coffee → Starbucks).
- Categorize expenses automatically (Coffee Beans → Beverages).
- Convert currencies and apply current exchange rates.
- Map vendor information to your existing database.
- Extract and structure key-value pairs from unstructured invoice notes.
- Parse complex pricing structures with discounts, taxes, and fees.
Our AI doesn’t just follow rules — it understands intent. It handles edge cases and ambiguities that would require constant ML model updates.
Stage 3: Automated QA
Quality assurance is where most IDP implementations fall apart. You can’t afford to trust automated extraction blindly, but manual QA creates bottlenecks. We’ve built an automated QA system that catches errors with mathematical precision while adapting to your business rules.
Confidential Documents
The same pattern applies: you describe your validation requirements, the AI generates the code, and your system runs it locally.
Here’s what those prompts might look like:
- “Flag any entries where total compensation falls outside the 5th-95th percentile for similar plan sizes.”
- “Identify any forms where reported participant counts decreased by more than 20% year-over-year.”
- “Check for missing required fields based on plan type.”
- “Validate that sum of individual contribution amounts equals total contributions reported.”
This goes beyond simple null checks or format validation. We’re talking complex cross-field validations, statistical anomaly detection, and business rule enforcement — all without exposing your underlying data to external services.
Public Documents
For public documents, AI-based QA becomes even more powerful. Let’s take invoices as a concrete example.
Mathematical Validation: You’ve extracted product name, quantity, unit price, and line total. The automated QA immediately validates: quantity × unit_price = line_total. If there’s a discrepancy, the document gets flagged for review.
This might sound simple, but it catches a remarkable number of errors:
- OCR misreads that turn “3” into “8”
- Calculation errors from poorly designed invoice templates
- Data entry mistakes from manual invoice creation
- Potential fraud indicators
We’ve implemented gating based on error types. Critical mathematical discrepancies? Those documents get held in a review queue. Minor formatting inconsistencies? Auto-corrected and processed. Suspicious patterns that might indicate fraud? Escalated to a specialist queue.
Additional AI-powered QA capabilities include:
- Cross-referencing expenses against historical patterns
- Detecting potential duplicate submissions with fuzzy matching
- Flagging unusual pricing or category mismatches
- Verifying that subtotals, taxes, and discounts reconcile to grand totals
- Validating that vendor details match your approved supplier list
The Learning Loop
Here’s what ties it all together: the system learns from corrections and becomes increasingly accurate at identifying anomalies specific to your business context.
Each time a flagged document is reviewed and either confirmed as an error or marked as a false positive, the models learn. We’re not just executing static rules — we’re developing an understanding of what “normal” looks like for your specific business.
This self-improving capability means the system gets smarter with every document processed, every correction made, every validation completed. The more you use it, the better it gets.
So that’s the full pipeline — extraction, standardization, and QA that actually learns. But not every IDP solution available in the market can actually pull this off.
Tips for Choosing an IDP Solution
If you’re shopping around for an IDP solution (or just wondering if your current setup is outdated), here are three capabilities that separate modern solutions from legacy tools.
Parallel Processing Over Sequential Pipelines
Traditional IDP systems process documents one page at a time — page 1, then page 2, then page 3. For a 100-page contract, that’s painfully slow. For 1,000 invoices at month-end, it’s a bottleneck.
Look for systems that process pages in parallel across distributed resources. A 100-page contract should process all pages simultaneously. Volume spikes — tax season, Black Friday, quarter-end — shouldn’t slow you down.
Also consider adaptability. Traditional ML systems require retraining when document formats change, which can take days or weeks. Prompt-based systems let you adapt by updating instructions, not rebuilding models. New vendor template? New compliance requirement? You should be able to adjust in hours, not weeks.
Unified Pipelines Over Extract-Export-Reformat-Import Workflows
You know the drill: extract in one system, export to Excel, manually reformat, import to another system. Each handoff introduces errors and delays.
Modern IDP should give you one continuous pipeline — extraction, standardization, and validation in a single flow. Document in, clean data out, no manual steps in between.
Code Generation for Custom Formats and Logic
Every organization has specific requirements: custom output formats, business-specific validation rules, integrations with proprietary systems.
Traditional solutions require weeks of custom development for each new requirement. Better systems let you describe what you need in plain language and generate the transformation code for you.
Need your invoice data in JSON? CSV? Direct database insertion? Custom API payload? You should be able to specify this in a prompt and get production-ready code — often same-day. For confidential documents, that code runs in your environment. For public documents, the system can execute it directly.
No more brittle transformation scripts that break with every slight change in requirements.
What This Looks Like in Practice
Enough theory. Here’s what happens when it actually works.
Scenario 1: Month-End Invoice Processing
One of our clients receives 5,000 invoices from 200+ vendors every month. Half are repeat vendors with familiar formats; half include new vendors, one-time suppliers, or invoices with unusual layouts.
Here’s how it played out:
- Familiar invoices: Our system recognized patterns from continuous learning, and processed rapidly with high confidence.
- New formats: Our AI adapted without retraining, extracted key-value pairs from unstructured sections.
- All invoices: Automated QA validated every calculation — quantities, prices, totals, taxes, discounts.
- Flagged documents: Only 3% required human review due to mathematical discrepancies or missing required fields.
- Processing time: Under 2 hours vs. 3-5 days with their previous solution.
Scenario 2: Annual Form 5500 Compliance
A benefits administration firm came to us with 300 Form 5500s for their clients. Each form had slight variations based on plan type, administrator, and year.
Here’s what we delivered:
- Extraction: Our ML models handled complex table structures, multi-page schedules, and plan-specific sections.
- Standardization: Our system generated Python code to normalize dates, currencies, and categorical data — executed locally on their servers.
- QA: AI-generated validation code checked required fields, cross-schedule consistency, statistical outliers, and calculations — all while running on their infrastructure.
- Data confidentiality: Maintained throughout. Our LLMs never saw actual plan data, participant information, or financial details.
- Processing time: 300 forms fully processed, validated, and standardized in under 4 hours.
Scenario 3: Mixed Document Workflows
An operations team uses our hybrid approach to process both confidential employment agreements and public vendor receipts.
The setup:
- Employment agreements: ML extraction + AI-generated manipulation code for confidential data handling.
- Vendor receipts: Full AI pipeline with continuous learning improving accuracy.
- Parallel processing: Both document types processed simultaneously without security compromise.
- Unified QA: Consistent validation approaches adapted to document sensitivity levels.
- Scalability: Volume spikes handled seamlessly.
Conclusion
When we first started working on this, it genuinely felt like you had to choose — ML for security, or AI for flexibility. That was just how document processing worked. You picked one and lived with the gaps.
But the more we dug in, the more we realized these technologies weren’t opposites. They were complements. ML could handle the structural precision. AI could handle the adaptability. The trick was knowing when to use which — and building a pipeline that could do both without compromising on either.
That’s what we’ve built. And seeing it work for our clients — compliance deadlines met without the fire drill, month-end processing wrapped up in hours instead of days, confidential data staying exactly where it should — that’s what makes this work worth it.
If you’re navigating the same tradeoffs we once were, we’d love to hear from you. Let’s get in touch!