“Data extraction” sounds like one task. It is really a family of them. Pulling rows from a database, parsing a JSON feed, reading a scanned contract, and scraping a live web page are all data extraction, yet each needs a different tool, runs on a different cadence, and breaks in a different way.
The way to make sense of it is to separate two questions. First, what type of data are you extracting? Second, what type of extraction does that data call for? This guide maps both: it starts with the types of data, then categorizes the types of data extraction by source structure, technical approach, frequency, and method.
Quick Digest
- Two lenses: the type of data you extract, and the type of extraction you use to get it.
- Data by structure: structured, semi-structured, and unstructured, in rising order of difficulty.
- Data by format: text, tabular, image, and audio/video/social.
- Extraction by source structure: databases, file feeds, documents, and the web each demand a different entry point.
- Extraction by technical approach: parsing, web scraping, OCR, and NLP/ML (IDP is one AI-based flavor, not the whole field).
- Extraction by frequency: one-time/batch, scheduled/incremental, or real-time/streaming.
- Extraction by method: manual, scripted/DIY, automated pipeline, or managed service.
Part 1: The types of data you extract
Before you pick a method, you have to name the data. It varies on two axes: how structured it is, and what format it arrives in.
By structure: structured, semi-structured, unstructured
This is the most important distinction, because it predicts how hard the extraction will be. The less structure the data has, the more intelligence it takes to pull it cleanly.
| Structure | What it is | Examples | Extraction implication |
|---|---|---|---|
| Structured | Rigid, predefined schema | Relational databases, spreadsheets, logs | Easiest: query or connect directly |
| Semi-structured | Partial structure with tags or markers | JSON, XML, emails, NoSQL | Parse the markers; format is flexible but readable |
| Unstructured | No predefined format | PDFs, scans, images, audio, video, free text | Hardest: needs OCR, NLP, or computer vision |
An estimated 80–90% of enterprise data is unstructured, which is exactly why it holds the most untapped value and demands the most capable extraction. Converting it into structured, schema-aligned output is the real work. Legal teams, for example, lean on dedicated contract data extraction to turn dense agreements into validated fields.
By format: text, tables, images, and beyond
Within those structures, data shows up in distinct formats, and each format needs its own handling.
| Format | Common forms | What it takes to extract |
|---|---|---|
| Text | Plain text and logs, Word/RTF, PDFs, emails, HTML | Parsing plus NLP; OCR first if the text is scanned |
| Tabular | Excel, CSV/TSV, Google Sheets, tables inside PDFs | Table detection and cell-boundary mapping |
| Image | Scanned documents, photos, diagrams, handwriting | OCR plus computer vision |
| Audio, video & social | Call transcripts, video frames, social posts, chat logs | Speech-to-text, frame analysis, and NLP |
Tables deserve a special mention: they are among the hardest formats to extract reliably because structure and meaning are encoded in position. A merged cell or a table spanning two pages defeats naive parsers, which is why tabular extraction from PDFs is treated as its own discipline.
Quick Summary
Q: What are the main types of data in data extraction?
A: Two ways to slice it. By structure: structured (databases, spreadsheets), semi-structured (JSON, XML, emails), and unstructured (PDFs, images, audio). By format: text, tabular, image, and audio/video/social. The less structured the data, the more intelligence the extraction requires.
Expert Insight
Teams almost always underestimate the structure axis. They scope a project around “documents” and discover halfway in that 70% of the value is locked in unstructured scans and handwriting, not the clean forms they planned for. Name the structure mix first, and the right method picks itself. Forage AI data team
Part 2: The types of data extraction
Once you know the data, you choose the extraction. It can be categorized four ways, and a real project usually combines several.
By source structure: where the data lives
The source dictates the entry point. The same field is trivial to pull from a database and hard to pull from a scanned PDF.
| Source | Example | Typical extraction |
|---|---|---|
| Structured store | SQL database, data warehouse | Query or a direct API/connector |
| Semi-structured feed | JSON/XML/CSV exports | Schema parsing against the tags |
| Unstructured documents | PDFs, scans, contracts | OCR plus NLP (this is where IDP fits) |
| The web | HTML pages, dynamic sites | Web scraping and crawling |
By technical approach: the method that does the reading
This is the layer people usually mean by “how.” Each approach is suited to a structure and format.
- Template / rule-based parsing: fast and exact for known, stable formats (a fixed CSV, a consistent form).
- Web scraping & crawling: for data published on websites, at scale. See web data extraction techniques and tools.
- OCR: turns images and scans into machine-readable text. A first step, not a finish line.
- NLP / ML / AI: understands unstructured content: entities, context, intent. Intelligent Document Processing (IDP) is one AI-based approach in this family, specialized for documents. See what IDP is and how it works.
- API-based: the cleanest path when a source offers structured programmatic access.
The point worth holding onto: IDP is one method, not the category. It is the right tool for unstructured documents, and the wrong frame for a database query or a web crawl.
By frequency: how often you pull it
| Frequency | When to use it | Example |
|---|---|---|
| One-time / batch | Migrations, audits, research | A single bulk pull of a back catalog |
| Scheduled / incremental | Keeping a dataset fresh | Nightly syncs and weekly delta updates |
| Real-time / streaming | Live decisions | Event feeds, IoT sensors, live pricing |
By method: who runs it, and how automated it is
| Method | What it means | Best fit |
|---|---|---|
| Manual | People read and key the data | Tiny, one-off volumes |
| Scripted / DIY | In-house scripts or tools | Stable, low-maintenance sources |
| Automated pipeline | Orchestrated, monitored, self-healing | Ongoing, multi-source extraction |
| Managed service | A partner runs it to an SLA | High-volume, high-drift, business-critical data |
Quick Summary
Q: What are the types of data extraction?
A: Four overlapping categories. By source structure (database, file feed, document, web), by technical approach (parsing, scraping, OCR, NLP/ML including IDP, API), by frequency (one-time, scheduled, real-time), and by method (manual, DIY, automated, managed). Most real projects mix several.
Expert Insight
These four axes are not a menu where you pick one. A single healthcare project might pull structured policy data by API, unstructured claims by OCR and NLP, on a scheduled cadence, run as a managed service. The skill is matching each data type to the right combination, not forcing everything through one tool. Forage AI data team
How to match the data type to the extraction type
The whole point of the taxonomy is to make the choice obvious. Work in this order:
- Name the structure. Structured, semi-structured, or unstructured? This rules most approaches in or out immediately.
- Name the source. Database, file feed, document, or the web? That sets the entry point.
- Name the cadence. One-time, scheduled, or real-time? Streaming and batch are different architectures, not a setting.
- Pick the method to match the volume and stability. Stable and small leans DIY; high-volume and high-drift leans managed.
Get those four answers and the right extraction type is no longer a guess. For the operational side of running it reliably once it is automated, see how to automate large-scale extraction.
Frequently asked questions
What are the types of data in data extraction?
By structure: structured (databases, spreadsheets), semi-structured (JSON, XML, emails), and unstructured (PDFs, images, audio, video). By format: text, tabular, image, and audio/video/social. Structure matters most because it predicts how hard the data is to extract.
What are the types of data extraction?
They group into four categories: by source structure (database, file, document, web), by technical approach (parsing, web scraping, OCR, NLP/ML, API), by frequency (one-time, scheduled, real-time), and by method (manual, DIY, automated, managed). Most projects combine several.
Is intelligent document processing a type of data extraction?
Yes, but a specific one. IDP is an AI-based technical approach specialized for unstructured documents. It is one method among many, not a synonym for data extraction, which also covers databases, APIs, and the web.
What is the hardest type of data to extract?
Unstructured data such as scanned documents, handwriting, and free text, because it has no predefined schema. Tables sit close behind, since their meaning is encoded in layout. Both need AI-based approaches rather than simple parsing.
How do I choose the right type of data extraction?
Name four things in order: the data’s structure, the source it lives in, how often you need it, and the volume and stability. Those answers point to the right approach and the right method, from a simple API pull to a fully managed pipeline.
Data extraction is never a single decision. It is the data type meeting the right source, approach, cadence, and method. Get the taxonomy straight, and the messy question of “how do we extract this?” turns into a short, answerable checklist.