Every health-data team hits the same wall eventually. The data they need to build a provider directory, price a procedure, or score a hospital on quality is public. It is also scattered across dozens of government systems and hospital websites, published in clashing formats, on unrelated schedules, with no shared key tying any of it together.
Healthcare data extraction is the work of turning that mess into clean, structured, queryable records. At Forage, we run these pipelines every day: pulling provider identity, pricing, quality, and capability data out of public healthcare sources, then validating and normalizing it until a downstream team can actually trust it.
One distinction makes the rest of this guide click. Healthcare data splits into two worlds. There is the inside-the-firewall world of electronic health records and patient data, governed by strict privacy rules, extracted over HL7 and FHIR. Then there is the outside-the-firewall world: the public, regulator-mandated, and hospital-published data that powers directories, price-transparency products, and AI grounding layers. This guide is the second world, the operational one, where the work is web-scale extraction and the hard part is trust, not access.
Quick Digest: The Storyline
- What it means: Healthcare data extraction turns scattered public healthcare data into clean, structured records. The big fork is inside-the-firewall PHI/EHR data versus outside-the-firewall public data. This guide is the public, operational side.
- How the pipeline is built: Every source runs the same five-stage spine: discover, extract, validate, normalize, monitor. The expensive stage is validation, because a file can download cleanly and still be wrong.
- Provider identity and affiliation: NPPES gives you identity for 8M+ providers, but not hospital affiliation. Real directory accuracy needs a cross-walk against state license boards, Care Compare, and hospital-published directories.
- Hospital price transparency: Every U.S. hospital must publish a machine-readable file of charges. It is the highest-volume, messiest source, and roughly a quarter of hospitals still need a custom parser.
- Quality and cost data: CMS Care Compare carries 100+ quality measures; HCRIS carries Medicare cost reports as multi-table flat files. Both need careful entity cross-walks or they roll up to the wrong hospital.
- Clinical capability signals: ClinicalTrials.gov, service-line pages, and formularies are the closest public proxy for what a hospital actually does, but none of it is standardized.
- Compliance and the 2026 shifts: Public hospital data sits outside the HIPAA boundary, so the posture is terms-of-service and accurate sourcing, not PHI safeguards. A cluster of 2026 rule changes resets the schemas.
- Build, buy, or blend: Most teams land in different cells for different sources. Build the stable ones, blend the high-drift ones, buy a feed where coverage maps cleanly.
- Why pipelines fail: The dangerous failures are silent. Files validate, schemas pass, and the data is still wrong. Only diff-watching against the prior release catches it.
What healthcare data extraction actually means
Healthcare data extraction is pulling structured, usable data out of healthcare’s sources and making it trustworthy. The work splits cleanly along where the data lives.
The dominant framing treats it as a document-AI problem: OCR a chart, parse an HL7 message, map fields to FHIR. That is the right frame inside the firewall, where the inputs are clinical records and the constraint is patient privacy. It is the wrong frame when your inputs are a public provider registry, a hospital’s published price file, or a government quality dataset.
Outside-the-firewall data is non-PHI by design. The HIPAA boundary at 45 CFR §160–164 governs covered entities handling protected health information. Public price files, the national provider registry, Care Compare, and cost reports sit outside that boundary. That single fact changes everything downstream: the compliance posture shifts from PHI safeguards to terms-of-service hygiene and accurate sourcing, and the engineering problem shifts from secure clinical integration to web-scale extraction and reconciliation.
Across that public world, the data falls into four practical buckets, each with its own publisher, format, cadence, and failure mode:
- Identity and affiliation: who the provider is and where they practice (the national provider registry, No Surprises Act directories).
- Pricing and cost: what care costs (hospital price-transparency files, payer rate files, Medicare cost reports).
- Quality and outcomes: how well a hospital performs (CMS Care Compare measures).
- Clinical capability signals: what a hospital actually does (trial listings, service lines, formularies).
For teams feeding RAG pipelines and AI grounding layers, these four buckets are the external evidence base for any question about a hospital’s identity, price, quality, or capability. None of them ships over FHIR. They ship as bulk CSV, JSON, PDF, and HTML, on cadences that have nothing to do with clinical encounters.
Quick Summary
Q: What does healthcare data extraction actually mean?
A: It is the work of turning healthcare’s scattered sources into clean, structured, trustworthy records. The decisive split is inside-the-firewall clinical data, which is PHI governed by HIPAA, versus outside-the-firewall public data, which is non-PHI and falls into four buckets: identity, pricing, quality, and capability. This guide covers the public world, where the hard part is reconciliation, not access.
Expert Insight
The most expensive mistake we see is treating public hospital data as EHR adjacency. It is a different engineering problem, and teams that miss the distinction burn six months building the wrong pipeline before the data ever lands clean. Forage AI data engineering team
How health-data teams build a healthcare extraction pipeline
The bucket changes, but the pipeline shape does not. Every source we run, from an 8 GB provider registry to a single hospital’s price file, moves through the same five stages. Naming them is what separates a pipeline that reports healthy from one that is actually correct.
- Discover. Resolve the real file location. Hospitals rotate URLs and CMS publishes through manifests, so the address you used last quarter is not the address today. Version every URL.
- Extract. Pull the data with the method the source demands: bulk file for analytics, API for lookups, per-site templates for hospital web pages. One method does not fit all four buckets.
- Validate and flag. Schema-validate, then run semantic checks. A file can pass schema validation and still be wrong. Flag suspect values, do not silently drop them.
- Normalize. Map every source to a canonical schema and resolve entities across them, so “ABC Hospital” and “ABC Medical Center” reconcile to one record.
- Monitor. Diff each new release against the last. Schema stays stable while semantics drift, and only active diff-watching catches that.
The stage that consumes the most engineering time is validation, not extraction. “The file downloaded” is the weakest possible success signal. Automation handles the volume happily; it is the semantic correctness that needs human-in-the-loop review and continuous observability. See human-in-the-loop validation for healthcare data.
Quick Summary
Q: How do you build a healthcare data extraction pipeline?
A: Run every source through five stages: discover, extract, validate, normalize, monitor. Match the extraction method to the source, spend your effort on validation rather than download, and diff every new release against the last. The pipeline that reports healthy is not the same as the pipeline that is correct.
Expert Insight
We budget more engineering hours for validation and monitoring than for extraction itself. Pulling the data is the easy 20%. Proving it is correct, release after release, is the 80% that keeps a healthcare data product trustworthy. Forage AI data engineering team
The four source buckets you extract from
With the pipeline shape in place, here is what actually flows through it. Healthcare’s public data sorts into four source buckets, drawing on the public sources mapped below. Almost every health-data product is built by combining them, and each bucket has its own publisher, format, cadence, and failure mode. Here is the shape of all four before we walk each one in turn.

| Bucket | What it answers | Key public sources | Main extraction risk |
|---|---|---|---|
| 1. Identity & affiliation | Who a provider is, where they practice | NPPES / NPI, No Surprises Act directories | NPPES has no affiliation; needs a cross-walk |
| 2. Pricing & cost | What care costs | HPT MRFs, Transparency in Coverage, HCRIS | Three schemas, ~25% need custom parsers, semantic drift |
| 3. Quality & outcomes | How well a hospital performs | CMS Care Compare | Wrong entity roll-up; 12–18 month reporting lag |
| 4. Clinical capability | What a hospital actually does | ClinicalTrials.gov, service lines, formularies | Unstandardized; publish-and-forget pages |
Bucket 1: Provider identity and affiliation (NPPES, NPI, directories)
The first bucket answers who a provider is and where they practice. It starts with the national provider registry and gets hard the moment you need affiliation.
NPPES is the canonical NPI registry under 45 CFR §162.408: Type 1 individual, Type 2 organizational, more than 8 million active NPI records on a monthly full file plus weekly deltas, with the dissemination file exceeding 4 GB. The No Surprises Act requires providers to verify directory information at least every 90 days.
API versus bulk file. The NPPES API is right for lookups: a single NPI, roster validation, real-time enrichment. The bulk file is right for analytics: weekly delta diffs and monthly fulls for affiliation modeling. Building real-time analytics on the API is how you get rate-limited at 3 a.m. the night before a payer audit.
What NPPES does not tell you. NPI carries identity, not affiliation. Employment and admitting-privilege truth lives in payer directories, hospital pages, and state license boards. Hospital affiliations are not in NPPES. That gap is why a directory product needs a four-layer ground-truth stack:
- NPPES: identity (NPI, taxonomy, practice address)
- State DOH license board: licensure, disciplinary history
- CMS Care Compare: affiliation signal via CCN
- Hospital-published directory: intent (the hospital’s own claim)

Why the cross-walk matters: a CMS Medicare Advantage directory review found 48.74% of directory locations had at least one inaccuracy (CMS, 2018). A 2025 follow-up across 1,802 providers found 40.3% of identified inaccuracies remained uncorrected after a mean of 540 days, and only 13.3% were corrected (AJMC, 2025). Verification without a ground-truth cross-check is paperwork.
Entity matching is where this gets hard. “ABC Hospital,” “ABC Medical Center,” and “ABC Health System” may be one facility, four facilities, or a partial overlap. Resolving that reliably is its own discipline. See entity matching across NPI and directories.
The REAL Health Providers Act (expected effective in 2026) shifts directory accuracy from market discipline to joint MA-plan and provider regulatory liability. This article is for informational purposes only and does not constitute legal advice.
Quick Summary
Q: Is NPPES enough to build a provider directory?
A: No. NPPES gives you identity for 8M+ providers, but not hospital affiliation. A trustworthy directory needs a four-layer cross-walk against state license boards, Care Compare, and hospital-published directories, plus real entity matching. Without it, you inherit the kind of error rate that leaves 40.3% of inaccuracies uncorrected at day 540.
Expert Insight
The failure mode that hurts is not bad data on day one. It is bad data on day 540. The metric we actually track is correction half-life: how fast a known inaccuracy gets fixed across the source stack. Forage AI data engineering team
Bucket 2: Pricing and cost (hospital price transparency and MRFs)
Identity tells you who. The second bucket tells you what care costs, and it is the highest-volume, highest-mess source in the entire public stack.
Hospital Price Transparency under 45 CFR Part 180 requires every U.S. hospital to publish (1) a machine-readable file (MRF) of standard charges and (2) a 300-shoppable-services display. The rule has been effective since January 1, 2021; the v2 template took effect July 1, 2024; CY 2026 enforcement begins April 1, 2026.

The v2 schema is really three schemas: tall CSV, wide CSV, and JSON, each needing its own parser. It mandates new fields, including median allowed amount and 10th and 90th percentiles, sourced from remittance data and attested over a 12–15 month lookback. A single hospital system with 30+ CMS Certification Numbers may publish one MRF or thirty, so the domain-to-hospital-to-MRF mapping is many-to-many.
Discovery comes before parsing. The path runs homepage → price-transparency page → manifest → MRF URL. Independent telemetry across 2,771 unique HPT URLs found only 60.1% returned a successful file, and roughly 25% of hospitals still require a custom parser even under v2. That long tail is where the manual labor hides.
Validate, but do not trust. A file can pass schema validation and still be wrong. The filler values that should never reach a normalized table without a missing-flag: 5555, N/A, null, empty string, and 0 in negotiated-rate columns where a real contract rate exists. The template standardizes structure, not semantics, and downstream claims processing automation inherits every silent gap unless it is flagged at ingest. Parsing these files reliably is a tabular extraction problem at scale.
Compliance context: a November 2023 audit found only 21.1% of hospitals fully compliant with the price-transparency rule (Patient Rights Advocate, 2023). CMS penalties cap at $300/day for small hospitals and roughly $5,500/day for larger ones, with annual caps scaling into the low millions, and state HPT laws in NC, CA, and TX add further requirements.
Quick Summary
Q: Why are hospital price transparency files so hard to extract?
A: The v2 rule is three schemas, the file location keeps moving, and about 25% of hospitals still need a custom parser. Worst of all, files pass schema validation while their semantics quietly shift between quarters. You need discovery, filler-flagging, and quarter-over-quarter diffing, not just a download job.
Expert Insight
Schema drift in the quarterly republishes is the most common cause of downstream corruption, and the least likely to trigger an alert. Our MRF stack diffs every republish against the prior distribution and runs roughly 3x the industry-average QA ratio on each one. Forage AI data engineering team
Bucket 3: Quality and cost reports (Care Compare and HCRIS)
Identity and price establish who and how much. The third bucket assesses how well a hospital performs and what it costs to run, and these surfaces are routinely treated as afterthoughts when they should be foundational.
CMS Care Compare publishes 100+ quality measures on 4,000+ Medicare-certified hospitals (CMS Provider Data Catalog): HCAHPS patient-experience scores, readmission rates, mortality, and timely care, refreshed quarterly. Each dataset has its own API endpoint plus CSV download. HCRIS is the quarterly public release of Medicare cost reports on Form 2552-10, distributed as multi-table flat files: numeric, report-header, and alphanumeric tables, keyed by report record number.

When to use which. Care Compare for quality claims. HCRIS for financial profile. Hospital marketing sites for capability claims. The break point: a single system can file 30+ cost reports under separate CCNs that roll up to one reporting entity. Joining HCRIS to facility identity without a CCN-to-NPI-to-Care-Compare cross-walk produces silently wrong financial roll-ups. See hospital financial data extraction with audit-ready accuracy.
One trap worth naming: Care Compare measure refreshes lag the reporting period by 12–18 months. Using last-quarter Care Compare to score current-quarter performance is a category error, not a freshness nuance.
Quick Summary
Q: What do Care Compare and HCRIS give you, and where do they break?
A: Care Compare gives 100+ quality measures; HCRIS gives Medicare cost reports as multi-table flat files. Both break on entity roll-up: without a CCN-to-NPI cross-walk, a hospital chain lands in the wrong financial bucket. Care Compare also lags 12–18 months, so it scores the past, not the present.
Expert Insight
HCRIS is underused because its multi-table structure is hostile to a first-pass ETL job. The teams that get value from it treat the numeric-to-header-to-alpha join as a first-class pipeline, not a one-off import script. Forage AI data engineering team
Bucket 4: Clinical capability signals (trials, service lines, formularies)
The fourth bucket is the closest public proxy for what a hospital actually does. It does not mean chart data. It means hospital-published signals on the open web: ClinicalTrials.gov site listings, formulary references, service-line and Centers-of-Excellence pages, and state DOH facility licensure data.
ClinicalTrials.gov API v2 launched in 2024 and the classic API is deprecated. Any pipeline still pointing at the legacy endpoint needs a rebuild, and it is one of the most-overlooked maintenance tasks in the space.

Formularies and service lines are not standardized. Hospitals do not publish standardized formulary feeds, and marketing references are decorative more often than authoritative. Service-line listings appear inconsistently across marketing sites, “find a doctor” pages, and state DOH files. Extraction requires per-website templates plus cross-validation against NPI Type 2 and Care Compare. See large-scale extraction for inconsistent hospital websites.
Quick Summary
Q: How do you extract hospital capability signals?
A: Combine ClinicalTrials.gov (on the v2 API), service-line pages, and formularies, then cross-validate against NPI Type 2 and Care Compare. None of it is standardized, so it needs per-website templates. Treat a published signal as positive evidence and its absence as inconclusive, never as a “no.”
Expert Insight
Capability signals follow a publish-and-forget pattern. A service line that went live three years ago may still sit on the page long after the program closed, so we weight presence as a positive signal and treat absence as inconclusive, not as a negative. Forage AI data engineering team
Compliance, HIPAA, and the 2026 regulatory shifts
Because every source in this guide is public and non-PHI, the compliance posture is unusual: the legal question is terms-of-service and accurate attribution, not HIPAA. Regulator-mandated public data is materially different from speculative scraping. See the compliance posture for public hospital data. General guidance, not legal advice.
That said, the schemas themselves are about to move. A cluster of 2026 changes compresses years of evolution into a few months:
- January 1, 2026: HPT v2 schema effective for all hospitals.
- March 3, 2026: NPPES V1 dissemination endpoint sunsets; V2 required. This breaks every pipeline still pointing at V1, with no grace runway.
- April 1, 2026: CY 2026 HPT enforcement begins, with median and 10th/90th percentile attestation.
- 2026 (expected): REAL Health Providers Act effective, adding joint MA-plan and provider directory liability.

Why “wait and see” is the wrong call. A pipeline built today needs roughly 90 days of test data before it is trustworthy. If enforcement lands on April 1, code-complete is January. CMS enforcement historically lags 6–12 months, but the audit trail starts immediately, and that documentation burden is the real cost when an auditor shows up.
Quick Summary
Q: What is the compliance posture for healthcare data extraction?
A: For public hospital data, HIPAA does not apply; the posture is terms-of-service and accurate sourcing. The near-term risk is operational, not legal: 2026 schema changes, especially the March 3 NPPES V1 sunset, break pipelines that have not migrated. Build and test before the deadline, not after.
Expert Insight
If a team does only one thing before March 3, it should be moving NPPES ingestion to V2 in parallel and reconciling deltas weekly. The HPT cutover has a 90-day grace runway. The NPPES V1 sunset does not. Forage AI data engineering team
Build, buy, or blend: how should you run extraction?
Once the sources and the pipeline are clear, the real decision is who runs it. There are three options, and the honest answer is that most teams land in different ones for different sources.

| Axis | Build | Buy | Blend |
|---|---|---|---|
| Use case | One-off analysis | Single-source feed | Continuous multi-source |
| Data scope | 1–2 sources | Aggregator’s choice | All four buckets |
| Schema-drift tolerance | You absorb it | Aggregator absorbs | Partner absorbs |
| Team capacity | 1+ FTE permanent | Lean | Lean ops + strategic |
| Required SLA | Best-effort | Aggregator’s terms | Contractual |
Build is right when the scope is one or two stable sources and you have a permanent FTE to own it. Buy is right for a single-source feed with a known schema where an aggregator’s coverage maps to your use case. Blend is right when all four buckets run continuously, schema drift will eat you alive, and your team’s value is downstream of clean data. See custom extraction versus pre-built tools.
The hidden costs decide it. Build carries a maintenance tail, the recurring manual fallback on that long tail of awkward sources. Buy carries schema lock-in and opaque coverage. Blend carries vendor due-diligence, which matters more in healthcare than anywhere else: the average data-breach cost in healthcare reached $9.77M in 2024, the highest of any industry (IBM Cost of a Data Breach Report, 2024).
Forage AI runs managed external healthcare data pipelines with 12+ years of operational history, 500M+ websites crawled, a 3x industry-average QA ratio, no-resell data governance, and HIPAA-compliant workflows. See healthcare data extraction services compared.
Quick Summary
Q: Should you build, buy, or blend healthcare data extraction?
A: Usually all three, by source. Build the stable, low-drift sources you can staff. Buy a single-source feed where coverage fits. Blend the high-drift, multi-source work with a partner on a contractual SLA. The deciding factor is the hidden cost: maintenance tail, schema lock-in, or vendor due-diligence.
Expert Insight
The teams that get this right admit they are in different cells for different sources. Build NPPES, blend the MRFs, buy a firmographic feed. Blend is not buy with a markup: buy gives you the aggregator’s coverage, blend gives you your own, run by a partner accountable to your SLA. Forage AI data engineering team
Why do healthcare data pipelines silently fail?
The failure that costs you is never the loud one. A pipeline that crashes gets fixed in minutes. The dangerous failure is silent: ingestion succeeds, schema validates, and the data is still wrong. Four patterns recur across every bucket.
- MRF validates but column semantics shift between republishes. The negotiated-rate column now reports an allowed amount.
- NPPES weekly delta-keys identify incorrectly after a migration, dropping or duplicating providers.
- Care Compare quality-score columns get renamed, breaking the join without an error.
- HCRIS multi-table joins on CCN drop a hospital chain into the wrong roll-up.
All four pass schema validation. All four fail semantic validation. Only observability with diff-watching catches them: diff every republish against the prior distribution, and if the median negotiated rate moves more than ~15% quarter-over-quarter on a stable cohort, that is a semantic shift, not real-world pricing. This is the deepest failure surface in the whole practice, and we treat it in detail in our companion guide to the challenges of healthcare data extraction and in why enterprise data pipelines silently fail.
Quick Summary
Q: Why do healthcare data pipelines fail silently?
A: Because ingestion success is not correctness. Files download, schemas validate, and semantics still shift underneath, dropping providers, breaking joins, or mis-rolling a hospital chain. Hard crashes get fixed fast; silent failures get fixed after a customer reports them. Diff-watching every release is the only reliable catch.
Expert Insight
Silent failures account for most of the data-quality incidents that ever reach a customer. A hard crash pages someone; a renamed column does not. That asymmetry is why we treat diff-watching as a first-class part of the pipeline, not a nice-to-have. Forage AI data engineering team
Frequently asked questions
What is healthcare data extraction?
It is the work of turning healthcare’s scattered sources into clean, structured, trustworthy records. It splits into two worlds: inside-the-firewall clinical data (PHI, governed by HIPAA, extracted over HL7/FHIR) and outside-the-firewall public data (provider identity, pricing, quality, and capability). This guide covers the public, operational world.
How is this different from EHR or EMR extraction?
External hospital data is non-PHI by design, so the HIPAA boundary at 45 CFR §160–164 does not apply to public price files, NPPES, or Care Compare. EHR extraction is inside-the-firewall HL7/FHIR work with full PHI safeguards. The two are different engineering and compliance problems.
How do you automate healthcare data extraction at scale?
Map the sources into buckets, set per-source cadences, and run every source through one pipeline: discover, extract, validate, normalize, monitor. Use multi-method extraction (templates, APIs, ML) with multi-layer QA. The real cost driver is “validates but semantics shift,” so active diff-watching is the only reliable catch.
How accurate are provider directories?
Less than they appear. A CMS review found 48.74% of Medicare Advantage directory locations had at least one inaccuracy (2018), and a 2025 follow-up found 40.3% of identified inaccuracies still uncorrected after a mean of 540 days. Accuracy requires a cross-walk against state license boards, Care Compare, and hospital directories, not just verification paperwork.
How big are hospital price transparency files?
HPT machine-readable files run from a few MB to multiple GB per hospital, and payer-side Transparency in Coverage files can exceed 1 TB each. For the largest files, streaming rather than batch processing is the correct approach.
The decision was never “do we extract healthcare data?” It is “which sources do we build, which do we blend, and do our 2026 deadlines line up?” The teams that treat public healthcare data as a real pipeline, with discovery, validation, and diff-watching built in from day one, are the ones whose data still holds up when an auditor, or a customer, comes asking. Worth having that conversation before March 3.
Related Articles
- Harnessing Professional Data with AI in Healthcare: how AI is reshaping healthcare data workflows.
- AI-Powered Document Processing Builds a Defensive Moat in Healthcare: IDP for hospital financial filings.

