Compliance & Regulation in Data Extraction

Is Web Scraping Legal? A Legal Compliance Guide

April 20, 2024

5 min read


Subhasis Patnaik

Is Web Scraping Legal? A Legal Compliance Guide featured image

Every data team that runs extraction at scale eventually hits the same meeting. Legal asks whether the pipeline that feeds the analytics dashboard is going to get the company sued. Engineering says the data is public. Someone forwards a news story about a multi-million-euro fine, and the room goes quiet.

That meeting is the reason this guide exists. The web-scraping market sat at USD 1.03 billion in 2025 and is projected to reach USD 2.23 billion by 2031, which tells you scraping is now core infrastructure for pricing intelligence, market research, and AI training sets. It also tells you that regulators and courts have been paying attention.

USD 1.03B in 2025, projected to USD 2.23B by 2031 (13.78% CAGR) for the web-scraping market. Source: Mordor Intelligence, 2026.

Here is the verdict we will defend across this guide: web scraping is legal in itself. What decides whether a specific scrape is lawful comes down to three things. What you scrape. How you access it. What you do with it afterward. Get those three right and most of the legal exposure disappears. Get them wrong and the fact that the data was “public” will not save you.

This guide is general information for practitioners, not legal advice. For a specific scraping program, confirm your position with qualified counsel in the relevant jurisdiction.

Quick Digest

  • Is it legal: Web scraping is legal in itself. Lawfulness turns on three factors: what you scrape, how you access it, and what you do with the data.
  • Legality by data type: Public non-personal data is the safest. Personal data, copyrighted content, and anything behind a login or paywall each carry their own triggers that flip a scrape from legal to illegal.
  • The core laws: Terms of Service, copyright, the CFAA in the US, and GDPR/CCPA on the privacy side are the four bodies of law you operate against.
  • By country: The US is permissive for public business data and asks “was access authorized?” The EU and UK ask “did you have a lawful basis to process personal data?” That difference drives most cross-border risk.
  • Case law: hiQ, Van Buren, Meta v. Bright Data, and X Corp v. Bright Data show US courts narrowing the routes to call public-data scraping illegal. The hinge in 2026 is logged-out versus logged-in access.
  • Ethics: Six practical principles, led by a three-check gut test, keep you defensible beyond the bare legal minimum.
  • The checklist: An operational before/during/after list that turns all of the above into something an engineer can actually run.

Is web scraping legal, or illegal?

Web scraping is not illegal. Collecting publicly available information from websites is a lawful activity in the US and across most of the world. What turns a given scrape into a legal problem is never the act of scraping by itself. It is the combination of three factors.

What you scrape. Public product prices and listings sit in a different legal category than personal data about identifiable people or content protected by copyright. The data type sets the baseline risk before you write a single line of code.

How you access it. Reaching a public page that anyone can load is one thing. Logging into an account, bypassing a paywall, or defeating a technical barrier such as a CAPTCHA is another. The access method is where most of the serious US exposure lives, because that is what the Computer Fraud and Abuse Act actually targets.

What you do with it. Internal analytics, republishing whole articles, reselling a dataset, and training an AI model are not the same use. Copyright and contract claims tend to land on the downstream use, not the collection.

We all know teams that treated “the data is public” as a complete legal answer and got a cease-and-desist anyway. Public is the start of the analysis, not the end of it. The three-factor frame is what lets you reason about a specific source instead of guessing.

“Gates-up-or-down.” The US Supreme Court framed computer-access liability as whether a site “has erected gates” to your access. A public webpage “has erected no gates to lift or lower in the first place.” Source: Van Buren v. United States, 2021.

Quick Summary

Q: Is web scraping legal or illegal?

A: Web scraping is legal in itself. A specific scrape becomes a legal problem based on three things: the type of data you collect, how you access it (public page versus logged-in or bypassed barrier), and what you do with the data afterward. “It was public” is the beginning of the analysis, not a defense on its own.

Expert Insights

“Where access is open to the general public, the CFAA ‘without authorization’ concept is inapplicable. Van Buren therefore reinforces our conclusion that the concept of ‘without authorization’ does not apply to public websites.” – the Ninth Circuit panel, hiQ Labs v. LinkedIn, 2022

Forage AI managed compliant web data extraction banner.

What actually decides legality: the type of data

The single best predictor of whether a scrape is lawful is the type of data you are collecting. Four buckets cover almost everything practitioners touch, and each one has a specific trigger that flips it from legal to illegal.

Public non-personal data is the safest ground. Prices, product listings, stock levels, and aggregate reviews stripped of author identity sit largely outside privacy law in both the US and the EU. Personal data is where the jurisdictions split hardest. Copyrighted content is about the downstream use. Anything behind a login or paywall is the highest-risk bucket because reaching it usually means crossing a barrier.

Data typeWhat flips it illegalUS postureEU / UK posture
Public, non-personal (prices, listings, specs)Republishing protected expression; breaching an accepted ToSMost permissive. Largely outside the CFAA and privacy lawLargely outside GDPR. Database rights can still apply to wholesale copying
Personal data (names, emails, profiles)Processing without a lawful basis; no transparency to data subjectsCCPA applies; “publicly available” carve-out is narrowGDPR applies even when the data is public. A lawful basis is required
Copyrighted content (articles, images, databases)Copying and republishing substantial or whole works without permissionCopyright Act governs the use; facts and raw data are not copyrightableCopyright plus EU database rights protect curated collections
Behind login or paywallBypassing authentication or a technical barrier to reach itClassic CFAA and breach-of-contract triggerSame, plus GDPR if the data is personal

The line that trips most teams up is not the prices column. It is the jump from public to personal. In the EU, a name and a job title scraped from a public profile is still personal data, and “it was public” is not a lawful basis to process it.

Matrix comparing the legality of four web-scraping data types across US and EU jurisdictions.

“Found on the internet” does not mean “publicly available.” Under CCPA/CPRA (Cal. Civ. Code §1798.140), data restricted to a limited audience, such as a private or limited-access account, does not qualify for the publicly-available carve-out. Source: TrueVault / California AG, 2024.

Quick Summary

Q: How does the type of data decide whether a scrape is legal?

A: Public non-personal data such as prices and listings is the lowest risk. Personal data triggers GDPR in the EU and CCPA in the US even when it is public. Copyrighted content is about whether you republish it. Anything behind a login or paywall is highest risk because reaching it means crossing a barrier. Map your target source to one of these four buckets before you build.

Expert Insights

“The Facebook and Instagram Terms do not bar logged-off scraping of public data; perforce it does not prohibit the sale of such public data.” – Judge Edward Chen, Meta Platforms v. Bright Data, 2024

The core laws, briefly

Four bodies of law do most of the work in web-scraping disputes. You do not need to be a lawyer to operate against them, but you do need to know which one each risk lives in. Here is the compact version.

LawWhereWhat it actually governs
Terms of ServiceEverywhere (contract)A site’s terms may prohibit or conditionally allow scraping. They bind you only if you accepted them, for example by logging into an account. A logged-out visitor may not be bound at all.
CopyrightUS, EU, UKCopyright protects a site’s original content. Scraping and republishing substantial or whole works, such as full news articles, without permission can infringe. Facts and raw data themselves are generally not copyrightable.
CFAAUSProhibits accessing a computer “without authorization.” After Van Buren, this generally requires breaching a technical barrier, not merely violating a ToS. Scraping public pages falls outside it.
GDPR / CCPAEU+UK / CaliforniaGovern the processing of personal data. GDPR requires a lawful basis even for public personal data. CCPA grants California residents rights over their personal information.

The copyright row is the one the original version of this guide got wrong, so it is worth stating plainly. You can scrape facts. Stock prices, sports scores, and product specifications are not protected expression. What you cannot do is lift a publisher’s original articles wholesale and republish them as your own. The expression is protected; the underlying facts are not.

The CFAA needs a breached barrier, not a broken promise. After Van Buren, violating a website’s terms is generally a contract matter, not a federal computer crime. Source: Van Buren v. United States, 2021; applied to scraping in hiQ v. LinkedIn, 2022.

Quick Summary

Q: Which laws govern web scraping?

A: Four. Terms of Service (contract, binding only if you accepted them), copyright (protects original content, not facts), the CFAA in the US (now requires a breached technical barrier), and GDPR/CCPA on the privacy side (govern processing of personal data). Most scraping disputes trace back to one of these four.

Expert Insights

“The extent to which public data may be freely copied from social media platforms, even under the banner of scraping, should generally be governed by the Copyright Act, not by conflicting, ubiquitous terms.” – Jeremy Goldman, Partner, Frankfurt Kurnit Klein & Selz, on X Corp v. Bright Data, 2024

Web scraping laws by country

The same scrape can be lawful in the US and a regulatory problem in the EU. The reason is that each jurisdiction asks a different core question. In the US, the question is whether your access was authorized. In the EU and UK, the question is whether you had a lawful basis to process personal data.

JurisdictionGoverning lawCore questionPublic data posture
USCFAA, contract, copyright“Was access authorized?” (gates-up-or-down)Most permissive for public business data. See hiQ (2022), Meta v. Bright Data (2024), X Corp v. Bright Data (2024)
EUGDPR, database rights“Did you have a lawful basis to process personal data?”“It was public” is not a lawful basis by itself. robots.txt is now weighed by regulators. The Clearview fine reached €30.5M
UKUK GDPR, DPA 2018, Computer Misuse ActMirrors the EU on personal dataNear-identical to the EU. Personal data is the line

One 2026 note for EU-facing programs. The EU Data Act became binding in September 2025, and the EU AI Act brings general-purpose AI obligations into force from August 2026, both of which sit on top of GDPR for anyone scraping at scale to build data products or training sets. If your scraping feeds an AI training set, treat these as an additional layer to clear, not a replacement for the GDPR analysis above.

Comparison of web-scraping laws across the US, EU, and UK jurisdictions.

€30.5 million fine imposed on Clearview AI for building a facial-recognition database from more than 30 billion photos scraped without consent. Source: Dutch Data Protection Authority, 2024.

Quick Summary

Q: Is web scraping legal by country?

A: It depends on the jurisdiction’s core question. The US asks whether access was authorized and is permissive for public business data. The EU and UK ask whether you had a lawful basis to process personal data, and “it was public” does not count as one. For EU-facing scraping, GDPR plus the 2025 Data Act and 2026 AI Act set the bar.

Expert Insights

The CNIL’s January 2026 focus sheet states that web scraping “cannot fall within the reasonable expectations of data subjects if the controller does not exclude from collection websites that explicitly object to scraping through robots.txt or CAPTCHAs.” – CNIL (French Data Protection Authority), 2026

How have the courts actually ruled?

US courts have spent the last several years narrowing the routes available to call public-data scraping illegal. The direction of travel is consistent across four rulings, and the practical hinge in every one of them is the same: logged-out versus logged-in access.

hiQ Labs v. LinkedIn is where the modern line starts. hiQ, a workforce-analytics company, scraped public LinkedIn profiles to build its product. LinkedIn sent a cease-and-desist alleging CFAA and ToS violations and tried to block the access. hiQ sued. In 2022 the Ninth Circuit reaffirmed that scraping public data does not violate the CFAA, because the “without authorization” concept does not apply to a public website. The original version of this guide called the case “over after a six-year battle.” The CFAA question for public data is in fact resolved. What is worth flagging is that hiQ still lost on a separate contract track over LinkedIn’s terms, which is the distinction that matters: scraping public data is not a CFAA violation, but a ToS breach is a separate claim.

Van Buren v. United States (2021) is the Supreme Court decision that drove the hiQ outcome. It narrowed the CFAA’s “exceeds authorized access” language to a gates-up-or-down test. If a site has erected no technical gate, there is nothing to breach. A bare terms violation is not a federal computer crime.

Meta v. Bright Data (January 2024) applied that logic to social platforms. Judge Edward Chen held that Facebook and Instagram terms do not bar logged-off scraping of public data, because the terms bind users who are logged in, not mere visitors. This is the cleanest statement of the logged-out hinge to date.

X Corp v. Bright Data (May 2024) went a step further. Judge William Alsup held that the Copyright Act preempts X’s ToS-based scraping claims, and warned that letting platforms control public data too tightly could create “information monopolies.” Across both Bright Data rulings, the trend is courts limiting a platform’s ability to weaponize its terms against scrapers of logged-out public data.

One caution worth stating plainly. None of these US rulings touch GDPR. A scrape that is bulletproof under the CFAA can still be a €30 million problem in Europe if it sweeps up personal data without a lawful basis.

Timeline of four US web-scraping court rulings from 2021 to 2024.

The logged-out hinge. Across hiQ, Meta v. Bright Data, and X Corp v. Bright Data, the deciding factor was whether the scraper was logged into an account. Logged-out access to public data kept winning. Source: N.D. Cal. rulings, 2024.

Quick Summary

Q: How have the courts actually ruled on web scraping?

A: US courts have steadily narrowed the routes to call public-data scraping illegal. hiQ and Van Buren established that scraping public data is not a CFAA crime. Meta v. Bright Data held that platform terms do not bind logged-off visitors. X Corp v. Bright Data held that copyright preempts ToS scraping claims. The hinge in every case is logged-out versus logged-in access. None of it overrides GDPR.

Expert Insights

“This case is an earthquake in the web-scraping world. It has huge ramifications for both web scrapers and those who are looking to stop it. For web scrapers, there is now a new precedent to argue that websites cannot stop scraping of ‘logged out’ data.” – Kieran McCarthy, Attorney, McCarthy Law Group, on Meta v. Bright Data, 2024

Ethical best practices for web scraping

Legal compliance is the floor, not the ceiling. The teams whose scraping programs survive contact with a legal review tend to operate by a set of ethics that go a step beyond what any single statute requires. Start with a fast gut-check before any new source.

Three quick checks before you scrape. First, is the data public and reachable without logging in or bypassing a barrier? If you have to defeat authentication or a CAPTCHA, stop. Second, is it personal data or copyrighted content? If yes, you need a lawful basis or you need to limit yourself to facts. Third, did you accept a ToS, and what will you actually do with the data? Internal analytics is a different risk profile than resale or republication.

From there, six principles do the rest of the work. They map directly to the failure modes we see most often in compliance reviews.

1. Respect website owners. Where a site offers an API or a data-licensing path, prefer it. Where the terms clearly prohibit collection and bind you, honor that. Seeking permission is slower than scraping anyway, and it is also the cheapest insurance you can buy against a contract claim.

2. Protect data privacy and security. If your scrape touches personal or sensitive data, treat it the way you would treat data you collected directly. Minimize what you keep, anonymize where you can, encrypt at rest and in transit, and document your lawful basis. A research institution scraping public-health data for legitimate study still owes the same duty of care it would owe in a clinical setting.

3. Be transparent and honest. Be straight about who you are, why you are collecting, and what you will do with the data. In the EU this is not optional courtesy. GDPR’s transparency duty can require notifying the people whose data you process, even when that data was already public.

4. Scrape only what you need. Collect the fields your use case actually requires and no more. Data minimization shrinks both your storage cost and your liability surface. It also keeps your request volume down, which leads to the next point.

5. Respect robots.txt and rate limits. Honor the robots exclusion standard and throttle your requests so you do not degrade the site you are collecting from. This used to be pure etiquette. As of 2026 it is a compliance signal in the EU: a major regulator now treats ignoring a robots.txt scraping refusal as evidence against your legitimate-interest basis.

robots.txt is now a compliance gate, not just etiquette. A site-wide scraping refusal looks like this, and ignoring it now weakens your GDPR legitimate-interest basis in the EU:

# Per CNIL focus sheet, January 2026
User-agent: *
Disallow: /

6. Avoid deceptive scraping practices. Rotating IPs to dodge blocks, spoofing user agents, and mimicking human behavior to defeat detection move you from “collecting public data” toward “evading a barrier the site put up on purpose.” That shift is exactly the kind of conduct that turns a defensible scrape into a contested one.

Six ethical web-scraping principles led by a three-check pre-scrape test.

The cost of ignoring these principles is not theoretical. The clearest cautionary example is Cambridge Analytica. In the 2018 scandal, the firm harvested the personal data of millions of Facebook users without consent and used it for political ad targeting, exploiting lax platform privacy controls. The fallout included regulatory scrutiny, investigations, hearings, penalties for Facebook, and the eventual shutdown and bankruptcy of Cambridge Analytica itself. The data was reachable. Collecting and using it the way they did was the failure.

Quick Summary

Q: What are the ethical best practices for web scraping?

A: Start with three checks: is it public and barrier-free, is it personal or copyrighted, and what will you do with it. Then follow six principles: respect site owners, protect privacy and security, be transparent, scrape only what you need, respect robots.txt and rate limits, and avoid deceptive evasion. Cambridge Analytica is the example of what happens when those principles are ignored.

Expert Insights

“Scraping violates the fairness principle because it is hidden and harmful. Because people are not notified when their data is scraped, they are often left unaware of data processing that exposes them to risk.” – Daniel J. Solove and Woodrow Hartzog, privacy-law professors, “The Great Scrape,” California Law Review, 2025

Your web-scraping compliance checklist

Everything above turns into one operational list. We organize it the way a data team actually runs a pipeline: what you settle before you collect, what you enforce while you collect, and what you govern after the data lands.

Before you scrape.

  • Check for an official API or data-licensing path first, and prefer it over scraping where one exists.
  • Confirm the data is public and reachable without logging in or bypassing a technical barrier.
  • Classify the data into one of the four buckets: public non-personal, personal, copyrighted, or behind-login.
  • Read robots.txt and the ToS. Note whether the terms bind you, which generally means whether you have to log in.
  • If the data is personal and any EU or UK data subjects are involved, document your lawful basis before you collect.

While you scrape.

  • Honor robots.txt directives and the rate limits the site signals.
  • Throttle request volume so you do not degrade the source site.
  • Collect only the fields your use case requires. Drop the rest at the point of collection.
  • Do not spoof user agents or rotate IPs to defeat a barrier the site deliberately put up.

After collection, and governance.

  • Encrypt personal data at rest and in transit, and restrict access to it.
  • Set and enforce a retention period. Delete what you no longer need.
  • Keep records of your lawful basis, your sources, and your collection decisions, so a compliance review has something to read.
  • Re-check the position when you change use, for example moving from internal analytics to resale or AI training, because the use is what copyright and contract claims attach to.

This is the part most teams underestimate. The hard work of compliant extraction is not a one-time legal sign-off. It is the ongoing discipline of honoring robots.txt as targets change it, re-scoping collection as sites restructure, and documenting decisions so the program stays defensible. That ongoing load is exactly what a managed-extraction partner absorbs. Forage AI scopes collection to public, non-personal data by default, honors robots.txt and rate limits as part of the service, and handles selector drift, anti-bot evolution, and schema changes so compliance is built into the pipeline rather than bolted on after.

If you are building the engineering side of this in code, the companion guide on web-scraping legal compliance covers the patterns, schemas, and checks at the implementation level. When the personal data arrives from a vendor or data broker rather than your own crawler, the downstream CCPA obligations are covered in our enterprise guide to CCPA implications for external data use.

Forage AI managed compliant web data extraction banner.

Quick Summary

Q: What belongs on a web-scraping compliance checklist?

A: Before: prefer an API, confirm the data is public and barrier-free, classify the data type, read robots.txt and ToS, and document a lawful basis for personal data. During: honor robots.txt and rate limits, minimize fields, and avoid evasion. After: encrypt personal data, set retention, keep records, and re-check the position whenever your use of the data changes.

Expert Insights

“These technologies introduce new privacy risks for which systems of internal and external accountability must be in place.” – Müge Fazlioglu, Principal Researcher, Privacy Law and Policy, IAPP, 2023

Forage AI managed compliant web data extraction banner.

Frequently asked questions

Is web scraping legal in 2026?

Yes. As of June 2026, web scraping is legal in itself. Whether a specific scrape is lawful depends on the type of data, how you access it, and what you do with it. US courts have continued to narrow the routes to call public-data scraping illegal, while the EU continues to treat personal data as the line you cannot cross without a lawful basis.

Is web scraping legal by country?

The legality varies because each jurisdiction asks a different question. The US asks whether your access was authorized and is permissive for public business data. The EU and UK ask whether you had a lawful basis to process personal data, and “it was public” is not a basis on its own. A scrape can be lawful in the US and a regulatory problem in the EU at the same time.

Does robots.txt give me legal cover?

Honoring robots.txt does not by itself make a scrape legal, but ignoring it now works against you. As of January 2026, the French data protection authority treats a disregarded robots.txt scraping refusal as evidence against your GDPR legitimate-interest basis. Respecting it is a low-cost control that strengthens your position.

Is it legal to scrape personal data?

Generally not without a lawful basis. In the EU and UK, personal data is covered by GDPR even when it is public, so you need a lawful basis such as legitimate interest and you may owe transparency to the people involved. In the US, the CCPA “publicly available” carve-out is narrower than scrapers assume, and data restricted to a limited audience does not qualify. If you are buying or ingesting that personal data from a vendor rather than collecting it yourself, the same statute governs how you handle it downstream, which is the focus of our enterprise guide to CCPA implications for external data use.

Is it legal to scrape data behind a login or CAPTCHA?

This is the highest-risk category. Logging into an account usually binds you to the site’s terms, and bypassing authentication or a CAPTCHA is the classic trigger for CFAA and breach-of-contract claims in the US. The court rulings that favored scrapers all turned on logged-out access to public data. Once you cross a login or a technical barrier, those protections fall away.

Can I get sued for scraping even if it is “legal”?

Yes. A scrape can fall outside the CFAA and still draw a contract claim over a ToS you accepted, a copyright claim over republished content, or a privacy complaint over personal data. hiQ won the CFAA question and still faced a contract claim. Legality is multi-axis, which is why the three-factor frame and the compliance checklist matter.

Sources

  • Mordor Intelligence, “Web Scraping Market Size, Share & Trends,” 2026, at mordorintelligence.com
  • Van Buren v. United States, 593 U.S. 374 (2021); Congressional Research Service summary, at congress.gov
  • Proskauer Rose, “Ninth Circuit Holds Scraping of Publicly Available Website Data Falls Outside CFAA,” 2022, at newmedialaw.proskauer.com
  • Quinn Emanuel, “What Does the Meta v. Bright Data Summary Judgment Ruling Mean for Web Scraping?,” 2024, at quinnemanuel.com
  • Kieran McCarthy, guest post on Eric Goldman’s Technology & Marketing Law Blog, 2024, at blog.ericgoldman.org
  • Jeremy Goldman, Frankfurt Kurnit Klein & Selz, “Copyright Act Preempts X’s Web Scraping Claims,” 2024, at ipandmedialaw.fkks.com
  • Skadden, “District Court Adopts Broad View of Copyright Preemption,” 2024, at skadden.com
  • CNIL, “Legitimate interest: focus sheet on web scraping,” 2026, at cnil.fr
  • Autoriteit Persoonsgegevens (Dutch DPA), “Dutch DPA imposes a fine on Clearview,” 2024, at autoriteitpersoonsgegevens.nl
  • TrueVault, “What Is ‘Publicly Available Information’ Under the CCPA?,” 2024; California AG CCPA page, at oag.ca.gov
  • IAPP, “The state of web scraping in the EU,” 2024, at iapp.org
  • Daniel J. Solove & Woodrow Hartzog, “The Great Scrape: The Clash Between Scraping and Privacy,” California Law Review, 2025, at californialawreview.org
  • Müge Fazlioglu, “Training AI on personal data scraped from the web,” IAPP, 2023, at iapp.org

Related articles

Related Blogs

post-image

AI Powered Solutions

April 20, 2024

Best Invoice Data Extraction Tools for Enterprises (2026)

Sai S

5 min read

post-image

Advanced Data Extraction

April 20, 2024

Alternative Data for Hedge Funds: A Practical Guide (2026)

Sai S

5 min read

post-image

AI Infrastructure and Data Management

April 20, 2024

Data Pipeline vs ETL: Key Differences (2026)

Sai S

5 min read