Web Scraping and the Rise of AI
Data has brought forth the digital revolution. It is the most important asset that is responsible for the birth of humanity’s most transformative invention—and potentially its last—artificial intelligence.
Artificial intelligence now permeates nearly every aspect of human life—social, personal, as well as professional. Most of today’s AI models are Large Language Models (LLMs) trained on large datasets to enable them to recognize and learn patterns and thereafter learn to predict and complete.
This is the basis of the technology behind various chatbots that we rely on and employ every day in our lives. This raises a critical question: how are such massive datasets collected?
The answer lies in the process of web scraping.
Web scraping[1] is the process through which a software-operated bot collects information from a website and converts it into a readable output format. It is an automated process of collecting data and storing it.
This process has many functions and has supported technologies like market intelligence tools, SEO analyzers, data aggregators, academic research platforms, e-commerce automation, and compliance monitoring systems.
However, the legality of web scraping has always been a topic of contention. But it acquires increased significance in our day and age wherein web scraping has enabled the development of large AI models that have been monetized and have accrued large benefits.
The Battle Over Data – Reddit versus Anthropic
Reddit, a popular social media company which hosts a multitude of threads containing user comments and information on various topics, has recently sued Anthropic accusing it of scraping data from its forum without prior consent, violating Reddit’s user policy and terms of service.[2] The complaint originated in San Francisco Superior Court and is one of many litigations being fought against unauthorized use of data for training AI models.
The complaint states that Anthropic refused to enter into a licensing agreement with Reddit and continued to use Reddit’s data to develop its Claude AI model, despite reassuring that it had disallowed its bots from doing so. Reddit alleges breach of contract, digital trespass, and unjust enrichment and unfair competition, claiming Anthropic ignored its Terms of Service and scraped data without authorization.
According to the complaint, Anthropic deliberately refused to pay heed to the robots.txt
file present on Reddit. This file serves as a manual for scraper bots indicating which areas of the site are free to scrape and which parts are restricted. It is important to note that though companies can choose to ignore this file, most companies have accepted this as an internet standard.
This case is poised to become a landmark in the evolving landscape of AI development, as it is likely to provide much-needed legal clarity on the permissibility and regulation of web scraping for training AI models. The outcome could set a critical precedent for how data governance, consent, and licensing are interpreted in the context of generative AI.
India’s Position – A Legal Grey Area
It is to be noted that scraping of data is subject to copyright and trademark violations. Moreover, it may also violate a website’s terms of services. However, the ambiguity arises in the case of “publicly available personal data”.
The Minister of State for Electronics and Information Technology stated in the Rajya Sabha that web scraping for the purpose of training AI models is punished under Section 43 of the Information Technology (IT) Act, 2000,[3] which deals with penalty and compensation for damage to computer, computer system, etc.
Section 43 of the IT Act states that if any person, without the consent of the owner or person in charge of a computer, computer system, or network, accesses, tampers with, damages, or interferes with the data or functionality of such system, they shall be liable to pay compensation by way of damages to the affected person under Section 43 of the Information Technology Act, 2000.
The Minister also mentioned the Information Technology (Intermediary Guidelines and Digital Media Ethics Code) Rules, 2021[4] as a safeguard which mandates that users should not host, display, upload or share any content which violates any law alongside implementing data protection measures and prevention of “unauthorised data access”.
However, there is a glaring gap in the legal framework which the Minister has failed to address. This chasm lies in Section 3(c)(ii) of the Digital Personal Data Protection Act (DPDPA), 2023.[5] This clause specifically states that the protection offered by this Act shall not extend to:
“Personal data that is made or caused to be made publicly available by –
(A) the Data Principal to whom such personal data relates; or
(B) any other person who is under an obligation under any law for the time being in force in India to make such personal data publicly available.”
So essentially, if an individual has made their personal data public on the internet, their data becomes public property and is entirely exempted from the scope of regulation. And therefore, it can freely be used by companies to develop generative AI models.
However, there are 2 key problems which emerge from this framework.
The Consent Problem in the DPDPA
This exemption fails to take into account that web scrapers do not possess reliable mechanisms to distinguish whether publicly available data has been made public by the data principals themselves or by some third party without their consent. If personal data made available by a third party is treated as exempt, it effectively undermines the data principal’s autonomy, violating the Act’s very objective—consensual, fair, and lawful processing.
Therefore, manual means may have to be employed by businesses to ensure that only exempted data is scraped. This will present as an additional challenge to small startups and can be a hindrance to AI development in India. The absence of clear guidelines or technical tools to trace the data’s provenance leaves scrapers vulnerable to inadvertently processing non-exempt data, risking non-compliance with the Act’s stringent requirements for consent and accountability.
This ambiguity not only complicates ethical and legal web scraping but also undermines the Act’s goal of protecting individual privacy, as it potentially enables unchecked data collection under the guise of the exemption, exposing data principals to unintended privacy violations.
Moreover, it is unclear how the law will operate in case the publicly available information is deleted by an individual. Furthermore, what happens when a publicly available profile is subsequently converted into a private one?
Data Leakage
It is pertinent to note that if personal data has been scraped and has been used to train an AI model, there is a possibility that someone may access it through the model. AI models can accidentally leak personal sensitive information in their output. This data can be accessed by a third party through carefully constructed prompts.
In a study led by Stanford and other institutions, it was found that GPT-2 could memorize and regurgitate sensitive personal information which includes social security numbers, full names as well as email addresses when appropriately prompted. [6]
This is a serious privacy concern as personal data can be accessed easily just by a few prompts and can be used for nefarious purposes, especially in the absence of clear guardrails around the collection, processing, and reuse of personal data for AI development.
Thus, the potential for data leakage must be seriously considered in any legal or regulatory framework governing AI training and web scraping. Ignoring this risk could inadvertently facilitate mass privacy violations through emerging technologies.
Other Jurisdictions
No specific laws prohibit web scraping around the world. In the EU, the GDPR does not contain any such exemption with regards to personal data as every kind of personal data is protected equally and the same rules apply to all. Article 14 of the GDPR[7] mandates that notice must be given when the personal data is not collected directly from data subjects. However, there can be an exception when there is a matter of public interest, scientific or historical research or statistical purpose.
In the USA, the Computer Fraud and Abuse Act (CFAA)[8] prohibits unauthorized access to protected computer systems and networks. Meanwhile, the California Consumer Privacy Act (CCPA)[9] specifically regulates how businesses can collect and process the personal information of Californian residents.
Article 13 of China’s Personal Information Protection Law (PIPL)[10] outlines the legal bases for processing personal data. It permits data processing without an individual’s consent when the personal information has been voluntarily made public by the individual or has been otherwise lawfully disclosed. However, such processing must remain within a reasonable scope and should ensure a balance between the individual’s rights and the broader public interest.
In contrast, Canada’s Personal Information Protection and Electronic Documents Act (PIPEDA)[11] allows the use of publicly available personal information without consent only in specific situations detailed in the Regulations Specifying Publicly Available Information (SOR/2001-7, dated December 13, 2000). Moreover, Canada’s data protection authority offers interpretive guidance on what qualifies as publicly available information under these rules.
One of the earliest coordinated global responses to large-scale data scraping came through the ‘Joint Statement on Data Scraping and the Protection of Privacy’ (August 2023)[12], issued by data protection authorities from countries including the U.K., Canada, Australia, and Switzerland. This Initial Joint Statement raised alarms about the privacy risks of data scraping and emphasized that even publicly available personal data remains protected under privacy laws. It warned that widespread scraping could amount to a reportable data breach in several jurisdictions. The statement urged organizations—especially social media platforms—to adopt safeguards against unlawful scraping, such as monitoring for bot activity, restricting excessive profile views, and pursuing legal remedies when needed. It also recommended keeping these safeguards updated in light of the evolving nature of scraping techniques.
A follow-up document, the ‘Concluding Joint Statement on Data Scraping and the Protection of Privacy’ (October 2024)[13], built on these concerns and, for the first time, addressed the role of generative AI (GenAI). It acknowledged that when companies use scraped data to train GenAI models or extract data from their own platforms for this purpose, they are required to comply with existing data protection laws and any AI-specific legal frameworks. However, the statement stopped short of offering detailed guidance on how such compliance should be implemented.
Conclusion
The advancement of technology has been unprecedented and will follow the same trajectory moving ahead. The Government of India must develop flexible guidelines regulating the process of scraping, particularly for the training and creation of generative AI models. This will not only protect internet users but will also prevent multiple lawsuits in the upcoming future.
Clarity must be provided regarding the contours of protection of publicly available personal data. The legal framework must be technologically adaptive, allowing for evolution in response to new scraping techniques and AI capabilities, and should ideally include guidelines for consent verification, proportionality in data usage, audit mechanisms, and ethical obligations for AI model developers.
Ultimately, regulatory clarity will strike a necessary balance—fostering innovation while ensuring accountability and allowing India to remain competitive in the global AI landscape without compromising the fundamental right to privacy.
End Notes:
- Web Scraping (National Library of Medicine, 25 May 2022) < https://www.nnlm.gov/guides/data-glossary/web-scraping > accessed 1 June 2025.
- William S Galkin, Reddit vs. Anthropic: A Defining Moment in the AI Data Race (Lexology, 17 June 2025) < https://www.lexology.com/library/detail.aspx?g=dfd6f12d-8ad4-4725-8e35-cb1cfca7acd7#:~:text=Reddit’s%20core%20allegation%20is%20that,being%20explicitly%20told%20to%20stop. > accessed 19 June 2025.
- Information Technology Act 2000, s 43.
- Information Technology (Intermediary Guidelines and Digital Media Ethics Code) Rules, 2021.
- The Digital Personal Data Protection Act 2023, s 3(c)(ii).
- Michael Epelboim, Why Your AI Model Might Be Leaking Sensitive Data (and How to Stop It) (NeuralTrust, 7 April 2025) < https://neuraltrust.ai/blog/ai-model-data-leakage-prevention > accessed 1 June 2025.
- Regulation (EU) 2016/679, art 5.
- Computer Fraud and Abuse Act, (1986) 18 USC s 1030.
- California Consumer Privacy Act (2018) Cal Civ Code ss 1798.100–1798.199.100.
- Personal Information Protection Law (2021) (China) art 13.
- Personal Information Protection and Electronic Documents Act (2000) (Canada) SC 2000, c 5.
- Joint Statement on data scraping and data protection (Information Commissioners Office, 24 August 2023) < https://ico.org.uk/about-the-ico/media-centre/news-and-blogs/2023/08/joint-statement-on-data-scraping-and-data-protection/ > accessed 2 June 2025.
- Global privacy authorities issue follow-up joint statement on data scraping after industry engagement (Information Commissioners Office, 28 October 2024) < https://ico.org.uk/about-the-ico/media-centre/news-and-blogs/2024/10/global-privacy-authorities-issue-follow-up-joint-statement-on-data-scraping-after-industry-engagement/ > accessed 20 June 2025.