HB Ad Slot
HB Mobile Ad Slot
OECD Report on Data Scraping and AI – What Companies Can Do Now as Policymakers Consider the Issues
Monday, March 24, 2025

The power of large language models (LLMs) that enables generative AI derives from vast quantities of data. Much of this data comes from scraping all forms of content from the internet. Despite the benefits, this practice raises numerous legal issues, some of which implicate IP issues. Dozens of pending lawsuits in the US alone include claims involving IP issues with data scraping. The recent OECD report titled “Intellectual Property Issues in AI Trained on Scraped Data” (Report) explores the intricate relationship between AI and IP rights, particularly focusing on data scraping practices used in AI training. It aims to provide policymakers with insights into the legal challenges posed by data scraping and potential policy approaches to address these issues. The following is an overview of the Report.

Data scraping involves the automated extraction of data from websites, databases, or social media platforms without coordination with the data host. Techniques include web scraping, web crawling, and screen scraping.

Scraping raises concerns about IP rights infringement, especially when copyrighted materials are involved. The issues can include copyright infringement, right of publicity/misuse of name, image or likeness (NIL), database rights, trademarks, trade secrets, among others. Legal disputes over data scraping to train AI are increasing worldwide. The legal issues are complicated by the variation in the relevant laws in different jurisdictions. As one example, the US has a “fair use” exception to copyright law, while the European Union (EU) has a “text and data mining” (TDM) exception. Some jurisdictions recognize sui generis (unique) rights to protect specific types of materials under IP law. For example, the EU provides sui generis database rights. Some jurisdictions, such as Japan, has proclaimed that training AI on copyrighted material will not be deemed an infringement. In the US, it is a critical issue pending in many lawsuits, with no clear answer yet. Other legal issues arise with the output of AI, including whether it is an infringement, if so, who is liable (the tool provider or the user) and whether it is copyright protectable.

The Report further notes that contracts, including website terms of service (TOS) and end user license agreements (EULAs) can help govern how data can be scraped or used as between the parties to the agreement, including specific provisions on permitted uses, attribution requirements, and liability allocations. However, it further notes that the enforceability of TOS and EULAs and their interaction with IP laws can vary significantly across jurisdictions. This further complicates matters.

As the Report notes, the term data scraping is often conflated with “data mining.” The latter refers to computational processes for identifying patterns, trends, and correlations in data. The Report highlights the inconsistencies in definitions and proposes a broad working definition for data scraping.

The Report further notes that the data scraping ecosystem includes research institutions and academia, AI data aggregators, technology companies and platform operators. AI data aggregators reportedly make scraped data available to third parties, often without clear licensing terms or clear disclosure of data provenance. This exacerbates the IP and other legal issues. Different legal issues may apply to different entities, depending on their role in the AI ecosystem.

The Report indicates that policymakers are increasingly considering codes of conduct and other forms of voluntary commitments by business to address challenges such as data scraping. It discusses various issues that should be addressed in these codes for AI data aggregators and users. Voluntary commitments are easy to adopt, but only work if uniformly adopted. Bad actors are unlikely to comply.

Another part of a potential solution includes policymakers encouraging the development of standard and widely accessible technical tools that protect IP rights, enable rights holders to control access to their data more easily, and support licensing mechanisms. Such tools can help with data access control and rights management to streamline compliance and enhance transparency.

The Report also suggests developing appropriate standard contract terms, including common terminology. However, given the cross-border nature of AI and the territorial nature of IP rights, this process would need to consider technical, legal and/or other terms that may already exist in different jurisdictions.

Lastly, the Report advocates raising awareness of data scraping legal issues by educating stakeholders about their rights and responsibilities in the AI data ecosystem. This includes informing stakeholders about legal implications and best practices for data usage in AI. This is an area where many companies that are just getting up to speed can benefit from assistance of knowledgeable legal counsel.

In summary, the Report underscores the need for coordinated international policy approaches to address the challenges posed by AI data scraping, balancing innovation with the protection of IP rights. It notes that by adopting voluntary codes of conduct, developing technical tools, and implementing standard contract terms, policymakers can promote responsible AI development while safeguarding intellectual property. However, this will not happen overnight and companies are using AI now.

In the interim, there are things that companies can do now to ensure responsible use of AI and minimize legal liability. Many companies are seeking advice from legal counsel to receive in-house legal training and responsible AI policy development. For some of the issues with training and developing AI policies see Why Companies Need AI Legal Training and Must Develop AI Policies and 5 Things Corporate Boards Need to Know About Generative AI Risk Management.

HTML Embed Code
HB Ad Slot
HB Ad Slot
HB Mobile Ad Slot
HB Ad Slot
HB Mobile Ad Slot