AI Training Data Via Copyrighted Works

Bruce D. Sokler

Email

202-434-7303

Bio and Articles

Alexander Hecht

Email

202-434-7333

Bio and Articles

Christian Tamotsu Fjeld

Email

1-202-434-7433

Bio and Articles

Find Your Next Job !

Paralegal - Illinois

Social Services Attorney

Associate Litigation Attorney

Explore More Job Openings

(Un)fair Use? Copyrighted Works as AI Training Data — AI: The Washington Report

by: Bruce D. Sokler, Alexander Hecht, Christian Tamotsu Fjeld of Mintz - AI: The Washington Report

Wednesday, January 10, 2024

Print Mail Download info_icon_img

/>i

Welcome to this week’s issue of AI: The Washington Report, a joint undertaking of Mintz and its government affairs affiliate, ML Strategies.

This week, we discuss the continued contentious matter of the use of copyrighted works as training data for AI models, the potential issues this practices raises, and parallel efforts currently underway to resolve these dilemmas. Not only is the status of such use an unresolved question in most of the world, it is also an open issue how and by whom this question will be resolved. Possibilities include regulation in parts of the world dictated by a uniform outcome, legislation, litigation, or a commercial agreement among interested parties.

As we start 2024, our current thoughts on the subject are:

To create AI tools, developers often engage in a “training” process whereby large amounts of data are fed into models. Often, copyrighted work is included in this “training data.” As generative AI tools have become commercialized and lucrative, some creatives have taken issue with this practice, arguing that it violates their rights as copyright holders for their works to be used to sharpen generative AI tools.
Various authorities have recognized the tension in this issue, including the US Copyright Office, the White House, and Congress. Major court cases, including the New York Times December 2023 suit against generative AI companies, may also influence how AI developers utilize copyrighted works in the training process.
Other authorities worldwide face the same issue. Japan has addressed the issue head-on, although its position may be changing. Meanwhile, the European Union (EU) seeks to require developers of general-purpose AI models to publish transparency reports regarding the use of all data used to train said models, regardless of its copyright status.
The longer it takes regulators and lawmakers to implement guidelines on the use of copyrighted works to train AI models, the more likely it will be that developers and creatives attempt to reach commercial deals on the issue, opting for self-regulation as the way to resolve this matter. The New York Times lawsuit revealed that such commercial negotiations have occurred.

Copyrighted Works as AI Training Data

Generative AI tools have raised a variety of complex questions for regulators on domains as diverse as data privacy, labor force disruptions, and consumer protection. But one issue of particular concern for regulators around the world is that of AI’s status vis-à-vis copyright law.

Many have been amazed by the capacity of generative AI tools to answer questions, crack jokes, and compose poetry. Often, AI tools are able to achieve these results thanks to a process known as “training” in which scientists feed massive amounts of data into computer models, instructing the models to “find patterns or make predictions” based on the data. To create data sets for training AI models, researchers have “scraped” various data sources including internet forums, book corpuses, and online code repositories.

Prior to the commercialization of generative AI tools, few were concerned about the copyright implications of data scraping for model training. But with the emergence of commercialized and lucrative generative AI tools the issue has become contentious, spawning multiple lawsuits, congressional hearings, and Copyright Office inquiries on the matter.

In this week’s issue, we investigate emerging solutions to the pressing question of how to best govern the use of copyrighted content by AI models.[1]

Japan: Power to the AI Developers?

One approach is to give developers free rein to utilize copyrighted works in model development. Many interpret this to be the approach to AI and copyright taken by Japan.

Under the 2018 revisions to the nation’s copyright law it is “permissible to exploit a work, in any way and to the extent considered necessary…if it is done for use in data analysis…” Some have interpreted this provision to mean that AI developers can use copyrighted works to train AI models without requesting the permission of copyright holders. This interpretation has led some to describe Japan as “more favorable to AI learning than those of other countries.”

However, Japan’s approach to AI and copyright may soon evolve, in part due to complaints that the current interpretation of the law harms the interest of creators. In December 2023, Japan’s Agency for Cultural Affairs released a draft report on AI and copyright. The report supports an interpretation of the copyright laws under which certain unauthorized uses of protected content to train AI models would constitute a copyright violation. The Agency for Cultural Affairs is expected to release guidelines on the matter in March 2024

EU: Transparency Requirements for Copyrighted Content

In December 2023, negotiators from the European Parliament reached a deal on the AI Act. First proposed in April 2021, the AI Act differentially regulates uses of AI on the basis of their “risk to the health and safety or fundamental rights of natural persons.”

To address potential harms caused by general-purpose AI models, the December 2023 text of the AI Act requires providers of such models to “draw up and make publicly available a sufficiently detailed summary about the content used for training of the general-purpose AI model, according to a template provided by the AI Office.” Whereas earlier versions of the AI Act had required detailed summaries regarding the use of “copyrighted training data” from the developers of such models, this most recent version of the AI Act mandates disclosure for all training data regardless of its copyright status.

The AI Act utilizes transparency requirements regarding training data to allow copyright holders to exercise their right to prevent their work from being data mined. According to commentary accompanying the December 2023 text of the AI Act, the EU rules state that “reproductions and extractions of works or other subject matter, for the purposes of text and data mining, under certain conditions” is permitted. Under this regime, it is incumbent on the copyright holder to opt-out of their work being data mined in such a fashion. Once this right “to opt out has been expressly reserved in an appropriate manner, providers of general-purpose AI models need to obtain an authorisation from rightholders if they want to carry out text and data mining over such works.”

US: Efforts in the Three Branches and the Marketplace to Resolve the AI and Copyright Dilemma

Executive Branch and Copyright Office Efforts on AI and Copyright

Here in the United States, in August 2023, the US Copyright Office published a request for comments on the question of how the office should address governance challenges posed by the emergence of generative AI. Among other issues, the Copyright Office requested comment on “whether legislative or regulatory steps” are needed to address “the use of copyrighted works to train AI models.” The comment period for this inquiry closed on December 6, 2023.

The August 2023 inquiry, along with previous efforts by the Copyright Office, are intended to provide the material for future legislative reforms. Policy proposals on AI and copyright made on the basis of the Copyright Office’s inquiries may come as early as summer 2024.

President Joe Biden’s October 2023 “Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence” directs the Copyright Office to “issue recommendations to the President on potential executive actions relating to copyright and AI… including the scope of protection for works produced using AI and the treatment of copyrighted works in AI training.” These recommendations are to be issued to the President by the end of July 2024 at the latest.

Congressional Proposals on AI and Copyright

In parallel with the executive branch and the Copyright Office, lawmakers in Congress are wrestling with the copyright issues posed by AI.

On December 22, 2023, Representatives Don Beyer (D-VA-8) and Anna Eshoo (D-CA-16), respectively the Vice Chair and Co-Chair of the Congressional AI Caucus, introduced the AI Foundation Model Transparency Act. Under the act, developers of certain AI models may be required to disclose the sources of data used to train their models, including “information necessary to assist copyright owners or data license holders with enforcing their copyright or data license protections...”

As members and committees propose piecemeal AI bills, Congressional leadership has been working towards the goal of comprehensive AI legislation. From September to December 2023, a bipartisan group of senators led by Majority Leader Chuck Schumer (D-NY) hosted a series of senate-wide “AI Insight Forums,” or closed-door meetings with a variety of stakeholders on pertinent governance issues raised by AI. The purpose of these forums is to develop expertise within the Senate needed to produce effective bi-partisan AI legislation within “months.”

On November 29, 2023, Senator Schumer hosted a forum on “Transparency, Explainability, Intellectual Property, and Copyright.” In his opening statement for the forum, the senator asserted that “there is a role for Congress to play to promote transparency in AI systems and protect the rights of creators.”

Will the Courts Decide the Issue Under Current Law?

On December 27, 2023, the New York Times sued leading generative AI companies for “unlawful use of The Times’s work to create artificial intelligence products that compete with” the publication. The complaint alleges that the defendants used “Times content without permission to develop their models and tools.” According to the complaint, after a months-long attempt to reach a negotiated agreement stalled, the publication filed suit.

The New York Times argues that the defendants’ use of “The Times’s content without payment to create products that… closely mimic the inputs used to train them” is not protected by fair use and therefore constitutes copyright infringement.

Given the high profile of both the plaintiff and defendants, some experts have predicted that this case, if tried, will eventually be appealed to the Supreme Court. Regardless of whether this ends up being the case, the outcome of this suit could have significant implications for not just the news industry, but all creators whose content has been used to train AI models. In the absence of regulation from the executive or legislative branch, a decisive ruling on this or a related case may set industry standards regarding the use of copyrighted works as training data for AI models—or be the catalyst for codifying a set of standards.

Self-Regulation as the Solution?

Aside from government agencies, Congress, and the courts, the ultimate solution to the issues posed by the use of copyrighted works as AI training data may come from the marketplace.

Prior to filing their suit, the New York Times engaged with generative AI companies in months of negotiations over the use of its content. Had these negotiations succeeded, the resulting arrangement could have been used as a model for other publishers and content creators. As the New York Times suit progresses, there is nothing preventing marketplace agreements (including from the litigants).

The emergence of AI is moving at warp speed. Regulatory, legislative, and litigation-based solutions all move at a slower pace. In all likelihood, the longer it takes for regulators to create and implement public policy solutions to this issue, the more likely it is that creators and AI developers will come to their own arrangements.

We will continue to monitor, analyze, and issue reports on these developments.

Endnotes

[1] Along with the use of copyrighted works as training data, the commercialization of generative AI tools has brought the issue of the copyrightability of AI-generated works to the fore.