AI Training and the Copyrighted Data: The Fair Use Conundrum

The presence of Artificial Intelligence (AI) in our daily lives is already more than we are prepared for. AI is no longer a futuristic concept, and it is very much a part of our daily lives, ranging from healthcare to entertainment. With the growing use of AI in creative industries, there are important legal questions to be answered. Since AI models use a large amount of data, which also includes works that are protected by copyright like music, books, and images, the potential for legal conflict increases significantly, while also questioning the ethical considerations of training such AI models. AI is here to reshape the copyright laws in the near future. The important question that arises is that “Can the use of copyrighted work for training AI models be considered Fair Use?”

The Fair Use doctrine: A brief overview

The Fair Use doctrine under the copyright law allows the use of copyrighted material without the consent of the copyright holder, provided that such use meets certain criteria. As per Section 107 of the U.S. Copyright Act, there are four essential factors to be considered when determining fair use, which are as follows:

  1. Purpose and character of use: Whether the use if transformative, non-commercial, or educational.
  2. Nature of the copyrighted work: The more factual and less creative the work, the more likely it is to be considered as fair use.
  3. Amount and substantiality of the portion used: Whether the portion used is appropriate in relation to the purpose of the use.
  4. Effect of the use on the market for the original work: Whether the new use harms the market for the original copyrighted work or its potential market.

AI and Copyrighted Data Intersection

AI systems require large datasets to train their algorithms. These datasets often consist of a lot of copyrighted works such as books, articles, music, art, images, text, and videos. The process of training an AI involves feeding humungous amounts of data such that it can identify patterns, make predictions, distinguish between right and wrong, and improve over time. Now since AI does not directly reproduce the existing copyrighted data that is fed to it, the critical question that has be considered here is “Whether the use of copyrighted data for AI training be considered Fair Use.”

The Fair Use Argument for AI Training

Many proponents argue that the data used for training purposes could be considered transformative. The Supreme Court in Campbell v. Acuff-Rose Music, Inc. (1994), defined transformative use as adding new expression, meaning, or message to the original work. With the AI context, the AI system does not directly use and reproduce the fed data, rather it uses the data.

In Author’s Guild v. Google, Google created a searchable database, by scanning millions of books. The court held that such scanning was for a non-commercial purpose and a transformative use, since it added a new function by making books searchable and accessible to users. Therefore, it was considered fair use.

AI research and development often serves an educational and non-commercial purposes. In Oracle Inc. v. Google Inc. (2021) the Supreme Court found that Google’s use of java code in Android was fair use, noting the transformative nature of the use, which aided in creating a new platform with new and different functionality. For AI developers, the distinction between simply using the data, and using it to create something new is crucial.

The GitHub Copilot Case: A direct application of AI and Fair Use (2022-Present)

This case is one of first battle arising from AI systems. GitHub Copilot is an AI based coding product made in cooperation with OpenAI. GitHub Copilot is trained on billions of lines of publicly available code, leaving open-source programmers with serious concerns regarding violations. A class action lawsuit was filed against GitHub, arguing that Copilot uses copyrighted code as a part of its training data, which is an infringement of their copyright, since Copilot generates code that are like the original copyrighted work. The defendants (GitHub, Microsoft, OpenAI), argue that AI’s use of code is transformative and because it generates new code rather than reproducing the original. Plaintiffs, however, contend that Copilot harms the market for the original code, by potentially replacing the need for human programmers or licensing agreements. The outcome of this case could set a precedent for how courts view the fair use arguments in AI.

The Copyright Holder’s Perspective

Despite these arguments for fair use, copyright holders are entitled to contend that AI developers are using their works without authorization, causing a diminishing impact on their market, thereby potentially depriving them of their income and rights.

The only concern for copyright holders is whether AI’s use of copyrighted works constitutes an infringement of their exclusive rights. If an AI system generates new work based on copyrighted data, the question arises whether that constitutes a derivate work or whether the original work is being exploited in a manner that infringes the holder’s rights, and this a valid concern.

The latest Supreme Court decision on fair use is the 2023 case of Goldsmith v. Warhol, supports the holder’s rights, where the Andy Warhol Foundation was held liable for unlicensed use of a photograph originally taken by Linda Goldsmith. It was argued that Warhol’s image was a transformative use, yet the Court disagreed and decided that this was not a fair use.

AI developers can be seen as “free-riding” on the creative efforts of others without providing compensation. The courts have time and again reinforced the importance of protecting the rights of the creators for them to have the ability to control their work and the benefits from their use.

What does this mean future of Copyright law?

The use of copyrighted data for AI training is creating a rift between the copyright holders’ rights and the need to foster technological innovation. And as much as we deny AI, it is very much going to intersect in everything we do and create. The GitHub Copilot case could be a milestone case, in setting the ground rules. If the court rules in favor of fair use, it could encourage innovation by allowing AI developers to access and use copyrighted works without the fear of litigation. However, this could undermine the rights of the creators.

On the other hand, it courts decide against fair use, AI developers may face significant barriers to accessing data needed to train the systems. But this could allow the creators to have their rights secured. AI may still use the copyrighted data, but by compensating the original authors and by appropriately licensing their work to use it for training the AI.

The United States Copyright Office (USCO) has recently released its reports on the legal issues and policy issues related to AI and copyright law. One of its reports now confirm that the use of AI to assist in the process of creation or the inclusion of AI-generated material in a larger human-generated work does not bar copyrightability.

The USCO is also set to release a part of its report addressing the legal implications of training AI models on copyrighted works, which probably will set the ground rules for answering the question “whether the copyrighted data for AI training can be considered as fair use.”

Conclusion

The copyrightability, fair use, and infringement, are going to remain an evolving issue. The answer will keep changing with time and advancement of the technologies. As AI systems, become more and more complex with each passing day, it is important that the laws also evolve. While AI training may align with some of the factors favoring fair use, such as transformative use, the creator’s rights is still a huge concern. The courts, lawmakers, and industry stakeholders are already actively engaging in discussions for changes in the laws to accommodate AI, its only time that these changes are adapted.

Author: Adv. Vidhi Gandhi