How the voracity with which Generative AI consumes works without authorisation is sparking anger among copyright holders.
The rise of Artificial Intelligence (an unhelpfully broad term, encompassing many different types of system) has undoubtedly ushered in a time of profound and universal change. Generative AI, a subset of AI able to dissect vast amounts data, ‘learn’ from the pieces and generate content of its own, has featured prominently in headlines warning of legal and ethical challenges kicked up by its rapid development - both in terms of content generated and method of ‘learning’.
Generative AI handle data in three main areas: i) ingestion – by which a piece of content is found and dismantled, ii) incorporation – by which recurring trends between pieces of content are identified and remembered and iii) fabrication – by which output is generated along the lines of those trends. The problem is that each area could involve the unauthorised copying of copyrighted material, contravening the Copyright, Designs and Patents Act 1988.
Generative AI require vast amounts of data to ‘learn’ from, and thereby imitate. In the case of text-generating Large Language Models, such as ChatGPT and Meta’s LLaMa, the AI ingests hundreds of thousands of texts to identify trends – grammatical, stylistic, literary, referential etc. With the right user-prompt, the AI generates new text based on the prevalence of these trends, using probabilities to calculate what the user is most likely to recognise as natural-looking or engaging. As a basic example, it might recognise that certain nouns are frequently preceded by the same adjectives (tall tree, red apple, fast car), and therefore deduce what the most natural sentence involving that noun would likely be. A similar logic is applied by text-to-image systems such as OpenAI’s DALL-E and Stability AI’s Stable Diffusion. Here an object may be identified as frequently associated with the same general colours, or approximate size relative to another object. The AI’s understanding (and subsequent mimicry) becomes more nuanced as it is exposed to probabilities concerning complex stylistic trends, such as recurring metaphors and character tropes, or shading and perspective. The creators of these Generative AI argue this process is no different to a student imitating its peers so as to learn their artistic/literary techniques and, eventually, develop themselves so as to be on a par with them.
The end of September 2023 saw George RR Martin, along with 16 other authors including Michael Connelly, John Grisham and Roxana Robinson, file a class action complaint in New York’s Southern District Court against OpenAI. Their claim alleged that ChatGPT, the Generative AI created by OpenAI has infringed upon their copyright. Now three weeks ago, nonfiction authors Nicholas Basbanes and Nicholas Gage have jointly filed (again in the US) a similar class action of their own against OpenAI. This time however, the class of claimant is far broader, in an effort to encourage as many other would-be claimants as possible to join them.
The selection pool is a big one. Last August saw the Atlantic release a database of authors whose works had been copied as part of various AI (including ChatGPT ) ‘learning’ how to be more nuanced and natural-sounding writers, and many authors were outraged to find their works and names on the list. More specifically, their works had been incorporated into the now infamous Books3 training dataset, a bundle of hundreds of pirated books which is itself one of 22 similar datasets comprising ‘The Pile’ – a gargantuan ‘corpora’ of copied works being torrented from a plethora of host websites.
Most of these cases are US jurisdiction. This may come as little surprise; the fact that, in US litigation, each party usually bears their own costs and that many lawyers will act on contingency fees allows for claimants to more easily ‘have a go’ against large corporate entities, without as much fear that an award of costs, should they fail, might drown them in legal fees.
However, authors are far from the only copyright-holders with claims against these (for the moment) still fledging intellects. More and more claims are being launched by artists, programmers and other copyright holders against the companies behind these AI. The creators of training datasets such as The Pile have themselves come under fire, yet have cited US Copyright law’s ‘fair use exemption’ – designed to protect transformative use of the work (in the case of The Pile, from readable text into extremely condensed data files which are not able to easily replace the books themselves in the market).
The various claims brought against the AI companies to date have so far broadly alleged that the product generated by the AI following a user prompt, be it an image or a piece of text, has a sliver of their work stitched within it, proving that their work must have been copied by the system at the ingestion stage. AI defenders have also sought ‘fair use’, asserting that the process is every bit as transformative and creative as an artist merely seeking inspiration from others. Critics argue that the AI’s product is only a Shelley-esq gestalt of stolen intellectual property and that work assembled in this way does not meet the copyright threshold of originality. In the ongoing case of Getty v Stability AI this criticism was particularly bolstered by the fact that some of the infringing works in question (in this case, images generated by Stability AI’s StableDiffusion system) bore recognisable fragments of the distinctive ‘Getty images’ watermark. Getty alleged this could only come about by the AI having ingested so many watermarked images, that when prompted to create an image of its own it assumed any image must logically contain a Getty watermark. In addition to their US claim, Getty has brought a claim against Stability AI in the UK as well, where they allege that in accessing their portfolio of images, StableDiffusion has infringed upon of a whole second set of database rights established under the EU's Database Directive.
The New York Times, whose New York claim was filed on 27 December, has accused OpenAI of feeding its journalists’ publications to ChatGPT to create AI-generated news stories and has similarly demonstrated that (with the right user-prompt) they were able to get ChatGPT to generate verbatim excerpts from existing NYT news articles.
The aim of each of the claimants in these cases is to prove that that the Generative AI could only have learnt the trends it has, from their own pirated material. Insodoing, they will have evidenced that an unauthorised copying of their work (at the first ‘ingestion’ stage) must have occurred, making the AI company in breach of copyright even before AI content has been generated. In the absence of a watermark (as in the case of Getty) however, establishing this represents a significant challenge.
The argument that an AI’s product represents a straight-cut infringement of copyrighted material has been met with limited success, both in Getty’s case (watermark notwithstanding) and elsewhere. One notable example being Doe v GitHub Inc, a 2023 US claim made by a group of software developers against GitHub, a code repository and hosting platform promoting the development and sharing of open-source code between developers. The plaintiffs, who had published licensed code via the platform, alleged that GitHub allowed their code-writing AI Copilot (jointly developed with OpenAI) to train on their code, thereby exceeding the terms of their licence. This claim faltered due, in part, to a failure to specifically identify the licenced code within Copilot’s generated product, with the court ruling that the injury must be ‘particularised’ to the claimant.
There is also a growing argument that the method by which the AI store and recall the ingested data is so essential to their operation, and so far removed from traditional data storage (where a like for like copy occupies a set amount of digital storage space) that the copied work actually becomes a part of that AI. In other words, as it ‘grows’ and allows each work to influence the way in which it performs tasks and the styles which it may mimic (exactly like a student) the AI indelibly combines itself to each work it ingests. As such, the very architecture (or ‘DNA’, in a sense) of the AI can be said to be comprised of fundamentally stolen works, which would make the mere existence of the AI an ongoing infringement of potentially millions of individual copyrights.
This process of ingestion learning has been seen by some as an exploitable weakness. Systems have been developed to mask image content from AI by making changes imperceptible to the human eye, but which confuse the AI into thinking the image falls into a category it does not. This year, this same technology has been taken a step further with the development of Sand Lab’s Nightshade which has turned this defensive mask into a weapon, so much so that it has been described as ‘AI poison’. Essentially, if an AI encounters too many images which have been altered (almost imperceptibly, this time) by the Nightshade system, it learns deliberately misleading lessons and generates bizarre returns to certain prompts. These lessons are costly to unlearn, and the rationale behind these kinds of ‘poison’ systems is to make the use of pirated material (which may or may not be poisoned) too risky for AI companies to rely on, thereby making licensing with creators the more cost-effective choice. Nightshade has, however, been criticised by AI supporters as going too far, threatening to fundamentally (and potentially irrevocably) damage AI systems in an effort to protect copyright. Some might counter, however, that this risk may be avoided through the use of licences with content creators, by which access to data for training purposes might be permitted for a fee.
Ultimately, the infringement itself is secondary to the concern felt by many that the market for creative products such as books and artwork, is on the cusp of being drowned by AI-generated content, with some predicting that as much as 90% of all online content will be AI-generated by 2026. In the face of this brave new world, AI companies have argued that the use of training datasets is essential and that such copyright claims as these are only stymying progress and delaying the inevitable. Getty Images have rebuffed these claims, whilst at the same time offering a ray of hope to human creators; one of licensed -based co-operation with generative AI. To illustrate this, Getty released their own generative AI in September 2023, imaginatively christened… Generative AI by Getty Images. Described by the company as ‘commercially safe’, the AI is reportedly trained off of exclusively Getty’s own content and with the consent of its creators.
Whether or not such license agreements can become normalised remains to be seen, especially as many governments (including Mr Sunak’s) appear more enamoured with the economic promises of unregulated AI, and less concerned with the rights of creators. There is much to be enamoured with; job augmentation and creation, acceleration of large-scale climate action, personalised needs-sensitive teaching and early cancer detection to name only a few. The creative industry can and does benefit also, with error and plagiarism detection, metrics analysis for user-specific content, enhancing photo/film image quality, 3D design modelling – all utilising AI tools. Greater transparency from AI companies as to how their models are trained would go a long-way to stabilising the relationship between two industries that can greatly benefit from one another.
There is a need to strike a legislative balance too between the goals of rightsholders and AI companies, and it is important to remember the relative sizes of these industries in this respect; the UK’s creative industry sector (which includes IT and technology) accounted for nearly £116 billion in gross value added (GVA) in 2019 alone, while the total contribution to GVA from AI related companies and professions in the 6 years between 2016 and 2022 was £3.7 billion. It needs to be said that the AI industry is widely projected to dramatically increase in value, but for now it remains just one small component of the creative sector. It is important therefore not to allow the limitless and futuristic potential of AI to distract lawmakers from the immediate issues of the present.
After all, these issues are unlikely to go anywhere. We are already a half-step through the looking-glass, and the time when an AI appearing in court would be the stuff of fiction is, as quoth the raven, Nevermore.