Grand theft author

I asked “Ernest Hemingway” about artificial intelligence (AI): “It is a simple thing but it can be made complex. The simple thing is to make it do what you want it to do. The complex thing is to make sure it doesn’t do what you don’t want it to do.”

“Charles Dickens” jotted this down: “The AI was a creature of both light and dark, a force of both creation and destruction. It was up to humanity to decide how the AI would be used, for good or for evil.”

While “Sylvia Plath” dashed off this stanza:

The machine is a marvel, a miracle,
A gift from the gods, or so they say.
But what if the gods are cruel?
What if the machine is our undoing?

This time – after the zombie unpleasantness – I didn’t perform any acts of necromancy or even pull out the ouija board. Instead, I turned to Bard, Google’s AI assistant, to generate the quotes.

While the world’s authors are sweating over their keyboards, AI is silently stealing their voices

They are… okay, I suppose. A human parodist would offer up something more refined, but they’d expect to be paid and would waste all sorts of time on tea breaks and staring out of the window wistfully. Bard vomited up its results in less than a second, running its Sauron eye across a vast corpus of online information.

With each passing version of these generative AI tools, they get better at mimicking the styles of writers, artists, and musicians. And if I’d had access to Meta’s large language model (LLM), the quotes might already have been much better. A lawsuit filed in July 2023 by the writers Sarah Silverman, Richard Kadrey, and Christopher Golden alleges that OpenAI and Meta – the company formerly known as Facebook – have pirated thousands of books to train the AIs.

As Alex Reisner wrote for The Atlantic: “The future promised by AI is written with stolen words.” The LLMs need high quality text to get really good and the general content of the internet – angry comment sections, unhinged Reddit threads, the complete back catalogue of Richard Littlejohn columns – isn’t good enough. Hence the suggestion that nicked novels are fuelling the rapidly improving AI output.

The lawsuit says that upwards of 170,000 books are part of the AI’s training model. They include literary fiction from authors such as George Saunders and Zadie Smith, to commercial thrillers by James Patterson and Stephen King. The dataset called Books3 is used by a range of other projects, including models from Bloomberg and EleutherAI (a popular open-source project). While the world’s authors are sweating over their keyboards, AI is silently stealing their voices.

Books3 is huge but it’s just a small part of an even larger set of training data called “the Pile”. It also throws in YouTube subtitles, European Parliament documents and transcripts of speeches, the contents of the English version of Wikipedia, a record of hundreds of thousands of emails sent and received by employees of the late and unlamented Enron Corporation, and millions upon millions of other words scooped up to feed the digital beasts.

In Neal Stephenson’s 1992 novel Snow Crash, the main character Hiro Protagonist – ha ha ha – is a hacker with a side-hustle as a stringer for the now privatised CIA (“the Central Intelligence Corporation of Langley, Virginia”), grabbing information for its vast central database:

The business is a simple one. Hiro gets information. It may be gossip… it can even be a joke based on the latest highly publicised disaster. He uploads it to the CIC database – the Library, formerly the Library of Congress, but no one calls it that anymore… It used to be a place full of books, mostly old ones… then all the information got converted into machine-readable form…
… CIC clients, mostly large corporations and Sovereigns, rifle through the Library looking for useful information, and if they find a use for something that Hiro put into it, Hiro gets paid.

Stephenson was writing a dystopian future, but at least Hiro gets paid, unlike the dystopian present in which authors’ stories are chewed up into bloody chum for the endlessly hungry jaws of the generative AI models – just as Hollywood is hoping that actors’ performances can be input and reused in a thousand new films at no extra cost.

Piracy has existed forever – perhaps stretching all the way back to when one cave dweller cribbed another’s drawing from a nearby wall – but this new iteration is worse. We have moved from minor plagiarism to a world in which writers’ voices can be taken and replicated at will. The companies in control of AI are like so many versions of Ursula the Sea Witch from The Little Mermaid, making every author into Ariel, their voices up for grabs.

Does it matter if a machine writes “new” Hemingway sentences or spits out verse in the style of Sylvia Plath? Not to me but to the estates of those authors still in copyright, it will matter a great deal. Yes, there have been new Agatha Christie novels and James Bond books from officially sanctioned ghostwriters, but they are not deceptions; the reader knows these are continuations from another’s pen. The “literary” output of AIs will be different and it will steal from the living as often as the dead.

Mic Wright is a journalist based in London. He writes about technology, culture and politics

More Like This

Get a free copy of our print edition

Life, October 2023, Tech Talk

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.
You need to agree with the terms to proceed

Your email address will not be published. The views expressed in the comments below are not those of Perspective. We encourage healthy debate, but racist, misogynistic, homophobic and other types of hateful comments will not be published.