Published Date: 07-26-23

Since OpenAI released ChatGPT in late 2022, it seems like everyone has been talking, pontificating, or simply ranting about artificial intelligence. Many people believe AI is poised to change the planet in the next several years – maybe, some say, by eliminating our species. Even some of the AI developers see the risks – for example, OpenAI’s CEO is calling on Congress to regulate the technology.

Fortunately, we already had packed go bags in anticipation of the robot apocalypse. That means we’ve had ample time to consider how the next wave of technological “innovation” will affect creative livelihoods. We’re not software engineers, lawyers, or fortune tellers, so we can’t say for certain … but it certainly smells like something is rotten in cyberspace.

Why are we concerned? It’s because the Big Tech companies that have repeatedly failed to prevent piracy on their search engines, social media platforms, or other online services are the same ones that are now on the forefront of “generative” AI.

These headline-dominating “generative” AI systems are complex algorithms. They train on datasets that include immense amounts of copyrighted material – and often without the permission of authors and artists.

One prominent example: Google trained T5, which stands for Text-To-Text Transfer Transformer, on a dataset called C4 (Colossal Clean Crawled Corpus). Google engineers assembled it from Common Crawl, a collection scraped from the web.

A Washington Post investigation of C4 found a massive quantity and variety of copyrighted materials, some of which sat behind paywalls intended to preserve their value. In fact, five of the 10 most extensively plundered domains belong to newspapers. Google didn’t stop there – it also harvested text from creators on Kickstarter and Patreon, as well as from bloggers on Medium, WordPress, and similar platforms.

Apparently, when Google engineers “cleaned” the Common Crawl dataset, they were unconcerned about sweeping in other people’s intellectual property. In its analysis of C4, The Washington Post study found 200 MILLION instances of the copyright symbol (©) and 28 domains flagged by the Office of the U.S. Trade Representative in an annual Review of Notorious Markets for Counterfeiting and Piracy.

Ah, the willful disregard of the rights of creatives. Takes us back to earlier times, those halcyon days when “Don’t Be Evil” Google first acquired YouTube, fully aware that it was primarily used for film and television piracy.

Then there’s Meta’s Blenderbot, which we’ve previously ridiculed (rightfully). Now, Meta has released another text generator called LLaMA, which stands for Large Language Model Meta AI (obviously). To train it, Meta used Google’s C4 dataset and augmented it with a larger set of English-language webpages from Common Crawl. As a result, we suspect (as does The Washington Post) that LLaMA’s piracy issues are at least as bad as T5’s – if not worse.

Again, who’s surprised? These are the same companies that have been profiting from pirated content for decades.

We have watched Facebook roll out a second-rate content protection tool nine years after introducing video to its platforms, which allowed piracy to flourish unfettered. We watched Google receive hundreds of millions of takedown notices per year only to sit on its hands while pirate sites continue to proliferate in search results. And, more recently, we have watched Twitter ignore the posting of full-length, high-definition pirated copies of movies on its platform.

So now, as we watch Silicon Valley hurtle toward the next phase of the internet and, perhaps, the end of humanity – we can’t help but quote the great Taylor Swift: We’ve seen this film before – and we didn’t like the ending.

This time let’s write a different ending – one where humans win, and creatives are able to put food on their tables.