Fracking for Data and The Copy Problem.

A brief history of AI breakthroughs and badly plumbed information flows.

Jan 30, 2026

The performance of AI systems is driven by three factors: compute, data and algorithms. But among them, one reigns supreme as the driving force behind how AI got smarter: Data. In the first section of Andrew Trask’s excellent talk, he runs us through the history of AI breakthroughs, summarised here:

2009

Raina, Madhavan and Ng publish an explosively influential paper that kicks off a revolution in machine learning, where they train a model on a GPU for the first time. More compute, vastly better performance.
ImageNet. A dataset containing millions of human-labelled images depicting thousands of object categories is released by Fei-Fei Li and colleagues. An open competition is started, inviting ML researchers to test their algorithms on object recognition and classification at scale.

2012

At the third annual competition, Ilya Sutskever and Alex Krizhevsky, under the supervision of Krizhevsky’s advisor at the University of Toronto, Geoffrey Hinton, use the GPU-training idea to build a powerful algorithm for the ImageNet competition. As big as the ImageNet dataset is, over 15 million labelled images with 22,000 object categories by the time they take on the challenge, they try making it even bigger. To do this, Krizhevsky plays around with the images, adjusting their colour, tilt, cropping and mirroring. They train their model, called “SuperVision”, and submit it. It blows the other competitors out of the water. The jump in performance? An artificially enlarged dataset and better algorithmic techniques.

2013

The arrival of Word2Vec. By turning words into vectors, you could do maths with words. The core idea of Mikolov, Sutskever et. al.’s paper is this: simplify the architecture so the model can train on orders-of-magnitude larger datasets. It looked like a smarter algorithm, but it was actually the sheer volume of data that made Word2Vec so impressive.
DeepMind Atari. What made this moment different was that the model wasn’t trained on static data: it was trained on a game, and games are an infinite source of data. You can play them again, and again, and again. The model can train against itself. Again, it looks like a smarter algorithm; it was actually more data.

2016

DeepMind AlphaGo. Go is a very complex game. The game tree (the space of all possible game states) is estimated to be 10^700 – that’s 10 followed by 700 zeroes. To put that into perspective, the number of atoms in the observable universe is around 10^82. There is, practically, an infinitely large dataset here. By training against itself, DeepMind’s AlphaGo cooked all other Go programs and beat the European champion 5 games to nil.

2017

‘Attention Is All You Need’. Another highly influential paper. The innovation was the same as Word2Vec’s: simplify the model, train it on more data.

2018

GPT-1. By this time, researchers knew that more data = more capabilities, but labelled data was scarce. Instead, the team at OpenAI tried a different approach, in two steps: (1) unsupervised pre-training on a big, unlabelled set of text, then (2) supervised fine-tuning. Giving much more data to the transformer architecture resulted in a huge jump in capability. The algorithm was much the same, the data was bigger, the performance got better.

2019

GPT-2. What made GPT-2 better than its predecessor? Sure, there were some improvements in architecture and compute, but again –the real reason?– training the model on millions of pages from the internet called WebText.

2020

GPT-3. The innovation here: more compute to handle more data.

2022

InstructGPT. New types of people annotate new data for what you want the model to behave like. The strategy was the same: higher quality (and more) data.

… We could go on and on. At each breakthrough, it looked like compute or algorithmic innovations, and yes, it’s true that was often going on, but the real factor behind these leaps was data: more of it, more complexity, higher quality.

This Past Year

Most of us are familiar with Large-Language Models (LLMs) and Diffusion Models (for image and video), but 2025 was a huge year for a new kind of AI system: World Models. Humans and animals can quite naturally construct internal, mental representations of the world. Take the moment you embark on your morning commute to work or a walk to your local café – most of us naturally generate miniature simulations of the journey, mentally planning our route before actually interacting with our environment.

These World Models are starting to be deployed in consumer technology, too. This past year, we saw some big steps in consumer robotics, such as the X-PENG Iron and NEO’s 1X’s semi-autonomous robot. Although the NEO looks cool, it still seemed a little wobbly and prototype-y to me, requiring humans to operate it telemetrically to learn new tasks. So why release it so early? Data, data, data. If NEO can ship first, they can start collecting umpteen data points from real-world environments, exponentially improving the 1X World Model that powers it. In testament to the (not so) secret history, more data means more capable models, and this advantage compounds quickly.

Given I’d just laid out the case that data is the driving factor behind model improvements, it might surprise you to learn that AI models are trained on a relatively small amount of data. The biggest of them all, Meta’s Llama 4 Behemoth, used approximately 180TB of training data. Hang on, that sounds like a lot, right? Nope.

For a little over a decade, we've been in what’s known as the “zettabyte era”. There are various benchmarks to indicate when this happened, but the way I prefer to think about it is when the total global volume of data was estimated to have hit 1 zettabyte – and that was in 2012. To put that into perspective, one single, measly zettabyte is the data equivalent of 1 billion 1TB hard drives. As of last year, we’d likely hit 180ZB – and this volume is predicted to double every two years. A lot of this future exponential growth will be driven by AI-generated “synthetic data”, some of it good, but most of it total slop. The good stuff? The valuable data that global business and research depends on? It’s privately held, making much of it inaccessible to the webcrawlers that have gobbled up the world’s publicly accessible information that ended up being used to train AI models.

The Copy Problem

‘In a flourishing online ecology… information is the supreme currency’
(Nissenbaum, 2011)

Since the 1970s, large oil reserves have been getting increasingly difficult to find. The oil that’s left for us to extract is located in more remote areas (think sub-arctic and offshore) or requires more complex and dangerous methods of extraction, such as fracking. Though it’s true that, like the process of fracking, tech companies have resorted to increasingly dangerous methods of resource extraction – facing some 70 ongoing lawsuits by copyright owners against them – data is a very different economic resource to oil for one reason:

Oil is finite, data is infinite – because it can be used again, and again, and again.

Unlike the Wu-Tang Clan’s Once Upon a Time in Shaolin, each piece of data has limitless economic reuse: when data is shared or sold, it duplicates, but in regenerating itself, it creates what’s known as the “Copy Problem”:

Once data is copied, shared, sold or stolen, the original owner can no longer control how the recipient uses it.

In academia, the copy problem is mitigated by robust protocols around attribution: it’s a cardinal sin to engage in plagiarism, passing off someone’s work as your own, no matter how valuable this data might be to you.

But AI is taking a wrecking ball to attribution controls.

Hanan Maayan framed this change perfectly: ‘we are transitioning from the era of Search, where people ask questions and get information, to the era of LLMs, where people ask questions and get answers’. Results are the curation of attributable sources; search engines simply present information based on relevance (and sneak in plenty of Ads, too). The direct source of the information is right there, just a click away.

LLMs don’t work like that: they’re trained on information, and the provenance of their answers are often opaque or, worse, hallucinated entirely. LLMs are the copy problem ad absurdum. During training, sources get recycled again and again in a soup of training data – until the model can produce answers that it itself can no longer determine the provenance of. In this way, answers become derivative work, and attribution goes missing. The original, human-generated information the AI model trains on undertakes a “spectacularly transformative” metamorphosis: the very words from the judge presiding over the monumental Bartz et. al v Anthropic lawsuit.

‘Ideas are all as much articles of merchandise as bread’
(Spooner, 1855)

If you want a perfect example of the copy problem on steroids, you’ve found it here. The TLDR is this: Anthropic needed more data to train their models. Like fracking, they turned to a very dubious method of resource extraction. Its internal, secret title was Project Panama. It involved spending tens of millions of dollars buying as many books as possible, splicing their spines and scanning their pages – then destroying them. They then used these texts for model training. This part, they got away with. Using these for training was considered within the remit of Fair Use, since the process of model training was considered transformative, likened by the judge to training children how to write well. But there was one part they didn’t get away with. Before Project Panama, Anthropic had downloaded some 7 million books from so-called “shadow libraries”, illegal sites which non-consensually scraped millions of human-authored texts. Even if the training of those texts was transformative Fair Use, their access wasn’t: it was theft. The class action is now undergoing mediation with Anthropic for settlement.

Grand Theft Authors

Instead of the singular, lazy student reading and regurgitating an original work without proper citation, a singular but nontrivial harm, the scale – some 1.7 billion global users of AI – makes the copy problem existential.

A little while ago, I tested out some models for academic work, prompting them with questions about my research area. The results were generic, structural, frustratingly hallucinated in more than a few cases, but, every now and then, I’d test it on something I knew plenty about already, and it surprised me on a few occasions with seemingly crisp, accurate answers. Interesting. How did it nail such a niche area of research? A few prompts later, it revealed its source: a relic blog post hosted on some far-flung corner of the web. That tidy, sharp answer? A summary of a summary of an original article I’d read - one you can’t read without a bought-and-paid-for academic license. Triple-distilled derivative work. Along the way, the model had played the Telephone Game with the core ideas, structurally sound but still a lacunose patchwork of subtly missing context.

We need to fix the information flow.

To do this, we must turn to Helen Nissenbaum’s work on informational norms. Informational norms involve actors (subject, sender, recipient), attributes (information types) and transmission principles (the constraints that govern how information flows between actors). When the flow of information adheres to a shared moral framework, things are okay. But the copy problem is a leaky tap turned into an erupting fire hydrant of failing information flows under uses and misuses of AI.

To plug the leak, we must introduce strong input and output verifications to the information flows of AI models. Input verifications, in information theory, mean recipients get valuable information from trusted sources and, once received, is anonymously and securely associated with their rights. Similarly, output verification controls mean that the computation of this process can be verified by the original custodian of information.

That’s exactly what we’re building at Kanonic. Our platform integrates researchers’ academic licenses with AI web search, modifying the way these models search by filtering out synthetic content and untrustworthy, unciteable sources. Researchers get answers, sure, but these answers look and feel like results: every claim has a direct attribution – straight to the source. Each retrieval is grounded by an anonymised identifier, confirming the access was initiated by a fully paid-up, license-granted human researcher. And then? The recipient’s information disappears when their research session ends or their SSO credentials reset (every 24 hours, whichever one is sooner), with nothing used for model training and no risk of derivative work generation. These are our input verification controls. For publishers and authors, we’re building transparent output verifications, allowing them to audit the information flow and provide them with a suite of AI impact metrics.

Today, millions of texts are going through AI models without a morsel of information-sharing with the copyright holders. By fixing the information flow, we can change that, and build a better platform for learning and work.

If you’re interested in building with us or want to be part of our demo group, please check out www.kanonic.ai

elianmccarron.substack.com

Discussion about this post

Ready for more?