Emerald Pages
◆
The Information Wall: AI Has Officially Run Out of Data
The era of making AI smarter by feeding it more internet data is over. Tech giants have officially hit the "data wall," forcing a fundamental shift in how artificial intelligence is built.
Photo: Harrison Thébaud | Medium
For years, the path to building smarter artificial intelligence was brutally simple: feed it more data. The bigger the dataset, the smarter the model. But that era is officially over. Between 2024 and 2026, the AI industry quietly hit a wall that researchers had been warning about for years. The public internet—all the blogs, articles, books, and forums that fueled the AI revolution—has been scraped dry. AI has officially run out of high-quality, human-written text to train on.
This isn't a future prediction. It's a present-day reality. Tech leaders have publicly acknowledged that the old trick of making AI smarter simply by feeding it more internet data has maxed out. According to research groups like Epoch AI, the stock of open, high-quality human text is effectively tapped out. The "data wall," or "peak data," as experts call it, has fundamentally changed the trajectory of AI development.
The timeline of this collapse is a stark reminder of how fast the technology has grown. In 2022 and 2023, warnings from research institutions predicted a shortage between 2026 and 2032, but they were largely ignored. By 2024, websites started fighting back. Major platforms, news outlets, and social media sites changed their code to block AI web crawlers. The free flow of data that had built ChatGPT and its rivals suddenly became a scarce commodity.
The Insatiable Appetite of AI
The simple truth is that humans cannot write fast enough to keep up with an AI's appetite. To put it in perspective, the amount of data used to train these models has multiplied by 100 times since 2020. An AI can "read" the entire history of human literature and the public internet in a matter of weeks, leaving nothing new for the next model to learn.
Furthermore, the internet is now filling up with its own echo. By 2025, studies showed that roughly half of all new articles published online were already AI-generated, making the open web an unsafe place to harvest clean, original human data. Training an AI on its own output leads to a phenomenon known as "model collapse"—like photocopying a photocopy, the quality degrades, and the AI loses its grip on real-world nuance.
The Iceberg of Information
The public internet is just the tip of the iceberg. While AI has consumed the entirety of Wikipedia, digitized books, and public forums, this represents a minuscule fraction of the information that exists in the world. The vast majority of human knowledge remains locked away, out of reach.
The bulk of the iceberg—the vast majority of untapped information—remains inaccessible. It is trapped inside human heads (tacit knowledge like how to ride a bike or recognize a voice), locked behind corporate firewalls and privacy laws, or simply too vast to measure (the way wind moves through trees, the mutations of bacteria, the movement of ocean waves). We do not have enough sensors, hard drives, or electricity on Earth to capture even a fraction of a percent of physical reality.
The Truth About AI
The most fundamental truth about this technology is that AI is based on guesses, not facts. At its core, a Large Language Model is just a massive prediction machine. It does not have a database of true facts that it looks up. Instead, it calculates a statistical probability: "Based on everything I've read, what is the most likely next word?" Because it is always playing a game of probability, there is always a chance it will guess the wrong word.
Expecting an AI to be 100% accurate is like expecting a weather forecaster to never get rained on—the math inherently allows for errors. Even if AI companies magically got all the data in the universe, it would still never be perfectly accurate. The data needed for perfection literally does not exist because human knowledge is constantly changing, messy, and full of contradictions.
The world changes (Who is the Prime Minister? What is the stock price?). Humans disagree on medical, political, and philosophical questions. And the internet is full of lies, jokes, sarcasm, and mistakes. Since AI learns from us, it learns our mistakes. To make it perfect, we would have to clean all the human errors out of the internet first—an impossibility.
So, while the tech industry talks about building "God-like" intelligence, the reality is much more grounded. They have built a very powerful calculator that has memorized a tiny, messy, human-written subset of the internet. Because we can never truly control or capture the infinite amount of information in the real world, AI will always be operating with a massive blind spot.
No Ads. By Us. For Us.
This article was made possible by readers like you. We hope it inspired you to support Emerald Book, so we can continue producing content like this.
We will never show you ads, sell your data, or require a subscription to consume our content. Your gift helps us keep the truth accessible.
Click the Support button to give a gift of any amount today.
Thank you for making this work possible.