October, 2025 ∙ 5 minute read
Models of reality vs. models of what people say about reality.
This was originally part of an internal discussions post, surfacing on my blog because it is interesting to see how far we’ve come since October 2025 and how much my thinking has changed. For context, Claude Opus 4.5 was released November 24, 2025. Read part 1.
One hypothesis of the LLM true believers is that mastery of language is the foundation on which all other intelligent capabilities can be built. As LLMs scale, as more and more data and compute is poured into foundational models, this mastery of human language results in a system that can be an expert in any domain. Language means being able to write code; it means being able to “think out loud” and “reason”; it means being able to predict what humans say and what they want you to say in any given situation: in a bar exam or as a therapist or as a Disney character; it means being able to provide iterative outputs based on feedback; it even means being able to take real world actions: run this program, book this flight.
Honestly, it is amazing that we’ve been able to scale up modeling language like this! And the flexibility and general applicability of LLMs is interesting and surprising.
However, I’m not convinced that language is the essence of intelligence: I think the arrow points the other way. I don’t believe that the mastery of predicting language actually equates to understanding, or learning, or a falsifiable modeling of reality. This doesn’t mean that LLMs aren’t useful, and yes there are a number of other ML technologies and techniques that DO model or attempt to model reality, but it does provide clarity into where to use LLMs (and where not to) and what to expect out of them.
The trap I see individuals and companies falling into is this: they truly believe that the foundational LLMs model the real world (if not today: soon). This isn’t true at all. They are trained on language (and sure, pictures and videos too) that humans have written about the real world, but they are stuck in a simulation playing a one-way game of telephone. They can’t learn, they can’t test hypotheses, and therefore they can’t build knowledge - they can only simulate how humans talk about knowledge. This means that they are fundamentally untrustworthy for certain applications. They might be excellent retrieval systems but they are doomed for any generative endeavor that must meet the harsh reality of how the world actually works.
This is actually a very helpful delineation!
When we think about writing code, there is a broad spectrum of software and jobs-to-be-done by software out there in the world. Let’s classify broadly as Type 1 and Type 2 software.
Prototyping? Building yet another CRUD web app? Doing a task with code that many before you have done before? LLMs are going to be amazing. They don’t understand, but they don’t need to and there’s not a lot of “reality” to crash into. The quality of the output will increase based on how often there are examples of solving this particular sort of thing in the training set. And honestly, for many of these applications: low quality and one-off use is fine. For many people, this will be the difference between having a software app or not. This in itself changes the world dramatically. I expect a collapse of 90% of software into a few languages (Python, JavaScript) and there will be SO much of this type of software that it’ll overwhelm what we have now but it’ll be like cheap shit from Amazon off the internet: inexpensive, useful, disposable, of varying quality, and challenging to manage because it’ll clog up everything else. We’ll need waste management solutions. It’ll super-charge spam and abuse and bots and bad-actors on the internet. It’ll also let a billion people “code” apps that are unable to do so today.
Software Engineering. Systems of consequence. Internet platform providers. Financial systems. Novel ideas. Health care. Governmental data. Fundamental building blocks. Open source stalwarts. Novel algorithms and data structures and architectures. Performance, correctness. Anywhere that software’s job is to model reality. No matter how good they get, LLMs are going to be frustrating to use for type 2 software. They will require extra effort from human experts to check their work, they will require complex online and offline eval setups to validate their performance, they will do a bad job even with lots of iterations. I predict the juice won’t be worth the squeeze. Completions, Q&A, etc. will still provide some value in type 2 development, but agentic coding by LLMs will be a dead end here. (NOTE: other ML techniques may be developed and used successfully here, but LLMs writing this sort of code will be hard and frustrating).
⚠️ some people will be confused about what type of software they are writing and use LLMs poorly and there will be direct fallout and consequences.
I think there will be a variety of tasks, common to all software development where LLMs will provide value and productivity benefits. Hopefully we’ll get some assistance with the toil of repetitive tasks (LLMs are good at following patterns without necessarily understanding). I expect LLMs to continue to augment and work with traditional information retrieval systems to provide fast access to common/general programming knowledge. Hopefully we can elevate/expose the patterns that produce high quality, performant, and maintainable code. This will be true for completion products and “ask/search” workflows.
This post was hand written, but in porting it to my public blog, I did use a model (Claude Opus 4.7) to copy edit; the changes were minor spelling and grammar fixes. See my AI attribution page for more.