AI Should Help Us Produce Better Code, Not More — Lessons from Simon Willison and 77 Lobsters Comments

There is a narrative forming around AI coding tools that goes like this: AI makes developers faster, faster means more code, more code means more bugs, therefore AI makes code worse.

Simon Willison disagrees. And after 77 comments on Lobste.rs — the highest-engagement thread on the site today — it turns out the developer community is deeply split on why.

The Core Argument: Quality Is a Choice

Willison’s central claim is blunt: shipping lower-quality code with AI assistance is a deliberate choice, not an inevitable outcome.

Teams are choosing to use agents for volume. They could just as easily choose to use agents for rigor. The technology does not pick sides. The humans wielding it do.

His most quotable line frames the opportunity:

“The cost of code improvements has dropped so low that we can afford zero tolerance to minor code smells.”

Read that again. He is not saying AI makes code better automatically. He is saying AI makes the cost of making code better almost negligible. That is a fundamentally different argument than “AI writes great code.” It is an argument about economics.

When a rename-and-refactor that used to take a developer two focused hours now takes an agent three minutes, the calculus changes. You stop tolerating the workaround. You stop living with the TODO comment. You actually fix things.

Three Use Cases That Make the Argument Concrete

Willison does not leave this as philosophy. He identifies three specific areas where agents excel at improving quality rather than inflating volume.

1. Technical Debt Mitigation

API redesigns. Nomenclature cleanup. Consolidating three functions that do almost the same thing. Splitting a 2,000-line file that everyone is afraid to touch.

This is the category of work that every team knows they should do, that every sprint planning meeting deprioritizes, and that quietly makes the codebase worse every quarter. It is simple-but-tedious — exactly the kind of task where agents outperform humans not in cleverness, but in patience.

An agent will cheerfully rename 347 references to a function across 89 files without getting bored, making a typo on file 64, or deciding to “also quickly refactor this other thing while I’m here.”

2. Exploratory Prototyping

Should you use Redis for the activity feed? Will a materialized view handle the query load? Can WebSockets scale to your projected user count?

The old approach: a three-hour meeting where senior engineers debate based on experience and intuition, followed by a spike that takes a sprint.

The agent approach: build a simulation, run load tests, generate performance data, present results. If Redis is wrong, you find out in 20 minutes instead of two weeks. If it is right, you have the proof to skip the debate entirely.

Willison frames this as eliminating expensive guesswork. In practice, it also eliminates expensive arguments.

3. Expanding Solution Spaces

Here is an underappreciated capability of LLMs: they have seen a lot of boring, proven solutions. When your team is deep in a domain, you develop tunnel vision. You reach for the tools and patterns you know. An LLM, drawing from its training data, will often suggest the most common approach to a problem — and that common approach exists for a reason.

Willison connects this to the “Boring Technology” philosophy. The best solution to many engineering problems is not the clever one. It is the obvious one that thousands of teams have already battle-tested. LLMs surface those solutions naturally because they are overrepresented in the training data.

Compound Engineering: The Multiplier

Willison highlights a technique from Every’s engineering methodology called Compound Engineering: after each project, conduct a retrospective to document which agent patterns worked, what prompts produced good results, and which approaches failed.

This creates a flywheel. Each project produces not just code but also reusable knowledge about how to use agents effectively. The next project starts from a higher baseline. Quality compounds.

This is not a new idea — it is essentially the engineering postmortem applied to AI collaboration. But it matters because most teams treat each agent interaction as a one-off. They do not build institutional memory around what works.

What 77 Lobsters Think

The Lobste.rs discussion is worth reading in full because it captures the real tension in the developer community around AI coding. Here is the landscape.

Where the Community Agrees with Willison

Tedious, checkable tasks are the sweet spot. Multiple commenters confirmed that TypeScript migrations, large-scale refactoring, and boilerplate generation are areas where agents genuinely produce better outcomes than manual work. The key word is checkable — you can verify the result mechanically.

Technical debt paydown is cheaper than ever. Several developers noted that refactoring work they had been postponing for months was suddenly tractable. The economic argument resonated.

Agents handle plumbing well. Threading data through multiple layers of an application — services, controllers, repositories, serializers — is exactly the kind of boring, error-prone work where humans introduce silly bugs through inattention. Agents do not get bored at layer four.

Where the Community Pushes Back

The learning problem. Multiple commenters raised a pointed question: if agents write the code, how do junior developers learn? Debugging, refactoring, and reading other people’s code are how developers build mental models. Outsourcing that work to agents could create a generation of developers who can prompt but cannot reason about systems.

This is a legitimate concern with no clean answer yet.

Incentive structures favor volume. Several commenters drew a parallel to the Industrial Revolution: when machines made production cheaper, we did not produce less and better. We produced more and cheaper. One commenter wrote that there is “nothing in the technology that promotes less volume and more deliberation.” The technology is neutral, but the economic incentives are not.

Amazon’s retreat from LLM-generated code. Multiple commenters cited reports that Amazon had backpedaled on using LLM-written code internally, interpreting it as a signal that even well-resourced teams are finding the quality problems real.

The “expanding cloud of slop.” A memorable phrase from the discussion. The concern is not just bad code in your repository but bad code in your documentation, your tests, your commit messages, your pull request descriptions — an expanding surface area of text that looks right but is subtly wrong.

The Interesting Middle Ground

The most thoughtful comments occupied a nuanced position:

Writing did not kill oral tradition skills — it transformed them. One commenter drew a historical analogy: the invention of writing was feared to destroy memory and rhetorical skill. It did change those skills, but it did not eliminate the need for them. AI coding may do the same — change what “being a good developer” means without making the underlying skills obsolete.

Most technical debt comes from indecision, not mistakes. A sharp observation from the thread: the worst code quality problems are not bugs. They are unresolved design decisions — the “we’ll figure this out later” that never gets figured out. Agents cannot help with that because it is a human organizational problem.

The real definition of quality. One commenter offered a definition that stuck: quality code is code that is understandable, navigable, extensible, and deletable. That last criterion — deletable — is underrated. Good code is code you can remove cleanly when you no longer need it.

The Research: Anthropic’s Own Data Says It Depends on How You Use AI

While the Lobsters debate is about opinions, Anthropic researchers Judy Hanwen Shen and Alex Tamkin published hard data on this exact question. Their paper, “How AI Impacts Skill Formation” (February 2026), ran randomized controlled experiments with 52 developers learning a new Python library.

The headline result: AI users scored 17% lower on comprehension tests (p=0.010), with no significant improvement in task completion time (p=0.391). The AI did not make them faster or more knowledgeable. It just made them feel more productive.

But the real finding is more nuanced. The researchers identified six distinct patterns of how developers interact with AI — and three of them actually preserved learning:

Pattern	Quiz Score	Time	What They Did
AI Delegation	39%	19.5min	Only asked AI to generate code, pasted as answer
Progressive AI Reliance	35%	22min	Asked questions for task 1, delegated entirely for task 2
Iterative AI Debugging	24%	31min	Repeated AI-assisted troubleshooting (5-15 queries)
Conceptual Inquiry	65%	22min	Only asked conceptual questions; resolved errors independently
Hybrid Code-Explanation	68%	24min	Asked for code generation with explanations
Generation-Then-Comprehension	86%	24min	Generated code first, then asked follow-up “why” questions

The gap is stark: 24-39% quiz scores for low-engagement patterns vs 65-86% for high-engagement ones. The difference is not whether you use AI — it is whether you engage your brain while doing so.

The worst pattern? Iterative AI Debugging — asking AI to fix errors over and over without trying to understand the root cause. These developers were the slowest and learned the least. The best pattern? Generation-Then-Comprehension — letting AI write code, then asking it to explain what it wrote and why. These developers scored almost as well as the no-AI control group.

Two additional findings that should concern anyone managing a team:

The biggest skill gap was in debugging. AI users encountered fewer errors (median 1 vs 3 in control group), which means they got less practice with the most important supervision skill.
Participants in the AI group self-reported feeling “lazy” and having “gaps in understanding.” They knew they were learning less. They just could not stop.

This research from inside Anthropic effectively confirms what the Lobsters community debated: AI assistance can either build or erode skills, depending entirely on the interaction pattern. The technology is neutral. The workflow is not.

What This Means in Practice

Willison’s argument is correct but incomplete. AI agents can improve code quality. But they will not do so by default, any more than a table saw produces good furniture by default.

Here is what actually makes the difference:

Quality gates, not quality hopes

The developers who report improved code quality with AI are not the ones who trust the output. They are the ones who built verification into their workflow. Code review agents that run after every generation pass. Test suites that execute before any merge. Linting that catches the patterns the LLM tends to get wrong.

In Claude Code, this looks like using /self-review before committing, running a dedicated code-reviewer agent on every significant change, and treating agent output with the same skepticism you would apply to a pull request from a new team member.

Tests first, implementation second

The single most effective pattern for getting quality code from agents is TDD: write the tests first, then let the agent implement to pass them. This works because tests are a specification. You are telling the agent what “correct” means before it starts generating code.

Without tests, the agent is guessing what you want. With tests, the agent is solving a constrained problem. The difference in output quality is dramatic.

Compound your patterns

Willison and Every are right about compound engineering, but it needs infrastructure to work. Document your successful agent interactions. Build prompt libraries. Create CLAUDE.md files that encode your team’s standards so every agent session starts with the right constraints. Use MCP servers to give agents access to your team’s accumulated knowledge.

The teams that get better code from agents are the ones that treat agent configuration as seriously as they treat their CI/CD pipeline.

Use agents for review, not just generation

The most underutilized pattern: using AI to read code, not just write it. Point an agent at a pull request and ask it to find problems. Point it at a module and ask what would break if you changed the interface. Point it at a test suite and ask what scenarios are missing.

Agents as reviewers have a different failure mode than agents as generators. When generating, a confident-sounding wrong answer is dangerous. When reviewing, a confident-sounding wrong suggestion is just noise — you were going to evaluate the feedback anyway.

The Bottom Line

Simon Willison is right that quality is a choice. The Lobsters community is right that incentive structures push against that choice. Both things are true simultaneously.

The developers who will produce better code with AI are the ones who consciously choose quality over volume, who build verification into their workflows, and who invest in compounding their agent patterns over time.

The developers who will produce worse code with AI are the ones who use it on autopilot, skip review, and optimize for lines shipped per hour.

The technology is the same. The outcomes are not. That is the uncomfortable truth that neither the optimists nor the pessimists want to fully accept.

The original article by Simon Willison is available on simonwillison.net. The Lobste.rs discussion had 66 points and 77 comments at the time of writing. The Anthropic research paper “How AI Impacts Skill Formation” by Shen & Tamkin (2026) is available on arXiv.