Agents of productivity and chaos

May, 2026 ∙ 11 minute read

It’s been over a year since I published Creative Flow vs. Critical Review, as much of my writing on AI has been internal. I’m hoping to publish a bit more here and wanted to start by sharing a little bit of my agent setup (as of May 2026). A version of this was originally published on an internal discussion board titled “Agents of productivity and chaos - multi-repo, multi-agent learnings and workflows”. For context, I’m mostly working in the GitHub Copilot App and the GitHub Copilot CLI, across a few dozen repos. The thing I wanted to share is that I’ve got a couple of skills that have been doing a lot of heavy lifting lately: one for planning multi-agent projects and another for delegating work — they show up a few times below, if you don’t read any further check them out!

Agents multiply effort, but if you’re going in the wrong direction they don’t help at all, they make it worse! Nobody wants to go faster in the wrong direction. Agents are also prone to making messes. They tend towards more code and more chaos. Entropy and all that. If you want to code with agents, one of your primary tasks is to counter both those behaviors. You need to:

Understand what you want to build (direction) and why and how it intersects with reality. Understand this deeply.
Leverage the tools for what they excel in, don’t fight the quirks of their particular brands of “intelligence”, know when to lean in and when to push back.
Go back to your computer science and engineering fundamentals and then focus on thinking at a systems level. You’re not a worker on the factory floor of Toyota, you’re designing the Lean manufacturing process itself.

The key is setting direction, steering, validating, and iterating on your processes. You’ve got to a) set them up for success b) validate the outputs and c) feed back what works and doesn’t. I’m NOT perfect at this (see failure modes section below), but I have learned a lot in the last few months. Here’s how I’m thinking about all this right now:

Personalize — Start with your own agent harness setup. Dial in an AGENTS.md and/or copilot-instructions.md with your taste, values, preferences. Next, curate a small set of user-level skills and wire up a few critical MCPs. Be careful to keep things concise and focused: don’t spend your entire context window budget here. Personalization is only the first layer of context design. This is also a good place to put things that are unique to how you like to work. For example, I have this fun little signature appended to all comments that Copilot leaves on my behalf:

_{Generated via Copilot (Claude Opus 4.7) on behalf of @tclem}

It’s enforced by a rule in my global Copilot instructions, the model name swaps in automatically, and the linked attribution post explains what this is.
Provide great context — Next you want to make sure each project you work on also has agent building blocks: project-specific skills, design documentation, ADRs. You want this to be indexed by your search engine of choice, and you want to ensure that the agents reach for the right skills and tools without getting overwhelmed and polluting their context. This is not a totally solved problem, but tool search and meta skills (e.g. a choosing-skill skill that helps the agent pick from the project’s skill library) can help. Before you start any work, make sure you’ve either directly supplied high quality, relevant context or have a system in place for agents to find that context themselves.
Plan — The tools encourage this step (see plan mode) but I take it even further. When you plan, don’t blindly accept the plan, ask for the opinion of a different model, read the plan, ask questions, try to poke holes in it. Ask the agent to debate with another agent. Eventually, you’ll start to collapse up on a plan that’s actually reasonable. This is your chance to steer with fundamental principles. Use your design and engineering expertise. Think like an architect. This is where you’re picking an initial direction. Don’t run fast until you feel relatively confident it’s the right direction. I’ve got a fun skill for planning multi-agent projects.
Implement — This part’s fun. As part of planning you can ask the agent to evaluate which parts of the work can happen in parallel and which parts need to be sequential. Regardless, the agents are pretty good here now — check out this delegating work skill for ways to make them even better. I always advise letting another agent review the code of the primary agent, fix all CI issues, address all code review comments, etc. Only at this point is it worth getting a human involved for code review. I’m still manually reviewing a lot of code written by my agents. Look for patterns, this is how you’re going to tune your skills and your processes. Push back on design decisions. Open the code in your editor. Read the diff. Force the agent to solve the root cause and make fixes that will last and scale. Don’t let it get away with band-aids or workarounds. Just this week, I caught the agents happily disabling parts of CI to make the builds pass instead of fixing the actual failures — teaching our project’s merge skill to refuse those workarounds was a small but important fix.
Verify — You do still need to verify the work for most things. Reality is unforgiving. Until you run the code, you don’t know if it actually works — as Knuth put it, “beware of bugs in the above code; I have only proved it correct, not tried it.” This is another area that needs more investment from the industry at large: how do you scale verification? how do you take the tedium out of manually testing? How do you know that the agent has done what you asked it to do? And how do you know that what you asked it to do actually solves whatever root problem or job-to-be-done you had in mind? Be careful of pushing the burden of this to your end users — we’re all going to get really sick of being beta testers for vibe coded apps.
Feed back — Chances are, all sorts of things are going to go wrong in this process. Observe, take notes, let themes emerge, don’t be afraid of doing manual stuff until you actually understand the patterns and can articulate a good abstraction. Then: feed that back. Update and tune skills, add new ones, delete broken ones. Make some things deterministic; scripts, programs, etc. (see Examples below for a couple of real instances of this). Because the models themselves are also improving, you want to have a process that flexes and absorbs those changes instead of fighting them. One trick is to interrogate the models themselves. Ask, “why did/didn’t you use X skill?”, “what’s in the context that made you decide Y?”.

Failure modes

Not everything is golden in the world of agentic coding. Just a few of the things I’m struggling with right now for discussion:

Focus — With agents, it is so easy to spin sessions that you end up with fractured attention. Too many things going on. Lots of context switching. It is easy to make mistakes, mis-read or forget something, etc. If you work with a team of people who are also each running their own agents, this problem multiplies.
Code review — Good code review is already hard. Now, how do you do good code review for hundreds of thousands of lines of code? What if you have 5 teammates who are also producing a huge volume of agentic code?
Verification — Manual validation is slow and tedious and automated validation is a hard problem, but not doing this just multiplies work: fixes on fixes on fixes.
Good foundations — It’s hard to get good foundations in place because everyone is moving fast, but for some reason we keep skipping known computer science fundamentals and then we pay the price later in quality issues, regressions, and tech debt. There has to be a way to move fast on solid fundamentals.

In some ways, our code and coding process have always had these problems. GitHub itself was coded by thousands of engineers over almost 2 decades: it is the very definition of legacy code. Agents mean you get legacy code like that in a few weeks, days, hours (?). There’s no silver bullet, but I expect there are better abstractions yet to be discovered.

Examples

A few real examples of this process in action…

Decomposing a 1,800-line function

A vibe-coded Rust service I work on had a match in handle_client_message that had grown to ~1,800 lines and 228 message-type arms — the next likely candidate in a string of tokio worker stack-overflow incidents we’d been dealing with. I turned on clippy::large_futures at 16 KiB during diagnosis; the function blew past it immediately.

The fix was relatively mechanical, but this code base moves forward so quickly that it was important to sequence out a small series of patches.

Introduce the futures size lint as a hard backstop so that we get CI failures, not application crashes.
Land some repo-level skill changes so that everyone else’s agents know how to avoid the bad pattern going forward.
Refactor the large match in a series of 7 sequential PRs designed to minimize merge conflicts, immediately reduce the stack sizes in this code path, and be authored largely in parallel by agents.

handle_client_message went from 16,296 bytes to 664 bytes, and along the way I refined the planning-multi-agent-projects and delegating-plan-work skills (the same skills that helped plan and execute the work to begin with).

Running Copilot CLI inside GitHub Actions

The GitHub Copilot App vendors the Rust copilot-sdk and the vendored copy needs to track upstream multiple times a day. We do this with a mix of deterministic and agentic Actions workflows.

A standard Actions workflow just syncs any upstream changes into our vendored copy (obeying some rules about additional code we have that we haven’t upstreamed yet).
That job pushes and opens a PR, assigns some humans as reviewers, and CCR starts reviewing as well.
If everything is green and there are no review comments: we’re done.
If anything fails, another action running the Copilot CLI kicks off and does the work to fix any breaking changes. It addresses CI failures, review comments, etc, and pushes to the same PR when it’s done.
Everything still requires a human review and approval before merging.

A complexity-first rewrite

The vibe-coded diff viewing feature in the GitHub Copilot App had cost functions that grew with the size of the underlying diffs. To fix this, we didn’t need a faster diff algorithm or more caching (there were already too many layers of unnecessary caching from over eager agents trying to make things better) — it was some computer science fundamentals around complexity and use of appropriate data structures.

We’re still untangling this, but the process is: a multi-agent plan was written from the output of /research as a series of markdown documents that laid out a phased approach for refactoring the full stack feature. We wanted the complexity contract to be O(V) where V is the size of the viewport (the visible diff), not the size of the entire patch. In addition, we developed a skill that would hold agents to that contract and would pair with some deterministic tests to verify behavior. The skill and associated design documentation now live in the repo as markdown files for humans and agents to reference. Scrolling large diffs is much improved, but we’re still working through untangling some of the misguided caching and unnecessary code complexity, moving carefully to not break end users.

_{This post was written with the help of AI (Claude Opus 4.7). The vast majority of the text was hand written, but I had an agent copy edit, fill in links, and resolve placeholders (e.g. TODO: grab this stat from datadog). See my ai attribution page for more.}

Building GitHub since 2011, programming language connoisseur, Marin resident, aspiring surfer, father of two, life partner to @ktkates—all words by me, Tim Clem.