Taking advantage of LLMs

Where to focus investments to best leverage AI tooling

How do we scale LLMs to larger codebases? First we must understand how LLMs contribute to engineering. This understanding helps us realize we need to invest in guidance and oversight.

  • Guidance: The context, the environment.
  • Oversight: The skill set needed to validate and verify the implementor's1 choices.

Investing in guidance

When an LLM can generate a working high-quality implementation in a single try, that is called one-shotting. This is the most efficient form of LLM programming.

An arrow hitting the center of a dart board.

The opposite of one-shotting is rework. This is when you fail to get a usable output from the LLM and must manually intervene.2 This often takes longer than just doing the work yourself.

Multiple arrows on a dart board. They have all missed the center.

So how do we create more opportunities for one-shotting? Better guidance.

Better guidance

LLMs are choice generators. Every set of tokens is a choice added to your codebase: how a variable is named, where to organize a function, whether to reuse/extend/or duplicate functionality to solve a problem, whether Postgres should be chosen over Redis, and so on.

Often, these choices are best left up to the designer (e.g., via the prompt). However, it's not efficient to exhaustively list all of these choices in a prompt. It's also not efficient to rework an LLM output whenever it gets these choices wrong.

In the ideal world, the prompt only captures the business requirements of a feature. The rest of the choices are either inferrable or encoded.

Write a prompt library

A prompt library is a set of documentation that can be included as context for an LLM.

Writing this is simple: collate documentation, best practices, a general map of the codebase, and other context an engineer needs to be productive in your codebase.3

Making a prompt library useful requires iteration. Every time the LLM is slightly off target, ask yourself, "What could've been clarified?" Then, add that answer back into the prompt library.

A prompt library needs to strike the right balance between comprehensive and lean.

The environment is your context

A peer at Meta told me that they weren't in a position to make Zuckerberg's engineering automation claims a reality. The reason is their codebase is riddled with technical debt. He wasn't surprised by this. Meta (apparently) historically has not prioritized paying down their debts.

Compare this to the mentality from the Cursor team:

I think ultimately the principles of clean software are not that different when you want it to be read by people and by models. When you are trying to write clean code you want to, not repeat yourself, not make things more complicated than they need to be.

I think taste in code... is actually gonna become even more important as these models get better because it will be easier to write more and more code and so it'll be more and more important to structure it in a tasteful way.

This is the garbage in, garbage out principle in action. The utility of a model is bottle-necked by its inputs. The more garbage you have, the more likely hallucinations will occur.

Here's a LLM literacy dipstick: ask a peer engineer to read some code they're unfamiliar with. Do they understand it? Do they struggle to navigate it? If it's a module, can they quickly understand what all that module exposes? Do they know the implications of using a certain function, the side-effects they must be aware of? No? Then the LLM won't either.

Here's another dipstick: Ask an LLM agent to tell you how certain functionality works. You should know the answer before asking the LLM. Is their answer right? More importantly, how did they go about answering your question? Follow the LLM's trail and document its snags. You'll notice it tends to grep, ls, and cat to search. How can you give it a map so it isn't left to rediscover the codebase on each new prompt? When a map can't be given, how do you make it easier for them to navigate the codebase?

How you make the environment better suited for LLM literacy is dependent on the tech stack and domain. But general principles apply: modularity, simplicity, things are well-named, logic is encapsulated. Be consistent and encode these conventions in your prompt library.

Investing in oversight

We need guidance and oversight. A 3-ton truck with a middle-schooler behind the wheel puts people in the hospital (and in jail). This is why the mentality of automating engineers is objectionable. We should be fostering our teams, not discarding them.

Remember, engineers operate on two timelines. As overseers of implementation, we must plan for the future of the codebase. If an LLM makes a choice, the overseer should be able to discern whether it was a good one or a bad one. For example, let's say the LLM opted to use Redis over Postgres to store some metadata. Was that a good choice? The overseer should know.

An investment in oversight is an investment in team, alignment, and workflows.

For team, it's worth investing in elevating everyone's design capabilities.

Design produces architecture. Architecture is a bet on the future. It's a bet that by setting up a program in a certain way, it will make the future feature development easier.

Architects are often created through experience. A career of shooting yourself in the foot builds intuition. This intuition shapes new software from having the same mistakes.

Oversight is not only about architecture, but also temperament, alignment to values, and workflows. Operators need to be both technical and product experts. Without a deep understanding of the product, it's easy to accidentally build the wrong solution.

Automating oversight

Some design concerns can be checked programmatically.

Moving more implementation feedback from human to computer helps us improve the chance of one-shotting. Agents can get feedback directly from their environment (e.g., type errors).

Think of these as bumper rails. You can increase the likelihood of an LLM reaching the bowling pins by making it impossible to land in the gutter.

One way to do this is through writing safety checks. But what is safety? Safety is protecting your abstractions. Pierces Types and Programming Languages contains my favorite definition of safety:

Informally, though, safe languages can be defined as ones that make it impossible to shoot yourself in the foot while programming.

Refining this intuition a little, we could say that a safe language is one that protects its own abstractions.

Safety refers to the language's ability to guarantee the integrity of these abstractions and of higher-level abstractions introduced by the programmer using the definitional facilities of the language. For example, a language may provide arrays, with access and update operations, as an abstraction of the underlying memory. A programmer using this language then expects that an array can be changed only by using the update operation on it explicitly—and not, for example, by writing past the end of some other data structure.

We tend to write tests for business-logic but don't always write tests for architecture-logic. Some programming languages have facilities for this built in.

Addressing verification

That's it, for now

This was the third part of a series on LLMs in software engineering.

First we learned what LLMs and genetics have in common. (part 1) LLMs don't simply improve all facets of engineering. Understanding which areas LLMs do improve (part 2) is important for knowing how to focus our investments. (part 3)


  1. Or, in today's age, the generator's

  2. Not being able to one-shot prevents adoption from many programmers. Programmers are disposed to seeing a worse solution and wanting to build their own. Oh, I can either pay $10/mo for a subscription to this SaaS tool, or I can build my own..? I choose to build my own, of course! (I am guilty of this). I think mentality partially explains the disparity between LLM skeptics and advocates. 

  3. Technical strategy is another form of context you can include in a prompt library. Though, you do risk bloating the context with many words, some of which aren't directly applicable. 

  4. This structure is motivated by this post and this video. If you are in the Django ecosystem, I recommend reviewing those.