My tips for using LLM agents to create software (efitz-thoughts.blogspot.com)
148 points by efitz 16 hours ago | 56 comments

• kvnhn 3 hours ago

IMO, a key passage that's buried:

"You can ask the agent for advice on ways to improve your application, but be really careful; it loves to “improve” things, and is quick to suggest adding abstraction layers, etc. Every single idea it gives you will seem valid, and most of them will seem like things that you should really consider doing. RESIST THE URGE..."

A thousand times this. LLMs love to over-engineer things. I often wonder how much of this is attributable to the training data...

• brookst 2 hours ago

They’re not dissimilar to human devs, who also often feel the need to replat, refactor, over-generalize, etc.

The key thing in both cases, human and AI, is to be super clear about goals. Don’t say “how can this be improved”, say “what can we do to improve maintainability without major architectural changes” or “what changes would be required to scale to 100x volume” or whatever.

Open-ended, poorly-defined asks are bad news in any planning/execution based project.

• exitb 13 minutes ago

There are however human developers that have built enough general and project-specific expertise to be able to answer these open-ended, poorly-defined requests. In fact, given how often that happens, maybe that’s at the core of what we’re being paid for.

• iguessthislldo 21 minutes ago

This is something I experienced first hand a few weeks ago when I first used Claude. I have this recursive-decent-based parser library I haven't touched in a few years that I want to continue developing but always procrastinate on. It has always been kinda slow so I wanted to see if Claude could improve the speed. It made very reasonable suggestions, the main one being caching parsing rules based on the leading token kind. It made code that looked fine and didn't break tests, but when I did a simple timed looped performance comparison, Claude's changes were slightly slower. Digging through the code, I discovered I already was caching rules in a similar way and forgot about it, so the slight performance loss was from doing this twice.

• manmal 11 hours ago

One weird trick is to tell the LLM to ask you questions about anything that’s unclear at this point. I tell it eg to ask up to 10 questions. Often I do multiple rounds of these Q&A and I‘m always surprised at the quality of the questions (w/ Opus). Getting better results that way, just because it reduces the degrees of freedom in which the agent can go off in a totally wrong direction.

• jackphilson 5 hours ago

I often like to just talk out out. Stream of thought. Gives it full context of your mental model. Talk through an excalidraw diagram

• deadbabe 9 hours ago

This is a little anthropomorphic. The faster option is to tell it to give you the full content of an ideal context for what you’re doing and adjust or expand as necessary. Less back and forth.

• 7thpower 5 hours ago

It’s not though, one of the key gaps right now is that people do not provide enough direction on the tradeoffs they want to make. Generally LLMs will not ask you about them, they will just go off and build. But if you have them ask, they will often come back with important questions about things you did not specify.

• MrDunham 2 hours ago

This is the correct answer. I like to go one step further than the root comment:

Nearly all of my "agents" are required to ask at least three clarifying questions before they're allowed to do anything (code, write a PRD, write an email newsletter, etc)

Force it to ask one at a time and it's event better, though not as step-function VS if it went off your initial ask.

I think the reason is exactly what you state @7thpower: it takes a lot of thinking to really provide enough context and direction to an LLM, especially (in my opinion) because they're so cheap and require no social capital cost (vs asking a colleague / employee—where if you have them work for a week just to throw away all their work it's a very non-zero cost).

• iaw an hour ago

My routine is:

Prompt 1: <define task> Do not write any code yet. Ask any questions you need for clarification now.

Prompt 2: <answer questions> Do not write any code yet. What additional questions do you have?

Reiterate until questions become unimportant.

• deadbabe 2 hours ago

They don’t know what to ask. They only assemble questions according to training data.

• fuzzzerd 2 hours ago

While true, the questions are all points where the LLM would have "assumed" an answer and by asking you get to point in the right direction instead.

• manmal 9 hours ago

Can you give me the full content of the ideal context of what you mean here?

• rzzzt 9 hours ago

Certainly!

• bbarnett 9 hours ago

Oh great.

LLM -> I've read 1000x stack overflow posts on this. The way coding works, is I produce sub-standard code, and then show it to others on stackoverflow! Others chime in with fixes!

You -> Get the LLM to simulate this process, by asking to to post its broken code, then asking for "help" on "stackoverflow" (eg, the questions it asks), and then after pasting the fix responses.

Hands down, you've discovered why LLM code is so junky all the time. Every time it's seen code on SO and other places, it's been "Here's my broken code" and then Q&A followed by final code. Statistically, symbolically, that's how (from an LLM perspective) coding tends to work.

Because of course many code examples it's seen are derived from this process.

So just go through the simulated exchange, and success.

And the best part is, you get to go through this process every time, to get the final fixed code.

• elif 3 hours ago

Please keep comments in this thread non-fiction.

• manmal 9 hours ago

The questions it asks are usually domain specific and pertaining to the problem, like modeling or „where do I get this data from ideally“.

• bbarnett 8 hours ago

Not blaming you, it's actually genius. You're simulating what it's seen, and therefore getting the end result -- peer discussed and reviewed SO code.

• ModernMech 34 minutes ago

Although you're being voted down probably for tone, this is a very interesting point.

• theptip an hour ago

> Finally it occurred to me to put context where it was needed - directly in the test files.

Probably CLAUDE.md is a better place?

> Too much context

Claude’s Sub-agents[1] seems to be a promising way of getting around this, though I haven’t had time to play with the feature too much. Eg when you need to take a context-busting action like debugging dependencies, instead spin up a new agent to read the output and summarize. Then your top-level context doesn’t get polluted.

[1]: https://docs.anthropic.com/en/docs/claude-code/sub-agents

• pmxi 10 hours ago

> If you are a heavy user, you should use pay-as-you go pricing

if you’re a heavy user you should pay for a monthly subscription for Claude Code which is significantly cheaper than API costs.

• Terretta 2 hours ago

Define heavy... There's a band where the max subscription makes most sense. Thread here talks $1000/month, the plan beats that. But there's a larger area beyond that where you're back to having to use API or buy credits.

A full day of Opus 4.1 or GPT 5 high reasoning doing pair programming or guided code review across multiple issues or PRs in parallel will burn the max monthly limits and then stop you or cost $1500 in top up credits for a 15 hour day. Wait, WTF, that's $300k/year! OK, while true, misses that that's accomplishing 6 - 8 in parallel, all day, with no drop in efficacy.

At enterprise procurement cost rates, hiring a {{specific_tech}} expert can run $240/hr or $3500/day and is (a) less knowledgable on the 3+ year old tech the enterprise is using, (b) wants to advise instead of type.

So the question then isn't what it costs, it's what's the cost of being blocked and in turn blocking committers waiting for reviews? Similarly, what's the cost of a Max for a dev that doesn't believe in using it?

TL;DR: At the team level, for guided experts and disbelievers, API likely ends up cheaper again.

• Nizoss 5 hours ago

Yeah, this is a no brainer for certain use cases.

• ramesh31 10 hours ago

Am I alone in spending $1k+/month on tokens? It feels like the most useful dollars i've ever spent in my life. The software I've been able to build on a whim over the last 6 months is beyond my wildest dreams from a a year or two ago.

• fainpul 10 hours ago

> The software I've been able to build on a whim over the last 6 months is beyond my wildest dreams from a a year or two ago.

If you don't mind sharing, I'm really curious - what kind of things do you build and what is your skillset?

• OtherShrezzing 8 hours ago

I’m unclear how you’re hitting $1k/mo in personal usage. GitHub Copilot charges $0.04 per task with a frontier model in agent mode - and it’s considered expensive. That’s 850 coding tasks per day for $1k/mo, or around 1 per minute in a 16hr day.

I’m not sure a single human could audit & review the output of $1k/mo in tokens from frontier models at the current market rate. I’m not sure they could even audit half that.

• Wowfunhappy an hour ago

You don't audit and review all $1k worth of tokens!

The AI might write ten versions. Versions 1-9 don't compile, but it automatically makes changes and gets further each time. Version 10 actually builds and seems to pass your test suite. That is the version you review!

—and you might not review the whole thing! 20 lines in, you realize the AI has taken a stupid approach that will obviously break, so you stop reading and tell the AI it messed up. This triggers another ~5 rounds of producing code before something compiles, which you can then review, hopefully in full this time if it did a good job.

• F7F7F7 6 hours ago

Audit and review? Sounds like a vibe killer.

• 7thpower 5 hours ago

Do people actually use GitHub copilot?

At any rate, I could easily go through that much with Opus because it’s expensive and often I’m loading the context window to do discovery, this may include not only parts of a codebase but also large schemas along with samples of inputs and outputs.

When I’m done with that, I spend a bunch of turns defining exactly what I want.

Now that MCP tools work well, there is also a ton of back and forth that happens there (this is time efficient, not cost efficient). It all adds up.

I have Claude code max which helps, but one of the reasons it’s so cheap is all of the truncation it does, so I have a different tool I use that lets me feed in exactly the parts of a codebase that I want to, which can be incredibly expensive.

This is all before the expenses associated with testing and evals.

I’m currently consulting, a lot of the code is ultimately written by me, and everything gets validated by me (if the LLM tells me about how something works, I don’t just take its word for it, I go look myself), but a lot of the work for me happens before any code is actually written.

My ability (usually clarity of mind and patience) to review an LLMs output is still a gating factor, but the costs can add up quickly.

• adithyassekhar 3 hours ago

> Do people actually use GitHub copilot?

I use it all the time. I am not into claude code style agentic coding. More of the "change the relevant lines and let me review" type.

I work in web dev, with vs code I can easily select a line of code that's wrong which I know how to fix but honestly tired to type, press Ctrl+I and tell it to fix. I know the fix, I can easily review it.

Gpt 4.1 agent mode is unlimited in the pro tier. It's half the cost of claude, gemini, and chatgpt. The vs code integration alone is worth it.

Now that is not the kind of AI does everything coding these companies are marketing and want you to do, I treat it like an assistant almost. For me it's perfect.

• ModernMech 30 minutes ago

I trust Copilot way more than any agentic coder. First time I used Claude it went through my working codebase and tried to tell me it was broken in all these places it wasn't. It suggested all these wrong changes that if applied would have ruined my life. Given that first impression, it's going to take a lot to convince me agentic coding is a worthwhile tool. So I prefer Copilot because it's a much more conservative approach to adding AI to my workflow.

• zppln 10 hours ago

Care to show what you've built?

• kergonath 8 hours ago

> Am I alone in spending $1k+/month on tokens?

I would if there were any positive ROI for these $12k/year, or if it were a small enough fraction of my income. For me, neither are true, so I don’t :).

Like the siblings I would be interested in having your perspective on what kind of thing you do with so many tokens.

• mewpmewp2 8 hours ago

If freelancing and if I am doing 2x as much as previously with same time, it would make sense that I am able to make 2x as much. But honestly to me with many projects I feel like I was able to scale my output far more than 2x. It is a different story of course if you have main job only. But I have been doing main job and freelancing on the side forever now.

I do freelancing mostly for fun though, picking projects I like, not directly for the money, but this is where I definitely see multiples of difference on what you can charge.

• sothatsit 8 hours ago

You're not alone in using $1k+/month in tokens. But if you are spending that much, you should definitely be on something like Anthropic's Max plan instead of going full API, since it is a fraction of the cost.

• tovej 9 hours ago

I would personally never. Do I want to spend all my time reviewing AI code instead of writing? Not really. I also don't like having a worse mental model of the software.

What kind of software are you building that you couldn't before?

• athrowaway3z 10 hours ago

> One of the weird things I found out about agents is that they actually give up on fixing test failures and just disable tests. They’ll try once or twice and then give up.

Its important to not think in terms of generalities like this. How they approach this depends on your tests framework, and even on the language you use. If disabling tests is easy and common in that language / framework, its more likely to do it.

For testing a cli, i currently use run_tests.sh and never once has it tried to disable a test. Though that can be its own problem when it hits 1 it can't debug.

# run_tests.sh # Handle multiple script arguments or default to all .sh files

scripts=("${@/#/./examples/}")

[ $# -eq 0 ] && scripts=(./examples/*.sh)

for script in "${scripts[@]}"; do

    [ -n "$LOUD" ] && echo $script

    output=$(bash -x "$script" 2>&1) || {

        echo ""

        echo "Error in $script:"

        echo "$output"

        exit 1

    }
done

echo " OK"

----

Another tip. For a specific tasks don't bother with "please read file x.md", Claude Code (and others) accept the @file syntax which puts that into context right away.

• JeremyNT 3 hours ago

Some of these sample prompts in this blog post are extremely verbose:

If you are considering leveraging any of the documentation or examples, you need to validate that the documentation or example actually matches what is currently in the code.

I have better luck being more concise and avoiding anthropomorphizing. Something like:

"validate documentation against existing code before implementation"

Should accomplish the same thing!

• brookst 2 hours ago

I have the best luck with RFC speak. “You MUST validate that the documentation validates existing code before implementation. You MAY update documentation to correct any mismatches.”

But I also use more casual style when investigating. “See what you think about the existing inheritance model, propose any improvements that will make it easier to maintain. I was thinking that creating a new base class for tree and flower to inherit from might make sense, but maybe that’s over complicating things”

(Expressing uncertainty seems to help avoid the model latching on to every idea with “you’re absolutely right!”)

• Wowfunhappy 3 hours ago

I've had both experiences. On some projects concise instructions seem to work better. Other times, the LLM seems to benefit from verbosity.

This is definitely a way in which working with LLMs is frustrating. I find them helpful, but I don't know that I'm getting "better" at using them. Every time I feel like I've discovered something, it seems to be situation specific.

• enraged_camel an hour ago

Not my experience at all. I find that the shorter my prompts, the more garbage the results. But if I give it a lot of detail and elaborate on my thought process, it performs very well, and often one-shots the solution.

• sothatsit 8 hours ago

> If you are a heavy user, you should use pay-as-you go pricing; TANSTAAFL.

This is very very wrong. Anthropic's Max plan is like 10% of the cost of paying for tokens directly if you are a heavy user. And if you still hit the rate-limits, Claude Code can roll-over into you paying for tokens through API credits. Although, I have never hit the rate limits since I upgraded to the $200/month plan.

• yifanl 8 hours ago

The blogpost is transparently an advertisement, which is ironic considering the author's last blogpost was https://blog.efitz.net/blog/modern-advertising-is-litter/

• alex-moon 9 hours ago

As a human dev, can I humbly ask you to separate out your LLM "readme" from your human README.md? If I see a README.md in a directory I assume that means the directory is a separate module that can be split out into a separate repo or indeed storage elsewhere. If you're putting copy in your codebase that's instructions for a bot, that isn't a README.md. By all means come up with a new convention e.g. BOTS.md for this. As a human dev I know I can safely ignore such a file unless I am working with a bot.

• kergonath 9 hours ago

I think things are moving towards using AGENTS.md files: https://agents.md/ . I’d like something like this to become the consensus for most commonly used tools at some point.

There was a discussion here 3 days ago: https://news.ycombinator.com/item?id=44957443 .

• mattmanser 8 hours ago

While I agree keep Readme a for humans, Readme literally means read me.

Not 'this is a separate project'. Not 'project documentation file'.

You can have read mes dotted all over a project if that's necessary.

It's simply a file that a previous developer is asking you to read before you start making around in that directory.

• CuriouslyC 12 hours ago

If I paid for my API usage directly instead of the plan it'd be like a second mortgage.

• 3abiton 10 hours ago

To be fair, allocating some token for planning (recursively) helps a lot. It requires more hands on work, but produce much better results. Clarifying the tasks and breaking them down is very helpful too. Just you end up spending lots of time on it. On the bright side, Qwen3 30B is quite decent, and best of all "free".

• Lucasoato 9 hours ago

I’ve seen going very successfully using both codex with gpt5 and claude code with opus. You develop a solution with one, then validate it with the other. I’ve fixed many bugs by passing the context between them saying something like: “my other colleague suggested that…”. Bonus thing: I’ve started using symlinks on CLAUDE.md files pointing at AGENTS.md, now I don’t even have to maintain two different context files.

• dizhn 5 hours ago

Other symlinks one can do: QWEN.md, GEMINI.md, CONVENTIONS.md (for aider).

• xwowsersx 13 hours ago

This lines up with my own experience of learning how to succeed with LLMs. What really makes them work isn't so different from what leads to success in any setting: being careful up front, measuring twice and cutting once.

• efitz 16 hours ago

I spent much of the last several months using LLM agents to create software. I've written two blog posts about my experience; this is the second post that includes all the things I've learned along the way to get better results, or at least waste less money.

• afeezaziz 13 hours ago

you should write more about your experience using LLM. Is this solely using LLM?

• bgwalter 4 hours ago

His profile says: "I'm a technology geek and do information security for a living."

The blog post starts with: "I’m not a professional developer, just a hobbyist with aspirations."

Is this a vibe blog promoting Misanthropic Claude Vibe? It is hard to tell, since all "AI" promotion blogs are unstructured and chaotic.

• chrisweekly 2 hours ago

Hmm, to my eye those descriptors (profile and blog post intro) aren't contradictory.