AI-assisted Code Generation Workflow
Here's how I get high quality output from LLM assistants
I have a specific process for generating code with AI that has allowed me to double my output while maintaining quality standards. Although in my job, we have a lot of infrastructure that enables these processes, it’s still possible to transfer the workflow to an independent development process today using open-source or easily available tools. For example, it should be possible for an independent developer to set up CI with automated linting tools and test-running. Here’s what I see as a high-quality current process for developing code with AI.
Generation
Start with a short definition of the problem to be solved.
Use a custom, permanent prompt that outlines the code standards you’d like the model to follow. In Windsurf, this is a workflow; in Claude Code, it goes in your CLAUDE.md file; other tools have equivalents. In Python, I specify things like following local conventions, keeping all imports at the top of the file, and avoiding comment clutter; I insist on unit tests, but neither too many nor too few. Although no tests or only happy path tests are dangerous — I need to know the code actually works, guard against regressions, and document expected behavior — I’ve seen LLMs write redundant tests, tests without purpose, or tests that don’t actually exercise the relevant code. I demand no duplication in unit tests, and I make sure each unit test exercises a real use case.
I don’t really vibe code (having AI generate code and not looking at the output) — I monitor the code as the large language model (LLM) produces it, and I keep going back and forth with it until I’m satisfied with what I see. Tests have to pass locally, the model needs to fix all linter errors, and I have to be satisfied that once I push the code, it will pass continuous integration (CI).
Integration
In my process, I don’t allow the LLM to post directly to the shared code repo, because I don’t trust it to do the right thing with this complex operation. Instead, I handle all the source code control (git) operations. I open the PR as a draft so I can review it before asking other human colleagues to review it. This ensures that I waste their time as little as possible and get the most out of easily accessible tools.
There might be some CI issues that got missed when we were linting the code and running tests locally, because I haven’t run the entire test suite. My changes could have caused problems with tests or code I didn’t touch directly. I make sure all of those get fixed.
Review
I review it myself as a pull request (PR). This allows me to pick up anything I missed as the code was being generated and gives me the critical perspective of a reviewer, rather than a pair programmer.
I have another LLM instance to review the PR. I have a permanent code review prompt that I use to ensure the LLM’s review is thorough but not overly critical. I’ve seen both failure modes in LLM reviews, which is why human reviews remain essential. Because LLM reviewers tend to be overly meticulous, I tell them to ignore nits and focus only on minor issues or higher. For any issues that the LLM reviewer finds, I ask it to write a test to prove to me that it’s actually a problem, because I’ve seen LLM reviewers hallucinate problems that weren’t actually there. It might insist that initialization defaults are missing when they already exist. Or it might say that the
elsein a conditional statement is missing when it’s actually present. Also, sometimes it will erroneously inflate a minor issue into a blocker.Once the LLM code review is generated, I give it to the original code-writing LLM and have it implement the recommended changes “if they are correct.” I tell it that if it believes the issues are not real, it should explain why with specific reasoning. I then evaluate that defense myself before allowing the LLM to skip that issue. In my process, the LLM code reviewer does not post directly on the shared code repo, because it’s merely communicating the review to the generator LLM. I iterate back and forth between the reviewer LLM and the generator LLM until all problems are solved. In the future, I would like to have multiple LLM code reviewers each generate a review independently, then confer on all the reviews and reach consensus on which issues are real problems and their severity.
After all LLM code review issues have been resolved, I open the PR for human colleague review. For every issue found by a colleague, I ask the code generator LLM to fix or comment on it, as it would with an LLM review.
Merge
Once it has passed human review, I do a final pass of my own before merging the PR into main.
What I’d add next
I would also like to incorporate more extensive end-to-end testing into my process. AI is very good at writing this kind of test, which helps ensure the software’s behavior is predictable. Even backend code can benefit from end-to-end testing, since what happens there affects what the user sees.
I’d love to get feedback from others using AI for coding, either on this process or on how you guarantee quality output.
┌─────────────────────────────────────────────────────────────────┐
│ GENERATION │
│ Problem Definition → Generator LLM → Local Tests/Lint → Loop │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ INTEGRATION │
│ Manual Git Operations → Draft PR → Fix CI │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ REVIEW │
│ │
│ ┌──────────┐ issues ┌──────────┐ │
│ │ Reviewer │─────────────▶│Generator │ │
│ │ LLM │ │ LLM │ │
│ └──────────┘◀─────────────└──────────┘ │
│ │ fix or defend │
│ │ │ │
│ └────────┬───────────────┘ │
│ ▼ │
│ YOU (arbiter) │
│ │ │
│ ▼ │
│ Human Review → Generator fixes/comments │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ MERGE │
│ Final Self-Review → Merge to Main │
└─────────────────────────────────────────────────────────────────┘

