Your old PR comments can now review code

Thanks Sidero for supporting today’s article!

That request you asked the platform team for? Yeah... Two months now, and it's still in their backlog. You're starting to take it personally.

Please don't. I managed a platform team back in 2017, and I can tell you firsthand: they are not slow or unresponsive, they are buried. Infra upgrades alone used to eat my team's whole week.

And now, with the AI craze, it's even worse. Every new AI tool and integration falls onto the same people, and they were already the bottleneck.

Talos Linux is the open-source OS that takes that weight off them. Built only for Kubernetes, it kills the upgrade pain and the 2am reboots (read here how the magic happens). Less firefighting for them means your request finally moves.

See how to make your DevOps friend smile again

Code review became a pain in the ass.

You have engineers who just delegate decisions to Claude. You have non-engineers vibe-coding and expecting you to review and fix their mess. And even without them, every capable engineer produces more code now.

More code means more reviews, more context switches, less patience, and more shitty code slipping through. This results in more bugs, conflicting standards, confused LLMs, slower dev speed and finally even more pressure from above:

Last month, Yaniv (a staff engineer in my company) took a genius but simple approach to improve the situation in his team. Here’s the open-source repo - clone it and try yourself.

He wrote a detailed article on medium, and today I brought him to share the TLDR version:

Mic to Yaniv 🎤

Last month, our team had a session about our code review process. It was slow, caused lots of context switches, and became our single biggest pain. We brainstormed some options, until our tech lead said: “What can we build to reduce this?”

So I took it on. I started with a generic pre-PR checklist and ended up with an AI reviewer that mines three years of our PR comments and writes feedback in each engineer's voice.

The version we have now reads PRs the way the team does. It knows that Iggy will veto any MobX dependency, and that Amit will calculate exactly how many DB queries your loop fires. It takes everyone's taste into account on every PR.

Here’s how we got to it, and the steps you can take to do it too:

V0: A generic checklist

The first version took a couple of hours. I built a Claude Code skill that anyone could run before sending a PR for review. It checked the diff against a markdown checklist specific to our team, mixed with company conventions and generic React best practices.

I got some positive comments, but two problems came up: maintaining a checklist by hand would go stale fast, and our EM said "It's too generic, it's not really dedicated to our team." He was right. I had built what basically amounts to ESLint with a chat interface.

V1: Let’s mine the history

I sat with it for a couple of hours - between meetings, at lunch, always in the back of my head.

"Dedicated to our team" sounded like a lot of work. Hand-writing a custom checklist for each engineer meant interviewing everyone, codifying their preferences, and expecting them to keep those documents current - the exact maintenance problem we'd already identified.

Then I had a real 'aha' moment. Every reviewer is different. Some obsess over tests, some spot architecture problems, some have strong opinions on readability or accessibility. And every one of them had left thousands of real PR comments over the years, capturing exactly what they push back on.

Afterward, the implementation was pretty simple:

Pull every PR comment via the GitHub API
Have Claude find each person's recurring patterns
Generate an .md profile per engineer with real PR citations as evidence
When you ran the skill, it read every profile, and applied them all to your diff
The result was a review that included all patterns from all team members - in one pass, before the human reviewers even opened the PR

The team loved it. The feedback was specific, attributed - each finding cited which profile flagged it and which historical PR the pattern came from - and recognizably us. Engineers reading the output were saying "yep, that's exactly what Amit would have said."

By the end of that first week, every engineer was running the skill before pushing PRs.

V2: We got greedy

An engineer on the team suggested that instead of one Claude call reading all the profiles together, we spin up a separate agent for each reviewer, all running at the same time. Then add a final "skeptic" agent that reads everyone's findings and throws out the false positives and duplicates.

It worked, but oh boy, the cost. On a 469-line PR, the thirteen agents ran in parallel on Opus and burned through 1 million tokens, costing roughly $20. The quality was great, but $20 per run was too expensive for us.

V2.1: Tiered intelligence

To address the issue, we broke down the flow into multiple steps:

V2.1 starts with a fully deterministic step (yeah, there’s still a lot of room for those!) - plain code looks at which files changed and labels the PR as frontend, backend, or mixed. Then it drops the parts of each reviewer's profile that don't apply - on a frontend-only PR, the backend rules get stripped out before any agent sees them. That step is basically free and shrinks how much the models have to read.
After that, three reviewer agents run in parallel, in a combination of Haiku (for simple reading-heavy passes) and Sonnet (for the part that needs real reasoning).
Last, Opus runs as the skeptic at the end.

The redesign that made the system ~7x cheaper.

On the same 469-line PR that cost $20.79 before, v2.1 cost $2.99 - about 7× cheaper. And it actually found more: it caught 10/12 of the V2 issues, but added 7 new ones.

3 takeaways for your own implementation:

Your team’s PR history is a gold mine. The result is very different from anything a generic prompt produces.
Tier your intelligence by job. Haiku for breadth and reading-heavy work, Sonnet for synthesis, Opus only as the skeptic at the end.
An Opus skeptic pass solves hallucination better than tighter prompting. It’s easier to throw away five wrong findings than to make ten findings perfect. Let the cheaper agents over-produce and the smartest agent prune.

You can check here the public template repo. It includes the directory structure, the pre-phase classifier, the orchestrator, and the prompt for generating a new reviewer profile from a GitHub handle.

Building your team’s version should take an afternoon 🙂

Thanks Yaniv for a great and super useful idea!

This approach of course won’t solve 100% of your code review problem. LLMs are good at basic patterns, but the most useful code reviews challenge the decisions taken, not just implementation details.

Still, even catching basic patterns can help reduce the cognitive load on reviewers, leaving more of it to focus on bigger issues.

Discover weekly

Why is Meta destroying its engineering organization? Another superb journalistic article by Gergely on what happens inside Meta.
Revised rules of engineering leadership. Will Larson with 5 changes from the last few years. A must read if you are a senior engineering leader thinking how to rebuild your org.
Big crocs vs little crocs. On speed and crocodiles.

Building a code reviewer from your team's PR history