How to make code-review skill actually work ?

The official code-review plugin didn’t work for me. Its reasoning looked flawless, its steps looked right — but the review it produced was not on par with my manual reviews.

What I want to share with you is my way of making a customized code review skill which can mimic your reviews.

A mental model for skills

If you are already using skills, you can skip this section, otherwise below gives a quick mental model.

In simple words: A skill is a reusable, polished prompt - which can be invoked by you or Claude Code anytime when required. You store this polished prompt in a markdown file named SKILL.md.

Here, for example, is the most-starred skill on GitHub: a distillation of Andrej Karpathy’s 🐐 coding guidelines into a single SKILL.md:

GitHub preview of the karpathy-guidelines SKILL.md, the most-starred Claude skill, showing its frontmatter table and opening sections. — Fig. 01 — The most-starred Claude skill on GitHubgithub.com

Though, you may have noticed that SKILL.md looks a little long and somewhat non-humanish. Don’t worry — you mostly won’t write a skill by hand. There’s a skill for writing a skill, LOL!

/skill-creator, enabled in Claude Code by default, helps you create a skill.

Few more important things about a skill: it can also include more than one file in its directory. That keeps SKILL.md focused on the essentials while letting the model pull in detailed reference material only when it’s actually needed. From the official docs:

my-skill/
├── SKILL.md        (required — overview and navigation)
├── reference.md    (detailed API docs — loaded when needed)
├── examples.md     (usage examples — loaded when needed)
└── scripts/
    └── helper.py   (utility script — executed, not loaded)

So keep SKILL.md thin, and let progressive discloure do it’s magic. Model decide which reference to read or script to execute at runtime.

There’s more to skills than fits here — the official guide covers the rest.

The official code-review plugin

Now, if skills are clear, let me take you through my journey with code review skills. I realised I spent roughly 30% of my week reviewing code, and it’s been increasing every week since engineers are churning out more and more PRs using AI tools. So reviews were the first thing I wanted to automate. Searching around, I found Anthropic’s official code-review plugin. As the docs describe it, it does following steps:

1. Summarizes the pull request changes.
2. Launches four parallel agents to review independently:
   - **Agents #1 & #2** — audit for `CLAUDE.md` compliance.
   - **Agent #3** — scan for obvious bugs in the changes.
   - **Agent #4** — analyze git blame/history for context-based issues.
3. Scores each issue 0–100 for confidence.
4. Filters out anything below the 80-confidence threshold.
5. Outputs the review — to the terminal by default, or as a PR comment with
   the `--comment` flag.

My first impression: exhaustive coverage, cool subagents idea, filtering will lead to less FPs.

Here are the other skills Anthropic ships out of the box, if you’re curious.

But does it actually work?

My happiness was short-lived.

I installed it, ran it on few PRs, but it fell short in following ways:

It takes ton of tokens. Four parallel agents hurt my usage limit hard
Even then, the review wasn’t close to mine. It had no clue about the coding conventions we follow, context around our features, commons module, or any other tribal knowledge.
Follow up reviews didn’t need such a heavy workflow, in a few instances, the time it took was more than me reviewing it manually.
Comments seemed robotic Not a big one but my review style is tailored around asking the author questions and leting them come to the correct answer with their own research, comments lacked human touch.

Don’t get me wrong — it found security bugs better than I would ever have. It just didn’t review like someone who knows the codebase.

How I made it work

The missing ingredient: Company Context.

The Idea: What if I create my own code review skill based on the corpus data of the past reviews - done by me and my colleagues?

So, I took our 50+ most-commented PRs with architectural changes and asked Claude Code to build a review skill from them. So I prompted Claude Code along those lines:

Using /skill-creator, I want you to write a code review skill named simbian-review, the way senior engineers would do — architecturally aware, and grounded in the project's actual conventions. 
We will use this skill to review any pull request and publish comments in draft mode on github.  
In SKILL.md, I want you to write standard practises which is required for all the code reviews. 
Create a references folder, and in that create reference file for particular apps which can be used for reviewing just those.
SKILL.md should talk about the navigation for references based on changes in which app folder. 
Data Corpus: Here are the past pull requests for you to fetch comments from and understand the context:
app_xxx -> 123, 156, ...
app_yyy -> 1345, 1456 ...
...
...

Don't skip any PRs, ask me any questions if you have any, don't take assumptions. Your output should be comments, prefixed using [claude-code], published in draft mode for me to do final review.

Results

Now it was time to test it. I ran the skill on few of the PRs, pending for me to review, so I could compare apples to apples.

After few back-and-forths, and burning the midnight oil, Claude produced a skill for me which had following directory tree:

simbian-review/
├── evals/
│   └── evals.json
├── references/
│   ├── <app>.md
│   ├── abc.md
│   └── ...
└── SKILL.md

I was happy with how:

It didn’t drain my usage limit. It ran and completed in a few minutes: light on tokens, faster than light.
The comments were deeply contextual — architectural, and in my own reviewing style: asking the author questions that matters.

I can’t share examples here, that may leak our code details but to prove the credibilty no one in my team have complained about the comments, though frankly speaking it may still create FPs sometime but it’s below < 3%.

Can it auto evolve though?

For the first few weeks the skill was great, I was reviewing the review comments published by Claud Code. Then reality hit. As a startup, we churn out new architecture every week: and the skill didn’t know about any of the new code.

It wasn’t evolving with me.

I realized it’s a perfect use case for a Claude Code routine — a cron that runs a task on a schedule you set. But I didn’t want the routine editing the skill directly, so I created a private repo for all my skills.

Now routine’s job is to open a pull request with changes in my code review skill. I review or edit the changes, merge, then pull the updated skill into my local environment for Clade Code to pick it up next time.

I used /schedule in Claude Code to create a weekly routine, described the task in free form, and it produced a polished version after a few questions. Here’s my coroutine task description which can help you to build something of yours:

You are running a weekly skill-improvement job.

Goal: improve the `simbian-review` skill at `skills/simbian-review/SKILL.md` based on patterns from the 5 most-commented PRs in **<simbian repo>** that were reviewed by @vedang122 OR @<temmate> in the last 7 days.

Steps:
1. In the **<simbian repo>** clone, use `gh` to query PRs:
   - Merged to main in the last 7 days
   - Reviewed by vedang122 OR <temmate> (look at PR reviews, not just comments)
   - Sort by total comment count (issue comments + review comments)
   - Take the top 5
2. For each of those 5 PRs, pull:
   - The diff (focus on what changed)
   - All review comments with file:line context
   - **Skip all the comments with prefix [claude-code] since that is not done by human**
   - Any back-and-forth discussion threads
3. Read the current `skills/simbian-review/SKILL.md` and any files under `skills/simbian-review/references/` in the vedang122/skills clone.
4. Synthesize: what recurring review patterns / heuristics / red flags do vedang122 and <temmate> apply that the current skill does NOT yet capture? Focus on architecture, correctness, DB, async — light on style nits.
5. Use the `/skill-creator` skill (skill-creator:skill-creator) to update `simbian-review` — add new references in case new components were added, sharpen existing ones, add concrete examples (anonymized if needed, but PR numbers are fine to cite).
6. In the vedang122/skills clone:
   - Create a branch: `weekly-skill-update-YYYY-MM-DD`
   - Commit the changes with a clear message listing which PRs informed each new heuristic
   - Open a PR titled `Weekly simbian-review update: <date>` with a body summarizing: (a) which 5 PRs were analyzed, (b) what new heuristics were added, (c) what was sharpened, (d) anything intentionally skipped.

If you find no meaningful patterns worth adding this week, DO NOT open a noisy PR. Instead, close out by reporting "no update needed this week" with a one-paragraph explanation of what you reviewed.

Tools allowed: Bash, Read, Write, Edit, Glob, Grep — plus the skill-creator skill.

Every Monday morning, I review the routine’s PR to my skills repo, edit if need be -> merge -> pull it in my local — and I start the week with an updated code-review skill. That’s how code review skill evolves with me.

Conclusion

We started with the skill concept, explored official published skills and then built one of our own. One thing I have learnt, skills aren’t plug-and-play.

A skill is only as good as the context you feed it — and the leverage is yours to supply. So be creative!

// Cite this post

Karwa, Vedang. How to make code-review skill actually work ?. vedangkarwa.com, 06 Jun 2026. https://vedangkarwa.com/posts/code-review-skill