I Made AI Review PRs By Scoring Risk

I review a lot of PRs.

Some weeks it feels like my job is half building things and half asking, “Wait, what else does this change touch?”

That question got more important once AI coding tools became part of the daily workflow. The team can produce more code. I can produce more code. Everyone can move faster. But the code still has to pass through review, and review does not magically become easier just because the code was written faster.

In some ways, it becomes harder.

There is more code to read, more branches to keep fresh, more generated confidence, and more moments where a PR looks fine locally but has some weird side effect hiding three modules away.

So I started using a simple metric when reviewing PRs with AI:

What is the blast radius?
How safe is this to merge into production?

That is it. Nothing revolutionary. But it changed how useful AI review became for me.

In a hurry? Jump to the AI summary at the bottom.

The Review Bottleneck Got Worse

Code review was already a bottleneck for many teams.

Even before AI coding agents, PRs could sit stale because nobody had enough context. Or they would get merged too quickly because the change looked small. Or a reviewer would leave some good comments but miss the actual risky part because the diff was boring until it suddenly was not.

AI adds a new version of the same problem.

It helps us write more code, but it also increases the amount of code that needs judgment. And code review is mostly judgment.

CI can tell you if the tests pass. TypeScript can tell you if the types line up. A linter can tell you if someone left formatting weird.

But those are not the questions I usually worry about.

I worry about things like:

Does this break an existing user flow?
Did we accidentally remove something unrelated?
Is this change safe for all organizations, roles, and permissions?
Does this invalidate the right data after mutation?
Did we make a shared component behave differently in 80 places?
If this fails in production, who notices first?

Those are review questions.

And if AI is going to help with review, I want it to help with those questions first.

The Two Questions I Care About

The first question is blast radius.

By blast radius, I mean all the places a change can realistically affect. Not only the files changed in the diff, but the product paths behind them.

A small diff can have a huge blast radius if it touches a shared hook, a permission check, a route wrapper, a cache key, or a component used everywhere.

A large diff can have a smaller blast radius if it is isolated inside one feature, behind a flag, with clear tests and no shared behavior changed.

So I want the review to tell me:

changed modules and shared packages
routes, dialogs, pages, and user flows
API calls, schemas, query keys, and cache invalidation
auth, permissions, organization scope, and feature flags
user-visible loading, empty, error, and permission-denied states
tests and manual QA paths that actually match the risk

The second question is production safety.

This is the part where I force the AI to give the PR a score from 1 to 5.

My rough version looks like this:

1/5 - Do not merge. Critical production, security, auth, or data-isolation risk.
2/5 - High risk. Likely regression or broad unverified behavior.
3/5 - Moderate risk. Merge only after listed fixes or manual QA.
4/5 - Low risk. Minor issues only, but still review the checklist.
5/5 - Ready. No blocking findings and low blast radius.

The score is not magic, but it is useful because it forces the review to take a position.

“Looks good overall” is easy to write.

“4/5 because the code is focused, but the UI touch surface is wide and needs targeted QA” is much more useful.

The Command

I turned this into a Claude slash command I can run against a branch.

If you want to try the same shape, create this file in your repo:

.claude/commands/safe-code-review.md

Then open the boilerplate linked below, paste it into that file, and adjust the project-specific review standards for your codebase.

The usage is simple:

/safe-code-review feature/my-branch
/safe-code-review feature/my-branch --base main
/safe-code-review feature/my-branch --re-review

The first command reviews the branch against main. The second lets you choose a different base. The third is the important one after fixes: it checks the saved checklist instead of inventing a brand new review.

The command writes review memory into .ai-reviews/safe-code-review/<branch-slug>.md. You can change that path, but I like keeping it boring and predictable.

The full copyable boilerplate is here: safe-code-review-command.md.

Preview the command skeleton

---
description: Risk-first pull request review for a branch, with production safety score, blast radius, and re-review checklist memory
allowed-tools: Read, Grep, Glob, Write, Bash
---

# Safe Code Review

Inputs:
- /safe-code-review feature/my-branch
- /safe-code-review feature/my-branch --base main
- /safe-code-review feature/my-branch --re-review

Workflow:
1. Read repo context and local guidance.
2. Inspect git diff, stat, commits, and touched files.
3. Search callers, routes, query keys, permissions, flags, and tests.
4. Build a blast-radius map.
5. Assess production safety.
6. Score the PR from 1 to 5.
7. Produce stable CR-* checklist items.
8. Save review memory for re-review.

Score:
1 - do not merge; critical production/security/data-isolation risk.
2 - high risk; likely regression or broad unverified behavior.
3 - moderate risk; merge only after fixes or manual QA.
4 - low risk; minor issues only.
5 - ready; no blocking findings, low blast radius.

Output:
- Production Safety
- PR Score
- Blast Radius
- Checklist by P0/P1/P2/P3 severity
- Verification
- Re-review history when applicable

The exact wording is less important than the shape: context first, blast radius, production safety, score, stable checklist, and re-review.

Using The Same Approach Elsewhere

Claude is just the version I use most right now.

For example in Cursor, the same idea can live in .cursor/commands/safe-code-review.md. Cursor custom commands are Markdown files in .cursor/commands, and you run them from the Agent input with /. I would keep the same score rubric and checklist format, then adjust the repo-specific rules for Cursor’s context model.

Examples From Real Reviews

Here are some anonymized examples of how the score changes depending on risk.

The exact codebase details do not matter. The pattern does.

3/5: Good Feature, Unrelated Risk

Another PR added a new UI state for an assistant response.

The main feature was mostly fine. But the diff also removed internal admin controls from a header. That change was unrelated to the feature and looked accidental.

The review came back closer to this:

PR Score:
3/5 - Core UI work is coherent, but an unrelated header change should be
reverted or justified before merge.

P1 High:

- [ ] CR-001 Unrelated admin controls removed
  - Evidence: Header diff removes two existing toggles.
  - Why it matters: Internal users lose controls without a replacement.
  - Suggested fix: Restore the removed controls or move the cleanup to a
    separate PR with explicit approval.

This is where the score helps.

If you only review the feature, the PR looks pretty reasonable. If you review the blast radius, the unrelated change becomes the most important part.

That is a very common PR review failure. We focus on the title of the PR, not everything the branch actually changes.

1/5: The Imaginary PR I Do Not Want To Merge

Here is a fake example, but it is the kind of thing I want the review command to catch immediately.

Imagine a PR adds a new endpoint for fetching project activity.

The UI works. The endpoint returns data. The tests pass for the happy path.

But the backend query only checks that the user is authenticated. It does not verify that the user belongs to the organization that owns the project.

That review should not be polite about the score:

Production Safety:
This PR introduces a data-isolation bug. Any authenticated user who can guess or
obtain a project id may be able to read activity from another organization.
This affects user data, audit visibility, and workspace boundaries.

PR Score:
1/5 - Do not merge; critical auth and data-isolation risk.

Blast Radius:

- API route returns cross-organization data
- Frontend activity panel displays unauthorized records
- Audit/logging may record access after the leak already happened
- Tenant isolation contract is broken

P0 Critical:

- [ ] CR-001 Missing organization-scope permission check
  - Evidence: Query filters by project id but not by organization membership.
  - Why it matters: Users can read data outside their workspace.
  - Suggested fix: Scope the query through the current user's organization and
    return forbidden/not found when access is missing.
  - Acceptance criteria: A user from another organization cannot fetch the
    project activity by id; add a regression test for this exact case.

This is why I like the score.

It makes the merge decision obvious. Nobody has to pretend this is just another review comment.

Why Re-Review Is Underrated

AI review is much less useful if every run starts from zero.

The first review might find three issues. The author fixes two. Then the AI runs again and finds four different things, including two style comments nobody asked for.

That is annoying for the author and useless for the reviewer.

So I want review memory.

The command saves the checklist, and re-review has a different job:

check every existing CR-* item
mark it fixed, still open, partially fixed, or not verifiable
keep ids stable
add new findings only if the new diff creates new risk

That turns review into a smaller loop.

The question becomes:

Did we close the known risks?

That is also easier to share in Slack. Instead of pasting a wall of new comments, I can send a short summary:

Re-review Summary:
CR-001 fixed. CR-002 still open because the fallback path still skips the
permission check. No new findings.

This is not exciting tooling. It is just useful.

How This Compares To Claude, Cursor, And CodeRabbit

There are managed tools for this now.

Claude Code Review can review GitHub pull requests, post inline findings, use multiple agents, and read repo guidance from files like CLAUDE.md and REVIEW.md.

Cursor Bugbot reviews PR diffs in GitHub, can run automatically or by comment trigger, and supports project-specific rules through .cursor/BUGBOT.md.

CodeRabbit is a dedicated AI review platform with PR review, IDE feedback, CLI workflows, and team customization.

I think tools like this are useful. I am not trying to argue that a custom command is universally better.

But for my current workflow, the custom command has a few advantages:

I can run it locally against branches before or during review.
I can make the output match how my team talks about risk.
I can force the review to lead with blast radius and production safety.
I can keep re-review tied to stable checklist ids.
I can redact, adjust, and evolve the rubric without waiting for a product feature.

So my opinion is somewhere in the middle.

Managed reviewers are probably the right default for many teams. My custom command is better for how I currently review PRs. But the transferable idea is not the command.

The transferable idea is the rubric.

Whatever tool you use, make it answer the review questions you actually care about.

The Architecture Lesson

There is also a bigger lesson here.

AI review gets much harder when the codebase has no boundaries.

If every module imports every other module, if shared helpers hide product behavior, if permission checks are scattered around the app, then blast radius becomes guesswork.

The AI might still find bugs. A human might still find bugs. But both are working through fog.

The more AI-generated code we review, the more I care about boring architecture:

clear module boundaries
shared behavior in obvious places
typed API layers
predictable permission checks
feature flags around risky or unreleased behavior
tests around the paths that can actually hurt users

This is not about making the codebase beautiful.

It is about making change easier to reason about.

If a PR touches one isolated flow, I can review it differently than a PR that touches auth, routing, shared UI, and cache invalidation at the same time.

That difference should show up in the score.

If You Asked AI To Summarize This Post

AI coding tools make it easier to ship more code, which makes PR review more important, not less important.
The metric I care about is simple: blast radius plus production safety score.
To use the Claude command, create .claude/commands/safe-code-review.md, paste the boilerplate, replace the project-specific rules, and run /safe-code-review feature/my-branch.
Re-review is where this becomes useful: the AI checks whether CR-001, CR-002, and the rest were fixed instead of starting from zero.
Claude Code Review, Cursor Bugbot, Codex skills/instructions, and CodeRabbit are all useful in different ways. My custom command works for me because the rubric is explicit, local, branch-based, and easy to change.

The Takeaway

I used to ask AI to review PRs and give me feedback.

Now I ask it to tell me what can break.

That is the whole shift. The useful part is not the tool, it is forcing the review to talk about risk before it talks about taste.