An AI worker can easily produce a sentence that sounds complete: “Fixed. Tests pass.”

For a reviewer, that sentence is almost useless.

Which tests? In what environment? Which files changed? Did the checks cover the user problem? Which checks were skipped? What risk remains? If the change affects UI, are there screenshots? If it affects a data flow, is there data proof?

That is why AI pull requests need an evidence gate.

The worker should not deliver only code. It should deliver a reviewable evidence bundle.

Evidence is review input, not self-certification

The evidence gate does not prove that the AI is correct.

Its goal is more practical: show the reviewer what the worker did, what it verified, and what it did not verify.

A minimum evidence bundle should include:

  • linked issue;
  • change summary;
  • files changed;
  • test commands and results;
  • static check commands and results;
  • manual verification;
  • skipped checks and reasons;
  • remaining risk;
  • stop conditions encountered;
  • whether release or production verification still requires a human.

This can live in the PR body or in a separate evidence manifest. The important part is durability. It must be reviewable, linkable, and recoverable outside the chat session.

Different changes need different evidence

Not every PR needs the same evidence.

A documentation-only change may only need scope notes and link checks.

A frontend behavior change usually needs screenshots or a short recording, at least for important states and viewports.

An API or service logic change needs test output, curl examples, log snippets, or contract notes.

A database, queue, cache, or event-flow change needs data-link evidence, not only unit tests.

Security, permissions, billing, financial records, compliance, or other high-risk changes usually should not move toward merge based only on AI evidence. They should escalate to human review required.

The point is not to maximize evidence volume. The point is to match evidence type to risk type.

Evidence should include what was not done

Good AI evidence should not list only passing checks.

It should also list skipped checks.

Example:

Skipped: browser screenshot
Reason: this change only updates backend validation copy

Skipped: e2e suite
Reason: not available locally; unit coverage added instead

Remaining risk: production data shape not verified
Next human action: verify against staging data before release

This is valuable to reviewers.

A skipped check does not automatically make a PR unmergeable. But if skipped checks are hidden, reviewers may assume the change was verified more thoroughly than it was.

Evidence changes worker behavior

If the worker knows it must produce evidence at the end, it behaves differently during implementation.

It tends to make smaller patches because small patches are easier to verify. It is more likely to add tests because evidence needs reproducible commands. It is more likely to stop when risk expands because it cannot provide sufficient evidence.

That is the real value of the gate.

It is not just a review checklist. It shapes how the AI worker writes code.

A minimal manifest

The first evidence manifest can stay simple:

issue:
  number:
  automation_mode:
  branch:

changes:
  summary:
  files_changed:

verification:
  tests_run:
  static_checks:
  manual_checks:
  skipped_checks:

risk:
  risk_notes:
  release_required:
  production_verification_required:

review:
  pr_url:
  final_state:

The template is not complicated. It forces the worker to turn “I fixed it” into auditable facts.

The quality of an AI PR should not be judged only by the diff. It should also be judged by whether the evidence is strong enough for a human to decide whether the issue contract was actually satisfied, and where human ownership must resume.