Many agent evaluations still look like answer evaluations: was the final answer correct, was the task completed, did the benchmark score improve?

That is not enough for agent workflows.

An agent does more than produce an answer. It plans, calls tools, reads sources, mutates state, encounters failures, retries, asks humans, packages evidence, and hands results into a workflow. If we only judge the final result, we miss most of the operational quality.

Evaluate the trajectory

An agent workflow has at least three layers of result.

The first is outcome: did it produce a usable result?

The second is artifact: does the deliverable satisfy structure, policy, evidence, and handoff requirements?

The third is trajectory: how did the agent get there?

Trajectory includes which sources it read, which tools it called, where it inferred, where uncertainty appeared, whether it asked for help, how it recovered from failure, whether it respected authority boundaries, and whether evidence reached the final artifact.

If we only evaluate outcome, the agent may have succeeded by luck. If we evaluate the artifact, we catch more issues. If we evaluate trajectory, we can judge whether the workflow is repeatable, auditable, and improvable.

Cost should be measured per successful goal

Agent cost should not be measured only by model usage, per-call price, or single-step latency.

One workflow may use cheap calls but wander, retry, read the wrong context, and produce unusable artifacts. Another may use more expensive calls but complete once, carry evidence, and reduce review time.

The better unit is goal-level cost:

model cost + tool cost + runtime cost + human review cost + failure/rework cost

divided by the number of accepted completed goals.

This matters for autorun systems, domain batches, publication pipelines, social operations, and professional tool agents. The real optimization target is not cheap calls. It is affordable, trustworthy completed goals.

Recovery is more than retry

Many automation systems treat recovery as retry or rollback. For tool-using agents, that is not enough.

After failure, the system needs to know whether the recovered state is semantically valid.

If code is rolled back, do documents, test data, and generated files still match?
If publication sync fails, do the central publication record and site _generated files still align?
If a modeling bridge fails halfway, does the structured design state still match the SketchUp scene?
If multiple issue-driven changes merge into a train branch, do the domain invariants still hold?

This is semantic recoverability. It is not only returning to a previous state. It is verifying that downstream dependencies and business meaning still make sense.

Calibration matters more than confidence

Agent evaluation also needs uncertainty handling.

A good agent should not always push forward confidently. It should mark uncertainty when evidence is missing, halt when authority is unclear, ask when sources conflict, block when risk is too high, and explain when it lacks a way to verify.

These capabilities are hard to capture with final-answer scoring. But they decide whether an agent can enter real workflows.

In high-trust work, confident wrong action is more dangerous than a clear “I do not know.”

A minimal evaluation panel

A project can start with a simple panel:

completion: was the goal completed and accepted?
evidence: did critical claims and side effects carry evidence?
authority: did high-risk actions pass the right gates?
trajectory: are tool calls, state changes, and human interventions traceable?
cost: what was the total cost per completed goal, including review?
recovery: did failure recovery restore semantic validity?
calibration: did the agent handle uncertainty and refusal boundaries correctly?
learning: did failures improve rules, tests, skills, or playbooks?

This is closer to real engineering than a single success rate.

Core judgment

Agent workflow quality is not the same thing as final answer quality.

What matters is how the agent acted, what the real cost was, whether evidence is reviewable, whether failure can recover, and whether the agent knows when it should not continue.