Evaluating Agent Workflows: Traces, Cost, Recovery
Connects trajectory evaluation, goal-level cost, semantic recovery, and confidence calibration into a practical evaluation frame for agent workflows.
22 public items tagged agent-workflow.
Connects trajectory evaluation, goal-level cost, semantic recovery, and confidence calibration into a practical evaluation frame for agent workflows.
Defines AI agent operational governance: the layer of authority, evidence, tools, traces, costs, and human gates that makes agents usable in real work.
Explains residual drift in long-running agent sessions and why serious workflows need commitment tracking, not just memory or contradiction checks.
Shows how evidence-carrying actions and source authority rules prevent fluent text from overriding structured data, tool output, or source evidence.
Explains why agent actions need runtime authority checks and autonomy gates, instead of relying only on plan-time approval or a confident task plan.
Treats MCP servers, plugins, tool descriptions, and dependencies as part of the agent control surface that needs governance and supply-chain review.
Shows how an agent CLI, MCP server, SketchUp Ruby bridge, runtime skills, and a structured design model turn design intent into verifiable project state.
Explains why Codex and similar AI coding tools should remain bounded workers, not long-running schedulers or release controllers.
Many AI task failures do not happen because the model cannot modify code. They happen because the model reads the wrong context.
A product and workflow essay arguing that AI should reduce low-level drafting work so designers can focus on intent, judgment, constraints, and tradeoffs.
Explains why AI delivery must include verifiable proof: tests, logs, screenshots, risk notes, and a review path, not only a claim that work is done.
Shows how repeated AI mistakes should become project memory, updated rules, and regression checks so the delivery pipeline improves after each failure.
Shows how GitHub labels, branches, and comments can become an AI engineering state machine that keeps issue repair work observable and controllable.
Shows where humans should stand in an AI delivery pipeline: requirements, risk boundaries, release decisions, rollback choices, and final acceptance.
Explains why real projects should put AI work in isolated branches or workspaces, then move changes through explicit gates before they reach the main codebase.
Argues that an issue should be an executable contract for AI work, with scope, context, gates, evidence, and release boundaries instead of a loose todo.
Explains how to safely extract public methodology, templates, and toy examples from private production AI workflow experience.
Explains why a merged PR is not the end of AI engineering work; release verification, rollout checks, and post-merge evidence still define done.
Defines a project-specific AI delivery pipeline: AI acts as a worker while the project owns task intake, context, gates, evidence, and release boundaries.
The most common risk in AI-assisted development is not that the model cannot write code. It is that the model starts writing code too early.
Uses SketchUp Agent Harness to explain why AI design tools need an editable, verifiable, repairable source of truth instead of only generating finished-looking output.
An open-source project connecting agent CLIs, an MCP server, a SketchUp Ruby bridge, structured design models, and runtime skills into a verifiable design workflow.