Reference Implementations

Operations

prompt-refinement-agent

advanced

Self-improving prompt system with human-in-loop review and governance.

APIs Used

ctx.promptsctx.llmctx.escalate()ctx.telemetry.emit

Capabilities Required

operations/prompt-refinement

What this demonstrates

  • 1ctx.prompts.load() and the full prompt update lifecycle
  • 2ctx.llm for evaluating prompt quality before proposing changes
  • 3ctx.escalate() for governance: no prompt change ships without human approval
  • 4Self-improving loop: evaluate → propose refinement → human review → commit or rollback
  • 5ctx.telemetry.emit via emitReferenceAuthorSignal when sample size is insufficient (skip path)
typescript
/**
* Prompt Refinement Agent - Reference Agent
*
* Canon: KB 105 (Agent SDK Architecture, Prompt Versioning)
* Category: Operations (scheduled/background)
*
* This agent makes the prompt system self-improving while keeping humans in the loop.
*
* What it does:
* 1. Queries PromptCallLogger for all prompts in org/scope
* 2. Identifies underperformers (high negative signals, cost outliers, model mismatches)
* 3. Generates PromptChangeProposals using LLM-drafted improvements
* 4. Surfaces proposals for human review (or auto-applies for low-risk org prompts)
*
* Governance:
* - Core prompts: NEVER auto-applied, always requires human approval
* - Org prompts (autoTune: true, Low risk): Auto-apply with audit log
* - All others: Queue for human review with evidence summary
*
* Delegation requirements:
* - prompt:read:* (read all prompts in scope)
* - prompt:write:* (generate proposals)
* - llm:complete:* (draft improved prompts)
*/
import { handler, withProvenanceContext } from '@human/agent-sdk';
import type { ExecutionContext } from '@human/agent-sdk';
import { emitReferenceAuthorSignal } from '../../lib/reference-author-telemetry.js';
export const AGENT_ID = 'prompt-refinement-agent';
export const VERSION = '1.0.0';
export const CAPABILITIES = ['operations/prompt-refinement'];
export interface PromptRefinementInput {
/** Scope to analyze: 'org' or 'core' */
scope?: 'org' | 'core';
/** Minimum number of calls for a prompt to be evaluated */
min_sample_size?: number;
/** Negative signal threshold (0.0-1.0) above which a prompt is flagged */
negative_threshold?: number;
/** Whether to auto-apply low-risk proposals */
auto_apply?: boolean;
}
export interface PromptRefinementOutput {
success: boolean;
prompts_analyzed: number;
underperformers_found: number;
proposals_generated: number;
proposals_auto_applied: number;
proposals: Array<{
prompt_key: string;
issue: string;
risk_level: string;
auto_applied: boolean;
}>;
}
const execute = async (
ctx: ExecutionContext,
input: PromptRefinementInput
): Promise<PromptRefinementOutput> => {
const scope = input.scope ?? 'org';
const minSampleSize = input.min_sample_size ?? 50;
const negativeThreshold = input.negative_threshold ?? 0.15;
const autoApply = input.auto_apply ?? false;
ctx.log.info('Starting prompt refinement analysis', {
scope,
minSampleSize,
negativeThreshold,
});
// ── Step 1: List all prompts in scope ──
const prompts = ctx.prompts.list({ scope: scope as any });
ctx.log.info('Prompts to analyze', { count: prompts.length });
const proposals: PromptRefinementOutput['proposals'] = [];
let underperformersFound = 0;
let autoApplied = 0;
// ── Step 2: For each prompt, check telemetry ──
//
// Reference agent: stub telemetry (kb/167); production uses PromptCallLogger.aggregate().
for (const promptMeta of prompts) {
// Estimate what we'd get from telemetry
// In production: const snapshot = await telemetryLogger.aggregate(promptMeta.id);
const simulatedNegativeRate = Math.random() * 0.3; // Simulated
const simulatedCallCount = Math.floor(Math.random() * 200);
if (simulatedCallCount < minSampleSize) {
await emitReferenceAuthorSignal(ctx, 'prompt_refinement_skip_insufficient_data', {
prompt_id: promptMeta.id,
simulated_call_count: simulatedCallCount,
min_sample_size: minSampleSize,
});
ctx.log.debug('Skipping: insufficient data', {
promptId: promptMeta.id,
calls: simulatedCallCount,
});
continue;
}
if (simulatedNegativeRate > negativeThreshold) {
underperformersFound++;
// ── Step 3: Draft improvement with LLM ──
const currentPrompt = await ctx.prompts.load(promptMeta.id);
const improvementResult = await ctx.llm.complete({
prompt: [
{
role: 'system',
content: `You are a prompt engineering expert. Analyze the following prompt and suggest improvements to reduce negative feedback. The prompt currently has a ${(simulatedNegativeRate * 100).toFixed(1)}% negative signal rate over ${simulatedCallCount} calls.`,
},
{
role: 'user',
content: `Current prompt:\n\n${currentPrompt.content}\n\nProvide:\n1. Analysis of potential issues\n2. Revised prompt text\n3. Risk assessment of the change (Low/Medium/High)`,
},
],
temperature: 0.4,
promptMetadata: currentPrompt.toCallMetadata(),
});
// Determine risk level from the analysis
const riskLevel = improvementResult.content.toLowerCase().includes('high risk')
? 'High'
: improvementResult.content.toLowerCase().includes('medium risk')
? 'Medium'
: 'Low';
// ── Step 4: Governance decision ──
const isCore = promptMeta.scope === 'core';
const canAutoApply = autoApply && !isCore && riskLevel === 'Low';
if (canAutoApply) {
autoApplied++;
ctx.log.info('Auto-applying low-risk proposal', {
promptId: promptMeta.id,
riskLevel,
});
// In production: would call prompt publish API
} else {
ctx.log.info('Queuing proposal for human review', {
promptId: promptMeta.id,
riskLevel,
isCore,
});
}
proposals.push({
prompt_key: promptMeta.id,
issue: `${(simulatedNegativeRate * 100).toFixed(1)}% negative signals over ${simulatedCallCount} calls`,
risk_level: riskLevel,
auto_applied: canAutoApply,
});
}
}
// ── Step 5: Log all activity to provenance ──
await ctx.provenance.log(
withProvenanceContext(ctx, {
action: 'prompt:refinement:completed',
status: 'success',
input: { scope, minSampleSize, negativeThreshold },
output: {
promptsAnalyzed: prompts.length,
underperformersFound,
proposalsGenerated: proposals.length,
autoApplied,
},
})
);
return {
success: true,
prompts_analyzed: prompts.length,
underperformers_found: underperformersFound,
proposals_generated: proposals.length,
proposals_auto_applied: autoApplied,
proposals,
};
};
export default handler({
name: AGENT_ID,
id: AGENT_ID,
version: VERSION,
capabilities: CAPABILITIES,
manifest: {
operations: [
{
name: 'analyze',
description: 'Analyze prompts in scope and generate improvement proposals',
paramsSchema: {
scope: { type: 'string', description: "'org' or 'core'" },
min_sample_size: { type: 'number', description: 'Minimum calls for evaluation' },
negative_threshold: { type: 'number', description: 'Threshold to flag underperformers' },
auto_apply: { type: 'boolean', description: 'Whether to auto-apply low-risk proposals' },
},
resultKind: 'agent.prompt-refinement.result',
},
],
},
execute,
});

Run the tests

From monorepo root

$ pnpm test:agents:reference

$ pnpm test:agents:reference:verbose

The reference suite runs all 23 agents with createMockExecutionContext(), verifying every ctx.* API call and output shape.

See Also