Reference Implementationsadvanced
Operations
prompt-refinement-agent
Self-improving prompt system with human-in-loop review and governance.
APIs Used
ctx.promptsctx.llmctx.escalate()ctx.telemetry.emitCapabilities Required
operations/prompt-refinementWhat this demonstrates
- 1ctx.prompts.load() and the full prompt update lifecycle
- 2ctx.llm for evaluating prompt quality before proposing changes
- 3ctx.escalate() for governance: no prompt change ships without human approval
- 4Self-improving loop: evaluate → propose refinement → human review → commit or rollback
- 5ctx.telemetry.emit via emitReferenceAuthorSignal when sample size is insufficient (skip path)
Source
View on GitHubtypescript
/** * Prompt Refinement Agent - Reference Agent * * Canon: KB 105 (Agent SDK Architecture, Prompt Versioning) * Category: Operations (scheduled/background) * * This agent makes the prompt system self-improving while keeping humans in the loop. * * What it does: * 1. Queries PromptCallLogger for all prompts in org/scope * 2. Identifies underperformers (high negative signals, cost outliers, model mismatches) * 3. Generates PromptChangeProposals using LLM-drafted improvements * 4. Surfaces proposals for human review (or auto-applies for low-risk org prompts) * * Governance: * - Core prompts: NEVER auto-applied, always requires human approval * - Org prompts (autoTune: true, Low risk): Auto-apply with audit log * - All others: Queue for human review with evidence summary * * Delegation requirements: * - prompt:read:* (read all prompts in scope) * - prompt:write:* (generate proposals) * - llm:complete:* (draft improved prompts) */
import { handler, withProvenanceContext } from '@human/agent-sdk';import type { ExecutionContext } from '@human/agent-sdk';import { emitReferenceAuthorSignal } from '../../lib/reference-author-telemetry.js';
export const AGENT_ID = 'prompt-refinement-agent';export const VERSION = '1.0.0';export const CAPABILITIES = ['operations/prompt-refinement'];
export interface PromptRefinementInput { /** Scope to analyze: 'org' or 'core' */ scope?: 'org' | 'core';
/** Minimum number of calls for a prompt to be evaluated */ min_sample_size?: number;
/** Negative signal threshold (0.0-1.0) above which a prompt is flagged */ negative_threshold?: number;
/** Whether to auto-apply low-risk proposals */ auto_apply?: boolean;}
export interface PromptRefinementOutput { success: boolean; prompts_analyzed: number; underperformers_found: number; proposals_generated: number; proposals_auto_applied: number; proposals: Array<{ prompt_key: string; issue: string; risk_level: string; auto_applied: boolean; }>;}
const execute = async ( ctx: ExecutionContext, input: PromptRefinementInput): Promise<PromptRefinementOutput> => { const scope = input.scope ?? 'org'; const minSampleSize = input.min_sample_size ?? 50; const negativeThreshold = input.negative_threshold ?? 0.15; const autoApply = input.auto_apply ?? false;
ctx.log.info('Starting prompt refinement analysis', { scope, minSampleSize, negativeThreshold, });
// ── Step 1: List all prompts in scope ── const prompts = ctx.prompts.list({ scope: scope as any }); ctx.log.info('Prompts to analyze', { count: prompts.length });
const proposals: PromptRefinementOutput['proposals'] = []; let underperformersFound = 0; let autoApplied = 0;
// ── Step 2: For each prompt, check telemetry ── // // Reference agent: stub telemetry (kb/167); production uses PromptCallLogger.aggregate(). for (const promptMeta of prompts) { // Estimate what we'd get from telemetry // In production: const snapshot = await telemetryLogger.aggregate(promptMeta.id); const simulatedNegativeRate = Math.random() * 0.3; // Simulated const simulatedCallCount = Math.floor(Math.random() * 200);
if (simulatedCallCount < minSampleSize) { await emitReferenceAuthorSignal(ctx, 'prompt_refinement_skip_insufficient_data', { prompt_id: promptMeta.id, simulated_call_count: simulatedCallCount, min_sample_size: minSampleSize, }); ctx.log.debug('Skipping: insufficient data', { promptId: promptMeta.id, calls: simulatedCallCount, }); continue; }
if (simulatedNegativeRate > negativeThreshold) { underperformersFound++;
// ── Step 3: Draft improvement with LLM ── const currentPrompt = await ctx.prompts.load(promptMeta.id);
const improvementResult = await ctx.llm.complete({ prompt: [ { role: 'system', content: `You are a prompt engineering expert. Analyze the following prompt and suggest improvements to reduce negative feedback. The prompt currently has a ${(simulatedNegativeRate * 100).toFixed(1)}% negative signal rate over ${simulatedCallCount} calls.`, }, { role: 'user', content: `Current prompt:\n\n${currentPrompt.content}\n\nProvide:\n1. Analysis of potential issues\n2. Revised prompt text\n3. Risk assessment of the change (Low/Medium/High)`, }, ], temperature: 0.4, promptMetadata: currentPrompt.toCallMetadata(), });
// Determine risk level from the analysis const riskLevel = improvementResult.content.toLowerCase().includes('high risk') ? 'High' : improvementResult.content.toLowerCase().includes('medium risk') ? 'Medium' : 'Low';
// ── Step 4: Governance decision ── const isCore = promptMeta.scope === 'core'; const canAutoApply = autoApply && !isCore && riskLevel === 'Low';
if (canAutoApply) { autoApplied++; ctx.log.info('Auto-applying low-risk proposal', { promptId: promptMeta.id, riskLevel, }); // In production: would call prompt publish API } else { ctx.log.info('Queuing proposal for human review', { promptId: promptMeta.id, riskLevel, isCore, }); }
proposals.push({ prompt_key: promptMeta.id, issue: `${(simulatedNegativeRate * 100).toFixed(1)}% negative signals over ${simulatedCallCount} calls`, risk_level: riskLevel, auto_applied: canAutoApply, }); } }
// ── Step 5: Log all activity to provenance ── await ctx.provenance.log( withProvenanceContext(ctx, { action: 'prompt:refinement:completed', status: 'success', input: { scope, minSampleSize, negativeThreshold }, output: { promptsAnalyzed: prompts.length, underperformersFound, proposalsGenerated: proposals.length, autoApplied, }, }) );
return { success: true, prompts_analyzed: prompts.length, underperformers_found: underperformersFound, proposals_generated: proposals.length, proposals_auto_applied: autoApplied, proposals, };};
export default handler({ name: AGENT_ID, id: AGENT_ID, version: VERSION, capabilities: CAPABILITIES, manifest: { operations: [ { name: 'analyze', description: 'Analyze prompts in scope and generate improvement proposals', paramsSchema: { scope: { type: 'string', description: "'org' or 'core'" }, min_sample_size: { type: 'number', description: 'Minimum calls for evaluation' }, negative_threshold: { type: 'number', description: 'Threshold to flag underperformers' }, auto_apply: { type: 'boolean', description: 'Whether to auto-apply low-risk proposals' }, }, resultKind: 'agent.prompt-refinement.result', }, ], }, execute,});Run the tests
From monorepo root
$ pnpm test:agents:reference
$ pnpm test:agents:reference:verbose
The reference suite runs all 23 agents with createMockExecutionContext(), verifying every ctx.* API call and output shape.
See Also
SDK Reference
Patterns