Operations

prompt-refinement-agent

advanced

Self-improving prompt system with human-in-loop review and governance.

APIs Used

ctx.promptsctx.llmctx.escalate()ctx.telemetry.emit

Capabilities Required

operations/prompt-refinement

What this demonstrates

1ctx.prompts.load() and the full prompt update lifecycle
2ctx.llm for evaluating prompt quality before proposing changes
3ctx.escalate() for governance: no prompt change ships without human approval
4Self-improving loop: evaluate → propose refinement → human review → commit or rollback
5ctx.telemetry.emit via emitReferenceAuthorSignal when sample size is insufficient (skip path)

Source

View on GitHub

typescript

/**
 * Prompt Refinement Agent - Reference Agent
 *
 * Canon: KB 105 (Agent SDK Architecture, Prompt Versioning)
 * Category: Operations (scheduled/background)
 *
 * This agent makes the prompt system self-improving while keeping humans in the loop.
 *
 * What it does:
 * 1. Queries PromptCallLogger for all prompts in org/scope
 * 2. Identifies underperformers (high negative signals, cost outliers, model mismatches)
 * 3. Generates PromptChangeProposals using LLM-drafted improvements
 * 4. Surfaces proposals for human review (or auto-applies for low-risk org prompts)
 *
 * Governance:
 * - Core prompts: NEVER auto-applied, always requires human approval
 * - Org prompts (autoTune: true, Low risk): Auto-apply with audit log
 * - All others: Queue for human review with evidence summary
 *
 * Delegation requirements:
 * - prompt:read:* (read all prompts in scope)
 * - prompt:write:* (generate proposals)
 * - llm:complete:* (draft improved prompts)
 */

import { handler, withProvenanceContext } from '@human/agent-sdk';
import type { ExecutionContext } from '@human/agent-sdk';
import { emitReferenceAuthorSignal } from '../../lib/reference-author-telemetry.js';

export const AGENT_ID = 'prompt-refinement-agent';
export const VERSION = '1.0.0';
export const CAPABILITIES = ['operations/prompt-refinement'];

export interface PromptRefinementInput {
  /** Scope to analyze: 'org' or 'core' */
  scope?: 'org' | 'core';

  /** Minimum number of calls for a prompt to be evaluated */
  min_sample_size?: number;

  /** Negative signal threshold (0.0-1.0) above which a prompt is flagged */
  negative_threshold?: number;

  /** Whether to auto-apply low-risk proposals */
  auto_apply?: boolean;
}

export interface PromptRefinementOutput {
  success: boolean;
  prompts_analyzed: number;
  underperformers_found: number;
  proposals_generated: number;
  proposals_auto_applied: number;
  proposals: Array<{
    prompt_key: string;
    issue: string;
    risk_level: string;
    auto_applied: boolean;
  }>;
}

const execute = async (
  ctx: ExecutionContext,
  input: PromptRefinementInput
): Promise<PromptRefinementOutput> => {
  const scope = input.scope ?? 'org';
  const minSampleSize = input.min_sample_size ?? 50;
  const negativeThreshold = input.negative_threshold ?? 0.15;
  const autoApply = input.auto_apply ?? false;

  ctx.log.info('Starting prompt refinement analysis', {
    scope,
    minSampleSize,
    negativeThreshold,
  });

  // ── Step 1: List all prompts in scope ──
  const prompts = ctx.prompts.list({ scope: scope as any });
  ctx.log.info('Prompts to analyze', { count: prompts.length });

  const proposals: PromptRefinementOutput['proposals'] = [];
  let underperformersFound = 0;
  let autoApplied = 0;

  // ── Step 2: For each prompt, check telemetry ──
  //
  // Reference agent: stub telemetry (kb/167); production uses PromptCallLogger.aggregate().
  for (const promptMeta of prompts) {
    // Estimate what we'd get from telemetry
    // In production: const snapshot = await telemetryLogger.aggregate(promptMeta.id);
    const simulatedNegativeRate = Math.random() * 0.3; // Simulated
    const simulatedCallCount = Math.floor(Math.random() * 200);

    if (simulatedCallCount < minSampleSize) {
      await emitReferenceAuthorSignal(ctx, 'prompt_refinement_skip_insufficient_data', {
        prompt_id: promptMeta.id,
        simulated_call_count: simulatedCallCount,
        min_sample_size: minSampleSize,
      });
      ctx.log.debug('Skipping: insufficient data', {
        promptId: promptMeta.id,
        calls: simulatedCallCount,
      });
      continue;
    }

    if (simulatedNegativeRate > negativeThreshold) {
      underperformersFound++;

      // ── Step 3: Draft improvement with LLM ──
      const currentPrompt = await ctx.prompts.load(promptMeta.id);

      const improvementResult = await ctx.llm.complete({
        prompt: [
          {
            role: 'system',
            content: `You are a prompt engineering expert. Analyze the following prompt and suggest improvements to reduce negative feedback. The prompt currently has a ${(simulatedNegativeRate * 100).toFixed(1)}% negative signal rate over ${simulatedCallCount} calls.`,
          },
          {
            role: 'user',
            content: `Current prompt:\n\n${currentPrompt.content}\n\nProvide:\n1. Analysis of potential issues\n2. Revised prompt text\n3. Risk assessment of the change (Low/Medium/High)`,
          },
        ],
        temperature: 0.4,
        promptMetadata: currentPrompt.toCallMetadata(),
      });

      // Determine risk level from the analysis
      const riskLevel = improvementResult.content.toLowerCase().includes('high risk')
        ? 'High'
        : improvementResult.content.toLowerCase().includes('medium risk')
        ? 'Medium'
        : 'Low';

      // ── Step 4: Governance decision ──
      const isCore = promptMeta.scope === 'core';
      const canAutoApply = autoApply && !isCore && riskLevel === 'Low';

      if (canAutoApply) {
        autoApplied++;
        ctx.log.info('Auto-applying low-risk proposal', {
          promptId: promptMeta.id,
          riskLevel,
        });
        // In production: would call prompt publish API
      } else {
        ctx.log.info('Queuing proposal for human review', {
          promptId: promptMeta.id,
          riskLevel,
          isCore,
        });
      }

      proposals.push({
        prompt_key: promptMeta.id,
        issue: `${(simulatedNegativeRate * 100).toFixed(1)}% negative signals over ${simulatedCallCount} calls`,
        risk_level: riskLevel,
        auto_applied: canAutoApply,
      });
    }
  }

  // ── Step 5: Log all activity to provenance ──
  await ctx.provenance.log(
    withProvenanceContext(ctx, {
      action: 'prompt:refinement:completed',
      status: 'success',
      input: { scope, minSampleSize, negativeThreshold },
      output: {
        promptsAnalyzed: prompts.length,
        underperformersFound,
        proposalsGenerated: proposals.length,
        autoApplied,
      },
    })
  );

  return {
    success: true,
    prompts_analyzed: prompts.length,
    underperformers_found: underperformersFound,
    proposals_generated: proposals.length,
    proposals_auto_applied: autoApplied,
    proposals,
  };
};

export default handler({
  name: AGENT_ID,
  id: AGENT_ID,
  version: VERSION,
  capabilities: CAPABILITIES,
  manifest: {
    operations: [
      {
        name: 'analyze',
        description: 'Analyze prompts in scope and generate improvement proposals',
        paramsSchema: {
          scope: { type: 'string', description: "'org' or 'core'" },
          min_sample_size: { type: 'number', description: 'Minimum calls for evaluation' },
          negative_threshold: { type: 'number', description: 'Threshold to flag underperformers' },
          auto_apply: { type: 'boolean', description: 'Whether to auto-apply low-risk proposals' },
        },
        resultKind: 'agent.prompt-refinement.result',
      },
    ],
  },
  execute,
});

Run the tests

From monorepo root

$ pnpm test:agents:reference

$ pnpm test:agents:reference:verbose

The reference suite runs all 23 agents with createMockExecutionContext(), verifying every ctx.* API call and output shape.

prompt-refinement-agent

What this demonstrates

Source

Run the tests

See Also