ctx.llm
Language model completions, embeddings, and streaming — with automatic cost tracking, provenance logging, and budget enforcement on every call.
Why this exists: Raw LLM provider calls give you text back.ctx.llm gives you text back plus the cost, token counts, a provenance ID for the audit trail, and automatic budget enforcement from the delegation. When your agent exceeds its budget, it throwsBudgetExceededError — not a surprise bill at the end of the month.
ctx.llm.complete(options)
Generate a text completion. Returns a fully typed response with cost and provenance.
const response = await ctx.llm.complete({ prompt: 'Summarize the following document in 3 bullet points.', temperature: 0.3, maxTokens: 500,});
console.log(response.content); // Generated textconsole.log(response.cost.usd); // e.g. 0.0023console.log(response.usage.totalTokens); // e.g. 847console.log(response.provenanceId); // Audit trail referenceWith message arrays
// Use message arrays for multi-turn conversationsconst response = await ctx.llm.complete({ system: 'You are a financial analyst. Be precise and cite sources.', prompt: [ { role: 'user', content: 'What are the key risks in this contract?' }, { role: 'assistant', content: 'I see three main areas of concern...' }, { role: 'user', content: 'Focus on the indemnification clause.' }, ], temperature: 0.2, maxTokens: 1000,});Options
| Option | Type | Default | Description |
|---|---|---|---|
prompt | string | Message[] | — | The prompt text or conversation messages. Required. |
system | string | undefined | System message. Sets the model's persona/behavior. |
temperature | number | 0.7 | Sampling temperature 0.0–1.0. Lower = more deterministic. |
maxTokens | number | undefined | Maximum output tokens. Capped by model context window. |
model | string | delegation default | Override model for this call. e.g. "gpt-4o", "claude-3-5-sonnet" |
stop | string[] | undefined | Stop sequences that end generation. |
topP | number | undefined | Nucleus sampling. Alternative to temperature. |
Response
| Field | Type | Description |
|---|---|---|
content | string | Generated text. |
cost | Cost | Cost in USD with breakdown (llm, embedding, etc). |
usage | TokenUsage | promptTokens, completionTokens, totalTokens. |
model | string | Actual model used (may differ from requested if delegated). |
finishReason | 'stop' | 'length' | 'content_filter' | Why generation stopped. |
provenanceId | string | Audit trail reference. Queryable via ctx.provenance. |
confidence | number | undefined | Confidence score if the model provides it. |
ctx.llm.stream(options)
Streaming completion. Returns an AsyncIterableIterator. Same options as complete(). Cost is reported in the final chunk.
// Stream tokens in real-timefor await (const chunk of ctx.llm.stream({ prompt: userQuery })) { process.stdout.write(chunk.delta); // Print new tokens as they arrive
if (chunk.finishReason) { console.log('\nDone. Tokens used:', chunk.usage?.totalTokens); }}ctx.llm.embed(text)
Generate a vector embedding for semantic search. Embedding costs are tracked separately from completion costs in the Cost.breakdown field.
// Generate embeddings for semantic searchconst embedding = await ctx.llm.embed('How do I process an invoice?');
// embedding.vector is a float array ready for vector searchconst results = await ctx.memory.persistent.search(embedding.vector, { limit: 5, minScore: 0.7,});
console.log(embedding.dimensions); // e.g. 1536console.log(embedding.cost.usd); // Embedding costs are tracked tooCost Tracking
Every call is tracked. Use getLastCost() andgetTotalCost() to monitor spend within an execution.
// Track costs across multiple LLM callsconst step1 = await ctx.llm.complete({ prompt: 'Extract entities from: ' + text });const step2 = await ctx.llm.complete({ prompt: 'Classify entities: ' + step1.content });
// Cost of the last callconst lastCost = ctx.llm.getLastCost();
// Total cost of ALL llm calls in this executionconst totalCost = ctx.llm.getTotalCost();console.log(`Total: $${totalCost.usd.toFixed(4)}`); // e.g. Total: $0.0047Real-World Example
From the invoice-processor reference implementation:
// Real-world: invoice extraction (from packages/agents-reference)const extractionResult = await ctx.llm.complete({ prompt: [ { role: 'user', content: `Extract invoice data as JSON with fields: vendor, date, total, line_items (array with description, qty, unit_price, total). Invoice content: ${invoiceContent}`, }, ], temperature: 0.2, // Low temp for structured extraction maxTokens: 2000,});
const invoiceData = JSON.parse(extractionResult.content);Errors
BudgetExceededErrorThe delegation budget would be exceeded by this call.AccessDeniedErrorThe delegation does not include LLM access scope.TimeoutErrorModel took longer than the configured timeout.See Error Reference for all error types.
In the wild
Reference agents that demonstrate ctx.llm in production.
ctx.llm.stream()streaming-analyzer
Real-time streaming analysis with token accumulation and async iteration.
ctx.llm.embed()semantic-search
Generate vector embeddings for documents and queries. The RAG foundation.
ctx.llm.complete()invoice-processor
JSON extraction prompt with low temperature for structured, deterministic output.
Deep Dives
Prompt Management · Part 1 of 3
From Inline Strings to ctx.prompts: A Developer's Guide to HUMΛN Prompt Management
A hands-on walkthrough of HUMΛN's prompt SDK: authoring prompts, validating schemas, composing layers, publishing versions, and wiring telemetry — with code examples from real agents.
Prompt Management · Part 3 of 3
The Self-Improving Prompt Loop: How Telemetry Closes the Gap Between Good and Great
Most AI platforms ship prompts and forget them. HUMΛN's protocol-level telemetry, model affinity tracking, and Prompt Refinement Agent create a virtuous cycle of continuous improvement — with humans always in the loop.