End-to-End Auto-Learn Demo
A complete walkthrough showing how an agent learns from one incident and applies that knowledge to the next
This guide walks through the full Acontext auto-learn cycle using a DevOps incident resolution scenario. By the end, you'll see an agent that:
- Diagnoses a 502 gateway error using multiple tool calls
- Has its session automatically distilled into a reusable SOP (skill)
- Applies that SOP to a new incident — resolving it faster with fewer tool calls
ACT 1: Agent investigates 502 errors (no prior knowledge)
└► 7+ tool calls: check status, read logs, read config, update, restart, verify
└► Task extracted → Skill distilled → SOP written
ACT 2: Similar 502 incident (with learned skill)
└► Agent reads skill first → goes straight to the fix
└► 3-4 tool calls: targeted diagnosis + fixPrerequisites
ACONTEXT_API_KEY— get one at acontext.ioOPENAI_API_KEY— for the OpenAI agent loop- Python SDK:
pip install acontext openai - TypeScript SDK:
npm install @acontext/acontext openai
The Mock Environment
The demo uses five mock DevOps tools that operate against an in-memory state — no real infrastructure needed:
| Tool | Purpose |
|---|---|
check_service_status | Health, error rate, latency |
check_logs | Recent log entries |
read_config | Service configuration |
update_config | Mutate a config value |
restart_service | Restart and clear errors |
Act 1 — The Learning Run
The agent has no prior knowledge. It must explore, diagnose, and fix a 502 gateway error from scratch.
Scenario: The API gateway connects to six services (auth, backend-api, notifications, redis, database). The upstream_timeout is set to 5 seconds, but the backend-api's p99 latency is 4.2 seconds. Gateway logs show 502 errors and slow upstream warnings — but don't reveal the configured timeout value. The notification service is also degraded with SMTP failures (a convincing red herring).
Step 1: Set up learning space and session
space = ac.learning_spaces.create()
session = ac.sessions.create()
ac.learning_spaces.learn(space.id, session_id=session.id)const space = await ac.learningSpaces.create();
const session = await ac.sessions.create();
await ac.learningSpaces.learn({ spaceId: space.id, sessionId: session.id });Step 2: Run the agent and record messages
Run your OpenAI agent loop as usual. The key is storing every message (user, assistant, tool calls, tool results) to Acontext via store_message:
ac.sessions.store_message(session.id, blob={"role": "user", "content": "..."})
# ... agent calls tools, produces responses ...
ac.sessions.store_message(session.id, blob={"role": "assistant", "tool_calls": [...]})
ac.sessions.store_message(session.id, blob={"role": "tool", "tool_call_id": "...", "content": "..."})await ac.sessions.storeMessage(session.id, { role: "user", content: "..." });
// ... agent calls tools, produces responses ...
await ac.sessions.storeMessage(session.id, { role: "assistant", tool_calls: [...] });
await ac.sessions.storeMessage(session.id, { role: "tool", tool_call_id: "...", content: "..." });What the agent typically does (9-12 tool calls — investigating + fixing):
1. check_service_status("api-gateway") → unhealthy, 34% error rate
2. check_logs("api-gateway") → 502s, slow upstream, notification-service failures
3. check_service_status("notification-service") → degraded, 5% errors (investigates red herring)
4. check_logs("notification-service") → SMTP connection refused (unrelated to 502s)
5. check_service_status("backend-api") → healthy, but p99 = 4200ms
6. check_logs("backend-api") → requests taking 4000-4900ms, slow DB queries
7. read_config("api-gateway") → upstream_timeout_ms: 5000 — aha!
8. update_config("api-gateway", "upstream_timeout_ms", "10000")
9. restart_service("api-gateway")
10. check_service_status("api-gateway") → healthy, 0% error rate ✓Step 3: Flush and inspect extracted tasks
ac.sessions.flush(session.id)
# Poll until all messages are processed
status = ac.sessions.messages_observing_status(session.id)
# status.pending / status.in_process → wait until both are 0
result = ac.sessions.get_tasks(session.id)
# result.items → list of Task objects
# result.items[0].status → "success"
# result.items[0].data.task_description → "Investigate and fix 502 errors..."await ac.sessions.flush(session.id);
const status = await ac.sessions.messagesObservingStatus(session.id);
// status.pending / status.in_process → wait until both are 0
const result = await ac.sessions.getTasks(session.id);
// result.items → list of Task objects
// result.items[0].status → "success"Example extracted task:
[success] Diagnose and fix 502 errors on the API gateway
> Gateway logs show 502s from backend-api, also notification-service health failures
> Investigated notification-service: SMTP issue, unrelated to 502 errors
> backend-api healthy but p99 latency 4200ms; gateway config upstream_timeout_ms=5000
> Root cause: backend-api latency exceeds gateway timeout threshold
> Increased upstream_timeout_ms to 10000, restarted gateway → 0% error rateStep 4: Wait for skill learning and inspect results
result = ac.learning_spaces.wait_for_learning(space.id, session_id=session.id)
# result.status → "completed"
skills = ac.learning_spaces.list_skills(space.id)
for s in skills:
for f in s.file_index:
content = ac.skills.get_file(skill_id=s.id, file_path=f.path)const result = await ac.learningSpaces.waitForLearning({
spaceId: space.id, sessionId: session.id,
});
const skills = await ac.learningSpaces.listSkills(space.id);
for (const s of skills) {
for (const f of s.fileIndex) {
const content = await ac.skills.getFile({ skillId: s.id, filePath: f.path });
}
}Example generated skill (auto-created by the learning pipeline):
---
name: "devops-incident-patterns"
description: "Reusable patterns for diagnosing and resolving infrastructure incidents"
---
# DevOps Incident Patterns
Collected SOPs and warnings from past incident resolutions.
## File Structure
One file per incident category (e.g., gateway-errors.md, database-issues.md).
## Guidelines
- One category per file
- Include date, symptoms, and numbered resolution steps
- Update existing entries when better approaches are discovered## API Gateway 502 — Upstream Timeout Mismatch (date: 2026-03-26)
- Principle: 502 errors from a reverse proxy often indicate the upstream
timeout is lower than the backend's actual response time.
- When to Apply: API gateway returning 502s with upstream timeout errors in logs.
- Steps:
1. Check gateway logs for "upstream timeout" errors — note which backend service
2. Compare gateway's upstream_timeout_ms config against the backend's p99 latency
3. If p99 > timeout, increase timeout to at least 3-5x the p99 latency
4. Restart the gateway to apply the new config
5. Verify error rate drops to 0%The exact skill output varies between runs since the learning pipeline uses LLM-based distillation. The structure and content will be similar, but wording may differ.
Act 2 — The Recall Run
Now a new incident occurs after a deployment. The freshly added payment-service has a p99 latency of 45 seconds, exceeding the gateway's 30-second timeout. The user doesn't know which service is causing it — they just see 502s again.
The agent has access to the skill learned in Act 1 via Skill Content Tools.
Step 5: Run the agent with learned skills
from acontext.agent.skill import SKILL_TOOLS
session2 = ac.sessions.create()
ac.learning_spaces.learn(space.id, session_id=session2.id)
skill_ids = [s.id for s in skills]
skill_ctx = SKILL_TOOLS.format_context(ac, skill_ids)
# Add skill tools alongside your existing agent tools
tools = your_tools + SKILL_TOOLS.to_openai_tool_schema()
system = "..." + skill_ctx.get_context_prompt()import { SKILL_TOOLS } from '@acontext/acontext';
const session2 = await ac.sessions.create();
await ac.learningSpaces.learn({ spaceId: space.id, sessionId: session2.id });
const skillIds = skills.map(s => s.id);
const skillCtx = await SKILL_TOOLS.formatContext(ac, skillIds);
const tools = [...yourTools, ...SKILL_TOOLS.toOpenAIToolSchema()];
const system = "..." + skillCtx.getContextPrompt();What the agent typically does (4-6 DevOps tool calls — skips exploration):
1. get_skill(...) → reads the SOP from Act 1
2. get_skill_file(...) → learns the upstream-timeout diagnostic pattern
3. check_logs("api-gateway") → "upstream timeout: payment-service...within 30000 ms" — pattern match!
4. read_config("api-gateway") → upstream_timeout_ms: 30000
5. update_config("api-gateway", "upstream_timeout_ms", "60000")
6. restart_service("api-gateway")
7. check_service_status("api-gateway") → healthy, 0% error rate ✓The agent skips the exploratory phase entirely. In Act 1 it had to investigate notification-service (red herring), check backend-api logs, and discover the timeout mismatch. With the SOP, it recognizes the upstream timeout pattern in the logs and goes straight to gateway config → fix.
Result
Act 1 (no prior knowledge): 9-12 DevOps tool calls (investigates multiple services)
Act 2 (with learned skills): 4-6 DevOps tool calls (targeted fix)
Improvement: ~40-55% fewer tool callsFull Runnable Scripts
How It Works Under the Hood
The auto-learn pipeline that runs between Act 1 and Act 2:
Session messages
└► Task Agent extracts structured tasks (goal, progress, status)
└► Distillation LLM analyzes the completed task:
- Was it trivial? → skip
- Multi-step procedure? → report_success_analysis (SOP)
- Factual content? → report_factual_content
- Failed task? → report_failure_analysis (anti-pattern)
└► Skill Learner Agent reads distilled context + existing skills
→ Creates/updates skill files (SKILL.md + data files)The DevOps scenario works well because it triggers report_success_analysis — the multi-step debugging process with clear decisions produces a rich SOP that the skill learner formats into actionable steps.
Next Steps
Last updated on