What if your infrastructure could detect problems, diagnose root causes, and fix itself -- all before an engineer even gets paged? This isn't science fiction. It's what I've been building using Claude Code, AWS CloudWatch, Lambda, and AI agents.
In this post, I'll share the architecture and implementation patterns for building self-healing infrastructure that uses AI to move from reactive incident response to proactive, autonomous remediation.
The Problem with Traditional Monitoring
Traditional monitoring follows a predictable (and exhausting) pattern:
- Alert fires at 3am
- On-call engineer wakes up, opens laptop
- Checks dashboard, reads alert details
- Searches runbooks (if they exist and are up to date)
- SSH into servers, check logs, correlate events
- Apply fix, verify recovery
- Write postmortem
Average MTTR: 30-90 minutes. Engineer happiness: low.
The AI-native approach:
- Alert fires
- AI agent automatically diagnoses root cause
- AI applies pre-approved remediation
- Engineer gets notified of resolved incident with full context
- Postmortem is auto-generated
Average MTTR: 2-5 minutes. Engineer sleep: uninterrupted.
Architecture Overview
Here's the architecture I've built:
CloudWatch Alarm
|
v
EventBridge Rule
|
v
Lambda: AI Orchestrator
|
+---> CloudWatch Logs Insights (gather context)
+---> Claude API (diagnose root cause)
+---> RAG Pipeline (search runbooks)
+---> SSM Automation (execute remediation)
+---> Slack/PagerDuty (notify team)
+---> DynamoDB (log incident record)
Step 1: Intelligent Alert Enrichment
When a CloudWatch alarm triggers, the first Lambda function gathers context:
def enrich_alert(alarm_event):
# Get the metric that triggered the alarm
metric = alarm_event['detail']['configuration']['metrics'][0]
# Query CloudWatch Logs for the affected service
logs = query_cloudwatch_logs(
log_group=f"/ecs/{service_name}",
query="fields @timestamp, @message | filter @message like /ERROR|WARN|Exception/",
time_range="15m"
)
# Get recent deployments from CodeDeploy
deployments = get_recent_deployments(service_name, hours=2)
# Get current resource utilization
metrics = get_service_metrics(service_name, period="5m")
return {
"alarm": alarm_event,
"recent_logs": logs,
"recent_deployments": deployments,
"current_metrics": metrics
}
This contextual data is what makes AI diagnosis possible. Instead of just saying "CPU is high," we give the AI agent the full picture.
Step 2: AI-Powered Root Cause Analysis
The enriched context is sent to Claude for diagnosis:
def diagnose_with_claude(enriched_context):
prompt = f"""You are an expert SRE analyzing an infrastructure incident.
ALARM: {enriched_context['alarm']['detail']['alarmName']}
STATE: {enriched_context['alarm']['detail']['state']['value']}
RECENT ERROR LOGS:
{enriched_context['recent_logs']}
RECENT DEPLOYMENTS:
{enriched_context['recent_deployments']}
CURRENT METRICS:
CPU: {enriched_context['current_metrics']['cpu']}%
Memory: {enriched_context['current_metrics']['memory']}%
Request Rate: {enriched_context['current_metrics']['request_rate']}/s
Error Rate: {enriched_context['current_metrics']['error_rate']}%
Analyze this incident and provide:
1. Most likely root cause (with confidence level)
2. Supporting evidence from the logs and metrics
3. Recommended remediation steps
4. Risk level of automated remediation (low/medium/high)
5. Whether this correlates with any recent deployment"""
response = claude_client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2000,
messages=[{"role": "user", "content": prompt}]
)
return parse_diagnosis(response.content[0].text)
Claude returns a structured diagnosis that typically identifies the root cause with remarkable accuracy -- especially for common patterns like memory leaks, connection pool exhaustion, deployment regressions, and traffic spikes.
Step 3: RAG-Powered Runbook Search
For additional context, the agent queries a RAG pipeline built on team runbooks:
def search_runbooks(diagnosis):
# Generate embedding for the diagnosis
query = f"remediation for {diagnosis['root_cause']}"
# Query Pinecone/OpenSearch for relevant runbook sections
results = vector_db.query(
vector=embed(query),
top_k=3,
include_metadata=True
)
return [r.metadata['content'] for r in results.matches]
This ensures the AI agent has access to team-specific knowledge -- not just generic AWS documentation, but your actual runbooks, past postmortems, and operational procedures.
Step 4: Automated Remediation
Based on the diagnosis and risk level, the agent executes remediation:
def auto_remediate(diagnosis, runbook_context):
# Only auto-remediate low-risk actions
if diagnosis['risk_level'] == 'low':
action = diagnosis['remediation_steps'][0]
if action['type'] == 'scale_up':
execute_ssm_automation('ScaleECSService', {
'cluster': action['cluster'],
'service': action['service'],
'desired_count': action['target_count']
})
elif action['type'] == 'restart_service':
execute_ssm_automation('ForceNewDeployment', {
'cluster': action['cluster'],
'service': action['service']
})
elif action['type'] == 'rollback_deployment':
execute_ssm_automation('RollbackCodeDeploy', {
'deployment_group': action['deployment_group']
})
return {"status": "remediated", "action": action}
else:
# For medium/high risk, notify and await approval
send_approval_request(diagnosis)
return {"status": "awaiting_approval", "diagnosis": diagnosis}
Key principle: low-risk remediations happen automatically; high-risk ones require human approval. The AI determines risk level based on the blast radius, time of day, and confidence in the diagnosis.
Step 5: Notification and Documentation
After remediation, the agent sends a rich notification:
The Slack message includes the alarm name, root cause analysis, confidence level, action taken, current status, deployment correlation, and a link to the full incident record -- all generated by AI.
The agent also auto-generates a postmortem draft in Confluence/Notion with timeline, root cause analysis, remediation steps, and prevention recommendations.
Building This with Claude Code
Here's the meta-level: I used Claude Code to build this entire system.
The prompts I used:
1. "Generate a Terraform module for the Lambda functions, EventBridge rules,
IAM roles, and DynamoDB table needed for an AI-powered incident response
pipeline on AWS."
2. "Write a Python Lambda handler that receives CloudWatch alarm events via
EventBridge, enriches them with logs and metrics context, sends to Claude
API for diagnosis, and executes SSM automations for remediation."
3. "Create the SSM Automation documents for: scaling ECS services, forcing
new deployments, rolling back CodeDeploy deployments, and isolating EC2
instances by swapping security groups."
4. "Generate the Slack notification Lambda that formats Claude's diagnosis
into a rich Block Kit message with action buttons for approval."
Claude Code accelerated the implementation significantly -- generating boilerplate, suggesting architecture patterns, and drafting the Lambda handlers. I reviewed, tested, and refined everything to production standards.
Expected Outcomes
Based on this architecture pattern, here's what teams can realistically expect:
- Significant MTTR reduction for common, well-understood incident types that match auto-remediation rules
- Automated handling of routine incidents like scaling events, service restarts, and deployment rollbacks
- Reduced on-call fatigue as low-risk incidents resolve without paging engineers
- Better postmortem quality because the AI agent captures context in real-time rather than relying on memory
- Gradual improvement as the system learns from your specific infrastructure patterns over time
Note: Actual results vary significantly depending on infrastructure complexity, incident types, and the quality of your runbook documentation.
Getting Started
To build your own self-healing infrastructure:
- Start with alert enrichment: Add context to your CloudWatch alarms before involving AI
- Build a runbook RAG pipeline: Index your team's documentation for AI-powered search
- Implement Claude diagnosis: Start with read-only diagnosis before enabling auto-remediation
- Gate remediations by risk: Only automate low-risk, well-understood actions initially
- Measure and iterate: Track MTTR, auto-remediation rate, and false positive rate
Conclusion
Self-healing infrastructure isn't about replacing SREs -- it's about giving them superpowers. When AI handles the routine incidents, engineers can focus on architecture, reliability improvements, and the complex problems that truly need human judgment.
The combination of Claude Code for building the system and Claude API for running it creates a powerful feedback loop: AI helps you build the automation that AI then operates. Welcome to the future of cloud operations.
Ready to build self-healing infrastructure? Let's discuss your architecture.