Skip to content
February 5, 2025AI + Cloud6 min read

BuildingSelf-HealingInfrastructure:ClaudeCode+CloudWatch+Lambda

Claude CodeAWSAIOpsSelf-HealingLambda

What if your infrastructure could detect problems, diagnose root causes, and fix itself -- all before an engineer even gets paged? This isn't science fiction. It's what I've been building using Claude Code, AWS CloudWatch, Lambda, and AI agents.

In this post, I'll share the architecture and implementation patterns for building self-healing infrastructure that uses AI to move from reactive incident response to proactive, autonomous remediation.

The Problem with Traditional Monitoring

Traditional monitoring follows a predictable (and exhausting) pattern:

  1. Alert fires at 3am
  2. On-call engineer wakes up, opens laptop
  3. Checks dashboard, reads alert details
  4. Searches runbooks (if they exist and are up to date)
  5. SSH into servers, check logs, correlate events
  6. Apply fix, verify recovery
  7. Write postmortem

Average MTTR: 30-90 minutes. Engineer happiness: low.

The AI-native approach:

  1. Alert fires
  2. AI agent automatically diagnoses root cause
  3. AI applies pre-approved remediation
  4. Engineer gets notified of resolved incident with full context
  5. Postmortem is auto-generated

Average MTTR: 2-5 minutes. Engineer sleep: uninterrupted.

Architecture Overview

Here's the architecture I've built:

CloudWatch Alarm
    |
    v
EventBridge Rule
    |
    v
Lambda: AI Orchestrator
    |
    +---> CloudWatch Logs Insights (gather context)
    +---> Claude API (diagnose root cause)
    +---> RAG Pipeline (search runbooks)
    +---> SSM Automation (execute remediation)
    +---> Slack/PagerDuty (notify team)
    +---> DynamoDB (log incident record)

Step 1: Intelligent Alert Enrichment

When a CloudWatch alarm triggers, the first Lambda function gathers context:

def enrich_alert(alarm_event):
    # Get the metric that triggered the alarm
    metric = alarm_event['detail']['configuration']['metrics'][0]

    # Query CloudWatch Logs for the affected service
    logs = query_cloudwatch_logs(
        log_group=f"/ecs/{service_name}",
        query="fields @timestamp, @message | filter @message like /ERROR|WARN|Exception/",
        time_range="15m"
    )

    # Get recent deployments from CodeDeploy
    deployments = get_recent_deployments(service_name, hours=2)

    # Get current resource utilization
    metrics = get_service_metrics(service_name, period="5m")

    return {
        "alarm": alarm_event,
        "recent_logs": logs,
        "recent_deployments": deployments,
        "current_metrics": metrics
    }

This contextual data is what makes AI diagnosis possible. Instead of just saying "CPU is high," we give the AI agent the full picture.

Step 2: AI-Powered Root Cause Analysis

The enriched context is sent to Claude for diagnosis:

def diagnose_with_claude(enriched_context):
    prompt = f"""You are an expert SRE analyzing an infrastructure incident.

ALARM: {enriched_context['alarm']['detail']['alarmName']}
STATE: {enriched_context['alarm']['detail']['state']['value']}

RECENT ERROR LOGS:
{enriched_context['recent_logs']}

RECENT DEPLOYMENTS:
{enriched_context['recent_deployments']}

CURRENT METRICS:
CPU: {enriched_context['current_metrics']['cpu']}%
Memory: {enriched_context['current_metrics']['memory']}%
Request Rate: {enriched_context['current_metrics']['request_rate']}/s
Error Rate: {enriched_context['current_metrics']['error_rate']}%

Analyze this incident and provide:
1. Most likely root cause (with confidence level)
2. Supporting evidence from the logs and metrics
3. Recommended remediation steps
4. Risk level of automated remediation (low/medium/high)
5. Whether this correlates with any recent deployment"""

    response = claude_client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2000,
        messages=[{"role": "user", "content": prompt}]
    )

    return parse_diagnosis(response.content[0].text)

Claude returns a structured diagnosis that typically identifies the root cause with remarkable accuracy -- especially for common patterns like memory leaks, connection pool exhaustion, deployment regressions, and traffic spikes.

For additional context, the agent queries a RAG pipeline built on team runbooks:

def search_runbooks(diagnosis):
    # Generate embedding for the diagnosis
    query = f"remediation for {diagnosis['root_cause']}"

    # Query Pinecone/OpenSearch for relevant runbook sections
    results = vector_db.query(
        vector=embed(query),
        top_k=3,
        include_metadata=True
    )

    return [r.metadata['content'] for r in results.matches]

This ensures the AI agent has access to team-specific knowledge -- not just generic AWS documentation, but your actual runbooks, past postmortems, and operational procedures.

Step 4: Automated Remediation

Based on the diagnosis and risk level, the agent executes remediation:

def auto_remediate(diagnosis, runbook_context):
    # Only auto-remediate low-risk actions
    if diagnosis['risk_level'] == 'low':
        action = diagnosis['remediation_steps'][0]

        if action['type'] == 'scale_up':
            execute_ssm_automation('ScaleECSService', {
                'cluster': action['cluster'],
                'service': action['service'],
                'desired_count': action['target_count']
            })

        elif action['type'] == 'restart_service':
            execute_ssm_automation('ForceNewDeployment', {
                'cluster': action['cluster'],
                'service': action['service']
            })

        elif action['type'] == 'rollback_deployment':
            execute_ssm_automation('RollbackCodeDeploy', {
                'deployment_group': action['deployment_group']
            })

        return {"status": "remediated", "action": action}

    else:
        # For medium/high risk, notify and await approval
        send_approval_request(diagnosis)
        return {"status": "awaiting_approval", "diagnosis": diagnosis}

Key principle: low-risk remediations happen automatically; high-risk ones require human approval. The AI determines risk level based on the blast radius, time of day, and confidence in the diagnosis.

Step 5: Notification and Documentation

After remediation, the agent sends a rich notification:

The Slack message includes the alarm name, root cause analysis, confidence level, action taken, current status, deployment correlation, and a link to the full incident record -- all generated by AI.

The agent also auto-generates a postmortem draft in Confluence/Notion with timeline, root cause analysis, remediation steps, and prevention recommendations.

Building This with Claude Code

Here's the meta-level: I used Claude Code to build this entire system.

The prompts I used:

1. "Generate a Terraform module for the Lambda functions, EventBridge rules,
    IAM roles, and DynamoDB table needed for an AI-powered incident response
    pipeline on AWS."

2. "Write a Python Lambda handler that receives CloudWatch alarm events via
    EventBridge, enriches them with logs and metrics context, sends to Claude
    API for diagnosis, and executes SSM automations for remediation."

3. "Create the SSM Automation documents for: scaling ECS services, forcing
    new deployments, rolling back CodeDeploy deployments, and isolating EC2
    instances by swapping security groups."

4. "Generate the Slack notification Lambda that formats Claude's diagnosis
    into a rich Block Kit message with action buttons for approval."

Claude Code accelerated the implementation significantly -- generating boilerplate, suggesting architecture patterns, and drafting the Lambda handlers. I reviewed, tested, and refined everything to production standards.

Expected Outcomes

Based on this architecture pattern, here's what teams can realistically expect:

  • Significant MTTR reduction for common, well-understood incident types that match auto-remediation rules
  • Automated handling of routine incidents like scaling events, service restarts, and deployment rollbacks
  • Reduced on-call fatigue as low-risk incidents resolve without paging engineers
  • Better postmortem quality because the AI agent captures context in real-time rather than relying on memory
  • Gradual improvement as the system learns from your specific infrastructure patterns over time

Note: Actual results vary significantly depending on infrastructure complexity, incident types, and the quality of your runbook documentation.

Getting Started

To build your own self-healing infrastructure:

  1. Start with alert enrichment: Add context to your CloudWatch alarms before involving AI
  2. Build a runbook RAG pipeline: Index your team's documentation for AI-powered search
  3. Implement Claude diagnosis: Start with read-only diagnosis before enabling auto-remediation
  4. Gate remediations by risk: Only automate low-risk, well-understood actions initially
  5. Measure and iterate: Track MTTR, auto-remediation rate, and false positive rate

Conclusion

Self-healing infrastructure isn't about replacing SREs -- it's about giving them superpowers. When AI handles the routine incidents, engineers can focus on architecture, reliability improvements, and the complex problems that truly need human judgment.

The combination of Claude Code for building the system and Claude API for running it creates a powerful feedback loop: AI helps you build the automation that AI then operates. Welcome to the future of cloud operations.

Ready to build self-healing infrastructure? Let's discuss your architecture.