March 11, 2026AI + Cloud14 min read

HowtoUseAICodingAssistantsforInfrastructureasCodeWithoutBreakingProduction

AIInfrastructure as CodeTerraformClaude CodeChatGPTDevOpsAutomationKubernetesAWSCloud EngineeringPrompt Engineering

You've probably tried using ChatGPT or Claude to write Terraform. Maybe you pasted an error message and got back confident-looking code that made things worse. Or you asked it to "create an AWS VPC" and got a configuration that would work perfectly in a tutorial but fail spectacularly in production.

I've been using AI coding assistants for infrastructure work every day for the past 18 months. I've seen what works, what fails catastrophically, and what separates productive AI-assisted infrastructure engineering from expensive disasters.

This is the guide I wish existed when I started. No hype. No "AI will replace DevOps" nonsense. Just the practical patterns that let you ship infrastructure faster without breaking things.

The Problem: Why Most DevOps Engineers Give Up on AI Assistants

Three things happen when engineers first try AI for infrastructure code:

1. The code looks perfect but has invisible landmines

You ask Claude to write an EKS cluster configuration. It returns beautiful Terraform with detailed comments. You apply it. The cluster creates successfully. Three weeks later your pods can't talk to each other because the networking configuration assumed a default VPC that doesn't exist in production.

2. The AI confidently suggests deprecated patterns

ChatGPT recommends using AWS Classic Load Balancers when ALBs have been the standard for years. Or suggests Terraform syntax from version 0.11 when you're on 1.7. The code technically works but sets you back months in maintainability.

3. Zero understanding of your environment

You ask for a Kubernetes deployment manifest. The AI has no idea you're running on GKE with Workload Identity, need specific labels for your service mesh, or have naming conventions for every resource. The output is generic and useless.

Most engineers try it once, get burned, and go back to doing everything manually.

The problem isn't that AI assistants are bad at infrastructure code. The problem is that infrastructure code has invisible requirements that documentation-trained models can't infer from a vague prompt.

What Works: The Three-Layer Strategy

After 18 months of daily use across AWS, Azure, Terraform, Kubernetes, CloudFormation, and CDK, here's what actually works:

Layer 1: Context Transfer (The 80% Solution)

Before asking AI for anything infrastructure-related, you need to transfer three pieces of context:

Your existing patterns

Instead of:

"Write Terraform for an S3 bucket"

Do this:

"Write Terraform for an S3 bucket following the pattern in this existing module:

[paste your team's existing S3 module]

Match the same:
- Naming convention (project-environment-purpose)
- Tagging structure (Owner, Environment, CostCenter, ManagedBy)
- Encryption defaults (aws:kms with our standard key)
- Lifecycle rules (transition to IA after 30 days)
- Versioning and replication config
"

Your constraints

AI models default to "best practices" from the internet. Your production environment has constraints that trump best practices:

"Write a Kubernetes deployment manifest for our API service.

Constraints:
- Must use our base image (ecr.io/company/base-api:v2.3)
- Resource limits: 500m CPU, 1Gi memory (hard limit, finance approval required to exceed)
- Must have readiness and liveness probes (SRE requirement)
- Service mesh injection: enabled (label: istio-injection=enabled)
- Must use our naming convention: {team}-{service}-{environment}
- Anti-affinity rules: spread across 3 AZs
- Secret management: External Secrets Operator pulling from AWS Secrets Manager
"

Your current state

If you're modifying existing infrastructure, the AI needs to know what's already there:

"I need to add an egress rule to this security group:

[paste current terraform state or aws ec2 describe-security-groups output]

Add egress to 10.50.0.0/16 on port 5432 for PostgreSQL.

Important:
- Don't modify existing rules
- Follow the description pattern: 'Allow {protocol} to {destination} for {purpose}'
- Keep the rules in alphabetical order by description
"

This takes 60 seconds of setup but prevents hours of rework.

Layer 2: Validation Before Apply (The Safety Net)

Never run AI-generated infrastructure code without these three checks:

1. Plan Review (Human + AI)

terraform plan -out=plan.tfplan
terraform show -json plan.tfplan > plan.json

Then ask the AI to review its own output:

"Review this Terraform plan for issues:

[paste plan.json]

Check for:
- Unintended resource deletions or replacements
- Changes to production databases or stateful resources
- Security group rules that are too permissive
- Missing required tags
- Resources without explicit dependencies that might cause race conditions
"

I've caught 3 production-breaking changes in the past month this way. The AI is surprisingly good at spotting its own mistakes when shown the plan.

2. Compliance Scanning

Use tools like Checkov, tfsec, or Terrascan on AI-generated code:

checkov -f main.tf --framework terraform

Most AI models don't know your organization's compliance requirements. Automated scanning catches:

Unencrypted storage
Public S3 buckets
Security groups with 0.0.0.0/0 ingress
Missing backup configurations
Overly permissive IAM policies

3. Targeted Questions

After getting code from an AI, ask follow-up questions about edge cases:

"This Terraform creates an RDS instance. What happens if:
1. The instance fails and needs to be replaced?
2. We need to restore from a snapshot?
3. We want to upgrade the engine version?
4. We need to change the instance class?

Will we lose data? Are there manual steps?"

This exposes assumptions the AI made that don't match your requirements.

Layer 3: Incremental Rollout (The Undo Button)

AI-generated infrastructure changes should follow the same rollout discipline as application code:

Development → Staging → Production

Apply AI-generated Terraform to dev first. Let it run for 24 hours. Check CloudWatch metrics, logs, costs. Only then promote to staging.

Feature Flags for Infrastructure

Use Terraform workspaces or separate state files:

resource "aws_instance" "app" {
  instance_type = var.use_ai_generated_config ? "t3.large" : "t3.medium"
  # AI suggested t3.large for better performance
}

You can toggle back instantly if something breaks.

Version Control Everything

Every AI-generated change goes through a PR with:

Before/after plan output
Explanation of what changed and why
Screenshot of AI conversation (for context)
Checkov scan results

This creates an audit trail and lets your team review before merging.

Prompt Patterns That Actually Work

Here are the prompt structures I use daily:

Pattern 1: The "Existing Module Extension"

I have this Terraform module for creating ECS services:

[paste module]

I need to extend it to support:
- Blue/green deployments
- Circuit breaker configuration
- CloudWatch Container Insights

Extend the module maintaining:
- Same variable naming convention (use_underscores)
- Same output structure (prefix with module name)
- Backward compatibility (new features off by default)

Show me:
1. Updated variables.tf
2. Updated main.tf with new resources
3. Updated outputs.tf
4. Example usage in a root module

Pattern 2: The "Debug This Error"

I'm getting this Terraform error:

[paste full error output]

Here's the relevant configuration:

[paste terraform code]

Here's the current state:

[paste terraform state show output for affected resource]

What's wrong and how do I fix it without destroying existing resources?

Pattern 3: The "Compliance Review"

Review this Terraform configuration for security and compliance issues:

[paste code]

Our requirements:
- All storage must be encrypted at rest
- No public internet access except through approved load balancers
- All resources must have Owner, Environment, and CostCenter tags
- IAM policies follow least privilege
- Secrets must use AWS Secrets Manager, never hardcoded

Flag anything that violates these rules and suggest fixes.

Pattern 4: The "Migration Plan"

I need to migrate this infrastructure from [old pattern] to [new pattern]:

Current:
[paste current terraform]

Target:
[describe desired end state]

Constraints:
- Zero downtime required
- Database cannot be recreated (contains production data)
- Must maintain existing DNS records
- Budget: can't exceed $500/month increase

Give me a step-by-step migration plan with:
1. What to create first
2. What to migrate
3. What to destroy last
4. Rollback plan if something fails

Real Examples: What I Built This Week

Example 1: EKS Cluster with AI Assistance

What I needed: Production-ready EKS cluster with VPC, IAM roles, node groups, and add-ons.

What I did:

Context dump: Pasted our existing VPC module, naming conventions, and tagging requirements
Specific ask: "Create EKS cluster Terraform following these patterns"
Iteration: Asked AI to add Karpenter, AWS Load Balancer Controller, and metrics-server
Validation: Ran checkov, caught missing encryption on EBS volumes, asked AI to fix
Testing: Applied to dev, tested pod networking, service exposure, autoscaling
Result: Deployed to production in 3 days instead of 2 weeks

Time saved: 8-10 days (if you want to translate this into a dollar figure, the Claude Code ROI calculator will do the math for your team size and salary band) Confidence level: High (because I validated every layer)

Example 2: Kubernetes YAML Debugging

What I needed: Fix a CrashLoopBackOff on a deployment that worked locally but failed in production.

What I did:

Prompt: "This deployment is CrashLoopBackOff in production but works locally:

[paste deployment YAML]

Pod logs show:
[paste kubectl logs output]

Environment differences:
- Production uses Workload Identity for AWS access
- Production has Istio sidecar injection
- Production has strict PodSecurityPolicy

What's likely wrong?"

AI Response: "The container is trying to access AWS Secrets Manager but the ServiceAccount doesn't have the correct annotation for Workload Identity..."

Fixed in 5 minutes. Would have taken 30-45 minutes of trial and error.

Example 3: Terraform State Surgery

What I needed: A resource was manually modified in AWS console, now Terraform wants to destroy and recreate it.

Prompt:

"This RDS instance was manually modified. Terraform plan shows:

[paste plan output showing replacement]

I need to update the Terraform state to match current AWS state without destroying the database. Walk me through the process."

AI gave me:

1. Pull current state: terraform state pull > backup.tfstate
2. Get actual AWS values: aws rds describe-db-instances ...
3. Update state: terraform state rm aws_db_instance.main
4. Import with new values: terraform import ...
5. Update code to match

Saved the database from accidental destruction.

Common Mistakes (And How to Avoid Them)

Mistake 1: Trusting Default Suggestions

Problem: AI suggests internet-facing load balancer when you wanted internal.

Fix: Always specify "internal" or "internet-facing" explicitly in your prompt. Don't assume the AI will infer from context.

Mistake 2: Not Checking for Breaking Changes

Problem: AI refactors your Terraform and accidentally changes resource names, causing Terraform to want to destroy and recreate.

Fix: Before accepting any refactoring, run terraform plan and look for -/+ symbols (replace). Ask AI: "Will this force resource replacement?"

Mistake 3: Applying Directly to Production

Problem: You're in a rush, AI code looks good, you apply straight to prod. It breaks.

Fix: No exceptions. Always apply to dev first. Set up CI/CD that enforces this.

Mistake 4: Not Version Controlling the Conversation

Problem: Three months later you need to know why a particular configuration was chosen. You have no record.

Fix: Save the AI conversation as docs/decisions/YYYY-MM-DD-why-we-chose-X.md in your repo.

Mistake 5: Treating AI Like a Magic Button

Problem: You ask AI to "set up full AWS infrastructure for my app" and expect it to work perfectly.

Fix: Break large requests into small, testable pieces. Build incrementally. Validate each layer.

Tool Recommendations

For Writing Terraform

Best: Claude Opus 4.7 (better at understanding complex state and dependencies) Alternative: GPT-4.1 mini (faster for simple modules) Budget option: Claude Sonnet 4.6 (1/10th the cost of Opus, surprisingly good)

For Kubernetes Manifests

Best: Claude Sonnet 4.6 (handles YAML indentation correctly) Alternative: Gemini 2.5 Pro (good at Helm charts)

For Debugging

Best: Claude Opus 4.7 (best at reasoning through complex error chains)

For Learning/Explaining

Best: GPT-4.1 (clearest explanations)

Use my Token Counter to estimate costs before sending large Terraform files to expensive models. Claude Sonnet often works for 80% of tasks at 1/10th the cost of Opus.

Advanced: Teaching AI Your Environment

If you use AI regularly for infrastructure work, create a "context file" you paste at the start of every conversation:

# Our Infrastructure Context

## Naming Convention
{project}-{environment}-{resource}-{purpose}
Example: payments-prod-rds-primary

## Tagging Requirements (all resources)
- Owner: team-name
- Environment: dev | staging | prod
- CostCenter: 4-digit code
- ManagedBy: terraform | cloudformation | manual

## AWS Defaults
- Region: us-west-2 (primary), us-east-1 (DR)
- VPC CIDR: 10.100.0.0/16 (prod), 10.101.0.0/16 (staging)
- Encryption: Always enabled, use aws/kms or custom KMS keys
- Versioning: Enabled on all S3 buckets
- Lifecycle: S3 IA after 30 days, Glacier after 90 days

## Kubernetes Standards
- Base images: only from company ECR
- Resource limits: required on all deployments
- Probes: both liveness and readiness required
- Service mesh: Istio with automatic sidecar injection
- Secrets: External Secrets Operator + AWS Secrets Manager

## Security Requirements
- No 0.0.0.0/0 ingress except load balancers
- IAM: least privilege, no *:* permissions
- Secrets: never in code, always in Secrets Manager
- Public access: must be approved by security team

## Terraform Standards
- Version: >= 1.5
- State: S3 backend with DynamoDB locking
- Modules: internal registry only
- Variables: always have descriptions and validation
- Outputs: prefix with module name

Save this file, paste it at the start of infrastructure conversations. The AI will follow your patterns instead of generic best practices.

The Future: Where This Is Going

Three things I'm seeing emerge:

1. AI-Powered Policy as Code

Instead of manually writing Sentinel or OPA policies, describe your requirements in English and let AI generate the policy code. I'm doing this now for AWS Service Control Policies.

2. Autonomous Infrastructure Refactoring

AI that can analyze your entire Terraform codebase, suggest improvements, and generate PRs. We're close to this. Claude Code is already capable of refactoring entire modules.

3. Self-Healing Infrastructure

AI agents that detect drift, generate the fix, test in a sandbox, and auto-apply if safe. This is what I'm building now with Claude + CloudWatch + Lambda.

The gap between "AI wrote it" and "production-ready" is shrinking fast.

Checklist: Before You Use AI for Infrastructure

Have you provided your existing patterns and conventions?
Have you specified all constraints explicitly?
Will you run terraform plan before applying?
Will you run compliance scanning (Checkov/tfsec)?
Are you applying to dev first, not production?
Have you tested rollback?
Have you documented why you made this change?
Will this change be reviewed by another human?

If you answered no to any of these, stop. Fix that first.

What to Do Next

If you're new to using AI for infrastructure:

Start small: Use AI to explain existing Terraform you inherited
Build templates: Ask AI to generate module templates following your conventions
Debug faster: Paste errors and let AI suggest fixes
Learn patterns: Ask AI to critique your code and suggest improvements

If you're already using AI:

Create your context file (see Advanced section above)
Set up validation pipelines (Checkov + automated plan review)
Document your prompts (what works, what doesn't)
Train your team (share the patterns that work)

The goal isn't to let AI write all your infrastructure code unsupervised. The goal is to move faster while maintaining the same level of quality and safety.

Tools I Built for This

I got tired of switching between tools, so I built a free suite that helps with AI-assisted infrastructure work:

Token Counter: Estimate costs before pasting large Terraform files into expensive models
AI Output Parser: Extract clean JSON or YAML from AI responses buried in markdown
YAML Validator: Validate Kubernetes manifests AI generates before applying
Diff Checker: Compare AI-suggested changes side-by-side with your existing code
Prompt Eval Suite: Score your infrastructure prompts against 12 best practices
JSON to TypeScript: Convert AWS API responses to TypeScript for CDK projects

All free, no signup, run entirely in your browser.

See all 86 tools here

AI coding assistants won't replace DevOps engineers. But DevOps engineers using AI assistants effectively will replace those who don't.

The difference is knowing what to ask, how to validate, and when to push back.

Further Reading:

Questions? Email me at phaqqani@gmail.com or find me on LinkedIn.

Stay ahead of the curve

Get new posts on AI, cloud engineering, and the future of tech delivered to your inbox.

All Posts

Back to Blog

March 11, 2026AI + Cloud14 min read

HowtoUseAICodingAssistantsforInfrastructureasCodeWithoutBreakingProduction

AIInfrastructure as CodeTerraformClaude CodeChatGPTDevOpsAutomationKubernetesAWSCloud EngineeringPrompt Engineering

This is the guide I wish existed when I started. No hype. No "AI will replace DevOps" nonsense. Just the practical patterns that let you ship infrastructure faster without breaking things.

The Problem: Why Most DevOps Engineers Give Up on AI Assistants

Three things happen when engineers first try AI for infrastructure code:

1. The code looks perfect but has invisible landmines

2. The AI confidently suggests deprecated patterns

3. Zero understanding of your environment

Most engineers try it once, get burned, and go back to doing everything manually.

What Works: The Three-Layer Strategy

After 18 months of daily use across AWS, Azure, Terraform, Kubernetes, CloudFormation, and CDK, here's what actually works:

Layer 1: Context Transfer (The 80% Solution)

Before asking AI for anything infrastructure-related, you need to transfer three pieces of context:

Your existing patterns

Instead of:

"Write Terraform for an S3 bucket"

Do this:

"Write Terraform for an S3 bucket following the pattern in this existing module:

[paste your team's existing S3 module]

Match the same:
- Naming convention (project-environment-purpose)
- Tagging structure (Owner, Environment, CostCenter, ManagedBy)
- Encryption defaults (aws:kms with our standard key)
- Lifecycle rules (transition to IA after 30 days)
- Versioning and replication config
"

Your constraints

AI models default to "best practices" from the internet. Your production environment has constraints that trump best practices:

"Write a Kubernetes deployment manifest for our API service.

Constraints:
- Must use our base image (ecr.io/company/base-api:v2.3)
- Resource limits: 500m CPU, 1Gi memory (hard limit, finance approval required to exceed)
- Must have readiness and liveness probes (SRE requirement)
- Service mesh injection: enabled (label: istio-injection=enabled)
- Must use our naming convention: {team}-{service}-{environment}
- Anti-affinity rules: spread across 3 AZs
- Secret management: External Secrets Operator pulling from AWS Secrets Manager
"

Your current state

If you're modifying existing infrastructure, the AI needs to know what's already there:

"I need to add an egress rule to this security group:

[paste current terraform state or aws ec2 describe-security-groups output]

Add egress to 10.50.0.0/16 on port 5432 for PostgreSQL.

Important:
- Don't modify existing rules
- Follow the description pattern: 'Allow {protocol} to {destination} for {purpose}'
- Keep the rules in alphabetical order by description
"

This takes 60 seconds of setup but prevents hours of rework.

Layer 2: Validation Before Apply (The Safety Net)

Never run AI-generated infrastructure code without these three checks:

1. Plan Review (Human + AI)

terraform plan -out=plan.tfplan
terraform show -json plan.tfplan > plan.json

Then ask the AI to review its own output:

"Review this Terraform plan for issues:

[paste plan.json]

Check for:
- Unintended resource deletions or replacements
- Changes to production databases or stateful resources
- Security group rules that are too permissive
- Missing required tags
- Resources without explicit dependencies that might cause race conditions
"

I've caught 3 production-breaking changes in the past month this way. The AI is surprisingly good at spotting its own mistakes when shown the plan.

2. Compliance Scanning

Use tools like Checkov, tfsec, or Terrascan on AI-generated code:

checkov -f main.tf --framework terraform

Most AI models don't know your organization's compliance requirements. Automated scanning catches:

Unencrypted storage
Public S3 buckets
Security groups with 0.0.0.0/0 ingress
Missing backup configurations
Overly permissive IAM policies

3. Targeted Questions

After getting code from an AI, ask follow-up questions about edge cases:

"This Terraform creates an RDS instance. What happens if:
1. The instance fails and needs to be replaced?
2. We need to restore from a snapshot?
3. We want to upgrade the engine version?
4. We need to change the instance class?

Will we lose data? Are there manual steps?"

This exposes assumptions the AI made that don't match your requirements.

Layer 3: Incremental Rollout (The Undo Button)

AI-generated infrastructure changes should follow the same rollout discipline as application code:

Development → Staging → Production

Apply AI-generated Terraform to dev first. Let it run for 24 hours. Check CloudWatch metrics, logs, costs. Only then promote to staging.

Feature Flags for Infrastructure

Use Terraform workspaces or separate state files:

resource "aws_instance" "app" {
  instance_type = var.use_ai_generated_config ? "t3.large" : "t3.medium"
  # AI suggested t3.large for better performance
}

You can toggle back instantly if something breaks.

Version Control Everything

Every AI-generated change goes through a PR with:

Before/after plan output
Explanation of what changed and why
Screenshot of AI conversation (for context)
Checkov scan results

This creates an audit trail and lets your team review before merging.

Prompt Patterns That Actually Work

Here are the prompt structures I use daily:

Pattern 1: The "Existing Module Extension"

I have this Terraform module for creating ECS services:

[paste module]

I need to extend it to support:
- Blue/green deployments
- Circuit breaker configuration
- CloudWatch Container Insights

Extend the module maintaining:
- Same variable naming convention (use_underscores)
- Same output structure (prefix with module name)
- Backward compatibility (new features off by default)

Show me:
1. Updated variables.tf
2. Updated main.tf with new resources
3. Updated outputs.tf
4. Example usage in a root module

Pattern 2: The "Debug This Error"

I'm getting this Terraform error:

[paste full error output]

Here's the relevant configuration:

[paste terraform code]

Here's the current state:

[paste terraform state show output for affected resource]

What's wrong and how do I fix it without destroying existing resources?

Pattern 3: The "Compliance Review"

Review this Terraform configuration for security and compliance issues:

[paste code]

Our requirements:
- All storage must be encrypted at rest
- No public internet access except through approved load balancers
- All resources must have Owner, Environment, and CostCenter tags
- IAM policies follow least privilege
- Secrets must use AWS Secrets Manager, never hardcoded

Flag anything that violates these rules and suggest fixes.

Pattern 4: The "Migration Plan"

I need to migrate this infrastructure from [old pattern] to [new pattern]:

Current:
[paste current terraform]

Target:
[describe desired end state]

Constraints:
- Zero downtime required
- Database cannot be recreated (contains production data)
- Must maintain existing DNS records
- Budget: can't exceed $500/month increase

Give me a step-by-step migration plan with:
1. What to create first
2. What to migrate
3. What to destroy last
4. Rollback plan if something fails

Real Examples: What I Built This Week

Example 1: EKS Cluster with AI Assistance

What I needed: Production-ready EKS cluster with VPC, IAM roles, node groups, and add-ons.

What I did:

Context dump: Pasted our existing VPC module, naming conventions, and tagging requirements
Specific ask: "Create EKS cluster Terraform following these patterns"
Iteration: Asked AI to add Karpenter, AWS Load Balancer Controller, and metrics-server
Validation: Ran checkov, caught missing encryption on EBS volumes, asked AI to fix
Testing: Applied to dev, tested pod networking, service exposure, autoscaling
Result: Deployed to production in 3 days instead of 2 weeks

Example 2: Kubernetes YAML Debugging

What I needed: Fix a CrashLoopBackOff on a deployment that worked locally but failed in production.

What I did:

Prompt: "This deployment is CrashLoopBackOff in production but works locally:

[paste deployment YAML]

Pod logs show:
[paste kubectl logs output]

Environment differences:
- Production uses Workload Identity for AWS access
- Production has Istio sidecar injection
- Production has strict PodSecurityPolicy

What's likely wrong?"

AI Response: "The container is trying to access AWS Secrets Manager but the ServiceAccount doesn't have the correct annotation for Workload Identity..."

Fixed in 5 minutes. Would have taken 30-45 minutes of trial and error.

Example 3: Terraform State Surgery

What I needed: A resource was manually modified in AWS console, now Terraform wants to destroy and recreate it.

Prompt:

"This RDS instance was manually modified. Terraform plan shows:

[paste plan output showing replacement]

I need to update the Terraform state to match current AWS state without destroying the database. Walk me through the process."

AI gave me:

1. Pull current state: terraform state pull > backup.tfstate
2. Get actual AWS values: aws rds describe-db-instances ...
3. Update state: terraform state rm aws_db_instance.main
4. Import with new values: terraform import ...
5. Update code to match

Saved the database from accidental destruction.

Common Mistakes (And How to Avoid Them)

Mistake 1: Trusting Default Suggestions

Problem: AI suggests internet-facing load balancer when you wanted internal.

Fix: Always specify "internal" or "internet-facing" explicitly in your prompt. Don't assume the AI will infer from context.

Mistake 2: Not Checking for Breaking Changes

Problem: AI refactors your Terraform and accidentally changes resource names, causing Terraform to want to destroy and recreate.

Fix: Before accepting any refactoring, run terraform plan and look for -/+ symbols (replace). Ask AI: "Will this force resource replacement?"

Mistake 3: Applying Directly to Production

Problem: You're in a rush, AI code looks good, you apply straight to prod. It breaks.

Fix: No exceptions. Always apply to dev first. Set up CI/CD that enforces this.

Mistake 4: Not Version Controlling the Conversation

Problem: Three months later you need to know why a particular configuration was chosen. You have no record.

Fix: Save the AI conversation as docs/decisions/YYYY-MM-DD-why-we-chose-X.md in your repo.

Mistake 5: Treating AI Like a Magic Button

Problem: You ask AI to "set up full AWS infrastructure for my app" and expect it to work perfectly.

Fix: Break large requests into small, testable pieces. Build incrementally. Validate each layer.

Tool Recommendations

For Writing Terraform

For Kubernetes Manifests

Best: Claude Sonnet 4.6 (handles YAML indentation correctly) Alternative: Gemini 2.5 Pro (good at Helm charts)

For Debugging

Best: Claude Opus 4.7 (best at reasoning through complex error chains)

For Learning/Explaining

Best: GPT-4.1 (clearest explanations)

Use my Token Counter to estimate costs before sending large Terraform files to expensive models. Claude Sonnet often works for 80% of tasks at 1/10th the cost of Opus.

Advanced: Teaching AI Your Environment

If you use AI regularly for infrastructure work, create a "context file" you paste at the start of every conversation:

# Our Infrastructure Context

## Naming Convention
{project}-{environment}-{resource}-{purpose}
Example: payments-prod-rds-primary

## Tagging Requirements (all resources)
- Owner: team-name
- Environment: dev | staging | prod
- CostCenter: 4-digit code
- ManagedBy: terraform | cloudformation | manual

## AWS Defaults
- Region: us-west-2 (primary), us-east-1 (DR)
- VPC CIDR: 10.100.0.0/16 (prod), 10.101.0.0/16 (staging)
- Encryption: Always enabled, use aws/kms or custom KMS keys
- Versioning: Enabled on all S3 buckets
- Lifecycle: S3 IA after 30 days, Glacier after 90 days

## Kubernetes Standards
- Base images: only from company ECR
- Resource limits: required on all deployments
- Probes: both liveness and readiness required
- Service mesh: Istio with automatic sidecar injection
- Secrets: External Secrets Operator + AWS Secrets Manager

## Security Requirements
- No 0.0.0.0/0 ingress except load balancers
- IAM: least privilege, no *:* permissions
- Secrets: never in code, always in Secrets Manager
- Public access: must be approved by security team

## Terraform Standards
- Version: >= 1.5
- State: S3 backend with DynamoDB locking
- Modules: internal registry only
- Variables: always have descriptions and validation
- Outputs: prefix with module name

Save this file, paste it at the start of infrastructure conversations. The AI will follow your patterns instead of generic best practices.

The Future: Where This Is Going

Three things I'm seeing emerge:

1. AI-Powered Policy as Code

Instead of manually writing Sentinel or OPA policies, describe your requirements in English and let AI generate the policy code. I'm doing this now for AWS Service Control Policies.

2. Autonomous Infrastructure Refactoring

AI that can analyze your entire Terraform codebase, suggest improvements, and generate PRs. We're close to this. Claude Code is already capable of refactoring entire modules.

3. Self-Healing Infrastructure

AI agents that detect drift, generate the fix, test in a sandbox, and auto-apply if safe. This is what I'm building now with Claude + CloudWatch + Lambda.

The gap between "AI wrote it" and "production-ready" is shrinking fast.

Checklist: Before You Use AI for Infrastructure

Have you provided your existing patterns and conventions?
Have you specified all constraints explicitly?
Will you run terraform plan before applying?
Will you run compliance scanning (Checkov/tfsec)?
Are you applying to dev first, not production?
Have you tested rollback?
Have you documented why you made this change?
Will this change be reviewed by another human?

If you answered no to any of these, stop. Fix that first.

What to Do Next

If you're new to using AI for infrastructure:

Start small: Use AI to explain existing Terraform you inherited
Build templates: Ask AI to generate module templates following your conventions
Debug faster: Paste errors and let AI suggest fixes
Learn patterns: Ask AI to critique your code and suggest improvements

If you're already using AI:

Create your context file (see Advanced section above)
Set up validation pipelines (Checkov + automated plan review)
Document your prompts (what works, what doesn't)
Train your team (share the patterns that work)

The goal isn't to let AI write all your infrastructure code unsupervised. The goal is to move faster while maintaining the same level of quality and safety.

Tools I Built for This

I got tired of switching between tools, so I built a free suite that helps with AI-assisted infrastructure work:

Token Counter: Estimate costs before pasting large Terraform files into expensive models
AI Output Parser: Extract clean JSON or YAML from AI responses buried in markdown
YAML Validator: Validate Kubernetes manifests AI generates before applying
Diff Checker: Compare AI-suggested changes side-by-side with your existing code
Prompt Eval Suite: Score your infrastructure prompts against 12 best practices
JSON to TypeScript: Convert AWS API responses to TypeScript for CDK projects

All free, no signup, run entirely in your browser.

See all 86 tools here

AI coding assistants won't replace DevOps engineers. But DevOps engineers using AI assistants effectively will replace those who don't.

The difference is knowing what to ask, how to validate, and when to push back.

Further Reading:

Questions? Email me at phaqqani@gmail.com or find me on LinkedIn.

30 Days of Claude Code in Production: The Receipts Across 7 Projects

April 28, 2026AI + Cloud9 min read

I Curated 219 AI Tools for DevOps Engineers (And You Can Use Them All for Free)

March 11, 2026AI + Cloud13 min read

Prompt Engineering for Cloud Engineers: Getting the Best from Claude Code

February 3, 2025AI + Cloud7 min read

Stay ahead of the curve

Get new posts on AI, cloud engineering, and the future of tech delivered to your inbox.

All Posts

HowtoUseAICodingAssistantsforInfrastructureasCodeWithoutBreakingProduction

The Problem: Why Most DevOps Engineers Give Up on AI Assistants

What Works: The Three-Layer Strategy

Layer 1: Context Transfer (The 80% Solution)

Layer 2: Validation Before Apply (The Safety Net)

Layer 3: Incremental Rollout (The Undo Button)

Prompt Patterns That Actually Work

Pattern 1: The "Existing Module Extension"

Pattern 2: The "Debug This Error"

Pattern 3: The "Compliance Review"

Pattern 4: The "Migration Plan"

Real Examples: What I Built This Week

Example 1: EKS Cluster with AI Assistance

Example 2: Kubernetes YAML Debugging

Example 3: Terraform State Surgery

Common Mistakes (And How to Avoid Them)

Mistake 1: Trusting Default Suggestions

Mistake 2: Not Checking for Breaking Changes

Mistake 3: Applying Directly to Production

Mistake 4: Not Version Controlling the Conversation

Mistake 5: Treating AI Like a Magic Button

Tool Recommendations

For Writing Terraform

For Kubernetes Manifests

For Debugging

For Learning/Explaining

Advanced: Teaching AI Your Environment

The Future: Where This Is Going

Checklist: Before You Use AI for Infrastructure

What to Do Next

Tools I Built for This

Related Posts

30 Days of Claude Code in Production: The Receipts Across 7 Projects

I Curated 219 AI Tools for DevOps Engineers (And You Can Use Them All for Free)

Prompt Engineering for Cloud Engineers: Getting the Best from Claude Code

Stay ahead of the curve

HowtoUseAICodingAssistantsforInfrastructureasCodeWithoutBreakingProduction

The Problem: Why Most DevOps Engineers Give Up on AI Assistants

What Works: The Three-Layer Strategy

Layer 1: Context Transfer (The 80% Solution)

Layer 2: Validation Before Apply (The Safety Net)

Layer 3: Incremental Rollout (The Undo Button)

Prompt Patterns That Actually Work

Pattern 1: The "Existing Module Extension"

Pattern 2: The "Debug This Error"

Pattern 3: The "Compliance Review"

Pattern 4: The "Migration Plan"

Real Examples: What I Built This Week

Example 1: EKS Cluster with AI Assistance

Example 2: Kubernetes YAML Debugging

Example 3: Terraform State Surgery

Common Mistakes (And How to Avoid Them)

Mistake 1: Trusting Default Suggestions

Mistake 2: Not Checking for Breaking Changes

Mistake 3: Applying Directly to Production

Mistake 4: Not Version Controlling the Conversation

Mistake 5: Treating AI Like a Magic Button

Tool Recommendations

For Writing Terraform

For Kubernetes Manifests

For Debugging

For Learning/Explaining

Advanced: Teaching AI Your Environment

The Future: Where This Is Going

Checklist: Before You Use AI for Infrastructure

What to Do Next

Tools I Built for This

Related Posts

30 Days of Claude Code in Production: The Receipts Across 7 Projects

I Curated 219 AI Tools for DevOps Engineers (And You Can Use Them All for Free)

Prompt Engineering for Cloud Engineers: Getting the Best from Claude Code

Stay ahead of the curve