AI-Assisted Infrastructure as Code: A Practical Refactoring Journey
How GitHub Copilot with Claude 3.5 transformed a manual VPC configuration into a maintainable, cost-optimized infrastructure using established best practices
Infrastructure as Code (IaC) refactoring is often viewed as a risky, time-consuming endeavor. The fear of breaking existing infrastructure, coupled with the complexity of understanding interdependencies, frequently leads teams to postpone necessary improvements. However, my recent experience refactoring a Terraform network configuration demonstrates how AI-assisted development can transform this challenging process into a systematic, safe, and educational journey.
The Challenge: From Manual to Modular
The starting point was a typical scenario many infrastructure teams face: a functional but manually configured AWS VPC setup. The existing `network.tf` file contained dozens of individual resources—VPC, subnets, internet gateways, NAT gateways, route tables—all hand-crafted and tightly coupled. While it worked, it suffered from common issues:
- Cost inefficiency: Multiple NAT gateways where one would suffice
- Maintenance burden: 100+ lines of boilerplate infrastructure code
- Limited reusability: Hardcoded configurations difficult to adapt
- Best practice gaps: Missing modern Terraform module patterns
The goal was straightforward: refactor this manual configuration to use the well-established [`terraform-aws-modules/vpc/aws`](https://github.com/terraform-aws-modules/terraform-aws-vpc) module while maintaining all existing functionality.
The AI-Assisted Approach: Incremental and Systematic
Rather than attempting a wholesale replacement, I established a collaboration pattern with GitHub Copilot (configured with Claude 3.5) that emphasized safety and learning:
Setting the Foundation
The conversation began with establishing clear working principles:
I want you to generate code in a step-by-step incremental manner.
For each request: Generate only the next logical small code snippet (5-10 lines maximum).
Wait for my approval before proceeding.
Always explain what the current snippet does before showing the code.
This approach immediately proved valuable. Instead of receiving a complete refactored solution, I got bite-sized changes I could understand, evaluate, and approve before moving forward.
The Refactoring Journey
The AI guided me through a logical progression:
Step 1: Module Integration
The first change replaced the manual VPC configuration with the module declaration:
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "~> 5.0"
name = var.vpc_name
cidr = var.vpc_cidr
azs = data.aws_availability_zones.available.names
public_subnets = [for i, az in data.aws_availability_zones.available.names : cidrsubnet(var.vpc_cidr, 8, i)]
private_subnets = [for i, az in data.aws_availability_zones.available.names : cidrsubnet(var.vpc_cidr, 8, i + 10)]
single_nat_gateway = true # Cost optimization
enable_dns_hostnames = true
enable_dns_support = true
}
Step 2: Dependency Updates
The AI then systematically identified and updated all references across the codebase:
- `outputs.tf`: Updated to use `module.vpc.vpc_id` instead of `aws_vpc.main.id`
- `lb.tf`: Changed subnet references to `module.vpc.public_subnets`
- `ecs.tf`: Updated private subnet references for service deployment
Step 3: Resource Cleanup
Finally, it helped remove the now-redundant manual resources while preserving custom configurations (like secure subnets) not covered by the standard module.
The Testing Protocol: Confidence Through Validation
The AI didn't just help with code changes—it guided me through a comprehensive testing workflow:
1. Backend Setup: The S3 backend Terraform configuration already existed, but I had initially forgotten to apply it before starting the main infrastructure work. The AI helped me realize this dependency and guided me through applying the backend infrastructure first.
2. Plan Review: While I could run `terraform plan` manually, the AI's ability to explain complex plan outputs within the full context of our refactoring was invaluable. Instead of searching Google for explanations of specific Terraform behaviors, I could ask contextual questions and get immediate, relevant answers.
3. Incremental Application: Applying changes and verifying outputs
4. Cleanup Testing: Safely destroying resources to validate the complete lifecycle
This systematic approach revealed important insights, such as the expected AWS behavior with default Network ACLs and the importance of S3 backend protection mechanisms.
The Results: Beyond Code Improvement
The refactoring delivered measurable improvements:
Technical Benefits:
- 90% code reduction: From 100+ lines to ~20 lines of module configuration
- Cost optimization: Single NAT gateway saving ~$45/month
- Maintainability: Leveraging a community-maintained, battle-tested module
- Best practices: Automatic implementation of AWS VPC patterns
Process Benefits:
- Educational value: Understanding each change rather than blind replacement
- Risk mitigation: Small, reviewable changes with validation at each step
- Documentation: Complete audit trail of decisions and rationale
Version Control: The Professional Touch
The AI also guided proper git practices, suggesting focused commits:
git commit -m "refactor: replace manual VPC resources with terraform-aws-vpc module"
git commit -m "refactor: update VPC outputs to use module references"
git commit -m "refactor: update load balancer to use VPC module outputs"
git commit -m "refactor: update ECS service to use VPC module outputs"
Each commit represented a logical unit of change, making the history readable and maintainable.
Lessons Learned: The Human-AI Partnership
This experience highlighted several key insights about AI-assisted infrastructure development:
What AI Excels At
- Pattern Recognition: Immediately identifying all code locations requiring updates
- Best Practice Application: Suggesting appropriate module configurations and parameters
- Systematic Approach: Breaking complex changes into manageable steps
- Error Prevention: Catching potential issues before they become problems
What Humans Bring
- Context Understanding: Knowing business requirements like cost optimization preferences
- Decision Making: Choosing when to approve changes and when to ask questions
- Verification: Validating that changes meet actual requirements
- Strategic Thinking: Understanding the broader impact of architectural decisions
The Collaboration Sweet Spot
The most effective pattern wasn't human OR AI, but human WITH AI. The AI provided systematic guidance and technical expertise, while I provided context, decision-making, and validation. This partnership approach resulted in better outcomes than either could achieve alone.
Implications for Infrastructure Teams
This experience suggests several practical applications for infrastructure teams:
For Individual Contributors:
- Use AI as a systematic guide for complex refactoring tasks
- Establish clear collaboration patterns that emphasize learning over speed
- Leverage AI expertise in areas where you lack deep knowledge
For Teams:
- Consider AI-assisted code reviews for infrastructure changes
- Use AI to help establish and document best practices
- Implement AI-guided testing protocols for infrastructure changes
- Explore team-wide AI adoption strategies (see my detailed analysis: AI in Infrastructure Teams: A Strategic Implementation Guide)
Looking Forward
As AI capabilities continue to evolve, the potential for infrastructure automation grows. However, this experience reinforces that the most valuable applications aren't about replacing human judgment, but about augmenting human capabilities with systematic guidance, comprehensive knowledge, and tireless attention to detail.
The future of Infrastructure as Code isn't about choosing between human expertise and AI assistance—it's about combining them in ways that make infrastructure development more systematic, educational, and reliable.
*This refactoring took approximately 2 hours from start to finish, including testing and validation. The same work done manually would likely have taken a full day and carried significantly higher risk of errors or missed dependencies.*