Infrastructure as Code: The Mistakes Nobody Warns You About
January 10, 2026
â˘7 min read
So there I was at 2 AM, staring at my terminal as Terraform cheerfully destroyed our entire production database. Not "modified." Not "updated." Destroyed.
The command was simple: terraform apply. I'd run it a hundred times before. But this time, I'd forgotten one tiny detailâI was in the wrong AWS profile. Production instead of staging.
One command. Thirty seconds. Three years of customer data gone.
Okay, not really goneâwe had backups. But those thirty seconds aged me about five years. And the four-hour recovery process? Let's just say I learned some things about Infrastructure as Code that no tutorial ever mentioned.
Let me save you from making the same mistakes.
The Big Lie: "Infrastructure as Code Makes Everything Safer"
Here's what the blog posts and conference talks tell you: IaC is version controlled, repeatable, and eliminates human error. It's the future! Everything should be code!
Here's what they don't tell you: IaC gives you the power to destroy everything with incredible efficiency. And you will. Multiple times. Usually at the worst possible moment.
Traditional infrastructure management was slow and manual. That was annoying, but it was also a safety feature. You couldn't accidentally delete your entire infrastructure because it took twenty clicks and three confirmation dialogs.
With IaC? One command. That's it. No "are you sure?" No safety net. Just instant, automated destruction.
I'm not saying don't use IaCâI use it for everything. But let's be honest about the risks.
Mistake #1: Treating State Files Like Regular Files
Terraform state files are not regular files. They're the single source of truth about your infrastructure. Lose them, corrupt them, or have two people modify them simultaneously, and you're in for a world of pain.
I once worked with a team that stored their state files in Git. "It's version controlled!" they said proudly.
Then two developers ran terraform apply at the same time on different branches. The state files diverged. Terraform got confused about what actually existed. It tried to "fix" things by deleting resources that were definitely still in use.
Production went down. Hard.
What You Should Actually Do:
First, use remote state backends. For AWS, that means S3 with DynamoDB locking. For Azure, use Azure Storage with its built-in locking mechanism. These services are designed specifically to handle concurrent access and prevent the exact scenario I described above.
Enable state locking so only one person can modify the state at a time. This is non-negotiable. When someone runs terraform apply, the state gets locked. Anyone else trying to run it gets a clear error message telling them to wait. Simple, effective, prevents disasters.
Never, ever commit state files to Git. I don't care how convenient it seems. State files contain sensitive informationâdatabase passwords, API keys, internal IP addresses. They belong in secure, encrypted storage, not in version control where they're accessible to everyone with repo access.
Back up your state files regularly, and keep those backups separate from your main infrastructure backups. If your infrastructure goes down and takes your state file with it, you need an independent backup to recover from. I recommend automated daily backups to a different region or even a different cloud provider.
Use state file encryption. These files contain secrets, and they should be encrypted both at rest and in transit. Most remote state backends support this out of the boxâjust turn it on.
One client lost their state file completely. No backups. We had to manually import every single resource back into Terraform. It took three weeks. Don't be that person.
Mistake #2: The "I'll Just Manually Fix This One Thing" Trap
It's Friday at 4:30 PM. Something's broken in production. You know exactly how to fix itâjust change one security group rule in the AWS console. Takes thirty seconds.
So you do it. Problem solved. You go home.
Monday morning, someone runs terraform apply. Terraform sees that the security group doesn't match the code. It "fixes" it by reverting your change. The problem comes back.
Now you're debugging the same issue again, except this time you've forgotten what you changed on Friday.
This is called configuration drift, and it's the silent killer of IaC projects.
The Hard Truth:
If you manage something with IaC, you can NEVER touch it manually. Not even once. Not even in an emergency.
Every change must go through code. Yes, even the urgent ones. Yes, even when it's faster to just click the button.
I know this is painful. I've been there. But the alternative is worseâinfrastructure that randomly breaks because Terraform and reality disagree about what should exist.
Making This Actually Work:
Set up read-only access for most people. They can view resources in the cloud console, check configurations, investigate issuesâbut they can't modify anything. This removes the temptation to "just quickly fix" something manually. If they can't make changes, they have to go through the proper IaC process.
Make your IaC pipeline fast enough that going through code isn't painful. If it takes 30 minutes to deploy a one-line change, people will find workarounds. But if your pipeline runs in under 5 minutes? Suddenly going through the proper process is easier than fighting with permissions to make manual changes.
Have a documented emergency process that still uses code. Yes, even for emergencies. This might mean having a fast-track approval process or a special emergency branch that deploys immediately. But it's still code, still tracked, still reproducible.
Use drift detection to catch manual changes. Run terraform plan regularly (I do it nightly) and alert when it detects differences between your code and actual infrastructure. This catches manual changes quickly, before they cause problems.
When you find drift, fix the code to match reality, don't just blindly revert. Sometimes the manual change was actually correctâmaybe it fixed a critical bug. Understand why the drift happened, update your code to reflect the correct state, then apply it. Learn from the incident and prevent it from happening again.
Mistake #3: Putting Secrets in Your IaC Code
I've seen this so many times: database passwords, API keys, and access tokens hardcoded right there in the Terraform files.
"But it's in a private repo!" they say.
Great. So is your entire company's infrastructure configuration. If someone gets access to that repo, they have everything.
Also, Git history is forever. That secret you committed and then removed? Still there. Anyone who clones the repo can find it.
What Actually Works:
Use proper secret management services like AWS Secrets Manager, HashiCorp Vault, or Azure Key Vault. These tools are built specifically for storing and managing secrets securely. They provide encryption, access control, audit logging, and automatic rotation.
Reference secrets in your IaC, don't define them. Your Terraform code should say "get the database password from Secrets Manager" not "the database password is abc123". The secret lives in the secret manager, your code just points to it.
Use environment variables for sensitive values during terraform apply. If you need to pass credentials to Terraform itself, use environment variables that exist only in your CI/CD environment. Never hardcode them in your code or commit them to Git.
Rotate secrets regularly and automate this process. Secrets that never change are secrets that eventually get compromised. Set up automatic rotationâevery 90 days is a good baseline. Your secret manager can handle this automatically.
Scan your repos for accidentally committed secrets using tools like git-secrets or truffleHog. Run these in your CI pipeline so they catch secrets before code gets merged. Better to catch it in a pull request than in production.
If you've already committed secrets, rotating them isn't enough. You need to consider the entire Git history compromised. Yes, even if you force-pushed to remove them.
Mistake #4: Not Testing Your Infrastructure Code
You test your application code, right? Unit tests, integration tests, maybe even end-to-end tests?
So why don't you test your infrastructure code?
"Because it's just configuration," people say. Wrong. It's code that creates real resources that cost real money and run real workloads.
I learned this the hard way when a "simple" change to our Terraform code accidentally changed the instance type of our database from db.r5.2xlarge to db.t3.micro. In production. During peak traffic.
The database immediately fell over. Response times went from 50ms to 30 seconds. Customers couldn't use the product.
If we'd tested the change in a staging environment first, we would have caught it.
How to Actually Test IaC:
Use terraform plan and actually read the output. I know it's boring. I know it's sometimes hundreds of lines. Read it anyway. Every line. Look for unexpected changes, resources being destroyed when they shouldn't be, or modifications you didn't intend. This five-minute investment can save you hours of recovery time.
Maintain a staging environment that mirrors production as closely as possible. Same instance types, same database configurations, same network setup. The more similar staging is to production, the more confident you can be that changes tested in staging will work in production.
Test changes in staging before production, every single time. No exceptions. Not even for "small" changes. Not even when you're in a hurry. Not even when the CEO is breathing down your neck. Test in staging first.
Use policy-as-code tools like OPA or Sentinel to enforce rules automatically. These tools can prevent entire categories of mistakes. For example, you can write a policy that rejects any change that creates a publicly accessible S3 bucket. The policy runs automatically on every terraform plan.
Write automated tests with tools like Terratest. These are actual code tests that deploy your infrastructure, verify it works correctly, then tear it down. They're more work to set up, but they catch issues that manual testing misses.
Use terraform validate in your CI pipeline to catch syntax errors and basic configuration problems before code even gets reviewed. This is the fastest feedback loopâdevelopers know within seconds if their code has basic problems.
Yes, this slows you down. That's the point. Slow down before you break production.
Mistake #5: Monolithic Infrastructure Code
When you start with IaC, it's tempting to put everything in one giant Terraform project. One state file, one terraform apply, everything together.
This seems convenient until:
terraform plantakes 10 minutes to run- One small change requires reviewing 500 lines of plan output
- A mistake in your VPC code somehow affects your database
- You can't give teams ownership of their own infrastructure
- State file conflicts happen constantly
I worked with a company that had a single Terraform project with 50,000 lines of code. Running terraform plan took 15 minutes. Making any change was terrifying because you had no idea what else might break.
The Better Approach:
Split your infrastructure into logical modules based on function and ownership. Create a networking module that handles your VPC, subnets, and routing tables. This is foundational infrastructure that rarely changes, so it gets its own isolated state file.
Security gets its own module for IAM roles, security groups, and policies. These are sensitive configurations that need careful review, and isolating them makes that review process more focused.
Data infrastructureâdatabases, caches, storage bucketsâgoes in another module. These are stateful resources that need special care during changes. Keeping them separate means you can't accidentally affect them when you're just deploying a new application version.
Compute resources like EC2 instances, ECS clusters, and Lambda functions change frequently as you deploy new code and scale your application. They get their own module so these frequent changes don't require reviewing your entire infrastructure.
Application-specific resources that are unique to particular services or teams should be in their own modules. This allows teams to own and manage their own infrastructure without stepping on each other's toes.
Each module has its own state file. Changes to networking don't affect databases. Teams can own their modules. Dependencies between modules are handled explicitly through remote state data sources or tools like Terragrunt.
Yes, this creates dependencies between modules. Handle them with remote state data sources or Terragrunt.
Mistake #6: Ignoring the Blast Radius
Every IaC change has a blast radiusâhow much damage it can do if something goes wrong.
Changing a tag on a single EC2 instance? Small blast radius. Worst case, you have to recreate one instance.
Changing your VPC configuration? Massive blast radius. You could take down your entire infrastructure.
Yet people treat these the same. Same review process, same testing, same deployment approach.
That's insane.
Risk-Based Approach:
Low Risk (small blast radius):
These are the changes that keep you sleeping soundly at night. Tags, labels, and descriptions are purely metadataâthey don't affect how your infrastructure actually runs. You could tag every resource with "delete-me-please" and nothing would break (though your colleagues might have questions).
Monitoring and alerting configuration falls into this category too. Yes, you want your alerts working correctly, but if you mess up a CloudWatch alarm, the worst that happens is you miss a notification. Your infrastructure keeps running.
Non-production resources are your playground. Dev and staging environments exist precisely so you can experiment without consequences. Break them. Learn from it. Fix them. That's what they're for.
Process: Quick review from one teammate, maybe a glance at the terraform plan output, then deploy directly. Don't overthink these. The time you spend being overly cautious here could be better spent on actual high-risk changes.
Medium Risk:
This is where things get interesting. Application infrastructure like EC2 instances and containers are critical, but they're also usually designed with redundancy. If you misconfigure one instance in an auto-scaling group, the others keep serving traffic while you fix it.
Non-critical databasesâthink analytics databases, reporting systems, or internal toolsâmatter, but they're not keeping your core product running. A few minutes of downtime is annoying, not catastrophic. You have time to fix mistakes without the CEO breathing down your neck.
Security group rules deserve respect. Open the wrong port to the wrong IP range, and you've created a security vulnerability. But at least you can close it quickly. The blast radius is contained to whatever resources use that security group.
Process: Get a thorough review from someone who understands the implications. Always test in staging firstâno exceptions. Deploy during business hours when your team is around and alert. If something goes wrong, you want people available to fix it, not scrambling to wake up at 3 AM.
High Risk (large blast radius):
This is the "one wrong move and everything explodes" category. Networking changes are terrifying because everything depends on networking. Modify your VPC routing table incorrectly? Congratulations, nothing can talk to anything else. Your entire application is now a very expensive collection of isolated servers.
IAM policies control who can do what. Mess these up and you've either locked everyone out (including yourself) or accidentally given intern-level accounts admin access to production. Both scenarios end with uncomfortable conversations with your security team.
Production databases are your company's memory. They hold every customer, every transaction, every piece of data that makes your business run. The terraform setting "force_destroy = true" on a production database should trigger the same reaction as seeing someone juggling chainsawsâimmediate intervention required.
Process: Multiple reviews, and I mean multiple. At least two senior engineers need to sign off. Extensive testing in staging that actually mirrors production. Deploy during a scheduled maintenance window when customers expect potential disruption. Have your entire team on standby. Most importantly, have a tested rollback plan ready to execute. Not a theoretical planâan actual runbook you've practiced.
I now require two approvals for high-risk changes. And we do them during scheduled maintenance windows with the entire team on standby.
Is this overkill? Maybe. But I haven't accidentally destroyed production in three years, so I'm calling it a win.
I now require two approvals for high-risk changes. And we do them during scheduled maintenance windows with the entire team on standby.
Is this overkill? Maybe. But I haven't accidentally destroyed production in three years, so I'm calling it a win.
Mistake #7: Not Having a Rollback Plan
Here's a scenario: You apply an IaC change. Something breaks. You need to roll back.
How do you do it?
If your answer is "uh... revert the Git commit and run terraform apply again?" you're in trouble.
Because sometimes Terraform can't roll back. Sometimes the damage is done. Sometimes the rollback itself fails.
I've been in situations where:
- The rollback required resources that the failed deployment deleted
- The rollback failed because of API rate limits
- The rollback worked but took 45 minutes while production was down
- The rollback succeeded but left orphaned resources costing money
What You Actually Need:
Document rollback procedures for each type of change. Not generic "revert the commit" instructionsâspecific, step-by-step procedures. For a database change, what's the rollback process? For a networking change? For an IAM policy? Write it down before you need it.
Test your rollback procedures. Actually practice them. In staging, make a change, then roll it back. Time how long it takes. Note any issues. Fix them. When you need to roll back in production at 2 AM, you want a procedure you've practiced, not one you're figuring out under pressure.
Keep backups of critical resources before major changes. Taking a database snapshot before modifying the database configuration takes five minutes and could save you hours of recovery time. Same for any stateful resourceâback it up first.
Have manual override procedures for when IaC fails. Sometimes Terraform gets confused. Sometimes the API is down. Sometimes you just need to fix something manually right now. Have documented procedures for how to do this safely and how to sync your code back up afterward.
Know which changes can't be rolled back and avoid them when possible. Some resources can't be restored once deleted. Some configuration changes are one-way. Identify these ahead of time and treat them with extra caution.
Use Terraform state snapshots before risky operations. The terraform state pull command saves your current state. If something goes wrong, you have a known-good state to restore from.
For really critical changes, I create a complete backup of the state file and take snapshots of databases before running terraform apply. Paranoid? Maybe. But I sleep better.
Mistake #8: Forgetting About Cost
IaC makes it incredibly easy to create resources. Too easy.
One developer on a client's team wrote Terraform code to create "a few test instances." They ran it. It worked great.
What they didn't realize: their code created instances in a loop. And they'd set the loop count to 100.
One hundred m5.4xlarge instances. Running 24/7. For three weeks before anyone noticed.
The AWS bill: $47,000.
Protect Yourself:
Use Infracost to estimate costs before applying changes. This tool analyzes your Terraform code and tells you how much it will cost to run. It integrates with your CI/CD pipeline and comments on pull requests with cost estimates. Suddenly that "small change" that would cost $5,000/month becomes visible before you deploy it.
Set up billing alerts at multiple thresholds. Don't just set one alert at your total budget. Set alerts at 50%, 75%, 90%, and 100% of expected spending. This gives you early warning when costs are trending up, not just when you've already blown the budget.
Require cost review for changes that create expensive resources. Any change that creates compute instances, databases, or data transfer should get an extra set of eyes specifically looking at cost implications. Make this part of your review process.
Use tagging to track which IaC projects cost what. Tag every resource with the project name, team, and environment. Then you can see exactly which projects are expensive and make informed decisions about where to optimize.
Implement auto-shutdown for non-production resources. Development and staging environments don't need to run 24/7. Schedule them to shut down at night and on weekends. This alone can cut non-production costs by 70%.
Review your infrastructure costs monthly. Put it on the calendar. Look at trends, identify anomalies, find optimization opportunities. Make it a regular practice, not something you do when the bill is surprisingly high.
I now require cost estimates for any change that creates compute or database resources. If the cost increase is more than 20%, it needs additional approval.
Mistake #9: Poor Documentation
Six months from now, you won't remember why you made that weird workaround in your Terraform code.
Neither will your coworkers. Neither will the new hire who has to modify it.
I've spent hours trying to understand IaC code only to discover the "clever" solution was working around a bug that was fixed months ago. But nobody documented it, so nobody knew it was safe to remove.
Document These Things:
Why you made non-obvious decisions. When you do something weird or unconventional, explain why. "This looks backwards but it's working around a bug in the AWS API" is valuable information. Future you will thank present you.
Workarounds and what they're working around. That hacky solution you implemented? Document what problem it solves and what the proper solution would be once it's available. When the underlying issue gets fixed, someone needs to know they can remove the workaround.
Dependencies between modules. If module A must be deployed before module B, write that down. If changing module C requires updating module D, document it. These relationships aren't always obvious from the code.
Order of operations for complex deployments. Some infrastructure changes require specific sequences. Deploy the database first, then the application, then update DNS. Write down the order and why it matters.
Known issues and limitations. If something doesn't work quite right but you're living with it, document it. If there's a configuration that looks wrong but is actually correct, explain why. Save the next person from spending hours debugging something that's not actually broken.
Contact info for who owns what. When something breaks at 2 AM, who should be called? When someone has questions about a particular module, who has the answers? Make this information easy to find.
Use comments in your code. Use README files. Use a wiki. I don't care how you document, just document.
Future you will be grateful. Your coworkers will be grateful. Your 2 AM incident response will go much smoother.
The Reality: IaC Is Worth It (Despite Everything)
I've spent this entire post talking about mistakes and problems. You might be thinking "why would anyone use IaC?"
Because despite all these pitfalls, IaC is still better than the alternative.
Manual infrastructure management is:
- Inconsistent (every environment is slightly different)
- Undocumented (knowledge lives in people's heads)
- Slow (everything takes forever)
- Error-prone (humans make mistakes)
- Not scalable (you can't manage 100 servers manually)
IaC is:
- Consistent (same code = same infrastructure)
- Self-documenting (the code is the documentation)
- Fast (automation is faster than clicking)
- Repeatable (same results every time)
- Scalable (managing 10 servers or 1000 is the same effort)
You just need to respect the power you're wielding.
Your Action Plan: Start Smart
If you're just getting started with IaC:
Week 1:
Start with non-production environments where mistakes are cheap learning opportunities. Get comfortable with the tools and processes before touching anything that matters. Use remote state from day oneâdon't start with local state and migrate later. Set up proper access controls so people can't accidentally modify production. Create a staging environment that mirrors production for testing changes safely.
Month 1:
Establish code review processes. Every infrastructure change should be reviewed by at least one other person. Document your standards and practicesâwhat's acceptable, what's not, how you handle secrets, how you structure modules. Set up automated testing in your CI/CD pipeline so basic issues get caught automatically. Train your team on the tools. Don't assume people will figure it outâinvest time in proper training.
Quarter 1:
Gradually migrate production workloads. Don't try to move everything at once. Pick one service, migrate it, learn from the experience, then move to the next. Implement drift detection to catch manual changes. Set up cost monitoring so you know what everything costs. Create runbooks for common operationsâdeployments, rollbacks, disaster recovery. Make these operations repeatable and documented.
Don't try to do everything at once. Start small, learn from mistakes (you will make them), and gradually expand.
And for the love of all that is holy, test in staging first.
Need help implementing IaC without learning every lesson the hard way? Let's talk. I've made all these mistakes so you don't have to. I can help you set up IaC properly from the start, or fix the mess you've already created. No judgmentâI've been there.