Real-World Use Cases
We mentioned state drift as a key driver of Autoschematic's design. Let's look at a real example of how state drift causes problems.
All names and other details have been altered.
A software service company produces a special point-of-sale system for boutique retail customers. They have a dedicated team for customer-facing support, and those team members sometimes need to access certain databases to help their customers make corrections and respond to problems. When they need to access this, they log in and assume a dedicated AWS IAM role. This access is tracked by AWS, so team members only make use of it when they need to - they don't have blanket access to every database all of the time. In fact, there are a few IAM roles that grant access to databases for particular territories. This is pretty good practice - fine-grained, least-privilege, and delegating access to a role, not a user.
So, suppose we have 3 territories - West Europe, Switzerland, and North America. Each territory has multiple databases in multiple regions, and our support staff is usually assigned to one of these territories. To keep things consistent, our databases are named something like
customerdb-eu-west-<name>
customerdb-north-america-<name>
customerdb-switzerland-<name>
So, if we were using Terraform, the AWS IAM role for our West Europe support staff might look like:
Cool, not bad! But hold on - let's read the documentation for this aws_iam_role
provider:
If someone adds another inline policy out-of-band, on the next apply, Terraform will remove that policy. If someone deletes these policies out-of-band, Terraform will recreate them.
Here, "out-of-band" means via the AWS console, or command line, or any other method of modifying a resource other than through Terraform. In other words, this is where state drift will bite you. This is precisely what happened to the company in this example.
A change in the business's needs had occured - the staff needed to have access to databases in the eu-west-3
region, and they needed it urgently. Rather than wait for the infrastructure team to properly modify the Terraform code and apply it, the policy was added in an ad-hoc fashion by a systems administrator serving the support staff.
Then, responding to a separate change in the business's needs, the infrastructure team modified the policy above like so:
So, when this was applied, the manually-created policy was detached, and the support team lost access to their customer's databases in the eu-west-3
region. We could not describe this work environment as toxic in any sense, but the unfortunate team member who applied it took all of the heat for this incident, even though the real mistake occured a week earlier when the policy was manually created. The mistake occured because the Terraform code did not match the true current state of the infrastructure. When it was modified, it did not match the true desired state of the infrastructure.
We strongly believe that this problem is so pervasive and so damaging that mitigating it, even if it means a major engineering effort, is compelling. Autoschematic is the result of that engineering effort.
We're not alone in this opinion. Hear it from the authors of Terraform themselves:
See the next section for what Autoschematic does differently, and how it could have prevented this incident.