Enforcing and controlling Infrastructure as Code

Although Site Reliability Engineering (SRE) is a full-time job, we aim to empower every engineer to be able to propose and apply changes in infrastructure. At Adevinta, every oncall engineer has exhaustive access to our AWS accounts so they can quickly remedy a production outage if required.

Although we trust our oncall engineers, we aren’t able to supervise them at all times. We need to ensure that no changes are made out of Terraform as each change made via the AWS Console or AWS CLI that affects the configuration would be out-of-sync with the Infrastructure as Code (IaC).

Today’s objective is straightforward: we want to get an alert for every manual modification made without Terraform on the account. Throughout the article you’ll find the corresponding Terraform code so you can reproduce the same process in your environments.

Linking to CloudWatch

Firstly, we link CloudTrail trails to CloudWatch log groups. This allows us to create metrics based on log groups entries and raise alarms based on those metrics. After that, we configure a topic on AWS SNS (Simple Notification Service) to receive email alerts.

Tracking manual changes, and only manual changes

Almost every AWS Service running on your account generates modificative actions. Auto Scaling, User Login, Instance Metadata Update — the quantity of daily modifications events can exceed 50,000+ on large production accounts.

We don’t want to track modificative actions from services as we don’t consider them as affecting the IaC configuration.

We also don’t want to track users. Users change on a regular basis within your team, and it’s out of the question to change the monitoring every time the engineers’ list changes.

We tracked IAM Roles, as this provided a limited, fixed amount of roles that could be assumed by team members when doing modificative actions. This is much easier to maintain in the long run.

Using CloudTrail to monitor and control the AWS account

CloudTrail allows you to monitor any API calls made to AWS for accessing and managing the account.

What’s a CloudTrail event, and what does it look like?

A CloudTrail event is a very detailed JSON object, allowing you to precisely track every action on your AWS account. Here is an example :

Which CloudTrail events reveal manual changes on the account?

As a significant amount of events like these are generated, one of the challenges is identifying which events are about manual changes on the AWS account (without Terraform).

There are several CloudTrail event properties we can use to help us:

– managementEvent: true > reflects a change on the account
– sessionCredentialFromConsole: true > reflects a change made via the Console (but not always present)
– userAgent > the signature (identification) of the software applying the modification
– read-only: false > reflects a change on the account
– event category: management > reflects a change on the configuration

You can retrieve the details for those fields within the CloudTrail records documentation.

Here, we consider that any change not performed via Terraform is a manual change not in sync with Infrastructure as Code. Luckily, Terraform has a self-identifying user-agent, so any modificative action performed with a different user-agent can be considered as suspicious.

Terrform’s user-agent:

userAgent = APN/1.0 HashiCorp/1.0 Terraform/1.2.0 (+https://www.terraform.io) terraform-provider-aws/dev (+https://registry.terraform.io/providers/hashicorp/aws) aws-sdk-go/1.44.10 (go1.17.6; linux; amd64)

However, there are a multitude of user-agents doing operations on an AWS account such as:

A browser identity (like Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Firefox/102.0)
An AWS service name itself (cloudtrail.amazonaws.com, rds.amazonaws.com, trustedadvisor.amazonaws.com)
The AWS SDK (embedded on EC2 hosts)
The S3 internal client ([S3Console/0.4) that operates S3 operations asked on the web console
AWS Internal (for some actions like StartQuery on CloudTrail)

Paying attention to false positives

Some operations are considered by AWS as “management changes”, despite not being proper configuration changes. Here are a few examples:

ConsoleLogin (on the AWS Console)
AssumeRole (via IAM and STS)
UpdateInstanceInformation (being pushed by SSM Client)
StartQuery (on CloudWatch)

Our filter criteria has to acknowledge these operations to avoid getting false positives. Although this requires time and effort at the beginning, it reduces the number of unwanted alerts.

Before we start

You can find the exhaustive Terraform files on GitHub here

We are going to create a dedicated infrastructure to retrieve your CloudTrail events. Even if you’re only operating in a given region (i.e. Ireland, eu-west-1), some CloudTrail events are always produced in the North Virginia region (us-east-1) :
— IAM actions
— Support-related Actions
— CloudFront operations

provider "aws" { 
  region = "eu-west-1" 
  default_tags { 
    tags = { 
      managed-by = "terraform" 
    } 
  } 
} 
 
provider "aws" { 
  region = "us-east-1" 
  alias  = "us_east_1" 
}

medium_iac_enforcement_project-terraform.tf hosted with ❤ by GitHub

This is why you need to create this infrastructure on all the regions you want to supervise operations, and the us-east-1 region.

module "iac_enforcement_eu_west_1" {
  source              = "../../modules/iac_enforcement"
  logs_retention_days = 30
  monitored_role      = "AWSReservedSSO_admin_xxxxxxxxxx"
}

module "iac_enforcement_us_east_1" {
  providers = {
    aws = aws.us_east_1
  }
  source              = "../../modules/iac_enforcement"
  logs_retention_days = 30
  monitored_role      = "AWSReservedSSO_admin_xxxxxxxxxx"
}

iac-enforcement.tf hosted with ❤ by GitHub

This is easily achievable by declaring the suggested changes into a Terraform module, and deploying the module for any concerned region.

Also, you can notice two important variables here. As we’ll monitor the actions performed by a given role, we need to input what role to supervise, and how long we want to keep the logs ingested into S3 Bucket and CloudTrail.

Setting up CloudTrail

Setting up CloudTrail is pretty straightforward. You first need to create a secure S3 bucket with a proper lifecycle policy (as you likely don’t want to keep your logs forever).

### Bucket S3 linked to the CloudTrail

resource "aws_s3_bucket" "this" {
  bucket        = "cloudtrail-iac-enforcement-${data.aws_caller_identity.current.account_id}-${data.aws_region.current.name}"
  force_destroy = true
}

resource "aws_s3_bucket_policy" "this" {
  bucket = aws_s3_bucket.this.id
  policy = <<POLICY
  {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AWSCloudTrailAclCheck",
            "Effect": "Allow",
            "Principal": {
              "Service": "cloudtrail.amazonaws.com"
            },
            "Action": "s3:GetBucketAcl",
            "Resource": "${aws_s3_bucket.this.arn}"
        },
        {
            "Sid": "AWSCloudTrailWrite",
            "Effect": "Allow",
            "Principal": {
              "Service": "cloudtrail.amazonaws.com"
            },
            "Action": "s3:PutObject",
            "Resource": "${aws_s3_bucket.this.arn}/AWSLogs/${data.aws_caller_identity.current.account_id}/*"
        }
    ]
  }
  POLICY
}

resource "aws_s3_bucket_lifecycle_configuration" "this" {
  bucket = aws_s3_bucket.this.bucket

  rule {
    expiration {
      days = 30
    }

    filter {
      prefix = "AWSLogs/"
    }

    id = "logs"

    status = "Enabled"

  }

}

s3.tf hosted with ❤ by GitHub

Then, you’ll require a dedicated IAM role allowing the CloudTrail Service to create log streams and put events on these log streams. Pay attention to the wildcard in the statement (resource): it is required as CloudTrail will create several log groups, so you need the wildcard to identify the relevant events across the groups.

resource "aws_iam_role" "cloudtrail_role" {
  name = "cloudtrail-iac-enforcement-${data.aws_region.current.name}"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Sid    = ""
        Principal = {
          Service = "cloudtrail.amazonaws.com"
        }
      },
    ]
  })

  inline_policy {
    name = "cloudtrail"

    policy = jsonencode({
      Version = "2012-10-17"
      Statement = [
        {
          Effect = "Allow"
          Action = [
            "logs:CreateLogStream"
          ],
          Resource = [
            "${aws_cloudwatch_log_group.iac_enforcement.arn}:log-stream:*"
          ]
        },
        {
          Effect = "Allow"
          Action = [
            "logs:PutLogEvents"
          ],
          Resource = [
            "${aws_cloudwatch_log_group.iac_enforcement.arn}:log-stream:*"
          ]
        }
      ]
    })
  }
}

iam.tf hosted with ❤ by GitHub

Then, it’s time to set up your trail. Your trail can forward multiple kinds of events such as management, data, and insights. There are Read-only and Write events, but today we are focusing on only Write management events. However, if you need granularity for monitoring files data inside S3 buckets, you may be required to enable all data events.


resource "aws_cloudtrail" "iacenforcement" {
  name                          = "iac-enforcement"
  s3_bucket_name                = aws_s3_bucket.this.id
  include_global_service_events = true

  cloud_watch_logs_group_arn = "${aws_cloudwatch_log_group.iac_enforcement.arn}:*" # CloudTrail requires the Log Stream wildcard
  cloud_watch_logs_role_arn  = aws_iam_role.cloudtrail_role.arn


  event_selector {
    read_write_type           = "WriteOnly"
    include_management_events = true
  }

}

cloudtrail.tf hosted with ❤ by GitHub

When your trail is being created, AWS checks for CloudWatch IAM role permissions: incorrect rights for IAM will result in a failure at creation attempt. If you struggle to create your trail, first try to enable the S3 log output, then focus on the CloudWatch direction.

Using CloudWatch to collect and track metrics

Ensure log messages arrive correctly

When the events generated by CloudTrail arrive at CloudWatch, they are stored under multiple LogStreams.

Create a Log Filter metric

It’s now time to take advantage of CloudWatch by creating a proper metric with a pattern. We want to create a metric that only warns us about manual changes from humans without Terraform. Every other case (non-human changes, or human changes via Terraform) should not raise the metric.

This metric will be used to trigger an alarm later on.

#################
# CloudWatch part
#################

resource "aws_cloudwatch_log_metric_filter" "manual_changes" {
  name = "Non IAC Changes"
  pattern = replace(<<EOT
{
($.userIdentity.sessionContext.sessionIssuer.userName = "${var.monitored_role}" ) &&
($.userAgent != "*Confidence*") &&
($.userAgent != "*Terraform*") &&
($.userAgent != "*ssm-agent*") &&
($.eventName != "AssumeRole") &&
($.eventName != "StartQuery") &&
($.eventName != "ConsoleLogin") &&
($.eventName != "StartSession") &&
($.eventName != "CreateSession") &&
($.eventName != "ResumeSession") &&
($.eventName != "SendSSHPublicKey") &&
($.eventName != "PutCredentials") &&
($.managementEvent is true) &&
($.readOnly is false)
}
EOT
  , "\n", " ") # This field cannot exceed 1024 caracters

  log_group_name = aws_cloudwatch_log_group.iac_enforcement.name

  metric_transformation {
    name          = "Non IAC Changes"
    namespace     = "iac-enforcement"
    unit          = "Count"
    default_value = "0"
    value         = "1"

  }
}

cloudwatch_metricfilter.tf hosted with ❤ by GitHub

The pattern assigned to the metric is highlighted, and these are the conditions required to be true to increase the metric. If you need more, you can create another metric filter accordingly. You can find more information in the Amazon CloudWatch API Reference.

The results can be seen on the CloudWatch interface, where each “1” shows where a change has been performed by a human without using Terraform.

Create a CloudWatch Alarm to get notified

To receive notifications of any manual changes made out of Terraform, we link a simple alarm to our CloudWatch Alarm. The condition is simple: if we see any spike on the metric previously created, we fire an alarm. This alarm will trigger an SNS topic to send an email (look at the alarm_actions in the terraform manifest). However, you can activate a different response to the metric spike like a Lambda.

resource "aws_cloudwatch_metric_alarm" "this" {
  alarm_name                = "Non-IAC changes detected"
  comparison_operator       = "GreaterThanOrEqualToThreshold"
  evaluation_periods        = "1"
  metric_name               = aws_cloudwatch_log_metric_filter.manual_changes.metric_transformation[0].name
  namespace                 = aws_cloudwatch_log_metric_filter.manual_changes.metric_transformation[0].namespace
  period                    = "30"
  statistic                 = "Sum"
  threshold                 = "1"
  alarm_description         = "Non-IAC changes detected on the AWS account"
  alarm_actions             = [aws_sns_topic.alert_sre.arn]
  insufficient_data_actions = []
}

cloudwatch_alarm.tf hosted with ❤ by GitHub

Creating a proper dashboard

This dashboard allows us to identify which action triggered the alarm. This is difficult to do via Terraform (or almost impossible, let’s face it), so the trick is to create the dashboard the way you like it, then export the dashboard in the AWS Console and paste the JSON file into the dashboard_body.

Proceed the same way for editing your dashboard: do it manually then immediately resync your changes with Terraform.

resource "aws_cloudwatch_dashboard" "iac_enforcement" {
  dashboard_name = "iac-enforcement"
  dashboard_body = jsonencode(
    {
      widgets = [
        {
          height = 7
          properties = {
            legend               = { position = "right" }
            metrics              = [["iac-enforcement", "Non IAC Changes", { color = "#d62728" }]]
            region               = "eu-west-1"
            setPeriodToTimeRange = true
            stacked              = true
            view                 = "timeSeries"
            period               = 60
            stat                 = "Sum"
          }
          type  = "metric"
          width = 24
          x     = 0
          y     = 0
        },
        {
          height = 13
          properties = {
            query   = <<-EOT
                  SOURCE 'iac-enforcement' | fields @timestamp, userIdentity.principalId, eventName, @message
                  | sort @timestamp desc
                  | filter userIdentity.sessionContext.sessionIssuer.userName like /${var.monitored_role}/
                  | filter userAgent not like /Confidence/
                  | filter userAgent not like /Terraform/
                  | filter userAgent not like /ssm-agent/
                  | filter eventName not like /AssumeRole/
                  | filter eventName not like /ConsoleLogin/
                  | filter eventName not like /StartSession/
                  | filter eventName not like /CreateSession/
                  | filter eventName not like /ResumeSession/
                  | filter eventName not like /SendSSHPublicKey/
                  | filter eventName not like /PutCredentials/
                  | filter eventName not like /StartQuery/
                  | filter managementEvent = 1
                  | filter readOnly = 0
                  | limit 100
              EOT
            region  = "eu-west-1"
            stacked = false
            view    = "table"
          }
          type  = "log"
          width = 24
          x     = 0
          y     = 7
        },
      ]
    }
  )


}

cloudwatch_dashboard.tf hosted with ❤ by GitHub

This becomes a dedicated dashboard that you can reproduce on all your accounts. The dashboard is in two parts: a graph with the metric values, and a filtered log group with the corresponding alarms.

SNS and email notification

Now that the alarm is set, we want to get alerts for metric changes. We’ll be using the SNS service with email subscription.

The subscription has to be confirmed via an initial message sent by AWS to the recipient. When you apply for a metric spike alert, the topic is automatically created as well as its subscription, and you should receive an email confirmation.

resource "aws_sns_topic" "alert_sre" {
  name = "alert-sre"
}

resource "aws_sns_topic_subscription" "sre_email_subscription" {
  topic_arn = aws_sns_topic.alert_sre.arn
  protocol  = "email"
  endpoint  = "benjamin.riou@adevinta.com"
}

sns.tf hosted with ❤ by GitHub

When your metric triggers an event, an email will automatically be fired to your mailbox.

Alarm customisation

You cannot customise the text of the alarm, but you can set the alarm description. For a customised email, you can use a Lambda function instead.

This will be discussed in the part 2 of the article

Cost estimation

A couple of services are used here: CloudTrail, S3, IAM, CloudWatch and SNS. Let’s explore how much these changes might cost. Your final pricing mostly depends on how many events you process.

To get an estimate of the size of events stored, go to your CloudWatch log group and check for Stored Bytes and divide by the age of your log group. (here about 10 MB per day)

To get an estimate of the amount of events, search within your log group using Log Insights. Set a custom time duration and run the query without any filter. Look at the number of records processed (here there are about 12,000 events per day).

CloudTrail:

CloudTrail logs are free and enabled by default
First trail is free of charge
For each additional trail, the charge is 2 USD per 100,000 events (management events)

S3:

Bucket creation is free
LifeCycle to trash bin is free
Data ingestion is 0.023 USD / GB

IAM:

Offered as no charge

CloudWatch:

Log ingestion and storage is charged 0.60 USD/GB (with first 5 GB free)
Events processed are charged 1 USD / million
Alarms cost 0.10 USD each
Dashboards are charged 3 USD (with the first 3 dashboards free)

SNS:

Topic creation and subscription is free
Sending an email via SNS is charged 2 USD per 100,000 notifications (with first 1,000 notifications free)

So enabling this alerting will cost about 10 USD per month, assuming the worst case that your free tier is exceeded:

Ingesting 372,000 events: ~ 4 USD
Storing 372 Mb of data: ~ 0 USD
One alarm: 0.10 USD
One dashboard: 3 USD
100,000 notifications: 2 USD

Conclusion

Changing habits and enforcing different processes can be a difficult challenge because it primarily affects humans rather than machines. However, enforcing IaC via active monitoring and alerting ensures manual changes are reimported via Terraform and reduces risk of inconsistency.

Another advantage of this solution is that it does not require any maintenance once set up. To improve the alerting system, we could use a Lambda function instead of a standard CloudWatch Alarm. With a Lambda, we can create a customised email with useful details of the metric spike to be sent to the topic recipient, which can give more context behind the change and support decisions in the best steps to take next.

This article has shown how you can be alerted for any manual changes on your AWS accounts, so you can ensure that any manual action is reimported via Terraform.