Although Site Reliability Engineering (SRE) is a full-time job, we aim to empower every engineer to be able to propose and apply changes in infrastructure. At Adevinta, every oncall engineer has exhaustive access to our AWS accounts so they can quickly remedy a production outage if required.
Although we trust our oncall engineers, we aren’t able to supervise them at all times. We need to ensure that no changes are made out of Terraform as each change made via the AWS Console or AWS CLI that affects the configuration would be out-of-sync with the Infrastructure as Code (IaC).
Today’s objective is straightforward: we want to get an alert for every manual modification made without Terraform on the account. Throughout the article you’ll find the corresponding Terraform code so you can reproduce the same process in your environments.
Linking to CloudWatch
Firstly, we link CloudTrail trails to CloudWatch log groups. This allows us to create metrics based on log groups entries and raise alarms based on those metrics. After that, we configure a topic on AWS SNS (Simple Notification Service) to receive email alerts.
Tracking manual changes, and only manual changes
Almost every AWS Service running on your account generates modificative actions. Auto Scaling, User Login, Instance Metadata Update — the quantity of daily modifications events can exceed 50,000+ on large production accounts.
We don’t want to track modificative actions from services as we don’t consider them as affecting the IaC configuration.
We also don’t want to track users. Users change on a regular basis within your team, and it’s out of the question to change the monitoring every time the engineers’ list changes.
We tracked IAM Roles, as this provided a limited, fixed amount of roles that could be assumed by team members when doing modificative actions. This is much easier to maintain in the long run.
Using CloudTrail to monitor and control the AWS account
CloudTrail allows you to monitor any API calls made to AWS for accessing and managing the account.
What’s a CloudTrail event, and what does it look like?
A CloudTrail event is a very detailed JSON object, allowing you to precisely track every action on your AWS account. Here is an example :
Which CloudTrail events reveal manual changes on the account?
As a significant amount of events like these are generated, one of the challenges is identifying which events are about manual changes on the AWS account (without Terraform).
There are several CloudTrail event properties we can use to help us:
– managementEvent: true > reflects a change on the account
– sessionCredentialFromConsole: true > reflects a change made via the Console (but not always present)
– userAgent > the signature (identification) of the software applying the modification
– read-only: false > reflects a change on the account
– event category: management > reflects a change on the configuration
You can retrieve the details for those fields within the CloudTrail records documentation.
Here, we consider that any change not performed via Terraform is a manual change not in sync with Infrastructure as Code. Luckily, Terraform has a self-identifying user-agent, so any modificative action performed with a different user-agent can be considered as suspicious.
Terrform’s user-agent:
userAgent = APN/1.0 HashiCorp/1.0 Terraform/1.2.0 (+https://www.terraform.io) terraform-provider-aws/dev (+https://registry.terraform.io/providers/hashicorp/aws) aws-sdk-go/1.44.10 (go1.17.6; linux; amd64)
However, there are a multitude of user-agents doing operations on an AWS account such as:
- A browser identity (like Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Firefox/102.0)
- An AWS service name itself (cloudtrail.amazonaws.com, rds.amazonaws.com, trustedadvisor.amazonaws.com)
- The AWS SDK (embedded on EC2 hosts)
- The S3 internal client ([S3Console/0.4) that operates S3 operations asked on the web console
- AWS Internal (for some actions like StartQuery on CloudTrail)
Paying attention to false positives
Some operations are considered by AWS as “management changes”, despite not being proper configuration changes. Here are a few examples:
- ConsoleLogin (on the AWS Console)
- AssumeRole (via IAM and STS)
- UpdateInstanceInformation (being pushed by SSM Client)
- StartQuery (on CloudWatch)
Our filter criteria has to acknowledge these operations to avoid getting false positives. Although this requires time and effort at the beginning, it reduces the number of unwanted alerts.
Before we start
We are going to create a dedicated infrastructure to retrieve your CloudTrail events. Even if you’re only operating in a given region (i.e. Ireland, eu-west-1), some CloudTrail events are always produced in the North Virginia region (us-east-1) :
— IAM actions
— Support-related Actions
— CloudFront operations
provider "aws" {
region = "eu-west-1"
default_tags {
tags = {
managed-by = "terraform"
}
}
}
provider "aws" {
region = "us-east-1"
alias = "us_east_1"
}
medium_iac_enforcement_project-terraform.tf hosted with ❤ by GitHubThis is why you need to create this infrastructure on all the regions you want to supervise operations, and the us-east-1 region.
module "iac_enforcement_eu_west_1" {
source = "../../modules/iac_enforcement"
logs_retention_days = 30
monitored_role = "AWSReservedSSO_admin_xxxxxxxxxx"
}
module "iac_enforcement_us_east_1" {
providers = {
aws = aws.us_east_1
}
source = "../../modules/iac_enforcement"
logs_retention_days = 30
monitored_role = "AWSReservedSSO_admin_xxxxxxxxxx"
}
iac-enforcement.tf hosted with ❤ by GitHubThis is easily achievable by declaring the suggested changes into a Terraform module, and deploying the module for any concerned region.
Also, you can notice two important variables here. As we’ll monitor the actions performed by a given role, we need to input what role to supervise, and how long we want to keep the logs ingested into S3 Bucket and CloudTrail.
Setting up CloudTrail
Setting up CloudTrail is pretty straightforward. You first need to create a secure S3 bucket with a proper lifecycle policy (as you likely don’t want to keep your logs forever).
### Bucket S3 linked to the CloudTrail
resource "aws_s3_bucket" "this" {
bucket = "cloudtrail-iac-enforcement-${data.aws_caller_identity.current.account_id}-${data.aws_region.current.name}"
force_destroy = true
}
resource "aws_s3_bucket_policy" "this" {
bucket = aws_s3_bucket.this.id
policy = <<POLICY
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AWSCloudTrailAclCheck",
"Effect": "Allow",
"Principal": {
"Service": "cloudtrail.amazonaws.com"
},
"Action": "s3:GetBucketAcl",
"Resource": "${aws_s3_bucket.this.arn}"
},
{
"Sid": "AWSCloudTrailWrite",
"Effect": "Allow",
"Principal": {
"Service": "cloudtrail.amazonaws.com"
},
"Action": "s3:PutObject",
"Resource": "${aws_s3_bucket.this.arn}/AWSLogs/${data.aws_caller_identity.current.account_id}/*"
}
]
}
POLICY
}
resource "aws_s3_bucket_lifecycle_configuration" "this" {
bucket = aws_s3_bucket.this.bucket
rule {
expiration {
days = 30
}
filter {
prefix = "AWSLogs/"
}
id = "logs"
status = "Enabled"
}
}
s3.tf hosted with ❤ by GitHubThen, you’ll require a dedicated IAM role allowing the CloudTrail Service to create log streams and put events on these log streams. Pay attention to the wildcard in the statement (resource): it is required as CloudTrail will create several log groups, so you need the wildcard to identify the relevant events across the groups.
resource "aws_iam_role" "cloudtrail_role" {
name = "cloudtrail-iac-enforcement-${data.aws_region.current.name}"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Sid = ""
Principal = {
Service = "cloudtrail.amazonaws.com"
}
},
]
})
inline_policy {
name = "cloudtrail"
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"logs:CreateLogStream"
],
Resource = [
"${aws_cloudwatch_log_group.iac_enforcement.arn}:log-stream:*"
]
},
{
Effect = "Allow"
Action = [
"logs:PutLogEvents"
],
Resource = [
"${aws_cloudwatch_log_group.iac_enforcement.arn}:log-stream:*"
]
}
]
})
}
}
iam.tf hosted with ❤ by GitHubThen, it’s time to set up your trail. Your trail can forward multiple kinds of events such as management, data, and insights. There are Read-only and Write events, but today we are focusing on only Write management events. However, if you need granularity for monitoring files data inside S3 buckets, you may be required to enable all data events.
resource "aws_cloudtrail" "iacenforcement" {
name = "iac-enforcement"
s3_bucket_name = aws_s3_bucket.this.id
include_global_service_events = true
cloud_watch_logs_group_arn = "${aws_cloudwatch_log_group.iac_enforcement.arn}:*" # CloudTrail requires the Log Stream wildcard
cloud_watch_logs_role_arn = aws_iam_role.cloudtrail_role.arn
event_selector {
read_write_type = "WriteOnly"
include_management_events = true
}
}
cloudtrail.tf hosted with ❤ by GitHubWhen your trail is being created, AWS checks for CloudWatch IAM role permissions: incorrect rights for IAM will result in a failure at creation attempt. If you struggle to create your trail, first try to enable the S3 log output, then focus on the CloudWatch direction.
Using CloudWatch to collect and track metrics
Ensure log messages arrive correctly
When the events generated by CloudTrail arrive at CloudWatch, they are stored under multiple LogStreams.
Create a Log Filter metric
It’s now time to take advantage of CloudWatch by creating a proper metric with a pattern. We want to create a metric that only warns us about manual changes from humans without Terraform. Every other case (non-human changes, or human changes via Terraform) should not raise the metric.
This metric will be used to trigger an alarm later on.
#################
# CloudWatch part
#################
resource "aws_cloudwatch_log_metric_filter" "manual_changes" {
name = "Non IAC Changes"
pattern = replace(<<EOT
{
($.userIdentity.sessionContext.sessionIssuer.userName = "${var.monitored_role}" ) &&
($.userAgent != "*Confidence*") &&
($.userAgent != "*Terraform*") &&
($.userAgent != "*ssm-agent*") &&
($.eventName != "AssumeRole") &&
($.eventName != "StartQuery") &&
($.eventName != "ConsoleLogin") &&
($.eventName != "StartSession") &&
($.eventName != "CreateSession") &&
($.eventName != "ResumeSession") &&
($.eventName != "SendSSHPublicKey") &&
($.eventName != "PutCredentials") &&
($.managementEvent is true) &&
($.readOnly is false)
}
EOT
, "\n", " ") # This field cannot exceed 1024 caracters
log_group_name = aws_cloudwatch_log_group.iac_enforcement.name
metric_transformation {
name = "Non IAC Changes"
namespace = "iac-enforcement"
unit = "Count"
default_value = "0"
value = "1"
}
}
cloudwatch_metricfilter.tf hosted with ❤ by GitHubThe pattern assigned to the metric is highlighted, and these are the conditions required to be true to increase the metric. If you need more, you can create another metric filter accordingly. You can find more information in the Amazon CloudWatch API Reference.
The results can be seen on the CloudWatch interface, where each “1” shows where a change has been performed by a human without using Terraform.
Create a CloudWatch Alarm to get notified
To receive notifications of any manual changes made out of Terraform, we link a simple alarm to our CloudWatch Alarm. The condition is simple: if we see any spike on the metric previously created, we fire an alarm. This alarm will trigger an SNS topic to send an email (look at the alarm_actions in the terraform manifest). However, you can activate a different response to the metric spike like a Lambda.
resource "aws_cloudwatch_metric_alarm" "this" {
alarm_name = "Non-IAC changes detected"
comparison_operator = "GreaterThanOrEqualToThreshold"
evaluation_periods = "1"
metric_name = aws_cloudwatch_log_metric_filter.manual_changes.metric_transformation[0].name
namespace = aws_cloudwatch_log_metric_filter.manual_changes.metric_transformation[0].namespace
period = "30"
statistic = "Sum"
threshold = "1"
alarm_description = "Non-IAC changes detected on the AWS account"
alarm_actions = [aws_sns_topic.alert_sre.arn]
insufficient_data_actions = []
}
cloudwatch_alarm.tf hosted with ❤ by GitHubCreating a proper dashboard
This dashboard allows us to identify which action triggered the alarm. This is difficult to do via Terraform (or almost impossible, let’s face it), so the trick is to create the dashboard the way you like it, then export the dashboard in the AWS Console and paste the JSON file into the dashboard_body.
Proceed the same way for editing your dashboard: do it manually then immediately resync your changes with Terraform.
resource "aws_cloudwatch_dashboard" "iac_enforcement" {
dashboard_name = "iac-enforcement"
dashboard_body = jsonencode(
{
widgets = [
{
height = 7
properties = {
legend = { position = "right" }
metrics = [["iac-enforcement", "Non IAC Changes", { color = "#d62728" }]]
region = "eu-west-1"
setPeriodToTimeRange = true
stacked = true
view = "timeSeries"
period = 60
stat = "Sum"
}
type = "metric"
width = 24
x = 0
y = 0
},
{
height = 13
properties = {
query = <<-EOT
SOURCE 'iac-enforcement' | fields @timestamp, userIdentity.principalId, eventName, @message
| sort @timestamp desc
| filter userIdentity.sessionContext.sessionIssuer.userName like /${var.monitored_role}/
| filter userAgent not like /Confidence/
| filter userAgent not like /Terraform/
| filter userAgent not like /ssm-agent/
| filter eventName not like /AssumeRole/
| filter eventName not like /ConsoleLogin/
| filter eventName not like /StartSession/
| filter eventName not like /CreateSession/
| filter eventName not like /ResumeSession/
| filter eventName not like /SendSSHPublicKey/
| filter eventName not like /PutCredentials/
| filter eventName not like /StartQuery/
| filter managementEvent = 1
| filter readOnly = 0
| limit 100
EOT
region = "eu-west-1"
stacked = false
view = "table"
}
type = "log"
width = 24
x = 0
y = 7
},
]
}
)
}
cloudwatch_dashboard.tf hosted with ❤ by GitHubThis becomes a dedicated dashboard that you can reproduce on all your accounts. The dashboard is in two parts: a graph with the metric values, and a filtered log group with the corresponding alarms.
SNS and email notification
Now that the alarm is set, we want to get alerts for metric changes. We’ll be using the SNS service with email subscription.
The subscription has to be confirmed via an initial message sent by AWS to the recipient. When you apply for a metric spike alert, the topic is automatically created as well as its subscription, and you should receive an email confirmation.
resource "aws_sns_topic" "alert_sre" {
name = "alert-sre"
}
resource "aws_sns_topic_subscription" "sre_email_subscription" {
topic_arn = aws_sns_topic.alert_sre.arn
protocol = "email"
endpoint = "benjamin.riou@adevinta.com"
}
sns.tf hosted with ❤ by GitHubWhen your metric triggers an event, an email will automatically be fired to your mailbox.
Alarm customisation
You cannot customise the text of the alarm, but you can set the alarm description. For a customised email, you can use a Lambda function instead.
This will be discussed in the part 2 of the article
Cost estimation
A couple of services are used here: CloudTrail, S3, IAM, CloudWatch and SNS. Let’s explore how much these changes might cost. Your final pricing mostly depends on how many events you process.
To get an estimate of the size of events stored, go to your CloudWatch log group and check for Stored Bytes and divide by the age of your log group. (here about 10 MB per day)
To get an estimate of the amount of events, search within your log group using Log Insights. Set a custom time duration and run the query without any filter. Look at the number of records processed (here there are about 12,000 events per day).
CloudTrail:
- CloudTrail logs are free and enabled by default
- First trail is free of charge
- For each additional trail, the charge is 2 USD per 100,000 events (management events)
S3:
- Bucket creation is free
- LifeCycle to trash bin is free
- Data ingestion is 0.023 USD / GB
IAM:
- Offered as no charge
CloudWatch:
- Log ingestion and storage is charged 0.60 USD/GB (with first 5 GB free)
- Events processed are charged 1 USD / million
- Alarms cost 0.10 USD each
- Dashboards are charged 3 USD (with the first 3 dashboards free)
SNS:
- Topic creation and subscription is free
- Sending an email via SNS is charged 2 USD per 100,000 notifications (with first 1,000 notifications free)
So enabling this alerting will cost about 10 USD per month, assuming the worst case that your free tier is exceeded:
- Ingesting 372,000 events: ~ 4 USD
- Storing 372 Mb of data: ~ 0 USD
- One alarm: 0.10 USD
- One dashboard: 3 USD
- 100,000 notifications: 2 USD
Conclusion
Changing habits and enforcing different processes can be a difficult challenge because it primarily affects humans rather than machines. However, enforcing IaC via active monitoring and alerting ensures manual changes are reimported via Terraform and reduces risk of inconsistency.
Another advantage of this solution is that it does not require any maintenance once set up. To improve the alerting system, we could use a Lambda function instead of a standard CloudWatch Alarm. With a Lambda, we can create a customised email with useful details of the metric spike to be sent to the topic recipient, which can give more context behind the change and support decisions in the best steps to take next.
This article has shown how you can be alerted for any manual changes on your AWS accounts, so you can ensure that any manual action is reimported via Terraform.