Back to all tech blogs

Get meaningful email alerts from AWS CloudWatch

  • DevOps
The default AWS CloudWatch alerts are pretty basic. Here’s how to pump them up.

AWS CloudWatch is a powerful solution that allows you to ingest logs and monitor your AWS infrastructure. It also enables you to create custom alerts that trigger email notifications — a very useful feature. However, the emails from CloudWatch are quite poor in the information they carry. In this article, you’ll learn how to create meaningful CloudWatch notifications.

We’ll be following the approach of our previous article about Terraform enforcement at Adevinta, “Enforcing and controlling infrastructure as code”. If you’re not familiar with AWS CloudWatch, you might want to read the previous blog article first. However, the example here can be reused for any kind of CloudWatch alerting, as long as you know which query is the origin of the triggered alarm.

Is the default CloudWatch notification that bad? No… it’s worse

Here’s a sample email notification issued by CloudWatch via SNS — replicated from the previous article, however all CloudWatch email alerts look alike.

Default CloudWatch Notification
Default CloudWatch Notification

CloudWatch Alerting only reports that an alarm is raised.

Looking at this email, you can see the little data available: name of the alarm triggered, timestamp, region, and state. However, we have no clue which event parsed is responsible for firing the alarm. The only option is to:

  • Connect to the impacted AWS Account
  • Access the CloudWatch Service
  • Set the same time range as the alert timestamp
  • Perform the same log query as the metric-based alarm is doing

You cannot customise any other field or add any piece of information in order to add a relevant hint when there is an alarm.

You received an alarm, without any details.
You received an alarm, without any details.

As I’m sure you’ll agree, this isn’t very efficient. So, let’s create a more user-friendly email with the exact event responsible for the alarm being triggered, plus additional information for how to react.

Overall infrastructure schema

Our approach involves several AWS services, including IAM, CloudWatch, EventBridge, Lambda and SNS. We are assuming as the start scenario, that an alarm, based on log ingestion, has already been set up.

For the Terraform file, we’ll re-use the existing alert. A Lambda will be fired from a CloudBridge Event, based on the CloudWatch Alarm triggered. The CloudWatch Alarm is based on a LogGroup Metric Filter which already exists. The Lambda will be smart enough to query CloudWatch for the latest events responsible for the alert, then an email will be sent via SNS.

event

The JSON event issued from CloudWatch to EventBridge? Just as horrible

The Lambda has an entrypoint declared that receives an event. This event is the JSON format of the CloudWatch-triggered alarm: it doesn’t contain much. The only meaningful information is about the Alarm Name and the status (ALARM).

The JSON payload issued from CloudWatch/EventBridge.
The JSON payload issued from CloudWatch/EventBridge.

We want a Lambda to retrieve the same latest CloudWatch events that have triggered the CloudWatch Alarm, and here is why we need to be able to gather CloudWatch results.

Dissecting the Lambda

import os
import time
from datetime import datetime, timedelta

import boto3

sns_arn = os.environ["SNS_TOPIC_ARN"]
monitored_role = os.environ["MONITORED_ROLE"]


def account_alias():
    return boto3.client("iam").list_account_aliases()["AccountAliases"][0]


def retrieve_events():
    client = boto3.client("logs")

    query = f"""fields @timestamp, userIdentity.principalId, eventName, eventSource, @message
| sort @timestamp desc
| filter userIdentity.sessionContext.sessionIssuer.userName like /{monitored_role}/
| filter userAgent not like /Confidence/
| filter userAgent not like /Terraform/
| filter userAgent not like /ssm-agent/
| filter eventName not like /AssumeRole/
| filter eventName not like /ConsoleLogin/
| filter eventName not like /StartSession/
| filter eventName not like /CreateSession/
| filter eventName not like /ResumeSession/
| filter eventName not like /SendSSHPublicKey/
| filter eventName not like /PutCredentials/
| filter eventName not like /StartQuery/
| filter managementEvent = 1
| filter readOnly = 0
| limit 100"""

    print(query)

    log_group = "iac-enforcement"
    timeout = 300
    timeout_start = time.time()
    response = {}

    while len(response.get("results", {})) < 1 and (
        time.time() < timeout_start + timeout
    ):
        start_query_response = client.start_query(
            logGroupName=log_group,
            startTime=int((datetime.today() - timedelta(minutes=2)).timestamp()),
            endTime=int(datetime.now().timestamp()),
            queryString=query,
        )

        query_id = start_query_response["queryId"]

        response = {}

        while response == {} or response["status"] == "Running":
            print("Waiting for query to complete ...")
            time.sleep(1)
            response = client.get_query_results(queryId=query_id)

        print("Gathered results from CloudWatch !")
        print(f"Found {len(response.get('results', {}))} results")

    parsed_response = []
    for result in range(len(response["results"])):
        element = {
            "timestamp": response["results"][result][0]["value"],
            "username": response["results"][result][1]["value"],
            "action": response["results"][result][2]["value"],
            "service": response["results"][result][3]["value"],
        }

        parsed_response.append(element)

    return parsed_response


def lambda_handler(event, context):
    print(event)
    print(context)

    client = boto3.client("sns")
    retrieved_events = retrieve_events()

    message = "Greetings from IAC_Enforcement Lambda ! \n \n"
    message += f"On the account : {account_alias()} \n"
    message += f"We have detected {len(retrieved_events)} custom manual modification change within last 2 minutes: \n"
    for event in retrieved_events:
        message += str(event)
        message += "\n"

    message += "\n \n"
    message += "Please take appropriate actions. You can also access the iac_enforcement for additional details. \n \n"
    message += f"Email Generated: {str(datetime.today())}"
    resp = client.publish(
        TargetArn=sns_arn,
        Message=message,
        Subject=f"IAC Enforcement Alert for account {account_alias()}",
    )


if __name__ == "__main__":
    lambda_handler(event="", context="")
lambda.py hosted with ❤ by GitHub
Lambda

A few remarks about the Retreive_Events function

This is a request for CloudWatch Logs to retrieve the root cause for the alarm. One particular variable here is “query”, that contains multiple elements:

  • Fields: these are the CloudWatch results details that you want to retrieve in your message. Each field will be delivered as a part of the response[“result”] array.
  • Sort: useful to ensure that the latest results matching the query are delivered first.
  • Filterthis is the key to the query — this is how to retrieve the same metric details used for your CloudWatch metric (and alarm!) generation.

The “query” field is basically the same request you would have made to retrieve the results manually from the CloudWatch Logs Insights’ interface.

Query CloudWatch is a multi-step process:

  1. First you need to place the query (client.start_query) on a given log_group, for a given time range. A QueryID will be returned by CloudWatch.
  2. Then you need to retrieve the query details and wait until the process is over.
  3. Finally, you can retrieve the query results.

Putting the cart before the horse

We had to add another surprising condition to the function to make it work. It required a retry for five minutes until we finally got valid results from CloudWatch. This was a requirement we noticed while developing the Lambda. EventBridge was notified too quickly and the Lambda started too fast. CloudWatch hadn’t finished indexing the search results and returned zero results. So, we added a five-minute grace delay to ensure we have found the relevant results to insert into the email notification.

Final result received by email

Lambda

The Lambda issues a nicely formatted email with the details of the alert (when, where, what, who). You just have to investigate the “why”, of course.

Let’s set up the Lambda!

Create IAM role

The Lambda requires a dedicated role to operate, with several allowed actions:

  • logs:CreateLogGroup, logs:CreateLogStream, logs:PutLogEvent
    This is a requirement in order to get output logs from the Lambda.
  • SNS:Publish
    The Lambda will be sending an email via SNS, so it should be allowed to publish to a SNS Topic.
  • logs:StartQuery, logs:RetrieveQueryResults
    To provide useful content for the email, it’s necessary to search for relevant content in CloudWatch.
  • iam:ListAccountAliases
    As we want to insert the user-friendly account name within the email message, this operation is also required.
resource "aws_iam_role" "iam_for_lambda" {
  name = "iam-for-lambda-iac-enforcement-${data.aws_region.current.name}"

  assume_role_policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": "lambda.amazonaws.com"
      },
      "Effect": "Allow",
      "Sid": ""
    }
  ]
}
EOF
}


resource "aws_iam_role_policy" "test_policy" {
  name = "sns-publish"
  role = aws_iam_role.iam_for_lambda.name

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = [
          "sns:Publish",
        ]
        Effect   = "Allow"
        Resource = "*"
      },
      {
        Action = [
          "logs:StartQuery",
          "logs:GetQueryResults",
        ]
        Effect   = "Allow"
        Resource = "*"
      },
      {
        Action = [
          "iam:ListAccountAliases",
        ]
        Effect   = "Allow"
        Resource = "*"
      },
    ]
  })
}
lambda_iam.tf hosted with ❤ by GitHub

Create SNS topic

We also need a simple SNS topic to exchange emails with the recipients. Recipients are linked to subscriptions, deliveries can be on multiple forms, including SMS or email. Pickup the Email delivery type, and don’t forget to confirm the subscription once the topic has been configured.


resource "aws_sns_topic" "alert_sre" {
  name = "alert-sre"
}

resource "aws_sns_topic_subscription" "sre_email_subscription" {
  topic_arn = aws_sns_topic.alert_sre.arn
  protocol  = "email"
  endpoint  = "your.email@domain.com"
}
lambda_sns.tf hosted with ❤ by GitHub

Create the LogGroup

It is preferable to manually manage the CloudWatch LogGroup that will receive the Lambda logs, as the LogGroup automatically created has infinite log retention. Make sure you stick with the given LogGroup name, or AWS will automatically create another one.

resource "aws_cloudwatch_log_group" "lambda_iac" {
  name              = "/aws/lambda/${aws_lambda_function.notification_lambda.function_name}"
  retention_in_days = var.logs_retention_days
}
lambda_cloudwatch_log_group.tf hosted with ❤ by GitHub

Lambda set-up with Terraform

Upload Python Lambda

The Lambda requires a parsable source code, submitted within a ZIP file, and an entrypoint declaration. It is completely possible to manage Lambda uploads from Terraform!

First, we need to generate a ZIP Archive. It is possible to use a null-resource (local-exec) to do so, but this is slightly painful (the local-exec is normally rendered once at the null-resource creation). You can use the Archive Provider that will automatically maintain the ZIP package for you.

data "archive_file" "zip_lambda" {
  type        = "zip"
  source_file = "${path.module}/lambda.py"
  output_path = "${path.module}/lambda_function_payload.zip"
}
lambda_zip.tf hosted with ❤ by GitHub

While creating the aws_lambda_function, take note of the source_code_hash variable, linked to the ZIP file. This will allow Terraform to detect any change on the package (using a hashing) and only upload the function when it gets modified.

resource "aws_lambda_function" "notification_lambda" {
  filename      = data.archive_file.zip_lambda.output_path
  function_name = "lambda_iac_alerting"
  role          = aws_iam_role.iam_for_lambda.arn
  handler       = "lambda.lambda_handler"
  timeout       = 300

  source_code_hash = data.archive_file.zip_lambda.output_base64sha256

  runtime = "python3.9"


  environment {
    variables = {
      SNS_TOPIC_ARN  = "${aws_sns_topic.alert_sre.arn}"
      MONITORED_ROLE = var.monitored_role
    }
  }

  depends_on = [data.archive_file.zip_lambda]

}
lambda_function.tf hosted with ❤ by GitHub

Two environment variables are defined here: the SNS topic to publish to, and another variable used in the CloudWatch Request.

Testing the Lambda

Once the Lambda has been pushed, it’s easy to manually trigger it on the AWS Console. Take a moment to ensure that the Lambda is operating normally and that an email is correctly sent. There is no need to use the Lambda publish function here, as the versioning is directly managed via Terraform.

In the event that the Lambda isn’t able to run, we still want to get warned, which is why a failure destination config is also set. AWS will send an awful JSON payload in case of failures, but at least you’ll have an idea of what’s wrong.


resource "aws_lambda_function_event_invoke_config" "errors_notifications" {
  function_name = aws_lambda_function.notification_lambda.function_name
  qualifier     = "$LATEST"

  destination_config {
    on_failure {
      destination = aws_sns_topic.alert_sre.arn
    }

  }
}
lambda_invoke_function.tf hosted with ❤ by GitHub

CloudWatch alerts

Set up EventBridge

A CloudWatch alert can trigger actions, but they’re limited to sending the default message to an existing SNS topic. However, CloudWatch sends the events to the EventBridge default bus; the detail of the EventBridge Rule can be retrieved in the Alarm details.

Alarms

EventBridge configuration

The EventBridge drives events into a bus. Each account already has a predefined one (named Default Bus), so we don’t have to push any Terraform to create a bus.

resource "aws_cloudwatch_event_rule" "iac_alerting" {
  name          = "iac_enforcement_alarm"
  description   = "IAC enforcement alarm"
  event_pattern = <<EOF
{
  "source": [
    "aws.cloudwatch"
  ],
  "detail-type": [
    "CloudWatch Alarm State Change"
  ],
  "resources": [
    "${aws_cloudwatch_metric_alarm.this.arn}"
  ],
  "detail": {
    "state": {
      "value": [ "ALARM" ]
     }
  }
}
EOF
}
lambda_event_rule_cloudwatch.tf hosted with ❤ by GitHub

The Rule by itself doesn’t perform any operation until it is linked to an event target. We’ll target our Lambda here. Several configuration details can be set, including the retry policy and dead letter queue. To keep things simple, we’ll stick with default values.

resource "aws_cloudwatch_event_target" "iac_alerting" {
  rule      = aws_cloudwatch_event_rule.iac_alerting.name
  target_id = "iac_alerting"
  arn       = aws_lambda_function.notification_lambda.arn
}
lambda_event_target_cloudwatch.tf hosted with ❤ by GitHub

Connect the Lambda

Then we need to grant the EventBridge Target to run the Lambda, with explicit permission at the Lambda level. Basically, we are setting up the Lambda properties to allow an execution initiated by the CloudWatch Event Rule that we have just defined.


resource "aws_lambda_permission" "cloudwatch_invokations" {
  statement_id  = "AllowExecutionFromCloudWatch"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.notification_lambda.function_name
  principal     = "events.amazonaws.com"
  source_arn    = aws_cloudwatch_event_rule.iac_alerting.arn
}
lambda_permission.tf hosted with ❤ by GitHub

A side note about the CloudWatch Query stored into the Lambda

Query

No doubt your veteran eye spotted the CloudWatch LogGroup Query directly stored into the Lambda. This is not ideal, because you have to update the query twice (once in the CloudWatch Log Filter Metric and another time in the Lambda code source.)

Metric filter

To avoid setting the query within the Lambda, the only option is to retrieve the query from an existing CloudWatch dashboard. However, this could have the unwanted side-effect of introducing changes to the dashboard (a modification on the dashboard would affect the Lambda).

So as a compromise, I have left the query filter directly within the Lambda.

Another implementation to achieve the same purpose

Query or subscribe… that is the question.

For this alerting design, we’ve chosen to use a Lambda that performs queries to the CloudWatch Log Group. However, a similar pattern is also possible with CloudWatch Subscriptions.

Simpler architecture based on CloudWatch Subscription to Lambda
Simpler architecture based on CloudWatch Subscription to Lambda

A Subscription is created with the same query filter we used on the Lambda (CloudWatch Query Format), and each log entry matching the filter would trigger a Lambda.

Create Lambda

The advantage of using CloudWatch Lambda Filter is the payload sent to Lambda effectively contains the desired cloud watch event with all the details. No need to query CloudWatch and wait for Lambda to retrieve results.

The drawback is that every single event matched by the filter will trigger the Lambda, so it can result in a significantly higher amount of emails being sent. Also, you cannot group the events on a single email (like we are doing with Lambda).

Conclusion

Thanks to a simple Lambda, you have the ability to get a much better error message. The alerting mechanism is also reliable — the Lambda destination will warn you even in the event of failure.

Alarm

It is way more comfortable to get the alarm details right from your mailbox, because you don’t have to connect to the AWS console and search manually within the CloudWatch Console.

The setup costs are minimal as there is no fixed charge for using EventBridge or setting up the Lambda and SNS. The costs for the Lambda to be triggered are very low too.

Now you’ve seen how it’s done, I hope you’ll try using a Lambda to make your CloudWatch alerts more useful.

Related techblogs

Discover all techblogs

Trial by Fire: Tales from the SRE Frontlines — Ep1: Challenge the certificates

Read more about Trial by Fire: Tales from the SRE Frontlines — Ep1: Challenge the certificates

Don’t name your EKS Managed NodeGroups (unless you want to trigger an incident)

Read more about Don’t name your EKS Managed NodeGroups (unless you want to trigger an incident)
Round building inside

Unpacking the Complexities of Kubernetes Upgrades: Beyond the One-Click Update Part II

Read more about Unpacking the Complexities of Kubernetes Upgrades: Beyond the One-Click Update Part II
Controller