Meet Amazon’s Wonder Woman – Athena

Meet Amazon’s Wonder Woman – Athena

I have always meant to experiment with Amazon Athena, similar to how I would explore different services in the AWS ecosystem for ways to enhance the plethora of workflows on which I work every day.  However, it is rare that my real-life exposure to a new service was under stress. Despite the risk of hyperbole, I would still say that Athena saved my day.

The Advent of Athena

Amazon Athena was first introduced during re:Invent 2016, with the tag line “an interactive query service that makes it easy to analyse data in S3.” Somehow I have always liked the name Athena, the guardian of wisdom and bravery. All in all, a good name is essential to good branding.

Athena coming to the Rescue

It was another day at work, another day in which I got to write some Terraform code to manage the infrastructure. I was pretty confident about the code, so I was expecting my terraform plan to be successful. My mind was already on the next task, just waiting for the formal process to complete such that I could commit the code.

Yet again, problems emerge when you least expect them. My minimal terraform code change led to a plan error. Worse, the error was RequestLimitExceeded. If I tried to run it again, it would likely succeed because (a) I could see the same errors appear time to time if I ran some innocuous AWS CLI commands such as aws ec2 describe-instances(b) the update I was trying to make was so straightforward that there was little room for ambiguous exceptions. Intermittent errors are always tricky to resolve because it is heavily dependent on the conditions during executions. In this case, an error related to an account-wide feature, such as the number of API requests (which is purported to be 5 per second), was near impossible to reproduce and troubleshoot because I only had insight into my workflows, while there were hundreds of instances running in a single region, not to mention the teams of developers that could be triggering workflow against one of the many services in AWS.

Gladly, based on AWS security best practices, AWS CloudTrail had been turned on in all regions in the account to capture API requests. We had until then never run into any kind of real issues in the account and the logs had been sitting quietly in the dedicated S3 bucket. Surely it was time to get the logs to help? The CloudTrail logs are but JSON files; I should be able to collate and find pertinent information to narrow down who/what had been issuing an excessive amount of calls to impact account-wide access.

I was then in a pinch -- it was true that I was first to discover the problem, but I could guarantee that other users of the account would complain in no time, especially for a problem that was as pervasive and obvious. Of course I could write a nice parsing function to go through the CloudTrail logs to give me an idea of what to look for, then iterate to identify the culprit. But what I needed then was something fast and ad-hoc. I just wanted to get some sense of what the logs could tell me, instead of spending time provisioning a cluster of instances to digest the logs in the bucket.

In short, I needed an analytic tool to get me some insights into the the CloudTrail log data in S3. That sounds exactly like what Athena -- the interactive query tool that makes it easy to analyse data in S3 -- was born to do.

Configuring Athena

Athena is an on-demand service built on Apache Presto (distributed SQL query) and supports ANSI SQL. As I learnt from my previous experience with EMR earlier, I needed to first identify the SerDe (Serialize/Deserializer) that understood CloudTrail logs. While Athena did not include a custom CloudTrail SerDe when it was made generally available (leading to my early attempts to raise EMR clusters just for analysing CloudTrail logs -- but that was another story), the CloudTrail SerDe has been introduced a few months ago. Eventually, that was what I did

CREATE EXTERNAL TABLE cloudtrail_logs (
eventversion STRING,
userIdentity STRUCT<
 type:STRING,
 principalid:STRING,
 arn:STRING,
 accountid:STRING,
 invokedby:STRING,
 accesskeyid:STRING,
 userName:STRING,
 sessioncontext:STRUCT<
  attributes:STRUCT< mfaauthenticated:STRING, creationdate:STRING>,
  sessionIssuer:STRUCT< type:STRING, principalId:STRING, arn:STRING, accountId:STRING, userName:STRING>>>,
eventTime STRING,
eventSource STRING,
eventName STRING,
awsRegion STRING,
sourceIpAddress STRING,
userAgent STRING,
errorCode STRING,
errorMessage STRING,
requestParameters STRING,
responseElements STRING,
additionalEventData STRING,
requestId STRING,
eventId STRING,
resources ARRAY<STRUCT< ARN:STRING,accountId: STRING,type:STRING>>,
eventType STRING,
apiVersion STRING,
readOnly STRING,
recipientAccountId STRING,
serviceEventDetails STRING,
sharedEventID STRING,
vpcEndpointId STRING
)
PARTITIONED BY (region string, year string, month string, day string)
ROW FORMAT SERDE 'com.amazon.emr.hive.serde.CloudTrailSerde'
STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://_cloudtrail_bucket/_cloudtrail_prefixes/'

where s3://_cloudtrail_bucket/_cloudtrail_prefixes/ is the location at which the CloudTrail logs are stored. The above is based on the standard practice as advertised by the Athena-CloudTrail documentations, with the enhancement of partitioning by regions, year, time and day. Partitioning is especially useful when specific dates are queried. I was then only interested in data of very recent day(s); with partitioning, instead of querying all files in the S3 location used to set up the table even when particular dates are specified in the query, I could specify the partitions of interest and therefore get the results I needed a lot faster (and cheaper).

With partitioning being used, I did have to run a few extra commands to set up the partitions for each day.

ALTER TABLE cloudtrail_logs ADD IF NOT EXISTS PARTITION (region='us-east-1',year='2017',month='06',day='15') 
LOCATION
's3://_cloudtrail_bucket/_cloudtrail_prefixes/us-east-1/2017/06/15/'

ALTER TABLE cloudtrail_logs ADD IF NOT EXISTS PARTITION (region='us-east-1',year='2017',month='06',day='16') 
LOCATION
's3://_cloudtrail_bucket/_cloudtrail_prefixes/us-east-1/2017/06/16/'

Once the schema is set up, it is time to start querying against the data.

At facing the problem of API request rate exceeding maximum, the first thing I did was to query for number of activities. I was interested in seeing who did what where and at what frequency, which was captured in the following query

Select eventname, useridentity.arn as arn, sourceipaddress, count(*) as count from cloudtrail_logs 
where eventTime like '2017-06-15%' and region='us-east-1' and year='2017' and month='06' and (day='15' or day='16')
group by eventname, useridentity.arn, sourceipaddress order by count desc 
limit 20

This yielded

"eventname","arn","sourceipaddress","count"
"DescribeAutoScalingGroups","arn:aws:sts::XXXXXXXXXX:assumed-role/masters.apps-dev-green...","_some_ip_","217236"
"DescribeAutoScalingGroups","arn:aws:sts::XXXXXXXXXX:assumed-role/masters.apps-prod-blue...","_some_ip_","104248"
"DescribeAutoScalingGroups","arn:aws:sts::XXXXXXXXXX:assumed-role/masters.apps-dev-blue...","_some_ip_","95930"
"DescribeAutoScalingGroups","arn:aws:sts::XXXXXXXXXX:assumed-role/masters.apps-prod-green...","_some_ip_","93086"
"DescribeAutoScalingGroups","arn:aws:sts::XXXXXXXXXX:assumed-role/masters.apps-qa-green...","_some_ip_","92797"
...

The first two entries alone constituted almost 4 calls per second; no wonder with the various activities in the account, the number of API calls per second easily exceeded the recommended 5.

Naturally I have removed the sensitive information from the output above. Suffice to say, the result was bit of a surprise -- it identified a culprit that I was not expecting. With the output, it took only a few minutes to discover that it was a misconfigured cluster-autoscaler from a kubernetes cluster that led to the API rate issue, which most certainly was not related to either my terraform code change, the innocent ec2 describe-instances call or any of the monitoring tools.

Would I eventually be able to discern the same root cause directly from the S3 bucket, with my home-grown parsing functions, or by running the CloudTrail SerDe directly on an EMR cluster? Quite likely. Would I however be able to run the query as easily and iterate as conveniently to get the results? Under the current circumstances, I honestly do not think so. On top of that, when one accounts for the pricing of Athena ($5/TB) and the wait-time -- 2.51s with 121.71MB of data scanned -- Athena sounds almost too good to be true. The service takes next to no time to set up, and it poses absolutely neither management nor operational overhead of the computing resources. It is especially useful when one is in a pinch and needs something no fuss and quickly done. I was in such a case, and I was a happy customer.

(Footnote: even in the case of completely unpartitioned data, query against which is equivalent to scanning the entire S3 bucket of CloudTrail logs, it took me 5.5 minutes with ~12GB of data to get the same results -- I regrettably tend to spend a lot more time and money on a cup of coffee.)

Proactive Athena

So that was a convenient foray into Amazon Athena. It was a purely reactive use based on exceptions observed, but how can I get more value from Athena and incorporate it into my day-to-day workflow?

CloudTrail is an essential service providing assets for auditing, as within the CloudTrail logs are a wealth of information on API calls -- access and updates -- made to the account and the resources within. If DevOps -- and increasingly DevSecOps -- recommends proactive actions over reactive responses, a tool that automatically extracts interesting (i.e. valuable and actionable) information from the captured data can most certainly help toward the goal of timely and concise actions.

Athena can be used via (a) its interactive GUI on the AWS console (b) JDBC (c) and API. I am especially interested in the option of the API because it offers vast opportunities for automation.  The following is the skeleton program to programmatically trigger Athena, leveraging the python AWS (boto3) sdk, centering around the API call StartQueryExecution

import boto3
query='''Select … '''  # the query statement 
database='default'    # the Athena database where the table is located 
output_location='s3://…'
athena_client=boto3.client('athena')
response=athena_client.start_query_execution(
QueryString=query,
QueryExecutionContext={'Database': database},
ResultConfiguration={'OutputLocation': output_location}
)

The resultant data is saved in an S3 bucket available for further processing. For example, an output-processing Lambda function can be triggered whenever there is new Athena output data (in form of .csv files) added to the S3 bucket, allowing (a) notification and alerts sent whenever thresholds are reached (b) sending the data to a data warehouse (e.g. RedShift) should analyses against other structured data be required later (c) even immediately responding to anomalies such as updating networking ACL and S3 bucket policies to curb undesired behaviours. In fact, other than the processing function, even the Athena-triggering (and table partition creation) function can be a Lambda function scheduled with CloudWatch events, resulting in an end-to-end workflow consisting of largely serverless and managed services that requires no day-to-day maintenance. Some of the potential workflows on analytics, reporting, notifications and responses are illustrated below. Similar to other AWS workflow, the services are highly customisable to create bespoke workflows -- the possibility is limitless.

Figure 1: Fully automated Athena process from CloudWatch Events trigger to output analysis, notifications and reporting

Next Steps

The illustrated example of Athena analysing CloudTrail logs is but a beginning into the myriad of ways to leverage collected data for alerting, analytics and reporting purposes. As the entry barrier for analytics is constantly being lowered, a lot of the data previously seems to be insurmountable suddenly can be analysed with relative ease. Be it structured data, be it unstructured data, there will likely be tools that can extract values from it and serve your needs and it is a matter of being bold and giving it try. If you find you are getting stuck, have performance issues, or need a hand in building a complex pipeline, TriNimbus can help you out.

Comments are closed.