Amazon S3 (Simple Storage Service) is a very common and popular choice when it comes to a cheap, durable and highly available data lake. Quite often, teams want to interactively query big data sets directly from S3 for a variety of use cases. Amazon Athena helps users by providing a serverless option using standard SQL to run ad-hoc queries.
Athena charges by data scanned per query, which as of writing this post is $5 per TB rounded to 10 MB for us-east. For big data sets that are in CSV for JSON format, cost of Athena queries can go North. Converting into a columnar format to save cost is well documented out of which CTAS is one such option that doesn’t require creating and running Spark jobs in AWS EMR.
In this article, I share use cases and few limitations when using CTAS, short for CREATE TABLE AS SELECT to convert data set into a columnar format and save cost. The example used in this article is based on a publicly available Covid-19 data set.
To begin with, let us upload this raw data to an S3 bucket.
C:\> aws s3 cp covid19-eu.csv s3://techwithcloud/rawdata/ --sse upload: .\covid19-eu.csv to s3://techwithcloud/rawdata/covid19-eu.csv
I also created a CloudTrail to monitor the Read and Write API calls for this bucket to examine Athena calls to S3. Next, create a table in Athena for this raw data set.
CREATE EXTERNAL TABLE IF NOT EXISTS covid19_rawdata ( `date` DATE, day SMALLINT, month SMALLINT, year SMALLINT, cases INT, deaths INT, countryname STRING, geoId STRING, countrycode STRING, popData2019 BIGINT, continentExp STRING, cases_per_100K_for_14d DOUBLE ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 's3://techwithcloud/rawdata/' TBLPROPERTIES ( 'skip.header.line.count'='1' );
Run a query to get new cases per month per continent from this data set.
Next, create a new table in Athena using CTAS pattern and configure the output as “Parquet with Snappy compression”. I prefer Parquet because it has a bigger block size as compared to ORC.
CREATE TABLE IF NOT EXISTS covid19_parquet WITH ( format='PARQUET', parquet_compression = 'SNAPPY', external_location='s3://techwithcloud/parquetdata' ) AS SELECT * FROM covid19_rawdata
Now running the same query on this table produces the following
As you can see the amount of data scanned with parquet is 567 KB as against 4 MB scanned in raw CSV data (saving of about 13%). Since Athena charges for data scanned, this a pretty decent saving. AWS has written a nice blog on Athena performance tuning which will help the case to use a columnar format (e.g. Parquet) for cost-saving.
Now, let us examine the Cloudtrail logs to see how many API calls were made to S3 by Athena (after all, these calls are chargeable too).
I had executed 2 queries, the first was “CREATE TABLE AS SELECT” and the second was a “SELECT”. After loading the Cloudtrail logs in Elasticsearch and visualizing them in Kibana, I saw total of 35 entries when filtered for
"userIdentity.invokedBy":"athena.amazonaws.com" with the majority of calls being GetObject followed by HeadObject.
CTAS use cases
- Short-term queries on huge data set e.g. Security team wants to audit logs every quarter and require access for 2 weeks.
- Repeated queries on the same data set for specific time e.g. QA team would like to run multiple queries on last month’s data.
- Queries involving specific columns from huge data set e.g. Data Science team is interested in only 8 columns and not interested in other 20 columns. Once the model is trained, they don’t require to query this data.
- Even though compressed, there will be two copies of data (one is source raw data and another is the parquet file created as a result of the CTAS query. As per this AWS blog, a 1 TB Elastic Load Balancer (ELB) logs data resulted in 130 GB compressed Parquet files. This looks good athena saving but if these parquet files are leftover then you will be charged an extra $3 per month in this example. This might become hundreds of dollars for production data.
- If new data is uploaded in S3, then you have to options
- Delete your existing parquet file from S3 and rerun the CTAS query
- Use “INSERT INTO” to add new data in your existing Parquet table
- Loading new partitions in the source rawdata table doesn’t effect CTAS table
- Not specific to CTAS but related to Athena in general. It doesn’t delete temp files in S3 on your behalf. You must create a bucket lifecycle policy to avoid being charged for Athena files, you don’t require any more.
CTAS pattern is a pretty good choice to convert raw data into a columnar format and save Athena cost that requires much less overhead. However, it is more suitable for use cases involving short term analysis of huge data set and not to be used as CSV/JSON to Parquet conversion step in data pipeline.