It is no surprise that data storage format play a key role in modern data engineering. You can reduce query time and storage space by selecting appropriate format. We all are familiar with CSV, XML and JSON formats, however there are few limitations when it comes to their application in big data. They do not give much benefit when it comes to aggregation queries or querying against partitioned data. There is no benefit of parallel processing in storing these files on disk (each such file is processed sequentially).
In such scenarios, a columnar data format speeds up aggregation queries as well as provides space benefit due to better compression techniques. It will be fair to say that two Apache projects Parquet and ORC have won the race for most popular and widely used columnar data format. Both have great support for Hadoop ecosystem and many cloud providers have added these formats into their services.
In this article, I will share how to transform CSV data into Parquet using AWS Glue. I prefer Glue due to the fact that it is serverless and uses Spark as a compute engine for transformation on input data. AWS Glue has templates for general purpose transformation out of which I will be referring to the one used to convert CSV to Parquet. I prefer Parquet over ORC due to its support for larger block size (128MB).
The data set used in this tutorial is of an online retail store and publicly available here. I have uploaded the file to an S3 bucket which serves as input bucket.
Before, we get into Glue let’s try this transformation locally using Spark and Jupyter notebook. After reading the input file into spark data frame, let us observe few lines.
Spark uses Snappy as a default compression technique for Parquet files.https://spark.apache.org/docs/2.4.7/sql-data-sources-parquet.html#configuration
Looks very easy but not feasible for big data sets which are stored either in a distributed file system like HDFS or Object Storage like S3. AWS Glue is one such service which we can use to automate such transformations steps.
(1) Create a Table in the Glue catalogue
You can either manually create a table representing the raw CSV data or better way if to let the Glue crawler scan the input CSV data and create it for you. I used the crawler to scan and create a table which will be source for this tutorial.
This is an external table (or metadata) which is integrated with AWS Athena, thus you can use Athena to query data from this table (our data is in the S3 bucket and will remain there).
(2) Create a Glue job to transform CSV to Parquet
Next, we need to create a Glue job which will read from this source table and S3 bucket, transform the data into Parquet and store the resultant parquet file in an output S3 bucket. AWS has made it very easy for users to apply known transformations by providing templates. I will use one such template and run the job.
Without duplicating myself, I will point you to this AWS blog which shows, how to use Glue Console to create job which will transform CSV files to Parquet. After successful completion of job, you should see parquet files created in the S3 location you provided.
In addition to creating parquet file, Glue also allows creating a target table referencing this transformed data set. So you can start using Athena to query parquet data.
If you are facing issues in running Glue generated job, please check following
- Does the IAM role used by Glue has permissions to read/write to desired S3 buckets
- Does the S3 bucket uses encryption for which, access to key was not provided to Glue job.
- Does your csv file is utf-8 encoded. (this thread)
Following are excellent blog posts from AWS sharing various techniques around Glue and Spark. I strongly suggest to go through these.
Couple of other good blogs for references