athena

Visualize Amazon S3 data using Apache Superset

Exploratory Data Analysis (EDA) is an important process for many use cases when you are dealing with data. It is much helpful if you get an open-source data exploration and visualization tool that supports a variety of data sources, considering, in the Big Data ecosystem, you always have more than one database. On these lines, I found Apache Superset to be quite impressive and very helpful.

In this article, I am sharing how to use Superset to explore and visualize data stored in Amazon S3, which is a popular choice for data lake. It took me few iterations to get it to work so hopefully, this post will help people get started.

Installing Superset

To install Superset, I followed instructions as mentioned in the docker hub page of superset. However, when I tried to open it in the browser, I got a lot of “Access Denied” errors. After checking these steps and playing with them, I found that “Step 4: Setup roles”, should be done right after “Step 2”. When you load examples before role setting then Superset throws these errors. I have created an issue https://github.com/apache/superset/issues/13867 with the Superset community to fix it.

So the correct sequence is as follows

# download docker image and run superset on port 8080
docker run -d -p 8080:8088 --name superset apache/superset

# With above container running, execute below commands on new terminal
docker exec -it superset superset fab create-admin \
               --username admin \
               --firstname Superset \
               --lastname Admin \
               --email admin@superset.com \
               --password admin

docker exec -it superset superset db upgrade
docker exec -it superset superset init
# optional but helpful to get started
docker exec -it superset superset load_examples
# save the changes to the image
docker commit <superset-container-id> apache/superset

Adding Athena Driver in Superset

Tip – You must have created Athena tables referencing the data in S3. Superset will query Athena tables and use the output for visualization. Since Superset doesn’t access S3 directly, it is transparent to the file format of data stored in S3 (CSV, ORC, Parquet etc).

As of now, the easiest way to visualize data in S3 is to use Superset’s Athena driver. The complete list of supported DB drivers is published here. For Athena, I am using PyAthena and not the JDBC one. The minimum Dockerfile to pull and build superset image that will include PyAthena connector is shown below.

FROM apache/superset

# Switching to root to install the required packages
USER root

# Installing Athena driver
RUN pip install "PyAthena>1.2.0"

# Switching back to using the `superset` user
USER superset

Build and run the container again using this updated image. The data used in this article is of a ‘Top 50 Bestselling Books’ on Amazon from Kaggle. As a prerequisite for this tutorial, I created a table in Athena as shown below which points to my S3 bucket having this sample dataset.

CREATE EXTERNAL TABLE bestseller_books(
  name string,
  author string,
  rating double,
  reviews int,
  price double,
  year int,
  genre string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
  'separatorChar' = ',',
  'quoteChar'     = '"'
)
LOCATION 's3://techwithcloud/rawdata/input/'
TBLPROPERTIES ('skip.header.line.count'='1')

Adding Athena as Database

Now, login to the Superset UI by opening http://localhost:8080 (default credential is admin/admin) and go to “Data” -> “Databases”. Then click on “+ DATABASE” button, to bring up below page

Add a DB connection in Superset

Provide a DB name and the SQLAlchemy URI (connection string) in the following format. Please note that below URI doesn’t contain Athena schema name. Superset will fetch all schema and allows you to choose while creating a dataset. You can find more options to use in the connection string here.

awsathena+rest://<your_aws_access_key_id>:<your_aws_secret_access_key>@athena.<aws_region>.amazonaws.com/?s3_staging_dir=<path_to_athena_query_output>

As a next step, I recommend going to “SQL Lab” -> “SQL Editor” and selecting the Athena table from the drop down. Superset will automatically fetch few rows (100 rows in my case). If you can see data in the preview then Superset and Athena (or S3) can communicate properly.

Superset: Preview of Athena Table

Let us try running a query directly in Superset to find out those authors who had the most number of books sold.

Superset shows the time taken to execute this query as well as an “Explore” option. Clicking on Explore and following instructions shown on the page brings us to the following very useful section of Superset. Note that till now, we haven’t created any dataset from the table.

Superset: Exploring SQL query output

Create Dataset

Now let us create a dataset from this table. Creating new dataset is very easy, navigate to “Data” -> “Datasets” -> “+ DATASET”. Then select the Athena table, add the dataset. You should see something like below

Superset: new dataset

Create Charts

Now that we have a dataset, let us create 2 charts and add these charts in a dashboard to conclude this tutorial. Navigate to “Charts” -> “+ CHART”, select the dataset name created in the above step, leave the visualization type as “Table”. This brings us to this page where count(*) is the default metric.

Edit the settings as shown below to get all rows in a table format and save it. This will be our first and simple chart.

Superset: Table chart having all rows

For the second chart, I want to see how the books were brought by Genre. Create a new chart and select “Bar Chart” as visualization type. Edit the settings as shown below to create our second chart. Also, in the “CUSTOMIZE” tab, select two check boxes called “SORT BARS” & “LEGEND” and provide labels for the ‘X’ and ‘Y’ axis.

Adding charts to Dashboard

So far, we have created 2 charts, and now time to add them to a dashboard.

  1. Navigate to “Dashboards” -> “+ DASHBOARD”
  2. From the “COMPONENTS” tab drag and drop “Row” twice.
  3. From the “CHARTS” tab select the two charts created and drop them in the two Empty Row spaces.
  4. Save the dashboard and you should have a similar page as shown below. (click on the “Draft” icon next to the dashboard name to Publish it to the home page)
Superset: Dashboard

This concludes the tutorial. Thanks for reading.

References

3 replies »

  1. Thanks for great article.did you ever faced this error : creating databases: (configuration_method) Missing data for required field.

    Like

Leave a Reply to Thulasi Kumar Cancel reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s