Exploratory Data Analysis (EDA) is an important process for many use cases when you are dealing with data. It is much helpful if you get an open-source data exploration and visualization tool that supports a variety of data sources, considering, in the Big Data ecosystem, you always have more than one database. On these lines, I found Apache Superset to be quite impressive and very helpful.
In this article, I am sharing how to use Superset to explore and visualize data stored in Amazon S3, which is a popular choice for data lake. It took me few iterations to get it to work so hopefully, this post will help people get started.
To install Superset, I followed instructions as mentioned in the docker hub page of superset. However, when I tried to open it in the browser, I got a lot of “Access Denied” errors. After checking these steps and playing with them, I found that “Step 4: Setup roles”, should be done right after “Step 2”. When you load examples before role setting then Superset throws these errors. I have created an issue https://github.com/apache/superset/issues/13867 with the Superset community to fix it.
So the correct sequence is as follows
# download docker image and run superset on port 8080 docker run -d -p 8080:8088 --name superset apache/superset # With above container running, execute below commands on new terminal docker exec -it superset superset fab create-admin \ --username admin \ --firstname Superset \ --lastname Admin \ --email email@example.com \ --password admin docker exec -it superset superset db upgrade docker exec -it superset superset init # optional but helpful to get started docker exec -it superset superset load_examples # save the changes to the image docker commit <superset-container-id> apache/superset
Adding Athena Driver in Superset
Tip – You must have created Athena tables referencing the data in S3. Superset will query Athena tables and use the output for visualization. Since Superset doesn’t access S3 directly, it is transparent to the file format of data stored in S3 (CSV, ORC, Parquet etc).
As of now, the easiest way to visualize data in S3 is to use Superset’s Athena driver. The complete list of supported DB drivers is published here. For Athena, I am using PyAthena and not the JDBC one. The minimum Dockerfile to pull and build superset image that will include PyAthena connector is shown below.
FROM apache/superset # Switching to root to install the required packages USER root # Installing Athena driver RUN pip install "PyAthena>1.2.0" # Switching back to using the `superset` user USER superset
Build and run the container again using this updated image. The data used in this article is of a ‘Top 50 Bestselling Books’ on Amazon from Kaggle. As a prerequisite for this tutorial, I created a table in Athena as shown below which points to my S3 bucket having this sample dataset.
CREATE EXTERNAL TABLE bestseller_books( name string, author string, rating double, reviews int, price double, year int, genre string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( 'separatorChar' = ',', 'quoteChar' = '"' ) LOCATION 's3://techwithcloud/rawdata/input/' TBLPROPERTIES ('skip.header.line.count'='1')
Adding Athena as Database
Now, login to the Superset UI by opening http://localhost:8080 (default credential is admin/admin) and go to “Data” -> “Databases”. Then click on “+ DATABASE” button, to bring up below page
Provide a DB name and the SQLAlchemy URI (connection string) in the following format. Please note that below URI doesn’t contain Athena schema name. Superset will fetch all schema and allows you to choose while creating a dataset. You can find more options to use in the connection string here.
As a next step, I recommend going to “SQL Lab” -> “SQL Editor” and selecting the Athena table from the drop down. Superset will automatically fetch few rows (100 rows in my case). If you can see data in the preview then Superset and Athena (or S3) can communicate properly.
Let us try running a query directly in Superset to find out those authors who had the most number of books sold.
Superset shows the time taken to execute this query as well as an “Explore” option. Clicking on Explore and following instructions shown on the page brings us to the following very useful section of Superset. Note that till now, we haven’t created any dataset from the table.
Now let us create a dataset from this table. Creating new dataset is very easy, navigate to “Data” -> “Datasets” -> “+ DATASET”. Then select the Athena table, add the dataset. You should see something like below
Now that we have a dataset, let us create 2 charts and add these charts in a dashboard to conclude this tutorial. Navigate to “Charts” -> “+ CHART”, select the dataset name created in the above step, leave the visualization type as “Table”. This brings us to this page where count(*) is the default metric.
Edit the settings as shown below to get all rows in a table format and save it. This will be our first and simple chart.
For the second chart, I want to see how the books were brought by Genre. Create a new chart and select “Bar Chart” as visualization type. Edit the settings as shown below to create our second chart. Also, in the “CUSTOMIZE” tab, select two check boxes called “SORT BARS” & “LEGEND” and provide labels for the ‘X’ and ‘Y’ axis.
Adding charts to Dashboard
So far, we have created 2 charts, and now time to add them to a dashboard.
- Navigate to “Dashboards” -> “+ DASHBOARD”
- From the “COMPONENTS” tab drag and drop “Row” twice.
- From the “CHARTS” tab select the two charts created and drop them in the two Empty Row spaces.
- Save the dashboard and you should have a similar page as shown below. (click on the “Draft” icon next to the dashboard name to Publish it to the home page)
This concludes the tutorial. Thanks for reading.