Automated exploratory data analysis

Introduction

One of the most exciting things about data science is when you get your hands on a new dataset. Oh, the possibilities!

Before I can really sink my teeth into the dataset, it's important to get an idea of its overall structure and potential problems. Because I can hardly wait to get started on the fun stuff, I have experimented quite a bit with packages that allow quick automated exploratory data analysis (EDA) as the first step. So for part 1 of this series on EDA, here is a round up of my (current) three most favourite packages for getting acquainted with a dataset while writing minimal code.

This is a continuously updated database of resources and packages curated at Intelligence Refinery for automated EDA (click on "View larger version" at the bottom right corner for an expanded view):

If you want to try this yourself, click on "Remix" in the upper right corner to get a copy of the notebook in your own workspace. Please remember to change the R runtime to auto_EDA_ R from this notebook to ensure that you have all the installed packages and can start right away.

Let's get started!

  • A quick overview of the raw dataset
  • To quickly spot things like missing values, misclassified variables, and erroneous values, I prefer dataMaid for its straight forward combination of metrics and visualizations.

    dataMaid

    dataMaid generates a summary report of your dataset in R markdown format, which you can knit together into an PDF or HTML report. For demonstration purposes, I will just show snippets of the interesting parts:

    ## Import library
    library('dataMaid')
    
    ## Import data raw Telco customer churn dataset
    raw_df <- read.csv("https://github.com/treselle-systems/customer_churn_analysis/raw/master/WA_Fn-UseC_-Telco-Customer-Churn.csv")
    
    ## Generate report
    makeDataReport(raw_df, openResult = FALSE,
    								render = FALSE, file = "./results/testing.Rmd", 
                    replace = TRUE)
    testing.Rmd

    First part of the generated report shows the types of checks performed:

    Then, we see a summary table of all variables, which provides a helpful quick overview of the data and any potential issues, like the 0.16% missing data in the TotalCharges column.

    Scrolling down, there are more detailed information on each variable. We see problematic areas such as the customerID column being a key and that the SeniorCitizen column is encoded in 0s and 1s.

    Also we see that the minimum value of Tenure column is 0, which is problematic and should be removed.

    Of all the automated EDA packages in R and Python that I have tried so far, dataMaid provides the best once-over, quick-glance view of the whole dataset with a single function. These results are great for focusing the initial data cleaning process.

  • Explore cleaned data
  • Once I get a (reasonably) clean data set, I want to be able to explore the variables and their relationships with minimal coding (at first). This is where the next two packages come in, which provide varying degrees of flexibility and depth of insights.

    autoEDA

    For the first quick overview, I use the autoEDA package to explore the relationship between all input variables and my target variable of interest, which is Churn in this case. For maximum convenience, this is can be done in a single line of code:

    ## Import libraries
    library(autoEDA)
    
    ## Import the same dataset, but with basic cleaning
    cleaned_df <- read.csv("https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_yes_no.csv")
    
    ## Analyze data, output visualizations to PDF 
    autoEDA(cleaned_df, y = "Churn", returnPlotList = FALSE,
            outcomeType = "automatic", removeConstant = TRUE, 
            removeZeroSpread = TRUE, removeMajorityMissing = TRUE, 
            imputeMissing = TRUE, clipOutliers = FALSE, 
            minLevelPercentage = 0.025, predictivePower = TRUE, 
            outlierMethod = "tukey", lowPercentile = 0.01, 
            upPercentile = 0.99, plotCategorical = "groupedBar",
            plotContinuous = "histogram", bins = 30, 
            rotateLabels = TRUE, color = "#26A69A", 
            outputPath = '.', filename = "autoEDA_telco", verbose = FALSE) 
    0 items

    The output includes a dataframe with summary statistics pertaining to variable type, presence of outliers, and descriptive statistics. In the last column of this output dataframe, there is a handy PredictivePower metric for each input variable with respect to a specified target variable. For now, we can ignore this as I will cover it in more details in a later post examining variable importance.

    The graphical outputs provided by autoEDA give really fast insights into how various aspects of customer demographics and behaviour relate to whether they churn or not. As Nextjournal is currently having some trouble in rendering multiple subplots together, I have output the plots in a PDF file for those who are interested, and attached an image to show here:

    mv autoEDA_telco.pdf results
    autoEDA_telco.pdf

    However, it is important to keep in mind that this type of bivariate analysis cannot detect combinatorial effects that exist among multiple variables to affect churn. Therefore, just because a variable do not appear to be differently distributed in terms of churn behaviour, such as Gender, it should not be excluded from analysis as it may be significant when considered in combination with other variables. Nevertheless, this is a good start for seeing if there are "learnable" signals in the dataset.

    ExPanDaR

    Saving the best for last, ExPanDaR provides a really nifty Shiny app for interactive explorations of your data set.

    Originally designed for examining time-series data, the package requires the input dataframe to have a 1) time/date column and 2) a column that uniquely identifies each row. As the time/date column is only needed if you want to visualize time-dependent trends, to use a dataset without a time dimension you can just add a new numeric column (ts) with a constant and set that as the time dimension. An index column would suffice for the second requirement. In the original Telco dataset, the customerID column would have worked fine. As I had dropped it in the process of data cleaning, I will just add a new index column (ID).

    ## Import library
    library(ExPanDaR)
    
    ## Add mock time column and new index to dataframe
    cleaned_df$ts <- rep(1, nrow(cleaned_df))
    
    cleaned_df$ID <- seq.int(nrow(cleaned_df))

    To start up the Shiny app for interactive exploration of the results:

    ExPanD(df = cleaned_df, cs_id = "ID", ts_id = "ts")

    Here are some snapshots of the features that I find most useful. The dropdown menus and sliders make it really easy and flexible to examine any combinations of variables.

    To go beyond bivariate relationships, the scatter plot can aggregate information from up to four variables and really give some interesting insights.

    There are some other very cool features like allowing the user to generate and explore new variables (from some arithemtic combinations of existing variables) on the fly and performing regression analysis. Definitely give this package a try!

    Parting notes

    While these three packages make up a good starter kit for fast automated EDA in a data science workflow, there is a huge ecosystem of EDA packages that are waiting to be explored. I will keep updating this post as I refine my EDA pipeline.

    Til next time! :)