Conditional probability density plots

Conditional probability density plots as a great way to examine the relationship between a continuous and categorical variable, as they shows how the conditional distribution of the former changes over different values of the latter. Unlike the commonly used bar graphs that show the mean or median of a continuous variable for different levels of a categorical value, which collapses much more information into just a few numbers, these plots can reveal interesting dynamics over a range of values. As usual, we will use the Telco customer churn dataset as a toy example for using these techniques to gain actionable business insights from data.

If you want to try this yourself, click on "Remix" in the upper right corner to get a copy of the notebook in your own workspace. Please remember to change the R runtime to conditional_prob_R from this notebook to ensure that you have all the installed packages and can start right away.

To make these plots, we use the smoothed density estimates implementation in the ggplot2 package, which calculates and plots the kernel density estimate over a range of values. For ease of examination, we can transform the static ggplot object into an interactive one using the ggplotly() wrapper function.

## Import libraries
library(PCAmixdata)
library(ggplot2)
library(plotly)
library(gridExtra)

## Import data
df <- read.csv("https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_renamed.csv")

## Plot
ggplotly(ggplot(df, aes_string(df$MonthlyCharges, fill = df$Churn)) + 
         geom_density(position='fill', alpha = 0.5) + 
         xlab("MonthlyCharges") + labs(fill='Churn') +
         theme(legend.text=element_text(size=12), 
               axis.title=element_text(size=14)))
Loading viewer…

Interesting! Looks like there are two "tiers" in terms of monthly charges that customers appear to be more likely to churn: ~40 dollars/month and ~75-100 dollars/month. It may be useful to see if the services and prices at these two "tiers" are less competitive than those offered by other companies. Conversely, perhaps the services offered around ~60 dollars/month is more appealing than those from competitors, and so may warrant more focus for advertising and promotions.

Next, as part of exploratory data analysis, it would be informative will make a conditional density plot for every categorical variable + numerical variable combination in the dataset. We can do this using a loop:

## Create conditional probability density plots for each categorical variable against one of the two continuous variables
plots = list()

split <- splitmix(df) ## Split the dataframe into categorical and continuous variables

i <- 1

for (v in colnames(split$X.quali)) {
    for (c in c("Tenure", "MonthlyCharges")){
        plots[[i]] <- ggplot(df, aes_string(df[[c]], fill = df[[v]])) + 
                        geom_density(position='fill', alpha = 0.5) + 
                        xlab(c) + labs(fill=v) +
                        theme(legend.text=element_text(size=12), 
                              axis.title=element_text(size=14))
        
        i <- i + 1
        }
    }

## Plot
options(repr.plot.width=20, repr.plot.height=80)

grid.arrange(grobs=plots, ncol=2)

For now, Nextjournal is having a bit of trouble rendering a panel of plots, so the plots will be shown here as an image:

A quick glance reveals a few more interesting points:

  • Male and female customers have very similar tenures and monthly charges. Factor analyses showed that the Gender variable factors very little into the variations in this dataset (see post here).
  • Monthly charges for customers with fiber optic internet service are much higher than those with DSL or none at all. Preliminary EDA (see post here) showed that customers who churn mostly have fiber optic internet service and higher monthly charges than those who do not churn. So dissatisfaction with this may factor into a customer's decision to leave.

Of course, it is important to keep in mind that these correlations do not indicate causal relationships, or even the direction of the relationship, between the variables examined. These are just starting points for further investigation, but interesting ones nonetheless.

Til the next post! :)