Association rule mining

Introduction

One approach to discovering interesting relationships and patterns in a large dataset is to identify entities that frequently appear together. This type of pattern can be summarized by association rules. For example, a rule that may be found in a retailer database:

{toiletpaper,cheese,milk}{shampoo,apples,lettuce}\left \{ toilet\, paper, cheese, milk \right \} \Rightarrow \left \{ shampoo, apples, lettuce \right \}

The rule can be interpreted as conditional probability: if a customer bought toilet paper, cheese and milk in a transaction, they are more likely to also buy shampoo, apples and lettuce. It is important to note that the association rule does not imply causation between the items on the two sides. This type of information can then be used to inform store layout and marketing campaigns. Association rule mining is also widely used outside of the business domain, such as characterizing side-effects of drugs and analyzing demographic variables in census data.

In this post, we will establish a basic workflow for association rule mining in R using the arules package. As in our previous posts, we will use the Telco customer churn dataset, not only to illustrate the application of association rule mining in a business setting, but also to derive insights that can be compared with those from our other analyses, in order to gain a comprehensive and well-balanced understanding of the dataset.

If you want to try this yourself, click on "Remix" in the upper right corner to get a copy of the notebook in your own workspace. Please remember to change the R runtime to association_rules_R from this notebook to ensure that you have all the installed packages and can start right away.

Let's get started!

Import data

First, we will import the version of the Telco dataset that we have cleaned up:

df <-read.csv("https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_yes_no.csv")

Preprocess data

Then, we have to preprocess the data a bit before we can start mining:

Discretize continuous variables

In this dataset, we have two continuous variables whose relationship to customer churn we are interested in: MonthlyCharges and Tenure. We will ignore the TotalCharges variable here, as it is a product of these two. As the algorithm cannot use continuous variables, we need to discretize them to make them into categorical variables.

Simply dividing up the values into equal sized bins would almost definitely result in information loss. To obtain the most informative binning, we will use a supervised discretization function from the arulesCBA package, which identifies bin breaks that retain the most predictive power with respect to the target variable that we are interested in, Churn.

MonthlyCharges

## Import library
library(plyr)
library(dplyr)
library(arulesCBA)

## Discretize "MonthlyCharges" with respect to "Churn"/"No Churn" label and assign to new column in dataframe
df$Binned_MonthlyCharges <- discretizeDF.supervised(Churn ~ ., df[, c('MonthlyCharges', 'Churn')], method='mdlp')$MonthlyCharges

## Check levels of binned variable
print(unique(df$Binned_MonthlyCharges))

Here we see that values in the MonthlyCharges column are binned into five levels. For ease of reading, we will rename the bin names:

## Rename the levels based on knowledge of min/max monthly charges
df$Binned_MonthlyCharges = revalue(df$Binned_MonthlyCharges, 
                                   c("[-Inf,29.4)"="$0-29.4", 
                                     "[29.4,56)"="$29.4-56", 
                                     "[56,68.8)"="$56-68.8", 
                                     "[68.8,107)"="$68.8-107", 
                                     "[107, Inf]" = "$107-118.75"))

Tenure

Now we will do the same for Tenure:

## Discretize "Tenure" with respect to "Churn"/"No Churn" label and assign to new column in dataframe
df$Binned_Tenure <- discretizeDF.supervised(Churn ~ ., 
                                            df[, c('Tenure', 'Churn')], 
                                            method='mdlp')$Tenure

## Check levels of binned variable
unique(df$Binned_Tenure)

## Rename the levels based on knowledge of min/max tenures
df$Binned_Tenure = revalue(df$Binned_Tenure, 
                           c("[-Inf,1.5)"="1-1.5m", 
                             "[1.5,5.5)"="1.5-5.5m",
                             "[5.5,17.5)"="5.5-17.5m",
                             "[17.5,43.5)"="17.5-43.5m",
                             "[43.5,59.5)"="43.5-59.5m",
                             "[59.5,70.5)"="59.5-70.5m",
                             "[70.5, Inf]"="70.5-72m"))

Convert dataframe to transaction format

Next, we have to get the data into the format accepted by the arules package, in which each row contains comma-separated items that make up each transaction. While there are no transactions in this dataset, each row being a customer works just fine. However, we do need to replace "No"s in each column with empty values and "Yes"s with the column header, such that when the dataframe is exported to a CSV, each customer is represented by a comma-separated list of characteristics/products that they do possess/have purchased.

## Replace "No"s with empty values
df[df=="No"]<-NA

df[] <- lapply(df, function(x) levels(x)[x])

## Replace "Yes"s with the column name
w <- which(df == "Yes", arr.ind = TRUE)

df[w] <- names(df)[w[,"col"]]

## Output to CSV
write.csv(df, './results/final_df.csv', row.names=FALSE)
0 items

Finally, we will read in the processed dataset using the read.transactions() function in the arules package that gets it ready in the form required for analysis:

## Import library
library(arules)

## Convert dataframe to transaction format
tData <- read.transactions(
final_df.csv
, format = "basket", sep = ",", header=TRUE) ## Check inspect(head(tData))

Inspect item frequency

Now that the data is ready, we can look at the frequency at which "item" (in this case customer characteristics and the products that they are bought) appears in the dataset:

library('ggplot2')

x <- data.frame(sort(table(unlist(LIST(tData))), decreasing=TRUE))

ggplot(data=x, aes(x=factor(x$Var1), y=x$Freq)) + 
      geom_col() +
      theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
      ylab('Frequency') +
      xlab('Items')

This gives us an idea of what characteristics or buying behaviour are the most prevalent among the customers.

Mine for association rules

The Apriori algorithm is most commonly used for association rule mining. We can set various parameters to limit the number of rules created from the dataset, usually how often the rule is observed ("support") and the minimum/maximum length of the rule. As an example, let's generate rules that appear in at least 0.1% of customers and contain 3-5 "items":

rules <- apriori(tData,
                 parameter = list(supp = 0.001,  ## Support (0.1%)
                                  minlen=3,      ## At least 3 items in rule
                                  maxlen=5))     ## At most 5 items in rule

We can convert the association rules to a dataframe for easy inspection and take a look:

## Convert rules matrix to dataframe
rules_df <- DATAFRAME(rules, setStart='', setEnd='', separate = TRUE)

## Sort the rules by how often they are true, which is measured by "support"
head(rules_df)
0 items

"LHS" and "RHS" refer to items on the left- and right-hand side of each rule, respectively. Support, confidence and lift are measures of the "strength" of each rule, which we will discuss in further details in the next notebook. Finally, "count" gives how many instances, whether it be transactions or customers, in which the rule appears. In our case, the "count" of each rule divided by the total number of customers in the dataset equals its support.

Parting notes

In this analysis, we uncovered 79,313 rules, which is almost certainly too many to be useful. In the next notebook, we will explore ways to filter the association rules identified here by the items involved, various measures of "interestingness", redundancy and statistical significance.