Competing on Kaggle with Nextjournal - Part 1
Predicting Housing Prices
In this article we'll take a swing at a Kaggle competiton - predicting House prices - using Nextjournal with a Python environment. We'll explore the dataset, try our hands on feature engineering and eventually use advanced regression techniques to enter the Kaggle ranks.
For this we'll use the following packages:
- (the unofficial Kaggle-CLI to download competition datasets)
- Pandas to handle the dataset in a convenient way and perform manipulations
- matplotlib and seaborn for plotting and statistical data visualization
- missingno for missing data visualization
- scikit-learn for preprocessing, model building and selection
pip install kaggle-cli scikit-learn seaborn missingno
1. Getting the Data
First we need to download all data files of the competition's dataset. Since downloading Kaggle datasets requires a Kaggle account, we'll download the files offline and upload them using Nextjournal's file-upload.
1.1. Using the Unofficial Kaggle-CLI
In order to download competition datasets using the command line we could also use the unofficial Kaggle-CLI. For that we first need to set a config with our own Kaggle account's username & password. The competition can be obtained from its Kaggle URL.
kg config -c <competition-name> -u <username> -p <password>
Now we could simply download the dataset using the following command. If we'd only like to download specific files we could also do so with the -f <file name>
option. We'll skip this for now and instead upload the data manually. In the upcoming Nextjournal release we can safely store private data using the vault integration. With that we can then easily use helpers like the Kaggle CLI without compromising our login information.
kg download -c house-prices-advanced-regression-techniques
1.2. Manual Data Upload
Manually uploading data simply works by pressing the 'plus' and choosing the file we wish to upload. Below we already have all relevant competition data which has been manually uploaded.
However, for large files this is not recommended since it might take forever uploading a dataset in this way.
2. Data Exploration
The four files above encompass the dataset (which is already split into train and test sets for our convenience), a description of the data as well as a sample submission to show what form a subission should have later on. Let's go ahead and open the file data_description.txt
in our browser to inspect the 79 variables which make up the dataset.
We immediately see quite a few categorical variables, which we'll have to transform in order to use them as input for our model. But let's first perform a very basic data exploration.
2.1. Basic Dataset Information
Let's explore the data by loading the train.csv
file into a Pandas DataFrame. But before we'll import some packages we're going to use.
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns
Now we'll load the data using Pandas' .read_csv()
method and then display the head to check if it looks as expected.
df_train = pd.read_csv(train_manual_upload.csv) df_test = pd.read_csv(test_manual_upload.csv) print(df_train.head()) print(df_test.head())
As we see in the test set the last colum is missing. After all, that's what we're trying to predict later. Let's try to identify the individual variables now using Pandas' df.info()
method. This displays some information about the DataFrame, like the total number of values for each column, null/not null, and the data type.
df_train.info()
We see that for many columns not all 1460 examples contain meaningful values. We'll deal with them later. For now let's also display some statistics about the individual columns.
print(df_train.describe())
Pandas automatically takes the missing values into account (and doesn't just count them as 0 entries) as we can see from the count
. Since the dataset was collected for Ames, Iowa we can already see the mean house in Ames sells for about 180k. But there's also quite a big variance in the sale prices - maybe that's why we want to build a regressor for the sale price after all?
2.2. Inspecting Correlation
Since the SalePrice
variable is the target variable we want to predict, let's calculate the correlation between all the other variables and our target.
correlation_to_target = df_train.corr()["SalePrice"] print(correlation_to_target[np.argsort(correlation_to_target, axis=0)[::-1]])
As expected, variables like the overall quality of the respective house (OverallQual
) show a rather big correlation to the sale price. Interestingly enclosed porches don't seem to raise the price.
The correlation between the variables themselves are also interesting to look at. Let's turn to the variables with the highest correlation and look at their pair-wise correlations in all their possible combinations. For that we'll set a threshold (e.g. at 0.6
) for correlations we deem important.
correlations = df_train.corr() # use correlations for all variables but the target feature_correlations = correlations.iloc[:-1,:-1] # threshold to cut away smaller correlations (optional) threshold = 0.6 correlations_thresholded = (feature_correlations[ abs(feature_correlations) > threshold][ feature_correlations != 1.0]).unstack().dropna().to_dict() unique_pairwise_correlations = pd.DataFrame( list(set([(tuple(sorted(key)), correlations_thresholded[key]) for key in correlations_thresholded])), columns=['Variable Pair', 'Correlation']) # sort list by absolute value in decreasing order unique_pairwise_correlations = unique_pairwise_correlations.ix[ abs(unique_pairwise_correlations['Correlation']).argsort()[::-1]] print(unique_pairwise_correlations
As a sanity check, it seems plausible that the area of a garage is highly correlated with the number of cars which fit into it. But there's more to pair-wise correlations between the feature variables than this. Ideally we don't want our feature variables to be correlated, since correlations between them can cause our model to be less robust and yield unreliable information about important variables.
In order to deal with this we could do one or a combination of the following:
- remove some of the highly correlated features
- perform a PCA to reduce the correlation between features and create a smaller feature set
- create new features from the current ones which maybe describe some underlying principle of a given subset of features
We will return to these options later on. Let's now visualize the data.
2.3. Visualization
Let's start by looking at the distribution of the target.
fig = plt.figure() plt.hist(df_train["SalePrice"], bins=80) plt.title("Distribution of Sale Price (skew={0:.2f})" .format(df_train["SalePrice"].skew())) plt.xlabel("Sale Price") plt.ylabel("Number of Occurances") fig
We see that the target SalePrice
is right-skewed. For regression models it is often advised to apply a transformation which helps unskew the target distribution. One such transformation is the np.log1p()
method. Let's see what our target distribution looks like after applying this transformation.
fig = plt.figure() plt.hist(np.log1p(df_train["SalePrice"]), bins=80) plt.title("Transformed Distribution of Sale Price (skew={0:.3f})" .format(np.log1p(df_train["SalePrice"]).skew())) plt.xlabel("Sale Price") plt.ylabel("Number of Occurances") fig
We'll test the effects transformations of variables can have on our regression models later in the article.
Let's continue by looking at a box-and-whiskers-plot of the variable which showed the largest correlation to the target - OverallQual
.
fig = plt.figure() sns.boxplot(y="SalePrice", x="OverallQual", data=df_train) plt.title('Overall Quality vs SalePrice') fig
As one might expect, as the overall quality of a house goes up, so does the sale price. However, we see quite a substantial variance in the sale price for houses with a high overall quality assessed to them. As a comparison let's now take a variable which is (as it probably should be) less correlated to the sale price - the month in which the house is sold (MoSold
).
fig = plt.figure() sns.boxplot(y="SalePrice", x="MoSold", data=df_train) plt.title('MoSold vs SalePrice') fig
We can't really tell how the month in which the house is sold alone affects its sale price. But sometimes even such variables can prove important for a model when combined with others. We'll see.
3. Preprocessing
It's now time to prepare the dataset in a way which helps create a regression model to predict the target. The scikit-learn package not only offers tools to build and train machine learning models, but also helps with preprocessing, model selection strategies and beyond.
Many of these tools come in a convenient form called Transformers. They can be glued together conveniently to form a pipeline which helps keeping your code clean so you don't lose track of what transformations you apply. We'll cover the work with pipelines in another article.
For now let's import a few tools we can use for preprocessing. We're not going to use all of them here, but let's import them anyway - in case the reader wants to experiment with different tools.
from sklearn.preprocessing import Imputer, LabelEncoder, OneHotEncoder
Let's revisit the missing values in our training dataset. For that we'll import the missingno package which offers some helpful visualizations for missing values.
import missingno as msno
With that we can display where in the DataFrame values are missing and maybe recognize patterns of missing values in the dataset. For that we only look at the columns which contain missing values.
fig = msno.matrix(df_train[df_train.columns[df_train.isnull().any()]], figsize=(10,10), inline=False, sparkline=False, color=(0, 0, 0)) fig
Two patterns can easily be spotted here: Missing values for Bsmt
variables often occur in the same row, which suggests that the corresponding house doesn't have any basement and therefore all variables related to it can't contain any values. The same seems to be true for variables describing the Garage
.
For many datasets with missing values such visualizations can provide useful insights as to why this is the case. Let's also display some numbers on the matter.
print("Number of missing values per variable in the train dataset.") print(df_train[df_train.columns[df_train.isnull().any()]].isnull().sum())
Some variables like Alley
, PoolQC
or MiscFeature
only contain values for less than 8% of all examples both in train and test. It may be best to drop those variables altogether. But let's keep them for now and explore ways to impute their missing values.
Imputing numerical variables by setting missing values equal to the mean or median of a respective variable seem reasonable. If the variable contains some outliers we should probably go for the median since it's less sensitive to outliers.
But what if a variable is categorical? We'll see how we can deal with them first.
3.1. Encoding Categorical Variables
Since not all machine learning models can handle categorical variables out of the box, we first have to encode them in a useful way. Different options here are Label-Encoding, One-Hot-Encoding or even Entity-Embedding for some models.
We'll stick to One-Hot-Encoding here due to its binary nature rather than imposing some ordinality (as is the case for Label-Encoding) and the benefit of having the categories embedded in orthogonal vector space. Keep in mind that for variables with many different categories One-Hot-Encoding can blow up since you get a new feature for every category. In some cases e.g. scikit-learn's LabelEncoder
can help storing the categories using less disk space.
Scikit-learn also has their own implementation of One-Hot-Encoding (OneHotEncoder
), but we'll stick to Pandas' .get_dummies()
method. Another benefit of One-Hot-Encoding is that the information about missing values is preserved. So we don't need to worry about imputing the encoded categorical variables later on.
# create DataFrame copy on which we perform encoding encoded_df_train = df_train.copy() encoded_df_test = df_test.copy() # keep track of categorical columns (just for clarity) categorical_cols = df_train.select_dtypes(include=['object']).columns # make sure categorical variables are the same for train and test assert(set(df_train.select_dtypes(include=['object']).columns) == set(df_test.select_dtypes(include=['object']).columns)) # just another way of doing the same thing assert(df_train.select_dtypes(include=['object']).columns.all() == df_test.select_dtypes(include=['object']).columns.all()) # perform one-hot-encoding encoded_df_train = pd.get_dummies(encoded_df_train) encoded_df_test = pd.get_dummies(encoded_df_test) # account for 1-hots not present in test data for col in set( encoded_df_train.columns.drop("SalePrice")) - set(encoded_df_test.columns): encoded_df_test[col] = 0 # information about categorical columns and added columns print(categorical_cols) print("Number of categorical columns: ", len(categorical_cols)) print(encoded_df_train.columns) print("Number of added columns by encoding: ",len(encoded_df_train.columns)-len(df_train.columns)) # make sure we have the same number of columns (apart from SalePrice) assert(set(encoded_df_train.columns.drop("SalePrice")) == set(encoded_df_test.columns))
3.2. Imputing Missing Values
As we saw already when inspecting the dataframe, there were quite a few variables with missing values for (sometimes many) examples. The same probably holds true for the test data, if not worse.
Let's see how many missing values we have left after encoding.
print(encoded_df_train[encoded_df_train.columns[ encoded_df_train.isnull().any()]].isnull().sum()) print(encoded_df_test[encoded_df_test.columns[ encoded_df_test.isnull().any()]].isnull().sum())
As we see there are not that many missing values left after encoding. Thus we had most of the missing values contained in categorical variables.
We'll deal with the rest of the missing values using scikit-learn's Imputer
. There are different strategies we can use here. Let's create some different imputers for experimentation.
impute_mean = Imputer(missing_values='NaN', strategy = 'mean', axis=0) impute_median = Imputer(missing_values='NaN', strategy = 'median', axis=0) impute_majority = Imputer(missing_values='NaN', strategy = 'most_frequent', axis=0)
Let's now create a strategy for imputing missing values. Sometimes the information where in the data values were missing itself is valuable for our model. We keep track of this by introducing new boolean columns.
# create DataFrame copy on which we perform preprocessing imp_df_train = encoded_df_train.copy() imp_df_test = encoded_df_test.copy() # get columns with missing values cols_with_missing_train = [col for col in imp_df_train.columns if imp_df_train[col].isnull().any()] cols_with_missing_test = [col for col in imp_df_test.columns if imp_df_test[col].isnull().any()] # use a consistent set for train and test (same cols with missing values for this dataset) cols_with_missing = list( set(cols_with_missing_train + cols_with_missing_test)) # add binary columns to keep track of where values have been missing for col in cols_with_missing: imp_df_train[col + '_was_missing'] = imp_df_train[col].isnull() imp_df_test[col + '_was_missing'] = imp_df_test[col].isnull() # print some effects of the imputation for sanity checks print(imp_df_train['LotFrontage'][:15]) # make sure we impute with the values calculated on the train set imp_df_train[cols_with_missing] = impute_median.fit_transform(imp_df_train[cols_with_missing]) imp_df_test[cols_with_missing] = impute_median.transform(imp_df_test[cols_with_missing]) print(imp_df_train['LotFrontage'][:15])
Now that we prepared the data we can start training some initial regression model. This model will later serve as our baseline to which we will compare more advanced models.
4. Baseline Regression Model
Before we train our first baseline model let's import some metrics, model selection tools and regression models.
# different metrics from sklearn.metrics import make_scorer, r2_score, mean_squared_error # k-fold cross-validation from sklearn.model_selection import cross_val_score, KFold # different choices of regressors from sklearn import linear_model from sklearn.tree import DecisionTreeRegressor from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor from sklearn import svm
Now let's define the metric we're using to judge our model's performance. We'll use the r2_score
here, but we could also e.g. create custom scoring functions using make_scorer
.
def train_model(X_train, y_train, model): cv = KFold(n_splits=5, shuffle=True, random_state=42) r2 = make_scorer(r2_score) r2_val_score = cross_val_score(model, X_train, y_train, cv=cv, scoring=r2) return (r2_val_score.mean(), r2_val_score.std())
It's time to train some initial models. We'll keep track of the results in the form of another DataFrame storing the mean and std of the respective 5-fold cross-validation.
X_train = imp_df_train.drop(['Id', 'SalePrice'], axis=1) y_train = imp_df_train['SalePrice'] results = pd.DataFrame() model_1 = linear_model.LinearRegression() results["Linear Model"] = train_model(X_train, y_train, model_1) model_2 = RandomForestRegressor() results["Random Forest"] = train_model(X_train, y_train, model_2) model_3 = DecisionTreeRegressor() results["Decision Tree"] = train_model(X_train, y_train, model_3) model_4 = AdaBoostRegressor() results["Adaboost"] = train_model(X_train, y_train, model_4) results.index = ["R2 Mean", "R2 Std"] print(results)
todo:
- remark: train/test could be treated identically by pd.concat([train,test],ignore_index=True) , but this has some pitfalls:
- -> transformers like impute should be 'fit' on train, not on both, otherwise test set information is leaking (especially important for real-world applications to not introduce bias)
last publication: 2018-01-19, 22:11 Heidelberg, Germany