# Dataism: Supervised Learning

This notebook is authored by Alexandre Puttick and modified by myself. Original is on GitHub.

Recall the "five steps of supervised learning" from class:

1. Start with labeled data: $(x_1,y_1), (x_2, y_2), ... , (x_n, y_n).$ Here $n$ is the number of training samples, the $x_i$ are the samples, and the $y_i$ are the labels.

2. Begin with an initial random model: A model is a map $f$ from inputs to outputs that takes a new sample $x$ and gives it a label $y = f(x).$

3. Define a "loss function": A number measuring how good/bad the predictions are.

4. Gradually adjust the model $f$ to decrease the loss: I.e., Iteratively adjust $f$ so that it's predictions become better and better. Do this with gradient descent.

5. Best model when loss is minimized.

## Linear Regression

Let's go through the five steps in the context of linear regression , i.e., finding the line that best describes a 2D set of data points.

# Import the necessary libraries:
import numpy as np
import matplotlib.pyplot as plt # matplotlib.pyplot contains a large variety of
                                # visualization tools.
0.2s
Python

We start with the following data consisting of CO2-emissions taxes and the corresponding amount of CO2-emissions:

The following code saves the first column into a numpy array named 'X.'

These are our samples $x_1 = 5,\, x_2 = 10, \,x_3 = 15,\, x_4 = 25.$

X = np.array([5,10,15,25])
print(X)
0.3s
Python

We save the second column unto a numpy array named 'Y.' These are the labels $y_1 = 5,\, y_2 = 4.3,\, \ldots$

Y = np.array([5,4.3,3.7,2.9])
print(Y)
0.4s
Python
# Plot the labeled data
plt.clf()
# Specify that we're plotting X against Y. 'bo' stands for blue dots and
# describes the style of the plot.
plt.plot(X,Y, 'bo')
# label the axes
plt.ylabel('Tax')
plt.xlabel('Co2-Emissions')
# start the plot at the point (0,0)
plt.xlim(xmin = 0)
plt.ylim(ymin = 0)
# display the plot
plt.gcf()
0.6s
Python

## 2. Begin with an initial random model:

We start with a randomly chosen line $f(x) = wx + b.$ Random means that we choose $w$ and $b$ randomly.

In class I didn't choose completely randomly. I made sure to start with a negative slope $(w<0)$ and chose $b =5.$

Why? Remember that are goal is to minimize the loss or "reaching the valley from up in the mountains."

Actually, there might be multiple valleys. Depending on the initial random model you start with, you might end up in a different valley (see image).

We want the loss to be as small as possible, so we have to take care to avoid the shallower valley. Therefore, it always makes sense to use some knowledge to pick a starting point closer to the correct valley. From the picture, we can see that best line should have negative slope $(w <0)$ and $b$ should be close to 5.

Exercise: Try choosing other values of $w$ and $b$ and see if you can make sense of what happens when you do gradient descent.

# randomly initialize the slope
w = -10 * np.random.random_sample() # random number between -10 and 0
# not so randomly initialize the y-intercept
b = 5
print(w,b) # Check values of w and b
0.4s
Python

Let's also visualize our initial model:

plt.clf()
plt.plot(X, Y, 'bo')
# make list of one hundred evenly spaced X-values between 0 and 30)
x = np.linspace(0,30,100)
# the corresponding list of predictions
y = w*x + b
# connect the points from these lists with a red line
plt.plot(x, y, '-r') # -r = 'red line'
plt.ylim(ymin = 0)
plt.xlim(xmin = 0)
plt.gcf()
0.6s
Python

### Initial weights

The general Machine Learning term for the w and b that we start with: initial weights.

Our choice of initial weights determine where in the mountains we start our descent from. In situations where there are many valleys (local minima), it is customary to choose several different starting points and compare the models we end up with.

One of the least understood phenomena of deep neural networks is that, even though there are many, many local minima, almost all of them seem to produce good models.

### Note on models vs. algorithms

The words 'model' and 'algorithm' are often used interchangeably, but there is a standardized use in the context of machine learning. Usually we say that we use machine learning algorithms (like linear regression) to obtain a trained model (like the best-fit line).

The confusion is that the trained model is itself an algorithm. It takes in a certain input and performs a series of steps that were "learned" during training and produces a desired output. Hence the idea that machine learning is about writing "algorithms that learn algorithms."

In our example, the input of the model is an arbitrary emissions tax level and the output is the predicted level of CO2-emissions at that tax-level.

In other data science contexts, 'model' can have a completely different meaning. For instance, 'modeling the data' might mean creating a some sort of visualization.

## 3. Define a Loss Function

For linear regression (and other 'regression problems' in machine learning), a typical choice of loss function is the mean-squared error:

where $n$ is the number of training samples (in our example $n =4$) and $f(x_i) = wx_i + b.$

Remember, the loss should be interpreted as the average of the (squared) distances between our data points and the current line (NOTE: the square function is just a way to keep the numbers positive when computing the loss; we could also use the absolute value).

Exercise: Think about why it makes sense that the 'best line' would minimize this average.

First we define an array called 'Y_predict:'

Y_predict = w*X + b
0.1s
Python
Y_predict
0.1s
Python
array([ -30.19595855, -65.3919171 , -100.58787565, -170.97979276])

Now we compare the predictions and the labels:

# difference between predictions and labels
diff = Y_predict - Y
diff
0.2s
Python
array([ -35.19595855, -69.6919171 , -104.28787565, -173.87979276])

Then we square them: $\textrm{sq\_diff} = [(f(x_1)-y_1)^2,\, (f(x_2)-y_2)^2,\,(f(x_3)-y_3)^2, \,(f(x_4)-y_4)^2]$

# square differences
sq_diff = diff**2
sq_diff
0.3s
Python
array([ 1238.75549835, 4856.96330946, 10875.96100846, 30234.18232921])

Finally, we average the entries in sq_diff to obtain the loss function:

loss = 1/len(X) * np.sum(sq_diff)
# len(X) = 4. The len() function returns the length of an array.
# We write len(X) instead of 4 so that our expression for the loss
# works for any number of data points.
loss
0.2s
Python
11801.46553637084

### Different loss functions for different situations

It's important to know that the choice of loss function in general depends on the situation. There are standard choices for regression or (image) classification, but you can also modify the loss function and get interesting results!

For example, neural style transfer is based on supervised learning in which the label for an image is itself and the loss function includes two parts:

1. the ordinary mean-squared error and

2. a term that measures the distance between a given image and the painting who's style you want to transfer to that image.

If you try to make both terms small at once, you end up with a mixed image.

In the more activistic direction, there is research concerning adding constraints to the loss function to combat unfairness in the resulting model.

Now we begin descending from our initial random model into the valley where the loss is minimal.

First we set the step size or 'learning rate,' which we called $\alpha$ in the slides.

learning_rate = .0001 # this is alpha, step size
0.1s
Python

Taking a step in the direction of steepest descent means updating w and b using the following formulae:

In case you know calculus and are curious where this came from, you can also write this as:

First we have to compute the terms that get multiplied by $\alpha$ in the above formula. We name these dw and db:

dw = 2/len(X) * np.sum((Y_predict - Y) * X)
db = 2/len(X) * np.sum(Y_predict - Y)
0.2s
Python

Note: The 'step size' isn't exactly $\alpha$, but $\alpha\cdot db$ and $\alpha\cdot dw.$ Both $dw$ and $db$ become smaller and smaller as we near the bottom of the valley, resulting in smaller and smaller steps, as David pointed out in class.

Now we 'take a step,' i.e., update w and b as in the above formulae:

w = w - learning_rate * dw
w
0.2s
Python
-6.699981114402784
b = b - learning_rate * db
b
0.2s
Python
5.019152777203267

That was just a single step. The following code repeats the whole process over and over until we reach the bottom of the valley:

## First specify the number of steps you want to take
num_steps = 500
## We want to keep track of the loss (how high up in the mountains we are)
## after every step. We do this so we can visualize who the loss changed as
## we descended.
## To do this, start with an empty array, which we call 'loss_hist' for
## 'Loss history.' We will add the loss at our current location with every step.
loss_hist = [] # empty array
0.1s
Python
## The following for-loop repeats the process of taking a step in the
## direction of steepest descent. Each time the loop runs, it uses the
## w and b at our current location.
for step in range(num_steps): # range(num_steps) is a list [0,1,...,999]
    Y_predict = w*X + b
    # Computing the loss at our location in the mountains
    diff = Y_predict - Y
    sq_diff = diff**2
    loss = 1/len(X) * np.sum(sq_diff)
    # Compute the 'gradient'
    dw = 2/len(X) * np.sum((Y_predict - Y) * X)
    db = 2/len(X) * np.sum(Y_predict - Y)
    # Take the step:
    w = w - learning_rate * dw
    b = b - learning_rate * db
    # record the loss at every step by adding to our loss_hist array.
    loss_hist.append(loss)
0.2s
Python

Let's see what the values of w and b that we ended up with are:

w
0.2s
Python
-0.10312393032426162
b
0.2s
Python
5.391747354738126

We can also plot the line we ended up with:

plt.clf()
# include the original data points in our plot
plt.plot(X, Y, 'bo')
# take a long list of 100 values on the x-axis between 0 and 30
x = np.linspace(0,30,100)
# compute the corresponding y-values for that long list
y = w*x + b # predictions for values in that long list
# connect all of these points with a red line
plt.plot(x, y, '-r') # -r = red line
# start the plot at the point (0,0)
plt.ylim(ymin = 0)
plt.xlim(xmin = 0)
# display the plot
plt.gcf()
0.6s
Python

And visualize how the loss changed over time:

plt.clf()
plt.plot(loss_hist)
plt.gcf()
0.5s
Python

As was noted during class, there's a trade off between the size the steps we take and the number of steps it takes us to reach the valley. Smaller steps means we need to take more to steps to get there.

But our steps can be too big. Imaging you have immensely long legs and step so far that you reach a point on the other side of the valley that's even higher up from where you started! If you do this over and over, you'll end up climbing to infinity.

If you really want to optimize and reach the valley as few steps as possible, you should make $\alpha$ as big as you can without ending up in this climbing to infinity situation.

Exercise: Play around with different values of the 'learning rate' $\alpha$ and the number of steps num_steps. Try to find the best values i.e. largest $\alpha$ that works and the smallest number of steps.

Note: As Aleks pointed out, the best thing to do would be to take huge steps in the beginning while you're still high up and gradually decrease $\alpha$ as you get close to the bottom. There is a bunch of research about how to do that. You can google "adaptive learning rates."