We start with a randomly chosen line f(x)=wx+b. Random means that we choose w and b randomly.
In class I didn't choose completely randomly. I made sure to start with a negative slope (w<0) and chose b=5.
Why? Remember that are goal is to minimize the loss or "reaching the valley from up in the mountains."
Actually, there might be multiple valleys. Depending on the initial random model you start with, you might end up in a different valley (see image).
We want the loss to be as small as possible, so we have to take care to avoid the shallower valley. Therefore, it always makes sense to use some knowledge to pick a starting point closer to the correct valley. From the picture, we can see that best line should have negative slope (w<0) and b should be close to 5.
Exercise: Try choosing other values of w and b and see if you can make sense of what happens when you do gradient descent.
# randomly initialize the slope
w=-10*np.random.random_sample() # random number between -10 and 0
The general Machine Learning term for the w and b that we start with: initial weights.
Our choice of initial weights determine where in the mountains we start our descent from. In situations where there are many valleys (local minima), it is customary to choose several different starting points and compare the models we end up with.
One of the least understood phenomena of deep neural networks is that, even though there are many, many local minima, almost all of them seem to produce good models.
Note on models vs. algorithms
The words 'model' and 'algorithm' are often used interchangeably, but there is a standardized use in the context of machine learning. Usually we say that we use machine learning algorithms (like linear regression) to obtain a trained model (like the best-fit line).
The confusion is that the trained model is itself an algorithm. It takes in a certain input and performs a series of steps that were "learned" during training and produces a desired output. Hence the idea that machine learning is about writing "algorithms that learn algorithms."
In our example, the input of the model is an arbitrary emissions tax level and the output is the predicted level of CO2-emissions at that tax-level.
In other data science contexts, 'model' can have a completely different meaning. For instance, 'modeling the data' might mean creating a some sort of visualization.
3. Define a Loss Function
For linear regression (and other 'regression problems' in machine learning), a typical choice of loss function is the mean-squared error:
where n is the number of training samples (in our example n=4) and f(xi)=wxi+b.
Remember, the loss should be interpreted as the average of the (squared) distances between our data points and the current line (NOTE: the square function is just a way to keep the numbers positive when computing the loss; we could also use the absolute value).
Exercise: Think about why it makes sense that the 'best line' would minimize this average.
It's important to know that the choice of loss function in general depends on the situation. There are standard choices for regression or (image) classification, but you can also modify the loss function and get interesting results!
For example, neural style transfer is based on supervised learning in which the label for an image is itself and the loss function includes two parts:
the ordinary mean-squared error and
a term that measures the distance between a given image and the painting who's style you want to transfer to that image.
If you try to make both terms small at once, you end up with a mixed image.
In the more activistic direction, there is research concerning adding constraints to the loss function to combat unfairness in the resulting model.
4. Gradually Adjust the Model to Decrease Loss (Gradient Descent)
Now we begin descending from our initial random model into the valley where the loss is minimal.
First we set the step size or 'learning rate,' which we called α in the slides.
Note: The 'step size' isn't exactly α, but α⋅db and α⋅dw. Both dw and db become smaller and smaller as we near the bottom of the valley, resulting in smaller and smaller steps, as David pointed out in class.
Now we 'take a step,' i.e., update w and b as in the above formulae:
As was noted during class, there's a trade off between the size the steps we take and the number of steps it takes us to reach the valley. Smaller steps means we need to take more to steps to get there.
But our steps can be too big. Imaging you have immensely long legs and step so far that you reach a point on the other side of the valley that's even higher up from where you started! If you do this over and over, you'll end up climbing to infinity.
If you really want to optimize and reach the valley as few steps as possible, you should make α as big as you can without ending up in this climbing to infinity situation.
Exercise: Play around with different values of the 'learning rate' α and the number of steps num_steps. Try to find the best values i.e. largest α that works and the smallest number of steps.
Note: As Aleks pointed out, the best thing to do would be to take huge steps in the beginning while you're still high up and gradually decrease α as you get close to the bottom. There is a bunch of research about how to do that. You can google "adaptive learning rates."