# Dataism: Poverty & Murder

This notebook is authored by Alexandre Puttick and modified by myself. Original is on GitHub.

## Setup

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
0.7s
Python
pip install xlrd
4.2s
Bash in Python

## Data

Data that correlates poverty, unemployment, murder rate. For the percentage of families incomes below $5000, what is the number of murders per 1,000,000 inhabitants per annum? If there are more people living in poverty, are there more murders? (source) Income vs. Murder.xlsx POVERTY_MURDER_DATA = pd.read_excel(Income vs. Murder.xlsx) X = POVERTY_MURDER_DATA['Percent with income below$ 5000']
Y = POVERTY_MURDER_DATA['Number of murders per 1,000,000 inhabitants']
POVERTY_MURDER_DATA.head(5)
0.4s
Python
Percent with income below $5000Number of murders per 1,000,000 inhabitants 016.511.2 120.513.4 226.340.7 316.55.3 419.224.8 5 items plt.plot(X, Y, 'bo') # label the axes plt.xlabel('Percent with income below$5000')
plt.ylabel('No. murders per 10^6 residents')
# start the plot at the point (0,0)
plt.xlim(xmin = 0)
plt.ylim(ymin = 0)
plt.gcf()
1.2s
Python

## Find the Best Fit Line

Use linear regression to generate a model that finds the best fit line. The best fit line offers two insights:

1. Test whether or not x has an influence on y and give it a level of confidence.

2. The slope can be used to predict trends

If the best fit line has some non-zero slope, then we can show how murders and poverty are correlated with some percentage of confidence (e.g. 95% confidence that x has an influence on y).

### Choose Initial Weights

Set reasonable initial weights for w and b. The first guesses may or may not appear on the graph depending on where the random choice starts from.

from dataclasses import dataclass

def guess_weight(value):
  return np.random.random_sample() * value
def init_guess():
  return {'w': guess_weight(10), 'b': guess_weight(-100)}
def update_guess_data(weight_guess):
  min_line_x = np.amin(X)
  max_line_x = np.amax(X)

  # Draw a line.
  x_fit = np.linspace(min_line_x, max_line_x, 100)
  y_fit = weight_guess['w']*x_fit+weight_guess['b']

  return {'x': x_fit, 'y': y_fit, \
          'w': weight_guess['w'], 'b': weight_guess['b']}
guess = update_guess_data(init_guess())
print("w:", guess['w'], " b:", guess['b'])
plt.plot(guess['x'], guess['y'], '-r')
plt.gcf()
0.9s
Python

### Define the Loss Function

Build the array from $1..n$ in three steps:

1. $f(x_i)$ is the list of predictions based on the (randomly chosen) starting points of w and b.

2. $f(x_i)-y_i$: subtract the list of predictions from the actual values

3. $(f(x_i)-y_i)^2$: square each value in the list

Sum the array and multiply by the inverse of its length (step 4): $\frac{1}{n}\sum_{i=1}^{n}$

def loss(guess_data):

  y_predict = guess_data['w']*X + guess_data['b'] # 1
  diff = y_predict - Y # 2
  sq_diff = diff**2 # 3
  loss = 1/len(X) * np.sum(sq_diff) #4
  return loss
loss(guess)
0.1s
lossPython
1988.250003490549

Adjust the estimate to decrease the loss and get closer to a best fit line. Recalculate both w and b.

Take a step towards the valley. The size of the step is determined by the setting for $\alpha$ (also known as the learning rate).

def gradient_descent(w, b):
    alpha = 0.0001

    dw = 2/len(X) * np.sum(((w*X + b) - Y) * X)
    db = 2/len(X) * np.sum((w*X + b) - Y)
    w_step = w - alpha * dw
    b_step = b - alpha * db

    return {'w': w_step, 'b': b_step}
0.1s

Plot the single step in yellow.

def take_step(guess):
  step = gradient_descent(guess['w'], guess['b'])
  guess['w'] = step['w']
  guess['b'] = step['b']
  return guess

guess = update_guess_data(take_step(guess))
print("Loss:", loss(guess))
plt.plot(guess['x'], guess['y'], '-y')
plt.gcf()
0.9s
take-stepPython

Take another step in blue.

guess = update_guess_data(take_step(guess))
print("Loss:", loss(guess))
plt.plot(guess['x'], guess['y'], '-b')
plt.gcf()
0.7s
Python

### Find the Best Fit

Stop looking for the best once the loss gets to a certain point. Plot the best fit in green. Hold a record of loss values in loss_hist for plotting.

loss_hist = []
def find_best_fit(best_guess):
  while True:
    if loss(best_guess) < 46:
      return best_guess
    best_guess = update_guess_data(take_step(best_guess))
    loss_hist.append(loss(best_guess))
best_fit = find_best_fit(guess)
plt.plot(best_fit['x'], best_fit['y'], '-g')
plt.gcf()
306.5s
Python

### Plot the Loss Values

plt.clf()
plt.plot(loss_hist)
plt.gcf()
0.7s
Python