# Dataism: Poverty & Murder

This notebook is authored by Alexandre Puttick and modified by myself. Original is on GitHub.

## Setup

`import pandas as pd`

`import numpy as np`

`import matplotlib.pyplot as plt`

`pip install xlrd`

## Data

Data that correlates poverty, unemployment, murder rate. For the percentage of families incomes below $5000, what is the number of murders per 1,000,000 inhabitants per annum? If there are more people living in poverty, are there more murders? (source)

`POVERTY_MURDER_DATA = pd.read_excel(Income vs. Murder.xlsx)`

`X = POVERTY_MURDER_DATA['Percent with income below $ 5000']`

`Y = POVERTY_MURDER_DATA['Number of murders per 1,000,000 inhabitants']`

`POVERTY_MURDER_DATA.head(5)`

Percent with income below $ 5000 | Number of murders per 1,000,000 inhabitants | |
---|---|---|

0 | 16.5 | 11.2 |

1 | 20.5 | 13.4 |

2 | 26.3 | 40.7 |

3 | 16.5 | 5.3 |

4 | 19.2 | 24.8 |

`plt.plot(X, Y, 'bo')`

`# label the axes`

`plt.xlabel('Percent with income below $5000')`

`plt.ylabel('No. murders per 10^6 residents')`

`# start the plot at the point (0,0)`

`plt.xlim(xmin = 0)`

`plt.ylim(ymin = 0)`

`plt.gcf()`

## Find the Best Fit Line

Use linear regression to generate a model that finds the best fit line. The best fit line offers two insights:

Test whether or not

`x`

has an influence on`y`

and give it a level of confidence.The slope can be used to predict trends

If the best fit line has some non-zero slope, then we can show how murders and poverty are correlated with some percentage of confidence (e.g. 95% confidence that `x`

has an influence on `y`

).

### Choose Initial Weights

Set reasonable initial weights for `w`

and `b`

. The first guesses may or may not appear on the graph depending on where the random choice starts from.

`from dataclasses import dataclass`

` `

`def guess_weight(value):`

` return np.random.random_sample() * value`

`def init_guess():`

` return {'w': guess_weight(10), 'b': guess_weight(-100)}`

`def update_guess_data(weight_guess):`

` min_line_x = np.amin(X)`

` max_line_x = np.amax(X)`

` `

` # Draw a line. `

` x_fit = np.linspace(min_line_x, max_line_x, 100)`

` y_fit = weight_guess['w']*x_fit+weight_guess['b']`

` `

` return {'x': x_fit, 'y': y_fit, \`

` 'w': weight_guess['w'], 'b': weight_guess['b']}`

`guess = update_guess_data(init_guess())`

`print("w:", guess['w'], " b:", guess['b'])`

`plt.plot(guess['x'], guess['y'], '-r')`

`plt.gcf()`

### Define the Loss Function

Build the array from $1..n$ in three steps:

$$$f(x_i)$ is the list of predictions based on the (randomly chosen) starting points of

`w`

and`b`

.$f(x_i)-y_i$: subtract the list of predictions from the actual values

$(f(x_i)-y_i)^2$: square each value in the list

Sum the array and multiply by the inverse of its length (step 4): $\frac{1}{n}\sum_{i=1}^{n}$

`def loss(guess_data):`

` `

` y_predict = guess_data['w']*X + guess_data['b'] # 1`

` diff = y_predict - Y # 2`

` sq_diff = diff**2 # 3`

` loss = 1/len(X) * np.sum(sq_diff) #4`

` return loss`

`loss(guess)`

### Gradient Descent

Adjust the estimate to decrease the loss and get closer to a best fit line. Recalculate both `w`

and `b`

.

Take a step towards the valley. The size of the step is determined by the setting for $\alpha$ (also known as the *learning rate*).

`def gradient_descent(w, b):`

` alpha = 0.0001`

` `

` dw = 2/len(X) * np.sum(((w*X + b) - Y) * X)`

` db = 2/len(X) * np.sum((w*X + b) - Y)`

` w_step = w - alpha * dw`

` b_step = b - alpha * db`

` `

` return {'w': w_step, 'b': b_step}`

Plot the single step in yellow.

`def take_step(guess):`

` step = gradient_descent(guess['w'], guess['b'])`

` guess['w'] = step['w']`

` guess['b'] = step['b']`

` return guess`

` `

`guess = update_guess_data(take_step(guess))`

`print("Loss:", loss(guess))`

`plt.plot(guess['x'], guess['y'], '-y')`

`plt.gcf()`

Take another step in blue.

`guess = update_guess_data(take_step(guess))`

`print("Loss:", loss(guess))`

`plt.plot(guess['x'], guess['y'], '-b')`

`plt.gcf()`

### Find the Best Fit

Stop looking for the best once the loss gets to a certain point. Plot the best fit in green. Hold a record of loss values in `loss_hist`

for plotting.

`loss_hist = []`

`def find_best_fit(best_guess):`

` while True:`

` if loss(best_guess) < 46:`

` return best_guess`

` best_guess = update_guess_data(take_step(best_guess))`

` loss_hist.append(loss(best_guess))`

`best_fit = find_best_fit(guess)`

`plt.plot(best_fit['x'], best_fit['y'], '-g')`

`plt.gcf()`

### Plot the Loss Values

`plt.clf()`

`plt.plot(loss_hist)`

`plt.gcf()`