As I write this (May 30th), fires are burning in the streets across the United States. Protesters are smashing windows and clashing with the police. They chant "Hands up! Don't shoot" and "I can't breathe."
Protests against police brutality erupted after a video went viral in which an officer is seen using his knee to pin a black man, George Floyd, to the ground by the neck. Floyd says he can't breathe, says "please" and "mama." Then he is quiet.
The officer, Derek Chauvin, was fired and has been charged with third-degree murder. According to an affidavit released by the prosecutors, Floyd had already been arrested and placed in a squad car when Chauvin arrived at the scene. He pulled George Floyd out of the car, placed his knee onto his neck and held him down on the ground. After five minutes Floyd became unresponsive. Another officer couldn't find a pulse. Chauvin continued to hold his knee down on George Floyd's neck for another three minutes.
Meanwhile the governor of Minnesota has not ruled out the possibility of bringing in the U.S. military to bring protests under control.
The act of violence committed against an unarmed black civilian by a police officer in this case was particularly egregious and the protests have escalated commensurately, but a version of this story has played out over and over again, and will continue to if nothing changes.
## Code for embedding video.
HTML('<iframe title="New York Times Video - Embed Player" width="480" height="321" frameborder="0" scrolling="no" allowfullscreen="true" marginheight="0" marginwidth="0" id="nyt_video_player" src="https://www.nytimes.com/video/players/offsite/index.html?videoId=100000007164865"></iframe>')
There were 1,099 killings by police in 2019. This is close to the average over the last 10 years. For comparison, the total number of on record killings by law enforcement officers in Germany since 1952 is about half of that.
Excessive police brutality affects members of all racial groups, but victims are diproportionately black. The ratio is skewed even more amongst victims who were unarmed when killed.
Similar racial bias exists in cases of non-lethal excessive force and arrests for drug possession and misdemeanors.
So far the adoption of body cams and attempts to diversify the police force have led to very little change in officer behavior and the degree to which police are held accountable.
Campaign Zero created a metric for scoring California police departments based on three categories:
Approach to policing
A seperate score was created for each category, with the overall grade given by the average of the three scores.
“The scorecard is designed to help communities, researchers, police leaders and policy-makers take informed action to reduce police use of force and improve accountability and public safety in their jurisdictions.”
-CA Police Scorecard Website
Here's a video of Sam Sinyangwe, the head data scientist on the project, talking about using of technology to help end police violence:
A Jupyter notebook of their methods for measuring racial bias in arrests and deadly force can be found here: https://github.com/campaignzero/ca-police-scorecard. We explain a part of their methods for measuring racial bias in a later section of this notebook.
Of course you can always donate, but you can go deeper by signing up here, where you can request to join a workgroup in
Policy research & advocacy
Data collection/ Analysis
Or suggest your own idea.
3. The Data
All of the data used to compute scores (along with extra data for further study) is contained in this spreadsheet.
Where did the data come from?
Deadly force, civilian complaints and arrests, from official databases:
The bias data also includes data relating to racial bias in arrests for drug possession.
Let's take a moment to explore the concept of a metric, which, in this course, has the following meaning:
metric: A numerical quantity to measure the magnitude of a (qualitative) concept.
The police scorecard is a metric that evaluates police departments based on the aforementioned categories. Some other prominent examples:
GDP has become the overarching measure of economic and social progress.
University rankings purportedly measure the "quality of the education" that students receive.
Credit scores measure the "creditworthiness" of individuals.
Even when defined with the best intentions and care, essentially every metric (including the Police Scorecard) is subject to several pitfalls:
It is impossible for a metric to capture the full complexity of a situation. At best, the choice of which factors to include and exclude is extensively studied and justified. At worst, some old white dudes in a room decided on a whim.
It is an immense problem that we tend as a society to view numerical measurements of precise and objective and don't question the choice of which factors are included/excluded in a metric. Here's economist Simon Kuznets on GDP:
"The valuable capacity of the human mind to simplify a complex situation in a compact characterization becomes dangerous when not controlled in terms of definitely stated criteria. With quantitative measurements especially, the definiteness of the result suggests, often misleadingly, a precision and simplicity in the outlines of the object measured. Measurements of national income are subject to this type of illusion and resulting abuse, especially since they deal with matters that are the center of conflict of opposing social groups where the effectiveness of an argument is often contingent upon oversimplification. [...]"
It's less pretty, but it often makes much more sense not to just rank entities in a one-dimensional line. A lot of information gets lost in the process.
The Police Scorecard does a good job presenting all of the parts that contribute to their metric.
Optimizing a number
Post WWII, GDP growth became the standard for measure a country's development and the health of its economy. Now the leading perogative of most countries is to sustainably maximize it. This can have strange side effects:
The world's "strongest" economy has crumbling roads, undrinkable water and generally poor infrastucture.
Many value GDP growth over saving lives, preserving nature, increasing access to basic human rights...
Volunteer work, household labor, contributions to (creative) commons etc.apparently add no value to the economy
Once we've replaced a complex goal with optimizing a number, we might cease to ask ourselves if the metric is in fact in line with our goals.
Instead we are likely to justify our actions by pointing to improvements in an "objective" (not really) number.
Metrics can be gamed
Once a concept had been reduced to a metric which is then widely adopted, the model can be exploited to "win."
Take the college ranking example:
Some universities have confessed to: paying the fees for students to retake their SATs, falsifying SAT scores, acceptance and graduation rates etc.
Universities spend huge amounts of money on marketing and change their application procedures so that their acceptance rates go down.
Such tactics have paid off and significantly raised the university in questions rank. But how much to selection rates and SAT scores have to do with the quality of the education and experience students receive at the institution?
What's excluded often matters!
What if GDP factored in things like environmental health, quality of infrastructure, wealth disparity, household labor, public commons etc.?
Moral judgements are often encoded in what we include/exclude from a model.
Can a single metric be useful?
Its easy to compare different values
Easier for humans to understand/tells a good story
If well-designed, can be a good way to motivate people/groups and to easily track the effects of new policies
5. Numerical Proxies
In building a metric to measure the combination of police violence, accountability and approach to policing, we need numerical proxies for each category. For us, this means
numerical proxy-- quantitative data that we can use as a stand in for the quality we want to factor into the metric.
The proxies determine the data we wish to use. On the other hand, the proxies we choose might depend on what data is obtainable.
Let's cover each of the numerical proxies in the Police Scorecard:
The proxies for police violence were:
percentage of violent non-lethal force used per arrest
percentage of deadly force used per arrest
number of unarmed civilians killed or seriously injured
racial bias in arrests and deadly force
Note: the project leaders invented a seperate metric for measuring racial bias.
percentage of civilian complaints sustained
percentage of discrimination and excessive force complaints sustained
percentage of complaints alleging police committed a criminal offense sustained
Approach to policing
percentage of misdemeanor arrests per population (as a proxy for "broken windows" or "zero tolerance" policing.)
percent homicides cleared (as a proxy for effectiveness at solving serious crimes)
Note that some of the proxies overlap with each other.
What potential pitfalls are there in using these numbers?
6. Scoring racial bias in police use of deadly force
Studies show that relative to population, black men in the U.S. have twice the risk of being killed by a police officer than white men. (See the link for commentary on how to interpret this statistic).
Campaign Zero's model for racial bias in use of deadly force is based on the following question:
What would deaths by police look like in a given city if the victims had been white?
The main idea is that they consider the chances a white person has of dying in an encounter the police as 'normal' or 'baseline.'
Campaign Zero's method attempts to measure racial bias in deadly force by looking at how far from 'normal' the number of black victims is, where normal is the amount that would have died if they were white.
NOTE: Their model examines bias against both black and latinx populations. We focus on the black population for simplicity.
NOTE: You could do something very similar to measure other sorts of bias! For example,
Gender bias in job hiring
Race bias in dating
You could also do something like defining a baseline level of carbon emission (say for companies), and compare how far from the baseline different companies emissions are.
General population vs. Arrested individuals
There is a choice when comparing use of deadly force across different racial groups.
We can look at the number of victims of deadly compared to the entire population, or
we can look at the number of victims of deadly force compared to the number of arrests.
Campaign Zero chooses the latter and gives a good explanation in their Juptyer Noteboook:
"Imagine that every interaction with police is a single round of Russian roulette and white people are given two levels of advantage: first, they simply don't have to play the game as often, and when they do, they're given a different revolver with fewer chambers loaded. Both lead to disproportionate use of deadly force against communities of color. The first advantage (not having to interact with the police as much in the first place) is accounted for in the arrest disparity scores, and here we want to examine what happens given that an arrest did occur, and how that changes when a black person is arrested."
Probability of being killed for white residents
The probability of a white person being killed by the police while being arrested is estimated as follows:
We want to calculate this probability separately for each police department. This will establish what is 'normal' for that city, and we want to compare that to the amount of black victims in the same city. This p might be relatively high in an urban area with lots of poor white people, but low in a rich, white community.
NOTE: The actual scorecard doesn't assume that p=0 for a department that has killed zero white residents. The assumption is that in such cases, too few white residents were arrested to get a good estimate of the probability. It's never impossible that a white residence will be killed by the police. For such departments, the model takes the statewide probability as a baseline instead. Is this a good solution?
Assumption on Distribution of Deaths
Imagine that the above probability p of a white resident being killed by the police in a town called Exampleton is 1%, i.e., one person killed for every 100 arrests.
Suppose that 1,000 black people are arrested in Exampleton. If there is no racial bias, you would expect approximately 0.01×1,000=10 black victims of deadly violence. Maybe not exactly 10 though. There's a high chance that the number would be close to 10, and a very low chance that it would be far from 10, say 900 deaths, for example.
If there are indeed 900 black victims of deadly violence in Exampleton (this is a very evil town), our metric for bias should say that the police in Exampleton are very biased.
The Campaign Zero model makes an important assumption:
Assumption: The number of deadly force incidents is normally distributed.
What does this mean?
Normal distributions are best understood visually:
This is also called a Gaussian distribution or a bell curve. It appears all the time in nature when a number usually clusters around some average μ. For example, the heights of individual humans are normally distributed around an average of μ≈170cm.
The bump in the middle indicates that most people's heights are close to average. Really high or low values have very low probability of occuring. These are the so called outliers.
The width of the curve is determined by the standard deviationσ, which roughly says how much the data varies around the average. In the above picture you can see that if data follows a normal distribution, then most values (95%) fall within two standard deviations of the average.
Returning to the height example, you might expect σ≈10cm since most people are between 170−20=150cm and 170+20=190cm tall. (Here we computed μ−2σ and μ+2σ.).
Relation to racial bias score
The bias score in Campaign Zero's model is based on how many standard deviations away from average/normal the number of black victims of deadly police violence is.
One standard deviation larger than expected indicates slight bias.
Three standard deviations bigger indicates extreme bias.
Here is the general (arbitrary police department) description of the above Exampleton example:
Consider a department for which a white person has a probability p of being killed by a police officer. If n=# of black arrests by department then we expect approximately μ=n⋅p cases of deadly violence against blacks if there is no bias.
Under the assumption that the distribution is normal, the standard deviation is given by
If x=# black victims of deadly violence,$ then the distance z (in standard deviations) between x and μ) is given by
This number, called the Z-score of x, is the basis for the racial bias score.
NOTE: Here we have a problem! For departments where no white residents were killed, we have p=0. Then σ=0 and z=undefined, since you can't divide by zero.
To address this problem, Campaign Zero replaces the probability p for the department with the statewide probability pstate=# white residents arrested statewide# white residents killed statewide
Is this reasonable? The problem is that not enough white people were arrested by the department in question to determine the "true probability" that a white person would be killed during arrest by the police. Using the Russian roulette analogy, if the game is only played a few times, it's difficult to determine how many chambers are loaded. Just because no one died, doesn't mean there are no loaded chambers. However, using the statewide probability is likely an overestimate. Many places with zero white victims of deadly force are likely to be safer than the state average.
Return to Exampleton
If we go back to Exampleton, Campaign Zero's, we had p=1% and n=10,000. Now suppose that the Exampleton police are responsible for the deaths of x=120 black victims. This is somewhat higher than the expected number of 100 deaths if the victims were white. Here we have
Then the Z-score is z=σx−μ=10120−100=2. This indicates quite high bias.
From Z-Score to Percentile
After computing the Z-score for each police department, they are ranked from best to worst and assigned a percentile score from 0-100. If Exampleton gets a score of 75, then it had a better Z-score than 75% of police departments.
7. Code for the Racial Bias Score
In this final section, we walk through the code for carrying out the process described in section 6.
### import the python libraries we need
# Allow pandas to display many rows and columns from our data tables
# NOTE: It is customary to 'import pandas as pd' and 'import numpy as np,'
# meaning you add 'as pd' and 'as np' to the above lines of code.
# This just means that in the code you can write pd and np instead
Interestingly, the about 50 (height of the middle bar) of the 100 departments have a bias score near zero. About 30 have a negative bias score (which indicates that they are actually biased against whites). About 20 departments seem to exhibit clear bias.
Is this right? Probably the reason is that we used the statewide probability for departments with no white victims (which was most of them). This likely overestimates the chances that a white person is killed, and there for overestimates the expected number of black deaths. This makes the real number of black deaths seem less biased.
In their work, Campaign Zero also estimated bias against latinx residents and took the maximum between black bias and latinx bias. This might help correct the problem.
f) Convert to percentile.
To convert Z-scores to 0-100 percentile scores, do the same procedure as in the Week 2 assignment.
NOTE: The actual scorecard doesn't assume that p=0 for a department that has killed zero white and black residents. The assumption is that in such cases, too few white residents were arrested to get a good estimate of the probability. It's never impossible that a white residence will be killed by the police. Rather than take the statewide average for white victims, I drop the agencies where there are no victims of deadly force (black or white).
Drop the agencies and note the following table will be missing several state agencies.
nonzero_data=bias_data.filter(['Agency Name', 'White Victims of Deadly Force', 'Black Victims of Deadly Force', 'White Arrests', 'Black Arrests'], axis=1)
nonzero_data=nonzero_data[nonzero_data['White Victims of Deadly Force'] !=0]
nonzero_data=nonzero_data[nonzero_data['Black Victims of Deadly Force'] !=0]