Lab on confusion matrices and COMPAS

export to both csv and html

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper

1. Use confusion matrices to understand a recent controversy around racial equality and criminal

justice system.

2. Use your logistic regression skills to develop and validate a model, analogous to the proprietary

COMPAS model that caused the above-mentioned controversy.

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper

3. Give you some hands-on experience with typical machine learning workflow, in particular

model selection with cross-validation.

4. Encourage you to think over the concept of fairness, and the role of statistical tools in the

policymaking process.

INFO370 Problem Set: Is the model fair?
March 5, 2023
Introduction
This problem set has the following goals:
1. Use confusion matrices to understand a recent controversy around racial equality and criminal
justice system.
2. Use your logistic regression skills to develop and validate a model, analogous to the proprietary
COMPAS model that caused the above-mentioned controversy.
3. Give you some hands-on experience with typical machine learning workflow, in particular
model selection with cross-validation.
4. Encourage you to think over the concept of fairness, and the role of statistical tools in the
policymaking process.
Background
Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) algorithm is a
commercial risk assessment tool that attempts to estimate a criminal defendents recidivism (when a
criminal reoffends, i.e. commits another crime). COMPAS is reportedly one of the most widely used
tools of its kind in the U.S. It is often used in the US criminal justice system to inform sentencing
guidelines by judges, although specific rules and regulations vary.
In 2016, ProPublica published an investigative report arguing that racial bias was evident in the
COMPAS algorithm. ProPublica had constructed a dataset from Florida public records, and used
logistic regression and confusion matrix in its analysis. COMPAS’s owners disputed this analysis,
and other academics noted that for people with the same COMPAS score, but different races, the
recidivism rates are effectively the same. Even more, as Kleinberg et al. (2016) show, these two
fairness concepts (individual and group fairness) are not compatible. There are also some discussion
included in the lecture notes, ch 12.2.3 (admittedly, the text is rather raw).
The COMPAS algorithm is proprietary and not public. We know it includes 137 features, and
deliberately excludes race. However, another study showed that a logistic regression with only 7 of
those features was equally accurate!
Note: Links are optional but very helpful readings for this problem set!
1
Dataset
The dataset you will be working with, compas-score-data, is based off ProPublicas dataset, compiled
from public records in Florida. However, it has been cleaned up for simplicity. You will only use a
subset of the variables in the dataset for this exercise:
age Age in years
c_charge_degree Classifier for an individuals crime–F for felony, M for misdemeanor
race Classifier for the recorded race of each individual in this dataset. We will only consider
Caucasian, and African-American here.
age_cat Classifies individuals as under 25, between 25 and 45, and older than 45
sex “Male” or “Female”.
priors_count Numeric, the number of previous crimes the individual has committed.
decile_score COMPAS classification of each individuals risk of recidivism (1 = low . . . 10 = high).
This is the score computed by the proprietary model.
two_year_recid Binary variable, 1 if the individual recidivated within 2 years, 0 otherwise. This
is the central outcome variable for our purpose.
Note that we limit the analysis with the time period of two years since the first crime—we do not
consider re-offenses after two years.
1
Is COMPAS fair? (50pt)
The first task is to analyze fairness of the COMPAS algorithm. As the algorithm is proprietary, you
cannot use this to make your own predictions. But you do not need to predict anything anyway–the
COMPAS predictions are already done and included as decile_score variable!
1.1
Load and check (2pt)
Your first tasks are the following:
1. (1pt) Load the COMPAS data, and perform the basic checks.
2. (1pt) Filter the data to keep only Caucasian and African-Americans.
All the tasks below are about these two races, there are just too few other offenders.
1.2
Aggregate analysis (20pt)
COMPAS categorizes offenders into 10 different categories, starting from 1 (least likely to recidivate)
till 10 (most likely). But for simplicity, we scale this down to two categories (low risk/high risk)
only.
1. (2pt) Create a new dummy variable based off of COMPAS risk score (decile_score), which
indicates if an individual was classified as low risk (score 1-4) or high risk (score 5-10).
Hint: you can do it in different ways but for technical reasons related the tasks below, the
best way to do it is to create a variable “high score”, that takes values 1 (decile score 5 and
above) and 0 (decile score 1-4).
2
2. (4pt) Now analyze the offenders across this new risk category:
(a) What is the recidivism rate (percentage of offenders who re-commit the crime) for lowrisk and high-risk individuals?
(b) What are the recidivism rates for African-Americans and Caucasians?
Hint: 39% for Caucasians.
3. (7 pt) Now create a confusion matrix comparing COMPAS predictions for recidivism (low
risk/high risk you created above) and the actual two-year recidivism and interpret the results.
In order to be on the same page, let’s call recidivists “positives”.
Note: you do not have to predict anything here. COMPAS has made the prediction for you,
this is the high risk variable you created in 1. See the referred articles about the controversy
around COMPAS methodology.
Note 2: Do not just output a confusion matrix with accompanying text like “accuracy = x%,
precision = y%”. Interpret your results such as “z% of recidivists were falsly classified as
low-risk, COMPAS accurately classified k% of individuals, etc.”
4. (7pt) Find the accuracy of the COMPAS classification, and also how its errors (false negatives
and false positives) are distributed–compute precision, recall, false positive rate and false
negative rate.
We did not talk about FPR and FNR in class, but you can consult Lecture Notes, section
6.1.1 Confusion matrix and related concepts.
Would you feel comfortable having a judge to use COMPAS to inform sentencing guidelines?
What do you think, how well can judges perform the same task without COMPAS’s help?
At what point would the error/misclassification risk be acceptable for you? Do you think the
acceptable error rate should be the same for human judges and for algorithms?
Remember: human judges are not perfect either!
1.3
Analysis by race (28pt)
1. (2pt) Compute the recidivism rate separately for high-risk and low risk African-Americans
and Caucasians.
Hint: High risk AA = 65%.
2. (6pt) Comment the results in the previous point. How similar are the rates for the the two
race groups for low-risk and high-risk individuals? Do you see a racial disparity here? If yes,
which group is it favoring? Based on these figures, do you think COMPAS is fair?
3. (6pt) Now repeat your confusion matrix calculation and analysis from 3. But this time do it
separately for African-Americans and for Caucasians:
(a) How accurate is the COMPAS classification for African-Americans and for Caucasians?
(b) What are the false positive rates (false recidivism rates) FPR?
(c) The false negative rates (false no-recidivism rates) FNR?
Hint: FPR for Caucasians is 0.22, FNR for African-Americans is 0.28.
3
4. (6pt) If you have done this correctly, you will find that COMPAS’s percentage of correctly
categorized individuals (accuracy) is fairly similar for African-Americans and Caucasians, but
that false positive rates and false negative rates are different. In your opinion, is the COMPAS
algorithm “fair”? Justify your answer.
5. (8pt) Does your answer in 4 align with your answer in 2? Explain!
Hint: This is not a trick question. If you read the first two recommended readings, you will
find that people disagree how you define fairness. Your answer will not be graded on which
side you take, but on your justification.
2
Can you beat COMPAS? (50pt)
COMPAS model has created quite a bit controversy. One issue frequently brought up is that it is
“closed source”, i.e. its inner workings are not available neither for public nor for the judges who
are actually making the decisions. But is it a big problem? Maybe you can devise as good a model
as COMPAS to predict recidivism? Maybe you can do even better? Let’s try!
2.1
Create the model (30pt)
Create such a model. We want to avoid explicit race and gender bias, hence you do not want
to include gender and race in order to avoid explicit race/gender bias. Finally, let’s analyze the
performance of the model by cross-validation.
More detailed tasks are here:
1. (6pt) Before we start: what do you think, what is an appropriate model performance measure
here? A, P, R, F or something else? Maybe you want to report multiple measures? Explain!
2. (6pt) you should not use variable decile score that originates from COMPAS model. Why?
3. (8pt) Now it is time to do the modeling. Create a logistic regression model that contains all
explanatory variables you have in data into the model. (Some of these you have to convert to
dummies). Do not include the variables discussed above, do not include race and gender in
this model either to avoid explicit gender/racial bias.
Use 10-fold CV to compute its relevant performance measure(s) you discussed above.
4. (10pt) Experiment with different models to find the best model according to your performance
indicator. Try trees and k-NN, you may also include other types of models. Include/exclude
different variables. You may also do feature engineering, e.g. create a different set of age
groups, include variables like age2, age2, interaction effects, etc. But do not include race and
gender.
Report what did you try (no need to report the full results of all of your unsuccessful attempts), and your best model’s performance. Did you got better results or worse results than
COMPAS?
4
2.2
Is your model more fair? (20pt)
Finally, is your model any better (or worse) than COMPAS in terms of fairness? Let’s use your
model to predict recidivism for everyone (i.e. all data, ignore training-testing split), and see if you
managed to FPR and FNR for African-Americans and Caucasians are now similar.
1. (6pt) Now use your model to compute the two-year recidivism rates by race and your risk
prediction (replicate 1.3-1). Is your model more or less fair than COMPAS?
2. (6pt) Compute FPR and FNR by race (replicate 1.3-3 the FNR/FPR question). Is your model
more or less fair than COMPAS?
3. (8pt) Explain what do you get and why do you get it.
Finally tell us how many hours did you spend on this PS.
References
Kleinberg, J., Mullainathan, S. and Raghavan, M. (2016) Inherent trade-offs in the fair determination of risk scores, Tech. rep., arXiv.
5
232
CHAPTER 6. ASSESSING MODEL GOODNESS
3. Finally, even if we define the difference ŷi − yi , e.g. as 0 if our prediction is
correct or 1 if it is not correct, we have lost all information about how “far” off
the prediction was from the correct one.
Solutions to the first two issues are based on confusion matrix. In order to address
the third issues, we have to look not just the predicted categories but the predicted
probabilities of the categories.
6.1.1 Confusion matrix and related concepts
Confusion matrix is a popular way to assess the performance of categorical models.
Instead of attempting to measure distance between the predicted and true values, we
just tabulate and count all types classification errors. It turns out that this simple
approach allows to avoid the first and second problem listed above.
Confusion matrix
Confusion matrix is in essence just a cross-tabulation of the actual and predicted
classes. It is a central concept in many categorization-related goodness measures.
Here we discuss confusion matrix in case of two categories only but it easily generalizes
to a larger number of classes.
Assume we have in total T cases from two categories: P positives denoted by “+”,
and N negatives denoted by “−”. One can imagine we are working with a medical
diagnosis problem where the negative ones do not suffer from the disease while the
positive ones have the disease. These are “actual categories”, created either manually
or in another way, possibly through expensive testing or diagnosis, so we know these
are correct. Now we use a model to predict the category for each case. We would
like the model to predict every single case correctly as positive or negative but this
is rarely the case. Say that in total, the model predicts P̂ cases as positive and N̂
cases as negative. In order to get a good overview of our prediction results, we can
create a 2 × 2 cross-table where we present the counts for actual and predicted classes
(Table 6.1). The table indicates how many actual positive cases were predicted as
positive, how many as negative, and so on. This is confusion matrix.
Table 6.1: Example confusion matrix for two categories, labeled here as “−” and “+”. The
table entries are counts: TP, true positives, refers to positive cases that were also predicted
to be positive, P is the number of actual positive cases. See explanations in the text.
Actual

Predicted
+
Total

+
TN
FN
FP
TP
N
P
Total


T
In case of two categories, the core of confusion matrix contains four cells:
6.1. CATEGORIZATION
233
• True positives (TP) are cases that are actually positive, and are correctly predicted as positive. We like TP to be large.
• True negatives (TN) are actually negative and are predicted as negative. We
like TN to be large.
• False positives (FP), also type-I errors, are cases that are actually negative but
were predicted as positive. We would like FP to be zero.
• False negatives (FN), also type-II errors, are cases that are actually positive
but were predicted as negative. We would like FN to be zero.
In case of confusion matrix, these concepts often refer to the corresponding counts,
e.g. FP is the number of cases we incorrectly predict as positive. However, these
may also refer to probabilities or percentages, e.g. FP may be a probability that we
predict a case incorrectly as positive, or percentage of such cases. Obviously, in case
of a good model we have high values of TP and TN while the counts of FP and FN
are small. Table 6.1 also includes one-way counts: P is the number of actual positives,
N is the number of actual negatives, P̂ is the number of predicted positives and N̂ is
that of predicted negatives. Finally, T denotes the total number of cases.
234
CHAPTER 6. ASSESSING MODEL GOODNESS
Example 6.1: Confusion Matrix
Dataset Treatment contains information about individual participation in a
labor market training program, and various background information, such as age,
previous unemployment, and income. Here we use that information to estimate a
logistic regression model to predict the participation status based on age, previous
real income and previous unemployment:
Pr(Participated i ) = Λ(βa · age i + βr75 · (re75 i > 0) + βu75 · u75 i )
The original data has 185 participants out of 2675 individuals in total, while our
model predicts 134 as participants and 2541 as non-participants. When we create
confusion matrix, a cross-table of actual and predicted values, the results looks
like this:
Predicted
Non-Participants Participants
Total
Non-Participants
Participants
2452
89
38
96
2490
185
Total
2541
134
2675
Actual
Let’s consider participants as positives below. So for TN = 2452 individuals,
our model correctly predicts that they did not participate in the program. For
an additional TP = 96 cases it correctly predicted that they participated. TN
is rather large, these are good news for our model. But unfortunately TP is not
much larger than FN = 89, the number of individuals who participated but were
incorrectly predicted as non-participants. Finally, the count of type-1 errors, false
positives, is smaller, FP = 38, indicating that the model does not mis-categorize
many non-participants to as participants.
Although a 2 × 2 table seems simple, confusion matrix is actually surprisingly
confusing. So it is not suprising it is called confusion matrix !. It is partly related
to the notation and language. In particular, true positives refer to cases that are
actually positive, and are predicted as positive; not to the “ground truth”, the cases
that are actually positive as one may think. This is why we introduce the “actual”
status here, to distinguish between the actual positives P and “true” positives TP .
In addition, N typically denotes the total number of cases, not just the number of
actual negatives. Here we denote the total number of cases by T .
Moreover, you can see the confusion matrix defined in slightly different way in the
literature, e.g. putting actual values in columns and predictions in rows, and putting
positives first and negatives second. Be aware how exactly the matrix is defined in
different sources, here we consistently use the definition above.
Exercise 6.1: Compute the confusion matrix
Consider a variable that can be of two categories: “0” and “1”. First, you ask
an expert for her opinion, and later the actual values also become evident. The
values are as follows:
6.1. CATEGORIZATION
235
case:
1
2
3
4
5
6
7
8
9
10
Actual
Expert
1
0
0
0
0
0
1
1
1
0
0
0
0
1
0
0
1
1
0
1
Construct the confusion matrix.
Solution on page 398.
Exercise 6.2: Confusion matrix for the naive model
Consider the data in Example 6.1. Let us construct a naive model that predicts
all observations to the majority category–to the category that is more common
(non-participants in that case). How will the corresponding confusion matrix
look like, if you consider the participants as positives?
Solution on page 398.
Based on these numbers, we define a number of model goodness measures:
• Accuracy: percentage of correct answers
TP + TN
T
Accuracy is an easy and intuitive summary measure: what percentage of our
predictions turn out to be correct. However, it is not very informative in case of
very inequally sized categories as even a naive model that always predicts the
largest class can achieve high accuracy.
Accuracy =
• Recall is the percentage of actual positives that are correctly identified (recalled)
as positives
TP
TP
Recall =
=
.
P
TP + FN
If our main concern is to capture all positives, recall may be a good measure.
However, it is easy to fool: if we predict positive for every case, we get Recall = 1
but the model is hardly of any use.
• Precision is a sort of mirror image of recall: percentage of predicted positives
that turn out to be correct.
TP
TP
Precision =
=
.
T
P
+ FP

Precision may be a good measure if avoiding false positives is a major concern.
As the other measures, this can also be fooled easily: if we ensure that only the
most likely cases are labelled as positive, we an ensure that precision is high.
• F score is an attempt to find a balance between recall and precision. It is just
the harmonic mean of these measures
2
F =
1
1
+
Precision Recall
12.2. ETHICAL QUESTIONS
12.2.2
373
Big Data, Big Inequality?
Boyd and Crawford (2012) discuss access to Big Data. Big Data is mainly collected
by players in the internet industry, such as social media or online retail companies.
These firms will have both the data and the resources for analysis, and they will
decide who else have access to data. It probably leads to inequality in terms of
research access where those with resources (prestigious universities and rich private
research labs) will have access, and the other cannot easily participate in the relevant
debate. Neither can they evaluate the quality of published big data–based research.
The fact that the private gatekeepers do not follow similar transparency and public
access requirements as the public sector data collectors will hamper analysis of topics
that the data collectors find inconvenient.
12.2.3
Fairness: Different Measures are Incompatible
Prerequisites: Conditional probability: Section 1.3.3 Conditional probability, page 27
Angwin et al. (2016) analyze Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) algorithm, a profiling algorithm that is widely used in
the U.S. criminal justice system to predict whether the defendant is likely to commit
a new crime. Their analysis turned up “significant racial disparities”. In particular,
whites who were labeled as “high risk” by the algorithm did not re-offend in 23.5%
of cases, while African-Americans who were labeled “high-risk” did-not re-offend in
44.9% of cases. To put it differently, substantially more low-risk African-Americans
were mis-categorized into high-risk category than the corresponding whites. Unfortunately, this label was not only of interest for academic researchers–the perceived
riskiness of recidivism influences sentences, right to bail, probation and other measures that have important real world effects on individual lives. The proponents
of the COMPAS score have countered the criticism by demonstrating that at given
score,1 both whites and African-Americans have similar probability to re-offend. So
COMPAS score is a fair measure.
The problem boils down to different concepts of fairness. Angwin et al. (2016)
criticism centers on group fairness, i.e. requirement that similar groups of people, here
defined by race, should be treated similarly (Jacobs and Wallach, 2021)). So whites
who do not re-offend should have the same mis-classification rate as blacks who do not
re-offend. This is clearly violated with COMPAS score. However, its advocates rely on
individual fairness, requirement that similar individuals to be treated equally: given
they receive equal COMPAS score, the decision should be the same, independent of
race and other personal characteristics. Unfortunately, these two concepts of fairness
are not compatible in general (Kleinberg et al., 2016). Except in very specific cases,
such as when we can perfectly predict re-offenses, it is only possible to be fair either
in one way or the other way, but not in both ways at the same time. Next, we explain
it both theoretically and provide a numerical example.
1 COMPAS assigns each individual a risk score between 1 and 10, with 1 meaning “very unlikely
to re-offend” and 10 meaining “very likely to re-offend”.
374
CHAPTER 12. RESPONSIBLE DATA SCIENCE
Unbiased test,
independent
of color, may still bias our
decision-making against the
weaker group, here greens. An
imperfect test that lets some
low-skilled individuals through,
combined with more low-skilled
green candidates, results in
discrimination against greens in
hiring.
Yuemin Cao, CC0 1.0
Figure 12.1: It is hard to make unequal groups
equally happy
Consider a problem, similar to that of COMPAS. There are two groups of people,
Greens and Reds. For every person, we are interested in whether they re-offend
(denote it by R), they may either re-offend (R = 1) or not re-offend (R = 0). We
also know their individual characteristics X. It has only two possible values: either
X = 0 or X = 1. You can imagine X measures whether they have committed
any crimes earlier, with X = 0 means no previous offenses and X = 1 means a
previous criminal record. However, for whatever reason, there are more people with
a criminal record among Greens than among Reds so that Pr(X = 1|Green) > 0.5
and Pr(X = 1|Red ) < 0.5. Fortunately, we can construct a test, a model similar to COMPAS, that predicts someone’s re-offending probability R based on the individual characteristics X. With only two possible categories and two possible X values, we can just compute the reoffending probability, depending on X and color Pr(R = 1|X, color). Assume that the probabilities we find do not depend on color: Pr(R = 1|X) = Pr(R = 1|X, Green) = Pr(R = 1|X, Red ). (12.2.1) So in this sense the model is color-blind. The model only looks at X, not at the color, and makes the predictions based on that. Assume that Pr(R = 1|X = 0) < 0.5 and Pr(R = 1|X = 1) > 0.5, hence the test predicts that a person with no previous
criminal record will not re-offend, but those who have previous criminal record will.
Let us illustrate this with a numerical example (Figure 12.2). There are 24 reds
and 24 greens in total. However, out of those 24, 16 reds and 8 greens have never
committed a crime (X = 0), and 8 reds and 16 greens have committed a crime
earlier (X = 1). We also know that those with X = 0 have probabilitie of reoffending Pr(R = 1|X = 0) = 1/4, so 4 Reds out of 16 and 2 Greens out of 8 will
re-offend. For those with previous criminal record, the probability of re-offending
Pr(R = 1|X = 1) = 3/4 so out of 8 Reds 6 will re-offend while the same is true for
12 out of 16 greens. See Figure 12.2, left panel. Based on these probabilities, we will
predict that everyone with X = 0 will not re-offend and everyone with X = 1 will,
and this does not depend on color. So our test is color-blind, and in this sense fair.
12.2. ETHICAL QUESTIONS
375
Figure 12.2: Left panel: the test is color blind: Pr(R = 1|X) does not depend on color.
However, now greens’ FPR is almost three times that of reds.
Right panel: a test that ensures the FPR-s are equal. However, now color is an important
predictor of Pr(R = 1|X).
However, if we compute the false positive rate, we come to a different conclusion.
Here, FPR is the probability that a person who will not re-commit a crime, R = 0,
FP
FN
will be mis-classified as re-offender. As we classify the re-offenses solely based on X, FPR = N , FNR = P . See
more in Section 6.1.1 Confusion
we mis-classify all those who will not re-offend (R = 0) but have previous criminal matrix, page 232.
record (X = 1). So FPR = Pr(X = 1|R = 0). We can easily compute this probability
from Bayes’ theorem
Pr(X = 1|R = 0) =
=
Pr(R = 0|X = 1) · Pr(X = 1)
=
Pr(R = 0)
Pr(R = 0|X = 1) · Pr(X = 1)
Pr(R = 0|X = 1) · Pr(X = 1) + Pr(R = 0|X = 0) · Pr(X = 0)
(12.2.2)
It is obvious that even if Pr(R = 0|X = 1) and Pr(R = 0|X = 0) are equal for both
groups, these probabilities are not as long as Pr(X = 1) and Pr(X = 0) differ. Hence
a color-blind model that estimate the re-offending probability cannot provide similar
FPR for both groups as long as the groups are not equal! The margin of the left
panel shows that for reds, FPR = 1/7 while for greens, the probability is 4/10. Hence
low-risk greens have almost three times larger chances to be mis-classified as high-risk
than the corresponding reds.
This example corresponds broadly to COMPAS model. The model uses a set of
individual background variables to compute the re-offending probability, and finds
that given the background, the probability does not depend on race. However, the
FPR differs by race.
Now assume that instead of characteristic X, we observe feature Z, say the city
the people are living. Z is also related to re-offending with Z = 1 being associated
with higher likelihood to re-offend. However, now it turns out that the FPR is equal
to Reds and Greens. Figure 12.2, right panel, shows a numeric example, where based
376
CHAPTER 12. RESPONSIBLE DATA SCIENCE
on Z we find that FPR = 1/4 for both groups. Whatever your color, the low-risk
non-offenders have 25% probability to be mis-categorized as re-offender.
However, the test based on Z is not color blind:
Pr(R = 1|Z = 0, red) = 1/16 Pr(R = 1|Z = 0, green) = 5/8
Pr(R = 1|Z = 1, red) = 3/8 Pr(R = 1|Z = 1, green) = 15/16.
(12.2.3)
This test is unlikely to satisfy the fairness requirements either. While now the low-risk
individuals have similar probability to be mis-classified as high risk, we find that color
is a very important predictor of re-offending. In particular, whatever Z, we categorize
all reds as non-offenders and all greens as offender. This feels very unfair.
Obviously, in a real application we may find that our model is fair neither in one
nor the other sens but gives results somewhere in-between. It all depends on what
kind of information we have access to, and how it is correlated with re-offending.
There are three separate issues that give us this unfortunate result. The first
problem is pure technical–the test is imperfect, in particular Pr(R = 1|X = 1) > 0–
we are unable to perfectly tell who is low-risk. Unfortunately, there is no reason to
believe that we are able to design perfect tests in the future either.
The second problem is that the percentage of high-risk individuals depends on
color. Why is it like this? Is it because of some sort of historical discrimination?
Because of unequal access to education or other resources? Something else? It is
unlikely that we are able to completely eliminate such inequality in the future, but
measures to improve the matters are definitely possible.
The final problem here is the fact that these two fairness concepts—individual
fairness and group fairness—are incompatible. We use the same word, “fairness”, to
denote somewhat different concepts, and intuitively we feel that both are important.
But that does not make these two concepts compatible.
Part of the problem is that the group fairness concept is based on group labels
that are irrelevant as predictors, even more, that are supposed to be irrelevant as
predictors. If we believe that group labels should not be used for prediction, and
they do not carry any information (as in the first example), then why do we want the
fairness to be based on the “irrelevant” group labels? There are no good answers. It
just feels “fair”.
But whatever is the fundamental problem, the policymakers are facing an inconvenient choice. They have to decide between
• Ignoring the equal treatment principle
• Ignoring the score-balancing requirement
• Not using the test at all. However, in the example above this easily leads to
perfect color discrimination where only Reds are hired.
Obviously, one can also use a combination of these options.
12.3 Human Versus Algorithmic Decision-Making
Algorithms are often criticized as “obscure”, in particular when the inner workings of
those are not published. Sometimes it is claimed that we should not use algorithms
INFO370 Problem Set: Is the model fair?
March 5, 2023
Introduction
This problem set has the following goals:
1. Use confusion matrices to understand a recent controversy around racial equality and criminal
justice system.
2. Use your logistic regression skills to develop and validate a model, analogous to the proprietary
COMPAS model that caused the above-mentioned controversy.
3. Give you some hands-on experience with typical machine learning workflow, in particular
model selection with cross-validation.
4. Encourage you to think over the concept of fairness, and the role of statistical tools in the
policymaking process.
Background
Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) algorithm is a
commercial risk assessment tool that attempts to estimate a criminal defendents recidivism (when a
criminal reoffends, i.e. commits another crime). COMPAS is reportedly one of the most widely used
tools of its kind in the U.S. It is often used in the US criminal justice system to inform sentencing
guidelines by judges, although specific rules and regulations vary.
In 2016, ProPublica published an investigative report arguing that racial bias was evident in the
COMPAS algorithm. ProPublica had constructed a dataset from Florida public records, and used
logistic regression and confusion matrix in its analysis. COMPAS’s owners disputed this analysis,
and other academics noted that for people with the same COMPAS score, but different races, the
recidivism rates are effectively the same. Even more, as Kleinberg et al. (2016) show, these two
fairness concepts (individual and group fairness) are not compatible. There are also some discussion
included in the lecture notes, ch 12.2.3 (admittedly, the text is rather raw).
The COMPAS algorithm is proprietary and not public. We know it includes 137 features, and
deliberately excludes race. However, another study showed that a logistic regression with only 7 of
those features was equally accurate!
Note: Links are optional but very helpful readings for this problem set!
1
Dataset
The dataset you will be working with, compas-score-data, is based off ProPublicas dataset, compiled
from public records in Florida. However, it has been cleaned up for simplicity. You will only use a
subset of the variables in the dataset for this exercise:
age Age in years
c_charge_degree Classifier for an individuals crime–F for felony, M for misdemeanor
race Classifier for the recorded race of each individual in this dataset. We will only consider
Caucasian, and African-American here.
age_cat Classifies individuals as under 25, between 25 and 45, and older than 45
sex “Male” or “Female”.
priors_count Numeric, the number of previous crimes the individual has committed.
decile_score COMPAS classification of each individuals risk of recidivism (1 = low . . . 10 = high).
This is the score computed by the proprietary model.
two_year_recid Binary variable, 1 if the individual recidivated within 2 years, 0 otherwise. This
is the central outcome variable for our purpose.
Note that we limit the analysis with the time period of two years since the first crime—we do not
consider re-offenses after two years.
1
Is COMPAS fair? (50pt)
The first task is to analyze fairness of the COMPAS algorithm. As the algorithm is proprietary, you
cannot use this to make your own predictions. But you do not need to predict anything anyway–the
COMPAS predictions are already done and included as decile_score variable!
1.1
Load and check (2pt)
Your first tasks are the following:
1. (1pt) Load the COMPAS data, and perform the basic checks.
2. (1pt) Filter the data to keep only Caucasian and African-Americans.
All the tasks below are about these two races, there are just too few other offenders.
1.2
Aggregate analysis (20pt)
COMPAS categorizes offenders into 10 different categories, starting from 1 (least likely to recidivate)
till 10 (most likely). But for simplicity, we scale this down to two categories (low risk/high risk)
only.
1. (2pt) Create a new dummy variable based off of COMPAS risk score (decile_score), which
indicates if an individual was classified as low risk (score 1-4) or high risk (score 5-10).
Hint: you can do it in different ways but for technical reasons related the tasks below, the
best way to do it is to create a variable “high score”, that takes values 1 (decile score 5 and
above) and 0 (decile score 1-4).
2
2. (4pt) Now analyze the offenders across this new risk category:
(a) What is the recidivism rate (percentage of offenders who re-commit the crime) for lowrisk and high-risk individuals?
(b) What are the recidivism rates for African-Americans and Caucasians?
Hint: 39% for Caucasians.
3. (7 pt) Now create a confusion matrix comparing COMPAS predictions for recidivism (low
risk/high risk you created above) and the actual two-year recidivism and interpret the results.
In order to be on the same page, let’s call recidivists “positives”.
Note: you do not have to predict anything here. COMPAS has made the prediction for you,
this is the high risk variable you created in 1. See the referred articles about the controversy
around COMPAS methodology.
Note 2: Do not just output a confusion matrix with accompanying text like “accuracy = x%,
precision = y%”. Interpret your results such as “z% of recidivists were falsly classified as
low-risk, COMPAS accurately classified k% of individuals, etc.”
4. (7pt) Find the accuracy of the COMPAS classification, and also how its errors (false negatives
and false positives) are distributed–compute precision, recall, false positive rate and false
negative rate.
We did not talk about FPR and FNR in class, but you can consult Lecture Notes, section
6.1.1 Confusion matrix and related concepts.
Would you feel comfortable having a judge to use COMPAS to inform sentencing guidelines?
What do you think, how well can judges perform the same task without COMPAS’s help?
At what point would the error/misclassification risk be acceptable for you? Do you think the
acceptable error rate should be the same for human judges and for algorithms?
Remember: human judges are not perfect either!
1.3
Analysis by race (28pt)
1. (2pt) Compute the recidivism rate separately for high-risk and low risk African-Americans
and Caucasians.
Hint: High risk AA = 65%.
2. (6pt) Comment the results in the previous point. How similar are the rates for the the two
race groups for low-risk and high-risk individuals? Do you see a racial disparity here? If yes,
which group is it favoring? Based on these figures, do you think COMPAS is fair?
3. (6pt) Now repeat your confusion matrix calculation and analysis from 3. But this time do it
separately for African-Americans and for Caucasians:
(a) How accurate is the COMPAS classification for African-Americans and for Caucasians?
(b) What are the false positive rates (false recidivism rates) FPR?
(c) The false negative rates (false no-recidivism rates) FNR?
Hint: FPR for Caucasians is 0.22, FNR for African-Americans is 0.28.
3
4. (6pt) If you have done this correctly, you will find that COMPAS’s percentage of correctly
categorized individuals (accuracy) is fairly similar for African-Americans and Caucasians, but
that false positive rates and false negative rates are different. In your opinion, is the COMPAS
algorithm “fair”? Justify your answer.
5. (8pt) Does your answer in 4 align with your answer in 2? Explain!
Hint: This is not a trick question. If you read the first two recommended readings, you will
find that people disagree how you define fairness. Your answer will not be graded on which
side you take, but on your justification.
2
Can you beat COMPAS? (50pt)
COMPAS model has created quite a bit controversy. One issue frequently brought up is that it is
“closed source”, i.e. its inner workings are not available neither for public nor for the judges who
are actually making the decisions. But is it a big problem? Maybe you can devise as good a model
as COMPAS to predict recidivism? Maybe you can do even better? Let’s try!
2.1
Create the model (30pt)
Create such a model. We want to avoid explicit race and gender bias, hence you do not want
to include gender and race in order to avoid explicit race/gender bias. Finally, let’s analyze the
performance of the model by cross-validation.
More detailed tasks are here:
1. (6pt) Before we start: what do you think, what is an appropriate model performance measure
here? A, P, R, F or something else? Maybe you want to report multiple measures? Explain!
2. (6pt) you should not use variable decile score that originates from COMPAS model. Why?
3. (8pt) Now it is time to do the modeling. Create a logistic regression model that contains all
explanatory variables you have in data into the model. (Some of these you have to convert to
dummies). Do not include the variables discussed above, do not include race and gender in
this model either to avoid explicit gender/racial bias.
Use 10-fold CV to compute its relevant performance measure(s) you discussed above.
4. (10pt) Experiment with different models to find the best model according to your performance
indicator. Try trees and k-NN, you may also include other types of models. Include/exclude
different variables. You may also do feature engineering, e.g. create a different set of age
groups, include variables like age2, age2, interaction effects, etc. But do not include race and
gender.
Report what did you try (no need to report the full results of all of your unsuccessful attempts), and your best model’s performance. Did you got better results or worse results than
COMPAS?
4
2.2
Is your model more fair? (20pt)
Finally, is your model any better (or worse) than COMPAS in terms of fairness? Let’s use your
model to predict recidivism for everyone (i.e. all data, ignore training-testing split), and see if you
managed to FPR and FNR for African-Americans and Caucasians are now similar.
1. (6pt) Now use your model to compute the two-year recidivism rates by race and your risk
prediction (replicate 1.3-1). Is your model more or less fair than COMPAS?
2. (6pt) Compute FPR and FNR by race (replicate 1.3-3 the FNR/FPR question). Is your model
more or less fair than COMPAS?
3. (8pt) Explain what do you get and why do you get it.
Finally tell us how many hours did you spend on this PS.
References
Kleinberg, J., Mullainathan, S. and Raghavan, M. (2016) Inherent trade-offs in the fair determination of risk scores, Tech. rep., arXiv.
5

Still stressed from student homework?
Get quality assistance from academic writers!

Order your essay today and save 25% with the discount code LAVENDER