Questions on Central Limit Theorem, statistical hypotheses and confidence intervals

34CHAPTER 1. INTRODUCTION TO STATISTICS
1.3.4 Random Variable
Random variable (RV) is a central concept that connects probability theory to statistics. In particular, it makes numbers out of events so that one can use the mathematical apparatus to analyze the random processes. The concept of RV is somewhat
complex, so we start with the easy part—it is easy to remember what RV is not.
First, random variable is not random; and second, random variable is not a variable.
But then what is RV? In essence, it is a rule that assigns a number to each event
in the sample space S. Most of the events we care about occur to our physical world
and are not numbers, but we need numbers to use the mathematical apparatus. So
we need a RV that links the outcomes of the stochastic phenomenon we are analyzing
with some sort of numbers.
RV-s are typically denoted by capital Latin letters, such as X or Z. Formally, a
RV X is X : S → R where S is the sample space of the phenomenon we are analyzing.
For instance, if our experiment is flipping a coin, then its sample space is {H, T }, i.e.
it contains just possible events, heads and tails. We can assign zero to tails and one
to heads, and define the RV X as
X(E) =
!
0 : if E = T
1 : if E = H.
(1.3.9)
Obviously, one can also define it in the opposite way X(H) = 0 and X(T ) = 1; and
in a myriad of other ways. For instance, if we want the expected value to be zero,
(see Section 1.3.5 for explanation) we can define X(T ) = −1 and X(H) = 1.
Now when we actually conduct the experiment and toss the coin, we will receive
either H or T . We use RV X to convert the realized outcome into a number and
hence we get either 1 or 0. These are two possible observed values or realizations of
the RV. Let’s repeat here: H and T are events. The corresponding numbers, 0 and 1,
are realizations (observed values). While RV-s are traditionally denoted by upper case
Latin letters, such as X or Y , their observed values (realizations) are denoted with
the corresponding lower case letters, such as x and y. If we observe many realizations
(e.g. toss the coin multiple times), we usually denote those using a subscript like
x1 , x 2 , . . . , x N .
So RV is not random. It is two things:
• a phenomenon or experiment with well-defined properties; and
• a rule how to assign numeric labels to the events in that phenomenon.
But the realizations of RV-s are random.
It is extremely important to be able to distinguish between a non-random RV
and its random realizations! Part of the confusion arises from how the word “random” is used: random in the concept random phenomenon refers to the fact that
this phenomenon can produce random realizations. However, the properties of the
phenomenon are not random, they are fixed and well defined. But random in random
outcomes refers to the fact that the outcomes are unpredictable, random.
1.3. BASICS OF PROBABILITY THEORY
35
Cheatsheet 1.3: Random variable and realization
• Random variable (RV) is contains two things:
– a phenomenon or experiment with well-defined properties; and
– a rule how to assign numeric labels to the events in that phenomenon.
It is a rule, and it is not random.
• Realizations are numbers, resulting in a random experiment when converting the outcomes (events) to numbers using a RV. These are random.
• A number of realizations together form a sample.
36
CHAPTER 1. INTRODUCTION TO STATISTICS
Different ways to define RV-s is related to questions about the physical world we
are interested in. Take example of rolling two dice (see Section 1.3.1). For instance,
if we are just interested in different outcomes, we can enumerate the combinations by
defining

1 : if E = (1,1)



2 : if E = (1,2)
(1.3.10)
Y (E) =

.
.
.



36 : if E = (6,6).
Now the RV will tell us if we got (1,5), (5,1) or (2,4). All these combinations correspond to different values. However, we may not be interested in the different combinations but instead in the sum of the points, whichever sides come up. Now we can
define


2 : if E = (1,1)





3 : if E = (1,2) or E = (2,1)
Z(E) = 4 : if E = (1,3) or E = (2,2) or E = (3,1)
(1.3.11)



.
.
.



12 : if E = (6,6).
Exercise 1.8: Rolling two dice
Take the example of two dice. Construct a random variable that answers the
question: did we get any 6-s?
Solution on page 387.
The RV outcomes have, in general, different probabilities as they correspond to
different events in the sample space. In case of the coin-toss RV X, the value 0
corresponds only to event T and the value 1 to the event H. Both of these events
have probability 0.5 if it is a fair coin, and hence the values 0 and 1 will also have
equal probability. However, this is not the case for RV Z above that counts the points
on two dice. Although all the atomic events are equiprobable, the RV values are not
because those correspond to different compound events. Value 2 corresponds only to
a single atomic event (1,1) and hence has probability 1/36. Value 3 corresponds to
two atomic events, (1,2) and (2,1) and hence has probability 2/36. Probability of 4
is 3/36 and so on.
Exercise 1.9: Find Pr(Z = 6)
Consider the RV Z as defined above. Find Pr(Z = 6), the probability that rolling
two dice will give you sum 6.
Hint: consider drawing a 6 × 6 table of faces and marking the sum of the dots
in each table cell.
Obviously, when the sample space is discrete—there is only a limited number of
different events—then there is also only a finite number of possible different RV values.
We talk about discrete random variables. Discrete random variables can be presented
1.3. BASICS OF PROBABILITY THEORY
37
as probability table. For instance, when tossing a fair coin and denoting heads by 1
(RV X on page 34) we can represent the values as a table:
Value
Probability
0
1
0.5
0.5
Such a table is very convenient when computing expectation, variance, and other
properties of the RV.
A few more words about the notation. The random variable, the process of tossing
a coin and counting heads, is typically denoted by a capital letter like X. Individual
realizations, the actual number of heads that we get when we roll the dice, are typically
denoted with lower case letters x. As we usually consider many individual outcomes,
we denote those by subscripts x1 , x2 , and so on, i.e. x1 denotes the number of heads
in the first toss, x2 the second toss etc. But in different situations the subscripts may
mean different things. In other times we may want to enumerate the different possible
outcomes, the lines in the probability table. Now for instance x1 = 0 and x2 = 1. In
this case Pr(X = x1 ) means Pr(X = 0), “the probability to receive no heads when
tossing a coin”. This is what we do when talking about expectation below. The
notation can be quite confusing, and one has to understand what exactly xi means
in each case.
In case of continuous sample space we may have an infinite number of possible
values and we talk about continuous random variables. For example, flight delay in
minutes or temperature in degrees are continuous random variables (given we measure
not just in minutes and degrees but also include the corresponding fractions). There
are also different ways to define RV-s in case of continuous sample space like flight
delay. The first and most obvious case is just to use the length of the delay d in
minutes. Alternatively, if we are not interested in early arrivals, we may construct a
different RV: what was the delay, given the flight was delayed?
X = max(0, d)
(1.3.12)
where d is the delay in minutes.
TBD: independent random variables
1.3.5
Expected Value and Variance
Expected Value
The section 1.2.2 above discusses mean as a way to characterize the central tendency
in case of sample of data. Intuitively, one can easily see that as the sample grows, its
mean will converge to the “true mean”. For instance, one can immediately understand
that the “true mean” when tossing the coin should be 0.5. When thinking about
“true mean” we intuitively have in mind a more general sample, the “population”, or
perhaps a stochastic process, where the current data is sampled from. This population
or stochastic process is essentially a RV and the “true mean” is a certain property of
this RV. The property is called expected value or expectation. It is usually denoted by
capital “ ”, e.g. X means the expected value of random variable X. Its numeric
68
CHAPTER 1. INTRODUCTION TO STATISTICS
1.5.2 Doing Statistical Inference
In the previous section we talked a lot about confidence intervals, confidence levels,
and p-values. But we did not discuss how can we actually compute those figures.
The central task in this section is to do exactly that, namely to devise methods to
compute the confidence intervals. We start with a somewhat trivial task, namely
making statistical claims about individual random variable realizations. This helps
us to build the necessary machinery for more complex problems later.
Consider a RV X. What would be a statistically sound claim about its realization
x? Imagine X is the temperature tomorrow, and tomorrow we will learn its actual
value (realization) x. Obviously, today we still do not know the correct value. But we
can still say something like “with 95% confidence the temperature tomorrow will be
between 14 and 17 degrees”. How can we find the boundaries 14 and 17, and where
is the 95% coming from?
The Simulation Approach
0.4
0.3
density
Confidence level is the
probability that rejecting H0 is
the correct decision. See
Section 1.5.1 Hypothesis testing
and confidence level, page 63,
and Cheatsheet 1.5.
Let’s start with the “95%” first. This is the confidence level, the mirror image of
significance level. We have to decide it before the analysis. In the current case, the
confidence level means the probability that our prediction is correct, so in 95% of
cases, the temperature fall in the [14, 17] interval. This is the 95% confidence interval
for our weather forecast. We can imagine an implicit H0 : Tomorrow we will have …
degrees. However, prediction here is not just a single number but a possible range.
0.2
0.1
0.0
14
16
18
Temperature, C
Figure 1.14: Temperature distribution from a hypothetical weather model. The expected
value is 15.1, the central 95% confidence interval is between the dotted vertical red lines.
1.5. STATISTICAL INFERENCE
69
All these values are uniquely determined by the distribution of X. In case of
weather forecast, the distribution is originating from the weather model we are using.
Such models include random components to account for the fact that we cannot
predict weather perfectly. Assume we run the model 1000 times and receive 1000
temperature predictions as in Figure 1.14. The distribution is somewhat right skewed,
and predicts the temperature between 13 and 19 degrees with the expected value being
15. But other values than 15, such as 14 and 16 also look perfectly feasible. But which
values do we consider feasible? For instance, 18 degrees seems to be unlikely, and even
warmer weather seems even less plausible.
A common answer to this question is to look at the central 95% of the predictions.
Central 95% of the observations means that we consider the leftmost (the coldest)
2.5% and the rightmost (the warmest) 2.5% predictions to be unlikely. Given 1000
predictions, we can just remove the 25 coldest and 25 warmest predictions, and we
are left with the central 95% of the predictions. This is equivalent to computing the
0.025-th and 0.975-th sample quantiles. In case of this sample, the corresponding
quantiles are 13.55 and 17.18 degrees (dotted vertical lines on the figure). This interval, [13.55, 17.18] is called confidence interval (CI), more precisely 95% confidence
interval for the predicted temperature. So 95% CI is the interval that contains the
actual value with 95% probability, given that our weather model is correct. The values in this range are considered likely, and those outside this range are considered
unlikely. So we may say that +16 degrees will be quite likely but +18 will be unlikely.
We can reject H0 : “temperature will be over 18C” at 5% significance level.
As the 95% CI are only correct 95% of time, one may be tempted to improve on
the type-I error and report 99% or 99.9% CI instead. This is fair, but unfortunately
the result will be less informative. If I say that temperature tomorrow will be between
-100C and +100C then I am correct 100% of time (well, at least on Earth). However,
such a prediction does not help us to make any practical decisions, such as what
to wear tomorrow. This is the trade-off between providing precise estimates and
avoiding errors. 95% is often a good confidence level, but sometimes one may use
a much higher level. But information that is sometimes incorrect is better than a
claim that is always correct but devoid of any usable information. Sometimes one can
compute the optimal confidence level by considering the cost of type-I (false positives)
and type-II (false negatives) errors, and choosing a confidence level that leads to the
smallest overall loss.
But why did we select the central 95% interval as our confidence region? Why
not leave out the largest 5% and pick the smallest value till 0.95-th quantile as the
confidence region? Or the other way around? And what about picking both extremes
and leaving a narrow 5% gap in the middle? There are a few reasons for this, some
of those more theoretical, some more practical.
• First, we are usually interested in a confidence region that is as concentrated as
possible: we want the 95% of possible outcomes to have a small error margin.
The narrower my temperature forecast, the better idea you have what to wear
tomorrow. The obvious choice is to pick the region on the histogram with
highest chances and not to leave a gap in the middle where the values are most
likely.
70
CHAPTER 1. INTRODUCTION TO STATISTICS
• Second, in typical applications the distribution is symmetric and unimodal (usually close to normal). Both tails are thin and it makes sense to cut the 2.5% of
observations in both tails if we want to get the most concentrated confidence region. But if it is not symmetric, we may actually choose a different percentages
in different tails.
TBD: Example of asymmetric CI
• But what about cutting off the top 5% of temperature predictions instead of
the middle range? This is sometimes the desired approach. If you want to
know whether the temperature will be no warmer than 17 degrees, then you
may look at one-tailed confidence interval and compute the probability that the
temperature will be below that threshold. Or doing this the other way around,
you may find the threshold temperature that our predictions will not exceed
with 95% probability. But note we do not care about how cold can weather be
in this case.
If the distribution is bimodal, we may actually want to leave a gap in the middle.
TBD: Example of bimodal CI
TBD: Example to compute the loss and CI based on the loss
Theoretical Confidence Intervals
Next, we discuss how to compute confidence intervals theoretically for certain important special cases. It typically boils down to the requirement that we have to
know the stochastic process that generates our data, and based on that we can often
the theoretical quantiles. Sometimes this can be calculated easily (e.g. for uniform
distribution), sometimes one has to consult tables (e.g. for normal and t-distribution).
X ∼ N (0,1) means that RV X is
normally distributed with
expected value 0 and variance 1.
X ∼ N (µ,σ 2 ) means that RV Y
is normally distributed with
expected value µ and variance
σ2 .
Confidence interval for normally distributed values For standard normal RV X ∼
N (0,1), the lower 2.5% quantile q0.025 = −1.96 and the upper 2.5% quantile is q0.975 =
1.96 (q1−α = −qα because standard normal is symmetric).
If X follows a general normal distribution, X ∼ N (µ, σ 2 ), with expectation equal
to µ and variance σ 2 , the corresponding quantiles are
q0.025 = µ − 1.96σ
and
q0.975 = µ + 1.96σ.
(1.5.1)
Example 1.12: Confidence intervals for human height
Look at the fathers’ and sons’ height data, in particular sons’ heights. The distribution as a histogram is shown on the figure below, overlapped with approximated
normal density curve (red).
1.5. STATISTICAL INFERENCE
71
0.06
density
0.04
0.02
0.00
150
160
170
180
190
200
height, cm
Figure 1.15: Distribution of son’s height in fathers-sons data. It is well approximated
by normal distribution (red line) with mean µ = 174.5 and standard deviation σ = 7.2,
this is common for measures of adult animals. Dotted vertical lines are the boundaries
of the 95% confidence intervals.
As evident from the figure, most of observations are concentrated around the
mean 174.5 cm, roughly in the interval of 160–190cm. More precisely, the lower
2.5% observations are shorter than 160.3 and the upper 2.5% of observations
are taller than 188 cm. This means that the middle 95% of the observations
fall into the interval [160.3, 188]. This is the 95%-confidence interval (CI) for
sons’ height. If data were exactly normally distributed with a similar mean
and standard deviation, the corresponding theoretical quantiles were q0.025 =
174.5 − 1.96 · 7.2 = 160.4 and q0.975 = 174.5 + 1.96 · 7.2 = 188.5 As one can see,
the deviations from the theoretical values are rather small. Normal distribution
is a good approximation for human height.
Confidence interval for sample mean Now we move to a more important task, namely
finding the confidence interval for the mean of N random variables. Why is this a
more important task?
Remember, we use a sample of N datapoints (realizations) to learn something
about the underlying process, the RV. If N is large, we can easily finds the sample
quantiles—the CI. But the sample only has a single mean. True, we can easily compute
it, but our primarily interest is to learn about the RV, not about the sample. Hence,
after all, we are not interested in the sample mean but in the expected value of
the underlying RV. For instance, imagine that you are conducting a poll of 1000
voters before the elections. You find that 520 of them prefer liberals and 480 prefer
conservatives. We are not interested in the sample average 0.52, we are interested in
who will win the elections. This is the expected value of the RV, the preferences of
72
the whole electorate. We want to use our sample to find the CI for its expected value.
Because sample only has a single mean, we cannot compute its confidence interval
through quantiles. Consider a sample of size N of RV X. Denote its mean by µ
and variance by s2 . From Central Limit Theorem we know that a) the mean X̄ of
RV-s tend to be normally distributed; and b) the variance of sample mean is inversely
proportional to sample size,
1
Var X̄ = Var X
(1.5.2)
N
where X̄ denotes the mean of a sample of size N . We also know that in a large sample,
the sample average tends to be close to the expected value X, and sample variance
tends to be close to the variance Var X. If we put these two facts together, we have
that for the sample mean
!

s2
X̄ ∼ N µ,
,
(1.5.3)
N
Accordingly, from (1.5.1), the 95% confidence intervals of a mean of normals is
#
$
σ
σ
µ − 1.96 √ , µ + 1.96 √
.
(1.5.4)
N
N
Example 1.13: Sample mean of sons’ height
Let’s look again at the sons’ height data. As a reminder, the mean sons’ height
is 174.5 and it’s standard deviation is 7.2. Now we take 1000 times a random
sample of 4 sons and calculate their mean height.a In this way we get 1000 sample
means, and we plot the results on a similar histogram as in Example 1.12. Note
that we have to take many samples in order to compute many sample means,
and be able to plot their distributions.
0.10
density
See Section 1.4.3 Formal
definition of CLT and
assumptions behind it, page 59.
CHAPTER 1. INTRODUCTION TO STATISTICS
0.05
0.00
150
160
170
height, cm
180
190
200
1.5. STATISTICAL INFERENCE
73
Figure 1.16: Distribution of mean height of four sons. The distribution of means are
well approximated by normal (red line) with mean µ = 174.6 and standard deviation
σ = 3.6. Dotted vertical lines are the boundaries of the 95% confidence intervals.
The figure is plotted in a similar scale as Figure 1.15. It reveals that the result is well approximated with a normal density, however this time the spread
is narrower. The sample of means has mean value µm = 174.6, almost exactly
the same as the sample of individual heights µ = 174.5. Its standard deviation
σm = 3.6 is only half of that for the heights, σ = 7.2. This
√ is to be expected as
the standard deviation of a sample of four should be 1/ 4 = 1/2 of the standard deviation of individual values. Accordingly, both the empirical confidence
intervals [167.9, 181.9 and theoretical confidence intervals are only half as wide,
[167.6, 181.6].
a CLT says that distribution of average of independent random variables of all kinds tend to
be normal if N → ∞. However, if X is normally distributed, the result is exactly normal even
if N is small. So average of just four heights here will result in a distribution that is very close
to normal.
Degrees of freedom and t-distribution
When adding two normal RV-s, the result is normally distributed. But here is a “but”:
it is normally distributed if we know its expected value. This is a nice theoretical result
but unfortunately it has little practical value. It is rare we know the expectation, much
more likely we have to compute it from data.
0.3
density
0.2
0.1
0.0
−2.5
0.0
2.5
5.0
z
Figure 1.17: Testing H0 : Z = X1 − X2 = 0 where X1 ∼ N (2, 1) and X2 ∼ N (0, 1). The
histogram displays the distribution of Z for 1000 replications.
74
CHAPTER 1. INTRODUCTION TO STATISTICS
Figure 1.17 exemplifies the problem. We create two random normals from different
distributions: X1 ∼ N (2, 1) and X2 ∼ N (0, 1), and analyze their difference Z = X1 −
X2 .14 The histogram looks quite similar to normal, as confirmed by the orange normal
density curve with mean
1.94 and standard deviation 1.41, close to the theoretical

results Z = 2 and Var Z = 1.414. All seems well.
But this image is somewhat misleading. It depicts the distribution of 1000 different
trials while in practice we are normally left with results of a single experiment. So
instead of testing the hypothesis on a sample of 1000 results, we need to to test if
1000 times on a single result, and see how often we are correct.
But now it turns out the result is not normally distributed any more, but rather
t-distributed. t-distribution is pretty similar to normal. We typically use “standard t”
distribution which is similar to standard normal. Standard t has a single parameter:
degrees of freedom, usually denoted by df. Degrees of freedom is in a way a measure
of information content in data, how many data points we have that actually carry
information. It turns out that after computing the mean, we are left with N − 1 df,
instead of the original number of observations N . This is because if you know the
mean, you can always compute the value of the N -th data point, if you are given the
values of all the N − 1 data points.
So the result of adding normals when you do not know the expected value is not
normally distributed, but t-distributed with N − 1 degrees of freedom. As one may
guess, if sample is getting large then the extra information you lose because you have
to compute mean from the sample is of less and less importance, and for a large
degrees of freedom (large sample) t-distribution converges to normal.
TBD: CLT, and the fact we don’t have to start with a normal
1.5.3 Comparing Distributions
One of the most common tasks that leverages statistical inference is to compare
distribution of certain variables across different groups, datasets or time periods.
For instance, can we say that police treats African-Americans differently than white
Americans? We can easily compute, for instance, the probability that either group
is stopped by the police and compare those figures, but aren’t those numbers just a
statistical blip, just random noise that will go the other way the next day?
In order to answer this and other similar questions we have to approach it through
formal statistical hypothesis testing. We can assume H0 : both groups are treated
equally; and thereafter look at data and see if we can reject the hypothesis. Below,
we walk through such an analysis and test the sample means using t-test. Note
that even if averages of two samples are similar, that does not mean that all sample
properties are equal. For instance, one sample may have a different variance. But we
will just test for equality of sample averages for now.
14 Sum and difference are very similar: instead of analyzing the difference, we can look at sum as
well–as X2 is symmetric around zero, the distribution of Z will be unaffected.
88
CHAPTER 2. GENERAL MACHINE LEARNING STRATEGIES
Table 2.1: Recent house sales
id
price ($ 1000)
m2
crime (per 1000)
a
b
c
800
1500
?
200
400
200
2
1
1
2.2 Metric Distance: A Revisit
Section 3.2.2 Metric distance, page 108 introduced the concept of metric. We mainly
discussed Euclidean and other Lp -related metrics. However, in machine learning
applications it is often useful to let data decide the way we measure distance. This
gives rise to various data-based transformations, including feature normalization and
Mahalanobis distance.
Many ML applications also permit violating the strict assumptions behind the
distance metric. We may not be particularly concerned if triangle inequality holds,
or if distance is zero only for identical vectors. We want a simple and good enough
method to rank vectors instead. This is how we can use the popular cosine similarity
measure. Below we discuss a number of more popular approaches.
2.2.1 Data-Driven Metrics
Imagine you are using the nearest neighbors method to predict house prices. Your
dataset contains two training examples (a and b) and you want to predict the price of
c (See Table 2.1). Nearest neighbors (see Section 7.1 k-Nearest Neighbors, page 249)
predicts the house c to have the same price as the more “similar” one of the training
examples a and b.
But which house is more similar? Clearly, house a is of the same size while b is
in a similar neighborhood. Obviously, we can choose a distance metric and compute
distance. For instance, the Euclidean distance dE (c, a) = 1 and dE (c, b) = 200 and
hence house c is more similar to house a than to house b. But does this way of
deciding similarity make sense? If we measure house size in km2 and crime rate per
million instead of per 1000 residents, we would come to the opposite conclusion. So
our similarity ranking depends on the measurement units. This looks like a false start
to begin with.
We can identify two separate issues here:
1. Ranking according to Euclidean metric (as well as other Lp metrics) is not
robust with respect to measurement units.
2. We don’t know how we should weight difference in size relative to the difference
in crime rate.
The first problem is actually a specific manifestation of the second problem. If we
were able to address the weighting, the first problem would also vanish, as the weights
would presumably make the ranking unit-invariant.
2.2. METRIC DISTANCE: A REVISIT
89
One popular approach to address this problem is to use metrics that are derived
from data. There are several popular approaches, all of these normalize the units with
respect to certain variation in data. However, despite that these distance metrics are
“data driven”, they are not necessarily more correct than other units. Sometimes the
preferred metric can be deduced from the nature of the problem, but other times one
has just to experiment and find the best approach.
TBD: Example with an island
Feature normalization
Perhaps the most popular such data-driven metric is feature normalization: transforming the features into mean-zero and variance-one version. This would constitute
an answer to the house-price-problem above along these lines: “we think that customers value one standard deviation difference in house size about the same as one
standard deviation difference in the neighborhood crime rate”.
Technically, normalized features can be computed like this. Consider a feature
vector of length N , x = (x1 , x2 , . . . , xN ). It can be transformed into the normalized vector x̃ by first subtracting its mean, and thereafter dividing it by standard
devitation:
x − x̄
x̃ =
∀i ∈ 1 . . . N
(2.2.1)
sd x
” !
!N
N
1
2
where x̄ = N1 i=1 xi is the average value of x and sd x =
i=1 (xi − x̄) is
N
the standard deviation of x. The normalized vector x̃ has mean zero and standard
deviation 1. Note that normalization is done for each feature vector independently,
the other features do not play any role here. However, we must know the values for
all observations for that vector to compute x̄ and sd x. The result, obviously, does
not depend on the units any more because sd x in 2.2.1 is measured in the same units
as x. We can say that we are introducing a new unit of measurement, the standard
deviation of x.
Example 2.1: Data normalization
Consider the matrix X below. It contains three columns, X•1 , X•2 and X•3 , and
three rows a, b and c. The first two columns have similar spread (2) but different
means (2 and 12 respectively), while the third column also has a different scale.
X •1
X •3
a
b
c
1
2
3
X •2
mean X̄•j
std.dev X •j
2
1
12
1
20
10
11
12
13
10
20
30
The table also lists the mean value for all columns X̄ •j for j = 1, 2, 3, and (sample
size corrected) standard deviations. When normalizing data, we simply subtract
the corresponding mean from each variable in data (i.e. “2” from values of X •1 ,
x•1 stresses that index “1” is the
column index, the bullet • is a
placeholder for rows.
See Section 0.1 Scalars, vectors,
matrices, page vii.
90
CHAPTER 2. GENERAL MACHINE LEARNING STRATEGIES
“12” from X •2 etc), and divide the resulting differences by the corresponding
standard deviation (i.e. “1” for X •1 and X •2 , and “10” from X •3 ). We get
1
2
3
X̃ •1
-1
0
1
X̃ •2
-1
0
1
X̃ •3
-1
0
1
One can see that all three variables are now equal—they are all (−1, 0, 1). This
is rather intuitive: they all have one observation in the middle, and two other at
each side and equally far from it.
Figure 2.1 shows a graphical comparison of normalized and non-normalized features. The left panel depicts the data points in the original feature space where spread
of x2 is much larger than spread of x1 . The dotted circle denotes a set of equidistant
points from the dark blue point in its center (using Euclidean distance in R2 ). One
can see that the circle encompasses the green dot but not the yellow dot–hence the
green dot is closer to the dark blue dot than the yellow dot. On the right panel we see
the normalized version of the same data. The visual impression confirms that both
features are now spread roughly equally. The solid circle depicts a set of equidistant
points from the dark blue dot in this feature space. In this space, the yellow dot is
closer to the dark blue dot than to the green dot. Feature normalization reverses the
distance ranking. Both panels also show the circles in the other feature space, those
are now transformed to ellipses.
Figure 2.2 gives a similar example using Boston housing data. Both the left and
the right panel depict the same data, neighborhood crime rate versus the average
number of rooms. On the left panel we use the original features while on the right
panel we use the normalized features. Unlike in Figure 2.1, we do not force equal
aspect ratio here and hence both panels look exactly the same, only the values on the
axes differ. The Euclidean distances differ too. For instance, the distance between
the green and the orange dot is 1.349, and between the green and the blue dot 5.666
in the original features (left panel). These distances are 1.92 and 0.666 in normalized
features (right panel). Hence the closest colored neighbor to the green dot is orange
in the original features and blue in the normalized features.
From the technical point of view, normalization is a good option if the features
are roughly independent, and their distribution is roughly symmetric and does not
have fat tails. This assures that the variance is stable and the mean is in the middle
of the observations.
More conceptually, normalization is justified in such cases where standard deviation is a relevant scale unit. In case of the house price example above, this is true if
people consider both house size and neighborhood security a relevant measure, and
standard deviation of the respective variables is a good proxy for how people value
these two factors. However, if the customers never care about crime, except for the
worst few neighborhoods, feature normalization may not be a good approach.
Another common reason to use feature normalization is to transform values that
are measured in arbitrary and hard-to-understand units into more easily understand-
91
−10
−2
−1
−5
0
x2
0
~
x2
1
5
2
10
2.2. METRIC DISTANCE: A REVISIT
−5
0
x1
5
−2
−1
~
x1
0
1
Figure 2.1: Non-normalized features (left) and normalized features (right). Dark blue, green
and yellow mark the same three datapoints on both images. The dotted line depicts a circle
in the original feature space, the solid line is a circle in the normalized feature space. Note
how the relative distance between dark blue and green, and dark blue and yellow dots differ
in the original and in the normalized features.
able (and comparable) ones. For instance, we may survey the support for a government policy on a scale from 1 (very much against it) to 5 (very much in favor of
it). One unit in this scale is hard to understand while a sentence like “those whose
support it one standard deviation above the mean…” carries more meaning.
There is one more technical reason why it is advisable to normalize features–if the
design matrix contains columns of very different scale then its condition number will
Matrix condition number is the
be high and hence the numeric properties may suffer.
ratio of the largest and the
Min-max scaling An easy alternative to normalization is min-max scaling. It is
conceptually similar to normalization, just instead of dividing the centered values by
the standard deviation, we divide those by the range. For example, we can create a
new set of features that are all in equal range as follows:
x̃i =
xi − xmin
xmax − xmin
∀i ∈ 1 . . . N
(2.2.2)
where xmin and xmax are the corresponding minimum and maximum values of x.
In this example all the features will be converted into the [0,1] interval, but we can
shift and scale these into another interval instead, say [−0.5, 0.5]. Min-max scaling
works well if the features are independent and have uniform-like distribution with no
tails—all values end abruptly at the boundary. This ensures that the minima and
maxima are stable.
smallest eigenvalue, κ ≡ |λ|max .
min
See Section 3.3.4 Condition
number, page 121.
|λ|
CHAPTER 2. GENERAL MACHINE LEARNING STRATEGIES
0
0
Crime rate, original units
20
40
60
Crime rate, normalized
2
4
6
8
80
10
92
4
5
6
7
Average #of rooms
8
−4
−2
0
2
Average #of rooms, normalized
Figure 2.2: Boston housing data: neighborhood crime rate (crim) versus average number
of rooms (rm). Non-normalized (left) versus normalized features (right). While the images
look exactly the same, the Euclidean distance rankings are different: the nearest (colored)
neighbor the the green dot is the orange on the left panel, and the blue dot on the right
panel.
As min-max scaling is very similar to feature normalization, its advantages and
disadvantages are similar too.
Mahalanobis distance Prerequisites: Section 3.2.2 Norm and Distance, page 105,
Eigenvalues and eigenvalue decomposition 3.3.4, feature normalization 2.2.1, covariation matrix.
This is a generalization of feature normalization in case where the features may
be correlated. Consider Figure 2.3. Here the two features x1 and x2 do not just have
a different variance, they are also clearly correlated. If we use feature normalization
we discussed above, we will change the picture somewhat, but we cannot address the
fact that the data points are clearly clustered around the diagonal line.
Mahalanobis transformation, in contrast, stretches and rotates the data in a way
that is aligned with the axis of the data. In the left panel of Figure 2.3, the solid
ellipse depicts the equidistant points from the central dark blue dot. The ellipse is
elongated along the long axis of the correlated data, and compressed along its short
axis. So Mahalanobis distance measures distance with respect to he extent of the
point cloud in each particular direction, not just along the coordinate axes as is the
case with feature normalization.
Mahalanobis distance can be done and understood easily using matrix notation
and eigenvalue decomposition. Consider X to be a N × K data matrix and xi• and
93
−2
−1
−1
0
~
x2
0
x2
1
2
1
3
2
2.2. METRIC DISTANCE: A REVISIT
−2
−1
0
1
2
−1.5
x1
−0.5
~
x1
0.5 1.0 1.5
Figure 2.3: Original features (left) and Mahalanobis-transformed features (right). The same
three cases are marked with different colors on both images. The dotted line depicts a circle
in the original feature space, the solid line is circle in Mahalanobis feature space.
xj• to be two rows (observations) from that data. The Mahalanobis distance between
the observations xi• and xj• is defined as

dE (xi• , xj• ) = (xi• − y x• )T Σ−1 (xi• − xj• )
(2.2.3)
where Σ is the covariance matrix of X.
Mahalanobis distance is equivalent to transforming the data matrix into
#
$
T
1
X̃ = X − 1N · x̄ Σ− 2
(2.2.4)
where x̄ is the vector of column means and 1N · x̄ is accordingly a matrix of column
means.
Mahalanobis transformation is essentially the same as transforming data to principal components (see Section 10.3). Thereafter it measures Euclidean distance in
such a rotated and stretched feature space. In case the features are uncorrelated,
Mahalanobis distance is equivalent to Euclidean distance in normalized data.
Mahalanobis distance is a good measure for data where the data variation is a
meaningful distance measure, and not just along the features as in case of normalization, but also along the axes of variation in data.
T
94
CHAPTER 2. GENERAL MACHINE LEARNING STRATEGIES
Example 2.2: Mahalanobis transformation of iris data
We demonstrate the Mahalanobis transformation using Iris data. Figure 2.4
shows an analogous tranformation as in Figure 2.3 for 2-D slice of iris data.
The different species, denoted by different colors, are reasonably well separated
on both figures. However, in the original coordinates (petal length and width,
left panel) the data points form an elongated cloud where the different species
cluster at different location in this cloud. In transformed coordinates, the points
are stretched out along the minor axis, increasing the distance between dots for
similar species.
3
−2
−2
−1
−1
PC2 (std.dev)
0
1
Petal width (cm)
0
1
2
3
2
4
setosa
versicolor
virginica
1
2
3
4
5
Petal length (cm)
6
7
−2
−1
0
1
PC1 (std.dev)
2
Figure 2.4: Iris data: petal width versus petal length in the original coordinates (left
panel) and in the corresponding Mahalanobis-transformed coordinates (right panel).
In transformed coordinates the species do not form as tight clusters any more
as in the original coordinates, making categorization more difficult. This fact is
also visible from the example circles, the circle that centers on a red observation
in Mahalanobis coordinates (solid line) includes two green points while the circle
in the original coordinates (dotted line) includes only a single green dot while
capturing most of the red ones. For k-NN to work well, it should be possible to
draw such circles around most datapoints that contain many dots of the correct
color and few of other colors. This is easier in the original coordinates.
Note that one of the major reason for data transformation, features measured in incomparable units, does not hold here as both features are measured
in centimeters.
2.2.2 Cosine similarity and angular distance
Prerequisites: Vector Norm 3.2.2
2.2. METRIC DISTANCE: A REVISIT
95
Cosine similarity is a similarity measure that is not based on Lp distance. It is
widely used when assessing similarity in features that are not numeric, such as when
comparing texts.
Cosine similarity between vectors x and y is defined as
T
c(x, y) =
x ·y
||x|| · ||y||
x %= 0, y %= 0,
(2.2.5)

where ||x|| = xT · x is the Euclidean norm. It is easy to see that c(x, x) = 1.
It’s name, cosine similarity, originates from the fact that inner product of vectors
equals to the product of their norms, multiplied by the cosine of the angle between
them:
T
x · y = ||x|| · ||y|| · cos φ,
(2.2.6)
where φ is the angle between vectors x and y. Hence cosine distance equals just to
the cosine of the angle between the vectors. Note that it is solely the angle between
the vectors. Cosine distance is agnostic to the length (norm) of the vectors (as long
as this is positive). It is a measure in similarity in direction the vectors point to, and
not a measure of the length of the vectors. For instance, when analyzing texts using
bag-of-words (see Section 8.2.3), this amounts to comparing word frequencies in the
texts. The number of words (text size) is irrelevant. Such an approach may be very
well suited when we try to understand the topic of the text while the texts itself may
be of very different size.
Example 2.3: Cosine similarity in R2
The easiest way to understand cosine similarity is to analyze it in R2 plane. Look
at the vectors x1 and x2 on the figure below. x1 = (1, 1) and hence it points
45◦ upward. x2 = (2, − 1) and accordingly points 27◦ downward, and hence the
angle between the two vectors is 45 + 27 = 72◦ .
y
2
x3
1
x1
45◦
−27◦
1
2
x2
x
96
CHAPTER 2. GENERAL MACHINE LEARNING STRATEGIES
Now calculate cosine similarity. First, the Euclidean norms are
(
%%& ‘%%
& ‘T & ‘
%% 1 %%

1
1
%% =
||x1 || = %%%%
·
= 1 + 1 ≈ 1.414
%
%
1
1
1
(
%%& ‘%%
& ‘T & ‘
%% 2 %%

2
2
%% =
||x2 || = %%%%
·
= 4 + 1 ≈ 2.236.
−1 %%
−1
−1
(2.2.7)
New we can plug the numbers into the cosine similarity definition (2.2.5):
& ‘T & ‘
1
2
T
·
1
−1
x1 · x2
2−1
c(x1 , x2 ) =

=
= 0.316.
||x1 || · ||x2 ||
1.414 · 2.236
3.162
(2.2.8)
We can easily check that 0.316 is cosine of 71.6◦ . Hence the computed cosine
similarity is equal to the cosine of the angle between x1 and x2 . (The difference
is related to rounding errors.)
TBD: Example where two vectors of different size are at same similarity with a
third one
Cosine similarity has a few very favorable properties, in particular it is easy to
compute, involving just multiplications, additions, and one division. In case of sparse
matrices, only non-zero components need to be considered. All this makes is very
well suitable for analyzing high-dimensional data, such as words in texts.
Unlike the distance measures above, cosine similarity is not a metric distance as
larger value means not more distant but more similar data vectors. The maximum
similarity, distance between identical vectors is 1 while the minimum similarity, distance between opposite vectors, is -1. This is sufficient to order vectors according to
their similarity, and often this is all we need.
In case one needs a difference measure instead of similarity measure, one can use
cosine distance dcos (x, y) = 1 − c(x, y). Cosine distance is zero in case of vectors
that point in the same direction, the maximal possible distance is 2 when two vectors
point in exactly opposite direction. Another option is to use angular distance, defined
as
cos−1 c(x, y)
da (x, y) =
,
(2.2.9)
π
instead of cosine distance. However, there is little gain from selecting a more computationally demanding metric if our task is just to rank vectors according to similarity.
Exercise 2.1: Cosine, angular distance are not proper metric distances
Show that neither cosine nor angular distance are proper metric distances as
defined in Section 3.2.2.
CLT and Hypothesis testing
January 29, 2023
Instructions
This problem set revolves around Central Limit Theorem, statistical hypotheses and
confidence intervals.
Note that there is also an extra credit task, that one is quite different.
• Please write clearly! Answer questions in a way, that if the code chunks are removed
from your document, the result is still readable!
• A recommendation–this problem set contains a number of repetitive tasks. Consider
writing a function that does most of the job, and then just feed different data to this
function, and comment what you see there.
Good luck!
1
Explore Central Limit Theorem (60pt)
In this section you will see how does Central Limit Theorem (CLT) work. CLT states
two things:
a) The means of a sample of random numbers tend to be normally distributed if the sample
gets large.
b) Variance of the mean tends to be 1 Var X where S is the sample size and X is the random
S
variable we are analyzing.
(This is actually a property of expectation and independence, not really CLT. But CLT
is closely related to this result.)
CLT, and how variance and mean value change when sample size increases, plays a very
important role in computing confidence intervals later.
The task is structured in a way that you may want to create a function that takes in
sample size S and outputs all needed results, including the histogram. There will be quite a
bit of repetitive coding otherwise.
We start with a distribution that does not look at all normal. We create a RV
{
−1
with probability 0.5
X=
1
with probability 0.5.
1
(You can imagine we flip a fair coin and label heads as 1 and tails as -1.) One way to sample
from such RV is something like this
import numpy as np
np.random.randint(0,2, size=10)*2 – 1
## array([-1, -1,
1,
1, -1, -1, -1,
1,
1,
1])
Detailed tasks:
1. (5pt) What is Random Variable (RV)? What makes X a RV?
Hint: check out 1.3.4 Random Variable.
2. (6pt) Calculate the expected value and variance of this random variable. Explain what
is the difference between expected value and the sample mean.
Note: these are theoretical values and not related to any samples. If you use functions
like mean or var here then you have misunderstood the concepts!
Hint: read 1.3.4 (Expected Value and Variance), and Openintro Statis- tics 3.4
(Random variables), in particular 3.4.2 (Variability). I recommend to use the shortcut
formula Var X = E X2 − (E X)2.
3. (1pt) Choose your number of repetitions R. 1000 is a good number but you can also
take 10,000 or 100,000 to get smoother histograms.
Note: number of repetitions R is not the same as sample size S here. You will create
samples of size S for R times below. For instance, you will create R = 1000 times a sample
of size S = 5. Please understand the difference, it is a fequent source of confusion!
4. (5pt) Create a vector of R random realizations of X. Make a histogram of those. Comment the shape of the histogram.
Note: in this case we have R = 1000 repetitions and samples are of size S = 1 as we look
at individual realizations.
Hint: it takes some tweaking to get nice histograms of discrete distributions. The
simplest way is just to make many bars (most of which will be 0) by adding argument
bins=100 to plt.hist.
5. (3pt) Compute and report mean and variance of the sample you created (just use
np.mean and np.var). NB! Here we talk about sample mean and sample variance.
Compare these numbers with the theoretical values computed in question 2 above.
Hint: they should be fairly close.
6. (6pt) Now create R pairs of random realizations of X (i.e. sample size S = 2). For each pair,
compute its mean. You should have R mean values. Make the histogram. How does
this look like?
7. (6pt) Compute and report mean of the R pair means, and variance of the means. NB!
we talk about sample mean and sample variance again, where sample is your sample ofR
pair means.
2
8. (5pt) Compute the expected value and variance of the pair means, i.e. the theoretical
concepts. This mirrors what you did in 2.
Compare the theoretical values with the sample values above. Are those fairly similar?
Note that according to CLT, the variance of a pair mean should be just 1/2 of what you
got above as for pairs S = 2.
9. (4pt) Now instead of pairs of random numbers, repeat this with 5-tuples of random
numbers (i.e. S = 5 random numbers per one repetition, and still the same R = 1000 or
whatever you chose repetitions in total). Compare the theoretical and sample version of
mean and variance of 5-tuples. Are they similar? Do you spot any noticeable differences in
the histogram compared to your previous histogram?
10. (3pt) Repeat with 25-tuples…
(Also compute the expectation and theoretical variance, and compare those with sample
mean, sample variance)
11. (3pt) … and with 1000-tuples. Do not forget to compare with theoretical results.
12. (2pt) Comment on the tuple size, and how the shape of the histogram changes when
the tuple size increases.
13. (7pt) Explain why do the histograms resemble normal distribution as S grows.
In particular, explain what happens when we move from single values S = 1 to pairs
S = 2. Why did two equal peaks turn into a “山”-shaped histogram?
14. (4pt) Explain what is the difference between R and S. How do changing these values
affect the histograms?
2
Is povery in Azraq refugee camp falling? (60pt)
A lot of refugees from Syrian civil war are housed in camps in Jordan. A report https://
reliefweb.int/attachments/32604cb5-1e86-49eb-8497-8f43a4721135/WFP-0000142211.
pdf shows results of a survey of liv- ing conditions in Zaatri and Azraq camps.
Page 5 of the report discusses poverty in the refugee camps, in particular “abject poverty”,
inability to afford the “survival minimum expenditure basket” (SMEB) of food and basic
hygiene. The survey finds that while the abject poverty has stayed fairly similar in Zaatari
camp, it has fallen substantially (from 66% to 51%) in Azraq camp.
But is it actually improving? Maybe the survey wave of 2022-Q2 just got “more lucky”
by randomly sampling fewer poor households? Or the way around–2022-Q1 survey was just
“unlucky”? Let’s find it out. We are going to compute the confidence interval for the 2022-Q2
survey, and check if the 2022-Q1 result falls into this confidence interval.
3
2.1
Background (2pt)
1. (1pt) What was the abject poverty in Azraq camp in Q1 and Q2 2022 (when including
all assistance)? Lets call these variables p1 and p2.
2. (1pt) How many households were surveyed in the camp? Call this sample size S.
2.2
Simulations (30pt)
Now let’s create a large number of such random surveys, and see how much variability do
we see in the results. Note though that we create these assuming they were conducted using
independent uniform sampling, if the actual surveys were done differently, then our analysis
below may be incorrect.
Let’s describe all households in the Azraq camp as a RV X that can be either “1” if the
household is poor (abject poor) or “0” is it is not poor. This is a Bernoulli RV
{
with probability p
X= 1
(1)
0
with probability 1 − p.
You can create a sample from this RV along the lines
S = 8
p = 0.6
np.random.binomial(1, p, size=S)
## array([0, 0, 1, 1, 1, 1, 0, 1])
1. (3pt) Create a random sample using the correct values of S and p2 you found in 2.1
above.
2. (2pt) Compute the sample mean and compare it with p1 and p2 above. How close is it
to these figures?
3. (1pt) Pick your number of replications R (something like 1000 or 10,000 are good numbers).
4. (6pt) Repeat the points 1 and 2 for R times: create the sample, compute the average,
but also store the average in an array. You should have R averages now.
5. (6pt) What is the average of the averages? Which probability from 1 does it resemble?
Why?
6. (5pt) Plot a histogram of the averages. Which distribution does it resemble? What do
you say, by just eyeballing the plot, what are the largest and smallest values that are
“reasonably” common?
7. (6pt) Finally, compute 2.5th and 97.5th percentile and the 95% confidence intervals.
Does the Q1 poverty value fall into this interval?
4
2.3
Theoretical CI (28pt)
But simulations are often hard to do. Fortunately, we can compute the confidence intervals by theoretical considerations.
1. (4pt) Compute variance of your sample of X. You can use the Bernoulli variance formula
Var X = p(1 − p). You can also use the sample you created in 2.2.1, or create a new sample,
and find the sample variance.
2. (8pt) But this was variance of X (or sample variance if that was what you computed).
What we need is variance of sample mean. What does CLT tell about relationship b/w
sample variance and variance of the sample mean?
3. (7pt) Compute the standard deviation of sample mean using CLT.
4. (3pt) Compare the standard deviation you got here with the standard deviation of the
sample of averages you computed in 2.2.4.
5. (6pt) Use this standard deviation to compute the confidence interval. Compare it to
what you got in 2.2.7. Does p1 falls inside or outside of this interval?
You may want to check 1.5.2 “Doing Statistical Inference”.
Hint: they should be fairly similar, and your conclusion regarding p1 should be the
same.
Finally tell us how many hours did you spend on this PS.
5
2.4
Challenge (extra credit, 1 EC point)
How long time do you need to simulate the Q2 poverty in Azraq camp in order to get an
outlier that is as big as the Q1 poweverty value? If you did the Question 2 correctly, then you
noticed that the simulated Q2 poverty values are all much smaller than what was reported in
Q1. But if you simulate long enough, then you can get an arbitrarily large value. But how long
do you need to run the simulations? Will a few minutes suffice? Or do you need to keep
computer running for year, or even millenia?
1. First, time your simulations. Run Q2.2.4 many times (do not store the values this time as
we may run out of memory) and measure how many seconds does it take. Your computer
should run a least five seconds before proceeding (this will help with accuracy). Based
on that figure, calculate how long it would take to run 10 12 or however many simulations
you need.
Hint: check out %timeit and %time magic macros, alternatively you may check out the
“time” module and “time.time()” method.
2. Next, we have to find the t-value. Let’s do it in a little wrong way by assuming the Q1
value is 0.66 without any errors. You computed the standard deviation of the mean in
Q2.2.. What is t value for this difference?
3. Second, what is the probability to receive such t-values? You may need to calculate your tvalues yourself, they will not be on any tables. Assume we are dealing with normal
distribution. (Not quite but we are close.) You have to compute the probability you get
a value larger than the t value you just computed. This can be done along the lines:
from scipy import stats
norm = stats.norm()
norm.cdf(-1.96) # close to 0.025
## 0.024997895148220435
Except that you replace 1.96 with your actual t-value.
Explain: why does the example use norm.cdf(-1.96) instead of norm.cdf(1.96)?
4. How many iterations do you need? Let’s do a shortcut—if probability p is small, you
need roughly 3/p iterations. So if p = 0.001, you need 3000 iterations.
5. Based on the timings you did above, how many years do you have to run the simulations? Could you get it done by the assignment deadline? Or if one had started the
computer the year your grandfather was born, would it be there now? If the first Seattle
inhabitants had started it when they moved here following the melting ice, 10,000 or so
years ago?
6

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper
Still stressed from student homework?
Get quality assistance from academic writers!

Order your essay today and save 25% with the discount code LAVENDER