The data contained in the spreadsheet “Corner Store.xls” provide information relating to the gross monthly sales of a hypothetical corner store chain. Each observation in the data represents a corner store at a different location.
You have been approached by the owners of this corner store chain to conduct a statistical analysis. They are looking to open new corner stores in several areas where they are currently not operating. Hence, they are interested in the determinants of gross sales, and in predicting the characteristics of the areas they should be considering for the establishment of new corner stores.
Task Write a report for the corner store chain owners. The report should provide a brief summary of the data, justification of the variable(s) that you consider to be relevant, and analysis of your results (including your interpretation of the results and any diagnostic tests you have conducted). You should also highlight any limitations in your analysis and suggestions for improving this research that you feel to be appropriate.
FORECASTING AND BUSINESS ANALYSIS
1
FORECASTING AND BUSINESS ANALYSIS
Name
Professor
Institution
Course
Date
Body of the Work/Assignment
References {examples of referencing in Havard}
Joachim, S 2010, Failed Bridges: Case studies, Causes and consequences, Wilhelm Ernst
and Sohn, Berlin.
Udith, JS, & Ernest, CH 1999, Structural Design in Wood, Kluwer Academic Publisher
Massachusetts.
Assignment Extract
Background
The data contained in the spreadsheet “Corner Store.xls” (located in the same folder as
this document) provide information relating to the gross monthly sales of a hypothetical
corner store chain. Each observation in the data represents a corner store at a different
location. The data incorporate observations from different cities and states (but all in
Australia). For each corner store, (i=1,2,….,50) , the variables you have been given are:
1. Gross monthly sales ($)
2. The number of competitors within 10km
3. The population (in 1000’s) within 10km
4. The average income of the population within 10km ($)
5. The average number of cars owned by households within 10km
6. The median age of dwellings within 10km
You have been approached by the owners of this corner store chain to conduct a
statistical analysis. They are looking to open new corner stores in several areas where
they are currently not operating. Hence, they are interested in the determinants of gross
sales, and in predicting the characteristics of the areas they should be considering for the
establishment of new corner stores.
Task
Write a report for the corner store chain owners. The report should provide a brief
summary of the data, justification of the variable(s) that you consider to be relevant, and
analysis of your results (including your interpretation of the results and any diagnostic
tests you have conducted). You should also highlight any limitations in your analysis and
suggestions for improving this research that you feel to be appropriate.
You need to do some excel.
If u need some help u can look my school web for this course http://learn.unisa.edu.au/course/view.php?id=130685
Name: shiwy015
Password:11Calvin18
Forecasting and Business Analysis
Study Period 5, 2013
Assignment 1 – Report
Due Date: 13
th
September, 5pm
Background
The data contained in the spreadsheet “Corner Store.xls” (located in the same folder as
this document) provide information relating to the gross monthly sales of a hypothetical
corner store chain. Each observation in the data represents a corner store at a different
location. The data incorporate observations from different cities and states (but all in
Australia). For each corner store, (i=1,2,….,50) , the variables you have been given are:
1. Gross monthly sales ($)
2. The number of competitors within 10km
3. The population (in 1000’s) within 10km
4. The average income of the population within 10km ($)
5. The average number of cars owned by households within 10km
6. The median age of dwellings within 10km
You have been approached by the owners of this corner store chain to conduct a
statistical analysis. They are looking to open new corner stores in several areas where
they are currently not operating. Hence, they are interested in the determinants of gross
sales, and in predicting the characteristics of the areas they should be considering for the
establishment of new corner stores.
Task
Write a report for the corner store chain owners. The report should provide a brief
summary of the data, justification of the variable(s) that you consider to be relevant, and
analysis of your results (including your interpretation of the results and any diagnostic
tests you have conducted). You should also highlight any limitations in your analysis and
suggestions for improving this research that you feel to be appropriate.
WORD LIMIT FOR REPORT: 1000 words including a 100 word executive
summary, but excluding references and appendix.
– An electronic copy of your assignment is to be submitted online using
Learnonline, while a hard copy of the identical version of your assignment is to
be dropped into the pigeon hole marked as “Assignment Box” located outside
the office of the School of Commerce (WL 2-57). For external students, the hard
copy is not required.
– There is a penalty of 10% of the total mark for late submission of each day. The
only exception to this is if your tutor approves an extension prior to the due date,
for legitimate reasons based on documented evidence. So you should collect your
documented evidence before you contact your tutor for an extension.
– Plagiarism is a specific form of academic misconduct. If found, both parties will
be penalised regardless of who copies from whom. The electronic version of your
submitted assignment will be used for a plagiarism check.
– For guideline on report writing for an empirical project, see Appendix A of
Koop.
The following are some guidelines that we will use to mark your assignments. This may
help you in preparing your assignment.
Main article
Superb: include all main results; accompanying well-presented and meaningful graphs
and tables; accurate, consistent and logical arguments; well-written.
Good, solid report: it is not superb in all aspects above, but an overall impression of the
report suggests that it is a very good and solid report.
Acceptable, but with several weaknesses: it clearly contains some mistakes, but overall it
covers most important findings.
Quite inadequate: the report contains many mistakes. An overall impression suggests that
assignments were done with little effort and students do not understand most of the
concepts introduced in the course.
Appendix
Include all necessary Excel outputs. Your appendix may include some technical
materials. The appendix should be well sorted out and with brief descriptions, so that they
are easy to follow.
Please note: marks are deducted for excessive or irrelevant output, in particular, printouts
of the worksheet.
>Data
km
.00
84
5 0
4 2 0.2 5 4 0.2 3 8
0.1 3 1.1 3 3 1.1 3.5 3 0.1 1.2 2 4.0 2 2.4 2 0.1 2 0 1.1 4.0 2 1.1 2 0.9 1 5
1.1 2 9
21542 0.2 2 5.0 1 1.1 2 1.1 2 1.2 1 1.2 2.4 2 1
0.7 1 1.4 1 1.5 1 1.5 2.5 2 1.2 1 1.8 9.2 1 2.0 1 2.2 5 2.1 4.3 1 58457 1.1 5 2.2 1 1 1.1 1.3 2 1.8 1 2.0 1.6 1 2.0 1.5 1 2.0 1.2 2 2.2 1 2.2 0 2.2 3.7 2 2.1 1 2.5 5.3 1 1.4 1 2.1 0 2.0 1.8 Correlation Analysis to +1 to measure the degree of association. 5,500
94
5 5 94
6 5 ,000
5 6 8 7 7 6 175 695 7 1 1 1 The Excel Correlation Function in Analysis Tools Forecast Evaluation Using EXCEL 7.60 0
7.50 7.20 7.00 6.20 5.50 7.60 Two Naive Forecasts Actual
2
Gross monthly sales
Number of competitors within
1
0
Population within 10km (1000’s)
Average income of residents within 10km
Average number of cars owned
Median age of dwellings in area
12192
6
1
4
19
5
0.1
2.
3
13061
1
5.0
15000
0.2
1.6
19153
12.77
54785
1.3
1
1.1
20714
13.89
10357
1
4.0
21222
17.70
25645
0.6
3.5
22319
14.88
21545
20.0
32215
2
1.4
16859
2.4
33629
24.51
45215
4.3
33977
26.56
25845
1.2
8.8
34163
23.56
25985
36647
24.43
21548
40937
39.89
21542
0.5
45302
30.20
32546
0.9
49298
35.26
24649
9.2
49583
33.06
24792
0.3
14.5
57929
45.69
32545
59501
50.00
29751
2.6
62747
40.59
42151
1.5
63073
4
2.0
52545
6.9
63775
4
2.5
2.2
70985
58.75
45125
0.8
71616
12.59
35808
10.0
75019
50.01
35485
1.7
76374
100.00
32156
0.7
85372
56.91
44151
88057
6
1.8
41254
6.0
94150
62.80
31254
3.2
103683
71.26
58457
3.7
108781
77.00
58000
113330
75.53
52012
12.8
113987
77.54
65898
117454
70.11
61328
2.1
123245
89.00
61623
5.8
125584
83.72
62792
127114
88.00
1.0
132717
88.48
66358
5.3
134340
45.00
67125
1.9
4.1
139739
92.55
45875
141946
120.00
64885
3.1
144593
48.75
72296
145712
101.44
75551
156329
95.00
74584
160987
107.32
61454
6.6
161436
98.58
78545
14.0
171521
114.35
77854
201344
138.00
85784
9.3
205397
135.75
102565
207523
148.00
98475
2.3
211938
141.29
88545
7.9
233522
202.15
65854
Sheet3
Introduction
Forecasting and Business Analysis
Copyright UPmarket Software Services. This file must not be used without permission.
Correlation models tell you how strong the linear relationship is between two variables. The statistic used is the coefficient of correlation denoted r (Rho). This is used mainly for understanding and can take values from –
1
Y
X
Correlation Analysis
The chart on the left shows a simple relationship between X and Y. Correlation analysis can help show the strength of the relationship and also if the relationship is positive or negative.
Using Excel for Correlation Analysis
Excel can be used to calculate an individual correlation or a correlation matrix. The matrix is a table of correlations between a number of variables. This worksheet shows how to calculate a single correlation coefficient using the CORREL function in Excel and also how to create a correlation matrix using the Analysis ToolPak.Correlation Example
Value $
Land Area
Rooms
Building Area
$1
5
6
124
$160,000
465
134
$163,500
7
119
$172,000
696
120
$
175
715
133
$212,000
634
8
234
$218,000
918
164
$225,000
695
204
$250,000
922
181
$265,000
801
158
$275,000
348
$310,000
220
Value $ Land Area Rooms Building Area
Value $ 1
Land Area
0.00045
Rooms
0.60722
0.19927
Building Area
0.70607
-0.04193
0.86599
0.8659925633
You can automatically calculate a correlation matrix in Excel using the Correlation function in the Analysis ToolPak. Open the ToolPak using Tools – Data Analysis. When you open the ToolPak you will see a long list of statistical methods that you can access. Go down the list and click Correlation. A screen will appear a little like that below. (It may vary depending upon the version of Excel you are using). The Input range is the data array of all input variables. If this includes a label in the first row of your data, you should check the Labels in First Row box. In this case the data is organised by columns. You need to select an output range where the output appears. When it is all entered – click Okay and the Correlation Matrix will be calculated.
The Excel Correl Function
The Correl function is CORREL(array1,array2). This returns a single correlation between array1 and array2. In cell 21B above is the function for calculating the correlation between Rooms and Building Area. Note that labels are not used – this is a mathematical function only. See HELP for details.Introduction
Forecasting and Business Analysis
Copyright Upmarket Software Services. This file must not be used without permission.
This is a simple exercise where two very simple naïve models are estimated and then a series of evaluation techniques are used. The purpose of this spreadsheet is to allow you to see the various formula that can be used for these applications.
– The naive forecast tab shows the two forecasting methods
– The evaluation tab shows the basic evaluation methods for these two naive forecasts
Two Naive Forecasts
Naïve Forecast 1
Actual
Forecast1
7.60
9.70
9.6
9.70
7.50
9.60
7.20
7.00
6.20
5.50
5.30
5.50 5.30
5.50
Naïve Forecast 2
Actual
Forecast2
9.70 9.60
10.8
7.50 9.6
7.20
6.5
7.00
7.1
6.20
6.9
5.50
5.8
5.30
5.2
5.50 5.2
5.6
Forecast1
Forecast2Calculating the
s
Asb Error | % error | Abs % Error | Adj | MAPE | Squared Error | ||||||||||||||||||||||||||||||||||
– | 0.10 | – | 0.01 | 0.00 | |||||||||||||||||||||||||||||||||||
– | 2.10 | – | 0.28 | 0.03 | 4.41 | ||||||||||||||||||||||||||||||||||
– | 0.30 | – | 0.04 | 0.09 | |||||||||||||||||||||||||||||||||||
– | 0.20 | -0.03 | |||||||||||||||||||||||||||||||||||||
– | 0.80 | – | 0.13 | 0.02 | 0.64 | ||||||||||||||||||||||||||||||||||
– | 0.70 | -0.13 | 0.49 | ||||||||||||||||||||||||||||||||||||
-0.20 | -0.04 | ||||||||||||||||||||||||||||||||||||||
– | 1.15 | – | 0.12 | 1.32 | |||||||||||||||||||||||||||||||||||
– | 2.05 | – | 0.27 | 4.20 | |||||||||||||||||||||||||||||||||||
0.75 | 0.56 | ||||||||||||||||||||||||||||||||||||||
– | 0.05 | -0.01 | |||||||||||||||||||||||||||||||||||||
-0.70 | – | 0.11 | |||||||||||||||||||||||||||||||||||||
-0.30 | -0.05 | ||||||||||||||||||||||||||||||||||||||
0.15 | |||||||||||||||||||||||||||||||||||||||
The most simple naïve forecast uses the last periods data as a forecast for the next. In other words “what ever happened last period will happen the next period”
This can be expressed as
Ft=At-1
The second naïve model uses the last value and the DIRECTION of the last change in values. So the last value is adjusted depending on if the last change in direction was positive or negative. This change is then weighted. This can be expressed as
Ft=At-1+P(At-1-At-2)
At-1-At-2 is the change
Pis the weight In this example a 50% weight (p=.5) is used
CALCULATING THE ERRORS
The “error” for each data point is simply the difference between the observed value and the forecast. In other words, how wrong were you!! This is either positive or negative. In most cases the positives and negatives almost even each other out so that the positive errors are approximately equal to the negative errors. So if you take the mean of these the Mean Error is almost nothing. In most cases the “absolute error” is more important this measures the value of the error but ignores the sign. So they are all positive. Another common error estimate is the percentage error. This is simply the error expressed in terms of the actual value. This often helps in the comparison of errors for time series with different relative values. Maybe a 10% error is acceptable and this then has meaning regardless of the magnitude of the actual values.
The Squared error is simply the error squared. This also removes the direction of the error (all measured as positives) and also highlights large errors by making them exponentially larger.
Forecast Errors (forecast 1)
(At-At-1)^2 | ||||
-0.10 | ||||
-2.10 | -0.28 | |||
-0.80 | ||||
ME | -0.53 | |||
MAE | 0.58 | |||
MPE | -0.08 | |||
MSE | 0.72 | |||
RMSE | 0.85 | |||
Theil’s U | 1.00 |
The final forecast evaluation is based on the summary statistics of the errors. Most of these involve taking the mean of the various errors. The mean error is thus the mean of the errors, which will often be very small because of the positive and negative values. The Mean Absolute Percentage Error is a popular measure as it measures the average error (regardless of direction) in percentage terms. The root mean squared error is also popular as it highlights forecasts when there are a number of very large errors.
Forecast Evaluation
Forecast Errors (forecast 2)
10.75 | -1.15 | -0.12 |
9.55 | -2.05 | -0.27 |
6.45 | ||
7.05 | ||
6.90 | -0.11 | |
5.80 | ||
5.15 | ||
5.20 | ||
5.60 | ||
-0.38 | ||
0.68 | ||
0.92 | ||
1.09 |
Forecast Evaluation
The final forecast evaluation is based on the summary statistics of the errors. Most of these involve taking the mean of the various errors. The mean error is thus the mean of the errors, which will often be very small because of the positive and negative values. The Mean Absolute Percentage Error is a popular measure as it measures the average error (regardless of direction) in percentage terms. The root mean squared error is also popular as it highlights forecasts when there are a number of very large errors.
Summary
-0.5250 | -0.3813 |
0.5750 | 0.6813 |
-0.0773 | -0.0476 |
0.0864 | 0.0943 |
0.7200 | 0.8478 |
0.8485 | 0.9208 |
1.0000 | 1.0851 |
Introduction
Forecast | ||||||||||||||
Copyright UPmarket Software Services. This file must not be used without permission. |
Follow the Example
Go to the sheet ”
Your Try
” and use the Tools – Data Analysis menu to exponentially smooth the data in the “actual” column, using a smoothing constant of .5. The input dialog box on the left shows the inputs needed. You should forecast a value for the first month in 1996.
Remember that the Excel exponential smoothing function applies the dampening to the forecast value not the actual value. This means that the dampening factor is equal to 1 minus alpha. For example if you want an alpha value of .3 then you need a dampening factor of 1-.3 = .7.
You will find a solution and a fully worked example on other worksheet tabs.
The Excel Exponential Smoothing Function
You can automatically exponentially smooth data in Excel using the Exponential Smoothing function from the Data Analysis menu. First check that the Analysis ToolPak is turned on. Click on the Tools menu. The bottom item should be Data Analysis. If it isn’t then you will need to turn the ToolPak on. Go to the Tools Menu, click on Add-Ins then check the box next to Analysis ToolPak. After a short period you should now be able to access Data Analysis under the Tools menu. When you click on Data Analysis you will see a long list of statistical methods that you can access. Go down the list and click Exponential Smoothing. A screen will appear a little like that below. (It may vary depending upon the version of Excel you are using). The screen below shows the input for smoothing with a damping factor of .5. This is similar to the alpha figure referred to in your text. In fact this is 1minus the alpha refered o in the text. You need to enter the cell range for the input data and also the destination of the output. If you include a label in the first row of your data, you should check the Labels in First Row box. When it is all entered – click OK and the smoothed data is calculated.
Your Try
Period | Actual | ||||||||||||
1994M1 | 94.300 | ||||||||||||
1994M2 | 93.200 | ||||||||||||
1994M3 | 91.500 | ||||||||||||
1994M4 | 92.600 | ||||||||||||
1994M5 | 92.800 | ||||||||||||
1994M6 | 91.200 | ||||||||||||
1994M7 | 89.000 | ||||||||||||
1994M8 | 91.700 | ||||||||||||
1994M9 | |||||||||||||
1994M10 | 92.700 | ||||||||||||
1994M11 | 91.600 | ||||||||||||
1994M12 | 95.100 | ||||||||||||
1995M1 | 97.600 | ||||||||||||
1995M2 | |||||||||||||
1995M3 | 9 | 0.3 | |||||||||||
1995M4 | 92.500 | ||||||||||||
1995M5 | 89.800 | ||||||||||||
1995M6 | |||||||||||||
1995M7 | 94.400 | ||||||||||||
1995M8 | 96.200 | ||||||||||||
1995M9 | 88.900 | ||||||||||||
1995M10 | 90.200 | ||||||||||||
1995M11 | 88.200 | ||||||||||||
1995M12 | 91.000 | ||||||||||||
1996M1 | |||||||||||||
Created by Peter Rossini © 2000 UPmarket Software Services |
Peter Rossini:
Forecast this value
Exponential Smoothing Solution
Error | Pct Error | Sq Error | |||
0.000 | 0.000% | ||||
-1.100 | 1.180% | 1.210 | |||
93.970 | -2.470 | 2.699% | 6.101 | ||
93.229 | – | 0.6 | 0.679% | 0.396 | |
93.040 | -0.240 | 0.259% | 0.058 | ||
92.968 | -1.768 | 1.939% | 3.127 | ||
92.438 | -3.438 | 3.863% | 11.818 | ||
91.406 | 0.294 | 0.320% | 0.086 | ||
91.494 | 0.006 | 0.006% | |||
91.496 | 1.204 | 1.299% | 1.449 | ||
91.857 | -0.257 | 0.281% | 0.066 | ||
91.780 | 3.320 | 3.491% | 11.022 | ||
92.776 | 4.824 | 4.943% | 23.270 | ||
94.223 | 0.877 | 0.922% | 0.769 | ||
90.300 | 94.486 | -4.186 | 4.636% | 17.525 | |
93.230 | -0.730 | 0.790% | 0.533 | ||
93.011 | -3.211 | 3.576% | 10.312 | ||
92.048 | 0.652 | 0.703% | 0.425 | ||
92.244 | 2.156 | 2.284% | 4.650 | ||
92.890 | 3.310 | 3.440% | 10.953 | ||
93.883 | -4.983 | 5.606% | 24.834 | ||
92.388 | -2.188 | 2.426% | 4.789 | ||
91.732 | -3.532 | 4.004% | 12.474 | ||
90.672 | 0.328 | 0.360% | 0.107 | ||
MISSING | 90.771 | ||||
Smoothing Constant | (alpha) | ||||
RMS Error | 2.519 |
Exponential Smoothing Solution
Actual
Forecast
Year and Month
Index
Simple Exponential Smoothing Forecast of the Index of Consumer Sentiment
Fully Worked Example
Example of Exponential Smoothing | Calculation | ||
Last Forecast or Actual Value | |||
(0.6)(94.3) + ( 1 – 0.6)(94.3) | |||
(0.6)(93.2) + ( 0.4)(94.3) | 93.640 | ||
(0.6)(91.5) + ( 0.4)(93.64) | 92.356 | ||
(0.6)(92.6) + ( 0.4)(92.356) | 92.502 | ||
Alpha | Change this value | ||
Example of Exponential Smoothing Calculations – all periods | |||
-2.140 | 2.339% | 4.580 | |
0.244 | 0.263% | 0.060 | |
0.298 | 0.321% | 0.089 | |
92.681 | -1.481 | 1.624% | 2.193 |
91.792 | -2.792 | 3.138% | 7.797 |
90.117 | 1.583 | 1.726% | 2.506 |
91.067 | 0.433 | 0.473% | 0.188 |
91.327 | 1.373 | 1.481% | 1.886 |
92.151 | -0.551 | 0.601% | 0.303 |
91.820 | 3.280 | 3.449% | 10.757 |
93.788 | 3.812 | 3.906% | 14.531 |
96.075 | -0.975 | 1.025% | 0.951 |
95.490 | -5.190 | 5.748% | 26.937 |
92.376 | 0.124 | 0.134% | 0.015 |
92.450 | -2.650 | 2.951% | 7.025 |
90.860 | 1.840 | 1.985% | 3.385 |
91.964 | 2.436 | 2.580% | 5.934 |
93.426 | 2.774 | 2.884% | 7.697 |
95.090 | -6.190 | 6.963% | 38.319 |
91.376 | -1.176 | 1.304% | 1.383 |
90.670 | 2.801% | 6.103 | |
89.188 | 1.812 | 1.991% | 3.283 |
90.275 | |||
2.575 |
Finding the Optimal Value of Alpha using SOLVER
SORITEC and similar forecasting software will automatically find the optimal value for Alpha by finding the value which minimises the RMS Error. This can also be done using solver in Excel. The Solver function enables the user to maximise or minimise a value (or function) by changing other cells until the optimal solution is found. This is the process used in optimising methods such as linear, non-linear, integer or dynamic programming.
For this example it is quite simple. Minimise the value of the RMS Error by changing the value of Alpha. To do this click Tools, then Solver. The dialog box to your left should appear. In this case we input to minimuse the value in cell F28 which is the RMS Error by changing the value of Alpha (or cell D27). Click solve and you will find the same solution that you would get through using SORITEC.
NOTE: IN this spreadsheet the value for Alpha has been named rather than using a cell reference. To find out how to use names I suggest you consult the Excel help system.
Change the value of alpha and see what happens. Try to find the value for alpha that minimises the RMS error. Then go to the next sheet to find out how to do this easily
Fully Worked Example
Actual
Forecast
Peter Rossini:
Change this value to see the effect on the calculations, the forecast and the errors
>Introduction
The Excel Function in Analysis Tools ou can automatically calculate moving averages in Excel using the Moving Average function in the Analysis ToolPak. First check that the Analysis ToolPak is turned on. Click on the Tools menu. The bottom item should be Data Analysis. If it isn’t then you will need to turn the ToolPak on. Go to the Tools Menu, click on Add-Ins then check the box next to Analysis ToolPak. After a short period you should now be able to access Data Analysis under the Tools menu. ” and use the Excel Data Analysis, Moving Average Add-in to calculate and period moving averages. sheet, together with a chart and Root Mean Square (RMS) errors indicating the “best” estimate. Try to follow the calculations for the RMS errors. The sheet shows the concept of how the moving average is calculated. Your Try 980Q1
3.529
9
.585
5. 24
7.909
Peter Rossini: Moving Average Solution Moving Average
0.000 0.000 0.000 0.000 0.000 0.000 0.000 231.834 0.000 0.000 220.853 0.000 0.000 212.018 222.342 212.049 217.624 219.100 217.546 225.408 218.487 230.005 223.116 234.097 230.793 245.605 238.621 254.082 244.105 251.307 246.332 244.110 247.083 238.538 246.770 238.015 241.778 235.797 236.112 231.568 234.926 234.721 236.145 239.825 236.884 249.048 241.546 251.461 245.543 248.880 247.271 232.142 239.999 211.173 228.352 188.326 210.838 171.205 191.845 162.046 176.252 156.500 165.453 152.100 156.402 147.627 151.833 141.773 147.793 136.877 141.300 129.777 135.799 129.124 134.007 128.197 129.642 129.153 128.227 130.574 130.226 136.278 133.543 141.132 135.431 144.502 140.008 148.865 145.370 149.502 146.788 143.785 144.513 7
136.652 142.682 134.421 140.740 136.228 136.931 134.725 133.816 131.504 133.372 129.520 132.652 128.004 129.966 126.112 127.302 122.959 125.592 118.002 121.864 112.221 116.922 108.003 113.592 107.203 110.519 106.397 106.985 103.324 104.785 100.394 103.422 98.010 100.989 93.153 96.355 91.585 94.536 49.048 Created by Peter Rossini © 2000 UPmarket Software Services Moving Average Solution Actual and Quarter Simple MA Example Three Quarter Moving Average Three Quarter Moving Average Forecast Period Actual Three Quarter Moving Average Three Quarter Moving Average Forecast Five-Quarter Moving Average Five-Quarter Moving Average Forecast Sq Error 3Q forecast Sq Error 5Q forecast MISSING 1980Q1 243.529 Missing Missing Missing Missing Missing MISSING MISSING 1980Q2 232.129 Missing Missing Missing Missing Missing Missing 231.834 MISSING 1980Q3 219.844 231.834 Missing Missing Missing Missing Missing 220.853 231.834 1980Q4 210.585 220.853 231.834 Missing Missing 451.520 Missing 212.018 220.853 1981Q1 205.624 212.018 220.853 222.342 Missing 231.912 Missing 212.049 212.018 1981Q2 219.938 212.049 212.018 217.624 222.342 62.732 5.780 219.100 212.049 1981Q3 231.738 219.100 212.049 217.546 217.624 387.657 199.205 225.408 219.100 1981Q4 224.549 225.408 219.100 218.487 217.546 29.692 49.045 230.005 225.408 Created by Peter Rossini © 2000 UPmarket Software Services Y Moving Average 1 10 Missing Missing 2 18 Missing Missing 3 20 4 12 0.76 First Quarter 5 12 Missing Created by Peter Rossini © 2000 UPmarket Software Services >Introduction
A Practice session You are currently looking at the Introduction Tab. So its highlighted. If you click on another tab, the sheet will open. To follow this session simply worked through the tabs in order! . 0
.70
0
. 9.6 Ft is the forecast at time t Proportion (P) Naïve Forecast 2 2 Copyright Upmarket Software Services. This file must not be used without permission. Relative Cell References . This is the first period when we can make a forecast since we need two proceeding actual values. Proportion (P) 0.5 In step 1 the proportion was included as a fixed value (.5). In this step we will add the proportion as a cell reference. The proportion is input into cell D2. The formula can be modified to point to this cell. The formula becomes Proportion (P) 0.5 If you click on cell D2 you will see that the letter P appears in the Name Box as in the diagram on the right. Since the proportion is now named P we can use the formula below to calculate the naïve forecast. I troduct i on or
Forecasting and Business Analysis
Copyright UPmarket Software Services. This file must not be used without permission.
Moving Average
Y
When you open Data Analysisyou will see a long list of statistical methods that you can access. Go down the list and click Moving Average. A screen will appear a little like that below. (It may vary depending upon the version of Excel you are using).
The screen below shows the input for a three period (Interval) moving average.
You need to enter the cell range for the input data and also the destination of the output. If you include a label in the first row of your data, you should check the Labels in First Row box. When it is all entered – click OK and the moving averages will be calculated.
Follow the Example
Go to the sheet ”Your Try
3
5
The input dialog box on the left shows the inputs needed for the 3 period moving average. The calculations, including a one period forecast are shown on the
Moving Average Solution
Simple MA Example
Period
Actual
1
2
4
1980Q2
232.
12
1980Q3
219.844
1980Q4
2
10
1981Q1
20
6
1981Q2
219.938
1981Q3
231.738
1981Q4
224.549
1982Q1
233.729
1982Q2
244.012
1982Q3
259.075
1982Q4
259.160
1983Q1
235.686
1983Q2
237.483
1983Q3
242.444
1983Q4
234.117
1984Q1
230.830
1984Q2
229.758
1984Q3
243.576
1984Q4
246.140
1985Q1
257.428
1985Q2
250.814
1985Q3
238.397
1985Q4
207.214
1986Q1
18
1986Q2
169.855
1986Q3
155.852
1986Q4
160.431
1987Q1
153.217
1987Q2
142.653
1987Q3
147.010
1987Q4
135.656
1988Q1
127.964
1988Q2
125.710
1988Q3
133.697
1988Q4
125.185
1989Q1
128.577
1989Q2
137.959
1989Q3
142.297
1989Q4
143.139
1990Q1
148.070
1990Q2
155.385
1990Q3
145.051
1990Q4
130.918
1991Q1
133.988
1991Q2
138.358
1991Q3
136.339
1991Q4
129.478
1992Q1
128.695
1992Q2
130.388
1992Q3
124.928
1992Q4
123.021
1993Q1
120.929
1993Q2
110.056
1993Q3
105.678
1993Q4
108.274
1994Q1
107.657
1994Q2
103.259
1994Q3
99.056
1994Q4
98.866
1995Q1
96.108
1995Q2
84.487
1995Q3
94.161
1995Q4
101.539
1996Q1
105.827
Created by Peter Rossini © 2000 UPmarket Software Services
Created by Peter Rossini © 1999 UniSA
This is the actual value for this quarter. DO NOT include this in your calculation then you can tets the quality of your one period forecast
Period Actual
Three
Quarter
Three Quarter Moving Average Forecast
Five-Quarter Moving Average
Five-Quarter Moving Average Forecast
Sq Error 3Q forecast
Sq Error 5Q forecast
1980Q1 243.529
0.000
1980Q2 232.129 0.000 0.000 0.000 0.000
1980Q3 219.844
231.834
1980Q4 210.585
220.853
451.520
1981Q1 205.624
212.018
222.342
231.912
1981Q2 219.938
212.049
217.624
62.732
5.780
1981Q3 231.738
219.100
217.546
387.657
199.205
1981Q4 224.549
225.408
218.487
29.692
49.045
1982Q1 233.729
230.005
223.116
69.233
232.325
1982Q2 244.012
234.097
230.793
196.187
436.660
1982Q3 259.075
245.605
238.621
623.917
799.860
1982Q4 259.160
254.082
244.105
183.729
421.867
1983Q1 235.686
251.307
246.332
338.425
70.880
1983Q2 237.483
244.110
247.083
191.103
78.312
1983Q3 242.444
238.538
246.770
2.774
21.522
1983Q4 234.117
238.015
241.778
19.542
160.088
1984Q1 230.830
235.797
236.112
51.619
119.859
1984Q2 229.758
231.568
234.926
36.470
40.373
1984Q3 243.576
234.721
236.145
144.184
74.816
1984Q4 246.140
239.825
236.884
130.386
99.900
1985Q1 257.428
2
49.048
241.546
309.877
422.048
1985Q2 250.814
251.461
245.543
3.119
85.888
1985Q3 238.397
248.880
247.271
170.659
51.068
1985Q4 207.214
232.142
239.999
1736.028
1604.563
1986Q1 187.909
211.173
228.352
1956.529
2713.326
1986Q2 169.855
188.326
210.838
1707.205
3421.946
1986Q3 155.852
171.205
191.845
1054.561
3023.438
1986Q4 160.431
162.046
176.252
116.086
986.865
1987Q1 153.217
156.500
165.453
77.951
530.620
1987Q2 142.653
152.100
156.402
191.739
519.831
1987Q3 147.010
147.627
151.833
25.911
88.202
1987Q4 135.656
141.773
147.793
143.297
261.682
1988Q1 127.964
136.877
141.300
190.688
393.205
1988Q2 125.710
129.777
135.799
124.694
243.048
1988Q3 133.697
129.124
134.007
15.369
4.417
1988Q4 125.185
128.197
129.642
15.513
77.835
1989Q1 128.577
129.153
128.227
0.144
1.135
1989Q2 137.959
130.574
130.226
77.546
94.720
1989Q3 142.297
136.278
133.543
137.437
145.719
1989Q4 143.139
141.132
135.431
47.078
92.083
1990Q1 148.070
144.502
140.008
48.140
159.734
1990Q2 155.385
148.865
145.370
118.440
236.440
1990Q3 145.051
149.502
146.788
14.544
0.102
1990Q4 130.918
143.785
144.513
345.365
251.870
1991Q1 133.988
136.652
142.682
95.975
11
0.76
1991Q2 138.358
134.421
140.740
2.909
18.700
1991Q3 136.339
136.228
136.931
3.677
19.369
1991Q4 129.478
134.725
133.816
45.567
55.544
1992Q1 128.695
131.504
133.372
36.361
26.227
1992Q2 130.388
129.520
132.652
1.245
8.902
1992Q3 124.928
128.004
129.966
21.090
59.654
1992Q4 123.021
126.112
127.302
24.827
48.227
1993Q1 120.929
122.959
125.592
26.867
40.615
1993Q2 110.056
118.002
121.864
166.496
241.374
1993Q3 105.678
112.221
116.922
151.881
262.000
1993Q4 108.274
108.003
113.592
15.579
74.795
1994Q1 107.657
107.203
110.519
0.119
35.219
1994Q2 103.259
106.397
106.985
15.555
52.705
1994Q3 99.056
103.324
104.785
53.880
62.860
1994Q4 98.866
100.394
103.422
19.879
35.039
1995Q1 96.108
98.010
100.989
18.368
53.502
1995Q2 84.487
93.153
96.355
182.872
272.325
1995Q3 94.161
91.585
94.536
1.016
4.813
1995Q4 101.539
93.396
95.032
99.075
1996Q1 105.827 93.396 95.032
RMS ERROR
14.464
18.297
Three Quarter Moving Average Forecast
Five-Quarter Moving Average ForecastYear
Actual Data and the Three-Quarter Moving Average Forecast
Period Actual
Calculation
1980Q1 243.529
MISSING
Missing
1980Q2 232.129
(243.529+232.129+219.844)/3
1980Q3 219.844
(232.129+219.844+210.585)/3
1980Q4 210.585
(219.844+210.585+205.624)/3
1981Q1 205.624
(210.585+205.624+219.938)/3
1981Q2 219.938
(205.624+219.938+231.738)/3
1981Q3 231.738
(219.938+231.738+224.549)/3
1981Q4 224.549
(231.738+224.549+233.729)/3
1982Q1 233.729
(224.549+233.729+244.012)/3
1982Q2 244.012 234.097 230.005 1995Q2 84.487 93.153 98.010 96.355 100.989 182.872 272.325
234.097 1995Q3 94.161 91.585 93.153 94.536 96.355 1.016 4.813
1995Q4 101.539 93.396 91.585 95.032 94.536 99.075 49.048
1996Q1 105.827 93.396 95.032
RMS ERROR 14.464 18.297
Centered Moving Average Example
Year Quarter
Time Index
Centred Moving Average
Seasonal Factor
Year 1
First Quarter
Year 1
Second Quarter
Year 1
Third Quarter
15.00
15.25
1.31
20/15.25=1.31
Year 1
Fourth Quarter
15.50
15.75
12/15.75=0.76
Year 2
16.00
Year 2 Second Quarter 6 20 Missing Missing
2
Forecasting and Business Analysis
Copyright Upmarket Software Services. This file must not be used without permission.
This practice session will provide a review of some of the basic skills needed in Excel. It is expected that most students will have covered this material before and that this is simply a refresher. If you have not covered all of this material before then you may also find it necessary to use a simple “how to” type Excel reference in order for you to get started with Excel.
What you will do in this Session?
This session will take you through the worked example that is used for a simple naive model that includes a proportion of the change in actual values. It is a very simple model so the calculations should be easy to follow. The example is built up in several steps so that you can become more familiar with Excel but hopefully not be lost in any of the steps. In particular you will learn
– how to input a function and formula
– the difference between relative and absolute referencing and how to use each
– how to copy a formula to make a single formula apply to a group of cells
– how to use cell naming
How do I follow this session?
Its easy. At the bottom of this page you will see a number of “tabs”. They look like this.
I hope that you find this session to be usefulThe Naive Forecast
Proportion (P)
0.
5
Naïve Forecast 2
Period
Actual
Forecast2
1
7
6
2
9
3
9.6
10
8
4
7.50
5
7.20
6.5
6
7.00
7.1
7
6.20
6.9
8
5.50
5.8
9
5.30
5.2
10 5.50 5.2
11
5.6
Copyright Upmarket Software Services. This file must not be used without permission.
At-1 is the actual value at time t-1. The value in the period before t.
At-1-At-2 is the change in the actual values between t-1 and t-2
Pis the weight In this example a 50% weight (p=.5) is used
The naïve forecasting model used in this session uses the last actual value and the last CHANGE in actual values in order to make a forecast.
The proceeding actual value is adjusted depending on size and direction of the last change in actual values. This change is then weighted using a proportion P.
This can be expressed as:-
Ft=At-1+P(At-1-At-2)
This is what we are aiming to produce.
It’s a simple spreadsheet to make a naïve forecast. The concept of the forecast is explained in the text box below and the final worked example is shown. We will be working towards this outcome through the various steps.Step 1
0.5
Period Actual Forecast2
1
7.60
9.70
3
9.60
4 7.50
5 7.20
6 7.00
7 6.20
8 5.50
9 5.30
10 5.50
11
This sheet uses a relative cell reference. This means that as you copy the formula it will “point” to a new set of cells. In the next step we examine an absolute cell reference.
At this time you should go to the Excel Help system and learn more about cell referencing. To do this open the Help System and search for relative cell reference in the index.
Now move your cursor over cell D8. A large Ft will appear to show you that this is what we are calculating. I have added similar markers for At-1 and At-2. (This will not normally happen in your spreadsheets). Now click the left mouse button. The formula will appear in the formula bar. Move your cursor into the formula bar as in the diagram on the right and click the left hand mouse button when the cursor is in the formula bar. You should see the formula and relevant cells change colour. This should help you to match the cells in the formula with the original formula.
Ft=At-1+P(At-1-At-2)
We can start by entering the formula into the appropriate cell. The formula is :-
In this example the first actual values are in cell C6 and C7. We will make our first forecast in period 3 and enter this into a new column headed Forecast2. We can enter the formula as shown below. Please enter this formula into cell D8.
=C7+0.5*(C7-C6)
You should get and answer of
10.8
We can now copy this formula to apply to the full range of data. Move you cursor to cell D8 and click the left mouse button. Now move the cursor to the lower left hand corner of the cell and the cursor should change to the cursor as indicated on the diagram.
Now click and HOLD the left hand mouse button and move the mouse down the spreadsheet. This will “drag down” the formula. You can drag it down to cell D16
At-2
At-1
FtStep 2
Naïve Forecast 2
Period Actual Forecast2
1 7.60
2 9.70
3 9.60 10.8
4 7.50 9.6
5 7.20 6.5
6 7.00 7.1
7 6.20 6.9
8 5.50 5.8
9 5.30 5.2
10 5.50 5.2
11 5.6
Copyright Upmarket Software Services. This file must not be used without permission.
=C7+$D$2*(C7-C6)
Note that the reference to D2 is shown as $D$2. This is an absolute reference. Unlike the relative reference used before, this part of the formula will always point to exactly the cell D2 even when “dragged down”. You can easily make a cell reference absolute by using the F4 key.
If you “drag down” the formula now, you will find that the answer remains as before. The advantage of this method is that you can now change the proportion for the whole forecast by simply changing the value in D2.
Using the same method as in step one, drag the formula down until period 11. Now try changing this value and see the effect n the forecast values.
In this example the absolute reference refers to an absolute column ($D) and an absolute row ($2). It is also possible to keep one part of the reference absolute, while leaving the other section relative. So $D2 would mean that the reference is always to Column D but that as the formula is copied the row reference will change. There is a good discussion of this in the Excel help menu under the heading, the difference between relative and absolute references. You should read this as you will need to use all forms of relative and absolute references in later problems.
Absolute Cell References
This sheet uses a combination of relative and absolute cell references. This means that as some references will always point to a specific or absolute cell regardless of how or where the formula is copiedStep 3
Naïve Forecast 2
Period Actual Forecast2
1 7.60
2 9.70
3 9.60 10.8
4 7.50 9.6
5 7.20 6.5
6 7.00 7.1
7 6.20 6.9
8 5.50 5.8
9 5.30 5.2
10 5.50 5.2
11 5.6
Copyright Upmarket Software Services. This file must not be used without permission.
To name a cell, simply click on a cell, then move the cursor to the name box and type in the name
Cell Names as a Reference
This sheet uses Cell Names. Cell names are very useful in complex formula’s or sheets as it makes it easier to keep track of variables. You should read about cell names in in the Excel Help under the heading, about labels and names in formulas.
=C7+P*(C7-C6)
In step 2 the proportion was included by using an absolute cell reference. In more complex situations we may find that naming the cell is easier. A named call can be referred to by name rather than the cell reference. This is particularly useful if we are using a large number of variables or the same variable on multiple sheets. In this case we call the Proportion, P.
Try entering a cell name. Click on cell D16 and name it forecast by typing this in the name box. When you have typed it hit the return (enter) key. Now click on cell C16 and enter the formula =forecast. This should give you the value for the cell that you have named the forecast.n
F
e
castingand Business Analysis
Copyright UPmarket
S
oftware Services.
This file must not
b
e used without permission.
Estimating the Unknown Parameters in a Simple
Model
This Excel template shows a simple example of how the unknown parameters are estimate in a simple regression. These parameters are b
(
intercept
)
and b
(slope). Students are not expected to be able to calculate these by hand and for most problems a computer would always be used to estimate the parameters. The calculations and demonstration here is simply to provided a greater level of understanding. Using the ToolPak and Linest function are also explained.
e1
e
2
e
e
e
e
Ordinary Least Squares
The method used to estimate the unknown parameters is called ordinary least squares or OLS. The diagram illustrates that the 6 observed values and the line of best fit for these points. The two unknown parameters for this line are
and
. We can find these parameters by minimising the errors between each point and the line. These are labelled as e1 to e6. The errors are squared first.
So the parameters are found by minimising the squared errors.
This can be shown as
And the values of b0 and b1 can be found from the partial derivates i.e.
The method is shown on the “
Estimating the Model
” tab.
Estimating the Model
s
4
6
7
9
Size
e
6 6 9
–
3 9
1
-1 1
b0 3
b1 1
Estimate the relationship between advertisement responses and size
You are analysing the responses from your companies latest advertising campaign. You wish to estimate the number of responses that you receive on average for each column centimetre of advertisement. You hypothesis that there is a positive relationship between the size of the advertisement and the number of responses. The relationship is shown on the scattergram. The line of best fit is also shown. You can adjust the line by changing the b0 & b1 parameters below.
Estimating the Model
Advertisment Size (Column Centimetres)
Number of Responses
OLS Calculations
Responses Size
7 4
6 6
9 7
10 8
11 9
36
Slope b1
=
Find the Line of best fit
Adjust the bo and b1 value to find the best fit. The errors (e) above are the difference between the observed response and the model response. Try to minimise the squared error.
Y=b0
+
b1(size)
Finding the line of best fit by minimising the sum of squared errors
While you can find the line of best fit by trial and error you can also minimise the squared errors by formula. This is shown on the “OLS Calculations” tab. To make the trial and error (iteration) method quicker, you can use solver to find the minimum. Use solver as shown below to minimise the squared errors. Check the results are the same as OLS.
Using Solver to minimise the error
You can open solver from the Tools menu. If Solver does not appear in the menu you may have to click the appropriate tick box in Tools Add-ins. In the solver screen notice that you are setting the target cell J17 to a minimum. Thus you minimise the sum of squared erros in J17. You do this by changing the values in H18 and H19. These are the two parameters b0 and b1. Use the solve button to find the parameter values. If you ask it to “Keep Solver Solution” the new parameter estimates will become the parameter values (the same as for OLS).
Finding the line of best using the OLS formulae
This shows the application of the OLS formulae for estimating the parameters
The formulae are applied to the data using the table and the formulae to the left. Note that these OLS results are the same as for the iterative (trial and error) approach.
Using the Excel ToolPak
Responses Size
7 4
6 6
10 6
9 7
10 8
11 9
6
F
Regression 1
10.4637681159
0.088990
22
45
4
5
Standard Error
.7053175738
10.3574914868
-0.1995488055 1.8517227186
Copyright UPmarket Software Services. This file must not be used without permission.
Estimate the relationship between advertisement responses and size
You are analysing the responses from your companies latest advertising campaign. You wish to estimate the number of responses that you receive on average for each column centimetre of advertisement. You hypothesis that there is a positive relationship between the size of the advertisement and the number of responses. This relationship can be tested using regression analysis. The analysis can be performed using the Analysis ToolPak.
Excel Regression Using Data Analysis
The graphic on the left, shows the regression dialog box. To find this use the Tools – Data Analysis menu and select regression from the list of methods. The most important inputs to this dialog box are
Y Range – which is the cells that refer to the dependent variable
X Range – which is a contiguous set of cells that refer to one or more independent variable (up to 16 variables can be chosen)
Labels – which needs to be ticked if the first row of the data is a label
Output options – you need to select to put the output in a specified range of the same worksheet OR on a new worksheet ply (or tab) or in a new workbook.
Residuals – you may choose to select from a range of residual options.
Normal Probability – you may choose to have a normal probability plot.
The Output
On the left you can see the output from the data analysis. Notice that the Coefficients are the same as for the other methods of analysis. These are the parameter estimates of the intercept and the slope. All of the other details are relevant statistics. This will be discussed at a later stage.
Using Linest
Reg Coeff | |
Stdev | |
=LINEST(known_y’s,known_x’s,const,stats) | SEE |
DF | |
SSE | SSR |
Enter Linest Function here |
Estimate the relationship between advertisement responses and size
You are analysing the responses from your companies latest advertising campaign. You wish to estimate the number of responses that you receive on average for each column centimetre of advertisement. You hypothesis that there is a positive relationship between the size of the advertisement and the number of responses. This relationship can be tested using regression analysis. The analysis can be performed using the the LINEST function.
Details of the LINEST Function from the Excel Help file
LINEST(known_y’s,known_x’s,const,stats)
Fits a straight line to your data and returns an array that describes that line. The accuracy of the line depends on the degree of scattering in the data you provide. The more linear the data, the more accurate the LINEST model. LINEST uses the method of least squares for determining the best fit for the data.
The known_x’s, const, and stats arguments are optional.
If the array known_y’s is in a single row, then each row of known_x’s is interpreted as a separate variable.
If the array known_y’s is in a single column, then each column of known_x’s is interpreted as a separate variable.
The array known_x’s can include one or more sets of variables.
If you use only one variable, known_y’s and known_x’s can be shaped differently.
If you use more than one variable, known_y’s must be a vector (a range with a height or width of 1).
If you omit known_x’s, LINEST uses the values {1,2,3,…} in an array the same size as
known_y’s.
If const is FALSE, the constant term b equals zero.
If const is TRUE or omitted, the constant term will be estimated.
If stats is FALSE or omitted, LINEST returns only the slope and y-intercept.
If stats is TRUE, LINEST returns the additional values:
Standard error for each coefficient
Standard error for the constant b
Coefficient of determination (r-squared)
Standard error for the y-estimate
F-statistic
Degrees of freedom
Regression sum of squares
Residual sum of squares
LINEST is an Array function in EXCEL to produce regression results. The advantage of this over the Analysis ToolPak approach is that like all functions the results will change as values in the data set change. The function is applied below and you can read about at the bottom of the sheet. To use Linest you must use an array. This is a group of cells which cannot change. In this case the arrays will be the X’s and Y’s and the results. To input an array you must use
Time for you to try this for yourself
Use the fx command to enter the Linest Command in cells O21 for the data in cells 03 to P8. Function should be LINEST(O3:O8,P3:P8,TRUE,TRUE). Now highlight cells O21 to P25. Click the cursor in the Equation Edit Bar while the cells are still highlighted. Now press
The yellow highlighted cells are the results from the LINEST function. In this case only the two unknown parameters are shown. The example to the right shows all of the statistics as well. Note the “squiggly” brackets around the formula. This indicates it is entered as an array.
(
)
(
)
2
2
1
X
X
Y
X
n
Y
X
b
i
i
i
–
S
–
S
=
22
1
XXYXnYXb
ii
i
2
1
0
2
)
(
Σ
Minimise
i
i
i
X
b
b
Y
e
–
–
S
=
2
10
2
)( Σ Minimise
iii
XbbY
e
X
b
Y
b
1
0
–
=
XbYb
10
i
i
X
b
b
Y
1
0
ˆ
+
=
ii
XbbY
10
ˆ
MBD00112D39.unknown
MBD001883AE.unknown
MBD001883AF.unknown
MBD00113586.unknown
MBD000E6AB2.unknown
MBD001066CE.unknown
29/07/1
3
1
Forecas0ng
and
Business
Analysis
Introduc0on,
Data
Handling
and
Correla0on
Lecture
1
• Lecturer:
Dr
Patricia
Sourdin
patricia.sourdin@unisa.edu.au
• Tutor
Mr
Minh
Nguyen
HuuMinh.Nguyen@unisa.edu.au
Admin
• Lectures:
5
pm–
7
pm,
Thursday
• Tutorials:
you
need
to
be
enrolled
in
one
of
the
tutorials
29/07/
13
2
Admin
• Your
textbook
Course
website
Features:
• Course
informa0on
booklet
• Lecture
slides
• Online
forum
• Assessment
informa0on
• Study
guide
and
data
sets
to
help
you
prepare
for
tutorials
Assessment
Three
pieces
of
assessment:
-‐-‐
Assignment
1
(20%)
due
13
September
2013
-‐-‐
Assignment
2
(20%)
due
8
November
2013
-‐-‐
Final
Exam
(60%)
Please
look
at
course
outline
for
specific
instruc0ons
and
rules
related
to
pass
marks.
29/07/13
3
Prerequisites
This
course
REQUIRES
successful
comple0on
of
the
following
courses:
–Either
Sta0s0cs
for
Business
(MATH
1052)
or
Quan0ta0ve
Methods
for
Business
(Math
1053)
and
–Either
Principles
of
Economics
(ECON
1008),
or
Microeconomics
(ECON
1006),
or
Macroeconomics
(ECON
1007).
If
you
do
not
have
these
courses
you
will
be
de-‐
enrolled
automa0cally.
What
you
need.
• This
course
builds
upon
your
previous
study,
and
therefore
we
assume
you
have
acquired
knowledge
of
the
following:
–Basic
MS
Excel
–Fundamental
economic
theory
and
logic
–Basic
algebraic
techniques
–Basic
concepts
from
sta0s0cs,
such
as
parameter
es0ma0on
and
sta0s0cal
tes0ng
• You
will
need
regular
access
to
a
computer
with
MS
Excel
installed
What
FBA
will
do
for
you
• Introduce
you
to
a
range
of
quan0ta0ve
analysis
techniques
and
their
limita0ons
• Take
you
from
being
able
to
compile
and
summarise
data
in
a
very
basic
way
towards
more
sophis0cated
analysis
–Provide
you
with
prac0cal
forecas0ng
and
quan0ta0ve
skills
that
can
be
applied
in
business,
government,
and
academic
research
• These
tools
are
highly
sought
by
employers
29/07/13
4
Examples
of
ques0ons
that
can
be
answered
using
the
techniques
you
will
acquire
• Do
lower
speed
limits
save
lives?
• How
much
is
a
university
degree
worth
in
the
labour
market?
• Does
campaign
spending
influence
elec0on
outcomes?
• Does
economic
development
lead
to
less
or
more
environmental
damage?
• Why
are
some
countries
so
poor
and
others
so
rich?
• What
is
the
likely
effect
of
a
marke0ng
campaign
on
sales?
This
week
• Topic
1a
–
Brief
Review
– Review
some
basic
concepts
from
maths
and
sta0s0cs
• Topic
1b
-‐
Data
Handling.
– Data
types,
graphical
methods,
descrip0ve
stats
–
mean
and
variance
• Topic
1c
–
Probability
distribu0ons
• Topic
1d
–
Correla0on
– Defini0on,
correla0on
table
Topic
1a
Review
and
concepts
from
maths
and
sta0s0cs
29/07/13
5
Func0onal
Nota0on
• Ogen
we
are
interested
in
the
rela0onship
between
2
or
more
variables,
which
is
ogen
denoted
using
the
concept
of
a
func0on
• Read:
“Y
is
a
func0on
of
X”
• X
can
be
one
variable,
or
it
can
be
a
vector
of
many
variables
Y
=
f X
( )
Equa0on
of
a
Straight
Line
• Any
straight
line
can
be
expressed
as:
• where
α
(y-‐intercept)
and
β
(slope
of
the
line)
are
values
which
determine
the
quan0ta0ve
rela0onship
between
X
and
Y.
• β
is
the
amount
Y
changes
when
X
changes
by
one
unit
• Explain
the
intui0on
of
β=0.5,
β=-‐0.75??
Y = α + β
X
Logarithms
• The
logarithm
is
a
common
way
of
transforming
a
variable
in
business
analysis
• The
logarithm
of
A
is
the
power
to
which
B
(a
base
value)
must
be
raised
to
give
the
value
A
• For
example,
if
B
=
9
and
A
=
81
then
the
logarithm
is
2,
expressed
as
log9(81)
=
2
• Ogen
we
use
the
natural
logarithm,
where
B
=
e
(e,
a
mathema0cal
constant,
is
approximately
equal
to
2.718)
• For
example,
if
GDP
=
243
then
loge(243)
=
ln(243)
=
5.493
29/07/13
6
Ln(A)
in
Excel
• To
calculate
the
natural
logarithm
of
a
number
in
Excel,
use
“=LN(number)”
• To
return
that
result
back
to
the
original
number,
use
“=EXP(number)”
(this
is
called
“exponen0a0on”)
• Both
opera0ons
can
also
be
done
using
a
calculator
Where
we
are
headed
• This
week
we’ll
discuss
data
types,
handling,
correla0on
analysis,
and
other
simple
procedures
-‐
and
then
move
on
to
basic
0me-‐series
forecas0ng
techniques
• This
will
be
followed
by
an
introduc0on
to
cross-‐
sec0onal
regression,
and
then
0me-‐series
regression
–
which
together
will
take
up
the
bulk
of
the
course
• During
the
course,
we’ll
focus
on
how
the
techniques
we
have
discussed
can
be
used
in
real-‐world
forecas0ng
and
data
analysis
What
does
regression
have
to
do
with
“Forecas0ng”???
• The
course
does
have
some
maths
in
it.
Don’t
panic!
• We
will
focus
on
regression:
regression
is
a
sta/s/cal
technique
• Despite
many
rumours,
neither
math
nor
sta0s0cs
is
an
evil
thing:
they
are
TOOLS,
which
are
used
EVERYWHERE
in
business
and
elsewhere
29/07/13
7
What
does
regression
have
to
do
with
“Forecas0ng”???
(con0nued)
• A
forecast
is
a
predicted
outcome.
The
predic0on
may
be:
–Of
an
aggregate
“macro”
variable
(e.g.,
unemployment)
–Of
a
“micro”
variable
(e.g.,
sales)
–Based
on
opinion,
theory,
data
analysis,
or
some
combina0on
of
these
three
• Regressions
yield
predic0ons
–
this
is
why
they
are
used
heavily
in
forecas0ng
More
about
forecas0ng
• Forecas0ng
helps
businesses
and
governments
make
decisions
in
the
face
of
uncertainty
• Virtually
all
governments
forecast
macroeconomic
indicators
such
as
unemployment,
consumer
spending,
popula0on
growth,
and
GDP
• Forecasts
are
constructed
at
na0onal,
regional,
state
and
local
levels,
and
inform
public
and
private
policy
(resource
alloca0on)
at
every
level
• Some
of
the
data
used
to
develop
such
forecasts
are
confiden0al,
and
some
come
from
government
agencies
and
departments
(like
the
ABS)
Examples
We
have
just
seen
a
major
economic
event.
The
“global
financial
crisis”
–What
will
GDP
growth
be
this
year?
–How
will
sales
of
my
company’s
goods
do
this
year?
–What
will
the
level
of
unemployment
be?
–What
will
happen
to
house
prices?
–Will
wages
fall?
–What
will
happen
to
a
par0cular
stock
price?
What
will
happen
to
stock
prices
in
general?
29/07/13
8
A
note
on
cross-‐sec0onal
versus
0me
series
predic0ons
• We
interpret
a
“forecast”
broadly
as
a
“predic0on”
in
this
course
• The
predic0on
can
be
for
future
0me
periods
(using
0me
series
data)
• A
predic0on
can
also
be
constructed
using
cross-‐sec0onal
data
Topic
1b
Data
handling
Subscripts
and
Summa0on
• Subscripts
are
used
to
denote
different
observa0ons
of
a
variable
• Conven0onally,
we
use
subscript
i
for
cross-‐sec0onal
observa0ons
(i.e.
states,
individuals,
etc),
and
t
for
0me
series
observa0ons
(i.e.
years,
months,
quarters)
• Say
we
had
data
for
GDP
(Y)
over
a
10
year
period,
with
one
observa0on
for
GDP
(Y)
in
year
1,
another
in
year
2,
etc.
• The
individual
values
of
Y
can
be
expressed
as
Y1
(=
GDP
in
year
one),
Y2
(
=
GDP
in
year
2),
all
the
way
up
to
Y10,
or
{Yt}
where
t
=
1
to
10
29/07/13
9
…con0nued
• We
can
then
write
Yt
to
denote
any
individual
observa0on
of
GDP
• If
you
are
interested
in
calcula0ng
the
average
of
GDP
over
the
ten-‐year
period,
then
you
first
want
to
add
up
(or
sum)
over
all
the
observa0ons
• Use
the
summa0on
operator,
capital
sigma:
Yt = Y1 +Y2 +…Y10t=1
10
∑
Data
Types
Types
of
Data:
–0me
series
data
(Yt
for
t=1,…,T)
–cross-‐sec0onal
data
(Yi
for
i=1,…,N)
–panel
data
(Yit
for
i=1,..,N
and
t=1,…,T)
Cross-‐sec0onal
vs
0me-‐series
data
• Cross-‐sec0onal
data
are
observa0ons
on
one
or
more
unit-‐level
variables
collected
at
a
single
point
in
0me
-‐-‐
Repeated
cross-‐sec0ons:
cross-‐sec0onal
data
on
variables
that
are
roughly
comparable
and
observed
at
successive
0me
periods
(but
NOT
for
the
same
sample)
• A
“0me
series”
is
a
series
of
observa0ons
on
one
variable
over
successive
periods
of
0me
29/07/13
10
Cross-‐sec0onal
example
Household
Yearly
Household
Yearly
Household
number
spending
income
1
30,000
100,000
2
40,000
70,000
3
80,000
60,000
4
100,000
250,000
5
30,000
25,000
6
15,000
20,000
7
40,000
60,000
8
50,000
50,000
9
80,000
90,000
10
20,000
100,000
…
…
…
100
30,000
40,000
Time-‐series
data
Time
Period
Median
house
price
1
100,000
2
150,000
3
150,000
4
155,000
5
157,000
6
200,000
7
210,000
8
150,000
9
200,000
10
205,000
…
…
100
250,000
Panel
data
example
Time
Period
State
#
Road
Accidents
1
NSW
46
1
WA
31
1
Vic
19
2
NSW
52
2
WA
47
2
Vic
17
3
NSW
49
3
WA
37
3
Vic
14
…
…
…
29/07/13
11
Variable
types
Categorical
–Nominal
scale:
one-‐to-‐one
or
many-‐to-‐one
mapping
of
categories
into
numerical
“dummies”
–Ordinal
scale:
numbers
assigned
to
categories
reflect
an
inherent
ranking
Numerical
–Discrete
(binary
or
mul0nomial);
ogen
equivalent
to
ordinal
categorical
variables
–
e.g.,
number
of
bedrooms
in
a
house
–Con0nuous
Most
variables
we
wish
to
forecast
are
numerical
Commonly
used
transforma0ons
• Levels
versus
Growth
Rates
–Example:
Might
be
more
interested
in
growth
of
GDP
as
opposed
to
levels
Growth
Rates:
•Log
transforma0ons
•Propor0ons
–
e.g.,
the
propor0on
of
people
in
Australia
with
a
university
degree
•Index
(eg.
CPI).
See
Sect
2.1
Koop.
Yt −Yt−1( )
Yt−1
×100
Graphical
methods
29/07/13
12
Time
series
• Retail
trade
over
0me
Histograms
• Example
ques0ons:
–“What
is
the
distribu0on
of
income
across
countries?”
–“What
is
the
extent
of
global
inequality?”
• Related
to
the
idea
of
a
distribu0on.
• Data
to
use:
real
GDP
per
capita
in
1992
for
90
countries
measured
in
$US.
Construc0ng
a
Histogram:
Step
1
• Construct
“class
intervals”
(“bins”).
• Real
GDP
per
capita
in
our
data
set
varies
from
$408
in
Chad
to
$17,945
in
the
U.S.
• Class
intervals
must
include
these
extremes.
• One
choice
of
class
intervals
(of
many
choices
possible):
29/07/13
13
Step
2:
Calculate
frequencies.
• Count
the
number
of
countries
whose
GDP
per
capita
falls
into
each
bin.
Step
3:
Make
a
bar
chart
• Make
a
bar
chart,
with
the
bins
on
the
x-‐axis,
and
frequency
on
the
y-‐axis.
XY
Graph
(cross-‐sec0onal)
• Example:
Deforesta0on
and
Popula0on
density
for
70
tropical
countries.
• Ques0on
of
interest:
–“Do
countries
with
high
popula0on
density
also
tend
to
have
high
deforesta0on
rates?”
• Plot
of
one
variable
versus
another
(e.g.
deforesta0on
on
y-‐axis,
popula0on
density
is
on
x-‐axis).
• Each
point
on
graph
represents
deforesta0on
and
popula0on
density
for
one
country.
29/07/13
14
Sca|erplot
Example
XY
plot
of
popula0on
density
against
deforesta0ons
Interpreta0on
of
XY-‐plots
•There
seems
to
be
a
posi0ve
rela0onship
between
deforesta0on
and
popula0on
density
•Countries
with
low
popula0on
density
also
tend
to
have
low
deforesta0on
rates
(i.e.
low-‐low)
•Countries
with
high
popula0on
density
also
tend
to
have
high
deforesta0on
rates
(i.e
high-‐high)
•Outliers:
countries
which
do
not
fit
the
“general
pa|ern”.
Descrip0ve
sta0s0cs
29/07/13
15
Descrip0ve
Sta0s0cs
• Example
(con0nued):
real
GDP
per
capita
for
90
countries.
• A
histogram
graphically
summarises
the
cross-‐
country
income
distribu0on.
• Descrip0ve
sta0s0cs
are
numbers
which
summarise
proper0es
of
the
income
distribu0on.
1.
Measures
of
Loca0on
• Intui0on:
centre
of
distribu0on,
average,
“typical
country”
(careful!!).
We
can
calculate
the
sample’s
value
–
not
the
popula0on
value.
• Sample
mean:
• Mean
GDP
per
capita
is
$5,443.80
in
this
sample
• Median
or
mode
is
ogen
useful
for
skewed
data
Y
=
Yi
i=1
N
∑
N
2.
Measures
of
dispersion
• Intui0on:
spread/variability/dispersion
of
distribu0on;
inequality
across
observa0ons.
Again
–
calculated
for
the
sample
at
hand.
• Standard
devia0on:
•
Variance
=
standard
devia0on
squared
s =
Yi −Y( )
2∑
N −1
29/07/13
16
Topic
1c
Probability
distribu0ons
Random
variables
• A
random
variable
is
a
variable
whose
value
is
unknown
un0l
it
is
observed;
in
other
words
it
is
a
variable
that
is
not
perfectly
predictable
– Each
random
variable
has
a
set
of
possible
values
it
can
take
– A
discrete
random
variable
can
take
only
a
limited,
or
countable,
number
of
values
• An
indicator
variable
taking
the
values
one
if
yes,
or
zero
if
no
• Indicator
variables
are
discrete
and
are
used
to
represent
qualita0ve
characteris0cs
such
as
gender
(male
or
female),
or
race
(white
or
nonwhite)
– A
random
variable
that
can
have
any
value
is
treated
as
a
conMnuous
random
variable
• Probability
is
usually
defined
in
terms
of
experiments
– If
we
were
to
select
one
cell
from
the
table
at
random,
that
would
cons0tute
a
random
experiment
29/07/13
17
• We
summarize
the
probabili0es
of
possible
outcomes
using
a
probability
density
func0on
(pdf
)
– The
pdf
for
a
discrete
random
variable
indicates
the
probability
of
each
possible
value
occurring
– For
a
discrete
random
variable
X
the
value
of
the
probability
density
func0on
f(x)
is
the
probability
that
the
random
variable
X
takes
the
value
x,
f(x)
=
P(X
=
x)
• It
must
be
true
that
0
≤
f(x)
≤
1
f(x1)
+
f(x2)
+
…
+
f(xn)
=
1
PDF
of
a
discrete
random
variable
PDF
of
a
discrete
random
variable
29/07/13
18
PDF
of
con0nuous
RV
Proper0es
of
PDF
• Two
key
features
of
a
probability
distribu0on
are
its
center
(loca0on)
and
width(dispersion)
– A
key
measure
of
the
center
is
the
mean,
or
expected
value
– Measures
of
dispersion
are
variance,
and
its
square
root,
the
standard
deviaMon
Expected
value
• The
mean
of
a
random
variable
is
given
by
its
mathemaMcal
expectaMon
– If
X
is
a
discrete
random
variable,
then
the
mathema0cal
expecta0on,
or
expected
value,
of
X
is:
E X( ) = x1P X = x1( )+ x2P X = x2( )++ xnP X = xn( )
29/07/13
19
• For
the
popula0on
in
our
table,
the
expected
value
of
X
is:
( ) ( ) ( )
( ) ( )
( ) ( ) ( ) ( )
1 1 2 2 3 3 4
4
1 0.1 2 0.2 3 0.3 4 0.4
3
E X P X P X P X P X= × = + × = + × = + × =
= × + × + × + ×
=
• The
mean
of
a
random
variable
is
the
populaMon
mean
– We
use
Greek
le|ers
for
populaMon
parameters
• The
expected
value
can
be
wri|en
equivalently
as:
µX = E X( ) = x1 f x1( )+ x2 f x2( )++ xn f xn( )
= xi f xi( )
i=1
n
∑
= xf x( )
x
∑
• For
our
example:
( ) ( )
( ) ( ) ( ) ( )
4
1
µ
1 0.1 2 0.2 3 0.3 4 0.4
3
X
i
E X xf x
=
= =
= × + × + × + ×
=
∑
29/07/13
20
Proper0es
of
expecta0ons
• If
a
is
a
constant,
then
g(X)
=
aX
is
a
func0on
of
X,
and:
• If
a
and
b
are
constants,
then:
( ) ( ) ( ) ( )
( ) ( )
( )
x
x x
E aX E g X g x f x
axf x a xf x
a
E X
⎡ ⎤= =⎣ ⎦
= =
=
∑
∑ ∑
( ) ( )E aX b aE X b+ = +
• The
expected
value
of
the
random
variable
is
the
average
value
that
occurs
in
many
repeated
trials
of
an
experiment
Variance
of
a
random
variable
• The
variance
of
a
discrete
or
con0nuous
random
variable
X
is
the
expected
value
of:
Algebraically:
( ) ( ) 2g X X E X⎡ ⎤
= −
⎣ ⎦
( )
( )
( )
( ) ( )
( )
2
2
2 2
2 2
2 2
var σ µ
2
µ µ
2µ µ
µ
XX E X
E
X X
E X E X
E X
= = −
= − +
= − +
= −
29/07/13
21
• For
our
problem,
we
know
that
E(X)
=
μ
=
3
– Now:
– Then:
– The
square
root
of
the
variance
is
called
the
standard
deviaMon
( ) ( ) ( ) ( )
4 4
2 2
1 1
2 2 2 21 0.1 2 0.2 3 0.3 4 0.4
10
i i
E X g x f x x f x
= =
= =
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤= × + × + × + ×⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦
=
∑ ∑
( ) ( )2 2 2 2var σ µ 10 3 1XX E X= = − = − =
• 2
PDF
with
different
variances
Property
of
variances
• A
useful
property
of
variances
is
the
following
– Let
a
and
b
be
constants,
then:
– To
see
this,
let
Y
=
aX
+
b.
Then:
( ) ( )2var varaX b a X+ =
( ) ( ) ( ) ( )( )
( ) ( )
( )
( )
22
2 22
22
2
var var µ µ
µ µ
µ
var
Y X
X X
X
aX b Y E Y E aX b a b
E aX a E
a X
a E X
a X
⎡ ⎤⎡ ⎤+ = = − = + − +⎢ ⎥⎣ ⎦ ⎣ ⎦
⎡ ⎤ ⎡ ⎤= − = −⎣ ⎦ ⎣ ⎦
⎡ ⎤= −⎣ ⎦
=
29/07/13
22
Rules of expected values
64
• E(c) = c
• E(X+c)=E(X)+c
• E(cX)=cE(X)
July 13
Rules of Variance
65
• V(c)=0
• V(X+c)=V(X)
• V(cX)=c2V(X)
July 13
Histogram
of
a
normally-‐distributed
variable
29/07/13
23
PDF
of
the
normal
distribu0on
Topic
1d
Correla0on
Defini0on
of
correla0on
• Correla0on
measures
numerically
the
rela0onship
between
two
variables
X
and
Y
(e.g.,
popula0on
density
and
deforesta0on).
• Correla0on
between
X
and
Y
is
symbolised
by
the
correla0on
coefficient
“r”
or
“rXY”.
r =
Yi −Y( ) Xi − X( )∑
Yi −Y( )
2∑ Xi − X( )
2∑
29/07/13
24
Coefficient
of
correla0on
values
Coefficient
of
correla0on
and
XY
plots
Why
are
variables
correlated?
Correla0on
does
not
imply
causality!!!!
Example:
• Correla0on
between
educa0on
and
wages
is
strong
&
posi0ve
• Does
this
mean
educa0on
“causes”
higher
wages?
• Possibility
1:
Educa0on
improves
skills,
skilled
workers
get
be|er
paying
jobs,
and
therefore
educa0on
causes
wages
to
increase
• Possibility
2:
Some
individuals
are
born
with
high
innate
ability,
which
makes
it
easier
for
such
individuals
to
pursue
more
educa0on
and
to
be
more
produc0ve
on
the
job.
Innate
ability
(not
educa0on)
causes
wages
to
increase.
• NEED
AN
UNDERLYING
THEORY
TO
BRING
TOGETHER!
29/07/13
25
Correla0on
with
several
variables
• Correla0on
relates
precisely
two
variables
• What
to
do
with
three
or
more?
Usually
use
regression.
• Or
you
can
calculate
the
correla0on
between
every
possible
pair
of
variables.
• Given
three
variables:
X,
Y
and
Z,
we
can
calculate
three
correla0ons:
–rxy,
rxz
and
ryz
Correla0on
matrix
A
correla0on
matrix
shows
the
correla0on
between
each
variable
and
each
other
variable
in
a
sample.
Conclusions
• Appropriate
data
descrip0on
is
a
necessary
first
step
before
ANY
modelling
• How
you
describe
data
depends
on
what
sort
of
data
you’re
describing
• Correla0on
can
be
sugges0ve,
but
alone
cannot
establish
causality
–Correla0on
+
a
sensible
theory
suggests
(does
not
prove
but
provides
evidence
of)
a
causal
rela0onship
29/07/13
26
Study
• Keep
up
with
the
readings,
• Prepare
for
your
tutorials:
–There
is
independent
work
that
must
be
completed
by
you
prior
to
tutorials,
if
you
wish
to
par0cipate
in
this
learning
experience
–Please
don’t
come
to
tutorials
if
you
have
not
prepared
Next
topic…
• Begin
looking
at
simple
regression
analysis
7/08/1
3
1
Topic
2
Simple
Regression
Koop
Chapters
4
and
5
Admin
• Assignment
1
to
be
posted
week
3
• Due
date
13
September
–
week
7
• Assign
1:
Regression
exercise
and
report
Last
lecture
• Maths
and
stats
review
– Data
handling
– Data
descripKon
• XY
plots
• Mean,
Standard
deviaKion
– CorrelaKon
– Probability
and
probability
distribuKons
7/08/
13
2
This
topic
• A
discussion
of
the
simple
regression…
• ABSOLUTELY
ESSENTIAL
READING:
Koop
Chapters
4
and
5
IntroducKon
to
Simple
Regression
• Regression
is
the
most
common
tool
of
the
applied
economist.
• Used
to
help
understand
what
factors
(variables)
accountable
for
the
outcome
of
variable
of
interest.
• We
begin
with
simple
regression
to
understand
the
relaKonship
between
two
variables,
X
and
Y.
Imagine
a
“best-‐fiang”
line…
• XY-‐plot
of
populaKon
density
against
deforestaKon
7/08/13
3
The
Regression
Line
IS
the
Line
of
Best
Fit
• The
process
of
(bivariate)
regression
is
the
process
of
fiang
a
line
through
the
points
in
the
XY-‐plot
that
best
captures
the
relaKonship
between
deforestaKon
and
populaKon
density.
• What
do
we
mean
by
“best
fiang”
line?
Assumed
Model
Structure
Assume
a
true
linear
relaKonship
exists
between
Y
and
X:
Example:
Y=
output
of
a
good,
X=
labour
input
α=
?
(perhaps
0)
β
=?
(perhaps
0.8=
marginal
product
of
labour)
Y = α + βX
α = intercept of line
β =slope of the line
NOT
the
Line
of
Perfect
Fit
1. Even
if
the
straight
line
relaKonship
were
true
on
average,
we
would
never
get
all
points
on
an
XY-‐plot
lying
precisely
on
it
due
to
the
fact
that
some
of
Y’s
movement
is
not
able
to
be
explained
using
X.
2. Also
–
the
true
relaKonship
is
probably
more
complicated;
a
straight
line
is
typically
thought
of
as
an
approximaKon
3. Y
or
X
may
be
measured
with
errors.
Due
to
1,
2
and
3,
we
add
an
error
term
to
the
model.
7/08/13
4
Adding
the
error
Y
=α
+
βX
+
e
where
e
is
an
error.
• What
we
know:
X
and
Y.
• What
we
do
not
know:
α,
β
and
e.
• Regression
analysis
uses
data
(X
and
Y)
to
make
an
esKmate,
of
what
α
and
β
are.
• NotaKon:
and
are
the
esKmates
of
α
and
β
that
the
regression
(line-‐fiang)
process
spits
out.
α̂ β̂
Pre-‐
versus
Post-‐EsKmaKon
Model
• True
regression
model:
• EsKmated
regression
model:
Y = α + βX + e
e = Y −α − βX
e = error
Y = α̂ + β̂X + u
u = Y −α̂ − β̂X
u = residual
How
do
we
choose
and
?
With
more
than
two
points,
it’s
usually
not
possible
to
find
a
line
that
fits
perfectly
through
all
points:
α̂
β̂
7/08/13
5
EssenKal
CharacterisKc
of
Regression
(or
“Ordinary
Least
Squares”)
OLS
regression
chooses
the
line
that
minimizes
the
sum
of
squared
residuals.
Expressing
the
OLS
esKmator
We
observe
data
on
two
variables
for
i=1,..,N
individuals.
Each
individual
has
a
Yi
and
an
Xi.
Any
line
we
fit/choice
of
and
will
yield
residuals
ui.
OLS
esKmator
chooses
and
to
minimise
SSR
α̂ β̂
Sum of squared residuals = SSR = ui
2
∑
α̂ β̂
MathemaKcal
expressions
for
the
bivariate
OLS
esKmators
SoluKon:
and
β̂ =
Yi −Y( ) Xi − X( )∑
Xi − X( )
2∑
α̂ = Y − β̂X
7/08/13
6
Regression
and
CausaKon
How
do
you
choose
which
variables
to
use?
• Ideally,
the
explanatory
variable
should
be
the
one
which
causes/influences
the
other
(dependent)
variable:
so,
X
causes
Y.
• If
you
can,
only
esKmate
models
where
this
causality
assumpKon
make
sense.
• But,
what
guides
this?
IntuiKon,
reasoning
raKonale,
theory
Examples
• Increases
in
X
(=
populaKon
density)
cause
Y
(=
deforestaKon)
to
increase
(or
vice
versa?
Make
your
argument)
• Increasing
X
(=
the
lot
size
of
a
house)
causes
Y
(=
its
value)
to
increase
(or
vice
versa?
Make
your
argument)
• Increasing
X
(=
adverKsing
expenditures)
causes
Y
(=
company
sales)
to
increase
(or
vice
versa?
Make
your
argument)
Causality
(cont.)
• In
pracKce,
great
care
must
be
taken
in
interpreKng
regression
results
as
reflecKng
causality.
Why?
–your
assumpAon
that
X
causes
Y
may
be
wrong.
–X
and
Y
may
both
be
caused
by
some
third
factor,
call
it
Z.
–X
may
cause
Y
but
Y
may
also
cause
X
(e.g.
exchange
rates
and
interest
rates).
–the
whole
concept
of
causality
may
be
inappropriate.
• Formally,
one
key
quesKon
regression
addresses
is:
“How
much
of
the
variability
in
Y
can
be
explained
by
X?”
(we
will
look
at
this
shortly)
7/08/13
7
InterpretaKon
of
• EsKmated
value
of
Y
if
X
=
0
• This
is
ooen
not
of
interest
Example:
• X
=
lot
size,
Y
=
house
price
•
=
esKmated
value
of
a
house
with
lot
size
=
0
α̂
α̂
InterpretaKon
of
1.
is
the
esKmate
of
the
marginal
effect
of
X
on
Y
2. Using
the
regression
model:
3. The
OLS
esKmator
–
the
esKmated
“slope”
–
is
a
measure
of
how
much
Y
tends
to
change
when
you
change
X.
4. “If
X
changes
by
1
unit
then
Y
tends
to
change
by
units”,
where
“units”
refers
to
what
the
variables
are
measured
in
(e.g.
$,
$billions,
£,
%,
hectares,
metres,
etc.)
β̂
β̂
β̂ =
dY
dX
=
ΔY
ΔX
β̂
DeforestaKon
example
Development
economists
have
theories
that
imply
that
increasing
populaKon
density
should
increase
deforestaKon.
Thus:
• Y
=
deforestaKon
(annual
percentage
lost)
=
dependent
variable
• X
=
populaKon
density
(people
per
thousand
hectares)
=
explanatory
variable
• Using
data
on
N
=
70
tropical
countries
we
find:
=
0.000842
β̂
7/08/13
8
InterpretaKon
and
predicKon
a)
“If
populaKon
density
increases
by
1
person
per
1,000
hectares,
then
the
average
deforestaKon
is
esKmated
(or
expected)
to
increase
by
0.000842
%
per
year”
b)
“If
populaKon
density
increases
by
100
people
per
1,000
hectares,
then
deforestaKon
is
esKmated
to
increase
by
0.0842%
per
year
on
average”
Basic
evaluaKon
staKsKcs
• R-‐squared
• F-‐test
• Data
evaluaKon
• t-‐test
R2:
A
Measure
of
Fit
IntuiKon:
• “Variability”
=
(e.g.)
how
deforestaKon
rates
vary
across
countries
Total
variability
in
dependent
variable
Y
=
(1)+(2):
1. Variability
explained
by
the
explanatory
variable
(X)
in
the
regression
+
2.
Variability
that
cannot
be
explained
and
is
leo
over
in
the
residual.
7/08/13
9
Sums
of
squares
In
mathemaKcal
terms,
TSS
=
RSS
+
SSR
where
TSS
=
Total
sum
of
squares
=
Note
similarity
to
formula
for
variance.
TSS = Yi −Y( )
2∑
More
sums
of
squares
• RSS
=
Regression
sum
of
squares
• SSR=
Sum
of
squared
residuals
RSS = Ŷi −Y( )∑
2
SSR = u2
i=
1
N
∑
R-‐squared
expressed
as
sums
of
squares
• R-‐squared
is
a
measure
of
fit
(i.e.
how
well
does
the
regression
line
fit
the
data
points
–
meaning
how
closely
X
and
Y
are
related)
R2 = 1−
SSR
TSS
or equivalently: R2 =
RSS
TSS
since 1− SSR = RSS
7/08/13
10
ProperKes
of
R-‐squared
• R2=1
means
perfect
fit.
All
data
points
exactly
on
the
regression
line
(i.e.
SSR=0).
• R2=
0
means
X
does
not
have
any
explanatory
power
for
Y
whatsoever
(i.e.,
X
has
no
influence
on
Y).
• Bigger
values
of
R2
imply
X
has
more
explanatory
power
for
Y.
• R2
is
equal
to
(the
correlaKon
between
X
and
Y)
squared
(i.e.
R2=r2xy)
0 ≤ R2 ≤1
R-‐squared
example
• R2
measures
the
proporKon
of
the
variability
in
Y
that
can
be
explained
by
X.
Example:
• In
regression
of
Y
=
deforestaKon
on
X
=
populaKon
density,
we
obtain
R2=0.44
àWe
can
say
that
“44%
of
the
cross-‐country
variaKon
in
deforestaKon
rates
can
be
explained
by
the
cross-‐country
variaKon
in
populaKon
density”
F
test
of
overall
significance
The
F
test
is
oHen
used
to
measure
the
explanatory
power
of
the
whole
model
(or,
equivalently,
the
significance
of
the
R-‐
squared).
The
typical
hypotheses
in
this
context
are:
• H0
:
there
is
no
staKsKcal
significance
on
the
relaKonship
between
Y
and
X
• H1
:
there
is
a
staKsKcal
significance
on
relaKonship
between
Y
and
X
• The
F
staKsKc
is
calculated
as
the
raKo
of
the
amount
of
variaKon
in
Y
that
is
explained
by
the
model
to
the
amount
of
variaKon
unexplained,
corrected
for
degrees
of
freedom:
Fk,n−k−1 =
RSS k
SSR n − k −1( )
7/08/13
11
DeforestaKon
Excel
output
Non
lineariKes
August 13 32
0
5
10
15
20
25
average hourly earnings
0 5 10 15 20
years of education
August 13 33
0
100
200
300
child mortality
0 1000 2000 3000 4000
per capita gnp in 1980
7/08/13
12
August 13 34
EsKmates
only!
As
menKoned,
and
are
es#mates
of
the
true
populaKon
parameters
only
• But
how
accurate
are
they?
• The
t-‐test
allows
us
to
formally
address
this
problem
for
each
variable
separately.
It
is
based
on
the
esKmated
standard
deviaKon
–
or
“standard
error”
–
of
which
is
esKmated,
along
with
the
value
itself,
by
the
regression
process
α̂ β̂
β̂
Standard
error
of
The
s.e.
of
the
esKmated
slope
varies:
–
directly
with
SSR
(the
variability
in
the
residuals)
–
Inversely
with
N
–
Inversely
with
,
which
relates
to
the
variance/variability
of
X
β̂
se =
SSR
n − 2( )
X − X( )2∑
X − X( )2∑
7/08/13
13
What
factors
affect
the
accuracy
of
the
esKmate
?
Ceteris
paribus,
• A
large
number
of
observaKons
(more
data
points)
• Small
errors
(small
SSR)
• A
bigger
spread
of
values
of
the
explanatory
variable
X
(X
has
a
range
of
values)
will
increase
the
accuracy
of
the
esKmate
β̂
β̂
Very
small
sample
size
Large
sample
size,
large
error
variance
7/08/13
14
Large
sample
size,
small
error
variance
Limited
range
of
X
values
DeforestaKon
excel
output
7/08/13
15
Test
of
a
slope
coefficient
The
t-‐test
in
the
context
of
linear
regression
tests
whether
there
is
a
staKsKcally
significant
linear
relaKonship
between
X
and
Y.
Hypotheses:
To
perform
the
test,
form
the
following
staKsKc
using
the
esKmated
coefficient
and
standrd
error:
n-‐k-‐1
represents
the
degrees
of
freedom
associated
with
the
test;
when
there
is
one
independent
variable,
k
=
1.
H0 : β = 0 (No linear relationship)
H1 : β ≠ 0 (Linear relationship)
t n−k−1( ) =
β̂
se β̂( )
Concept
check
Test
your
understanding
of
the
t-‐test
by
doing
the
following:
1. Calculate
the
t-‐stat
using
the
standard
error
of
the
esKmate
and
esKmated
coefficient
2. Check
that
this
is
equal
to
the
t-‐stat
reported
in
excel
EvaluaKng
the
model
using
the
original
data:
The
issues
• How
well
does
the
model
‘explain’
the
variance
in
the
dependent
variable?
–‘Goodness
of
fit’:
the
closer
the
points
to
the
regression
line,
the
bezer
• How
strongly
are
the
independent
variables
related
to
the
dependent
variable?
–Are
the
esKmated
effects
economically
meaningful?
–Are
the
esKmated
effects
staKsKcally
significant?
• Determine
whether
the
underlying
assumpKons
of
regression
modelling
have
been
met
(more
on
this
later…)
• Determine
robustness
of
the
model
to
outliers
(‘unusual’
observaKons)
7/08/13
16
Confidence
in
our
results
• Uncertainty
about
accuracy
of
the
esKmator
can
be
summarised
in
a
“confidence
interval”.
• This
will
provide
us
with
some
more
informaKon
about
the
accuracy
of
our
results
Confidence
Interval
for
• Confidence
interval
for
is
given
by:
Where:
“criKcal
value”
from
the
t-‐distribuKon
(note
that
Excel
can
provide
the
value)
And
is
the
standard
error
of
β
β̂ − t
β̂
× se
β̂( ), β̂ + tβ̂ × seβ̂( )⎡⎣ ⎤⎦
β
t
β̂
=
se
β̂
= β̂
ConstrucKng
a
Confidence
Interval
• If
we
want
a
95%
CI
on
,
we
need
the
s.e.
(provided
in
excel
output)
and
the
relevant
t-‐staKsKc*
• Therefore,
the
CI
on
our
esKmate
is:
Lower
bound:
0.000842
–
1.99*0.0001165
=
0.00061
Upper
bound:
0.000842
+
1.99*0.0001165
=
0.001075
CI
=
[0.00061,
0.001075]
InterpretaKon
(informal):
– There
is
a
95%
probability
that
the
true
value
of
β
lies
between
0.00061
to
0.001075.
β
7/08/13
17
• OLS
esKmator
has
many
nice
staKsKcal
properKes
if
certain
condiKons
hold.
• These
condiKons
known
as
Gauss
Markov
CondiTons
Y = α + βX + e
The
Gauss-‐Markov
Assump#ons,
In
Brief
• These
necessary
condiKons
are:
– The
linear
model
is
correct
– We’ve
got
a
random
sample
of
data
from
the
populaKon
whose
behaviour
we’re
using
the
model
to
explain
– There’s
some
sample
variance
in
X
– X
and
the
unexplained
part
of
Y
(that
is,
e)
aren’t
related
Why • If β̂ β 7/08/13 18 More • So • So • Now Y = α + βX + e Nonlinearity • How – Graphical Nonlinearity • e.g: • But • Answer: Y = α + βX2 + e 7/08/13 19 Figure 4.2 A quadratic relationship between X and Y
0 40
60
80
100 120
140
160
180
200 0 1 2 3 4 5 6
Copyright Choosing • Common – Note: Figure 4.3 X and Y need to be logged
0 0.5
1 1.5
2 2.5
3
3.5
4
0 2 4 6 8 10 12 14
Copyright 7/08/13 20 Figure 4.4 ln(X) versus ln(Y)
-1.5
-1
-0.5
0 -3 -2 -1 0 1 2 3
Copyright Func#onal • So,
lnY = α + β lnX + e
Next • More 22/08/13
1 Forecas/ng
3 This • Koop • Last Mul/ple
22/08/13 2 Differences • Mul/ple • The – The OLS • Mul/ple • OLS • Solu/on Yi = α + β1X1i +…+ βkXk + ei β̂
Sta6s6cal • Standard • Confidence • Can 22/08/13 Mul6ple • R2 • Can Interpre6ng Mathema/cal dY = β
∂Y = β j Interpreta6on • Verbal • β j 22/08/13 4 Example: Explaining • Let’s • Data • Explanatory – X2 Example: Excel Explaining • Figed • Evaluate Ŷ = 7.31− 0.029X1 + 0.000006X2 22/08/13 5 Interpreta6on: • Since – If 9 pounds β̂1 = −0.029
Interpreta6on: • Since – If β̂2 = 0.000006
Example: • Is • Can • Might 22/08/13 6 Excel Interpreta6on • No • New • The Ŷ = 7.31− 0.029X1 + 0.116 X2
Interpre6ng • How • Recall, • Interpreta/on: β̂2
β̂2 = 0.116 22/08/13 7 Some • Consider Answer • These • This Answer • One • BUT: • People • The 22/08/13 8 Sta6s6cal • The Mul6ple • If • Models • This PiJall the AND variable THEN variable 22/08/13 OVB • In • Many • This OmiPed • True • If
26
y = α + β1X1 + β2X2 + e
X2 y = α + β1X1 + v X2 X1
OmiPed Bias (+) Bias (–) 27
corr(X1,X2) > 0 corr(X1,X2) < 0
β2 > 0 The true model is: But we omit ability. Since ability and education are most likely correlated (+ve), wage = β0 + β1educ + v
wage = α + β1educ + β2ability + e
v = β2ability + e( ) 22/08/13 10 PiJall • However, • Irrelevant • Hence PiJall model • High • ie. • Diagnosing: • NOTE: Remedies 1.Do – If 3.Drop 22/08/13 11 How • Include • However, • Ideally, Example: func/on • This Start with some data:
Date Total earnings (male) Nov.1983 362.00 Feb.1984 370.60 May.1984 383.80 Aug.1984 386.20 Nov.1984 389.50 Feb.1985 392.70 May.1985 397.20 Aug.1985 403.10 Nov.1985 413.90 Feb.1986 422.70 May.1986 425.50 Aug.1986 437.20 Nov.1986 446.30 Feb.1987 444.50 May.1987 450.90 Aug.1987 457.00 Nov.1987 470.00 Feb.1988 474.90 May.1988 481.70 Aug.1988 486.20 6302.0 Average Weekly Earnings, Australia TABLE 3. Average Weekly Earnings Of Employees, Australia (Dollars) – Original Set p = 0.5 and find your forecasts for the time period of interest
Once you have found your RMSE or evaluation tool, go to Solver And impose the appropriate constraints: Go back and check your p-value, which should have changed. You have now minimized the error terms by choice of p, which provides you with the best naive 2 forecast.
the
G-‐M
assump#ons
ma
any
of
these
condiKons
DON’T
hold,
then
you
can’t
run
the
regression
and
expect
the
OLS
esKmator
to
deliver
parameter
esKmates
that
are
reasonable
guesses
of
the
true
relaKonship
between
Y
and
X
in
the
populaKon
you
care
about.
– Bias
means
is
not
a
true
esKmate
of
– Lack
of
precision:
means
has
a
large
standard
error
relaKve
to
itself
β̂
on
non-‐linear
rela#onships
far,
we’ve
discussed
esKmaKng
a
LINEAR
regression
of
Y
on
X
and
we
have
seen
briefly
a
lizle
of
non-‐linearity:
let’s
say
you
have
chosen
your
variables
(say
explaining
birth
weight
(Y)
using
mother’s
income
(X))
choose
funcKonal
form
• Is
the
relaKonship
linear?
• We
could
perform
a
regression
of
Y
(or
ln(Y)
or
Y2)
on
X2
(or
1/X
or
ln(X)
or
X3,
etc.)…
and
the
same
esKmaKon
technique
for
the
equaKon’s
parameters
would
hold.
will
we
decide?
– Theory
(would
the
marginal
effect
on
birth
weight
of
income
be
likely
to
be
constant?
ie.
does
$1
of
extra
income
have
the
same
effect
for
an
unemployed
person
as
a
millionaire?)
analysis:
What
does
a
plot
of
the
two
variables
look
like?
-‐
example
how
might
you
know
if
the
TRUE
relaKonship
between
X
and
Y
is
likely
to
be
nonlinear?
Careful
examinaKon
of
X-‐Y
plots
and/
or
theory.
20
©
2005
John
Wiley
&
Sons,
Ltd
func#onal
form
transformaKons
are:
– Squared
terms
– Taking
natural
logs
(one
side
or
both
sides)
–
implies
elasAciAes,
not
slopes,
are
constant.
Need
values
>
0
to
use
log
models!
• To
find
the
proper
data
transformaKon,
try
the
following:
– Plot
out
the
data
in
X-‐Y
space,
as
per
following
slides
– Scan
relevant
theory
for
any
suggesKons
©
2005
John
Wiley
&
Sons,
Ltd
0.5
1
1.5
©
2005
John
Wiley
&
Sons,
Ltd
form
data
in
previous
slide
suggest
that
double
log
model
is
‘correct’
one:
• InterpretaKon
of
results
will
be
different:
• Eg.
A
coefficient
of
10
implies
a
1%
change
in
X
yields
a
10%
change
in
Y
topic
on
regression
– MulKple
regression
– Dig
deeper:
What
assumpKons
do
we
rely
on
to
get
unbiased
and
efficient
esKmates?
and
Business
Analysis
Topic
topic
Ch
6
– Mul/ple
regression
topic:
– Simple
regression
Regression
between
simple
and
mul6ple
regression
regression
is
like
simple
regression,
except
that
there
are
many
explanatory
variables:
X1,
X2,…,
Xk
key
differences
are:
– You
can
perform
mul/ple
t-‐tests,
achieve
higher
R-‐
squared,
and
build
a
more
theore/cally
complete
model
of
Y
effect
of
each
independent
variable
on
Y
is
es/mated
CONDITIONAL
on
the
other
independent
variables
es6ma6on
regression
model:
es/mates:
• These
es/mates
(s/ll)
minimise
the
sum
of
squared
residuals
to
minimisa/on
problem:
Messy
• Calcula/on
of
is
harder
for
mul/ple
OLS
• Excel
will
calculate
the
OLS
es/mates
for
you
α̂, β̂1,…, β̂k
Aspects
and
Evalua6on
error
of
the
es/mate
:
largely
the
same
as
for
simple
regression,
just
with
bigger
‘k’
intervals
can
be
calculated
for
each
individual
coefficient,
as
we
did
before
for
just
the
one
coefficient.
test
βj=0
using
a
t-‐test
for
each
individual
coefficient
(j=1,2,..,k),
just
as
before
3
OLS
sta6s6cs
–
cont’d
is
s/ll
a
measure
of
fit,
with
the
same
interpreta/on
(although
now
it
is
no
longer
simply
the
square
of
the
correla/on
between
Y
and
‘X’).
s/ll
test
R2=0
using
an
F-‐test,
but
with
bigger
‘k’.
• If
you
find
R2>0,
then
you
conclude
that
the
explanatory
variables
together
provide
explanatory
power
(note:
this
does
not
necessarily
mean
that
each
individual
explanatory
variable
[through
t-‐stats]
is
significant).
OLS
Es6mates
in
the
Mul6ple
Regression
Model
Intui/on
Total
vs.
par/al
deriva/ve
Simple
regression:
Mul/ple
regression:
dX
∂Xj
of
Mul6ple
OLS
Es6mates,
cont’d
intui/on
•
the
marginal
effect
of
Xj
on
Y,
ceteris
paribus
is
the
effect
on
the
dependent
variable
of
a
small
change
in
the
jth
explanatory
variable,
holding
all
the
other
explanatory
variables
constant.
β j
Birth
Weight
take
some
6me
going
over
the
following
example:
on
N
=
1388
individuals
• Dependent
variable:
– Y
=
birth
weight
of
child,
in
pounds
variables:
– X1
=
number
of
cigareges
smoked
per
day
by
pregnant
mum
=
Family
income,
1988$USD
• NOTE
k=2!
Output
Birth
Weight
Regression
Line:
the
following:
– Significance
of
coefficient
es/mates
– R2
and
its
significance
– Do
the
results
accord
with
common
sense
and/or
formal
theory
in
the
areas
of
economics,
general
human
behaviour,
and
health?
Birth
Weight
Results
,
then
at
the
average
number
of
cigareges
smoked:
– Mum
having
one
extra
cigarege
per
day
is
expected
to
reduce
the
baby’s
birth
weight
by
0.029
pounds,
ceteris
paribus
(i.e.,
holding
income
constant)
we
compare
individuals
with
the
exact
same
income,
mums
who
smoke
10
cigareges
per
day
are
expected
to
have
babies
that
weigh
0.2
less
than
those
of
mums
who
do
not
smoke.
Birth
Weight
Results
,
then
at
the
average
family
income:
– The
family’s
having
one
extra
dollar
of
annual
income
is
expected
to
increase
the
baby’s
birth
weight
by
0.000006
pounds,
ceteris
paribus
(i.e.,
holding
mum‟s
smoking
behaviour
constant)
we
compare
mums
with
the
exact
same
number
of
cigareges
smoked
per
day,
mums
who
have
$10,000
more
in
family
income
are
expected
to
have
babies
that
weigh
0.06
pounds
more
(a
seemingly
“small”
effect
in
the
output,
but
economically
meaningful
and
sta/s/cally
significant!).
Data
transforma6on
it
reasonable
to
expect
that
birth
weight
will
increase
at
a
constant
rate
with
income?
Economic
intui/on
would
tell
us
that
this
is
probably
not
the
case.
check
X-‐Y
plot
of
birth
weight
and
income
to
confirm.
make
more
sense
to
use
ln(income)?
Output
of
Results
massive
changes
in
the
explanatory
power
of
the
model,
and
no
changes
to
significance
figed
model
is:
es/mated
coefficient
on
cigs
(number
of
cigareges)
has
not
changed
all
that
much,
nor
has
the
intercept.
do
we
interpret
the
new
coefficient
on
our
transformed
variable?
,
our
dependent
variable
is
in
pounds,
and
our
independent
variable
is
ln(income)
A
1%
increase
in
income
is
expected
to
increase
birth
weight
by
about
(0.116/100)
=
.00116
pounds,
ceteris
paribus
(i.e.,
holding
smoking
behaviour
constant).
piJalls…?
the
following
output
for
a
simple
version
of
our
birth
weight
model:
es/mators
come
from
two
different
regressions
which
control
for
different
explanatory
variables.
means
that
the
two
es/mates
come
with
different
‘ceteris
paribus’
condi/ons.
– Specifically:
In
the
simple
model,
we
are
not
holding
anything
else
constant.
(cont’d)
• Simple
Regression:
1%
increase
in
income
is
expected
to
increase
birth
weight
by
0.00147lbs
solu/on
to
increasing
birth
weight,
according
to
this
model,
would
be
to
give
people
more
money.
other
factors
will
influence
birth
weight,
such
as
smoking,
diet,
etc.
with
higher
incomes
tend
not
to
be
smokers
and
therefore
are
“healthier”
simple
model
only
shows
that
“richer”
people
tend
to
have
higher
birth
weight
babies
–
but
gives
no
indica/on
that
this
effect
may
be
partly
due
to
their
increased
health
Evidence
nega/ve
correla/on
between
income
and
cigarege
consump/on
suggests
that
people
with
lower
incomes
tend
to
consume
more
cigareges,
and
vice
versa.
We
are
not
examining
the
smoking
and
income
effects
separately
when
we
only
consider
one
variable
in
our
model.
regression
may
provide
a
bePer
approxima6on
to
reality
we
evaluate
the
overall
performance
and
fit
of
the
model,
we
see
that
the
mul/ple
regression
model
has
a
lower
standard
error,
higher
F
sta/s/cs,
and
a
higher
Adjusted
R2.
– Adjusted
R
squared
accounts
for
the
fact
that
more
variables
included
(greater
k)
which
contain
all
or
most
of
the
drivers
of
Y
will
tend
to
look
beger
“on
paper,”
as
well
as
in
terms
of
their
theore/cal
coherence,
than
simple
regression.
is
not
always
the
case,
it
depends
on
the
marginal
effect
of
each
addi/onal
variable
(the
trade-‐off
between
an
increase
in
R2
and
the
increase
in
k)
#1:
OmiPed
Variables
Bias
The
technical
term
for
what
we
just
described
is
“OMITTED
VARIABLES
BIAS”
IF
• We
exclude
explanatory
variable(s)
that
should
be
present
in
model
• •These
variable(s)
are
correlated
with
an
included
explanatory
• The
OLS
es/mate
of
the
coefficient
on
the
included
explanatory
will
be
“biased”
–
that
is,
it
won’t
reflect
the
“pure”
impact
of
that
variable
on
Y
9
cont’d
our
simple
regression,
we
only
considered
ln(income)
important
determinants
of
birth
weight
were
omiged,
such
as
smoking.
omiged
variable
was
correlated
with
income,
and
therefore
the
es/mate
from
our
simple
regression
was
biased.
variable
bias
model
is:
• But
we
omit
and
es/mate:
and
are
correlated,
OLS
is
biased
v = β2X2 + e
variable
bias
Bias (–) Bias (+)
β2 < 0
our coefficient on education is likely biased upwards.
#2:
Irrelevant
variables
• If
we
include
any
irrelevant
variables
as
independent
variables
in
our
model,
it
will
not
cause
bias
if
the
true
coefficient
of
the
extra
variable
is
zero
(irrelevant).
it
will
increase
the
variance
of
the
es/mated
coefficients,
which
will
tend
to
decrease
the
magnitude
of
their
t-‐scores.
variables
also
usually
decreases
the
adjusted
R
squared.
irrelevant
variables
reduce
the
precision
of
regressions.
#3:
Mul6-‐collinearity
• Explanatory
variables
may
be
HIGHLY
correlated
è
the
has
trouble
differen/a/ng
between
their
effects
on
Y.
R2,
large
F-‐stat,
but
insignificant
t
stats
for
coefficient
es/mates
(or
wrong
sign).
Model
overall
fits
well,
but
cannot
pin
down
marginal
effects
of
individual
variables
Look
at
your
correla/on
matrix
for
high
levels
of
correla/on
between
your
explanatory
variables.
This
will
reveal
the
source
and
extent
of
the
mulCcollinearity
problem.
High
correla/on
between
your
dependent
variable
and
independent
variable(s)
is
usually
OK!
for
mul6collinearity
nothing
– If
looking
for
overall
predic/on
and
not
individual
effects
theory
suggests
variables
should
be
included
2.Transform
variables
– If
theore/cally
jus/fied.
Mul/collinearity
is
a
problem
when
there
is
a
linear
rela/onship
between
explanatory
variables
or
combine
explanatory
variables
do
you
select
explanatory
variables?
(insofar
as
possible)
all
explanatory
variables
that
you
think
might
explain
your
dependent
variable.
This
will
reduce
OVB.
including
irrelevant
variables
or
ones
that
are
highly
mul/collinear
is
also
not
advisable.
turn
to
theory,
intui/on,
logic,
and/or
common
sense
for
sugges/ons
on
what
is
appropriate
to
include.
Forecas6ng
Demand
• Back
to
first
year
microeconomics.
• Theory
argues
that
market
demand
for
a
product
is
a
of
the
following:
– Price
– Tastes
and
preferences
– Disposable
income
– Number
of
consumers
in
the
market
– Prices
of
related
goods
– Expecta/ons
theory
gives
you
an
indica/on
of
what
should
be
included
in
a
regression
model
of
market
demand.
However,
you
will
also
need
to
consider
the
availability
of
data
(which
is
almost
always
the
biggest
constraint
on
model
development).