Download Lab 6 Oct 13th 2016 - adv quant techniques

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Forecasting wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Choice modelling wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Advanced Quantitative Techniques
Lab 6
October 13th 2016
• Agenda today
– DSSC presentation: available data & how they can
help you. Use their resources now
– Final project ideas activity
– Quick recap / midterm review
– intro to regression in STATA (tbc next week)
Final project peer brainstorming
– Divide into groups of 4
– Turn to person next to you & describe your ideas
thus far in 2 minutes. (5 mins)
– Regroup and describe your partner’s idea to the 4
person group (15 mins)
– 1 person from each group report back on sticky
points – Ideas? Data? Tools?
Midterm review, cont.
A. Big concepts: get them right
B. Problems: plug in carefully
C. Explaining findings: don’t forget. Use precise
language.
Logistics:
• Blue book, hand-written.
• worth 20% of semester grade
• Done in-class
Midterm review:
A: Big concepts: get them right
e.g.
• error types: type a, type b
• Confidence: 90% confidant that this interval contains the
true value of y. If we repeated the experiment 100 times, in
95, our value would fall inside of this range.
• hypotheses
• Power
• Dependent & independent variables
• If something doesn’t make sense, look up other explanations..
(I like graphics and cartoons)
B: Manipulating equations
Look @ homework problems + in-class examples
• Standard distribution, CI, t-test, zstat
– Knowing mean + SD, what % of observations fall
below X value?
• Calculate the input that you’re missing (either the sample or
the population or the SD). Plug into m-m/SD. Look up value
on z table. Remember to subtract if one-tailed. Use normal
to approximate binomial if needed.
– Calculate a mean, build a CI around it. Mean +se*Tcrit. Usually you’ll have to calculate SE from SD.
C: Explaining findings: don’t forget
• Make sure to write a concluding sentence.
Hint: look back at the question. What puzzle
are you trying to unravel?
Make a formula sheet in your own words, e.g
standard error = SD / sq root of sample size
[sample] [pop estimate] [sample]
t statistic = sample mean – pop mean /
standard
error
 s 
xz 

 n
*
Quick ref for
the important Z
scores..
Confidence interval : mean plus or minus the z
(or t) stat multiplied by standard error.
Coyotes & poison
Hypothesis (H1): <28% of coyotes will survive the winter.
Null Hypothesis(H2): ≥28% more of coyotes will survive the
winter.
We want to see where the actual survival last year (51/214)
=24% survival maps on the overall survival percentage
(~population p/mean)
s or σ = √p*(1-p) = √.28*(1-.28) =√.2016 = 0.45
s.e. = s/√n = .45/√214 = .031
t=
= .24-.28/.031 = 1.33 (+-)
p = .0885 … so yes to 90% but no to 95%
significance
Intro to regression in STATA - tbc
• Open the 311 data
Command: Scatterplot
Relation between 311 calls & vacancy rate?
calls_per_thousand & vacant
• generate vacant_rate= vacant/ HSE_UNIT*100
• twoway (scatter calls_per_thousand vacant_rate) (lfit
calls_per_thousand vacant_rate)
Command: correlate (corr)
• corr calls_per_thousand vacant_rate
Linear Regression
•
Describes a relationship between an explained
variable (y) and an explanatory variable (x). You
“regress y on x.”
•
Attempts to explain this relationship with a
straight line fit.
•
Simple linear regression has one input (x) and
one output (y)
•
The ideal formula to approximate the regression:
Y   0  1 X  ( i )
Intercept
Slope
Error term
What are ‘residuals’ (error terms)?
Y   0  1 X  ( i )
•
Residuals (or error terms) are the difference
between an observed value of the response
variable and the value predicted by the regression
line.
•
Residual = observed y – predicted y
•
Residuals represent the ‘leftover’ or ‘unexplained’
variation in the response variable after fitting the
regression line.
Command: regress (reg)
• reg calls_per_thousand vacant_rate
Interpreting the Output
1. Slope:
•
The coefficient of the independent variable (ß1)
is the slope of the regression line.
•
Slope is the amount of increase in the dependent
variable for every unit increase in the independent
variable.
2. Y-Intercept:
•
The constant (ß0).
Interpreting the Output
3. The p-value and CI of the Coefficients:
•
P-value corresponds to the coefficient of the
independent variable.
•
If the p-value is less than alpha, you can conclude
there is a statistically significant relationship
between the independent variable and the
dependent variable.
•
Or, you can examine whether zero is in the
confidence interval of the independent variable. If
zero is in the interval, then the coefficient is not
statistically different from zero at 95% confidence.
How to Read Stata Output?
 SS – Sum of Squares associated with three sources of variance: Model,
Residual, and Total
 MS – Mean of Squares, the SS divided by the respective degrees of
freedom. MS represents the sample, error and model variance
respectively
 F-statistic – this is the MS Model divided by the MS Residual; the numbers
in brackets are the respective df
 Prob>F – this is the p-value associated with F-statistic. It tests the
hypothesis that all the model coefficients are 0
 R-squared – the proportion of variance in y explained by the independent
variables.
 Adjusted R-squared – in which the addition of extraneous variables to the
model is penalized. It is always less than R-squared and increases only if
the addition of one more explanatory variable improves the model more
than what would be expected by chance
 Root MSE – the Root of the MS Residual. This is the standard deviation of
the residuals