Download Review of Part I

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Choice modelling wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Forecasting wikipedia , lookup

Regression analysis wikipedia , lookup

Data assimilation wikipedia , lookup

Regression toward the mean wikipedia , lookup

Robust statistics wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Review of Midterm
math2200
Data
• W’s
• Subjects
• Variables
– Categorical versus quantitative
One categorical variable
• Graphs:
– Bar chart
– Pie chart
• Numerical summary:
– Frequency table
– Relative frequency table
Two categorical variables
• Conditional and marginal distribution
• Graphs
– Segmented bar charts
– Side-by-side bar charts
– Side-by-side pie charts
• Numerical summary
– Contingency table
– table percentage, row percentage, column
percentage
Problems 28 (page 148)
• Birth order related to major?
–
–
–
–
What percent of these students are oldest or only children? (113/223)
What percent of Humanities majors are oldest children? (15/43)
What percent of oldest children are Humanities students? (15/113)
What percent of the students are oldest children majoring in the
Humanities? (15/223)
Major
Birth Order
1
2
3
4+
Total
Math/Science
34
14
6
3
57
Agriculture
52
27
5
9
93
Humanities
15
17
8
3
43
Other
12
11
1
6
30
total
113
69
20
21
223
Problems 30 (page 148)
•
•
What is the marginal distribution of majors?
Math/Science
Agriculture
Humanities
57 (25.6%)
93 (41.7%)
43 (19.3%) 30 (13.5%)
Total
Other
223
What is the conditional distribution of majors for the oldest children?
Math/Science
Agriculture
Humanities
34 (30.1%)
52 (46.0%)
15 (13.3%) 12 (10.6%)
Major
Total
Other
113
Birth Order
1
2
3
4+
Total
Math/Science
34
14
6
3
57
Agriculture
52
27
5
9
93
Humanities
15
17
8
3
43
Other
12
11
1
6
30
total
113
69
20
21
223
Simpson’s Paradox
• Problem 3.38: Two delivery services
Delivery
Service
Pack Rats
Boxes R
Us
Type of
Service
Number
of
deliveries
Number of
late
packages
Regular
400
12 (3%)
Overnight
100
16(16%)
Regular
100
2(2%)
Overnight
400
28 (7%)
Overall
percentage of
late deliveries
5.60%
6%
One quantitative variable
• Graphs
– Histogram
– Boxplot
• Qualitative summary
– # of modes
– Symmetric? Transformation?
– Outliers?
• Numerical summary
– Five-number summary
– Center: mean versus median
– Spread: sd versus IQR
Problem 32: Pay
• The 1999 National Occupational Employment
and Wage Estimates for management
Occupations
– For chief executives
• Mean = $48.67/hour
• Median = $52.08/hour
– For General and Operations Managers
• Mean = $31.69/hour
• Median = $27.23/hour
– Are these wage distributions likely to be symmetric,
skewed to the left or skewed to the right?
Shifting and rescaling
Location
shift
rescale
min
x
x
Q1
x
x
median
x
x
Q3
x
x
max
x
x
mean
x
x
spread
variance
x
Standard deviation
x
IQR
x
range
x
Problem 4.42: Job Growth
• 20 cities’ job growth rates predicted by Standard &
Poor’s DRI in 1996
• Are the mean and median very
•
Frequency
3
4
5
Histogram of Predicted Job Growth Rates
0
1
2
•
1
2
3
Growth rate (%)
4
5
•
•
different?
Which one is more
appropriate?
– Mean (2.37%) or median
(2.235%)?
– SD (0.425%) or IQR
(0.515%)?
If we subtract from these
growth rates the predicted U.S.
average growth rate of 1.20%,
how would this change the
above summary statistics?
If we omit Las Vegas (growth
rate=3.72%) from the data, how
would you expect the above
summary statistics to change?
How to summarize the
distribution of the data?
One quantitative variable and one
categorical variable
• Comparing groups
– with histogram, boxplot, stem-and-leaf plot
– Transformation when spread is too different
across groups
Normal model
• Z-score and standard normal
• Nearly normal condition
– Normal probability plot
• Four types of problems
– Given parameters and data values (or z-score), ask
for probabilities
– Given parameters and probabilities, ask for data
values (or z-score)
– Given probabilities and data values (or z-score), ask
for parameters
– Given probabilities, data values (or z-score) and one
parameter, ask for the other parameter
Problem 22: Winter Olympic 2002
speed skating
• Top 25 men’s and 25 women’s 500-m speed
skating times
– Mean = 73.46
– Sd = 3.33
• If the Normal model is appropriate, what percent
of the times should be within 1.67 seconds of
73.46?
– Solution 1: 1.67=0.5*sd, Normcdf(-0.5,0.5,0,1)
– Solution 2: Normcdf(72.19, 75.13, 73.46, 3.33)
• In the data, only 6% are within that range. Why
are the percentages so different?
Problem 39: assembly time
• Only 25% of the company’s customers succeeded in building the
desk under an hour
• 5% said it took them over 2 hours
• Assume that consumer assembly time follows a Normal model
• Mean = ? , SD= ?
– Z-score corresponding to 25%:
• (1- mean)/ SD = invNorm(0.25,0,1) = 0.6744897495
– Z-score corresponding to 95%:
• (2- mean)/ SD = invNorm(0.95,0,1) = 1.644853626
• Solve the two equations, we have
• mean = 1.29
• SD = 0.43
Problem 39: assembly time (cont.)
• Mean =1.29, sd=0.43
• What assembly time should the company
quote in order that 60% of customers
succeed in finishing the desk by then?
– invNorm(0.6,1.29,0.43)
Problem 39: assembly time (cont.)
• Mean =1.29, sd=0.43
• The company wishes to improve the onehour success rate to 60%. If the sd stays
the same, what new lower mean time does
the company need to achieve?
– Z-score = invnorm (0.6,0,1)
– Z-score = (1-mean)/sd
– Mean = 0.89
Correlation
•
•
•
•
•
•
•
•
Sign of r means?
The range of r?
X and Y are called uncorrelated if and only if r=0
r(x,y)=r(y,x)
No units
Effected by shifting or rescaling X, Y or both?
Uncorrelated does NOT imply no association
Sensitive to outliers (remove a point close to the
line fitted through the scatterplot increase or
decrease r?)
Correlation: Review II 13, Page 264
• What factor most explains differences in Fuel Efficiency
among cars? Here’s a correlation matrix exploring that
relationship for the car’s Weight, Horsepower, engine size
(Displacement), and number of Cylinders.
MPG
Weight
HorsePower
Displace
ment
MPG
1.000
Weight
-0.903
1.000
Horse-Power
-0.871
0.917
1.000
Displacement -0.786
0.951
0.872
1.000
Cylinders
0.917
0.864
0.940
-0.806
Cylinders
1.000
a) Which factor seems most strongly associated with Fuel Efficiency ?
b) What does the negative correlation indicate?
c) Explain the meaning of R^2 for that relationship.
Matching r and scatterplots
• Here are several scatterplots. The calculated correlations are 0.85, 0.87, 0.04 and 0.53. which is which?
Linear regression (least squares)
• How to calculate the slope?
• Given the slope, and standard deviations, how to
calculate the correlation?
• The line always goes through
• Residual =
– Overestimation
– Underestimation
• Causal relationship ?
• How to interpret
?
Diagnostics of a Linear Model
1. Visual inspection: scatter plot satisfies the “Straight
Enough Condition”? Looks okay,
2. Regression: calculate the regression equation, r and
R^2. (R^2=r*r gives the percentage of variation of the
data explained by the model). R^2 is tiny, say<0.2, a
linear model may not be a good choice.
3. Residuals: check the residual plot even when R^2 is
large. Bad sign if we see some pattern. The spread of
the residuals are supposed to about the same across
the X-axis if the linear model is appropriate. (you can
either put predicted value or x-variable on x-axis).
4. Re-expression: consider re-expressing the data. If a
linear model is not appropriate for the data, And
remember to repeat the diagnostics every time after
fitting a new linear model on the transformed data.
Randomness Simulation
Simulation Component ?
Response variable?
Trial?
Example: 11.20
Suppose the chance of passing the driver’s
test is 34% the first time and 72% for the
subsequent retests. Estimate the
percentage of those tested who still do not
have a driver’s license after two attemps.
Check list
• Graphs and plots: bar chart, pie chart, histogram,
boxplot (mod boxplot on ti-83), normal
probability plot, scatterplot, residual plot
How to make ? How to interpret ?
• Statistics : mean, medium, min, max, range,
quartiles, standard deviation, IQR, correlation
coefficient
How to calculate ? How to interpret?
• Model: normal distribution, linear regression.
How to get the parameters ?