Download AMS 7L is required for AMS 7, There are about 10 of you

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Time series wikipedia , lookup

Student's t-test wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
AMS 7L is required for AMS 7, There are about 10 of you
haven’t Registered for AMS 7L. You should do it ASAP.
Lab starts this week!!!
Random Sample vs. Simple Random sample
If in a room are 36 persons and sit in 6 lines with 6 persons in each line.
They are assign “names” as the following:
A1 A2 A3 A4 A5 A6
B1 B2 B3 B4 B5 B6
C1 C2 C3 C4 C5 C6
D1 D2 D3 D4 D5 D6
E1 E2 E3 E4 E5 E6
F1 F2 F3 F4 F5 F6
An example (not the only way) for Random sampling for sample size 6 is
to randomly pick one row:
A1 A2 A3 A4 A5 A6
B1 B2 B3 B4 B5 B6
C1 C2 C3 C4 C5 C6
D1 D2 D3 D4 D5 D6
E1 E2 E3 E4 E5 E6
F1 F2 F3 F4 F5 F6
A simple random sample should give equal probability for all possible
samples with sample size 6. For example, sample:
A1 A2 A3 A4 A5 A6
B1 B2 B3 B4 B5 B6
C1 C2 C3 C4 C5 C6
D1 D2 D3 D4 D5 D6
E1 E2 E3 E4 E5 E6
F1 F2 F3 F4 F5 F6
The sampling method described in previous page CANNOT give us such
kind of sample, which means, it is not a simple random sampling.
Ways to do simple random sampling for this example:
Describing, Exploring and Comparing Data
Data: 70, 40, 70, 150, 155, 70, 45, 65, 105, 50, 175, 40 (Storm wind
speed 2005)
Stem and leaf plot
Distribution, max/min, clustering
Histogram:
Frequency Distribution:
Speed Frequency
Relative frequency
Speed Relative frequency (%)
Pareto Chart (Bar)
Sorted bar chart (histogram for categorical data)
Pie Chart – Relative frequencies –Pareto Chart is usually better
Measures of Center
What is a typical value of a dataset?
Aug 2005 storms: 65, 105, 50, 175, 40
Mean
Sample ̅
Population µ
The mean is often used b/c it has good theoretical + practical behavior
̅ is usually the best possible estimator of µ.
The mean can be unreliable if there are outliers (extremely unusual
observations)
Eg. At Duke University, the mean starting salary of the 10 math majors
who graduated in 1999 was over $100,000.
Why?
Trajan Langdon was a math major and one of the top picks in the NBA
draft. His starting salary was over 1 Million. The other 9 math majors
had salaries less than $100,000.
So the mean was larger than all but one of the data points, b/c of the
outlier.
When there are extreme values, the median is more stable measure of
the center.
Storm example: Aug 2004, 40, 45, 65, 70, 105, 120, 135, 145
Median:
When data are symmetric and w/o outliers, the mean and median will
be very similar.
Mode:
The most common value
Storm 2005
If only one value is the most often: Unimodal
2 equally :
Bimodal
More:
Multimodal
Skewness:
If the distribution of the data is the same on both side of the mean, it is
symmetric, otherwise it is skewed
Income
Measures of Variation
Range = Maximum – Minimum
= Length of an interval containing all of the data
Example: August 2005 tropical storms: 40, 50, 65, 105, 175
Range= 175-40=135
How far from the mean is a typical data point?
For population, the standard deviation is defined as
and the variance is ∑
( − )
=
Variance is mathematically convenient, std. dev. is more interpretable.
Std. Dev. is a measure of the distance from the mean to a typical data
point, and is the most common measure of spread.
Ex: If we are interested in the wind speed of Aug 2005 storms, what is
our population? (interested in 2005 storms?)
For population, ̅ =
= ̅ .
=
=
In most cases, we only have a sample, not a population.
Then we use ̅ to estimate , and to estimate .
The formula for sample standard deviation is
∑( − ̅ )
=
−1
Empirical rule for data with a Bell-shaped distribution:
About 68% of the data will be within 1 s.d.
About 95% within 2 s.d.
Is the following be considered as a bell shaped?
Is it symmetric? Skewed?
Textbook, Page 48, Figure 2-11 gives a good illustration about Mean,
Median, Mode and Skewness.
Anything we can say for non-bell shaped distribution/ all distributions?
Chebyshev’s Theorem
The proportion of any set of data lying within K s.d. of the mean is
always at least 1-1/(K*K), where K>1.
Standardized Scores (z Scores)
How extreme is an observation?
How do we compare observations from different datasets?
We standardize the data by subtracting its mean and dividing by its s.d.
(p. 69)
Sample vs. Population
Ex: Hurricane Katrina (175-87)/49.25=1.79
Aug 2004’s biggest storm was Karl at 145. The mean & s.d. for Aug 04
were 90.6 and 38.28.
So Karl’s z Score is
Conclusion: Even after adjusting for the increased variability in 2005,
Katrina stands out more extreme.
SAT scores (rescaled to 200-800) adjusts for varying difficulty of the
exams.
From Empirical rule, |z|>2, means the observation is unusual.
Quartiles and Percentiles
The median separates the data into two equally sized groups – half of
the observations are above the median, half are below. Equivalently, 50%
are below, so median is the 50th percentile.
The 99 percentiles divide the data into 100 groups. 1% of the
observations are less than the 1st percentile, 2% less than the 2nd
percentile, and so on.
To find the kth percentile: (p 73)
1.
2.
3. a.
b.
Storm examples:
Aug04 40,45,65,70,105,129,135,145
Aug05 40,50,65,105,175
The 25th percentile is called 1st quartile (Q1), the 75th percentile is the
3rd quartile (Q3).
The Median = Q2 =50th percentile.
Boxplot
Min, Q1, median, Q3, Max