Download Lec4 - NCSU Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Student's t-test wikipedia , lookup

Regression toward the mean wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
North Carolina State University
STAT 370: Probabilityy and Statistics for
Engineers
[Section 002]
Instructor: Hua Zhou
Harrelson Hall 210
10:15AM-11:30AM, Jan 23, 2012
Announcement
• HW2 (graphical summary) due Friday, Jan 27 @ 6pm
• Any problem using StatCrunch?
Lass class
• Graphical summary: bar plot (categorical variable), pie
chart (categorical variable), histograms (quantitative
variable), scatter plot (two quantitative variables, time
series)
• Shape of distribution: symmetric, bell-shaped, left or
right-skewed, unimodal, bimodal
1
Graphical Summary is Evolving!
• Suppose you have collected a data set of life expectancy
and income data for 200 countries in 200 years.
• How would you summarize this data?
• A video clip:
Today
• Numerical summary of data: mean/median/mode,
variance, standard deviation, outliers
• Graphical summary: Boxplot
• http://www.open.ac.uk/openlearn/science-mathstechnology/mathematics-and-statistics/statistics/the-joy-stats-200countries-200-years-4-minutes
Measuring center: the mean
x
• Mean = Average value
• Sample mean x : for n observations x1 , x2 ,...xn
their mean is
x  ( x1  x2  ...  xn ) / n

1
n
Measuring center: the median
•
•
Median = middle value or center point
Sample median: the number such that half of the
observations are smaller than it and the other half
are larger
– the midpoint of a distribution
x
i
2
Procedure to calculate the median: M
1. Arrange all observations in order of size, from smallest
to largest.
2. If the number of observations, n, is odd, the median M
is the center observation in the ordered list.
3. If n is even, then M is the average of the two center
observations in the ordered list.
Mean vs Median
• Consider two data sets
{5, 8, 9, 10, 13} vs {5, 8, 9, 10, 68}
– Both have the median equal to 9
– First has mean of 9, second has mean of 20
– Median is more robust to the outlier 68
Note: ((n+1)/2
) is the location of the median,, not the median
itself.
Mean vs Median: Salary Survey of UNC Graduates
• Survey a certain number of graduates from UNC.
• A lot of departments are surveyed.
• Question:
– Which department produces students that earn the
most on average 10 years after they got their
degrees?
• Answer:
– Geography!!!!??????
– Michael Jordan
Mean vs. Median
• Mean:
– easy to calculate
– easy to work with algebraically
– highly affected by outliers
– not resistant to extreme observations
• Median:
– can be time consuming to calculate
– more resistant to a few extreme observations
(sometimes outliers)
– robust
3
Mean, Median and Mode
Mode: where is the peak
• Important for categorical data
• The most frequent value in the data
• Possible to have more than one mode
Mean – average
Median – the middle data point
Mode – where the peak(s) is (are)
• If a unimodal distribution is exactly symmetric, then
mean, median and mode are exactly the same.
• If the distribution is skewed, the three measures
differ.
Remarks
Which one to use?
– If the histogram is symmetric, the mean is
approximately equal to median.
– If the histogram is right skewed, the mean is likely
greater than the median.
– If the histogram is left skewed, the mean is likely less
than the median
– The difference between the mean and the median is a
rough measure of how severely skewed the data are.
Mode
Mean
Median
• Different by definition
– Mean and median are unique, and only for
quantitative variables.
– Mode is not unique.
– Mode is defined for categorical variables also.
• The choice depends on the shape of the distribution, the
type of data and the purpose of your study
– Skewed: median
– Categorical: mode
– Total quantity: mean
–…
Mean Mode
Median
4
Exercise
Measure of Spread: Variance and Standard Deviation
• For a random sample x1 , x2 , ...xn of size n, the
sample variance/standard deviation are measures of
spread

n
– Variance:
s 
( x i  x ) / ( n  1)
i 1
– Standard deviation = s = square root of variance (in same units
as data)
– If observed values are far apart, the variance and standard
deviation will be large
– Note variance (and standard deviation) are greater than or equal
to 0.
0 Only 0 when all observed values are equal
– Variance and standard deviation are strongly affected by outliers
2
2
• What’s the mode? Answer: 4
• What’s the median? Answer: 4
• What’s the mean? Answer: approx. 4.215
Standard deviation (cont’d)
• Why sample “st. dev.” (SD) rather than sample “variance”?
– S.D. is in the original scale
– S.D.
S D is natural for measuring spread for “normal”
distributions
• Why “n-1” rather than “n”?
– Intuitively speaking, S.D. is not defined for n=1
– Sum of deviations is always
y 0,, which means “if we
know (n-1) of them, we know the last one”
– Only (n-1) deviations can change freely
– “n-1” – is referred to as degrees of freedom
Example
• Calculate the variance for
a) First data set {5,8,9,10,13}
2
2
2
2
s2  59 89  99 (109)2 139  /4


=8.5,
s.d.=2.9155
b) Second data set {5, 8, 9, 10, 68}
2
2
2
2
s 2   5  20   8  20   9  20  (10  20)2   68  20  / 4


=(225+144+121+100+2304)/4=723.5
s.d.=26.8980
5
Take Home Message
• Graphical tools for quantitative data
– Histograms
– Boxplot
• Examine distributions:
– Shape
– Symmetric or skewed
– How many modes?
– Bell-shaped
– Outliers
O tli
• Numerical summary: Mean, median, mode, variance,
standard deviation, Q1, Q3, IQR
6