Download STA 291 Summer 2010

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Categorical variable wikipedia , lookup

Student's t-test wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
Lecture 4
Dustin Lueker

The population distribution for a continuous
variable is usually represented by a smooth curve
◦ Like a histogram that gets finer and finer
 Similar to the idea of using smaller and smaller rectangles to
calculate the area under a curve when learning how to
integrate

Symmetric distributions

Not symmetric distributions:
◦ Bell-shaped
◦ U-shaped
◦ Uniform
◦ Left-skewed
◦ Right-skewed
◦ Skewed
STA 291 Summer 2010 Lecture 4
2

Center of the data
◦ Mean
◦ Median
◦ Mode

Dispersion of the data
 Sometimes referred to as spread
◦ Variance, Standard deviation
◦ Interquartile range
◦ Range
STA 291 Summer 2010 Lecture 4
3

Mean
◦ Arithmetic average

Median
◦ Midpoint of the observations when they are
arranged in order
 Smallest to largest

Mode
◦ Most frequently occurring value
STA 291 Summer 2010 Lecture 4
4



Sample size n
Observations x1, x2, …, xn
Sample Mean “x-bar”
x  ( x1  x2  ...  xn ) / n
n
1
  xi
n i 1
  SUM
STA 291 Summer 2010 Lecture 4
5



Population size N
Observations x1 , x2 ,…, xN
Population Mean “mu”
  ( x1  x2  ...  xN ) / N
1

N
N
 xi
  SUM
i 1
 Note: This is for a finite population of size N
STA 291 Summer 2010 Lecture 4
6

Requires numerical values
◦ Only appropriate for quantitative data
◦ Does not make sense to compute the mean for
nominal variables
◦ Can be calculated for ordinal variables, but this does not
always make sense
 Should be careful when using the mean on ordinal variables
 Example “Weather” (on an ordinal scale)
Sun=1, Partly Cloudy=2, Cloudy=3,
Rain=4, Thunderstorm=5
Mean (average) weather=2.8
 Another example is “GPA = 3.8” is also a mean of observations
measured on an ordinal scale
STA 291 Summer 2010 Lecture 4
7


Center of gravity for the data set
Sum of the differences from values above the
mean is equal to the sum of the differences
from values below the mean
◦ 3+2+2 = 3 + 4
STA 291 Summer 2010 Lecture 4
8

Mean
◦ Sum of observations divided by the number of
observations

Example
◦ {7, 12, 11, 18}
◦ Mean =
STA 291 Summer 2010 Lecture 4
9

Highly influenced by outliers
◦ Data points that are far from the rest of the data
◦ Example
 Monthly income for five people
1,000 2,000 3,000 4,000 100,000
 Average monthly income =
 What is the problem with using the average to describe
this data set?
STA 291 Summer 2010 Lecture 4
10


Measurement that falls in the middle of the
ordered sample
When the sample size n is odd, there is a
middle value
◦ It has the ordered index (n+1)/2
 Ordered index is where that value falls when the
sample is listed from smallest to largest
 An index of 2 means the second smallest value
◦ Example
 1.7, 4.6, 5.7, 6.1, 8.3
n=5, (n+1)/2=6/2=3, index = 3
Median = 3rd smallest observation = 5.7
STA 291 Summer 2010 Lecture 4
11

When the sample size n is even, average the
two middle values
◦ Example
 3, 5, 6, 9, n=4
(n+1)/2=5/2=2.5, Index = 2.5
Median = midpoint between 2nd and 3rd smallest
observations = (5+6)/2 = 5.5
STA 291 Summer 2010 Lecture 4
12



For skewed distributions, the median is often
a more appropriate measure of central
tendency than the mean
The median usually better describes a “typical
value” when the sample distribution is highly
skewed
Example
◦ Monthly income for five people
1,000 2,000 3,000 4,000 100,000
◦ Median monthly income:
 Why is the median better to use with this data than the
mean?
STA 291 Summer 2010 Lecture 4
13
Mean - Arithmetic Average
 Mean of a Sample - x

Mean of a Population - μ
Median - Midpoint of
the observations when
they are arranged in
increasing order
Notation: Subscripted variables
n = # of units in the sample
N = # of units in the population
x = Variable to be measured
xi = Measurement of the ith unit
Mode - Most frequent value.
STA 291 Summer 2010 Lecture 4
14

Example: Highest Degree Completed
Highest Degree
Frequency
Percentage
Not a high school
graduate
38,012
21.4
High school only
65,291
36.8
Some college, no
degree
33,191
18.7
Associate, Bachelor,
Master, Doctorate,
Professional
41,124
23.2
Total
177,618
100
STA 291 Summer 2010 Lecture 4
15



n = 177,618
(n+1)/2 = 88,809.5
Median = midpoint between the 88809th
smallest and 88810th smallest observations
◦ Both are in the category “High school only”


Mean wouldn’t make sense here since the
variable is ordinal
Median
◦ Can be used for interval data and for ordinal data
◦ Can not be used for nominal data because the
observations can not be ordered on a scale
STA 291 Summer 2010 Lecture 4
16

Mean
◦ Interval data with an approximately symmetric
distribution

Median
◦ Interval data
◦ Ordinal data

Mean is sensitive to outliers, median is not
STA 291 Summer 2010 Lecture 4
17

Symmetric distribution
◦ Mean = Median

Skewed distribution
◦ Mean lies more toward the direction which the
distribution is skewed
STA 291 Summer 2010 Lecture 4
18

While the median is better than the mean for
skewed distributions there is one large
disadvantage to using the median
◦ Insensitive to changes within the lower or upper
half of the data
◦ Example
 1, 2, 3, 4, 5
 1, 2, 3, 100, 100
◦ Sometimes, the mean is more informative even
when the distribution is skewed
STA 291 Summer 2010 Lecture 4
19

Keeneland Sales
STA 291 Summer 2010 Lecture 4
20

The deviation of the ith observation xi from
the sample mean x is the difference between
them, ( xi  x )
◦ Sum of all deviations is zero
◦ Therefore, we use either the sum of the absolute
deviations or the sum of the squared deviations as
a measure of variation
STA 291 Summer 2010 Lecture 4
21

Variance of n observations is the sum of the
squared deviations, divided by n-1
s
2
(x


i
x)
2
n 1
STA 291 Summer 2010 Lecture 4
22
Observation
Mean
Deviation
Squared
Deviation
1
3
4
7
10
Sum of the Squared Deviations
n-1
Sum of the Squared Deviations / (n-1)
STA 291 Summer 2010 Lecture 4
23

About the average of the squared deviations
◦ “average squared distance from the mean”

Unit
◦ Square of the unit for the original data
 Difficult to interpret
◦ Solution
 Take the square root of the variance, and the unit is
the same as for the original data
 Standard Deviation
STA 291 Summer 2010 Lecture 4
24

s≥0
◦ s = 0 only when all observations are the same


If data is collected for the whole population
instead of a sample, then n-1 is replaced by
N
s is sensitive to outliers
STA 291 Summer 2010 Lecture 4
25

Sample
◦ Variance
s2 
2
(
x
i

x
)

n 1
◦ Standard Deviation

Population
◦ Variance
2 
s
2
(
x
i

x
)

n 1
2
(
x
i


)

◦ Standard Deviation
N

2
(
x
i


)

N
STA 291 Summer 2010 Lecture 4
26

Population mean and population standard deviation
are denoted by the Greek letters μ (mu) and σ
(sigma)
◦ They are unknown constants that we would like to estimate

Sample mean and sample standard deviation are
denoted by x and s
◦ They are random variables, because their values vary
according to the random sample that has been selected
STA 291 Summer 2010 Lecture 4
27

If the data is approximately symmetric and
bell-shaped then
◦ About 68% of the observations are within one
standard deviation from the mean
◦ About 95% of the observations are within two
standard deviations from the mean
◦ About 99.7% of the observations are within three
standard deviations from the mean
STA 291 Summer 2010 Lecture 4
28

Scores on a standardized test are scaled so
they have a bell-shaped distribution with a
mean of 1000 and standard deviation of 150
◦ About 68% of the scores are between
◦ About 95% of the scores are between
◦ If you have a score above 1300, you are in the
top
%
STA 291 Summer 2010 Lecture 4
29