Download Lecture 2

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Lecture 2
Dustin Lueker

Center of the data
◦ Mean
◦ Median
◦ Mode

Dispersion of the data
 Sometimes referred to as spread
◦ Variance, Standard deviation
◦ Interquartile range
◦ Range
STA 291 Winter 09/10 Lecture 2
2

Mean
◦ Arithmetic average

Median
◦ Midpoint of the observations when they are
arranged in order
 Smallest to largest

Mode
◦ Most frequently occurring value
STA 291 Winter 09/10 Lecture 2
3



Sample size n
Observations x1, x2, …, xn
Sample Mean “x-bar”
x  ( x1  x2 
 xn ) / n
n
1
  xi
n i 1
  SUM
STA 291 Winter 09/10 Lecture 2
4



Population size N
Observations x1 , x2 ,…, xN
Population Mean “mu”
  ( x1  x2 
1

N
N
x
i 1
i
 xN ) / N
  SUM
 Note: This is for a finite population of size N
STA 291 Winter 09/10 Lecture 2
5

Requires numerical values
◦ Only appropriate for quantitative data
◦ Does not make sense to compute the mean for
nominal variables
◦ Can be calculated for ordinal variables, but this does not
always make sense
 Should be careful when using the mean on ordinal variables
 Example “Weather” (on an ordinal scale)
Sun=1, Partly Cloudy=2, Cloudy=3,
Rain=4, Thunderstorm=5
Mean (average) weather=2.8
 Another example is “GPA = 3.8” is also a mean of observations
measured on an ordinal scale
STA 291 Winter 09/10 Lecture 2
6


Center of gravity for the data set
Sum of the values above the mean is equal to
the sum of the values below the mean
STA 291 Winter 09/10 Lecture 2
7

Mean
◦ Sum of observations divided by the number of
observations

Example
◦ {7, 12, 11, 18}
◦ Mean =
STA 291 Winter 09/10 Lecture 2
8

Highly influenced by outliers
◦ Data points that are far from the rest of the data

Not representative of a typical observation if
the distribution of the data is highly skewed
◦ Example
 Monthly income for five people
1,000 2,000 3,000 4,000 100,000
 Average monthly income =
 Not representative of a typical observation
STA 291 Winter 09/10 Lecture 2
9


Measurement that falls in the middle of the
ordered sample
When the sample size n is odd, there is a
middle value
◦ It has the ordered index (n+1)/2
 Ordered index is where that value falls when the
sample is listed from smallest to largest
 An index of 2 means the second smallest value
◦ Example
 1.7, 4.6, 5.7, 6.1, 8.3
n=5, (n+1)/2=6/2=3, index = 3
Median = 3rd smallest observation = 5.7
STA 291 Winter 09/10 Lecture 2
10

When the sample size n is even, average the
two middle values
◦ Example
 3, 5, 6, 9, n=4
(n+1)/2=5/2=2.5, Index = 2.5
Median = midpoint between 2nd and 3rd smallest
observations = (5+6)/2 = 5.5
STA 291 Winter 09/10 Lecture 2
11



For skewed distributions, the median is often
a more appropriate measure of central
tendency than the mean
The median usually better describes a “typical
value” when the sample distribution is highly
skewed
Example
◦ Monthly income for five people
1,000 2,000 3,000 4,000 100,000
◦ Median monthly income:
 Does this better describe a “typical value” in the data
set than the mean of 22,000?
STA 291 Winter 09/10 Lecture 2
12
Mean - Arithmetic Average
 Mean of a Sample - x

Mean of a Population - μ
Median - Midpoint of
the observations when
they are arranged in
increasing order
Notation: Subscripted variables
n = # of units in the sample
N = # of units in the population
x = Variable to be measured
xi = Measurement of the ith unit
Mode - Most frequent value.
STA 291 Winter 09/10 Lecture 2
13

Trimmed mean is a compromise between the
median and mean
◦ Calculating the trimmed mean
 Order the date from smallest to largest
 Delete a selected number of values from each end of
the ordered list
 Find the mean of the remaining values
◦ The trimming percentage is the percentage of
values that have been deleted from each end of the
ordered list
STA 291 Winter 09/10 Lecture 2
14

Example: Highest Degree Completed
Highest Degree
Frequency
Percentage
Not a high school
graduate
38,012
21.4
High school only
65,291
36.8
Some college, no
degree
33,191
18.7
Associate, Bachelor,
Master, Doctorate,
Professional
41,124
23.2
Total
177,618
100
STA 291 Winter 09/10 Lecture 2
15



n = 177,618
(n+1)/2 = 88,809.5
Median = midpoint between the 88809th
smallest and 88810th smallest observations
◦ Both are in the category “High school only”


Mean wouldn’t make sense here since the
variable is only ordinal
Median
◦ Can be used for interval data and for ordinal data
◦ Can not be used for nominal data because the
observations can not be ordered on a scale
STA 291 Winter 09/10 Lecture 2
16

Mean
◦ Interval data with an approximately symmetric
distribution

Median
◦ Interval data
◦ Ordinal data

Mean is sensitive to outliers, median is not
STA 291 Winter 09/10 Lecture 2
17

Symmetric distribution
◦ Mean = Median

Skewed distribution
◦ Mean lies more towards the direction which the
distribution is skewed
STA 291 Winter 09/10 Lecture 2
18

Disadvantage
◦ Insensitive to changes within the lower or upper
half of the data
◦ Example
 1, 2, 3, 4, 5
 1, 2, 3, 100, 100
◦ Sometimes, the mean is more informative even
when the distribution is skewed
STA 291 Winter 09/10 Lecture 2
19

Keeneland Sales
STA 291 Winter 09/10 Lecture 2
20

Difference between the largest and smallest
observation
◦ Very much affected by outliers
 A misrecorded observation may lead to an outlier, and
affect the range

The range does not always reveal different
variation about the mean
STA 291 Winter 09/10 Lecture 2
21

Sample 1
◦ Smallest Observation: 112
◦ Largest Observation: 797
◦ Range =

Sample 2
◦ Smallest Observation: 15033
◦ Largest Observation: 16125
◦ Range =
STA 291 Winter 09/10 Lecture 2
22

The pth percentile (Lp) is a number such that
p% of the observations take values below it,
and (100-p)% take values above it
◦ 50th percentile = median
◦ 25th percentile = lower quartile
◦ 75th percentile = upper quartile

The index of Lp
◦ (n+1)p/100
STA 291 Winter 09/10 Lecture 2
23

25th percentile
◦ lower quartile
◦ Q1
◦ (approximately) median of the observations
below the median

75th percentile
◦ upper quartile
◦ Q3
◦ (approximately) median of the observations
above the median
STA 291 Winter 09/10 Lecture 2
24

Find the 25th percentile of this data set
◦ {3, 7, 12, 13, 15, 19, 24}
STA 291 Winter 09/10 Lecture 2
25



Use when the index is not a whole number
Want to go closest index lower then go the
distance of the decimal towards the next
number
If the index is found to be 5.4 you want to go
to the 5th value then add .4 of the value
between the 5th value and 6th value
◦ In essence we are going to the 5.4th value
STA 291 Winter 09/10 Lecture 2
26

Find the 40th percentile of the same data set
◦ {3, 7, 12, 13, 15, 19, 24}
 Must use interpolation
STA 291 Winter 09/10 Lecture 2
27


Five Number Summary
◦
◦
◦
◦
◦
Minimum
Lower Quartile
Median
Upper Quartile
Maximum
◦
◦
◦
◦
◦
minimum=4
Q1=256
median=530
Q3=1105
maximum=320,000.
Example
 What does this suggest about the shape of the distribution?
STA 291 Winter 09/10 Lecture 2
28

The Interquartile Range (IQR) is the difference
between upper and lower quartile
◦ IQR = Q3 – Q1
◦ IQR = Range of values that contains the middle 50%
of the data
◦ IQR increases as variability increases

Murder Rate Data
◦ Q1= 3.9
◦ Q3 = 10.3
◦ IQR =
STA 291 Winter 09/10 Lecture 2
29




Displays the five number summary (and
more) graphical
Consists of a box that contains the central
50% of the distribution (from lower quartile to
upper quartile)
A line within the box that marks the median,
And whiskers that extend to the maximum
and minimum values
 This is assuming there are no outliers in the data set
STA 291 Winter 09/10 Lecture 2
30

An observation is an outlier if it falls
◦ more than 1.5 IQR above the upper quartile
or
◦ more than 1.5 IQR below the lower quartile
STA 291 Winter 09/10 Lecture 2
31


Whiskers only extend to the most extreme
observations within 1.5 IQR beyond the
quartiles
If an observation is an outlier, it is marked by
an x, +, or some other identifier
STA 291 Winter 09/10 Lecture 2
32

Values






Min = 148
Q1 = 158
Median = Q2 = 162
Q3 = 182
Max = 204
Create a box plot
STA 291 Winter 09/10 Lecture 2
33



On right-skewed distributions, minimum, Q1,
and median will be “bunched up”, while Q3
and the maximum will be farther away.
For left-skewed distributions, the “mirror” is
true: the maximum, Q3, and the median will
be relatively close compared to the
corresponding distances to Q1 and the
minimum.
Symmetric distributions?
STA 291 Winter 09/10 Lecture 2
34

Statistics that describe variability
◦ Two distributions may have the same mean
and/or median but different variability
 Mean and Median only describe a typical value, but
not the spread of the data
◦
◦
◦
◦
Range
Variance
Standard Deviation
Interquartile Range
 All of these can be computed for the sample or
population
STA 291 Winter 09/10 Lecture 2
35

The deviation of the ith observation xi from
the sample mean x is the difference between
them, ( xi  x )
◦ Sum of all deviations is zero
◦ Therefore, we use either the sum of the absolute
deviations or the sum of the squared deviations as
a measure of variation
STA 291 Winter 09/10 Lecture 2
36

Variance of n observations is the sum of the
squared deviations, divided by n-1
s
2
(x


i
x)
2
n 1
STA 291 Winter 09/10 Lecture 2
37
1.
2.
3.
4.
5.
Calculate the mean
For each observation, calculate the deviation
For each observation, calculate the squared
deviation
Add up all the squared deviations
Divide the result by (n-1)
Or N if you are finding the population variance
(To get the standard deviation, take the square root of the result)
STA 291 Winter 09/10 Lecture 2
38
Observation
Mean
Deviation
Squared
Deviation
1
3
4
7
10
Sum of the Squared Deviations
n-1
Sum of the Squared Deviations / (n-1)
STA 291 Winter 09/10 Lecture 2
39

About the average of the squared deviations
◦ “average squared distance from the mean”

Unit
◦ Square of the unit for the original data
 Difficult to interpret
◦ Solution
 Take the square root of the variance, and the unit is
the same as for the original data
 Standard Deviation
STA 291 Winter 09/10 Lecture 2
40

s≥0
◦ s = 0 only when all observations are the same


If data is collected for the whole population
instead of a sample, then n-1 is replaced by n
s is sensitive to outliers
STA 291 Winter 09/10 Lecture 2
41

Sample
◦ Variance
s2 
2
(
x
i

x
)

n 1
◦ Standard Deviation

Population
◦ Variance
2 
2
(
x
i

x
)

s
n 1
2
(
x
i


)

◦ Standard Deviation
N

2
(
x
i


)

N
STA 291 Winter 09/10 Lecture 2
42

Population mean and population standard deviation
are denoted by the Greek letters μ (mu) and σ
(sigma)
◦ They are unknown constants that we would like to estimate

Sample mean and sample standard deviation are
denoted by x and s
◦ They are random variables, because their values vary
according to the random sample that has been selected
STA 291 Winter 09/10 Lecture 2
43

If the data is approximately symmetric and
bell-shaped then
◦ About 68% of the observations are within one
standard deviation from the mean
◦ About 95% of the observations are within two
standard deviations from the mean
◦ About 99.7% of the observations are within three
standard deviations from the mean
STA 291 Winter 09/10 Lecture 2
44
STA 291 Winter 09/10 Lecture 2
45

SAT scores are scaled so that they have an
approximate bell-shaped distribution with a
mean of 500 and standard deviation of 100
◦ About 68% of the scores are between
◦ About 95% of the scores are between
◦ If you have a score above 700, you are in the
top
%
 What percentile would this be?
STA 291 Winter 09/10 Lecture 2
46

According to the National Association of Home
Builders, the U.S. nationwide median selling price
of homes sold in 1995 was $118,000
◦ Would you expect the mean to be larger, smaller, or equal
to $118,000?
◦ Which of the following is the most plausible value for the
standard deviation?
(a) –15,000, (b) 1,000, (c) 45,000, (d) 1,000,000
STA 291 Winter 09/10 Lecture 2
47

Experiment

Random (or Chance) Experiment

Outcome

Sample Space

Event

Simple Event
◦ Any activity from which an outcome, measurement, or other
such result is obtained
◦ An experiment with the property that the outcome cannot
be predicted with certainty
◦ Any possible result of an experiment
◦ Collection of all possible outcomes of an experiment
◦ A specific collection of outcomes
◦ An event consisting of exactly one outcome
STA 291 Winter 09/10 Lecture 2
48
Examples:
Experiment
1. Flip a coin
2. Flip a coin 3 times
3. Roll a die
4. Draw a SRS of size
50 from a
population
Sample Space
1.
2.
3.
4.
STA 291 Winter 09/10 Lecture 2
Event
1.
2.
3.
4.
49


Let A denote an event
Complement of an event A
A
S
◦ Denoted by AC, all the outcomes in the sample
space S that do not belong to the event A
◦ P(AC)=1-P(A)

Example
◦ If someone completes 64% of his passes, then what
percentage is incomplete?
STA 291 Winter 09/10 Lecture 2
50


Let A and B denote two events
Union of A and B
◦ A∪B
◦ All the outcomes in S that belong to at least one of
A or B

Intersection of A and B
◦ A∩B
◦ All the outcomes in S that belong to both A and B
STA 291 Winter 09/10 Lecture 2
51

Let A and B be two events in a sample space S
◦ P(A∪B)=P(A)+P(B)-P(A∩B)
A
B
S
STA 291 Winter 09/10 Lecture 2
52

Let A and B be two events in a sample space S
◦ P(A∪B)=P(A)+P(B)-P(A∩B)
 At State U, all first-year students must take chemistry
and math. Suppose 15% fail chemistry, 12% fail math,
and 5% fail both. Suppose a first-year student is
selected at random, what is the probability that the
student failed at least one course?
STA 291 Winter 09/10 Lecture 2
53


Let A and B denote two events
A and B are Disjoint (mutually exclusive)
events if there are no outcomes common to
both A and B
◦ A∩B=Ø
 Ø = empty set or null set

A
Let A and B be two disjoint
events in a sample space S
B
S
◦ P(A∪B)=P(A)+P(B)
STA 291 Winter 09/10 Lecture 2
54

The probability of an event occurring is
nothing more than a value between 0 and 1
◦ 0 implies the event will never occur
◦ 1 implies the event will always occur

How do we go about figuring out
probabilities?
STA 291 Winter 09/10 Lecture 2
55


Can be difficult
Different approaches to assigning probabilities to
events
◦ Subjective
◦ Objective
 Equally likely outcomes (classical approach)
 Relative frequency
STA 291 Winter 09/10 Lecture 2
56

Relies on a person to make a judgment on
how likely an event is to occur
◦ Events of interest are usually events that cannot be
replicated easily or cannot be modeled with the
equally likely outcomes approach
 As such, these values will most likely vary from person
to person

The only rule for a subjective probability is
that the probability of the event must be a
value in the interval [0,1]
STA 291 Winter 09/10 Lecture 2
57

The equally likely approach usually relies on
symmetry to assign probabilities to events
◦ As such, previous research or experiments are not
needed to determine the probabilities
 Suppose that an experiment has only n outcomes
 The equally likely approach to probability assigns a
probability of 1/n to each of the outcomes
 Further, if an event A is made up of m outcomes then
P(A) = m/n
STA 291 Winter 09/10 Lecture 2
58

Selecting a simple random sample of 2
individuals
◦ Each pair has an equal probability of being selected

Rolling a fair die
◦ Probability of rolling a “4” is 1/6
 This does not mean that whenever you roll the die 6
times, you always get exactly one “4”
◦ Probability of rolling an even number
 2,4, & 6 are all even so we have 3 possibly outcomes in
the event we want to examine
 Thus the probability of rolling an even number is
3/6 = 1/2
STA 291 Winter 09/10 Lecture 2
59

Borrows from calculus’ concept of the limit
a
P( A)  lim
n  n
◦ We cannot repeat an experiment infinitely many
times so instead we use a ‘large’ n
 Process
 Repeat an experiment n times
 Record the number of times an event A occurs, denote this
value by a
 Calculate the value of a/n
a
P( A) 
n
STA 291 Winter 09/10 Lecture 2
60

“large” n?
◦ Law of Large Numbers
 As the number of repetitions of a random experiment
increases, the chance that the relative frequency of
occurrence for an event will differ from the true
probability of the even by more than any small number
approaches 0
 Doing a large number of repetitions allows us to
accurately approximate the true probabilities using the
results of our repetitions
STA 291 Winter 09/10 Lecture 2
61