Download STAB22 Statistics I Lecture 3 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

World Values Survey wikipedia , lookup

Time series wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
STAB22 Statistics I
Lecture 3
1
Describing Quantitative Data

Quantitative Data: variables measure
numerical quantities (for which arithmetic
name
calories
operations make sense)


E.g. Breakfast cereal data
(calories per serving)
All-Bran
70
Apple Jacks
110
⁞
⁞
Describe quantitative data using:


Plots: histogram, stem-&-leaf
Numerical summaries: mean, median, range, etc
2
Histogram

Create artificial categories, called bins or
classes, for values of quantitative variable


E.g. calorie classes: 40→60, 60→80, 80→100, ...
(classes span all data values & don’t overlap)
Pretend variable is
categorical (bins are
categories) & create
bar-plot
3
Example

Cereal calories


StatCrunch:
Graphics > Histogram
Rel. freq. histogram
Differences from
bar-plots:


Categories (classes)
are ordered
No gaps between
bars (unless freq.=0)
4
Stem-and-Leaf Display

Split each numerical value in two parts (stem & leaf),
along some level (0.1’s, 1’s, 10’s, 100’s etc)

E.g. 42.19, cut along 10’s



Stem (leading digits) = 4
Leaf (trailing digits) = 2.19 or 2
Put stems on the left column, and list leaves of
every value to the right of their stem

Sort leaves within each stem from smallest to largest
Stems Leaves
4 12445589
5 33566
6 3458
5
Example

Cereal potassium

Stem-and-leaf gives info
similar to histogram’s


S-&-L shows individual
data values, but is not
practical for very large data
sets
E.g. What is maximum
potassium content in data?
Variable: potass
Decimal point is 2 digit(s) to the
right of the colon.
0 : 002233333333444444444
0 : 55555666666778999999
1 : 00000001111111222233444
1 : 667799
2 : 034
2 : 68
3 : 23
6
StatCrunch: Graphics > Stem and Leaf
Describing Quantitative Data
For any quantitative variable, want to
describe overall pattern of its distribution.
In particular, look at:

Shape: peaks (modes), symmetry, outliers

Centre: mean, median

Spread: range, standard deviation
7
Shape of Distribution
Modality: Check for modes (peaks), i.e. most
frequently occurring values
Bimodal
(2 peaks)
Frequency
Frequency
Unimodal
(1 peak)
Variable
Uniform
(no peaks)
Frequency

Variable
Variable
8
Shape of Distribution

Symmetry: distribution is called symmetric if,
when we draw a vertical line down its center, the
two sides are similar in shape and size
Variable
Frequency
Frequency
Frequency
Symmetric Distributions
Variable
Variable
9
Shape of Distribution
Skeweness: Unimodal distributions with one
tail longer than the other are called skewed
Skewed to the left
(negatively skewed)
Skewed to the right
(positively skewed)
Frequency
Frequency

Variable
Variable
10
Shape of Distribution
Outliers: Any data that lie far off the main
body of the distribution (i.e. extreme values)
are called outliers
Frequency

Variable
outliers
11
Centre of Distribution

Mean: sum of all values of quantitative
variable divided by the number of values


E.g. Amount customers spend at coffee shop;
data : $3.5, $4.2, $6.7, $2.6, $5.1
3.5+4.2+6.7+2.6+5.1 22.1

 4.42
mean 
5
5
Generally, for #n sample values x1, x2,…, xn,
sample mean (denoted by x ) is given by
x1  x2    xn  xi
x

n
n
12
Mean

Mean is “center of gravity” of data
2.6
3.5
4.2
5.1
6.7
4.42


Good representative value
But, sensitive to outliers

E.g.
2.6
3.5
4.2
5.1
15
6.17
13
Centre of Distribution
Median: midpoint of all values, after they are
ordered from the smallest to the largest
Median = 4.2
 E.g. Coffee
2.6
3.5
4.2
5.1
6.7
shop data
}



50% of values above & 50% below median
For even # of data (n=2,4,6, etc.), median is the
mean of the two middle numbers
Median 
2.6
3.5
4.2
3.5  4.2
 3.85
2
5.1
14
Median
Median is robust to (less influenced by)
extreme values

E.g.
2.6
3.5
4.2
5.1
6.7
}
Median = 4.2
}

2.6

3.5
4.2
5.1
15
Prefer median when data have outliers
15
Mean, Median & Shape


For symmetric distributions: mean = median
For skewed distributions: mean >< median

One tail has more extreme observations than other
Mean < Median
Mean = Median
Mean > Median
16
Questions
1.
2.
3.
What measure of centre would you use on
the following data: 0.6, 0.2, 0.1, 0.2, 0.2, 0.3,
0.7, 0.1, 0.0, 22.5, 0.4 ?
Changing the value of a single score in a
data set will necessarily cause the mean to
change. (T/F)
Changing the value of a single score in a
data set will necessarily cause the median to
change. (T/F)
17