Download Introduction - University of Toronto

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Categorical variable wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
The Rise of Statistics
• Statistics is the science of collecting, organizing and interpreting
data. The goal of statistics is to gain information and
understanding from data.
• Statistics is the common bond supporting all other sciences.
• For more info on the history of statistics read:
A Very Brief History of Statistics, Howard W. Eves,
The College Mathematics Journal, Vol. 33, No. 4 (Sep., 2002),
pp. 306-308.
STA248 week 2
1
Elements of Statistics - Introduction
• Data are numerical facts with context and we need to understand
the context if we are to make sense of the numbers.
• A set of data contains some information about a group of
individuals.
• Individuals are the objects upon which we collect data.
Individuals can be people, animals, plots of land and many other
things.
• A population is a set of individuals that we are interested in
studying.
• A variable is any characteristic of an individual.
• A sample is a subset of the individuals of a population.
STA248 week 2
2
Questions to ask when planning a statistical study
• Why? What purpose do the data have? Do we hope to
answer some specific questions? Do we want to draw
conclusions about individuals other than the ones we actually
have data for?
• Who? What individuals do the data describe? How many
individuals appear in the data?
• What? How many variables do the data contain? Exact
definitions of these variables. What are the units of
measurements in which each variable is recorded? Weights
for example, might be recorded in pounds, or in kg.
STA248 week 2
3
Collecting Data
• Generally, data can be obtained in four different ways.
 Published source.
 Designed experiment.
 Survey.
 Observational study.
STA248 week 2
4
Simple Random Sample
• A simple random sample (SRS) of size n consists of n individuals
from the population chosen in such a way that every set of n
individuals has an equal chance to be the sample actually selected.
• How to select an SRS?
Hat method, Random number tables or software.
• SRS helps eliminate the problem of having the sample reflect a
different population than the one we are interested in studying.
STA248 week 2
5
Stratified Random Sampling
• Often, the sampling units are not homogeneous and naturally
divide themselves into nonoverlaping groups that are
homogeneous.
• These groups of similar individuals are called strata.
• A stratified random sample consist of separate SRS within
each stratum; these SRSs are combined to form the full
sample.
STA248 week 2
6
Data and Statistics
• Statistics is about data – collecting organizing, summarizing,
analyzing, using data to make decisions.
• A statistic is a number in context.
• Statistical inference is the connection between statistics and
probability.
• There is a random mechanism behind the data because only have
data on a random sample.
• Data: observations x1, …, xn are considered n realizations of random
variables X1,…, Xn or of 1 random variable X.
STA248 week 2
7
Displaying Distributions With Graphs
• Statistical tools and ideas help us examine data in order to
describe their main features. This examination is called
exploratory data analysis.
• Two basic strategies for exploration of data set:
 Begin by examining each variable by itself. Then move
on to study the relationships among the variables.
 Begin with graphs. Then add numerical summaries of
specified aspects of the data.
STA248 week 2
8
Describing Quantitative Data
• The pattern of variation of a variable is called its distribution.
• The distribution of a variable is best displayed graphically.
• There are three main graphical methods for describing
summarizing and detecting patterns in quantitative data:
 Dot plot
 Stem-and-leaf plot
 Histogram
STA248 week 2
9
Stemplots
To make a stemplot:
1.
Separate each observation into a stem consisting of all but
the final (rightmost) digit and a leaf, the final digit.
2.
Write the stems in a vertical column with the smallest at
the top, and draw a vertical line at the right of this column.
3.
Write each leaf in the row to the right of its stem, in
increasing order out from the stem.
STA248 week 2
10
Example
• Here are the scores of a basketball player (say player A)
54 59 35 41 46 25 47 60 54 46 49 46 41 34 22
Make a stemplot of these data. Describe the main
features of the distribution.
• Solution : Min. = 22, Max = 60 …
• R command: stem (scores, scale=2)
STA248 week 2
11
Examining a distribution
• In any graph of data, look for the overall pattern and for
striking deviations from that pattern.
• Overall pattern of a distribution can be described by its
shape, centre, and spread.
• An important kind of deviation is an outlier, an individual
value that falls outside the overall pattern.
• Some other things to look for in describing shape are:
 Does the distribution have one or several major peaks,
usually called modes? A distribution with one major
peak is called unimodal.
 Is it approximately symmetric or skewed in one direction.
STA248 week 2
12
Exercise
Describe the shape of the distributions summarized by the
following stemplot.
Stem-and-leaf of sta220 marks
Leaf Unit = 1.0
1
6 7
3
7 44
11
7 77888999
(11)
8 00011233444
20
8 555556666778
8
9 000001
2
9 7
1
10 0
STA248 week 2
N
= 42
13
Exercise
Describe the shape of the distributions summarized by the
following stemplots.
Stem-and-leaf of C1 N= 50
Leaf Unit = 0.10
18
(17)
15
8
4
2
2
1
1
0
0
1
1
2
2
3
3
4
000111122233334444
55555566667889999
0011444
5669
03
1
2
.
STA248 week 2
14
Histograms
•
A histogram breaks the range of values of a variable into
intervals and displays only the count or percent of the
observations that fall into each interval.
•
We can choose a convenient number of intervals.
•
Histograms do not display the actual values observed. (only
counts in each interval).
•
Example:
Here is some data on the number of days lost due to illness
of a group of employees:
47, 1, 55, 30, 1, 3, 7, 14, 7, 66, 34, 6, 10, 5, 12, 5, 3, 9, 18,
45, 5, 8, 44, 42, 46, 6, 4, 24, 24, 34, 11, 2, 3, 13, 5, 5, 3, 4,
4, 1
STA248 week 2
15
The main steps in constructing a histogram
1.
Determine the Range of the data (largest and smallest
values)
In our example the data ranges from a min.
of 1 day to a max. of 66 days.
2.
Decide on the number of intervals (or classes) , and the
width of each class (usually equal).
3.
Count the number of observations in each class. These
counts are called class frequencies.
4.
Draw the histogram.
STA248 week 2
16
Class
No. of employees
(Frequency)
0-10
10-20
20-30
30-40
40-50
50-60
60-70
23
5
3
2
5
1
1
Total
40
Cumulative. Frequency
23
28
31
33
38
39
40
Relative frequency
0.575
0.125
0.075
0.050
0.075
0.025
0.025
1.000
• A table with the first two columns above is called
frequency table or frequency distribution.
• A table with the first column and the third column is called
cumulative frequency distribution.
STA248 week 2
17
Frequency
20
10
0
0
10
20
30
40
50
60
70
days lost
• R command: hist(days)
STA248 week 2
18
Describing distributions with numbers
• A large number or numerical methods are available for
describing quantitative data sets. Most of these methods
measure one of two data characteristics:
 The central tendency or location of the set of
observations – it is the tendency of the data to cluster, or
center, about certain numerical values.
 The variability of the set of observation – it is the spread
of the data.
STA248 week 2
19
Measuring Center
• Two common measures of center are the mean and the median.
• These two measures behave differently.
• The mean is the “average value” and the median is the “middle
value”.
STA248 week 2
20
Measuring center: the median
•
The median, ~
x , is the midpoint of the distribution, the
number such that half the observations are smaller then it
and the other half are larger.
•
To find the median of a distribution:
1. Arrange the observations in order of size, from smallest
to largest.
2. If the number of observations n is odd, the median is the
center observation in the ordered list.
3. If the number of observations n is even, the median is the
average of the two center observations in the ordered list.
STA248 week 2
21
Example
The annual salaries (in thousands of $) of a random
sample of five employees of a company are:
40, 30, 25, 200, 28
Arranging the values in increasing order:
25 28 30 40 200
median = 30
Excluding 200 median = (28+30)/2=29.
STA248 week 2
22
Measuring center: mean
• If the n observations are x1,x2,…xn. The sample mean
given by
x1  x2      xn
mean  x 

n
x is
x
i
n
• Example
Find the mean of the following observations: 4, 5, 9, 3, 5.
Solution:
4  5  9  3  5 26
mean 

 5.2
5
5
STA248 week 2
23
Example
• The annual salaries (in thousands of $) of a random sample of
five employees of a company are: 40, 30, 25, 200, 28.
mean  40  30  25  200  28  323  64.6
5
5
If we exclude 200 as an outlier,
mean  40  30  25  28  123  30.75
4
4
• Mean is sensitive to the influence of a few extreme
observations. Because the mean cannot resist the influence of
extreme values, we say that it is NOT a resistant measure of
center.
STA248 week 2
24
Mean versus median
• The median and mean are the most common measures of the
center of a distribution.
• If the distribution is exactly symmetric, the mean and median
are exactly the same.
• Median is less influenced by extreme values.
• If the distribution is skewed to the right, then
mode < median < mean
• If the distribution is skewed to the left, then
mean < median < mode.
STA248 week 2
25
Trimmed mean
• Trimmed mean is a measure of the center that is more resistant
than the mean but uses more of the available information than
the median.
• To compute the 10% trimmed mean, discard the highest 10%
and the lowest 10% of the observations and compute the mean
of the remaining 80%. Similarly, we can compute 5%, 20%
etc. trimmed mean.
• Trimming eliminates the effect of a small number of outliers.
STA248 week 2
26
Example
• Compute the 10% trimmed mean of the data given below.
20 40 22 22 21 21 20 10 20 20
20 13 18 50 20 18 15
8 22 25
• Solution:
- Arrange the values in increasing order:
8 10 13 15 18 18 20
20 20 21 21 22 22 22
20
25
20
40
20
50
- There are 20 observations and 10% of 20 = 2. Hence, discard
the first 2 and the last 2 observations in the ordered data and
compute the mean of the remaining 16 values.
Variable
C2
N
16
Mean
19.812
STA248 week 2
27
Questions
1.
2.
3.
4.
You are asked to recommend a measure of center to
characterize the following data:
0.6, 0.2, 0.1, 0.2, 0.2, 0.3, 0.7, 0.1, 0.0, 22.5, 0.4.
What is your recommendation and why?
The mean is ____ sensitive to extreme values than the
median.
(a) more
(b) less
(c) equally
(d) can’t say without data
Changing the value of a single score in a data set will
necessarily cause the mean to change. (T/F)
Changing the value of a single score in a data set will
necessarily cause the median to change. (T/F)
STA248 week 2
28
Measuring Spread
• There are two main measures of spread that we will discuss;
the sample range and the sample standard deviation.
• The range (max-min) is a measure of spread but it is very
sensitive to the influence of extreme values.
• The range can be very useful in statistical quality control.
• The measure of spread that is used most often is the sample
standard deviation.
STA248 week 2
29
Measuring spread - Standard deviation
• The sample variance, s2, of a set of n observations
x1 , x2 ,..., xn is
1 n
 x i  x 2
s 

n  1 i 1
2
• The sample standard deviation, s, is the square root of the variance (s2).
i.e.
s  s2
• It can be shown that,
1  n 2
2
s 
xi  nx 


n  1  i 1

2
This formula is usually quicker.
STA248 week 2
30
• The deviations xi  x display the spread of the values xi about
their mean. Some of these deviations will be positive and some
negative because the observations fall on each side of the
mean.
• The sum of the deviations of the observations from their mean
will always be zero.
• Squaring the deviations makes them all positive, so that
observations far from the mean in either direction have large
positive squared deviations.
• The variance is the average of the squared deviations.
• The variance, s2, and the standard deviation, s, will be large if
the observations are widely spread about their mean, and small
if the observations are all close to the mean.
STA248 week 2
31
Example
• Find the standard deviation of the following data set: 4, 8, 2, 9, 7.
• Solution: n=5 ,
mean  x 
4  8  2  9  7 30

6
5
5
2 (86)2 (26)2 (96)2 (76)2 34
(4

6)
2
s 
  8.5
51
4
s  8.5  2.92
STA248 week 2
32
Properties of standard deviation (s)
• s measures the spread about the mean and should be used only
when the mean is chosen as the measure of center.
• s = 0 only when there is no spread. This happens only when all
observations have the same value. Otherwise, s > 0.
• s, like the mean , is not resistant to extreme values. A few
outliers can make s very large.
STA248 week 2
33
Ballpark approximation for s
• The ballpark approximation for the standard deviation s is
the Range/4 (divide by 3 if there are less then 10
observations, divide by 5 if there are more then 100
observations).
• For the data set 4, 8, 2, 9, 7, range = 9 – 2 = 7 and so
s  7  2.33
3
STA248 week 2
34
Percentiles
• The simplest useful numerical description of a distribution
consists of both a measure of center and a measure of spread.
• We can describe the spread or variability of a distribution by
giving several percentiles.
• The pth percentile of a distribution is the value such that p
percent of the observations are smaller or equal to it.
• The median is the 50th percentile.
• If a data set contains n observations, then the pth percentile is
p th value in the ordered data set.
the (n1)100
STA248 week 2
35
Example
• Find the 20th percentile of the data represented by the
following stem-and-leaf plot.
Stem-and-leaf of Rural
Leaf Unit = 1.0
1
2 1
5
3 3589
(12)
4 122333456788
12
5 112467
6
6 7
5
7 04
3
8 48
1
9
1
10 8
STA248 week 2
N = 29
N* = 7
36
Solution
Since n = 29 , p = 20 and (n1) p  (29 1) 20  6
100
100
then the 20th percentile is the 6th value from the top of
the stemplot, which is 41.
STA248 week 2
37
Quartiles
• The 25th percentile is called the first quartile (Q1).
• The first quartile (Q1) is the median of the observations whose
position in the ordered list is to the left of the location of the
overall median.
• The 75th percentile is called the third quartile (Q3).
• The third quartile (Q3) is the median of the observations whose
position in the ordered list is to the right of the location of the
overall median.
NOTE: The median is the second quartile Q2 .
STA248 week 2
38
Example
The highway mileages of 20 cars, arranged in increasing order are:
13 15 16 16 17 19 20 22 23 23 | 23 24 25 25 26 28 28 28
29 32.
The median is …
The first quartile Q1 is …
The third quartile Q3 is…
•
Exercise: Find
(a) the 10th percentile.
(b) the 90th percentile of the above data set.
STA248 week 2
39
Measuring Spread: IQR
• The range (max-min) is a measure of spread but it is very
sensitive to the influence of extreme values.
• The distance between the first and third quartiles is called the
Interquartile range (IQR) i.e. IQR =Q3 – Q1 .
• The IQR is another measure of spread that is less sensitive to
the influence of extreme values.
STA248 week 2
40
The five-number summary
• The five-number summary of a set of observations consists of
the smallest observation, the first quartile, the median, the
third quartile and the largest observation.
• These five numbers give a reasonably complete description of
both the center and the spread of the distribution.
• R commands: summary (data)
STA248 week 2
41
Example
• The highway mileages of 20 cars, arranged in increasing order are:
13 15 16 16 17 19 20 22 23 23 23 24 25 25 26 28 28 28 29 32.
Give the five number summary.
• Answer
Variable
mileage
N
20
Minimum
13.00
Q1
17.50
STA248 week 2
Median
23.00
Q3
27.50
Maximum
32.00
42
Box-plot
• A box-plot is a graph of the five-number summary.
• Example:
Make a box-plot for the data in the above example.
Boxplot of Mileages
Mileages
30
25
20
15
• R commands: Boxplot (data)
STA248 week 2
43
Outliers
An outlier is an observation that is usually large or small
relative to the other values in a data set. Outliers are typically
attributable to one of the following causes:
1. The observation is observed, recorded, or entered incorrectly.
2. The observation comes from a different population.
3. The observation is correct but represents a rare event.
STA248 week 2
44
The 1.5×IQR Criterion for outliers
• Call an observation a suspected outlier if it falls more than
1.5×IQR above the 3rd quartile or below the 1st quartile.
• Example
Consider the data given in the example on slide 42 (mileage
data with an extra observation of 66).
Variable
Mileages
N
21
Mean
24.67
Min
13
Q1
18
Median
23
Q3
28
Max
66
The IQR = 28-18 = 10 and the largest observation, 66, falls
more than 1.5×IQR above Q3 and therefore is an outlier.
STA248 week 2
45