Download chapter 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
STP 420 SUMMER 2002
STP 420
INTRODUCTION TO APPLIED STATISTICS
NOTES
PART 1 - DATA
CHAPTER 1
LOOKING AT DATA - DISTRIBUTIONS
Individuals – objects described by a set of data (people, animals, things)
- all the data for one individual make up a case
Variable – any characteristic of an individual (may take different values for different
individuals).
Categorical variable – places an individual into one of several groups/categories.
Quantitative variable – takes numerical values for which arithmetic operations
(adding/averaging) makes sense.
Distribution – tells us what values a variable takes and how often these values are taken.
1.1
Displaying Distributions with Graphs
Exploratory data analysis – use statistical tools (graphs and numerical summaries) and
ideas to help examine data and describe their main features
- examine each variable and the relationships among variables
- construct graphs and add numerical summaries
Graphs for categorical variables
Bar graph
- order of bars are not important
Pie chart
- must have all parts that make up the whole
1
STP 420 SUMMER 2002
Measuring speed of light
Newcomb experiment
Measurement – dependent on instrument use to make measurement
- appropriateness of measurement for purpose
Variation – difference in measurements may be due to many factors
Distribution - the pattern of variation of a variable
The distribution of a quantitative variable records its numerical values and how often
each value occurs
Stemplot – gives quick picture of a distribution while including the actual numerical
values in the graph
1.
Separate each observation into a stem (has all but the last digit, can be 1, 2, or
more digits) consisting of all but the final (rightmost) digit and a leaf (has only
one digit), the final digit.
2.
Write the stems in a vertical column with the smallest at the top, and draw a
vertical line at the right of this column.
3.
Write each leaf in the row to the right of its stem, in increasing order out from the
stem.
Back-to-back stemplot – uses one stem and two sets of leaves, one on either side of the
stem helps to make comparison between two data sets.
The number of stems can be doubled by splitting the stem in two; one with leaves from 0
to 4 and the other with leaves 5 to 9.
Good idea to round off numbers to only a few digits before trying to make a stemplot
(lose some accuracy in measurements)
2
STP 420 SUMMER 2002
Examining a distribution
1.
In any graph of data, look for the overall pattern and for striking deviations
from that pattern.
2.
Can describe the overall pattern of a distribution by its shape, center, and
spread.
3.
Outlier, important deviation that falls outside the overall pattern.
Mode(s) – observation(s) that occurs most often
- shown by the major peak(s) in the graph
Unimodal – distribution with one major peak
Symmetric distribution – values smaller and larger than its midpoint are mirror images
of each other
Skewed to the right – right tail (larger values) longer than left tail (smaller values)
Skewed to the left – left tail (smaller values) longer than right tail (larger values)
Histogram – breaks the range of values of a variable into intervals (of equal width) and
displays only the count (frequency) or percent (relative frequency) of the observations
that fall into each interval
Frequency table – table showing the intervals with their respective frequencies/relative
frequencies
Roundoff error – may sometimes be significant
Looking at data Histogram can help to shape, spread (outliers), center
Time plots – plotting the measurements in the order that they are observed (over time).
Time series – measurements of a variable taken at regular intervals over time
- examples: economic/social data
3
STP 420 SUMMER 2002
Seasonal variation – a pattern in a time series that repeats itself at known regular
intervals of time
Trend – persistent long-term rise or fall
Monthly consumer price index for some product
Index number – nationwide average price (less variable than the price at any one store
that may from time to time offer special prices)
Seasonally adjusted – helps to avoid misinterpretation especially for short periods of
time.
Decomposing time series
Statistical software programs can help to examine a time series by decomposing the data
into systematic patterns such as trends and seasonal variation and the residuals that
remains after we remove these patterns
1.2
Describing Distributions with numbers
Measures of center
x1  x2  ...  xn 1
  xi
n
n
1.
Mean = x 
2.
Median = M
The median is the midpoint of the distribution, the number such that half the
observations are smaller and the other half are larger.
To find the median:
1.
Arrange the observations in increasing order.
2.
If the number of observations n is odd, the median is the center observation at the
position (n+1)/2 in the ordered list.
3,
If the number of observations n is even, the median is the mean of the two center
observations in the ordered list and holds the same position as above in #2.
4
STP 420 SUMMER 2002
The mean is affected by extreme observations whereas the median is not affected, hence
the median is called a resistant measure and the mean is not resistant.
Measuring spread: Quartiles
Quartiles divide the distribution into 4 equal parts
To calculate the quartiles:
1.
Arrange the observations in increasing order and find the median (same as Q2- the
second quartile) 50% of the observations are to its left
2.
The first quartile (Q1) is the median of the observations on the left of the median.
25% of the observations are to its left
3.
The third quartile (Q3) is the median of the observations on the right of the
median. 75% of the observations are to its left
Percentiles divide the distribution into 100 equal parts
25%ile = Q1
50%ile = Q2 = M
75%ile = Q3
Range is the highest score minus the lowest score.
Interquartile range is the highest quartile minus the lowest quartile.
IQR = Q3 – Q1
An observation is a suspected outlier if it falls more than 1.5 X IQR above Q3 or below
Q1.
The Five number summary include
Minimum
Q1
M = Q2
Q3
Maximum
in the given order.
5
STP 420 SUMMER 2002
Boxplot – graph of the five number summary with suspected outliers plotted individually
- useful in comparing distributions
1.
Central box spans the quartiles
2.
A line in the box marks the median
3.
Observations more than 1.5 X IQR above Q3 or below Q1 are plotted as individual
outliers
4.
Lines extend from the box out to the smallest and largest observations that are not
suspected outliers.
The variance s2 of a set of observations is the average of the squares of the deviations of
the observations from their mean.
( x1  x) 2  ( x 2  x) 2  ...  ( x n  x) 2
1
s 

( xi  x ) 2

n 1
n 1
2
Hence, the standard deviation is
s
1
( xi  x ) 2

n 1
x1 to xn are the observations and
n-1 is the degrees of freedom
Properties
1.
s measures spread about the mean and should be used only when the mean is
chosen as the measure of center.
2.
s = 0 only when there is no spread, all observations are the same value. Otherwise
s > 0 measures the spread of the observations about the mean (more spread
implies a bigger s)
3.
s, like the mean is not resistant. A few outliers can make s very large.
6
STP 420 SUMMER 2002
A Linear Transformation changes the original variable x into a new variable
xnew = a + bx (equation of a straight line)
the constant a shift all the values of x a units upward/downward
the positive constant b changes the size of the unit of measurement
linear transformations do not change the shape of a distribution
Effect of a linear transformation
To see the effects of a linear transformation on measures of center and spread, apply
these rules:
1.
Multiplying each observation by a positive number b multiplies both measures of
center (mean and median) and measures of spread (interquartile range and
standard deviation) by b.
2.
Adding the same number a (+ve or –ve) to each observation adds a to measures of
center and to quartiles and other percentiles but does not change measures of
spread.
1.3
The normal distributions
Strategy for exploring data
1.
Always plot data (stemplot or histogram)
2.
Look for overall pattern and striking deviations (outliers)
3.
Calculate numerical summary to describe center and spread
and
4.
Draw a smooth curve approximately through the tops of the bars in the histogram.
A density curve is a curve that
1.
2.
is always on or above the horizontal axis
has area exactly 1 underneath it
It describes the overall pattern of a distribution.
The area under the curve and above any range of values is the relative frequency of all
observations that fall in that range.
7
STP 420 SUMMER 2002
Measuring center and spread for density curves
If symmetric, mean, median and mode are same x value that has the highest peak
Median and mean of a density curve
1.
The median has an area of 0.5 on each side
2.
The mean is the balance point
3.
If skewed to the right, the measures are in the order mode, median and mean (the
mean is pulled to the right)
If skewed to the left, the measures are in the order mean, median and mode (the
mean is pulled to the left)
The mean of a population (idealized distribution) is 
The standard deviation of a population (idealized distribution) is 
The normal curve has equation:
f ( x) 
1
 2
e
1  x 
 

2  
2
The 68-95-99.7 rule
In the normal distribution with mean  and standard deviation 
1.
68% of the observations fall within  of the mean 
2.
95% of the observations fall within 2 of the mean 
3.
99.7% of the observations fall within 3 of the mean 
Standardizing observations
If x is an observation from a distribution that has mean  and standard deviation , the
standardized value of x is
z
x

called a z-score
8
STP 420 SUMMER 2002
Standard normal distribution - N(0, 1): mean 0 and standard deviation 1
If the variable X has any normal distribution N(, ) with mean and standard deviation ,
then the standardized variable
Z
X 

has a standard normal distribution
The standard normal table gives the area under the curve to the left of the z-score value.
This is often interpreted as a probability.
It is important that all X variables are standardized in order to use the standard normal
tables to compute probabilities.
Normal quantile plot
- very sensitive way to assess normality, however, not easily done by hand
- computer software programs allow us to construct a more accurate plot without
taking much time
If the points on a normality quantile plot lie close to a straight line, the plot indicates that
the data are normal. Systematic deviations from a straight line indicate a nonnormal
distribution. Outliers appear as points that are far away from the overall pattern of the
plot.
To construct the normal quantile plot
1.
Arrange the observed data values from smallest to largest. Record what percentile
of the data each value occupies. Eg. for 20 observations, the first is at the 5%
point, the next is at the 10% point, and so on.
2.
Find the z-scores for each of the percentiles. Eg. z = -1.645 is the 5% point of the
standard normal distribution.
3.
Plot each data point x against the corresponding z. If the data distribution is close
to standard normal, the plotted points will lie close to the 450 line x = z. If the data
distribution is closed to any normal distribution, the plotted points will lie close to
any straight line.
Granularity – when plotted points appear to form a horizontal segment in the probability.
This does not hold us back from adopting a normal distribution for the data.
- This could be avoided if the measurements are taken more accurately.
9