Download Notes Ch 01-6

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Categorical variable wikipedia , lookup

Time series wikipedia , lookup

Transcript
STT200
Chapter 1-6
KK AM
Chapter 1 Intro to Stats
Definitions
Data are observations (such as measurements, genders, survey
responses) that have been collected.
Statistics is a collection of methods for planning experiments, obtaining
data, and then organizing, summarizing, presenting, analyzing,
interpreting, and drawing conclusions based on the data.
Population is the complete collection of all elements (scores, people,
measurements, and so on) to be studied. The collection is complete in the
sense that it includes all subjects to be studied.
Parameter is a numerical measurement describing some characteristic of
a population
Census is the collection of data from every member of the population.
Sample is a subset of a population.
An important activity in this class is to demonstrate how we can use
sample data to form conclusions about populations.
It is extremely critical to obtain sample data that are representative of the
population from which the data are drawn.
A statistic is a numerical measurement describing some characteristic of
a sample
A survey is one of many tools that can be used for collecting data.
A common goal of a survey is to collect data from a small part of a larger
group so that we can learn something about the larger group.
Example: Identify a sample and population, a statistic and a
parameter:
A Gallup Poll asked this of 1087 adults: “Do you have occasion to use
alcoholic beverages such as liquor, wine, or beer, or are you a total
abstainer?”
The 1087 survey subjects are a sample, while the population
consists of the entire collection of all 302,682,345 adult Americans
(or whatever is the exact number for now).
1 of 23
STT200
Chapter 1-6
KK AM
A parameter: the proportion of ALL adult Americans who use
alcoholic beverages.
A statistic: the proportion of those surveyed Americans who use
alcoholic beverages.
Sample data must be collected in an appropriate way, such as through a
process of random selection.
Chapter 2 – Data
Data - recorded information together with its context.
Context - tells Who, What, When, Where, How and Why is measured
1. Who –individuals about whom data are collected ( participants,
respondents, subjects, experimental units, records, cases)
2. What – characteristics recorded about each individual (variables)
Variables:
quantitative or numerical (measured in units)
qualitative or categorical (labels)
3. When – time
4. Where – place
5. How – method of collecting data
6. Why – purpose of study
2 of 23
STT200
Chapter 1-6
KK AM
Classwork Ch 2
Homework tips:
First complete your lecture notes, marking everything that is unclear. Read
the textbook, paying attention to the examples, and get help with unclear
parts if any left, then do assigned homework problems, as listed in Class
Schedule.
Chapter 3 Categorical Data
Example: For each of 2201 people on the Titanic the following variables
were recorded: Ticket Class (First, Second, Third, Crew), Survival (Dead
or Alive), Age (Adult or Child) and Sex (Male or Female).
ONE VARIABLE
Who = people on Titanic
What = Ticket Class
3 of 23
STT200
Chapter 1-6
KK AM
Frequency Table = list of categories and counts or percentage of
observations of each category.
Class
First
Second
Third
Crew
Total
Count (frequency)
325
285
706
885
2201
relative frequency
325/2201=0.1477=14.766%
100%=1
Distribution of a variable is the list of possible values of the variable and
corresponding relative frequencies.
Graphical displaying of a distribution of categorical data:
1. Bar chart
2. Pie chart
The Area of a bar or a slice should correspond to the frequency of a
category.
Class
First
Second
Third
Crew
Total
Count (frequency)
325
285
706
885
2201
% (relative frequency)
14.8
12.9
32.1
40.2
100
Displaying categorical (qualitative) data:
The most common displays for categorical data are the bar graphs and
pie graphs. I used EXCEL to build the bar graphs and a pie graph for the
data above:
relative frequency
Count (frequency)
0.60
1000
0.40
500
0.20
0
0.00
First
Second
Third
Crew
First
Second
Third
Crew
Constructing the pie graph:
Convert each data into the central angle of the circle by multiplying given
relative frequency by 3600
Central angle=relative frequency x 3600
4 of 23
STT200
Chapter 1-6
Bar chart:
KK AM
Pie Chart:
1000
900
800
First
15%
700
Crew
40%
600
500
Second
13%
400
Third
32%
300
200
100
0
First
Classwork
Second
Third
Crew
Chapter 3
Displaying TWO VARIABLES with a Contingency Table
Example: For each of 2201 people on the Titanic the following variables
were recorded: Ticket Class (First, Second, Third, Crew) and Survival
(Dead or Alive).
Who = people on Titanic
What = (1) Ticket Class and (2) Survival
Why = did the chance of surviving depend on ticket class?
5 of 23
STT200
Chapter 1-6
KK AM
Contingency Table
Survival
Ticket Class
Alive
First
202
Dead 123
Total 325
Second
118
Third
178
Crew
212
Total
710
167
285
528
706
673
885
1491
2201
 Marginal
distributions
CONDITIONAL DISTRIBUTIONS
A distribution of one variable, given value of another is called a
conditional distribution. Can include Percentages of Column or
Percentages of Row.
Example: what percent of the survivors had First Class tickets?
What percent of the passengers with First Class tickets survived?
…
Segmented Bar Chart
100%
90%
80%
Crew
Crew
70%
60%
Crew
Third
Third
50%
40%
Second
Second
Third
First
Second
30%
First
20%
10%
First
0%
Alive
Dead
Variables are independent, if the conditional distribution for each category
is the same as the corresponding marginal distribution.
Is the variable SURVIVAL independent or dependent on TICKET CLASS?
Classwork Ch 3: 32
6 of 23
STT200
Chapter 1-6
KK AM
Chapter 4 – Displaying and Summarizing Quantitative Data
Histogram
Example: Consider the following set of raw data which are the high
temperatures recorded for 30 consecutive days. Summarize this data by
creating a frequency distribution of the temperatures.
Data Set - High Temperatures for 30 Days
90
85
89
90
83
89
90
89
85
89
87
87
84
81
82
83
86
86
90
82
81
82
83
84
89
85
86
85
81
89
A histogram is a special kind of a bar graph in which the horizontal scale
represents the classes of data values and the vertical scale represents the
frequencies. There are no gaps between the bars, and the widths of the
bars are usually equal.
7 of 23
STT200
Chapter 1-6
KK AM
The frequency distribution table:
Frequency Distribution for High Temperatures
Temperature
Frequency
81
///
82
///
83
///
84
////
85
////
86
///
87
//
88
89
/////
90
////
Tally Frequency Relative Frequency Cumulative
3
0.10
3
3
6
3
9
4
etc.
3
3
2
0
5
4
Total: 30 Total:1
Displays of data are clearer wen data are first grouped into the sets called
classes or bins.
Making a histogram:
1. First, make the frequency table by slicing up the entire span of values
covered by the quantitative variable into equal-width piles called classes or
bins.
2. The bins and the counts in each bin give the distribution of the
quantitative variable
3. A histogram plots the bin counts as the heights of bars (like a bar chart).
4. A relative frequency histogram displays the percentage of cases in each
bin instead of the count. In this way, relative frequency histograms have
the total area of all bars equal to 100% (1 square unit, if you use fractions).
8 of 23
STT200
Class limits
81-82
83-84
85-86
87-88
89-90
Total:
Chapter 1-6
Frequency
6
7
6
…
Relative Frequency
20.0%
23.3%
20.0%
…
30
100%
KK AM
10
8
6
4
2
0
81-82
83-84
85-86
87-88
89-90
The histogram built on Relative Frequency table has exactly the same
shape.
Graphs convey information about distribution of the data:
 shape,
 center,
 spread
 possible outliers.
Example: More graphs
Given are test scores:
35 37 45 46 49 56 57 57 59 61 62 64 68 71 72 76 80 89 94.
Make a stem-and-leaf display, and a dot-plot.
a) Make a Stem-and-leaf
Stem: Leaves
3
4
5
6
7
8
9
1. First, cut each data value into leading digits (“stems”) and trailing digits
(“leaves”).
2. Use the stems to label the bins.
9 of 23
STT200
Chapter 1-6
KK AM
3. Use only one digit for each leaf—either round or truncate the data
values to one decimal place after the stem.
b) Make a Dotplot:
___________________________________
Shapes of distributions:
Normal distribution: Symmetric graph with bell shaped “tails” plays a
special role in statistics.
Other Shapes of distribution:
Symmetric:
Uniform distribution
Bimodal:
10 of 23
STT200
Chapter 1-6
KK AM
Skewed distribution:
Skewed to the left
Skewed to the right
The outliers are the data far away from the main group.
The shape of a histogram can be described by a smooth curve that roughly
follows the tops of the bars. This can be done by eye or by numerical
algorithms on the computer. The shape often follows certain patterns:
 Symmetric (uniformed, bell-shaped etc)
 skewed to the right or left
 outliers - unusual observations which do not “fit” to the data
 unimodal (one major peak) or bimodal (two major peaks)
 large or small spread (is the histogram wide or narrow)
Avoid Common Errors:
 Don’t make a histogram of a categorical variable—bar charts or pie
charts should be used for categorical data.
 Don’t look for shape, center, and spread
of a bar chart.
 Choose the number of bins (a bin width) appropriate to the data.
Changing the bin width changes the appearance of the histogram.
Describing Distribution Numerically
Measures of the center:
•
mean
•
median
Example: Find the measures of center for the data: 45 46 49 35 76 80 89
94 37 61 62 64 68 56 57 57 59 71 72
Sorted Data:
11 of 23
STT200
Chapter 1-6
KK AM
35 37 45 46 49 56 57 57 59 61 62 64 68 71 72 76 80 89 94
Max = 94,
•
Min = 35,
n=19
y
mean = the average value = sum/n
Mean = (35+37+...+94)/19 =1178/19 = 62
•
median = the middle value (or the average of two middle
values)
Median = 61
Note: For skewed distributions the median is a better measure of the
center than the mean.
Measures of the spread:
•
range
•
interquartile range (IQR)
•
variance
•
standard deviation
In a sample, the variance = "average" squared deviation from the
mean
s2 

y y

2
y  y = deviation
n 1
Standard Deviation s = square root of the variance
Example: Find the variance and standard deviation for the data below
3, 5, 6, 10
n=4,
y 2  5  6  10
y
First, find the mean:
Data


n
Deviation
4
Deviation^2
3
5
6
10
Variance: s2 =
Standard Deviation:
s2 
12 of 23
s=

y
n
STT200
Chapter 1-6
KK AM
Rounding rule: round to at least one unit farther than the data – in this
case, to the tenths.
Measures of Position: Percentiles
The 100pth-percentile in a ranked set of data is the value that separates
p% of smaller (or not greater) values from remaining 100-p% numbers that
are greater or at least not smaller than that value.
The median is the 50th percentile of a given set of data
A calculator or computer program might give somewhat different answers!
Quartiles: “special” percentiles
Q1 (First or Lower Quartile) separates the bottom 25% of sorted values
from the top 75%. It’s the 25th percentile.
Q2 (Median, Second Quartile) separates the bottom 50% of sorted values
from the top 50%. It’s the 50th percentile.
Q3 (Third or Upper Quartile) separates the bottom 75% of sorted values
from the top 25%. It’s the 75th percentile.
CLASSWORK Ch 4
d) standard deviation (use calculator)
e) variance
g) determine whether there are outliers
h) make a box plot (with outliers indicated, if any)
13 of 23
STT200
Chapter 1-6
14 of 23
KK AM
STT200
Chapter 1-6
KK AM
Chapter 5 Understanding and Comparing Distributions
One more display: a Boxplot
Data (sorted!):
94
35 37 45 46 49 56 57 57 59 61 62 64 68 71 72 76 80 89
Five-Number-Summary:
Minimum, Lower Quartile, Median, Upper Quartile,
Maximum
Min = 35,
Median = 61,
y
y
n
Max = 94,
Q3 = Upper quartile = middle of upper half (include median if n is odd)
Q1 = Lower quartile = middle of lower half (include median if n is odd)
Upper half:
35 37 45 46 49 56 57 57 59 [61 62 64 68 71 72 76 80 89 94]
(71+72)/2 = 71.5
Lower half:
[35 37 45 46 49 56 57 57 59 61] 62 64 68 71 72 76 80 89 94
+ 56)/2 =52.5
Q3 =
Q1 = (49
IQR = 71.5 - 52.5 = 19
Five number summary
Min = 35, Q1 =52.5, Median = 61, Q3=71.5, Max = 94
Outlier = data values which are beyond fences
Interquartile range (IQR) = Q3 - Q1=19
Upper fence = Q3 + 1.5 IQR =71.5 + 1.5 x 19 = 100 (no outliers here)
Lower fence = Q1 - 1.5 IQR = 52.5 - 1.5 x 19 = 24 (no outliers here)
(The extreme outliers are the numbers separated by twice longer fences)
Boxplot:
15 of 23
STT200
Chapter 1-6
Exercise: draw the histogram for the data above
Try also the calculator:
Data: 35 37 45 46 49 56 57 57 59 61 62 64 68 71 72 76
80 89 94
.Classwork Chapter 5
CLASSWORK:
16 of 23
KK AM
STT200
Chapter 1-6
KK AM
An answer to a classwork problem:
The Summary Chapters 4-6
Always report the shape of its distribution, along with a center and a
spread.
 If the shape is skewed, report the median and IQR
 If the shape is symmetric, report the mean and standard deviation
and possibly the median and IQR as well.
 If there are multiple modes, try to understand why. If you identify a
reason for the separate modes, it may be good to split the data into
two groups.
 If there are any clear outliers and you are reporting the mean and
standard deviation, report them with the outliers present and with the
outliers removed. The differences may be quite revealing.
 Don’t report too many decimal places.
 Don’t round in the middle of a calculation.
17 of 23
STT200
Chapter 1-6
KK AM
Chapter 6 – Standardizing and Normal Model
We compare individual data values to their mean, relative to their
standard deviation using the following formula:
z
For a sample:
For the population
z
y
y  y
s

We call the resulting values standardized values, denoted as z, or zscores.
 Standardized values have no units.
 z-scores measure the distance of each data value from the mean in
standard deviations.
 A negative z-score tells us that the data value is below the mean,
while a positive z-score tells us that the data value is above the
mean.
Arithmetic operations on the data:
1. Shifting data:
Adding (or subtracting) a constant to each value will increase
(or decrease) measures of position: mean, median,
percentiles, max or min by the same constant.
Its shape and spread - range, IQR, standard deviation remain unchanged.
2. Rescaling data
Multiplying (or dividing) each value by a constant all measures
of position (mean, median, percentiles, max or min) and all
measures of spread (range, IQR, standard deviation) will be
multiplied or divided by that constant.
Standardizing data into z-scores shifts the data by subtracting the
mean and rescales the values by dividing by their standard deviation.
 Standardizing into z-scores does not change the shape of
the distribution.
 Standardizing into z-scores changes the center by making
the mean 0.
 Standardizing into z-scores changes the spread by making
the standard deviation 1.
18 of 23
STT200
Chapter 1-6
KK AM
4. Hams. A specialty foods company sells “gourmet hams” by mail order.
The hams vary in size from 4.15 to 7.45 pounds, with a mean weight of 6
pounds and standard deviation of 0.65 pounds. The quartiles and median
weights are 5.6, 6.2, and 6.55 pounds.
a) Find the range and the IQR of the weights.
b) Do you think the distribution of the weights is symmetric or skewed?
If skewed, which way? Why?
c) If these weights were expressed in ounces (1 pound = 16 ounces)
what would the mean, standard deviation, quartiles, median, IQR, and
range be?
d) When the company ships these hams, the box and packing materials
add 30 ounces. What are the mean, standard deviation, quartiles,
median, IQR, and range of weights of boxes shipped (in ounces)?
e) One customer made a special order of a 10-pound ham. Which of
the summary statistics of part d might not change if that data value
were added to the distribution?
14. Placement exams. An incoming freshman took her college's
placement exams in French and mathematics. In French, she scored 82
and in math 86. The overall results on the French exam had a mean of 72
and a standard deviation of 8, while the mean math score was 68, with a
standard deviation of 12. On which exam did she do better compared with
the other freshmen?
19 of 23
STT200
Chapter 1-6
KK AM
68-95-99.7 Rule (so called Empirical Rule)
In a Normal model:
 about 68% of the values fall within one standard deviation of
the mean;
 about 95% of the values fall within two standard deviations of
the mean; and,
 about 99.7% (almost all!) of the values fall within three
standard deviations of the mean.
Notation
Normal Distribution: N ( ,  )
Where mu=mean, sigma=standard deviation
Standard Normal Distribution: N(0,1)
Example:
Suppose that we model SAT scores Y by N(500, 100) distribution.
1. What percentage of SAT scores fall between 450 and 600?
z-scores of 450 and 600:
(450-500)/100 = -.50
(600-500)/100 = 1.00
So, P(450 <Y<600) = the area under standard normal curve and
between z=-.5 and z=1, that is, by the tables,
0.8413 - 0.3085 = 0.5328
Answer: 53.28%
TI-83: [2nd DISTR 2]
normalcdf(450,600,500,100) =0.5328072
20 of 23
STT200
Chapter 1-6
KK AM
Question 2 For what value b, 10% of SAT scores are greater than
b?
Let z be a z-score of b. The area to the right of z must be 10%, so
the area to the left is 90% (0.90)
From Table Z the z-score for which the area to the left is .9 is z =
1.28
So, b is 1.28 standard deviations to the right of , that is
b = 100 x 1.28 + 500 = 628
Answer: b = 628
TI-83: [2nd DISTR 3]
invNorm(.9,500,100) = 628.155
28. IQ. Some IQ tests are standardized to a Normal model, with a mean of
100 and a standard deviation of 16.
a) Draw the model for these IQ scores. Clearly label it, showing what
the 68–95–99.7 Rule predicts about the scores.
b) In what interval would you expect the central 95% of IQ scores to
be found?
c) About what percent of people should have IQ scores above 116?
d) About what percent of people should have IQ scores between 68
and 84?
e) About what percent of people should have IQ scores above 132?
21 of 23
STT200
Chapter 1-6
22 of 23
KK AM
STT200
Chapter 1-6
KK AM
End of Part 1 Ch. 1-6.
Do Homework!
REMEMBER:
 Don’t use a Normal model when the distribution is not unimodal
and symmetric.
 Don’t use the mean and standard deviation when outliers are
present—the mean and standard deviation can both be distorted
by outliers.
 Don’t round your results in the middle of a calculation.
 Don’t worry about minor differences in results.
23 of 23