Download Chapter 3

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Section 3.1 – Measures of Central Tendency
A measure of center is a value at the center (or middle) of a data set.
Notation:
x =
n =
∑ =
N=
variable representing data values
total number of values in the sample
sum or “add up”
total number of values in the population (difficult to know)
Arithmetic Mean – Add the data values and divide by the total number of values.
x
= sum of the data values divided by n (statistic) “x-bar”
n
x
μ = population mean =
(parameter) “mew”
N
x = sample mean =
Median – Value that lies in the middle of the data set when arranged from smallest to largest.
 n1
If you have an ODD number of data points, the median will be the middle data point. 

 2 
n 
n
If you have an EVEN number of data points, then average the middle two data points.  and  1 
2
2

1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
9
10
Mode – The data point that occurs most often.
The data set can be bimodal, multimodal, or have no mode.
(you can find the mode of qualitative data, too!)
1
Example: The following data represent the miles per gallon for a 2013 For Fusion for 6 randomly
selected vehicles. Calculate the mean, median, and mode miles per gallon.
34.0
33.2
37.0
29.4
23.6
25.9
Finding Mean and Median on the calculator:
1. Enter the data into a list (L1)
2. Select STAT
3. Arrow over to the menu labeled “CALC”
4. Highlight “1-VAR STATS” and push enter.
5. Type the name of the list (L1 or L2 , etc.)
6. Push enter and scroll down.
Note: The default list for 1-VarStats is L1
2
Describing the Shape of a Distribution Based on Mean and Median
When data are skewed left or right, that means there are extreme values in the tail which are pulling
the mean in the direction of the tail. (pg. 123)
Mean < Median
Mean = Median
Mean > Median
Example: Match the histograms shown to the summary statistics:
Data Set
I
II
III
IV
Mean
42
31
31
31
(#18 pg. 127)
Median
42
36
26
32
Resistance
A numerical summary of data is resistant if extreme values (very large or small) do not affect its value
substantially.
Which of the following is considered resistant:
Mean or Median?
3
Section 3.2 – Measures of Dispersion
Dispersion is the degree to which data points are spread out.
Range = highest value – lowest value
Variance is based on how the data points deviate from the mean.
 (x - x)
2
Variance of a sample =
s2 =
i
n -1
Standard Deviation of a sample = s =
s2
Notation:
Standard
Deviation
Variance
Sample
Population
Properties and Interpretations of Standard Deviation:
1. The standard deviation is a measure of variation of all values from the mean. We say the standard
deviation is the “typical deviation from the mean”. The larger the standard deviation, the more
dispersion the distribution has.
2. The value of the standard deviation is either zero or positive.
(Zero means that all of the data points are the same number)
3. Outliers cause the standard deviation to increase dramatically.
4. The units of the standard deviation are the same as the units of the original data points.
4
Example: The following data represent the miles per gallon for a 2013 For Fusion for 6 randomly
selected vehicles. Calculate the range, variance, and standard deviation of the miles per gallon.
34.0
33.2
37.0
29.4
23.6
25.9
Example: The following data represent exam score in a statistics class taught using traditional lecture
and a in a statistics class taught using a “flipped” classroom.
Use your calculator (1-VarStats) to find the standard deviation for each sample.
Which class has more dispersion in the exam scores?
Traditional
Flipped
70.8
76.4
69.1
71.6
79.4
63.4
67.6
72.4
85.3
77.9
78.2
91.8
56.2
78.9
81.3
76.8
80.9
82.1
71.5
70.2
5
Empirical Rule for Bell-shaped data: (pg. 139)
Approximately 68% of the data will lie within 1 standard deviation of the mean
Approximately 95% of the data will lie within 2 standard deviations of the mean
Approximately 99.7% of the data will lie within 3 standard deviations of the mean
Example: IQ scores of normal adults have a bell-shaped distribution with a mean of 100 and a
standard deviation of 15.
a) Approximately what percentage of adults have IQ scores between 55 and 145?
b) Approximately what percentage of adults have IQ scores greater than 145?
c) Approximately what percentage of adults have IQ scores between 85 and 130?
6
Example: SAT math scores have a bell-shaped distribution with a mean of 515 and a standard
deviation of 114.
a) Approximately what percentage of math scores are between 401 and 629?
b) Approximately what percentage of math scores are lower than 401?
Section 3.3 – Measures of Central Tendency – Grouped Data
Sometimes data points are not equally weighted. Some data points might have a higher importance,
or more weight, to them.
The weighted mean of a variable is found by multiplying each value of the variable by its
corresponding weight, then adding these products, and dividing by the sum of the weights.
Weighted Average  xw 
w x
w
i i

i
sumof eachvariabletimesitsweight
sumof allweights
Example: Marissa just completed her first semester in college. She earned an A in her 3-hour
statistics class, a B in her 3-hour sociology course, a C in her 6-hour biology class, and an A in her 1hour PE class. Calculate Marissa’s GPA.
Course
Grade
(points)
Weight
(hours)
Stats
Sociology
Biology
PE
7
Example: In Marissa’s statistics course, attendance counts for 5% of her grade, quizzes count 10%,
exams count 60%, and the final exam counts 25%. Her grades are as follows. Calculate Marissa’s
course average.
Attendance:
Quizzes:
Exams:
Final Exam:
100%
93%
86%
85%
Course
Grade
Weight
(% as a decimal)
Attendance
Quizzes
Exams
Final Exam
Section 3.4 – Measures of Position and Outliers
Z-score – The distance that a data value is from the mean in terms of the number of standard
deviations.
Negative z-scores indicate that the data point is below the mean.
Positive z-scores indicate that the data point is above the mean.
Sample z-score
x  x
z
s
Population z-score
x  
z

ALWAYS ROUND Z-SCORES TO 2 DECIMAL PLACES!!
Example: Determine whether the Los Angeles Angels or the Colorado Rockies had a relatively better
run-producing season.
The Angels scored 773 runs and play in the American League, where µ = 677.4 runs and σ = 51.7 runs.
The Rockies scored 755 runs and play in the National League, where µ = 640.0 runs and σ = 55.9 runs.
8
Example: The average man in his twenties is 69.6 inches tall with a standard deviation of 3.0 inches.
The average woman in her twenties is 64.1 inches tall with a standard deviation of 3.8 inches.
Who is relatively taller, a 75-inch man or a 70-inch woman? Explain your choice using a complete
sentence.
Quartiles and Percentiles
The kth percentile, denoted Pk, of a data set is a value such that k percent of the data points are less
than or equal to that value.
Percentiles – separates the sorted data into 100 equal parts with 1% of the data values in each group.
P13 separates the lower 13% from the upper 87%
P55 separates the lower 55% from the upper 45% etc…
Example: If someone’s SAT math score a 600 and that is in the 74th percentile, what does this mean?
It means that _______% of SAT math scores are less than or equal to 600, and _______% of SAT
math scores are greater than 600.
9
Quartiles – separates the sorted data into 4 equal parts with 25% of the data values in each group.
Q1 separates the lower 25% from the upper 75%
Q2 separates the lower 50% from the upper 50% (also called the median)
Q3 separates the lower 75% from the upper 25%
BEFORE CALCULATING PERCENTILES OR QUARTILES, THE DATA MUST BE SORTED LOW TO HIGH!!!
Example: The following 13 data points are the sorted final exam grades of a sample of students who
took the statistics final exam in 2005. Calculate Q1, Q2, and Q3.
53 58 64 66 71 74 75 77 83 84 87 92 93
Quartiles can also be found on your calculator using 1-VarStats (scroll down)
The Interquartile Range (IQR) – the range of the middle 50% of the values in a data set.
IQR = Q3 – Q1
Interpretation: The more spread out a data set is, the higher the IQR will be.
Example: Calculate the IQR for the last example (test grades).
10
Outlier – An extreme observation (high or low)
How to check for outliers :
1. Calculate Q1 and Q3 then calculate the IQR.
2. Determine the “fences”.
Lower fence = Q1 – (1.5 × IQR)
Upper fence = Q3 + (1.5 × IQR)
3. If a data value is less than the lower fence or greater than the upper fence, it is an OUTLIER!
Example: The following data represent the hemoglobin levels for 20 randomly selected cats.
5.7
7.7
7.8
8.7
8.9
9.4
9.5
9.6
9.6
9.9
10.0 10.3 10.6 10.7 11.0 11.2 11.7 12.9 13.0 13.4
a) Calculate the mean and standard deviation of the hemoglobin level for this data set.
b) A cat named Daisy had a hemoglobin level of 7.8. Calculate her z-score and interpret.
c) Calculate the IQR and the fences.
d) Are there any outliers? If so, list them.
11
Which statistics should I use for my data set?
Shape of Distribution
Measure of Central Tendency
Measure of Dispersion
SYMMETRIC
SKEWED
(left or right)
Section 3.5 – The Five-Number Summary and Boxplots
5-Number Summary:
1.
2.
3.
4.
5.
Minimum value
Q1
Q2 (median)
Q3
Maximum value
How to Draw a Boxplot:
1. Draw a box with vertical lines at Q1, Q2 , and Q3 .
2. Draw brackets at the lower and upper fences.
3. Draw a horizontal line from Q1 to the smallest data point inside the lower fence.
4. Draw a horizontal line from Q3 to the largest data point inside the upper fence.
5. Label any outliers with asterisks (*).
12
Example: Create a Box & Whisker Plot for the “birth month” data for both classes. Compare.
10am Class
11am Class
Min
Q1
Median (Q2)
Q3
Max
1
2
3
4
5
6
7
8
9
10
11
12
Birth Month
13
How to determine the distribution (shape) of the data set by looking at the boxplot: (pg. 167)
Example: #6 on pg. 170
a) To the nearest integer, what is the median of variable x?
b) To the nearest integer, what is the first quartile of variable y?
c) Which variable has more dispersion? Why?
d) Does the variable x have any outliers? If so, what is the value of the outlier(s)?
e) Describe the shape of variable y.
14