Download APSTAT PART ONE Exploring and Understanding Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia, lookup

Regression toward the mean wikipedia, lookup

Taylor's law wikipedia, lookup

Bootstrapping (statistics) wikipedia, lookup

Time series wikipedia, lookup

Misuse of statistics wikipedia, lookup

Foundations of statistics wikipedia, lookup

Categorical variable wikipedia, lookup

Transcript
There are three kinds of lies - lies, damned lies and statistics.
~Benjamin Disraeli, commonly misattributed to Mark Twain
APSTAT PART ONE
Exploring and Understanding Data
What is Statistics?
Chapters 1-3
What is Stat?
• Book Says:
– A way of reasoning
– Collection of tools and methods
– Helps us understand the world
– Statistics is about variation
Stat Basics
• Individuals
– Object described by a set of data
– People (#1), cars, animals, groups…
• Variables
– Categorical (Qualitative)– Usually involves
words
• Examples: sex, advisor, social security #...
– Quantitative – Involve #’s
• Examples: age, height, income, test score…
Displaying Categorical Data
• Frequency tables:
Favorite
Color
Red
Count
Yellow
15
Blue
78
Burnt
Sienna
382
25
Displaying Categorical Data
• Realtive Frequency
tables:
– Just roll up the %’s
Favorite
Color
Red
Count
Yellow
3%
Blue
15.6%
Burnt
Sienna
76.4%
5%
Displaying Categorical Data
• Contingency Table
– Two Way table
Age at first “Real Kiss” (ahhhhhhhhhhhh…)
10-14
15-19
20-24
Male
12
13
5
Female
18
8
2
Marginal Distribution
Age at first “Real Kiss” (ahhhhhhhhhhhh…)
10-14
15-19
20-24
Total
Male
12
13
5
30
Female
18
8
2
28
Total
30
21
7
58
• Conditional Distribution:
– % of males whose first kiss came when they were 10-14
– % of 20-24 year old first kissers who were male
The Rest of Chapters 1-3
• Displaying the data
– Pie Charts
– Bar Charts
– Blah Blah Blah….
• Simpson’s Paradox – AP MC
• Being Skeptical – Important for real life
– 5 W’s + 1H
• Ex: 4 out of 5 dentists….
– Displaying data
• Lies, Dammed Lies, and Statistics
Showing Off Your Data
Chapters 4-5
Histograms
• Remember bar graphs? Same, but different.
• Think of sorting boxes…
– Same size boxes
• ON TI-83
–
–
–
–
Enter Data into L1 (STAT>EDIT)
Go to STAT PLOT (2ND Y=)
Change Options
Go to ZOOM Choose Stat OR Go to WINDOW
Change Options
Go to GRAPH
Histograms
• Make a histogram of the following data:
• Age of Teachers At WPS
25, 34, 37, 42, 51, 43, 49, 35, 37, 65,
Outliers
• An observation that is outside the pattern
– For example, ages in this classroom
16, 17, 16, 17, 18, 17, 17, 16, 18, 36
• Formula to determine (l8r, sk8r)
– For now “potential” or “possible” outlier
Describing a distribution
• Center
– Mean - Average
– Median - Middle
• Shape
–
–
–
–
–
Symmetric
Skewed
Uniform
Bell Shaped
Bi- or Multi-modal
• Spread
– Standard Deviation
– Range
– IQR
• Weird-ness
– Outliers
– Gaps
Stemplots
• Basic
• Split Stems
• Back-To-Back
Basic Stemplot
Boys Weight in class (pounds)
10
11
12
13
14
34699
15
025788
16
001344589
17
1
18
9
KEY: 10 8 = 108 pounds
Split Stem Stemplot
Boys Weight in class (pounds)
14
34
14
699
15
02
15
5788
16
001344
16
589
17
1
17
18
9
KEY: 10 8 = 108 pounds
Back to Back Stemplot
Girls vs. Boys Weight in class
(pounds)
8 10
93
11
8773
12
940
13
2
14
34699
1
15
025788
16
001344589
17
1
18
9
KEY: 10 8 or 8 10 = 108 pounds
Mean
• Average! Add ‘em up and divide by n
• Sample Mean denoted as x (x-bar)
• Not Resistant to extreme measures
– ie. Ages in Mrs. Smith’s Kindergarten Class
– 4,5,4,4,4,5,5,4,4,4,5,5,4,4,5,39
Median
• Middle! Line ‘em up (in order) and find the
middle. If two share it, find their mean.
• Resistant to extreme measures
– ie. Ages in Mrs. Smith’s Kindergarten Class
– 4,4,4,4,4,4,4,4,4,5,5,5,5,5,5,39
Quartiles
• Median cuts data in half, Quartiles cut the
Halves in Half!
Recall Teacher Ages:
25, 34, 35, 37, 37, 42, 43, 49, 51, 65
1st Quartile
Q1
Median
3rd Quartile
Q3
5-Number Summary
• Low-Q1-Median-Q3-High
• Shows Spread of Data
Recall Teacher Ages:
25, 34, 35, 37, 37, 42, 43, 49, 51, 65
• 5-Number Summary:
25 35 39.5 49 65
Boxplot
• Graphical Representation of 5-Number
Summary
• Shows Shape, Spread, and Center
• Always draw to scale:
25
35
39.5
49
65
Outliers
• First off, IQR – InterQuartile Range
– Distance between Quartiles…
Recall Teacher Ages:
25, 34, 35, 37, 37, 42, 43, 49, 51, 65
• IQR is 49-35=14
• Outlier is anything 1.5 times IQR below Q1 or
above Q3
• Sooo…. An outlier would have to be 21 below
35 or 21 above 49…Below 14 or above 70.
Nothing in our data is an outlier!
Boxplot Using TI-83
Enter Teacher Ages into L1 (clear old stuff first):
25, 34, 35, 37, 37, 42, 43, 49, 51, 65
• ON TI-83
– Go to STAT PLOT (2ND Y=)
– Change Options
– Go to ZOOM Choose Stat OR Go to WINDOW
Change Options
Go to GRAPH
Variance & Standard Deviation
• Variance - s2
– Average of Squared
distances from mean
– In example 26/5 = 5.2
• Standard Deviation – s
– Square Root of Variance
– In example, about 2.28
• Standard Deviation
– Measure of Spread
– Use with Mean
– Non-Resistant
• On TI-83 Now…..
Raw Data
Distance
From Mean
Distance
Squared
3
3
9
5
1
1
6
0
0
6
0
0
10
4
16
STAT>CALC-1VARSTAT
Mean = 6
It’s Normal to Deviate
Chapter 6 – The Normal Model
Density Curve
• Area under a density curve is always 1
• Symmetric density curve:
Mean, Median and Mode
Density Curve Continued
• Density curves are often skewed
• Recall Median is “resistant” while Mean is
not
Mean
Mode
Skewed to the Left
(tail trails to the left)
Mean
Skewed to the Right
(tail trails to the right)
Median
Histograms
• Median is “equal areas” point
• Mean is “balance point” – “think Physics”
50% of
50% of
Population Populatio
n
Normal Distributions (bell shaped)
• Center is mean m – (population mean)
• Spread is Standard Deviation s –
(population standard deviation)
– To find, look for inflection points
Concave
Down
Concave
Up
s
ms
Concave
Up
s
m
m+s
68 – 95 – 99.7 Rule
• Also called EMPIRICAL RULE
Probability = 99.7% within 3s
Probability = 95% within 2s
Probability = 68% within 1s
Raw-Score
(X)
z-Score
(z)
m  3s
3
m  2s
2
m  1s
m
m + 1s
m + 2s
m + 3s
1
0
1
2
3
Percentiles (and quartiles)
• Think standardized tests or class rankings
• Percent of observations to the LEFT of an
observation
• Quartiles:
– First is at 25th percentile
– Median is at 50th percentile
– Third is at 75th percentile
Z-SCORE
• Number of Standard Deviations (s) away
from the Mean (m)
Raw-Score
(X)
z-Score
(z)
m  3s
3
m  2s
2
m  1s
m
m + 1s
m + 2s
m + 3s
1
0
1
2
3
Z-SCORE Continued
X m
z
s
m = population mean
s = population standard deviation,
X = Raw-Score,
z = z-Score
Normal Distribution Notation N (m , s)
• Example, You have an IQ of 148 The IQ test you
took has a distribution N(105, 20). What is your
Z-Score? What does this mean?
Using Tables
• Ex. – Your IQ ZSCORE was
2.15. What
does it mean
now?
Using Tables
• Ex. – If
someone’s IQ
was at the 10th
percentile, what
would their ZSCORE be?
Using TI-83
• Normalcdf (Xlower, Xupper, m, s) : - use
to convert Raw-Score directly to
probability.
• Normalcdf (Zlower, Zupper) : - use to
convert z-Score to probability
***For Graphics use Shadenorm (GTANG
notes)
Using TI-83
• Test Empirical Rule (68-95-99.7)
– Find Normalcdf(-1,1), Normalcdf(-2,2),
Normalcdf(-3,3)
• Ex. What percent of IQ Scores would fall
between 100 and 110 Using N(105, 20)?
What percent would be above 150?
– Normalcdf(100,110,105,20)
– Normalcdf(150,1000000000,105,20)
Normality
• Just check Box and Whisker
plot or Histogram on TI-83
• ALWAYS do this if raw data is
given
– Sketch result and comment on it!
Distributions
Column 1
70
60
50
40
30
20
Quantile s
100.0% maximum
99.5%
97.5%
90.0%
75.0%
quartile
50.0%
median
25.0%
quartile
10.0%
2.5%
0.5%
0.0%
minimum
M ome nts
65.000
65.000
65.000
63.600
45.000
39.000
36.250
25.900
25.000
25.000
25.000