Download Introduction to Stats

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Data mining wikipedia , lookup

Time series wikipedia , lookup

Transcript
What is Statistics?
Statistics is a collection of procedures
and principles for gathering data and
analyzing information in order to
help people make decisions when
faced with uncertainty.
THINK—SHOW—TELL
1. Why?
2. Who?
3. What?
4. When?
5. Where?
6. How?
“Data are used to make a
judgment about a situation”
1) What question needs to be answered?
2) How should we collect data & how
much?
3) How can we summarize the data?
4) What decisions or generalizations can
be made in regards to the question
based on the data collected?
Population Data vs. Sample Data
• Everyone—everything • Representative smaller
“subset” of population
• Parameters—summary
• Statistics—summary
measurements (p, )
of the population
measurements denoted by
standard letters ( p̂ , x )
data.
of the sample data.
Data--Types of Variables
Categorical
Group of category
names w/no order
Quantitative Numerical values
taken from an
individual
Eye Color
(brown, blue,
green)
Weight (117 lbs,
170 lbs, 253 lbs)
Types of Quantitative (Numerical) Data
Discrete
Example: Number of
siblings, number of
pockets in a pair of
jeans, number of
free throws made in
a season,…
Continuous
Example: Time, Weight,
Height, …because of
our limitations of
measurement accuracy
we often round to the
nearest second, ounce,
inch,…
Summarizing Data
w/ Bar graph of Categorical Data
TX_betw eenHoustonDallas
race
White
0
ancestry marital
Czechosl... Nev
eduCode eduText income
11
Some coll... 2300
industry
job
Construc... Bookkee...
19
White
1
Mexican
Nev
10
High sch... 3000
Miscellan... amuseme...
F
21
Filipino
0
Filipino
Nev
11
Some coll... 3084
Eating an... Waiter/w ...
389
M
19
White
0
French
Nev
11
Some coll... 2000
Construc... Construc...
390
M
20
White
1
German
Nev
11
Some coll... 1500
Eating an... Cashier
391
M
19
White
0
Nev
11
Some coll... 3000
Air trans... Weigher/...
386
=
387
F
F
388
sex
20
age
hisp
<
Bar Chart
TX_betw eenHoustonDallas
80
Count
60
40
Acadian
American
American Indian
Asian Indian
British
Canadian
Chinese
Cuban
Czech
Czechoslovakian
Danish
Dutch
English
European
Filipino
Finish
French
German
Greek
Haitian
Hispanic
Honduran
Irish
Italian
Japanese
Korean
Malaysian
Mexican
Nicaraguan
Norwegian
Panamanian
Polish
Saudi Arabian
Scandanavian
Scotch Irish
Scottish
Slovak
Slovene
Spanish
Sri Lankan
Swedish
Trinidadian
Turkish
VIetnamese
Welsh
20
ancestry
Summarizing Data with Pie Chart
for Categorical Data
100%
Dotplot for Univariate
Quantitative Data
Dot Plot
paneldat
0
20
40
60
80 100
Temperature
120
140
160
Stemplot for Quantitative Data
Ages of Death of U.S. First Ladies
3 | 4 indicates 34 years old
3 | 4, 6
4|3
Stem
5 | 2, 4, 5, 7, 8
6 | 0, 0, 1, 2, 4, 4, 4, 5, 6, 9
Leaf—a
7 | 0, 1, 3, 4, 6, 7, 8, 8
single digit
8 | 1, 1, 2, 3, 3, 6, 7, 8, 9, 9
9|7
Split Stemplot
Stem is split
for every 2
leaves—
(0, 1), (2, 3),
(4, 5), (6, 7),
and (8, 9)
1|7
1 | 8, 9, 9, 9, 9, 9
2 | 0, 0, 0, 0, 1, 1, 1, 1, 1, 1
2 | 2, 2, 2, 3, 3
2 | 4, 5
2|
2|8
3 | 0, 1
Age of 27 students randomly selected from Stat 303 at A&M
Split Stemplot
1|
1 | 7, 8, 9, 9, 9, 9, 9
2 | 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4
2 | 5, 8
3 | 0, 1
Stem is split for every 5
3|
leaves—(0 thru 4) AND (
5 thru 9)
Age of 27 students randomly selected from Stat 303 at A&M
Back-to-back Stemplot
Babe Ruth
5, 2
5, 4
9, 7, 6, 6, 6, 1, 1
9, 4, 4
0
Roger Maris
8
3, 4, 6
3, 6, 8
3, 9
|0|
|1|
|2|
|3|
|4
|5|
|6| 1
Number of home runs in a season
Histogram—Univariate Quantitative data
Histogram
TX_betw eenHoustonDallas
120
Frequency Count
100
Count
80
Univariate Variable Age
60
40
20
0
20
40
60
age
80
100
Boxplot and Modified Boxplot
HusbandsAndWives
Box Plot
“Divides data into 4 quarters”
HusbandsAndWives
Box Plot
15 20 25 30 35 40 45 50 55 60 65
Age_Wife
25% of data in
each section
1550
1650
1750
1850
Ht_Husband
1950
Comparative Parallel Boxplots—
Univariate quantitative data by category
Box Plot
sex
M
TX_betw eenHoustonDallas
F
Outliers
0
10
20
30
40
50
age
60
70
80
90
Cumulative Frequency Plot
Scatterplot—Bivariate quantitative data
Scatter Plot
Olympics - Mens Field Trends
9.0
8.5
LongJump_m
8.0
7.5
7.0
6.5
6.0
1880
1900
1920
1940
year
1960
1980
2000
Summary Features of Quantitative Variables
Center—Location
Spread—Variability
Shape—Distribution pattern with data
Any
unusual
features?
Explain in
context.
Location—Center
Mean(, x ) —add up data values and divide by
number of data values
Median—list data values in order, locate middle
data value
Data Set: 19, 20, 20, 21, 22
19  20  20  21  22
 20.04
Mean is x 
5
Median is 20 since it is the middle number
of the ranked (ordered) data values.
Robust (Resistant) Statistic
Median is resistant to extreme values (outliers) in
data set.
Mean is NOT robust against extreme values. Mean
is pulled away from the center of the distribution
toward the extreme value (“tails of graph”).
Of the 2 segments, where’s the Mean
with respect to the Median?
Remember the mean is pulled toward extreme values.
Where’s the Mean with respect to
the Median?
Mean or
Median?
th
Location—p
Percentile
The pth percentile of a distribution (set of
data) is the value such that p percent of the
observations fall at or below it.
Suppose your Math SAT score is at the 80th
percentile of all Math SAT scores. This means your
score was higher than 80% of all other test takers.
Describing Location: Quartiles
Spread: Range and Interquartile Range
Range = Maximum – minimum
Q1 (Quartile 1) is the 25th percentile of ordered data
or median of lower half of ordered data
Median (Q2) is 50th percentile of ordered data
Q3 (Quartile 3) is the 75th percentile of ordered data
or median of upper half of ordered data
IQR(Interquartile Range) = Q3 – Q1
Any point that falls outside the interval calculated by
Q1- 1.5(IQR) and Q3 + 1.5(IQR) is considered an
outlier.
Summary Statistics
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13





min
Q1
median
Q3
max
Range = Max – min = 13 – 1 = 12
IQR = Q3 – Q1 = 10.5 – 3.5 = 7
Boxplot—5 Number Summary
Computersx1000
Box Plot
ComputerDensity
250
1000
2950
5400
1000
8600
250
2400
3500
5400
8600
0
min
Q1
2000
4000
6000
ThouComputers
median
8000
Q3
IQR = Q3 – Q1 = 5400 – 1000 = 4400
Max
Calculating boundaries for potential outliers
Find Q1 and Q3.
Calculate IQR = Q3 – Q1.
Multiply IQR by 1.5.
Subtract this from Q1.
Q1 = 10, Q3 = 20
IQR = 10
1.5·IQR = 15
Q1 – 1.5 IQR
10 – 15
 -5
Add it to Q3.
Q3 + 1.5 IQR
20 + 15
 35
These are the boundaries.………………………...(-5, 35)
If any data value falls outside of this interval, the data
values are to be considered potential outliers.
Describing Spread: Standard Deviation
Roughly speaking, standard deviation is the
average distance values fall from the mean
(center of graph).
Population and Sample
Standard Deviation
 x   
2

i
n
 x  x 
2
s
i
n 1
2 population variance s2 sample variance
What is Variance???
What is Variance?
Variance = (Standard
2
deviation)
Calculated Standard Deviation
is a measure of Variation in data
Sample Data Set
Mean
100, 100, 100, 100, 100
100
Standard
Deviation
0
90, 90, 100, 110, 110
100
10
30, 90, 100, 110, 170
100
50
90, 90, 100, 110, 320
142
99.85
Descriptive Terms
Trend
Descriptive Terms of Sampling Distribution
(Histogram) and Model (Red Curve)
Shape----Bell-shaped curve----Symmetric
Descriptive Terms
of Population Models
Skewed Right (or Skewed Left)
“Tail” points to right
Descriptive Terms
of Sampling Distribution
Cluster---Gaps---Potential Outliers
HusbandsAndWives
Histogram
Count
45
40
35
30
25
20
15
10
5
20
30
40
50
Age_Husb_at_Marriage
60
Uniform Population Model
Total area under the curve (model)
will always equal 1.
Various
Population
Models