Download Example

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Categorical variable wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Welcome to the Graduate Workshop in
Statistics
Instructor: Kam Hamidieh
Monday July 11, 2005
Today’s Agenda
•
•
•
•
Workshop Introductions & Website Tour
Plan Ahead
Today: All About Descriptive Statistics
Brief Demo of SPSS & R (if time permits)
2
Before we start…
When you see Bert on a
slide, either I will go over
the slide quickly or skip it
entirely. However, you will
need to read it on your own
since the subsequent
sessions will depend on
these.
When you see Sir Isaac
Newton, it means that the
slide will be of
technical/mathematical
nature. Read it if you wish.
I will not use the material in
the subsequent sessions.
3
Workshop Plan
• July 11: Descriptive Statistics - making graphical
and numerical summaries of data
• July 18: language of research studies in statistics
& a crash course in Probability and Random
Variables
• July 25: Hypothesis Testing, lots of t-tests, &
confidence intervals
• August 1: Categorical data & chi-squared tests
• August 8: Linear Regression
• August 15: ANOVA and catch up.
4
What is Statistics?
• Statistics is the art of learning from data. It is
concerned with the collection of data, their
subsequent description, and their analyses, which
often leads to the drawing of conclusions.
• Want to know more about what statistics is and
how its meaning has evolved? See Vic Barnett’s
Comparative Statistical Inference
5
Biostatistics
• Statistics applied to biological (life) problems,
including:
– Public health
– Medicine
– Ecological and environmental
• Much more statistics than biology, however
biostatisticians must learn the biology also.
6
Some Additional Terms
• Bioinformatics - computerized and statistical
analyses of biological data to extract and analyze
biological data, particularly in studying the
nucleotide sequences of DNA.
• Microarray Data – Data with lots of variables and
a few observations (more variables than cases).
Mostly biological data.
7
Some Other Applications
• Finance: statistical models used for analysis of
stocks, bonds, and currencies to control risk or
make money
• Economics: statistical models used to forecast
economic trends
• Clinical trials: testing effectiveness of drugs
• Information technology: network traffic analysis,
pattern recognition, separation of noise from
data
• Business: fraud detection
• Government: analysis of current economic
situation, forecasting, opinion polling
8
Further Reading for Pleasure!
• For a detailed history of statistics see Stephen
M. Stigler’s History of Statistical Concepts and
Methods.
• For new approaches to statistics see Breiman’s
Statistical Modeling: The Two Cultures.
9
The Big Picture in Statistics
Use a small group of units to make some
conclusions (inference) about a larger group
Population
Sample
(Characteristics Unknown)
10
Populations and Parameters
• Population – a group of individuals (or things) that
we would like to know something about
• Parameter - a characteristic of the population in
which we have a particular interest
– Often denoted with Greek letters (µ, )
– Examples:
• The proportion of the population that would
respond to a certain drug
• The population average height of males in
Michigan
11
Samples and Statistics
• Sample – a subset of a population (hopefully
representative and random)
• Statistic – a characteristic of the sample (any
function of the sample data)
– Example:
• The observed proportion of the sample that
responds to treatment
• The observed average height of males in
Michigan
12
Example
• A sample of 1000 women between the ages of 30
and 39 is randomly chosen across the US for a
marketing study. The results are: 825 women
prefer product A over product B (or 175 prefer B
over A)
• Population?
Population of all women between the ages 30 and 39, living in the US
• Sample? 1000 women,
• Parameter?
30-39, sampled in the survey
Population proportion of women 30-39 in the US preferring A over B – this
is unknown
• Statistic?
Sample proportion of women 30-39 in r.s. of n=1000 who preferred A
over B. Here the value of this statistic is 825/1000 or 82.5%
13
Populations and Samples
• Studying populations is too expensive and timeconsuming, and thus impractical
• If a sample is representative of the population,
then by observing the sample we can learn
something about the population
– And thus by looking at the characteristics of
the sample (statistics), we may learn something
about the characteristics of the population
(parameters).
14
Issues
• Samples are random
– If we had chosen a different sample, then we
would obtain different values for the statistics
(although we are trying to estimate the same
(unchanged) population parameters).
• Samples must should represent the population
15
Explanatory and Response Variables
• Many questions in statistics are about the
relationship between two or more variables.
• It is useful to identify one variable as the
explanatory and the other variable as the
response variable.
• In general, the value of the explanatory variable
for an individual is thought to partially explain or
account for the value of the response variable.
16
Explanatory and Response Variable
• Other names:
– Explanatory: independent, factor, treatment,
input, x
– Response: dependent, y, output
17
Statistical Analyses
• Descriptive Statistics
– Describe the sample – use numerical and
graphical summaries to characterize a data set
• Inference
– Make inferences about the population
– Primarily performed in two ways:
• Hypothesis testing
• Estimation
– Point estimation
– Interval estimation
18
Descriptive Statistics - Data
• Pieces of information
• Types of Data
• Categorical Data:
– Nominal – unordered categories
– Ordinal – ordered categories
• Quantitative Data
– Discrete – only whole numbers are possible,
order and magnitude matters
– Continuous – any value is conceivable
19
Summary of Data Types
Types of Data
Categorical
Nominal
Quantitative
Ordinal
Discrete
Continuous
20
Examples of Data Types
•
•
•
•
•
•
•
Age (years) quantitative, continuous
Car Manufacturer (GM, Ford, etc.) categorical, nominal
Starting Salary in Dollars quantitative, continuous
Starting Salary (Low, Med., High) categorical, ordinal
Calcium Level (microgram per liters) quantitative, continuous
Current Smoker (yes or no) categorical, nominal
Number on the flip of a die quantitative, discrete
21
Data
• The vast majority of errors in research arise
from a poor planning (e.g., data collection)
• Fancy statistical methods cannot rescue garbage
data
• Collect exact values whenever possible
22
On Descriptive Statistics
• It is ALWAYS a good idea to summarize your
data
– You become familiar with the data and the
characteristics of the people/things that you
are studying
– You can also identify problems or errors with
the data
• This is the first the step in any statistical
analysis
23
Dataset Structure
• Think of data as a rectangular matrix of rows and
columns.
• Rows represent the “experimental unit” (e.g.,
person)
• Columns represent variables measured on the
experimental unit
24
Example Data Set
• Data are for 11 variables and n = 1,606
respondents in the 1993 General Social Survey, a
national survey done by the National Opinion
Research Center at the University of Chicago.
Some questions are only asked of about twothirds of the survey participants, so there is
quite a bit of missing data. (Source: SDA archive
at UC Berkeley website,
http://csa.berkeley.edu:7502)
• I will be using a smaller version of it with only
n=500.
25
Example Data Set
Column
C1
C2
C3
C4
C5
Name
sex
race
degree
relig
polparty
C6
C7
C8
cappun
tvhours
marijuan
C9
C10
owngun
gunlaw
C11
age
Description
Sex of respondent
Race of respondent (White, African American, Other)
Highest educational degree received (Five categories)
Religious preference (Catholic, Protestant, Jewish, Other)
Does respondent think of self as Democrat, Republican,
Indep. or Other?
Does the respondent favor or oppose the death penalty.
Hours of watching television on a typical day
Whether the respondent thinks marijuana should be legalized
or not
Whether respondent owns a gun or not (Yes or No)
Does respondent favor or oppose a law requiring a permit
to buy a gun?
Age of the respondent
26
Example Data Set (DataSet_1)
27
Summarizing Categorical Data
• Numerical Summaries
– Frequency/Count tables
• Visual Summaries
– Pie Charts – good for summarizing a single
categorical variable
– Bar Charts – good for summarizing one or two
categorical variables and useful for making
comparisons when there are two categorical
variables
28
Numerical Summary of Categorical Data
• Count how many fall into each category
• Calculate the percent in each category
• If two variables, have the categories of the
explanatory variable define the rows and compute
the row percentages
29
Example
Numerical Summary of the Sex Variable
sex
Valid
Female
Male
Total
Frequency
283
217
500
Percent
56.6
43.4
100.0
Valid Percent
56.6
43.4
100.0
Cumulative
Percent
56.6
100.0
30
Example
Bar Chart of the Sex Variable
60
56.6 %
50
43.4%
40
Percent
30
20
10
0
Female
Male
sex
31
Example
Numerical Summary of political party affiliation
polparty
Valid
Mis sing
Total
Democrat
Indpndnt
Other
Republcn
Total
Frequency
157
178
5
156
496
4
500
Percent
31.4
35.6
1.0
31.2
99.2
.8
100.0
Valid Percent
31.7
35.9
1.0
31.5
100.0
Cumulative
Percent
31.7
67.5
68.5
100.0
Question: What percentage of people in the US identify
themselves as democrat/independent/republican/other?
At least we have some descriptive information from the
table above: most people seem to identify themselves as
independents while the percentage of the democrat and the
republicans seem to be very close.
32
Example
Visual Summary of the political party affiliation
40
35.9 %
31.7 %
31.5 %
30
Percent
20
10
1.0 %
0
Democrat
Indpndnt
Other
Republcn
polparty
33
Example
Numerical Summary of the Sex Variable vs.
Political Party Affiliation
sex * polparty Crosstabulation
s ex
Female
Male
Total
Count
% within s ex
Count
% within s ex
Count
% within s ex
Democrat
94
33.6%
63
29.2%
157
31.7%
polparty
Indpndnt
Other
95
2
33.9%
.7%
83
3
38.4%
1.4%
178
5
35.9%
1.0%
Republcn
89
31.8%
67
31.0%
156
31.5%
Total
280
100.0%
216
100.0%
496
100.0%
Question: Is there a difference in party affiliation (in %)
between the men and the women?
Again some descriptive information is available. There
does not seem to be a big difference.
34
Example
Numerical Summary of the political party
affiliation vs. own a gun
polparty * owngun Crosstabulation
owngun
polparty
Democrat
Indpndnt
Other
Republcn
Total
Count
% within polparty
Count
% within polparty
Count
% within polparty
Count
% within polparty
Count
% within polparty
No
Yes
72
66.1%
61
58.1%
3
75.0%
40
39.2%
176
55.0%
37
33.9%
44
41.9%
1
25.0%
62
60.8%
144
45.0%
Total
109
100.0%
105
100.0%
4
100.0%
102
100.0%
320
100.0%
35
Example
Look at these!
Question: Is there a
relationship between gun
ownership and party
affiliation?
Descriptively, there seems
to be a relationship. Most
republicans seem to be gun
owners.
36
Questions to Ask – 1 Categorical Variable
Question: How many and what percentage of
individuals fall into each category?
Example: What percentage of college students
favor legalization of marijuana?
Question: Are individuals equally divided across
categories or do the percentages across
categories follow some other interesting pattern?
Example: When individuals are asked to choose a
number from 1 to 10, are all numbers equally likely
to be chosen?
37
Questions to Ask – Categorical Variables
Question: Is there a relationship between the two
categorical variables, so that the category into
which individuals fall for one variable seem to
depend on which category they are in for the
other variable?
Example: Is there a relationship between gun
ownership and party affiliation?
Another Example: The relationship between smoking
and lung cancer was detected in part, because
someone noticed that the combination of being
smoker and having cancer is unusual.
38
Descriptive Statistics – Quantitative Data
• We will use a new data set from
http://www.infoplease.com/ipa/A0194030.html on the age of
presidents at inaugural
39
Interesting Features of Quantitative Variables
• Quick glance at the data values (Bloody Eyeball
Test!)
• Location: where most values lie or the value that
represents the data best e.g. mean or median
• Spread: variability in data
• Shape: a bit later…
• Five number summary: find extreme (high, low),
the median, and the quartiles (median of lower
and upper halves of the values).
40
Location of a Data Set: Mean, Median, and Mode
• Mean: the numerical average, sum the data and then
divide by the number of data points
• Formula:
x

x
i
n
• Median: the middle value (if n odd) or the average of the
middle two values (n even) once the data have been
ordered. 50% of data are above the median and 50% are
below the median.
• Mode: it is the measurement that occurs most often.
41
Some Word about Notation
Notation for Data:
n = number of individuals in a data set
x1, x2 , x3,…, xn represent individual raw data values
Example:
A data set consists the president’s age
at inaugural;
the values are 51, 61, …, 46, and 54.
Then, n = 43
x1= 51, x2 = 61, …, x42 = 46, and x43 = 54
42
Example of Mean
• What is the average age of the US
Presidents at inaugural?
Mean age =
(57 + 61 + … + 46 + 54)/43 = 55
Statistics
age
N
Mean
Median
Std. Deviation
Range
Minimum
Maximum
Percentiles
Valid
Mis sing
25
50
75
43
0
54.81
55.00
6.235
27
42
69
51.00
55.00
58.00
43
Example of Median
• What is the median age of
the US President at inaugural?
(n=43, n is odd)
Note n=43, n is odd, take the
(43+1)/2 = 22, or 22nd value which is
55
Note: Data
has been
sorted.
Statistics
age
N
Mean
Median
Std. Deviation
Range
Minimum
Maximum
Percentiles
Valid
Mis sing
25
50
75
43
0
54.81
55.00
6.235
27
42
69
51.00
55.00
58.00
44
A Bit More About Median
Median Calculations
If n is odd: M = middle of ordered values.
Count (n + 1)/2 down from top of ordered list.
If n is even: M = average of middle two ordered values.
Average values that are (n/2) and (n/2) + 1
down from top of ordered list.
Say you have the following list of numbers:
18,29,33,45,88,100
The median here is the average of 33 and 45 so
(33 + 45)/2 = 39.
45
Describing Spread/Variability in Data
• Range = highest/max value – lowest/min value
• Interquartile Range (IQR) = upper quartile –
lower quartile
• Standard Deviation: a bit later….
46
Describe the Spread - Quartiles
• Split the ordered values into half that is below the median and
the half that is above the median.
• Q1 = lower quartile = median of data values that are below the
median
• Q3 = upper quartile = median of data values that are above the
median
• Q2 is the just the median
• IQR, Interquartile Range = Q3 - Q1
• Min, Max, Median, Q1, and Q3 used in creation of boxplots
Min
Q1
25%
Med
25%
Q3
25%
Max
25%
47
Example Using the Presidents Age Data
Statistics
age
N
Mean
Median
Std. Deviation
Range
Minimum
Maximum
Percentiles
Valid
Mis sing
25
50
75
43
0
54.81
55.00
6.235
27
42
69
51.00
55.00
58.00
Max - Min
Min
Max
Q1
Q3
•About 25% of the presidents were 51 years old or
younger.
•About 75% were 58 or less.
•About 50% (the middle 50%) were between the ages of
51 and 58. IQR = 58-51=7
•The oldest was 69 (Reagan) and the youngest 42
(T. Roosevelt). Range = 69 – 42 = 27.
•About 50% were 55 or less or equivalently about 50%
were 55 or older.
48
The Spread and Shape of Data are important!
• Suppose 20 people take
exams. Possible scores go
from 0 to 100. The average
score is 87. Bob got an 88.
How well do you think he
did?
Case I: Bob is hot!
Case II: Bob is not so hot!
Just knowing the mean or the
median is not enough. We need
to know something about the
spread and shape of data.
Case I
80
81
85
86
86
86
86
86
86
86
86
86
86
86
87
87
87
88
99
100
Case II
0
3
88
95
95
95
95
96
96
96
96
96
96
97
98
98
100
100
100
100
49
Graphical Summaries for Quantitative Data
• Histograms: similar to bar graphs, used for any
number of data values
• Stem and Leaf plot and dot plots: present all
the individual values, useful for small to moderate
sized data sets.
• Boxplots: useful summary for comparing two or
more groups.
• Scatter Plot: very useful for exploring
relationships between two variables
50
Creating a Histogram
1. The horizontal axis has your variable of interest.
2. Decide how many equally spaced intervals to use for the
horizontal. Between 6 and 15 intervals is a good number.
3. Decide to use frequencies (count) or relative frequencies
(proportion) on the vertical axis. Relative frequency is a
usually a better choice.
4. Draw equally spaced intervals on the horizontal axis
covering the entire range of data values.
5. Determine frequency or relative frequency of data values
in each interval and draw a bar with corresponding height.
6. Decide rule to use for values that fall on the border
between two intervals.
51
Histogram of Presidents Age Data
Bin Sizes = 2.5
12 intervals.
52
Some Various Histogram
(presidents age data)
53
Histograms and Software Dependency
0
5
Frequency
10
15
Default Histogram Generated by R, Bin Size = 5, 6 Intervals
40
45
50
55
60
65
70
Age
54
Describing Shape
(By using histograms)
55
How About the Presidents Age Data?
Seems approximately bell shaped.
56
Mean vs. Median
• Mean is sensitive to extreme values.
• Median is not sensitive to extreme values.
• Simple example:
Say you have the following set of data (n=7):
{2,4,6,8,10,12,14}
The mean and the median are both 8.
Now suppose you have {2,4,6,8,10,12,50}
The mean jumps to 13.14 but the median is still 8.
Which is a better measure for location?
Median in this case.
57
Mean vs. Median
• Note:
– Symmetric data: mean ≈ median (e.g. presidents age
data)
•
•
•
•
– Skewed Left: mean < median
– Skewed Right: mean > median
If your data is approximately symmetric then
better to use mean
Note extreme values can cause skew-ness
With extreme skew-ness, median may be a better
measure
How about “outliers”? Hang on….
58
Mean vs. Median (from DataSet1)
Statistics
tvhours
N
Mean
Median
Minimum
Maximum
Percentiles
Valid
Mis sing
25
50
75
497
3
2.92
2.00
0
16
1.50
2.00
4.00
The number of hours of TV watched per day is right skewed.
Here the mean of 2.92 hours is greater than the median value of
2.00 hours per day.
59
Outliers
• Outlier: a data point that does not seem to be
consistent with the bulk of the data.
• Remarks:
– Look for them via graphs. Recommend
boxplots.
– Can have a big influence on conclusions.
– Can cause complications in some statistical
analysis.
– Can not discard without solid justification
60
More on Outlier
• Outlier is not necessarily a bad thing!
Examples: Credit card Fraud: very high activity
associated with stolen card.
• May sometimes be due to errors in data entry.
Example: You have height data for people and
the minimum height shows up as 2 inches! Can’t
be right!
• How do you detected it? I recommend graphical
methods such as boxplots.
61
More on Outliers
It is a BAD idea to exclude outliers in an automatic
manner:
NASA launched Nimbus 7 satellite to record
atmospheric data. After a few years in 1985, a
few scientists observed a large decrease in ozone
over Antarctic. It was found later that the
NASA data processors were automatically
throwing away data with very small values (ozone
readings) and assumed to be mistakes. Had this
been known earlier, perhaps CFC phase-out would
have been implemented sooner!
62
Possible Reason for Outliers and Reasonable
Actions
• Mistake made while taking measurement or entering it into
computer. If verified, should be discarded/corrected.
• Individual in question belongs to a different group than bulk
of individuals measured. Values may be discarded if
summary is desired and reported for the majority group
only.
• Outlier is legitimate data value and represents natural
variability for the group and variable(s) measured. Values
may not be discarded — they provide important information
about location and spread.
63
Boxplots – Presidents Age Data
Max
Q3
Q2
Q1
Possible Outliers
are marked
Apart from
outliers, lines
extending from
box reach to min
and max.
Box covers 50%
of data
Boxplot gives
you the visual
summary of:
Statistics
age
N
Median
Minimum
Maximum
Percentiles
Valid
Mis sing
25
50
75
43
0
55.00
42
69
51.00
55.00
58.00
Min
64
Boxplots – Presidents Age Data from R
Mean 3rd Qu.
54.81
57.50
Max.
69.00
45
50
Age
55
60
65
70
> summary(p$age)
Min. 1st Qu. Median
42.00
51.00
55.00
Presidents
65
Comparing Two Groups via Boxplots
Boxplots are great graphical tool for comparing
numerical summaries across different categories.
66
Drawing Boxplots
•
•
•
•
•
Step 1: Label either a vertical axis or a horizontal axis
with numbers from min to max of the data.
Step 2: Draw box with lower end at Q1 and upper end at
Q3.
Step 3: Draw a line through the box at the median.
Step 4: Draw a line from Q1 end of box to smallest data
value that is not further than 1.5  IQR from Q1.
Draw a line from Q3 end of box to largest data value
that is not further than 1.5  IQR from Q3.
Step 5: Mark data points further than 1.5  IQR from
either edge of the box with an asterisk. Points
represented with asterisks are considered to be
outliers.
67
Percentiles
The kth percentile is a number that has
k% of the data values at or below it and (100 –
k)% of the data values at or above it.
•
•
•
Lower quartile = 25th percentile
Median = 50th percentile
Upper quartile = 75th percentile
68
Scatterplots
• Scatterplots are two dimensional plots of data (quantitative
data/variables of course.)
• They can tell us something about the strength, the direction,
and the nature of the relationship between two variables.
– Direction:
• Two variables have a positive association when the values
of one variable tend to increase as the values of the
other variable increase.
• Two variables have a negative association when the values
of one variable tend to decrease as the values of the
other variable increase.
– Strength: how tightly the points are clustered around some
straight line or a curve.
– Nature of the relationship: linear or curved?
• More when we cover Regression
69
Example Scatterplot
• The data set sats98 contain the
average math and verbal SAT
scores in 1998 for the 50 states
and the District of Columbia.
• The pcttook variable is the
percent of graduating
seniors who took the test
that year.
• Is there a relationship between
the average math and the
average verbal scores?
70
Example Scatterplot
The nature of the
relationship: There seem to
be a linear relationship. See
the line!
Direction: The relationship
is positive. As the math
scores go up, the verbal
scores go up on the average,
Strength: The relationship
seems to be strong. The
points are bunched very
close to the possible
underlying line.
WARNING: You can NOT conclude that
high math scores cause high verbal scores!
71
Bell-Shaped Data
• Many data or measurements follow a predictable
pattern:
– Most values are clumped around a center.
– The greater the distance a value is from the
center, the fewer individuals have that value.
• Variables that follow such a pattern are said
to be “bell-shaped”. A special case is called
a normal distribution or normal curve.
72
Example: Presidents Age Data!
Our presidential age data was
Approximately bell shaped.
73
Describing Spread via Standard Deviation
• Standard deviation measures variability by
summarizing how far individual data values are
from the mean of the data.
• Think of the standard deviation as roughly the
average distance values fall from the mean.
• It will be in the same units as our data.
74
Computing Standard Deviation
Formula for the (sample) standard deviation:
 x  x 
2
s
i
n 1
The value of s2 is called the (sample) variance.
An equivalent formula, easier to compute, is:
s
2
2
x

n
x
i
n 1
75
Computing Standard Deviation
Step 1: Calculate x, the sample mean.
Step 2: For each observation, calculate the
difference between the data value
and the mean.
Step 3: Square each difference in step 2.
Step 4: Sum the squared differences in step 3,
andx then divide this sum by n – 1.
Step 5: Take the square root of the value
in step 4.
76
Simple Example
Consider just four numbers: 62, 68, 74, 76
Step 1:
62  68  74  76 280
x

 70
4
4
Steps 2 and 3:
Step 4:
120
s 
 40
4 1
Step 5:
s  40  6.3
2
77
Population Standard Deviation
Data sets usually represent a sample from a larger
population. If the data set includes measurements for an
entire population, the notations for the mean and
standard deviation are different, and the formula for the
standard deviation is also slightly different.
A population mean is represented by the symbol m (“mu”),
and the population standard deviation is
 x  m 
2

i
n
78
Some Remarks on Estimation
• The population mean, µ, is most often an unknown
parameter.
• The sample mean, x ,a statistic computed from the
sampled data, is our estimate of the unknown
population mean µ.
• The population standard deviation, , is most often
an unknown parameter.
• The sample standard deviation, s, a statistic
computed from the sampled data, is our estimate of
the unknown population standard deviation .
79
Standard Deviation and Bell Shaped Data
For bell-shaped (normal) data, approximately
• 68% of the values fall within 1 standard deviation of
the mean in either direction
• 95% of the values fall within 2 standard deviations of
the mean in either direction
• 99.7% of the values fall within 3 standard deviations
of the mean in either direction
The above approximation is sometimes
called the Empirical Rule.
80
Example – President Age Data
Descriptive Statistics
N
age
Valid N (lis twis e)
•
•
•
•
•
•
43
43
Minimum
42
Maximum
69
Mean
54.81
Std. Deviation
6.235
The (sample) standard deviation for the presidents data is 6.235 years.
Remember that our data looked bell shaped.
We would expect 68% of the presidents’ ages to be between 54.81 ±
6.245 years old (at inaugural) or 49 to 61 years old. The actual data
show about 72%. It is somewhat close.
We would expect 95% of the presidents’ ages to be between 54.81 ±
2(6.245) years old or 42 to 67 years old. The actual data show about
95%.
We would expect 99.7% of the presidents’ ages to be between 54.81 ±
3(6.245) years old or 36 to 74 years old. The actual data show about
100%.
Interpretation of sample standard deviation: On the average, the
inaugural age of the US presidents have been roughly 6 years away
from their average age of 55.
81
Important Remarks
• What does s = 0 mean?
No variability in your data! All values are the
same.
• Like mean, standard deviation, is sensitive to
extreme observations.
• Use the mean and standard deviation for
reasonably symmetric bell shaped data.
• Five number summary: min, max, median, Q1, and
Q2, is better for skewed distributions or if
outliers are present.
82
Summary
Descriptive Tools
Quantitative
Variables
Histo gram
Q -Q
Plots
Time
Plots
Categorical
Variables
Boxplots
Scatter
Plots
Bar
Charts
Pie
Charts
Freq.
Tables
83
Next Time
• Please read articles:
– Breiman’s article on two cultures of statistics (2001)
– Altman’s articles on
• Poor quality medical research (2002)
• Statistical reviewing for medical journals (1998)
• Some recent trends in statistics in medical journals
(2000)
– Goodman et. al. statistical reviewing policies of journals
(1998)
• Next Time:
– Crash Course in Probability
– Research Studies
84