Download Descriptive Statistics - University of Florida

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
Descriptive Statistics
• Tabular and Graphical Displays
– Frequency Distribution - List of intervals of
values for a variable, and the number of
occurrences per interval
– Relative Frequency - Proportion (often reported
as a percentage) of observations falling in the
interval
– Histogram/Bar Chart - Graphical representation
of a Relative Frequency distribution
– Stem and Leaf Plot - Horizontal tabular display
of data, based on 2 digits (stem/leaf)
Constructing Pie Charts
• Select a small number of categories (say 5 or 6 at
most) to avoid many narrow “slivers”
• If possible, arrange categories in ascending or
descending order for categorical variables
Monthly Philly Rainfall 1825-1869 (1/100 in)
Philly Monthy Rainfall 1825-1869 (1/100 inches)
Category
1
2
3
4
5
6
7
8
9
10
11
1
2
3
4
5
6
7
8
9
10
11
Range
<100
100-199
200-299
300-399
400-499
500-599
600-699
700-799
800-899
900-999
>1000
Count
17
78
132
115
86
55
27
17
6
3
4
Constructing Bar Charts
• Put frequencies on one axis (typically vertical, unless
many categories) and categories on other
• Draw rectangles over categories with height=frequency
• Leave spaces between categories
Constructing Histograms
• Used for numeric variables, so need Class Intervals
– Let Range = Largest - Smallest Measurement
– Break range into (say) 5-20 intervals depending on sample size
– Make the width of the subintervals a convenient unit, and make
“break points” so that no observations fall on them
– Obtain Class Frequencies, the number in each subinterval
– Obtain Relative Frequencies, proportion in each subinterval
• Construct Histogram
– Draw bars over each subinterval with height representing class
frequency or relative frequency (shape will be the same)
– Leave no space between bars to imply adjacency of class
intervals
Histogram
140
100
80
60
40
20
rain100
e
M
or
00
11
0
90
0
70
0
50
0
30
0
0
10
Frequency
120
100
200
300
400
500
600
700
800
900
1000
1100
1200
More
Interpreting Histograms
• Probability: Heights of bars over the class intervals
are proportional to the “chances” an individual
chosen at random would fall in the interval
• Unimodal: A histogram with a single major peak
• Bimodal: Histogram with two distinct peaks (often
evidence of two distinct groups of units)
• Uniform: Interval heights are approximately equal
• Symmetric: Right and Left portions are same shape
• Right-Skewed: Right-hand side extends further
• Left-Skewed: Left-hand side extends further
Stem-and-Leaf Plots
• Simple, crude approach to obtaining shape of
distribution without losing individual measurements to
class intervals. Procedure:
– Split each measurement into 2 sets of digits (stem and leaf)
– List stems from smallest to largest
– Line corresponding leaves aside stems from smallest to
largest
– If too cramped/narrow, break stems into two groups: low
with leaves 0-4 and high with leaves 5-9
– When numbers have many digits, trim off right-most (less
significant) digits. Leaves should always be a single digit.
Comparing Groups
•
•
•
•
Side-by-side bar charts
3 dimensional histograms
Back-to-back stem and leaf plots
Goal: Compare 2 (or more) groups wrt
variable(s) being measured
• Do measurements tend to differ among
groups?
Summarizing Data of More than One Variable
• Contingency Table: Cross-tabulation of units based on
measurements of two qualitative variables simultaneously
• Stacked Bar Graph: Bar chart with one variable
represented on the horizontal axis, second variable as
subcategories within bars
• Cluster Bar Graph: Bar chart with one variable forming
“major groupings” on horizontal axis, second variable
used to make side-by-side comparisons within major
groupings (displays all combinations in factorial expt)
• Scatterplot: Plot with quantitaive variables y and x
plotted against each other for each unit
• Side-by-Side Boxplot: Compares distributions by groups
Example - Ginkgo and Acetazolamide for Acute
Mountain Syndrome Among Himalayan Trekkers
Contingency
Table (Counts)
Percent
Outcome by
Treatment
Placebo
Acet
Ginkgo
Acc+Gi
Total
Placebo
Acet
Ginkgo
Acc+Gi
AMS
40
14
43
18
115
No AMS
79
104
81
108
372
Total
119
118
124
126
487
AMS
33.61
11.86
34.68
14.29
No AMS
66.39
88.14
65.32
85.71
Total
100
100
100
100
Stacked Bar Graph of AMS Incidence (Percent)
100%
90%
80%
70%
60%
No AMS
50%
AMS
40%
30%
20%
10%
0%
Placebo
Acet
Ginkgo
Treatment
Acc+Gi
Cluster Bar Graph of AMS Incidence (Counts)
120
100
Frequency
80
AMS
60
No AMS
40
20
0
Placebo
Acet
Ginkgo
Treatment
Acc+Gi
3-D Barchart of Incidence of AMS
100.00
90.00
80.00
70.00
60.00
Percent within Treatment
50.00
40.00
30.00
20.00
10.00
No AMS
0.00
Placebo
AMS
Acet
Ginkgo
Treatment
Acc+Gi
Outcome
Sample & Population Distributions
• Distributions of Samples and Populations- As
samples get larger, the sample distribution gets
smoother and looks more like the population
distribution
– U-shaped - Measurements tend to be large or small,
fewer in middle range of values
– Bell-shaped - Measurements tend to cluster around
the middle with few extremes (symmetric)
– Skewed Right - Few extreme large values
– Skewed Left - Few extreme small values
Measures of Central Tendency
• Mean - Sum of all measurements divided by
the number of observations (even
distribution of outcomes among cases). Can
be highly influenced by extreme values.
• Notation: Sample Measurements labeled
Y1,...,Yn
Y1    Yn  Yi
Y

n
n
Median, Percentiles, Mode
• Median - Middle measurement after data have
been ordered from smallest to largest.
Appropriate for interval and ordinal scales
• Pth percentile - Value where P% of
measurements fall below and (100-P)% lie
above. Lower quartile(25th), Median(50th),
Upper quartile(75th) often reported
• Mode - Most frequently occurring outcome.
Typically reported for ordinal and nominal data.
Measures of Variation
• Measures of how similar or different
individual’s measurements are
– Range -- Largest-Smallest observation
– Deviation -- Difference between ith individual’s
outcome and the sample mean: Yi  Y
– Variance of n observations Y1,...,Yn is the “average”
squared deviation:
s2 
2
(
Y

Y
)
 i
n 1
(Y1  Y ) 2  (Y2  Y ) 2    (Yn  Y ) 2

n 1
Measures of Variation
• Standard Deviation - Positive square root of
the variance (measure in original units):
s   s2 
2
(
Y

Y
)
 i
n 1
• Properties of the standard deviation:
• s  0, and only equals 0 if all observations are equal
• s increases with the amount of variation around the mean
• Division by n-1 (not n) is due to technical reasons (later)
• s depends on the units of the data (e.g. $1000s vs $)
Empirical Rule
• If the histogram of the data is approximately
bell-shaped, then:
– Approximately 68% of measurements lie within
1 standard deviation of the mean.
– Approximately 95% of measurements lie within
2 standard deviations of the mean.
– Virtually all of the measurements lie within 3
standard deviations of the mean.
Other Measures and Plots
• Interquartile Range (IQR)-- 75th%ile - 25th%ile
(measures the spread in the middle 50% of data)
• Box Plots - Display a box containing middle
50% of measurements with line at median and
lines extending from box. Breaks data into four
quartiles
• Outliers - Observations falling more than
1.5IQR above (below) upper (lower) quartile
Dependent and Independent Variables
• Dependent variables are outcomes of interest to
investigators. Also referred to as Responses or
Endpoints
• Independent variables are Factors that are often
hypothesized to effect the outcomes (levels of dependent
variables). Also referred to as Predictor or Explanatory
Variables
• Research ??? Does I.V.  D.V.
Example - Clinical Trials of Cialis
• Clinical trials conducted worldwide to study efficacy
and safety of Cialis (Tadalafil) for ED
• Patients randomized to Placebo, 10mg, and 20mg
• Co-Primary outcomes:
– Change from baseline in erectile dysfunction domain if the
International Index of Erectile Dysfunction (Numeric)
– Response to: “Were you able to insert your P… into your
partner’s V…?” (Nominal: Yes/No)
– Response to: “Did your erection last long enough for you to
have succesful intercourse?” (Nominal: Yes/No)
Source: Carson, et al. (2004).
Example - Clinical Trials of Cialis
• Population: All adult males suffering from erectile
dysfunction
• Sample: 2102 men with mild-to-severe ED in 11
randomized clinical trials
• Dependent Variable(s): Co-primary outcomes
listed on previous slide
• Independent Variable: Cialis Dose: (0, 10, 20 mg)
• Research Questions: Does use of Cialis improve
erectile function?
Contingency Tables
• Tables representing all combinations of
levels of explanatory and response variables
• Numbers in table represent Counts of the
number of cases in each cell
• Row and column totals are called Marginal
counts
2x2 Tables - Notation
Group 1
Outcome
Present
X1
Outcome
Absent
n1-X1
Group
Total
n1
Group 2
X2
n2-X2
n2
Outcome
Total
X1+X2
(n1+n2)(X1+X2)
n1+n2
Example - Firm Type/Product Quality
Not
Integrated
Vertically
Integrated
Outcome
Total
High
Quality
Low
Quality
Group
Total
33
55
88
5
79
84
38
134
172
• Groups: Not Integrated (Weave only) vs Vertically integrated
(Spin and Weave) Cotton Textile Producers
• Outcomes: High Quality (High Count) vs Low Quality (Count)
Source: Temin (1988)
Scatterplots
• Identify the explanatory and response variables of
interest, and label them as x and y
• Obtain a set of individuals and observe the pairs
(xi , yi) for each pair. There will be n pairs.
• Statistical convention has the response variable (y)
placed on the vertical (up/down) axis and the
explanatory variable (x) placed on the horizontal
(left/right) axis. (Note: economists reverse axes in
price/quantity demand plots)
• Plot the n pairs of points (x,y) on the graph
France August,2003 Heat Wave Deaths
•
•
•
•
Individuals: 13 cities in France
Response: Excess Deaths(%) Aug1/19,2003 vs 1999-2002
Explanatory Variable: Change in Mean Temp in period (C)
Data: City
Dth03
Dth9902 %chng (y)
Degchg(x)
Little
Marseilles
Grenoble
Rennes
Toulouse
Bordeaux
Strasbourg
Nice
Poitiers
Lyon
Le Mans
Dijon
Paris
200
571
148
156
315
318
253
341
184
447
204
168
1854
192.3
456.8
115.6
114.7
231.6
222.4
167.5
222.9
102.8
248.3
112.1
87
766.1
4
25
28
36
36
43
51
53
79
80
82
93
142
4
4.3
6.3
5.6
6.6
6.2
5.9
4.3
7.3
6.8
7
7.4
6.7
France August,2003 Heat Wave Deaths
2003 France Heat Wave Mortality
160
140
Excess Mortality (%)
120
100
80
60
40
20
0
3
3.5
4
4.5
5
5.5
6
Change in Mean Temp (Celsius)
6.5
7
7.5
8
Sample Statistics/Population
Parameters
• Sample Mean and Standard Deviations are
most commonly reported summaries of
sample data. They are random variables
since they will change from one sample to
another.
• Population Mean (m) and Standard
Deviation (s) computed from a population
of measurements are fixed (unknown in
practice) values called parameters.
Example 1.3 - Grapefruit Juice
Study
crcl
38
66
74
99
80
64
80
120
To import an EXCEL file, click on:
FILE  OPEN  DATA then change
FILES OF TYPE to EXCEL (.xls)
To import a TEXT or DATA file, click on:
FILE  OPEN  DATA then change
FILES OF TYPE to TEXT (.txt) or
DATA (.dat)
You will be prompted through a series of
dialog boxes to import dataset
Descriptive Statistics-Numeric Data
• After Importing your dataset, and providing names
to variables, click on:
• ANALYZE  DESCRIPTIVE STATISTICS
DESCRIPTIVES
• Choose any variables to be analyzed and place them in
n
box on right
yi

n
i

1
• Options include: Mean : y 
Sum :  yi
n
i 1
 y  y 
n
Std. deviation : S 
S.E. Mean :
S
n
i 1
2
i
n 1
Variance : S 2
Example 1.3 - Grapefruit Juice
Study
e
t
d
N
e
u
i
m
a
m
i
a
t
t
t
t
t
t
t
i
E
i
i
i
i
i
i
C
8
8
0
1
3
3
1
1
V
8
Descriptive Statistics-General Data
• After Importing your dataset, and providing names to variables,
click on:
• ANALYZE  DESCRIPTIVE STATISTICS FREQUENCIES
• Choose any variables to be analyzed and place them in box on
right
• Options include (For Categorical Variables):
– Frequency Tables
– Pie Charts, Bar Charts
• Options include (For Numeric Variables)
– Frequency Tables (Useful for discrete data)
– Measures of Central Tendency, Dispersion, Percentiles
– Pie Charts, Histograms
Example 1.4 - Smoking Status
S
u
P
r
u
c
c
V
N
0
9
9
9
Q
3
3
3
2
Q
9
6
6
8
C
2
4
4
2
O
3
8
8
0
T
7
0
0
Vertical Bar Charts and Pie
Charts
• After Importing your dataset, and providing names to
variables, click on:
• GRAPHS  BAR…  SIMPLE (Summaries for Groups
of Cases)  DEFINE
• Bars Represent N of Cases (or % of Cases)
• Put the variable of interest as the CATEGORY AXIS
• GRAPHS  PIE… (Summaries for Groups of Cases) 
DEFINE
• Slices Represent N of Cases (or % of Cases)
• Put the variable of interest as the DEFINE SLICES BY
Example 1.5 - Antibiotic Study
80
60
40
Count
20
5
4
0
1
OUTCOME
2
3
4
5
3
1
2
Histograms
• After Importing your dataset, and providing
names to variables, click on:
• GRAPHS  HISTOGRAM
• Select Variable to be plotted
• Click on DISPLAY NORMAL CURVE if you
want a normal curve superimposed (see Chapter 4).
Example 1.6 - Drug Approval
Times
30
20
10
Std. Dev = 20.97
Mean = 32.1
N = 175.00
0
0
0.
12
0
0.
11
0
0.
10
.0
90
.0
80
.0
70
.0
60
.0
50
.0
40
.0
30
.0
20
.0
10
0
0.
MONTHS
Side-by-Side Bar Charts
• After Importing your dataset, and providing
names to variables, click on:
• GRAPHS  BAR…  Clustered (Summaries for
Groups of Cases)  DEFINE
• Bars Represent N of Cases (or % of Cases)
• CATEGORY AXIS: Variable that represents
groups to be compared (independent variable)
• DEFINE CLUSTERS BY: Variable that represents
outcomes of interest (dependent variable)
Example 1.7 - Streptomycin
Study
30
20
OUTCOME
1
2
10
3
Count
4
5
0
6
1
TRT
2