Download Summarizing Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Graphing and Summarizing Data
3
2
1
0
1
2
3
Osborn
First Thing: Look at your Data
Some Handy Graphics
First Thing: Look at your Data
14
13
12
11
Na
15
16
17
• Scatter plots: plot any two variables against each
other
1.515
1.520
1.525
RI
1.530
First Thing: Look at your Data
• Pairs plots: do many scatter plots at once
1
2
3
4
5
6
73
74
75
0
4
5
6
70
71
72
Si
12
14
16
0
1
2
3
K
6
8
10
Ca
70
71
72
73
74
75
6
8
10
12
14
16
First Thing: Look at your Data
• Gasoline data:
lbl.gas
0.20
1
2
3
4
5
C4.alkylbenzene.unid.2
6
7
8
0.15
9
10
11
12
13
14
15
0.10
16
17
18
19
20
0.2
0.3
0.4
0.5
o.Xylene
0.6
0.7
First Thing: Look at your Data
• Histograms: “bin” a variable and plot frequencies
First Thing: Look at your Data
• Histograms: “bin” a variable and plot frequencies
Each bar is a “bin” that contains a number of data points: counts
First Thing: Look at your Data
• Histograms: counts in each bin:
In R:
library(mlbench)
# Load a library containing some data
data(Glass)
Glass
head(Glass)
# Load Glass data set
# Take a look at Glass
# Just look at the top of Glass
RI <- Glass[,1]
# Pull out the RIs. THey are in column 1
hist(RI)
# Make a histogram for the RIs
First Thing: Look at your Data
• Box and Whiskers plots:
range
possible
outliers
possible
outliers
25th-%tile
1st-quartile
1.5188
1.5189
median
50th-%tile
1.5190
RI
75th-%tile
3rd-quartile
1.5191
1.5192
Visualizing Data
• Note the relationship:
First Thing: Look at your Data
• Box-and-whiskers:
In R:
Result:
# Box and whiskers plots
boxplot(RI)
boxplot(RI, horizontal = T, range = 0)
Measures of Central Tendency
• Given a sample from some population:
• What is a good “summary” value which well
describes the sample?
• We will look at:
• Average (arithmetic mean)
• Median
• Mode
For reference see (available on-line):
“The Dynamic Character of Disguised Behaviour for Text-based, Mixed and Stylized Signatures”
LA Mohammed, B Found, M Caligiuri and D Rogers
J Forensic Sci 56(1),S136-S141 (2011)
Histogram Points of Interest
25
•
Velocity for the first
segment of genuine
signatures in (soon to
be classic)
Mohammed et al.
study.
What is a good
summary number?
•
•
“Central Tendency”
How spread out is the
data?
20
Percent of Total
•
15
10
5
0
0
10
20
AvgAbsVelocity
30
Measures of Central Tendency
• Arithmetic sample mean (average):
• The sum of data divided by number of observations:
meas.1+ meas.2 + meas.3+ ...........
avg =
number of measurements
1 n
x = å xi
n i=1
intuitive formula
fancy formula
Measures of Central Tendency
• Example from L.A.M. study:
• Compute the average absolute size of segment 1 for
the genuine signature of subject 2:
Subj. 2; Gen; Seg. 1 Absolute Size (cm)
1
0.0548
2
0.2951
3
0.1026
4
0.1005
5
0.2491
6
0.1287
7
0.0496
8
0.2299
9
0.256
10
0.0538
1 10
Abs.Size i =
å
10 i=1
= 0.152 = Abs.Size
Measures of Central Tendency
• Example:
• More useful: Consider again Absolute Average Velocity for
Genuine Signatures across all writers in the LAM study:
25
92 subjects × 10 measurements/subject = 920 velocity measurements
Percent of Total
20
15
Average Absolute Average Velocity:
1 920
Abs.Avg.Veloc i = 8.39
å
920 i=1
10
5
0
0
10
20
AvgAbsVelocity
30
Measures of Central Tendency
• Sample median:
• Ordering the n pieces of data from smallest value to
largest value, the median is the “middle value”:
n 1
• If n is odd, median is
largest data point.
2
th
th
n
n
• If n is even, median is average of
and  1th largest data
2
2
points.
Measures of Central Tendency
• Example from L.A.M. study:
• Compute the median absolute size of segment 1 for
the genuine signature of subject 2:
Subj. 2; Gen; Seg. 1 Absolute Size (cm)
1
0.0548
2
0.2951
3
0.1026
4
0.1005
5
0.2491
6
0.1287
7
0.0496
8
0.2299
9
0.256
10
0.0538
Ordered
0.0496
0.0538
0.0548
0.1005
0.1026
0.1287
0.2299
0.2491
0.2560
0.2951
n =10
n
n
=5
+1 = 6
2
2
0.1026 + 0.1287
= 0.11567
2
Measures of Central Tendency
• Example:
25
• Median of Average Absolute Velocity for Genuine Signatures,
LAM:
Abs.Avg.Veloc # 460 = 7.1972
Abs.Avg.Veloc # 461 = 7.2008
Percent of Total
20
median = (7.1972+7.2008)/2 = 7.1990
15
10
5
0
0
10
Avg
20
AvgAbsVelocity
30
Measures of Central Tendency
• Sample mode:
• Needs careful definition but basically:
• The data value that occurs the most
• Tabulate the data and see which value(s) occur the
most:
Sample:
mode
Measures of Central Tendency
• Sample mode:
• Computing modes can get tricky if there are more than one (multimodal)
Sample:
modes…
Measures of Central Tendency
• Sample mode:
• What’s the mode here?
Sample:
Measures of Central Tendency
• Sample mode:
• Mode of Average Absolute Velocity for Genuine
Signatures, LAM:
25
20
Percent of Total
mode = 9.2541
15
10
5
0
0
10
Med Avg
20
AvgAbsVelocity
30
Measures of Central Tendency
• Some trivia:
Modes
0.15
0.10
0.05
2
Nice and symmetric:
Mean = Median = Mode
4
6
8
Mean
10
12
14
Measures of Data Spread
• Sample variance:
• (Almost) the average of squared deviations from the
sample mean.
there are n data points
n
1
2
2
s 
 xi  x 

n  1 i 1
data point i
• Standard deviation is
sample mean
s2  s
• The sample average and standard dev. are the most
common measures of central tendency and spread
• Sample average and standard dev have the same units
Measures of Data Spread
• Standard deviation is “instructive” to do by hand a few
times:
Compute the standard deviation of the following blood
alcohol volumes assayed in 10 samples of 10 mL of blood
drawn from a drunk driving suspect:
7.97 nL, 7.80 nL, 7.79 nL, 8.12 nL, 8.12 nL, 8.22 nL,
8.03 nL, 7.97 nL, 7.88 nL, 8.08 nL
Measures of Data Spread
• Sample range:
• The difference between the largest and smallest
value in the sample
• Very sensitive to outliers (extreme observations)
• Percentiles:
• The pth percentile data value, x, means that ppercent of the data are smaller than or equal to
x.
• Median = 50th percentile
Measures of Data Spread
• What is the sample range of deoxypyridinoline conc?
0.62 0.64 1.14 1.04 1.07 1.83 1.32 1.19 1.28 0.85 1.36
1.16 1.00 1.69 1.62 1.25 1.49 1.45 1.14 2.40 3.05 2.81
# Dr. James Curran's dafs (http://www.stat.auckland.ac.nz/~curran)
library(dafs)
data(dpd.df)
range(dpd.df[,5])
diff(range(dpd.df[,5]))
#
#
#
#
Deoxypyridinoline data
Look at column 5 for Deoxypyridinoline
concentration and get its range
Range as defined in the notes
# Box and whiskers plot:
boxplot(dpd.df[,5], horizontal = T, range = 0, xlab = "Deoxypyridinoline conc.")
sd(dpd.df[,5])
summary(dpd.df[,5])
# standard dev of Deoxypyridinoline conc.
# Common summary statistics
# Some percentiles
quantile(dpd.df[,5], probs = c(0.25)) # 25th percentile
quantile(dpd.df[,5], probs = c(0.50)) # 50th percentile
quantile(dpd.df[,5], probs = c(0.75)) # 75th percentile
Measures of Data Spread
First 99% of
the data is
between here
First 1% of
the data is
between here
RI
1st-%tile
1.52003
99th-%tile
1.52008
Measures of Data Spread
• Box-and-whisker plot again for reference
• Deoxypyridinoline conc?
0.62 0.64 1.14 1.04 1.07 1.83 1.32 1.19 1.28 0.85 1.36
1.16 1.00 1.69 1.62 1.25 1.49 1.45 1.14 2.40 3.05 2.81
range
25th-%tile
1st-quartile
median
50th-%tile
75th-%tile
3rd-quartile
Related documents