Download Summarizing Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Graphing and Summarizing Data
3
2
1
0
1
2
3
Osborn
First Thing: Look at your Data
Some Handy Graphics
First Thing: Look at your Data
14
13
12
11
Na
15
16
17
• Scatter plots: plot any two variables against each
other
1.515
1.520
1.525
RI
1.530
First Thing: Look at your Data
• Pairs plots: do many scatter plots at once
1
2
3
4
5
6
73
74
75
0
4
5
6
70
71
72
Si
12
14
16
0
1
2
3
K
6
8
10
Ca
70
71
72
73
74
75
6
8
10
12
14
16
First Thing: Look at your Data
• Gasoline data:
lbl.gas
0.20
1
2
3
4
5
C4.alkylbenzene.unid.2
6
7
8
0.15
9
10
11
12
13
14
15
0.10
16
17
18
19
20
0.2
0.3
0.4
0.5
o.Xylene
0.6
0.7
First Thing: Look at your Data
• Histograms: “bin” a variable and plot frequencies
First Thing: Look at your Data
• Histograms: “bin” a variable and plot frequencies
Each bar is a “bin” that contains a number of data points: counts
First Thing: Look at your Data
• Histograms: counts in each bin:
In R:
library(mlbench)
# Load a library containing some data
data(Glass)
Glass
head(Glass)
# Load Glass data set
# Take a look at Glass
# Just look at the top of Glass
RI <- Glass[,1]
# Pull out the RIs. THey are in column 1
hist(RI)
# Make a histogram for the RIs
First Thing: Look at your Data
• Box and Whiskers plots:
range
possible
outliers
possible
outliers
25th-%tile
1st-quartile
1.5188
1.5189
median
50th-%tile
1.5190
RI
75th-%tile
3rd-quartile
1.5191
1.5192
Visualizing Data
• Note the relationship:
First Thing: Look at your Data
• Box-and-whiskers:
In R:
Result:
# Box and whiskers plots
boxplot(RI)
boxplot(RI, horizontal = T, range = 0)
Measures of Central Tendency
• Given a sample from some population:
• What is a good “summary” value which well
describes the sample?
• We will look at:
• Average (arithmetic mean)
• Median
• Mode
For reference see (available on-line):
“The Dynamic Character of Disguised Behaviour for Text-based, Mixed and Stylized Signatures”
LA Mohammed, B Found, M Caligiuri and D Rogers
J Forensic Sci 56(1),S136-S141 (2011)
Histogram Points of Interest
25
•
Velocity for the first
segment of genuine
signatures in (soon to
be classic)
Mohammed et al.
study.
What is a good
summary number?
•
•
“Central Tendency”
How spread out is the
data?
20
Percent of Total
•
15
10
5
0
0
10
20
AvgAbsVelocity
30
Measures of Central Tendency
• Arithmetic sample mean (average):
• The sum of data divided by number of observations:
meas.1+ meas.2 + meas.3+ ...........
avg =
number of measurements
1 n
x = å xi
n i=1
intuitive formula
fancy formula
Measures of Central Tendency
• Example from L.A.M. study:
• Compute the average absolute size of segment 1 for
the genuine signature of subject 2:
Subj. 2; Gen; Seg. 1 Absolute Size (cm)
1
0.0548
2
0.2951
3
0.1026
4
0.1005
5
0.2491
6
0.1287
7
0.0496
8
0.2299
9
0.256
10
0.0538
1 10
Abs.Size i =
å
10 i=1
= 0.152 = Abs.Size
Measures of Central Tendency
• Example:
• More useful: Consider again Absolute Average Velocity for
Genuine Signatures across all writers in the LAM study:
25
92 subjects × 10 measurements/subject = 920 velocity measurements
Percent of Total
20
15
Average Absolute Average Velocity:
1 920
Abs.Avg.Veloc i = 8.39
å
920 i=1
10
5
0
0
10
20
AvgAbsVelocity
30
Measures of Central Tendency
• Sample median:
• Ordering the n pieces of data from smallest value to
largest value, the median is the “middle value”:
n 1
• If n is odd, median is
largest data point.
2
th
th
n
n
• If n is even, median is average of
and  1th largest data
2
2
points.
Measures of Central Tendency
• Example from L.A.M. study:
• Compute the median absolute size of segment 1 for
the genuine signature of subject 2:
Subj. 2; Gen; Seg. 1 Absolute Size (cm)
1
0.0548
2
0.2951
3
0.1026
4
0.1005
5
0.2491
6
0.1287
7
0.0496
8
0.2299
9
0.256
10
0.0538
Ordered
0.0496
0.0538
0.0548
0.1005
0.1026
0.1287
0.2299
0.2491
0.2560
0.2951
n =10
n
n
=5
+1 = 6
2
2
0.1026 + 0.1287
= 0.11567
2
Measures of Central Tendency
• Example:
25
• Median of Average Absolute Velocity for Genuine Signatures,
LAM:
Abs.Avg.Veloc # 460 = 7.1972
Abs.Avg.Veloc # 461 = 7.2008
Percent of Total
20
median = (7.1972+7.2008)/2 = 7.1990
15
10
5
0
0
10
Avg
20
AvgAbsVelocity
30
Measures of Central Tendency
• Sample mode:
• Needs careful definition but basically:
• The data value that occurs the most
• Tabulate the data and see which value(s) occur the
most:
Sample:
mode
Measures of Central Tendency
• Sample mode:
• Computing modes can get tricky if there are more than one (multimodal)
Sample:
modes…
Measures of Central Tendency
• Sample mode:
• What’s the mode here?
Sample:
Measures of Central Tendency
• Sample mode:
• Mode of Average Absolute Velocity for Genuine
Signatures, LAM:
25
20
Percent of Total
mode = 9.2541
15
10
5
0
0
10
Med Avg
20
AvgAbsVelocity
30
Measures of Central Tendency
• Some trivia:
Modes
0.15
0.10
0.05
2
Nice and symmetric:
Mean = Median = Mode
4
6
8
Mean
10
12
14
Measures of Data Spread
• Sample variance:
• (Almost) the average of squared deviations from the
sample mean.
there are n data points
n
1
2
2
s 
 xi  x 

n  1 i 1
data point i
• Standard deviation is
sample mean
s2  s
• The sample average and standard dev. are the most
common measures of central tendency and spread
• Sample average and standard dev have the same units
Measures of Data Spread
• Standard deviation is “instructive” to do by hand a few
times:
Compute the standard deviation of the following blood
alcohol volumes assayed in 10 samples of 10 mL of blood
drawn from a drunk driving suspect:
7.97 nL, 7.80 nL, 7.79 nL, 8.12 nL, 8.12 nL, 8.22 nL,
8.03 nL, 7.97 nL, 7.88 nL, 8.08 nL
Measures of Data Spread
• Sample range:
• The difference between the largest and smallest
value in the sample
• Very sensitive to outliers (extreme observations)
• Percentiles:
• The pth percentile data value, x, means that ppercent of the data are smaller than or equal to
x.
• Median = 50th percentile
Measures of Data Spread
• What is the sample range of deoxypyridinoline conc?
0.62 0.64 1.14 1.04 1.07 1.83 1.32 1.19 1.28 0.85 1.36
1.16 1.00 1.69 1.62 1.25 1.49 1.45 1.14 2.40 3.05 2.81
# Dr. James Curran's dafs (http://www.stat.auckland.ac.nz/~curran)
library(dafs)
data(dpd.df)
range(dpd.df[,5])
diff(range(dpd.df[,5]))
#
#
#
#
Deoxypyridinoline data
Look at column 5 for Deoxypyridinoline
concentration and get its range
Range as defined in the notes
# Box and whiskers plot:
boxplot(dpd.df[,5], horizontal = T, range = 0, xlab = "Deoxypyridinoline conc.")
sd(dpd.df[,5])
summary(dpd.df[,5])
# standard dev of Deoxypyridinoline conc.
# Common summary statistics
# Some percentiles
quantile(dpd.df[,5], probs = c(0.25)) # 25th percentile
quantile(dpd.df[,5], probs = c(0.50)) # 50th percentile
quantile(dpd.df[,5], probs = c(0.75)) # 75th percentile
Measures of Data Spread
First 99% of
the data is
between here
First 1% of
the data is
between here
RI
1st-%tile
1.52003
99th-%tile
1.52008
Measures of Data Spread
• Box-and-whisker plot again for reference
• Deoxypyridinoline conc?
0.62 0.64 1.14 1.04 1.07 1.83 1.32 1.19 1.28 0.85 1.36
1.16 1.00 1.69 1.62 1.25 1.49 1.45 1.14 2.40 3.05 2.81
range
25th-%tile
1st-quartile
median
50th-%tile
75th-%tile
3rd-quartile
Related documents