Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Graphing and Summarizing Data 3 2 1 0 1 2 3 Osborn First Thing: Look at your Data Some Handy Graphics First Thing: Look at your Data 14 13 12 11 Na 15 16 17 • Scatter plots: plot any two variables against each other 1.515 1.520 1.525 RI 1.530 First Thing: Look at your Data • Pairs plots: do many scatter plots at once 1 2 3 4 5 6 73 74 75 0 4 5 6 70 71 72 Si 12 14 16 0 1 2 3 K 6 8 10 Ca 70 71 72 73 74 75 6 8 10 12 14 16 First Thing: Look at your Data • Gasoline data: lbl.gas 0.20 1 2 3 4 5 C4.alkylbenzene.unid.2 6 7 8 0.15 9 10 11 12 13 14 15 0.10 16 17 18 19 20 0.2 0.3 0.4 0.5 o.Xylene 0.6 0.7 First Thing: Look at your Data • Histograms: “bin” a variable and plot frequencies First Thing: Look at your Data • Histograms: “bin” a variable and plot frequencies Each bar is a “bin” that contains a number of data points: counts First Thing: Look at your Data • Histograms: counts in each bin: In R: library(mlbench) # Load a library containing some data data(Glass) Glass head(Glass) # Load Glass data set # Take a look at Glass # Just look at the top of Glass RI <- Glass[,1] # Pull out the RIs. THey are in column 1 hist(RI) # Make a histogram for the RIs First Thing: Look at your Data • Box and Whiskers plots: range possible outliers possible outliers 25th-%tile 1st-quartile 1.5188 1.5189 median 50th-%tile 1.5190 RI 75th-%tile 3rd-quartile 1.5191 1.5192 Visualizing Data • Note the relationship: First Thing: Look at your Data • Box-and-whiskers: In R: Result: # Box and whiskers plots boxplot(RI) boxplot(RI, horizontal = T, range = 0) Measures of Central Tendency • Given a sample from some population: • What is a good “summary” value which well describes the sample? • We will look at: • Average (arithmetic mean) • Median • Mode For reference see (available on-line): “The Dynamic Character of Disguised Behaviour for Text-based, Mixed and Stylized Signatures” LA Mohammed, B Found, M Caligiuri and D Rogers J Forensic Sci 56(1),S136-S141 (2011) Histogram Points of Interest 25 • Velocity for the first segment of genuine signatures in (soon to be classic) Mohammed et al. study. What is a good summary number? • • “Central Tendency” How spread out is the data? 20 Percent of Total • 15 10 5 0 0 10 20 AvgAbsVelocity 30 Measures of Central Tendency • Arithmetic sample mean (average): • The sum of data divided by number of observations: meas.1+ meas.2 + meas.3+ ........... avg = number of measurements 1 n x = å xi n i=1 intuitive formula fancy formula Measures of Central Tendency • Example from L.A.M. study: • Compute the average absolute size of segment 1 for the genuine signature of subject 2: Subj. 2; Gen; Seg. 1 Absolute Size (cm) 1 0.0548 2 0.2951 3 0.1026 4 0.1005 5 0.2491 6 0.1287 7 0.0496 8 0.2299 9 0.256 10 0.0538 1 10 Abs.Size i = å 10 i=1 = 0.152 = Abs.Size Measures of Central Tendency • Example: • More useful: Consider again Absolute Average Velocity for Genuine Signatures across all writers in the LAM study: 25 92 subjects × 10 measurements/subject = 920 velocity measurements Percent of Total 20 15 Average Absolute Average Velocity: 1 920 Abs.Avg.Veloc i = 8.39 å 920 i=1 10 5 0 0 10 20 AvgAbsVelocity 30 Measures of Central Tendency • Sample median: • Ordering the n pieces of data from smallest value to largest value, the median is the “middle value”: n 1 • If n is odd, median is largest data point. 2 th th n n • If n is even, median is average of and 1th largest data 2 2 points. Measures of Central Tendency • Example from L.A.M. study: • Compute the median absolute size of segment 1 for the genuine signature of subject 2: Subj. 2; Gen; Seg. 1 Absolute Size (cm) 1 0.0548 2 0.2951 3 0.1026 4 0.1005 5 0.2491 6 0.1287 7 0.0496 8 0.2299 9 0.256 10 0.0538 Ordered 0.0496 0.0538 0.0548 0.1005 0.1026 0.1287 0.2299 0.2491 0.2560 0.2951 n =10 n n =5 +1 = 6 2 2 0.1026 + 0.1287 = 0.11567 2 Measures of Central Tendency • Example: 25 • Median of Average Absolute Velocity for Genuine Signatures, LAM: Abs.Avg.Veloc # 460 = 7.1972 Abs.Avg.Veloc # 461 = 7.2008 Percent of Total 20 median = (7.1972+7.2008)/2 = 7.1990 15 10 5 0 0 10 Avg 20 AvgAbsVelocity 30 Measures of Central Tendency • Sample mode: • Needs careful definition but basically: • The data value that occurs the most • Tabulate the data and see which value(s) occur the most: Sample: mode Measures of Central Tendency • Sample mode: • Computing modes can get tricky if there are more than one (multimodal) Sample: modes… Measures of Central Tendency • Sample mode: • What’s the mode here? Sample: Measures of Central Tendency • Sample mode: • Mode of Average Absolute Velocity for Genuine Signatures, LAM: 25 20 Percent of Total mode = 9.2541 15 10 5 0 0 10 Med Avg 20 AvgAbsVelocity 30 Measures of Central Tendency • Some trivia: Modes 0.15 0.10 0.05 2 Nice and symmetric: Mean = Median = Mode 4 6 8 Mean 10 12 14 Measures of Data Spread • Sample variance: • (Almost) the average of squared deviations from the sample mean. there are n data points n 1 2 2 s xi x n 1 i 1 data point i • Standard deviation is sample mean s2 s • The sample average and standard dev. are the most common measures of central tendency and spread • Sample average and standard dev have the same units Measures of Data Spread • Standard deviation is “instructive” to do by hand a few times: Compute the standard deviation of the following blood alcohol volumes assayed in 10 samples of 10 mL of blood drawn from a drunk driving suspect: 7.97 nL, 7.80 nL, 7.79 nL, 8.12 nL, 8.12 nL, 8.22 nL, 8.03 nL, 7.97 nL, 7.88 nL, 8.08 nL Measures of Data Spread • Sample range: • The difference between the largest and smallest value in the sample • Very sensitive to outliers (extreme observations) • Percentiles: • The pth percentile data value, x, means that ppercent of the data are smaller than or equal to x. • Median = 50th percentile Measures of Data Spread • What is the sample range of deoxypyridinoline conc? 0.62 0.64 1.14 1.04 1.07 1.83 1.32 1.19 1.28 0.85 1.36 1.16 1.00 1.69 1.62 1.25 1.49 1.45 1.14 2.40 3.05 2.81 # Dr. James Curran's dafs (http://www.stat.auckland.ac.nz/~curran) library(dafs) data(dpd.df) range(dpd.df[,5]) diff(range(dpd.df[,5])) # # # # Deoxypyridinoline data Look at column 5 for Deoxypyridinoline concentration and get its range Range as defined in the notes # Box and whiskers plot: boxplot(dpd.df[,5], horizontal = T, range = 0, xlab = "Deoxypyridinoline conc.") sd(dpd.df[,5]) summary(dpd.df[,5]) # standard dev of Deoxypyridinoline conc. # Common summary statistics # Some percentiles quantile(dpd.df[,5], probs = c(0.25)) # 25th percentile quantile(dpd.df[,5], probs = c(0.50)) # 50th percentile quantile(dpd.df[,5], probs = c(0.75)) # 75th percentile Measures of Data Spread First 99% of the data is between here First 1% of the data is between here RI 1st-%tile 1.52003 99th-%tile 1.52008 Measures of Data Spread • Box-and-whisker plot again for reference • Deoxypyridinoline conc? 0.62 0.64 1.14 1.04 1.07 1.83 1.32 1.19 1.28 0.85 1.36 1.16 1.00 1.69 1.62 1.25 1.49 1.45 1.14 2.40 3.05 2.81 range 25th-%tile 1st-quartile median 50th-%tile 75th-%tile 3rd-quartile