Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Descriptive statistics for one variable 描述性统计 What to describe? What is the “location” or “center” of the data? (“measures of location”) How do the data vary? (“measures of variability”). Types of statistics Descriptive Statistics Gives numerical and graphic procedures to summarize a collection of data in a clear and understandable way Inferential Statistics Provides procedures to draw inferences about a population from a sample Reasons for using statistics aid in summarization aid in “getting at what’s going on” aid in extracting “information” from the data aid in communication Frequency distribution The frequency with which observations are assigned to each category or point on a measurement scale. Most basic form of descriptive statistic May be expressed as a percentage of the total sample found in each category Source : Reasoning with Statistics, by Frederick Williams & Peter Monge, fifth edition, Harcourt College Publishers. Frequency distribution The distribution is “read” differently depending upon the measurement level Nominal scales are read as discrete measurements at each level (no ordering) Ordinal measures show tendencies, but categories should not be compared (ordering exists, but not distance) Interval (distance exists, but no ratios) and ratio scales (ratios exist) all for comparison among categories Sex N Mean Median TrMean StDev SE Mean female 126 91.23 90.00 90.83 11.32 1.01 male 100 96.79 110.00 105.62 17.39 1.74 Minimum Maximum female 65.00 120.00 male 75.00 162.00 Q1 Q3 85.00 98.25 95.00 118.75 Fastest Ever Driving Speed 226 Stat 100 Students, Fall '98 100 Men 126 Women 70 80 90 100 110 120 130 140 150 160 Speed Fastest Ever Driving Speed 226 Stat 100 Students, Fall 1998 160 110 60 female male Gender Source: Protecting Children from Harmful Television: TV Ratings and the V-chip Amy I. Nathanson, PhD Lecturer, University of California at Santa Barbara Joanne Cantor, PhD Professor, Communication Arts, University of Wisconsin-Madison Source: http://www.elonka.com/kryptos/ Web page on cryptography Ancestry of US residents Source: UCLA International Institute Source: Cornell University website Source: www.cit.cornell.edu/computer/students/bandwidth/charts.html Source: www.cit.cornell.edu/computer/students/bandwidth/charts.html Source: Verisign Search engine use The percentage of online searches done by US home and work web surfers in July 2006 NY Times Source: Verisign Old Faithful Geyser Duration in seconds of 272 eruptions of the Old Faithful geyser. library(datasets) > faithful[1:10,] eruptions waiting 1 3.600 79 2 1.800 54 3 3.333 74 4 2.283 62 5 4.533 85 6 2.883 55 7 4.700 88 8 3.600 85 9 1.950 51 10 4.350 85 > summary(faithful) eruptions waiting Min. : 1.600 Min. : 43.0 1st Qu. : 2.163 1st Qu.: 58.0 Median : 4.000 Median : 76.0 Mean : 3.488 Mean : 70.9 3rd Qu. : 4.454 3rd Qu.: 82.0 Max. : 5.100 Max. : 96.0 Normal distribution Many characteristics are distributed through the population in a ‘normal’ manner Normal curves have well-defined statistical properties Parametric statistics are based on the assumption that the variables are distributed normally Most commonly used statistics This is the famous “Bell curve” where many cases fall near the middle of the distribution and few fall very high or very low I.Q. Statistical properties of the normal distribution I.Q. distribution Measures of central tendency Mode (Mo): the most frequent score in a distribution good for nominal data Median (Md): the midpoint or midscore in a distribution. (50% cases above/50% cases below) – insensitive to extreme cases --Interval or ratio Source : Reasoning with Statistics, by Frederick Williams & Peter Monge, fifth edition, Harcourt College Publishers. Measures of central tendency Mean The ‘average’ score—total score divided by the number of scores has a number of useful statistical properties however, can be sensitive to extreme scores many statistics based on mean Sensitive to ‘outliers’ Extreme cases that just happened to end up in your sample by chance Index of central tendency Source: http://www.uwsp.edu/psych/stat/5/skewnone.gif Source: Scianta.com Source: www.wilderdom.com/.../L2-1UnderstandingIQ.html Source: CSAP’s Data Pathways Measures of dispersion Look at how widely scattered over the scale the scores are Groups with identical means can be more or less diverse To find out how the group is distributed, we need to know how far or close individual members are from the mean Like mean, only meaningful for interval or ratio-level measures Measures of dispersion Range Distance between the highest and lowest scores in a distribution; sensitive to extreme scores; compensate by calculating interquartile range (distance between the 25th and 75th percentile points) which represents the range of scores for the middle half of a distribution Usually used in combination with other measures of dispersion. Range Source: www.animatedsoftware.com/ statglos/sgrange.htm Source: http://pse.cs.vt.edu/SoSci/converted/Dispersion_I/box_n_hist.gif Average Deviation (Mean Deviation) Merits: 1. Easy to calculate and understand. 2. This can be calculated from any average. 3. It is less affected by extreme observations. Demerits: 1. This is mathematically incomplete because it ignores negative signs. 2. As it can be calculated from any average, it does not have certainty (i.e., it is not a well defined measure). 3. Its use is very limited in statistical work. Measures of dispersion Variance (S2) Average of squared distances of individual points from the mean High variance means that most scores are far away from the mean. Low variance indicates that most scores cluster tightly about the mean. Standard Deviation (SD) A summary statistic of how much scores vary from the mean Square root of the Variance expressed in the original units of measurement Used in a number of inferential statistics Variance vs. Standard Deviation Variance Population Sample Standard Deviation Skewness of distributions Measures look at how lopsided distributions are—how far from the ideal of the normal curve they are When the median and the mean are different, the distribution is skewed. The greater the difference, the greater the skew. Distributions that trail away to the left are negatively skewed and those that trail away to the right are positively skewed If the skewness is extreme, the researcher should either transform the data to make them better resemble a normal curve or else use a different set of statistics—nonparametric statistics—to carry out the analysis Different Shapes of Distributions Source: http://faculty.vassar.edu/lowry/f0204.gif Skewness of distributions Source: http://www.polity.org.za/html/govdocs/reports/aids/images/image022.gif Distribution of posting frequency on Usenet Kurtosis Measures of kurtosis look at how sharply the distribution rises to a peak and then drops away