Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Department of Civil, Structural, and Environmental Engineering University at Buffalo, State University of New York 2011 Summer ERIE/REU Program Descriptive Statistics Igor Jankovic Content • Statistics terminology 1. 2. 3. • Presentation of qualitative data 1. 2. • Graphical method Numerical method Presentation of quantitative data 1. 2. • Population vs. Sample Descriptive statistics vs. Inferential statistics Data Types Graphical method Numerical method Outliers in a data set Population vs. Sample • • Population: an entire data set that is the target of our interest Sample: a subset of data selected from a population Example: Electrical engineers recognize that high natural current in computer power system is a potential problem. To determine the extent of the problem, a survey of the computer power system load currents at 146 US sites taken (IEEE Transaction on Industry Applications, July/August 1990). The survey revealed that less than 10% of the sites had high neutral to full-load current ratios. • • • Identify the population of interest (powerload status at all US sites with computer powers systems) Identify the sample (powerload status at 146 US sites with computer powers systems Use of the sample information to make an inference about population (less than 10% of the sites had high neutral to full-load current ratios) Descriptive statistics vs. Inferential statistics • Two major applications of Statistics: -Summarizing, describing, and exploring data -Using sample data to infer the nature of the population data set In other words, • Descriptive statistics -The branch of statistics devoted to the organization, summarization, and description of data sets • Inferential statistics -The branch of statistics concerned with using sample data to make an inference about populations Data Types Quantitative Data: The data that represent the quantity or amount of something Qualitative (categorical) Data: The data that have no quantitative interpretation Example: • • Length (in centimeters), weight (in grams), DDT concentration (in ppm): quantitative data Location and species: qualitative data Qualitative Data Graphical method for describing qualitative data For qualitative data, we define the categories in such a way that each observation can fall in one and only one category. Example: Student distribution in terms of year at college in EAS 308 Numer of students in EAS 308 50 Year at College Junior Senior Sophomore 0 10 20 30 40 45 40 35 30 25 20 15 10 5 50 Numer of students in EAS 308 Horizontal Bar Graph 0 Senior Junior Sophomore Year at College Sophomore Senior Junior Pie Chart Pareto diagram Numerical method for describing qualitative data For qualitative data, we define the categories in such a way that each observation can fall in one and only one category. Category frequency for a given category is the number of observations that fall in that category Category relative frequency for a given category is the proportion of the total number of observations that fall in that category Summary frequency table Year at college Frequency Percent Cumulative Frequency Cumulative Percent Sophomore 11 12.4 11 12.4 Junior 35 39.3 46 51.7 Senior 43 48.3 89 100.0 Quantitative Data Graphical method for describing quantitative data (1) Dot plots Steps: 1. Draw a horizontal scale that spans the range of data 2. Place a dot over the appropriate value on the scale representing the value of observations 3. If data value repeats, then the dots are placed on top of each other Graphical method for describing quantitative data (2) Histograms (most popular and traditional method for describing quantitative data) Steps: 1. Calculate the range of data 2. Divide the range into 5-20 classes of equal width 3. For each class, count the number (class frequency) of observations that fall in the class 4. Calculate each relative class frequency = (class frequency)/ total number of measurements Graphical method for describing quantitative data (3) Stem-and-Leaf Display Steps: 1. Divide each observation in the data set into two parts, the stem and the leaf. For example, the stem and leaf of the CPU time 2.41 are 2, and 41, respectively. Stem Leaf 2 41 2. List the stems in order in a column, starting with the smallest stem and ending with the largest. 3. Proceed through the data set, placing the leaf for each observation in the appropriate stem row. Numerical method for describing quantitative data Measures of central tendency - help to locate the center of the relative frequency distribution -Arithmetic mean (mean) Suppose we have a set of n measurements, y1,y2,y3,…,yn, n The arithmetic mean = y i 1 n i Generally, we use y to represent sample mean and to represent population mean -Median Median is the middle number when the measurements are arranged in ascending (descending) order y[(n+1)/2] , if n is odd Median = { y(n/2) + y(n/2+1) } /2, if n is even Generally, we use m to represent sample median and to represent population median Numerical method for describing quantitative data Measures of central tendency - help to locate the center of the relative frequency distribution -Mode The mode of a set of n measurements, y1,y2,y3,…,yn, is the value of y that occurs with the greatest frequency Numerical method for describing quantitative data Measures of central tendency Example: We have 10 sample measurements: 4, 5, 8, 1, 11, 6, 2, 8, 3, 7 Compute the mean, median, and mode. Solution: Mean = 5.5 Median = (6+5)/ 2 = 5.5 Mode = 8 Measures of central tendency: Geometric Mean (from Wikipedia) Measures of central tendency: Harmonic Mean (from Wikipedia) Numerical method for describing quantitative data Measures of variation - help to locate the spread of the distribution -Range Range = largest measurement – smallest measurement -Variance (of n measurements, measurements, y1,y2,y3,…,yn) n Sample variance = s 2 ( y y) i i 1 n 1 n Population variance = n 2 2 (y i 1 i y i 1 )2 n 2 i n [( yi ) 2 / n] i 1 n 1 Numerical method for describing quantitative data Measures of variation - help to locate the spread of the distribution -Standard Deviation n standard deviation of a sample = s (y i 1 i y) n 1 n standard deviation of a population = n 2 (y i 1 i y i 1 )2 n n 2 i [( y i ) 2 / n] i 1 n 1 Skewness: measure of shape Approximate formula (accurate for large “n”) Exact formula where s is the sample standard deviation. Kurtosis: measure of “peakedness” Approximate formula (accurate for large “n”) Exact formula where s is the sample standard deviation. Numerical method for describing quantitative data Measures of relative standing - describes the relative position of an observation within the data set Two measures used to describe the relative standing of an observation are percentiles and z-scores Percentiles - 100 pth percentile 100pth percentile of a data set is a value of y located so that 100 p% of the area under the relative frequency distribution for the data lies to the left of the 100pth percentile and 100 (1-p)% of the area lies to its right [note: 0 p 1] - Lower quartile, QL, , corresponding to 25th percentile. - Midquartile, m, corresponding to 50th percentile. - Upper quartile, QU , corresponding to 75th percentile Numerical method for describing quantitative data Measures of relative standing - describes the relative position of an observation within the data set Two measures used to describe the relative standing of an observation are percentiles and z-scores Z-scores The z-score for a value y of a data set is the distance that y lies above or below the mean, measured in units of the standard deviation. Sample z-score: z y y s Population z-score: z y Detecting Outliers Definition of an outlier: An observation y that is unusually large or small relative to the other values in a data set is called an outlier. Reasons for outliers in a data set: 1. The measurement is observed, recorded, or entered into the computer incorrectly 2. The measurement comes from a different population 3. The measurement is correct, but represents a rare (chance) event. Rule of Thumb for detecting outliers: Observations with z-scores greater than 3 in absolute value are considered outliers. Detecting Outliers Box Plot Method Interquartile range, IQR IQR = QU - QL Steps to construct a Box Plot 1. Calculate the median m, lower and upper quartiles, QL, and QU, and IQR, for the y values in a data set 2. Construct a box on the y-axis with QL and QU located at the lower corners. The base width will be equal to IQR. Draw a vertical line inside the box to locate the median, m 3. Construct two sets of limits on the box plot. Inner fences are located a distance of 1.5 (IQR) below QL and QU; outer fences are located a distance of 3(IQR) below QL and above QU. 4. Observations that fall between the inner and outer fences are called suspect outliers. Observations that fall outside the outer fences are called highly suspect outliers. 5. To further highlight extreme values, use Whiskers. Empirical Rule If a data set has an approximately mound shaped distribution, then the following rules of thumb may be used to describe the data set Example: At least 68% of the measurements will lie within the interval y ± s for samples At least 95% of the measurements will lie within the interval y ±2s for samples Summary In this lecture, we have learned: • Some important statistics terminologies 1. 2. 3. • How to deal with Qualitative data 1. 2. • Graphical method (Bar graph, Pie chart, Pareto diagram) Numerical method How to deal with Quantitative data 1. 2. • • Population vs. Sample Descriptive statistics vs. Inferential statistics Data Type Graphical method (Dot plot, Histogram, Stem and Leaf plot) Numerical method How to detect outliers in a data set? Empirical Rule