Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Transcript

Lecture 3 (Jan 19, 2004) 2.6 Numerical Summaries of Quantitative Variables We have seen many ideas in this section before. Notation for raw data: n: the number of individuals in a data set. x1, x2, … , xn: the individual raw data values Mean vs. Median 1. How to calculate mean and median; mean: x = x n i . 2. influence of outliers on the mean and median; Spread of data: 1. 2. 3. 4. Q1: lower quartile; Q3: upper quartile; Range: high extreme – low extreme; Interquartile range: Q3 – Q1, notation as: IQR; Percentiles: in general, the kth percentile is a number that has k% of the data values at or below it and (100 – k)% of the data values at or above it. Boxplot and interpretation: In Minitab, we use Graph >> Boxplot, and then chose the variable that we are interested in as Y, and appropriate X. Using the ideal height data again and want to see the comparison between males and females: 85 IdealHt 80 75 70 65 60 f emale male Gender 1. Label Y-axis with numbers from the minimum to maximum of the data; 2. The upper end of the box is Q1 and lower end is Q3; 3. The line in the middle is the median; 4. Draw a line that extended from Q1 end to the smallest data value that is not further than 1.5*IQR from Q1, draw a line that extended from Q3 end to the largest data value that is not further than 1.5*IQR. 5. The rest points are treated as outliers and they should be represented with asterisks at their proper positions. 2.7 Bell-shaped Distributions of Numbers Nature seems to follow a predictable pattern for many kinds of measurements. Most individuals are concentrated around the center, and the greater the distance a value is from the center, the fewer individuals have the value. For example, human’s height, or weight. Numerical variables follow this pattern are said to follow a bell-shaped curve, many of them have a certain distribution called normal distribution. Look at the actual data of height: Histogram of Height, with Normal Curve 50 Frequency 40 30 20 10 0 60 70 80 90 Height Stat >> Basic statistics >> Display Descriptive Statistics, choose the column of interest, click Graphs and choose Histogram with normal curve. Standard Deviation: Standard deviation: roughly is the average distance values fall from the mean. Sample mean: x ; Population mean: ; note: usually we do NOT know this value. We want to draw inference on this population mean. Sample standard deviation: s = (x Population standard deviation: i x) = n 1 (x i x 2 ) n 2 i nx n 1 2 ; 2 . Empirical Rule: for any bell-shaped curve, approximately: 1. 68% of the values fall within 1 standard deviation of the mean in either direction; 2. 95% of the values fall within 2 standard deviations of the mean in either direction; 3. 99.7 of the values fall within 3 standard deviations of the mean in either direction. Standardized z-score z= observed mean . z-score shows us how far a given point is away from std the mean in terms of standard deviation. Empirical Rule: for any bell-shaped curve, approximately: 4. 68% of the values have z-score between -1 and 1; 5. 95% of the values have z-score between -2 and 2; 6. 99.7 of the values have z-score between -3 and 3.