Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
AMS 7L is required for AMS 7, There are about 10 of you haven’t Registered for AMS 7L. You should do it ASAP. Lab starts this week!!! Random Sample vs. Simple Random sample If in a room are 36 persons and sit in 6 lines with 6 persons in each line. They are assign “names” as the following: A1 A2 A3 A4 A5 A6 B1 B2 B3 B4 B5 B6 C1 C2 C3 C4 C5 C6 D1 D2 D3 D4 D5 D6 E1 E2 E3 E4 E5 E6 F1 F2 F3 F4 F5 F6 An example (not the only way) for Random sampling for sample size 6 is to randomly pick one row: A1 A2 A3 A4 A5 A6 B1 B2 B3 B4 B5 B6 C1 C2 C3 C4 C5 C6 D1 D2 D3 D4 D5 D6 E1 E2 E3 E4 E5 E6 F1 F2 F3 F4 F5 F6 A simple random sample should give equal probability for all possible samples with sample size 6. For example, sample: A1 A2 A3 A4 A5 A6 B1 B2 B3 B4 B5 B6 C1 C2 C3 C4 C5 C6 D1 D2 D3 D4 D5 D6 E1 E2 E3 E4 E5 E6 F1 F2 F3 F4 F5 F6 The sampling method described in previous page CANNOT give us such kind of sample, which means, it is not a simple random sampling. Ways to do simple random sampling for this example: Describing, Exploring and Comparing Data Data: 70, 40, 70, 150, 155, 70, 45, 65, 105, 50, 175, 40 (Storm wind speed 2005) Stem and leaf plot Distribution, max/min, clustering Histogram: Frequency Distribution: Speed Frequency Relative frequency Speed Relative frequency (%) Pareto Chart (Bar) Sorted bar chart (histogram for categorical data) Pie Chart – Relative frequencies –Pareto Chart is usually better Measures of Center What is a typical value of a dataset? Aug 2005 storms: 65, 105, 50, 175, 40 Mean Sample ̅ Population µ The mean is often used b/c it has good theoretical + practical behavior ̅ is usually the best possible estimator of µ. The mean can be unreliable if there are outliers (extremely unusual observations) Eg. At Duke University, the mean starting salary of the 10 math majors who graduated in 1999 was over $100,000. Why? Trajan Langdon was a math major and one of the top picks in the NBA draft. His starting salary was over 1 Million. The other 9 math majors had salaries less than $100,000. So the mean was larger than all but one of the data points, b/c of the outlier. When there are extreme values, the median is more stable measure of the center. Storm example: Aug 2004, 40, 45, 65, 70, 105, 120, 135, 145 Median: When data are symmetric and w/o outliers, the mean and median will be very similar. Mode: The most common value Storm 2005 If only one value is the most often: Unimodal 2 equally : Bimodal More: Multimodal Skewness: If the distribution of the data is the same on both side of the mean, it is symmetric, otherwise it is skewed Income Measures of Variation Range = Maximum – Minimum = Length of an interval containing all of the data Example: August 2005 tropical storms: 40, 50, 65, 105, 175 Range= 175-40=135 How far from the mean is a typical data point? For population, the standard deviation is defined as and the variance is ∑ ( − ) = Variance is mathematically convenient, std. dev. is more interpretable. Std. Dev. is a measure of the distance from the mean to a typical data point, and is the most common measure of spread. Ex: If we are interested in the wind speed of Aug 2005 storms, what is our population? (interested in 2005 storms?) For population, ̅ = = ̅ . = = In most cases, we only have a sample, not a population. Then we use ̅ to estimate , and to estimate . The formula for sample standard deviation is ∑( − ̅ ) = −1 Empirical rule for data with a Bell-shaped distribution: About 68% of the data will be within 1 s.d. About 95% within 2 s.d. Is the following be considered as a bell shaped? Is it symmetric? Skewed? Textbook, Page 48, Figure 2-11 gives a good illustration about Mean, Median, Mode and Skewness. Anything we can say for non-bell shaped distribution/ all distributions? Chebyshev’s Theorem The proportion of any set of data lying within K s.d. of the mean is always at least 1-1/(K*K), where K>1. Standardized Scores (z Scores) How extreme is an observation? How do we compare observations from different datasets? We standardize the data by subtracting its mean and dividing by its s.d. (p. 69) Sample vs. Population Ex: Hurricane Katrina (175-87)/49.25=1.79 Aug 2004’s biggest storm was Karl at 145. The mean & s.d. for Aug 04 were 90.6 and 38.28. So Karl’s z Score is Conclusion: Even after adjusting for the increased variability in 2005, Katrina stands out more extreme. SAT scores (rescaled to 200-800) adjusts for varying difficulty of the exams. From Empirical rule, |z|>2, means the observation is unusual. Quartiles and Percentiles The median separates the data into two equally sized groups – half of the observations are above the median, half are below. Equivalently, 50% are below, so median is the 50th percentile. The 99 percentiles divide the data into 100 groups. 1% of the observations are less than the 1st percentile, 2% less than the 2nd percentile, and so on. To find the kth percentile: (p 73) 1. 2. 3. a. b. Storm examples: Aug04 40,45,65,70,105,129,135,145 Aug05 40,50,65,105,175 The 25th percentile is called 1st quartile (Q1), the 75th percentile is the 3rd quartile (Q3). The Median = Q2 =50th percentile. Boxplot Min, Q1, median, Q3, Max