Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 1: Looking at Data - Distributions: http://anengineersaspect.blogspot.com/2013_05_01_archive.html 1 What is Statistics? http://vadlo.com/cartoons.php?id=71 2 What is Statistics • Statistics is the science of learning from data. • Components – Collection – Organization – Analysis – Interpretation 3 Applications of Statistics • Computer Science client-server performance image processing • Chemistry/Physics determining outliers in your data linear regression propagation of error dealing with large populations and approximations • Engineering is one process/technique better than another one? • Business Making good decisions • Everyday life Medical information Average cell phone usage of Purdue students 4 Branches of Statistics • Collection of data • Descriptive Statistics • Inferential Statistics 5 1.1: Data: Goals • Give examples of cases in a data set. • Identify the variables in a data set. • Demonstrate how a label can be used as a variable in a data set. • Identify the values of a variable. • Classify variables as categorical or quantitative. Information on Histograms: Slides: 15 – 19, Book: pp. 15 – 20. 6 Basic Definitions • Cases – objects that are described by the data • Label – special variable used to separate the cases • Variable – characteristic of a case 7 Types of Variables • Number – univariate – bivariate – multivariate • Type – Categorical – Quantitative • Distribution of a variable – The possible values and how often that it takes these variables 8 To better understand a data set, ask: • Who? •What cases do the data describe? •How many cases? • What? •How many variables? •What is the exact definition of each variable? •What is the unit of measurement for each variable? • Why? •What is the purpose of the data? •What questions are being asked? •Are the variables suitable? 9 1.2: Displaying Distributions with Graphs: Goals • Analyze the distribution of categorical variable: – Bar Graphs – Pie Charts • Analyze the distribution of quantitative variable: – Histogram – Time plots – Identify the shape, center, and spread – Identify and describe any outliers 10 Categorical Variables - Display The distribution of a categorical variable lists the categories and gives the count or percent or frequency of individuals who fall into each category. • Pie charts show the distribution of a categorical variable as a “pie” whose slices are sized by the counts or percents for the categories. • Bar graphs represent categories as bars whose heights show the category counts or percents. 11 Categorical Variables – Display (STAT 311) 0.4 F 5% Percent 0.3 0.2 D 25% 0.1 A 20% B 10% 0 A B Grade C D F Percent 0.4 C 40% 0.3 0.2 0.1 0 C D A Grade B F 12 Quantitative Variable: Histograms Histograms show the distribution of a quantitative variable by using bars. The height of a bar represents the number of individuals whose values fall within the corresponding class. Procedure - discrete 1. Calculate the frequency and/or relative frequency of each x value. 2. Mark the possible x values on the x-axis. 3. Above each value, draw a rectangle whose height is the frequency (or relative frequency) of that value. 15 Histogram - Discrete 100 married couples between 30 and 40 years of age are studied to see how many children each couple have. The table below is the frequency table of this data set. Kids 0 1 2 3 4 5 6 7 # of Couples Rel. Freq 11 0.11 22 0.22 24 0.24 30 0.30 11 0.11 1 0.01 0 0.00 1 0.01 16 100 1.00 Quantitative Variable: Histograms continuous Procedure - continuous 1. Divide the x-axis into a number of class intervals or classes such that each observation falls into exactly one interval. 2. Calculate the frequency or relative frequency for each interval. 3. Above each value, draw a rectangle whose height is the frequency (or relative frequency) of that value. 17 Visual Display: Continuous Histogram Power companies need information about customer usage to obtain accurate forecasts of demand. Investigators from Wisconsin Power and Light determined the energy consumption (BTUs) during a particular period for a sample of 90 gas-heated homes. An adjusted consumption value was calculated via consumption adj consumption (weather, degree days)(house area) The data is listed under furnace.txt under extra files on the computer web page. 18 Example (cont) Bin = 0.25 63 classes Bin = 1 17 classes Bin = 0.5 32 classes Bin = 3 7 classes Bin = 5 4 classes 19 Examining Distributions In any graph of data, look for the overall pattern and for striking deviations from that pattern. • You can describe the overall pattern by its shape, center, and spread. • An important kind of deviation is an outlier, an individual that falls outside the overall pattern. 20 Shapes of Histograms - Number Symmetric unimodal bimodal multimodal http://www.particleandfibretoxicology.com/content/6/1/6/figure/F1?highres=y 21 Shapes of Histograms (cont) Symmetric Positively skewed Negatively skewed 22 Shapes of Histograms (cont) 23 Outliers http://ewencp.org/blog/url-reshorteners/ 24 Time Plots A time plot shows behavior over time. • Time is always on the x-axis; the other variable is on the y-axis • Look for a trend and deviations from the trend. Connecting the data points by lines may emphasize this trend. • Look for patterns that repeat at known regular intervals. 25 Example: Time Plots We are interested in the temperature (oF) of effluent at a sewage treatment plant. 47 54 53 50 46 46 47 50 51 50 51 50 46 52 50 50 a) Plot a histogram of the data. b) Plot a time plot of the data. 26 Example: Time Plots (cont) 27 1.3: Describing Distributions with Numbers: Goals • Describe the center of a distribution by: – mean – median • Compare the mean and median • Describe the measure of spread: – quartiles – standard deviation • Describe a distribution by a boxplot (five-number summary and outliers) • Be able to determine which summary statistics are appropriate for a given situation • Be able to determine the effects of a linear transformation on the above summary statistics. 28 Sample Mean 𝑠𝑢𝑚 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 1 𝑥= = 𝑛 𝑛 𝑥𝑖 29 Sample Mean: Example The following data give the time in months from hire to promotion to manager for a random sample of 20 software engineers from all software engineers employed by a large telecommunications firm. a) What is the mean time for this sample? 5 7 12 14 18 14 14 22 21 25 23 24 34 37 34 49 64 47 67 69 b) Suppose that instead of x20 = 69, we had chosen another engineer that took 483 months to be promoted. what is the mean time for this new sample? 30 Sample Median, M or x̃ Procedure 1. Sort n observations from smallest to largest 2. If n is odd, x̃ is the center If n is even, x̃ is the average of the two center observations 31 Sample Median: Example The following data give the time in months from hire to promotion to manager for a random sample of 20 software engineers from all software engineers employed by a large telecommunications firm. a) What is the median time for this sample? 5 7 12 14 14 14 18 21 22 23 24 25 34 34 37 47 49 64 67 69 b) Suppose that instead of x20 = 69, we had chosen another engineer that took 483 months to be promoted. what is the median time for this new sample? 32 Mean and Median Mean Median Left skew Mean Median Mean Median Right skew 33 Variability of Data 1 2 3 -20 Set 1 Set 2 Set 3 -10 -15 -15 -3 -10 -5 -2 0 -5 -1 -1 10 0 0 0 20 5 1 1 10 5 2 15 15 3 34 Quartiles Q1 Q2 Q3 35 Quartiles Procedure 1. Sort the values from lowest to highest and locate the median. 2. The first Quartile, Q1 is the median of the lower half. 3. The third quartile, Q3 is the median of the upper half. 36 Quartiles: Example The following data give the time in months from hire to promotion to manager for a random sample of 19 software engineers from all software engineers employed by a large telecommunications firm. 24 7 12 14 14 14 18 21 22 23 25 34 34 37 47 49 64 100 150 a) Find the median and the quartiles. b) What is the Interquartile Range? c) Are there any outliers in this data set? 37 Boxplots Procedure 1. Draw and label a number line that includes the range of the distribution. 2. Draw a central box from Q1 to Q3. 3. Draw a line for the median. 4. Extend lines (whiskers) from the box to the minimum and maximum values that are not outliers. 5. Put in dots (* or some symbol) for the outliers 38 Boxplot: Example Boxplot of Promotion 160 140 Promotion 120 100 80 60 40 20 0 39 Side-by-side Boxplot: Example 40 Sample Standard Deviation 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = 𝑠𝑥2 1 = 𝑛−1 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 𝑠𝑥 = (𝑥𝑖 − 𝑥)2 1 𝑛−1 (𝑥𝑖 − 𝑥)2 41 Properties of Standard Deviation • s measures spread about the mean so only use this measure when you are using the mean to measure the center. • s = 0 means that all of the observations are the same, normally s > 0 • s is not resistant to outliers • s has the same units of measurement as the original observations 42 Sample Standard Deviation: Example The following data give the time in months from hire to promotion to manager for a random sample of 20 software engineers from all software engineers employed by a large telecommunications firm. What is the standard deviation time for this sample? 5 7 12 14 18 14 14 22 21 25 23 24 34 37 34 49 64 47 67 69 43 Choosing Measures of Center and Spread Choices 1. Mean and standard deviation 2. Median and IQR ALWAYS PLOT YOUR DATA! http://freshspectrum.com/wp-content/uploads/2012/09/ Hans-Rosling-Bubble-Plot-Cartoon.jpg 44 Change of Measurement • Linear transformation: xnew = a + bx • Effects 1. No change to shape 2. Adding a: adds a to measures of center; doesn’t effect measures of spread 3. Multiplying by b: multiplies both measures of center and measures of spread (s, IQR) by b. 45 1.4: Density Curves and Normal Distributions: Goals • Be able to state the definition and practical importance of a density curve. • State the physical means of the measurements of center and spread for density distributions. • Normal distributions – – – – – – – Be able to sketch the normal distribution. Be able to state the importance of the 68 – 96 – 99.7 rule Be able to standardize a value Be able to use the Z-table Be able to calculate percentages Be able to calculate percentiles (Inverse calculations) Be able to determine if a distribution is normal (normal 46 quantile plots) Exploring Quantitative Data 1. 2. 3. 4. Always plot your data. Look for the overall pattern. Calculate a numeric summary. Sometimes, the overall pattern is regular so that we can describe it by a specific methodology. 47 Density Curve (a) (b) (c) 48 Properties of Density Curve y = f(x) y = f(x) f(x)dx 1 proportion between b a and b = f(x)dx a 49 Density Curves – Median and Mean • The median of a density curve is the equal – areas point. 𝑦=𝑚𝑒𝑑𝑖𝑎𝑛 𝑝 = 0.5 = 𝑓 𝑥 𝑑𝑥 −∞ • The mean of a density curve is the balance point. • If the distribution is symmetric, the median and mean are the same and are the center of the curve. 50 Mean http://isc.temple.edu/economics/notes/descprob/descprob.htm 51 Sample vs. Population • Terms for samples (actual observations) – Mean: x,̄ median: x̃, standard deviation, s • Terms for populations (density curves) – Mean: , median: ̃, standard deviation, 52 Normal Distribution A visual comparison of normal and paranormal distribution Lower caption says 'Paranormal Distribution' - no idea why the graphical artifact is occurring. http://stats.stackexchange.com/questions/423/what-is-your-favorite-data-analysis-cartoon 53 Normal Distribution 𝑓 𝑥 = 1 (𝑥−𝜇)2 − 𝑒 2𝜎2 𝜎 2𝜋 where -∞ < < ∞, σ > 0 X ~ N(,σ) 54 Shapes of Normal Density Curve http://resources.esri.com/help/9.3/arcgisdesktop/com/gp_toolref /process_simulations_sensitivity_analysis_and_error_analysis_modeling /distributions_for_assigning_random_values.htm 55 68-95-99.7 Rule Empirical Rule 56 Standard Normal or z curve 𝑓 𝑧 = 1 2𝜋 𝑧3 − 𝑒 2 57 Cumulative z curve area 58 Z-table 59 Using the Z table area right of z = area between z1 and z2 = 1 area left of z area left of z1 – area left of z2 60 Procedure for Normal Distribution Problems 1. Sketch the situation and shade the area to be found. 2. Standardize X to state the problem in terms of Z. 3. Use Table A to find the area to the left of z. 4. Calculate the final answer. 5. Write your conclusion in the context of the problem. 61 Normal Distribution: Example A particular rash has shown up in an elementary school. It has been determined that the length of time that the rash will last is normally distributed with mean 6 days and standard deviation 1.5 days. a) What is the percentage of students that have the rash for longer than 8 days? b) What is the percentage of students that the rash will last between 3.7 and 8 days? 62 Percentiles 63 Normal Distribution: Example A particular rash has shown up in an elementary school. It has been determined that the length of time that the rash will last is normally distributed with mean 6 days and standard deviation 1.5 days. c) How long would the student’s rash have to have lasted to be in the top 10% of the number of days that the students have the rash? 64 Symmetrically Located Areas 65 Normal Distribution: Example A particular rash has shown up in an elementary school. It has been determined that the length of time that the rash will last is normally distributed with mean 6 days and standard deviation 1.5 days. d) What interval symmetrically placed about the mean will capture 95% of the times for the student’s rashes to have lasted. 66 Procedure: Normal Quantile Plot 1) Arrange the data from smallest to largest. 2) Record the corresponding percentiles (quantiles). 3) Find the z value corresponding to the quantile calculated in part 2. 4) Plot the original data points (from 1) vs. the z values (from 3). 67