Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Department of Statistics TEXAS A&M UNIVERSITY STAT 211 Instructor: Webster West Register for the online materials • • • • • • • Go to http://dl.stat.tamu.edu/dostat/ Click on Register here Specify account info and click Submit Click on log in Enter your account info and click on Log in! Click on Add course Enter the following information – Course reference: – Registration code: Topic 1: Data collection and summarization • • • • Populations and samples - pages 5 - 8 Frequency distributions - pages 14 – 17 Histograms - pages 17 - 20 Mean, median, variance and standard deviation - pages 24 - 28 • Quartiles, interquartile range - pages 29– 31 • Boxplots - pages 31 - 33 What is Statistics? • What do you think of when you hear the word “statistics”? • Statistics: The science of collecting, classifying, and interpreting data. • Anticipated learning outcomes: – appreciate and apply basic statistical methods in an everyday life setting – appreciate and apply basic statistical methods in their scientific field Collecting data • Observational study: Observe a group and measure quantities of interest. This is passive data collection in that one does not attempt to influence the group. The purpose of the study is to describe the group. • Experiment: Deliberately impose treatments on groups in order to observe responses. The purpose is to study whether the treatments cause a change in the responses. Observational Study Terms • Population: The entire group of interest • Sample: A part of the population selected to draw conclusions about the entire population • Census: A sample that attempts to include the entire population • Parameter: A fixed unknown number that describes the population • Statistic: A number produced from a sample that estimates a population parameter Horry County SC Murder Case • Do juries properly represent the racial makeup of Horry County which is 13% African American? • 295 jurors summoned, 22 were African American • What is the population parameter of interest? • What sample statistic could be used to estimate the parameter? Experiment Terms • Experimental Group: A collection of experimental units subjected to a real treatment. • Control Group: A collection of experimental units subjected to the same conditions as those in an experimental group except that no treatment is imposed. • This design helps control for potential confounding effects. NCTR study • A large scale study was conducted to see if a new drug might have potential toxic effects. • Dose groups of 0, 100, 200, and 400 ppg were evaluated for liver tumors at the end of a two week exposure to the drug. • What comparisons would you want to make? • Should you evaluate each group on consecutive days at the end of the study? Analyzing data with StatCrunch • StatCrunch is a statistical software package that runs through a Web browser like Internet Explorer. • You can access StatCrunch for free via DoStat. • If you are not on the TAMU system, you will need to enter the passcode, . • When you access the StatCrunch site, the window below will appear. Click on the Run button. All about variables • Variable: Any characteristic or quantity to be measured on units in a study • Categorical variable: Places a unit into one of several categories – Examples: Gender, race, political party • Quantitative variable: Takes on numerical values for which arithmetic makes sense – Examples: SAT score, number of siblings, cost of textbooks • Univariate data has one variable. • Bivariate data has two variables. • Multivariate data has three or more variables. Cereal data What types of vaiables do we have in this data set? mfr A = American Home; G = General Mills; K = Kelloggs; N = Nabisco; P = Post; Q = Quaker Oats; R = Ralston Purina type cold or hot calories calories per serving protein grams of protein fat grams of fat sodium milligrams of sodium fiber grams of dietary fiber carbo grams of complex carbohydrates sugars grams of sugars potass milligrams of potassium vitamins vitamins and minerals - 0, 25, or 100, indicating the typical percentage of FDA recommended shelf display shelf (1, 2, or 3, counting from the floor) weight weight in ounces of one serving cups number of cups in one serving rating a rating of the cereal Summarizing a single categorical variable • Frequency - number of times the value occurs in the data • Relative frequency - proportion of the data with the value • Cereal Data mfr Frequency Relative Frequency A 1 0.012987013 G 22 0.2857143 K 23 0.2987013 N 6 0.077922076 P 9 0.116883114 Q 8 0.103896104 R 8 0.103896104 Analyzing a single quantitative variable • Consider the concentration data which contains the concentration of suspended solids in parts per million at 50 locations along a river. • What is a typical concentration along the river? • How much spread is there in the concentrations along the river? • Typical is generally characterized by the center of the data • Spread is generally reported as an interval containing most of the data Histograms • Histogram - bar graph of binned data where the height of the bar above each bin denotes the frequency (relative frequency) of values in the bin • Typical concentration? • Spread? • Roughly how many concentrations below 50? • StatCrunch Choosing the number of histogram bins • General rule: # of bins # of observations • Choosing the number of bins for a histogram can be tricky! Consider the Old Faithful data. Describing the shape of quantitative data • Symmetric data has roughly the same mirror image on each side of a center value. • Skewed data has one side (either right or left) which is much longer than the other relative to the mode (peak value). • The above definitions are most useful when describing data with a single mode. • Multimodal data has more than one mode. • Beware of outliers when describing shape. • Shape of the concentration data? States data from 1996 • Define the shape of each variable. POVERTY percentage of the state population living in poverty CRIME violent crime rate per 100,000 population COLLEGE percentage of states population who are enrolled in college METRO percentage of the state population living in a metropolitan area INCOME median household income in 1996 dollars • Where does TX fall for each variable? Stem and leaf plots • Separate each value into a stem (all but the rightmost digit) and a leaf (the rightmost digit) • Write unique sorted stems in a vertical column • Add each leaf to the right of its stem in increasing order Variable: concentration 2 : 7 • StatCrunch 3 3 4 4 5 5 6 6 7 7 8 8 9 9 : : : : : : : : : : : : : : 024 6779 002 56778 03 66689 1111222 55566899 012 55679 3 7 1 5 Histograms vs. Stem and leaf plots • Stem and leaf plots (typically) display actual data values whereas histograms do not • Stem and leaf plots are more useful for small data sets (less than 100 values) • Histograms can be constructed for larger data sets Summary statistics for quantitative data • Measures of center (typical) – The sample median is the middle observation if the values are arranged in increasing order. – The sample mean of n observations is the average, the sum of the values divided by n. X1 ,..., X n represents n data values n X X i 1 n i Summary statistics for quantitative data • pth percentile -the value such that p×100% of values are below it and (1-p) ×100% are above it – first quartile (Q1) is the 25th percentile – second quartile (Q2) 50th percentile (median) – third quartile (Q3) is the 75th percentile • 5-number summary: Min, Q1, Q2, Q3, Max – Boxplots: Stacking boxplots can be very useful for comparing multiple groups Summary statistics for quantitative data • Measures of spread: – Interquartile range, IQR = Q3-Q1, the range of the middle 50% of the data – sample variance, s2, is the sum of squared deviations from the sample mean divided by n-1 n s 2 (X i 1 i X) 2 n 1 – sample standard deviation, s, is the square root of sample variance. Preferred because it has the same units as the data. Cereal data • Compare rating across shelf. Comparing measures of center and spread • The sample mean and the sample standard deviation are good measures of center and spread, respectively, for symmetric data • If the data set is skewed or has outliers, the sample median and the interquartile range are more commonly used • Mean versus median Case Study: Salary data • A fictitious large university decides to study the salaries of their graduates. A survey was conducted of 2232 recent graduates from engineering and education majors. • The salary data consists of three variables: – Gender: Male or Female – Major: Education or Engineering – Salary: Reported in $ • What types of variables do we have? Salary data by major • Are both majors equally represented in the survey? • Do salaries differ across major? Salary data by gender • Are both genders equally represented in the survey? • Do salaries differ across gender? Discrimination? Salary data by gender within each major • How do male and female salaries compare in engineering? • How do male and female salaries compare in education? • What’s going on? Let’s Make a Deal • This is motivation to study probability. • Should you switch or should you stay with your original choice?