Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Descriptive Statistics (Gathering and organizing data) Chapter 6 1 Experimentation The celebrity news magazine Us Weekly conducted a poll in which they asked 100 people whether they were shocked by the breakup of actress Carmen Electra and musician Dave Navarro after nearly three years of marriage. The participants in the survey were individuals encountered on Fifth Avenue in New York City. 43% of those asked said they were shocked by the news while the other 57% said they were not. 2 Experimentation In chapters 2-5 we looked at probability distributions of random variables such as 1 −x2 f ( x) = e , x ≥ 0 2 where X is the lifetime of a certain type of insect. More often than not, the true underlying probability distribution of a random variable is not known. An investigator finds out everything possible about the probability density function through conducting studies and collecting data. 3 Experimentation Observational Study: Individuals are observed and measured on variables of interest. Researchers do not implement a ‘treatment’. A recent study by a web site that gives quotes on insurance rates found that astrological signs are a significant factor in predicting car accidents. The study looked at 100,000 North American drivers' records from the past six years. Reuters – Libras were found to be the worst offenders for tickets and accidents. Leos were found to be the best overall. – In the study results, Leos were described as "generous, and comfortable in sharing the roadway." Whereas Aries "have a 'me first' childlike nature that drives [them] into trouble.” One criticism of observational studies is that the researchers may overlook important variables (confounding factors) that influence the observed outcomes. 4 Experimentation At a recent conference on eating disorders, a research group presented data showing that individuals who undergo a “stomach stapling” procedure to control their weight tend to have obese children. The researchers concluded that the parents’ stomach stapling caused the obesity in their children. What other factors might explain this relationship? 5 Experimentation Experiment: Researchers impose some treatment on subjects in order to observe their responses. An experiment can be used to determine whether the treatment caused the response. Duct tape no magical cure for warts, study finds November 6, 2006 WASHINGTON (Reuters) -- Duct tape does not work any better than doing nothing to cure warts in schoolchildren, Dutch researchers reported on Monday in a study that contradicts a popular theory about an easy way to get rid of the unattractive lumps. The study of 103 children aged 4 to 12 showed the duct tape worked only slightly better than using a corn pad, a sticky cushion that does not actually touch the wart and which was considered to be a placebo. 6 Experimentation In order to ensure that the data collected appropriately address the issue in question, the population must be appropriately identified. A population consists of all possible observations available from a particular probability distribution. The identification of the population depends on the question. 7 Experimentation Example: Consider the following three questions that might underlie the US weekly poll seen previously. 1. What percentage of U.S. adults were shocked by Carmen Electra’s and Dave Navarro’s breakup. 2. What percentage of New York residents were shocked by Carmen Electra’s and Dave Navarro’s breakup. 3. What percentage of US weekly readers were shocked by Carmen Electra’s and Dave Navarro’s breakup. The question indicates the population. 8 Experimentation A parameter is a quantity θ which is a property of interest in an unknown probability distribution. Parameters are usually unknown, one goal of statistics is to estimate them. Example: An investigator wishes to know what proportion of voters in the US will vote ‘Republican’ in the next election. Population – Voters in the US. Parameter – proportion who will vote ‘Republican’. Example: A toothpaste company would like to know what proportion of dentists recommend the brand it produces. Identify a population and parameter. Population – Parameter – 9 Experimentation Typically, it is not feasible to collect data from every individual in a given population. Why not? How would you get the data you need? If it were possible to gather data from the entire population, the true underlying probability mass function or probability density function would be known. 10 Experimentation Data is generally collected for a subset of the population, a sample. The science of deducing properties of an underlying probability distribution A statistic is a random variable whose value can be from a sample is know calculated from a sample. as statistical inference. The statistic can then be used to estimate an unknown parameter and thereby to draw conclusions about the overall population. 11 Sample Selection When the sample is representative of the population, estimating parameters from the sample is justified, thus how a sample is chosen is very important. For a sample to be representative of the population, every individual in the population must have a chance of being included in the sample. When a sample is poorly chosen, a statistic calculated from that sample is a poor estimate of the population parameter. 12 Sample Selection Example: Suppose the population for the “Breakup” survey is all adults in the U.S. The parameter of interest is presumably the percentage of U.S. adults who were shocked by Carmen Electra’s and Dave Navarro’s breakup. Is the sample used in the survey representative of the population? Is the statistic obtained from the Us survey a good estimate of the parameter? 13 Sample Selection ‘If you had it to do again, would you still have children?’ In the 1970s Ann Landers submitted this question to her readers. She received over 10,000 responses. 70% of the respondents answered “no.” What were the population and parameter of interest? Did Ann Lander’s sample adequately represent the population? Are there any issues regarding the way the sample was chosen that need to be addressed? 14 Sample Selection Some common sources of bias in a sample are: Selection Bias: a systematic tendency on the part of the sampling procedure to exclude one kind of person or another from the sample. Non-response bias: bias introduced by important differences between those who respond and those who do not. How would you select a representative sample to carry out Ann Lander’s survey? 15 Sample Selection To ensure that the sample is representative of the population, a random sample should be used, i.e. the elements in the sample should be chosen randomly from the population. The following methods could be used to randomly select individuals for the sample: – Toss a coin – Draw names from a hat – Use computer generated random numbers When a sample is randomly chosen, human choice is not part of the selection process. 16 Sample Selection Example: The Statesman Survey (www.utahstatesman.com) What is the population? Is the sample representative of the population? What sources of bias might influence this survey? 17 Sample Selection Discuss the following methods of sample selection. 1. 4 students want to determine which Logan grocery store has the lowest prices. Each student makes a list of the items he buys most often and the students compare prices on those items at the local stores. 2. 4 students want to determine which Logan grocery store has the lowest milk prices. They record the prices of all varieties of milk at all grocery stores. The students then randomly select 4 types of milk to compare for all stores. 3. A teacher wishes to know what proportion of the students that take stat 3000 are female. She teaches one section of stat 3000 in the spring, thus she chooses this section for her sample. 18 Data Presentation Presenting data in a clear organized way facilitates understanding the information. There are both graphical and numeric methods of summarizing data. “If ‘a picture is worth a thousand words,’ then it is worth at least a million numbers.” (from the text) 19 Data Presentation Poorly constructed graphics for data presentation can be worse than useless since they sometimes obscure information or appear to exaggerate relationships. 20 Pie Charts – Categorical Data A pie chart presents frequencies of categorical data visually. It emphasizes the proportion of the total data set that falls into each category. A category containing r of n total observations is represented by a slice of the pie with an angle of (r /n) 360°. 21 Bar Charts – Categorical Data A bar chart presents frequencies of categorical data visually. Each category has a corresponding bar whose height is proportional to the frequency associated with that category. Bar charts are useful for comparing categories. When might a bar chart be more useful than a pie chart? 22 Histograms – Numerical Data Histograms look similar to bar charts, but are used to present numerical data rather than categorical. The histogram is composed of bins whose area is proportional to the number of data observations which take the value(s) within the bin. (Note: if all bins have equal width then the height of the bin is proportional to the number of observations with values within the band.) A probability mass function (or density function) can be thought of as a smoothed version of a histogram. 23 Histograms – Numerical Data The binwidth needs to be carefully chosen – if two wide, little of the structure of the underlying probability density function will be observable, if two narrow the histogram could have many gaps and spikes. 24 Outliers Outliers are data points that are separated from the rest of the data. Outliers can affect conclusions drawn from the data and should therefore be removed if possible. –If outliers are the result of misrecorded or otherwise incorrect data collection, they can be removed. –If outliers represent true variation in the population, they should not be discarded. 25 Outliers Data from a stat 1040 class, outliers represent true variation in the population. 26 Sample Statistics Numerical data summaries are called sample statistics. Some useful sample statistics the sample mean, median, standard deviation, and variance. 27 Sample Statistics Measures of Center: For a data set consisting of n observations, x1,...,xn, • Sample mean : ∑ x= n x i =1 i n • Sample median : the “middle” data point. When n is even, the average of the two middle points. • Sample trimmed mean : the mean of the observations minus some of the largest and some of the smallest, say 5% of each. • Sample mode : The category or data value which contains the larges number of observations. 28 Sample Statistics In 1984, the University of Virginia reported that the mean starting salary of graduates from its department of Rhetoric and Communications was $55,000. One of these graduates was Ralph Sampson who played basketball in the NBA. His salary did not reflect the earning potential of other graduates from his department. UVA did not publish the median salary. 29 Sample Statistics Examples: The ages of 5 students are 21, 22, 24, 24, 26. 1. Find the sample mean. 2. Find the sample median. 3. Find the sample trimmed mean (remove 20% top and bottom). 4. Find the sample mode, 30 Sample Statistics Measures of Spread: For a data set consisting of n observations, x1,...,xn, • The sample variance : s2 ∑ = n 2 ( x − x ) i i =1 n −1 • The sample standard deviation : s=√s2 • The pth sample quantile (percentile): a value such that proportion p (p%) of the sample values are smaller than it. • Upper and lower sample quartiles : 75 – percentile and 25 – percentile respectively. If these quartiles fall between values, they can be found by taking weighted averages as follows: • Upper sample quartile = (1/4 * low #) + (3/4 * high #) • Lower sample quartile = (3/4 * low #) + (1/4 * high #) • Interquartile range (IQR) : the difference between the upper and lower sample quartiles. 31 Sample Statistics Examples: The ages of 5 students are 21, 22, 24, 24, 26. 1. Find the sample variance. 2. Find the sample standard deviation 3. Find the sample IQR. 32 Box Plots A boxplot provides a good visual summary of the distribution of the data set. It presents the sample median, upper and lower sample quartiles, and the larges and smallest data observations. 33 Box Plots Boxplots can be useful for comparing distributions. This example shows the distribution of height in my stat 1040 and stat 3000 classes from fall of 2006. • Why might the distributions be so different? 34 Data Presentation What graphical representation(s) would you use for each of the data sets? 1. 2. 3. 4. Ethnicity of participants in an asthma study Depths of Massachusetts ponds Data collected for Electra/Navarro survey College GPAs of students who attended private high schools and the same for students who attended public high schools. 35