* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download The Normal Distribution
Survey
Document related concepts
Transcript
Statistics for Algebra II Statistics is the practice or science of collecting and analyzing numerical data in large quantities, esp. for the purpose of inferring proportions in a whole from those in a representative sample. Designing a study Steps for designing a study: 1. Identify the variable (or variables) of interest and the population of the study. A variable is any characteristic that is recorded for subjects in a study. (Gender, major, age, and GPA might be variables for a study about college students.) The population is the collection of all of the things the researcher wants to describe or make decisions about in the study. (Students in the US, seniors at SAHS, registered vehicles in NC, likely voters in the 2016 presidential election, gears produced by a particular manufacturer, toys sold on Cyber Monday on Amazon, or the animals at a particular shelter during 2013 are all possible populations.) 2. Develop a detailed plan for collecting the data. When the data will be collected from the entire population, the study is called a census and the data collected are called parameters. When using a sample, it is important to make sure that the sample is representative of the population. Usually, researchers will employ sampling techniques which involve probability to randomize the sample selection process .The data collected from a sample are called statistics. 3. Collect the data. 4. Describe the data using descriptive statistics. 5. Interpret the data and use inferential statistics to make decisions (or assumptions) about the population. 6. Identify any possible errors (some potential problems can be identified earlier in the process). Conducting a Study We will focus on four methods for collecting data: observational studies, surveys, experiments, and simulations. In an observational study, a researcher measures and observes the variables of interest without changing existing conditions. Observational studies can point to a correlation between two variables of interest, but cannot be used to infer causation. A survey is used to investigate characteristics of a population. It is frequently used when the subjects are people, and questions are asked of them. When designing a survey, you must be very careful of wording (and sometimes ordering) the questions so that the results are not biased. The results are also subject to bias introduced by non-response. (In this case, subjects may refuse to answer some or all questions or they may not give truthful responses.) Correlation does not imply causation. In an experiment, randomization is used to assign members of the study group to treatment groups. A researcher then randomly assigns a treatment to each group and observes the response. Often, one group will be assigned as a control group (a group receiving no treatment or a placebo) to be used to compare the effectiveness of a treatment. A well designed experiment maybe used to infer causation. A simulation uses a mathematical, physical, or computer model to replicate the conditions of a process or situation. This is frequently used when the actual situation is too expensive, dangerous, or impractical to replicate in real life. Examples Identify which method for collecting data (observational study, an experiment, a simulation, or a survey) is best in each of the following situations and explain your answer. 1. The effect a severe earthquake would have on the Salt Lake Valley. Simulation, we cannot control when that area will have an earthquake. 2. Whether or not a certain coupon attached to the outside of a catalog makes recipients more likely to order products from a mail-order company. Experiment, since we are comparing two scenarios and we can control them. 3. Whether or not smoking has an effect on coronary heart disease. Observational study, since we will not be changing a person's behavior. (There are ethical and health concerns for deciding whether or not someone smokes.) 4. Determining the average household income of homes in Burlington, NC. Survey, since it can be answered with a brief question. Problems and Methods to Deal with Them in a Study A confounding variable occurs when a researcher cannot tell the difference between the effects of different factors on a variable. The placebo effect occurs when a subject (or an “experimental unit") reacts favorably to a placebo when no medicated treatment has been given. Blinding is a technique used to make the subjects “blind" to which treatment (or placebo) they are being given. A double-blind experiment is one in which neither the experimenter nor the subjects know which treatment is being given. Randomization is a process of randomly assigning subjects to treatment groups. There are several different techniques for randomization: A completely randomized design assigns subjects to different treatment groups through random assignment. A randomized block design is sometimes used to make sure that subjects with certain characteristics are assigned to each treatment. For example, when testing a certain medication, you might first want to split subjects in groups according to either gender or age (or both), then randomly assign each of these groups to the different treatments. A matched pairs design pairs up subjects according to similarities. One subject in the pair receives one treatment, while the other receives a different treatment. Sample size is the number of participants in the experiment. The larger the sample, the more representative of the population the results will be, but the costs of the experiment will be higher. Replication is the ability to reproduce the experiment (and results) under similar conditions. Examples 1. For the following experiment, determine the experimental units, treatments, and sample size. Indicate whether this experiment is blind, double-blind, and/or randomized. Also identify any potential problems with the design. A study with 233 low-income adult smokers evaluated the effectiveness of usual care (physician advice and follow-up) for smokers wishing to quit to the usual care enhanced by computer-assisted telephone counseling sessions. Each subject was assigned randomly either to the usual care or to the usual care plus counseling, and their smoking status (still smoking or quit smoking) was observed after 3 months. The percentage who had quit smoking was higher for the group receiving counseling. (from Journal of Family Practice 2000;50:138-144) The experimental units are the 233 adult smokers. The treatments are “usual care" and “usual care plus counseling." The sample size is 233. The experiment is not blind or double-blind (experimental units will know if they receive counseling or not), but it is randomized since the subjects were assigned randomly. 2. How would you design a placebo-controlled double-blind experiment with a randomized block design for the following situation: A veterinarian wants to test a strain of antibiotic on calves to determine their resistance to a common infection, and if their gender plays a role. In a pasture, there are 22 newborn calves (11 males and 11 females). There is enough antibiotic for 10 calves, but blood tests to determine their resistance to infection can be done on all calves. For the randomized block design, we can choose 5 males and 5 females randomly to receive the antibiotic, and the remaining 6 males and 6 females will receive a “placebo" (in this case, the “placebo" could be no treatment since the calves won't be telling anyone if they received a shot or not). In order to make the experiment double-blind, we need to make the calves unaware of what treatment they are receiving (not difficult), and for the person carrying out the blood tests to be unaware of the treatment (to make this possible, another person will assign which of the calves receive the antibiotic and which do not and will administer the antibiotic). Sampling Techniques Ideally, we would take a census, that is, use every member of a population as a subject since the descriptive statistics would be sufficient. However, this is often too costly and difficult. Instead, we sample part of the population. With sampling, we need to make sure that the sample is representative of the population and large enough to be meaningful. Definitions and Terminology A sampling error is the difference between the results of the sample and those of the population. Even with the best sampling techniques, this is possible. A biased sample is one that is not representative of the entire population. We want to avoid bias. A random sample is one in which every member of the population has an equal chance of being chosen. A simple random sample (SRS) is a sample in which every possible sample of the same size has the same chance of being collected. Normally, we will start by using a simple random sample. A stratified sample is used when it is important to have members from multiple segments of the population. First, the population is split into segments (called “strata"), then a predetermined number of subjects is chosen from each of the strata. Cluster sampling can be used when the population naturally falls into subgroups with similar characteristics. First, determine the clusters, then select all the members of one or more of the clusters. Systematic sampling first involves assigning a number to each member of the population and ordering them in some way. Sample members are selected by choosing the first member randomly, then selecting subsequent members at regular intervals after the starting number (for example, every 7th person). This method is fairly simple to use, but should be avoided if there are regularly occurring patterns in the data. A convenience sample consists only of available members of the population, but this often leads to biased studies. A volunteer sample is a kind of convenience sample in which only volunteers participate. A multi-stage sample is selected by applying two or more sampling techniques successively to determine the sample. To choose subjects for a SRS, first determine the size of the population and number everyone on the list. Then use a random process such as a table of random numbers or a random number generator to select your sample. (To use a table of random numbers, the size of your population tells you how many digits to read at once. For example, if there are 132 members, you will need 3 digits, but with 32, you would only need 2 digits. Note that with 100 members, you will only need 2 digits if you number the members of your population from 00-99. To select the first subject, read the first “few" digits of the table (“few" = the number of digits that you determined you needed), then find that number in your numbered list of the population. If the number you selected is larger than any number on your list, ignore that number and move onto the next “few" digits and try again. Continue until you have as many subjects as you need. Note that you need to make sure that subjects are not repeated. Examples Determine which kind of sample was used in each of the following scenarios: (a) To determine the quality of on-campus housing, 20 residents from each dorm were chosen to complete a survey. Stratified sample (b) To evaluate employee compensation, choose a random sample of 10 zip codes in the state, then survey all businesses within each chosen zip code about their benefits package. Cluster sample (c) Those who participate to a survey linked to from cnn.com. Volunteer sample (d) To determine the quality of education at the University of North Carolina, a PID number is chosen at random, then every 600th student is evaluated until 30 students are selected. Systematic sample (e) Interested in only one neighborhood, you walk door-to-door to ask residents questions. Everyone was home and willing to participate, so you have survey results from every household in the neighborhood. Census (f) Chosen at random, 300 people who received care at the University Hospital participated in a survey. Simple random sampling Measures of Central Tendency The mean is the number found by adding all of the values in the data set and dividing by the total number of values in that set. To find the mean of a set of data with n terms: The median is the middle number in an ordered data set. The number of values that precede the median will be the same as the number of values that follow it. To find the median of a set of data with n values: 1. Arrange the values in the data set into increasing or decreasing order. 2. If n is odd, the number in the middle is the median. 3. If n is even, the median is the average of the two middle numbers. The mode is the value which occurs with the highest frequency. The mode is the only measure of central tendency for categorical data. If there are no repeated numbers in a data set, then the set has no mode. It is possible for a set to have more than one mode. The mean, the median, and the mode are single numbers used to describe the entire set by pointing out the center of the data. Measures of Dispersion (Spread) While knowing the mean value for a set of data may give us some information about the set itself, many varying sets can have the same mean value. To determine how the sets are different, we need more information. Another way of examining single variable data is to look at how the data is spread out, or dispersed about the mean. The range is the difference between the greatest value in set and the least value. The IQR (interquartile range) is the difference between the third and first quartiles and is considered a more stable statistic than the total range. The IQR contains 50% of the data. Outliers are extreme values in the set (Any value which is either greater than Q3 + 1.5IQR or less than Q1 - 1.5IQR is considered a suspicious point that could be called an outlier.) Each value in a data set (each data point) is either less than, greater than or equal to the mean. The average difference from the mean is equal to zero, since the sum of all the positive differences would equal the sum of all of the negative differences. The variance is the average of the squared differences from the mean. To find the variance: • subtract the mean, , from each of the values in the data set, • square the result • add all of these squares . • and divide by n (the number of values) in the data set to find the variance of a population and by n – 1 to find the variance of a sample. Think of it as a "correction" when your data is only a sample. The standard deviation ( for a population and s for a sample) is the square root of the variance. This measure gives us a “standard” way to describe which values are normal (close to the average) and which values are very large (or small) compared to others in that set. The Normal Distribution Many data sets tend to follow a pattern called a normal distribution. Such data, when graphed as a histogram with the data on the horizontal axis and the frequency on the vertical axis, creates a symmetric “bell-shaped curve” with a single peak at the mean. The spread of a normal distribution is controlled by the standard deviation, . If the standard deviation is small, then the data is concentrated closely about the mean. If the standard deviation is large, then data is dispersed or spread with data points further away from the mean. The mean and the median are the same in a normal distribution. Fifty percent of the distribution lies to the left of the mean and fifty percent lies to the right of the mean. 50% of the distribution lies within 0.67448 standard deviations of the mean. (This refers to the IQR.) Normally distributed data follows the “Empirical Rule” 68% of the distribution lies within one standard deviation of the mean. 95% of the distribution lies within two standard deviations of the mean. 99.7% of the distribution lies within three standard deviations of the mean. Image: http://www.regentspre p.org/Regents/math/alg trig/ATS2/NormalLesson .htm Look for the words "normally distributed" in a question before referring to the Normal Distribution Standard Deviation chart seen on this page. When using the chart, your information should fall on the increments of one-half of one standard deviation as shown in the chart. When the increments are less than one- half of a standard deviation, you should refer to a Standard Normal Table or use a graphing calculator.