Download The Normal Distribution

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Time series wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Statistics for Algebra II
Statistics is the practice or science of collecting and analyzing numerical data in large quantities, esp. for the
purpose of inferring proportions in a whole from those in a representative sample.
Designing a study
Steps for designing a study:
1. Identify the variable (or variables) of interest and the population of the study.
A variable is any characteristic that is recorded for subjects in a study. (Gender, major, age, and GPA
might be variables for a study about college students.)
The population is the collection of all of the things the researcher wants to describe or make decisions
about in the study. (Students in the US, seniors at SAHS, registered vehicles in NC, likely voters in the
2016 presidential election, gears produced by a particular manufacturer, toys sold on Cyber Monday on
Amazon, or the animals at a particular shelter during 2013 are all possible populations.)
2. Develop a detailed plan for collecting the data.
When the data will be collected from the entire population, the study is called a census and the data
collected are called parameters.
When using a sample, it is important to make sure that the sample is representative of the population.
Usually, researchers will employ sampling techniques which involve probability to randomize the
sample selection process .The data collected from a sample are called statistics.
3. Collect the data.
4. Describe the data using descriptive statistics.
5. Interpret the data and use inferential statistics to make decisions (or assumptions) about the population.
6. Identify any possible errors (some potential problems can be identified earlier in the process).
Conducting a Study
We will focus on four methods for collecting data: observational studies, surveys, experiments, and simulations.
In an observational study, a researcher measures and observes the variables of interest without
changing existing conditions. Observational studies can point to a correlation between two variables of
interest, but cannot be used to infer causation.
A survey is used to investigate characteristics of a population. It is frequently used when the subjects are
people, and questions are asked of them. When designing a survey, you must be very careful of wording
(and sometimes ordering) the questions so that the results are not biased. The results are also subject to
bias introduced by non-response. (In this case, subjects may refuse to answer some or all questions or
they may not give truthful responses.) Correlation does not imply causation.
In an experiment, randomization is used to assign members of the study group to treatment groups. A
researcher then randomly assigns a treatment to each group and observes the response. Often, one group
will be assigned as a control group (a group receiving no treatment or a placebo) to be used to compare
the effectiveness of a treatment. A well designed experiment maybe used to infer causation.
A simulation uses a mathematical, physical, or computer model to replicate the conditions of a process
or situation. This is frequently used when the actual situation is too expensive, dangerous, or impractical
to replicate in real life.
Examples
Identify which method for collecting data (observational study, an experiment, a simulation, or a survey) is best
in each of the following situations and explain your answer.
1. The effect a severe earthquake would have on the Salt Lake Valley.
Simulation, we cannot control when that area will have an earthquake.
2. Whether or not a certain coupon attached to the outside of a catalog makes recipients more likely to
order products from a mail-order company. Experiment, since we are comparing two scenarios and we
can control them.
3. Whether or not smoking has an effect on coronary heart disease. Observational study, since we will
not be changing a person's behavior. (There are ethical and health concerns for deciding whether or not
someone smokes.)
4. Determining the average household income of homes in Burlington, NC. Survey, since it can be
answered with a brief question.
Problems and Methods to Deal with Them in a Study
 A confounding variable occurs when a researcher cannot tell the difference between the effects of
different factors on a variable.
 The placebo effect occurs when a subject (or an “experimental unit") reacts favorably to a placebo when
no medicated treatment has been given.
 Blinding is a technique used to make the subjects “blind" to which treatment (or placebo) they are being
given.
 A double-blind experiment is one in which neither the experimenter nor the subjects know which
treatment is being given.
 Randomization is a process of randomly assigning subjects to treatment groups. There are
several different techniques for randomization:
 A completely randomized design assigns subjects to different treatment groups
through random assignment.
 A randomized block design is sometimes used to make sure that subjects with certain
characteristics are assigned to each treatment. For example, when testing a certain
medication, you might first want to split subjects in groups according to either gender
or age (or both), then randomly assign each of these groups to the different treatments.
 A matched pairs design pairs up subjects according to similarities. One subject in the
pair receives one treatment, while the other receives a different treatment.
 Sample size is the number of participants in the experiment. The larger the sample, the
more representative of the population the results will be, but the costs of the experiment
will be higher.
 Replication is the ability to reproduce the experiment (and results) under similar
conditions.
Examples
1. For the following experiment, determine the experimental units, treatments, and sample size.
Indicate whether this experiment is blind, double-blind, and/or randomized. Also identify any
potential problems with the design.
A study with 233 low-income adult smokers evaluated the effectiveness of usual care
(physician advice and follow-up) for smokers wishing to quit to the usual care enhanced
by computer-assisted telephone counseling sessions. Each subject was assigned randomly
either to the usual care or to the usual care plus counseling, and their smoking status (still
smoking or quit smoking) was observed after 3 months. The percentage who had quit
smoking was higher for the group receiving counseling. (from Journal of Family Practice
2000;50:138-144)
The experimental units are the 233 adult smokers.
The treatments are “usual care" and “usual care plus counseling."
The sample size is 233.
The experiment is not blind or double-blind (experimental units will know if they receive
counseling or not), but it is randomized since the subjects were assigned randomly.
2. How would you design a placebo-controlled double-blind experiment with a randomized
block design for the following situation:
A veterinarian wants to test a strain of antibiotic on calves to determine their
resistance to a common infection, and if their gender plays a role. In a pasture, there
are 22 newborn calves (11 males and 11 females). There is enough antibiotic for 10
calves, but blood tests to determine their resistance to infection can be done on all
calves.
For the randomized block design, we can choose 5 males and 5 females randomly to
receive the antibiotic, and the remaining 6 males and 6 females will receive a “placebo"
(in this case, the “placebo" could be no treatment since the calves won't be telling anyone
if they received a shot or not). In order to make the experiment double-blind, we need to
make the calves unaware of what treatment they are receiving (not difficult), and for the
person carrying out the blood tests to be unaware of the treatment (to make this possible,
another person will assign which of the calves receive the antibiotic and which do not and
will administer the antibiotic).
Sampling Techniques
Ideally, we would take a census, that is, use every member of a population as a subject since the
descriptive statistics would be sufficient. However, this is often too costly and difficult.
Instead, we sample part of the population. With sampling, we need to make sure that the sample
is representative of the population and large enough to be meaningful.
Definitions and Terminology
 A sampling error is the difference between the results of the sample and those of the
population. Even with the best sampling techniques, this is possible.
 A biased sample is one that is not representative of the entire population. We want to
avoid bias.
 A random sample is one in which every member of the population has an equal chance of
being chosen.
 A simple random sample (SRS) is a sample in which every possible sample of the same
size has the same chance of being collected. Normally, we will start by using a simple
random sample.
 A stratified sample is used when it is important to have members from multiple segments
of the population. First, the population is split into segments (called “strata"), then a
predetermined number of subjects is chosen from each of the strata.
 Cluster sampling can be used when the population naturally falls into subgroups with
similar characteristics. First, determine the clusters, then select all the members of one or
more of the clusters.
 Systematic sampling first involves assigning a number to each member of the population
and ordering them in some way. Sample members are selected by choosing the first
member randomly, then selecting subsequent members at regular intervals after the
starting number (for example, every 7th person). This method is fairly simple to use, but
should be avoided if there are regularly occurring patterns in the data.
 A convenience sample consists only of available members of the population, but this
often leads to biased studies.
 A volunteer sample is a kind of convenience sample in which only volunteers participate.
 A multi-stage sample is selected by applying two or more sampling techniques
successively to determine the sample.
To choose subjects for a SRS, first determine the size of the population and number everyone
on the list. Then use a random process such as a table of random numbers or a random number
generator to select your sample.
(To use a table of random numbers, the size of your population tells you how many digits to
read at once. For example, if there are 132 members, you will need 3 digits, but with 32, you
would only need 2 digits. Note that with 100 members, you will only need 2 digits if you
number the members of your population from 00-99. To select the first subject, read the first
“few" digits of the table (“few" = the number of digits that you determined you needed), then
find that number in your numbered list of the population. If the number you selected is larger
than any number on your list, ignore that number and move onto the next “few" digits and try
again. Continue until you have as many subjects as you need. Note that you need to make sure
that subjects are not repeated.
Examples
Determine which kind of sample was used in each of the following scenarios:
(a) To determine the quality of on-campus housing, 20 residents from each dorm were chosen
to complete a survey. Stratified sample
(b) To evaluate employee compensation, choose a random sample of 10 zip codes in the state,
then survey all businesses within each chosen zip code about their benefits package. Cluster
sample
(c) Those who participate to a survey linked to from cnn.com. Volunteer sample
(d) To determine the quality of education at the University of North Carolina, a PID number is
chosen at random, then every 600th student is evaluated until 30 students are selected.
Systematic sample
(e) Interested in only one neighborhood, you walk door-to-door to ask residents questions.
Everyone was home and willing to participate, so you have survey results from every household
in the neighborhood. Census
(f) Chosen at random, 300 people who received care at the University Hospital participated in a
survey. Simple random sampling
Measures of Central Tendency
The mean is the number found by adding all of the values in the data set and dividing by the
total number of values in that set.
To find the mean of a set of data with n terms:
The median is the middle number in an ordered data set. The number of values that precede the
median will be the same as the number of values that follow it.
To find the median of a set of data with n values:
1. Arrange the values in the data set into increasing or decreasing order.
2. If n is odd, the number in the middle is the median.
3. If n is even, the median is the average of the two middle numbers.
The mode is the value which occurs with the highest frequency. The mode is the only measure
of central tendency for categorical data. If there are no repeated numbers in a data set, then the
set has no mode. It is possible for a set to have more than one mode.
The mean, the median, and the mode are single numbers used to describe the entire set by
pointing out the center of the data.
Measures of Dispersion (Spread)
While knowing the mean value for a set of data may give us some information about the set
itself, many varying sets can have the same mean value. To determine how the sets are
different, we need more information. Another way of examining single variable data is to look
at how the data is spread out, or dispersed about the mean.
The range is the difference between the greatest value in set and the least value.
The IQR (interquartile range) is the difference between the third and first quartiles and is
considered a more stable statistic than the total range. The IQR contains 50% of the
data. Outliers are extreme values in the set (Any value which is either greater than Q3 +
1.5IQR or less than Q1 - 1.5IQR is considered a suspicious point that could be called an
outlier.)
Each value in a data set (each data point) is either less than, greater than or equal to the mean.
The average difference from the mean is equal to zero, since the sum of all the positive
differences would equal the sum of all of the negative differences. The variance is the average
of the squared differences from the mean.
To find the variance:
• subtract the mean, , from each of the values in the data set,
• square the result
• add all of these squares
.
• and divide by n (the number of values) in the data set to find the variance of a population
and by n – 1 to find the variance of a sample. Think of it as a "correction" when your data is
only a sample.
The standard deviation ( for a population and s for a sample) is the square root of the
variance. This measure gives us a “standard” way to describe which values are normal (close to
the average) and which values are very large (or small) compared to others in that set.
The Normal Distribution
Many data sets tend to follow a pattern called a normal distribution. Such data, when graphed
as a histogram with the data on the horizontal axis and the frequency on the vertical axis,
creates a symmetric “bell-shaped curve” with a single peak at the mean. The spread of a normal
distribution is controlled by the standard deviation, . If the standard deviation is small, then
the data is concentrated closely about the mean. If the standard deviation is large, then data is
dispersed or spread with data points further away from the mean.
 The mean and the median are the same in a normal distribution.
 Fifty percent of the distribution lies to the left of the mean and fifty percent lies to the
right of the mean.
 50% of the distribution lies within 0.67448 standard deviations of the mean. (This refers to
the IQR.)
 Normally distributed data follows the “Empirical Rule”
 68% of the distribution lies within one standard deviation of the mean.
 95% of the distribution lies within two standard deviations of the mean.
 99.7% of the distribution lies within three standard deviations of the mean.
Image:
http://www.regentspre
p.org/Regents/math/alg
trig/ATS2/NormalLesson
.htm
Look for the words "normally distributed" in a question before referring to the Normal Distribution
Standard Deviation chart seen on this page. When using the chart, your information should fall on
the increments of one-half of one standard deviation as shown in the chart. When the increments
are less than one- half of a standard deviation, you should refer to a Standard Normal Table or use
a graphing calculator.