Download Branches of Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Statistics 2014
Among the skills that you will be learning in this course are techniques for graphing data. Below is a
type of graph that we will be looking at later in the course. You might find that this graph provides
you with some useful information. (Note that in this course, we will consider only straight-line trends,
not curvilinear trends.)
Relationship Between Class Attendance and Course Grade
for a General Biology Course
Chapter 1 – Data Collection
Definition: Statistics is the science of conducting studies to collect, organize, summarize and analyze
data and to draw conclusions from the data.
Definition: A population is the collection of all objects, animals, or persons that are of interest.
Example: A population of interest might be all students at UNF. An individual member of the
population would be a student.
Example: A population of interest might be all family units in the United States. An individual
member of the population would be a family.
Definition: A variable is a characteristic under study that assumes different values for different
elements of the population.
Example: For the population of all students at UNF, a variable might be the GPA of a student.
Example: For the population of all registered voters in the U.S., a variable might be the candidate
supported by a voter for the office of President.
Definition: Data are the measured values of (a) variable(s).
Definition: Variables whose values are determined by chance are called random variables.
Example: If I randomly choose someone from the population of all registered voters in the U.S., that
person’s preference for President would be a random variable; it’s value would depend on which voter
was randomly chosen.
Definition: A collection of data values is called a data set.
Definition: An experiment is a planned activity whose result yields a set of data.
Branches of Statistics
Definition: Descriptive statistics is the branch of statistics concerned with collection, organization,
summation, and presentation of data.
Definition: A parameter is a numerical characteristic of a population.
Example: For the population of all students at UNF, one parameter would be the average GPA.
Example: For the population of all registered voters in the U.S., one parameter would be the level of
support for a particular Presidential candidate.
Definition: A sample is a subset of a population.
Definition: A statistic is a numerical characteristic of a sample.
Example: If I select a random sample of 100 UNF students, a statistic would be the average GPA for
the members of the sample.
Example: If I select a random sample of 1000 people from the population of all registered voters in the
U.S., a statistic would be the fraction of the sample who support a particular Presidential candidate.
Definition: Inferential statistics is the branch of statistics concerned with inferring the characteristics
of populations (i.e., parameter values) based on the information contained in sample data sets.
Inferential statistics includes estimation of parameters and hypothesis testing. Included in an
inferential procedure is a measure of reliability.
Example: If I want to predict the outcome of the next Presidential election, I might select a random
sample of 1000 registered voters and ask for each voter’s preference. The fraction of voters in the
sample who support a particular candidate would then estimate the fraction of voters in the entire
population who support that candidate.
Types of Data
Definition: A categorical, or qualitative, variable categorizes elements of a population or sample
based on some characteristic of interest.
Examples: eye color, race, gender, nationality, species of tree
Definition: A quantitative variable is numerical in nature. It represents a count or a measurement
made on elements of a population or sample. A quantitative variable is discrete if it can assume at
most a finite or countable number of possible values. Otherwise, it is continuous.
Examples: 1) Consider the population of families in the United States. The variable is the number of
children in the family. This is a discrete quantitative variable (a count) with a finite number of
possible values.
2) Consider the population of all adults in the United States. The variable is each person’s height.
This is a continuous quantitative variable (a measurement), with an uncountably infinite number of
possible values.
Sampling Methods
Definition: A census is a set of data collected for every member of the population.
Definition: A representative sample is one whose characteristics reflect as closely as possible the
characteristics of the population from which the sample was selected.
Example: I want to examine the differences in choice of major field of study for male and female
students at UNF. I select a random sample of 100 male students and a random sample of 100 female
students. I ask each student’s major.
Definition: A simple random sample of size n is a sample drawn from a population by a method
which makes every sample of size n equally likely to be chosen. Alternatively, a simple random
sample of size n is a sample selected by a method that guarantees that every member of the population
will have exactly the same probability of being selected to be in the sample.
Steps in choosing a SRS of size n:
1) Obtain a list of all members of the population; this list is called a sampling frame. (Note: This is
the most difficult step in the whole process, and is also error-prone.)
2) Assign a unique ID number to each member of the population.
3) Go to a table of random numbers; choose a convenient starting point; go down the column,
recording numbers within the range of the assigned ID numbers, until n distinct numbers are selected.
4) The population members that have the ID numbers obtained by this process make up the SRS of size
n.
(Step 3 may also be done using the TI-83 calculator)
Note: We can never be absolutely certain that our sample is representative, but simple random
sampling gives us a good chance.
Example: I want to estimate the average height of the class, without gathering height data for every
person in the class. I will select a simple random sample of size 6 and use the average height of the
members of the sample as the estimate of the average height of the members of the class. I assign a
unique ID number to each person in the class; the first person on the class roll will have the ID number
001, the second person 002, etc. I then go to a table of random numbers, open it, and blindly choose a
starting point. Reading down the column from the starting point, I find six distinct three-digit numbers
within the range of the values of the ID numbers. The class members with these 6 ID numbers
constitute the SRS.
Definition: A systematic sample is chosen by randomly choosing one of the first k members on the
population list, and then choosing every kth member thereafter.
Note: Systematic sampling has the advantage of simplicity; it has the disadvantage that any regularity
in listing the population members could be reflected in the sample, making the sample less likely to be
representative of the population.
Example: Coca-Cola wants to poll its 29500 employees. They begin with a complete employee roster;
randomly choose one of the first 50 names on the list as a starting point; then choose every 50th name
after that one, obtaining a systematic sample of 590 employees. Suppose that the first person selected
is number 3 on the list, and suppose that the original list is arranged so that management people appear
in positions 1, 2, 3, 51, 52, 53, 101, 102, 103, etc. Then the systematic sample chosen will consist
solely of management people, and will thus not be representative of all employees.
Definition: A stratified random sample is chosen by separating the population into mutually exclusive
subgroups, or strata, and choosing a SRS from each stratum.
This method is useful if there are identifiable subpopulations about which we want information.
Example: We want to estimate the average height of the members of the class, using a (hopefully)
representative sample. Since males tend to be taller than females, and we also want to estimate the
average height of the males in the class and of the females in the class, we will first stratify by gender.
We will obtain a SRS from the group of males, and a SRS from the group of females. These two
SRS’s together will make up our stratified random sample from the class.
Definition: A cluster sample is a simple random sample of groups or clusters of elements.
Note: This method is useful when it is difficult or costly to develop a complete list of members of the
population, or when the population is widely disperse geographically.
Example: We want to estimate the average household income in a large city. To use a SRS, we would
need a complete list of all households, with unique ID numbers. We would select a SRS of size n, and
interview someone from each of these households. If we are sending out interviewers to each
household, they may have to travel extensively, going from one chosen household to the next. On the
other hand, we could make a list of all residential blocks in the city, choose a SRS of blocks, and have
the interviewer go to each house in each selected block. This would involve less travel time. The
disadvantage is that this method increases sampling error; we are somewhat less likely to obtain a
representative sample than if we use SRS.
Definition: A convenience sample is a sample in which the individuals are easily obtained. (Note: It
is almost never a good idea to do convenience sampling.)
Example: Online “surveys” often use convenience samples. Almost on a daily basis, I see online
“surveys” that ask users for their opinions about current events in the news. The resulting data are
based on response from users who volunteer to answer the questions. Conclusions drawn from such
“surveys’ are questionable.
Sources of Error in Sampling
Definition: A nonsampling error is an error that results from the survey process. Such errors may be
due to 1) nonresponse, 2) inaccurate responses, 3) poorly worded survey questions, etc.
Example: In 1936, the magazine Literary Digest conducted a poll to predict the outcome of the
Presidential election. They mailed out sample ballots to a sample of 10 million voters. Based on the
results, they predicted that Alfred M. Landon would win with 57% of the votes. Actually, Franklin D.
Roosevelt won with 62% of the votes. What happened? The magazine used poor sampling methods.
First of all, as a sampling frame, they used a list composed of subscribers to their magazine,
households with telephones, and households with automobiles. Since the U.S. was in the middle of the
Great Depression, the people listed on the sampling frame were those who were more affluent. A
majority of the people on the list were Republicans. However, less affluent Americans, who were not
included in the frame, were more likely to favor the New Deal policies of the Roosevelt administration.
A second problem with the poll was that, of the 10 million sample ballots mailed out, only about onefourth were returned.
Some other examples of causes of nonsampling error are given in the text on pp. 34-36.
Definition: Sampling error is the error that results from using information from a sample to make
decisions about the characteristics of the population from which the sample was selected. This type of
error occurs because a sample gives incomplete information about a population.
The Design of Experiments
Definition: An observational study draws conclusions about the characteristics of a population by
measuring those characteristics of individuals in a sample, but does not attempt to manipulate or
influence the variables of interest.
Definition: A designed experiment applies a treatment to individuals (called experimental units) and
attempts to isolate the effects of the treatment on a response variable.
Example: In 1954, Jonas Salk tested a vaccine which he hoped would prevent polio. One group of
children were given the vaccine, and another group of children were given a placebo (not containing
any drug). He then observed whether there was a difference in the rates of polio between the vaccine
group and the placebo group. It was found that there was a much lower rate of polio in the vaccine
group than in the placebo group. Salk concluded that his vaccine would help to prevent polio.
Example: Suppose there are two methods for teaching a basic astronomy course.
1) One method is a standard chalk-and-talk method, in which the teacher lectures and assigns
homework and tests.
2) Another method is self-paced; in addition to lectures from the teacher, the student, at his/her
own pace, studies units of material (for example, there may be a unit of material on the internal
structure of a star). The student then takes a quiz on the unit of material, and has two
opportunities to pass the quiz. The student must successfully complete a certain number of
units in order to get a passing grade in the course.
A researcher wants to know whether the self-paced method improves student performance over the
standard method. The researcher selects a random sample of students enrolled in astronomy, and
randomly assigns some students to the standard instructional method, and other students to the selfpaced method. At the end of the semester, the student examines student grades.
Definition: The experimental unit (or subject) is a person, object, or some other well-defined item
upon which a treatment is applied. The experimental unit is analogous to the individual in a survey.
Example: In the Salk vaccine study, an experimental unit was an individual child.
Definition: An experimental treatment is a condition applied by the researcher to the experimental
unit.
Example: In the Salk vaccine study, one experimental treatment was the Salk vaccine; the other
experimental treatment was the placebo.
Example: In the basic astronomy study, one experimental treatment was the standard instructional
method; the other experimental treatment was the self-paced method.
Definition: In an experimental design, a factor is an explanatory variable that could influence the
outcome of the experiment.
Example: In the Salk vaccine study, one factor is the experimental factor – whether a child is given the
vaccine or a placebo. Another factor (uncontrolled) is opportunity to be exposed to the polio virus.
Definition: In an experimental design, the response variable is the outcome variable about which
comparisons are made.
Example: In the Salk vaccine study, the response variable is a random variable with two possible
values: 1) the child developed polio, or 2) the child did not develop polio.
Example: In the basic astronomy study, the response variable is a random variable, namely a student’s
grade in the course.
Steps in Conducting an Experiment:
1) Identify the problem to be solved, as explicitly and completely as possible. (We want to know
whether the Salk vaccine reduces the rate of occurrence of polio in children.)
2) Determine the factors that affect the response variable. (Presence or absence of the vaccine,
opportunity for exposure to the virus that causes polio.)
3) Determine the number of experimental units. (How many children should be in the vaccine
group? How many in the placebo group?)
4) Determine the level of the predictor variables. (One predictor variable is the experimental
treatment – presence or absence of the vaccine. Another predictor variable is opportunity to be
exposed to the virus. The first variable is completely under the control of the researcher. The
second can only be controlled through randomly assigning children to receive either the
vaccine or the placebo.)
5) Collect and process the data. (After a period of time, each child is examined to determine
whether polio infection is present.)
6) Test the claim. (Compare the rate of infection in the vaccine group with the rate of infection in
the placebo group.)