Download 1332Introduction2Statistics.pdf

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mathematics of radio engineering wikipedia , lookup

History of statistics wikipedia , lookup

Elementary mathematics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Lecture 4.1
Contemporary Mathematics
Instruction: Population, Sample, Data, Statistics
A datum is a statement of fact or at least accepted as fact. Data is the plural of datum, so
a set of data is a collection of statements. Data can be quantitative or qualitative. Consider a
sporting goods store that carries jerseys. The sizes of the jerseys carried can represent a set of
data that is quantitative, e.g., N = {10, 12, 14, 16, 18, 20, 24, 28} , or a set of data that is
qualitative, e.g., D = {small, medium, large, extra large} .
Data can be collected from a population or from a sample. A population is a set of data.
A sample is a subset of a population. A sample, then, is a set of data, and in this course, we will
deal exclusively with samples that are sets of quantitative (numerical) data.
There are many ways to sample the population. If our efforts are aimed at attempting to
reach all members of the population, the technique is census-taking; otherwise, we employ other
techniques to obtain a sample. Consider a shipment of 300,000 sealed crates. Law enforcement
officers searching for contraband may not possess the resources to inspect each crate (take a
census). Instead, officers may randomly select thirty crates from the population. The thirty
crates represent a random sample. There are many types of samples: random samples (selecting
thirty crates at random), cluster samples (selecting ten crates from three particular cargo bays),
stratified samples (selecting a few crates from several types such as small, medium, and large
crates or Nigerian, Ugandan, and Liberian crates), systematic samples (selecting every tenthousandth crate), and many more.
As mentioned above, we will consider numerical data sets understood to be samples,
which, in turn, are understood to be subsets of larger populations. Numbers that describe a
population are called parameters.
A parameter is a numerical value that describes a population.
We will be interested in statistics, that is, numbers that describe a sample in some way. The term
"statistics" is ambiguous because it refers both to the plural of statistic and to a field of study as
defined below.
A statistic is a numerical value that describes a sample.
Statistics refers both to the plural of a statistic and to a field of science, the science
of collecting, organizing, and analyzing empirical data.
Statistics–numbers that describe samples–are sometimes used to infer characteristics of
population parameters. Statistics–the field of study–employs certain tests and procedures to
gather knowledge concerning populations.
Instruction: Scores and Variables
Collecting data involves some activity requiring observation or measurement. The
measurements yield data values called scores, which are referred to as raw scores when it is
necessary to emphasize that the score has not been changed from the initial measurement.
Lecture 4.1
A score is a datum collected by measurement or observation. Raw
scores equal unchanged measurements or observations.
This course deals with quantitative samples which are numerical data sets. The data is collected
via some measurement, each measurement being a particular datum or score. The scores change
(or at least could change) from object to object in the set. The measurement itself is called a
variable represented by the letter X, and scores are possible values of the variable (possible xvalues).
A variable is a measurable characteristic that takes different values.
We will be concerned with two types of variables, discrete variables and continuous variables.
Discrete variables are typically restricted to whole numbers. For example, counting the number
of siblings of individuals or the number of deaths that occur in a hospital in a week.
A discrete variable takes values that represent separate categories such
that when the scores are ordered any two consecutive possible scores
are separated by a span of impossible values.
Not all variables are discrete. If a variable is not discrete, it is continuous.
A continuous variable takes values that represent categories such that
an infinite number of possible scores fall between any two measured
scores.
Examples of continuous variables include heights, weights, and durations. In cases where the
variable is continuous but measurements are rounded, it is important to recognize the rounded
scores as values of a continuous variable, not a discrete variable.
Instruction: Summation Notation
Statistics often requires the summation of a large number of numbers, so a special
notation for "summation" is required. The capital Greek letter sigma, Σ , serves this purpose.
For example, given a set of scores (x-values), A = { x1 , x2 , … , xn } , then ∑ X = x1 + x2 + + xn .
In particular, if A = {5, 7, 8, 9, 11, 12, 18} where each element in the set is a datum
considered to be the value of some measurement called the random variable X, then
∑ X = 5 + 7 + 8 + 9 + 11 + 12 + 18 , so ∑ X = 70 .
Application Exercise 4.1
Problems
Donald Robertson writes in Space & Communications, "The SSMEs [space shuttle main
engines] are the most reliable of today's rocket engines. In 168 engine flights, there have been no
critical failures, just one early shut-down in flight caused by a sensor problem, and only four
shut-downs on the pad, according to Rocketdyne. Challenger was destroyed by a Solid Rocket
Booster leak. This main engine reliability has been achieved despite engines being re-used as
many as fifteen times."
#1 Is the proportion of "shut-downs" to "engine flights" an example of a statistic or a parameter?
A Gallup survey of 1,000 telephone interviews conducted shortly after the Columbia tragedy
indicated that 82 percent of respondents expressed support for continuing the manned space
shuttle program.
#2 Is the proportion of respondents who "expressed support for continuing the manned space
shuttle program" a statistic or a parameter?
A reporter investigating possible asteroid/Earth impact catastrophes writes, "The orbit of
asteroid 2003-QQ47 has been calculated using only fifty-one observations during a seven-day
period. Further observations are required to determine if any danger of impact with Earth does
exist. This asteroid will be monitored closely over the next two months. Astronomers expect the
risk of impact to decrease significantly as more data is gathered."
#3 To scientists assessing the risk of impact from asteroid 2003-QQ47, are the "fifty-one
observations" discussed in the article a population or a sample?
About.com reports, "More than 70 spacecraft have been sent to the Moon; 12 astronauts have
walked upon its surface and brought back 382 kg of lunar rock and soil to Earth."
#4 To scientists interested in the Moon, does the "382 kg of lunar rock and soil" mentioned in
the report represent a population or sample?
#1 parameter
#2 statistic
#3 sample
#4 sample
Assignment 4.1
Problems
Identify each data set described below as a sample or a population.
#1.
A survey of five-hundred University of Texas students.
#2
The age of each U. S. president upon election to office.
Identify each numerical value described below as a statistic or a parameter.
#3
The average annual salary of forty of a company's 1,100 employees is $76,000.00.
#4
According to ACT, Inc., the average ACT math score for all graduates in a particular
year was 20.7.
Consider the set of x-values: {5, 6, 9, 10} .
#5
Find ∑ X .
Lecture 4.2
Instruction: Frequency Distributions
One statistic is the frequency of a particular value in a data set. The frequency, f, of a
score, x, equals the number of times the score appears in the data set. A frequency distribution is
a common tool used to organize data from a sample.
A frequency distribution is an organized display—be it a tabulation or a graph—that
shows the frequency of each data value in a sample.
A frequency distribution helps organize data sets (samples) that contain numerous repeated
values. For example, consider the data set T collected by the National Oceanic and Atmospheric
Administration. T is the set of number of deaths in the United States attributed to tornados in the
month of February for the years 1950 to 1983.
T = {45, 1, 10, 3, 2, 0, 8, 0, 13, 21, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 134, 0, 0, 0, 7, 5, 2, 0, 0, 0, 2, 0, 1}
The random variable, X, equals the number of tornado deaths in February in the United States for
a given year. The frequency distribution shown below lists each x-value and the corresponding
frequency of that value for the data contained in T.
X 0 1 2 3 5 7 8 10 13 21 45 134
1
f 20 2 3 1 1 1 1 1 1 1 1
The frequency distribution above is in tabular form, but frequency distributions can be
graphs as well. Imagine a traffic engineer collecting data on intersections. Consider sample S
comprised of the monthly number of fatal automobile accidents at a particular intersection over a
period of sixteen months. If S = {4, 0, 1, 4, 2, 1, 1, 0, 0, 3, 0, 1, 0, 2, 0, 1} , then Figure A is a
frequency distribution of S in the form of a line graph, which is called a frequency polygon.
Figure A
f 7
6
5
4
3
2
1
0
0
1
2
3
4
x
Lecture 4.2
Besides the line graph above, frequency distributions can take the form of bar graphs. Bar
graphs represent discrete variables with rectangles whose heights represent the frequency of each
score. The bar graph below is a frequency distribution for S.
Bar Graph
7
6
5
4
3
2
1
0
0
1
2
3
4
Some frequency distributions show the relative frequency of the data values. A relative
frequency distribution shows the fraction or percentage of the data set represented by a data
value. The table below is a relative frequency distribution for sample
S = {4, 0, 1, 4, 2, 1, 1, 0, 0, 3, 0, 1, 0, 2, 0, 1} .
0
38
x
f
1
5 16
2
18
3
1 16
4
18
Figure B is a relative frequency distribution for sample S in bar graph form.
Figure B
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0
1
2
3
4
Lecture 4.2
Figure C is a relative frequency distribution of sample S in circle graph form. Circle
graphs divide the area inside a circle into wedges whose sizes represent the relative frequencies
of the data values.
Figure C
4, 12.5%
3, 6.25%
0, 37.50%
2, 12.50%
1, 31.25%
Another type of frequency distribution is a grouped frequency distribution. A grouped
frequency distribution shows the frequencies of ranges of values. The ranges of values are
sometimes referred to as "classes" or "bins." Consider set D, a set of lengths.
D = {3.1, 3.6, 2.9, 4.1, 4.2, 2.2, 2.6, 3.1, 3.3, 4.2, 4.4, 5.0, 2.9, 4.1, 4.6}
It could be advantageous to organize this data according to ranges of values. Figure D is a
grouped frequency distribution using four classes in histogram form. Histograms represent
continuous variables with contiguous rectangles whose heights represent the frequency. The
histogram below is a grouped frequency distribution for set D.
Figure D
f
7
6
5
4
3
2
1
0
1.5 ≤ x1< 2.5
2.5 ≤2x < 3.5 3.5 ≤ x3< 4.5
4.5 ≤ x4 < 5.5
Lecture 4.2
In Figure D, the numbers 1.5, 2.5, 3.5, and 4.5 represent lower class limits, the smallest
possible data values for the four classes respectively. The upper class limits for the four classes
are 2.5, 3.5, 4.5, and 5.5 respectively. Here, the upper class limits represent a boundary on the
largest possible data value for each class. Class limits are either the extreme most possible data
values (least or greatest) or a minimal/maximal boundary on the possible data values (least or
greatest). The uniform class width equals the difference of any two successive upper class
limits. The class width for Figure D is 1 as calculated here: 5.5 − 4.5 = 1 . The class mark is the
"middle" value of a class and can be calculated by dividing the sum of the lower and upper limits
of a class by two. The class marks for Figure D are calculated below.
1.5 + 2.5
= 2,
2
2.5+3.5
= 3,
2
3.5+4.5
= 4,
2
4.5+5.5
=5
2
Application Exercise 4.2
Problems
Suppose NASA studies the effects of micro-gravity on the immune system. As part of this study,
NASA collects thirty blood samples from astronauts after six consecutive weeks in orbit and
records the number of white cells in thousands per cubic millimeter below.
3.6
5.9
6.3
5.1
5.0
7.2
5.2
9.3
8.1
7.1
9.9
9.2
5.9
9.9
5.7
7.9
9.9
8.4
6.0
8.5
6.7
7.9
7.7
4.4
8.0
4.7
6.9
7.8
9.1
4.9
#1
Create a grouped frequency distribution using seven classes.
#2
Represent the grouped frequency distribution from problem one using a frequency
polygon. Label each class using the class mark.
#3
Change the grouped frequency distribution from problem one into a relative frequency
distribution.
9.9 − 3.6 6.3
=
= 0.9
7
7
3.6 < x < 4.5 4.5 < x < 5.4
2
5
#1 width ≈
x
f
5.4 < x < 6.3
4
6.3 < x < 7.2
4
7.2 < x < 8.1
6
8.1 < x < 9.0
3
9.0 < x < 9.9
6
#2
f 7
6
5
4
3
2
1
0
1
4.05
2
4.95
3
5.85
4
6.75
5
7.65
6
8.55
7
9.45
x
#3
x
f/n
3.6 < x < 4.5
1/15
4.5 < x < 5.4
1/6
5.4 < x < 6.3
2/15
6.3 < x < 7.2
2/15
7.2 < x < 8.1
1/5
8.1 < x < 9.0
1/10
9.0 < x < 9.9
1/5
Assignment 4.2
Problems
#1
Display the following data using a frequency polygon.
{1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 5, 5, 5, 6, 6, 6, 7, 7, 7, 7, 8, 9, 9, 9, 9, 10, 10}
#2
Display the following data as a grouped frequency distribution using a histogram.
{0.2, 0.3, 0.7, 1.1, 1.7, 2.2, 2.3, 2.4, 2.5, 2.7, 2.8, 2.9, 3.3, 3.9}
#3
Use a frequency polygon to display a grouped frequency distribution for the data, which
represents the prices of grade A eggs (in dollars per dozen) for the indicated years.
1900
1991
1992
1993
1994
1995
#4
1996
1997
1998
1999
2000
2001
1.03
1.02
0.97
0.99
1.09
1.10
Use a pie chart to display the data as a relative frequency distribution. The numbers
represent the number of Nobel Prize laureates by country during the years from 1901 to
2002.
U.S.
U.K.
#5
1.00
1.01
0.98
0.99
0.98
1.01
270
100
France
Sweden
49
30
Germany
Other
77
157
Construct a grouped frequency distribution with six classes using a histogram for the data
set. The numbers represent the average amount in dollars spent on energy bills for a
month.
91
188
189
266
190
30
472
341
127
248
398
354
279
266
8
101
88
222
249
199
526
375
269
93
530
142
184
486
43
352