Download Basic Concepts in Descriptive and Inferential Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Operations research wikipedia , lookup

History of statistics wikipedia , lookup

Transcript
LIS 397.1
Introduction to Research in
Library and Information
Science
Introduction to Statistics
R. E. Wyllys
Copyright 2003 by R. E. Wyllys
Last revised 2003 Feb 6
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Lesson Objectives
• You will acquire an introductory
understanding of
– Descriptive statistics
– Inferential statistics
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Statistics as a Tool for LIS
• Why is statistics an important tool for
LIS? Here are some reasons:
– LIS problems
• concern people and other complex entities
• concern interactions among these complex
entities
• typically involve many contributing, or
potentially contributing, variables
– Many LIS problems involve variables that
exhibit random or probabilistic variation
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Statistics Is a Tool for
Making Decisions
• “Statistics is a method of decision
making in the face of uncertainty, on
the basis of numerical data, and at
calculated risks.”*
*From: Chou, Ya-Lun. (1969). Statistical Analysis with
Business and Economic Applications. New York, NY:
Holt, Rinehart and Winston.
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Two Branches of Statistics
• Descriptive Statistics
– Numbers that describe or characterize groups:
mean, median, mode, total, range, standard
deviation, proportion, etc.
– Presentations of such numbers in charts and
tables
• Inferential Statistics
– The use of numbers from samples to provide
generalizations (inferences) about the populations
from which the samples came
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Statistical Terms
• Population or Universe
– The set of entities in which the investigator is
interested, typically with the purpose of seeking
the value of certain characteristics of the
population
– “Population” is the term generally used in social
statistics; “universe,” in the mathematical theory of
statistics
• Sample
– A subset of a population that is examined and
from which inferences are drawn about the
characteristics of the population
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Statistical Terms
• Parameter
– Measurable characteristic of a population
(universe)
• Sample value
– Measurable characteristic of a sample
– Value that a population parameter takes on
in a particular sample
Note: Some writers use the term “statistic” to refer to samples only,
not to populations. In LIS 397.1 we do not restrict it that way.
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Types of Data
• There are four main types of data that it
is useful to distinguish
– Categorical (nominal)
– Ordinal
– Interval
– Ratio
• The next three slides discuss these data
types in more detail.
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Types of Data (cont'd)
• Categorical (nominal) Numbers
– Numbers used like names, e.g., faces of a die
– No order is implied (although order may used for
convenience in listing numbers for human viewing)
• Ordinal Numbers
– Numbers used to indicate position in a sequence,
e.g., in rank number
– Order is indicated, but “distance” between
positions is not indicated
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Types of Data (cont'd)
• Interval Numbers
– Express order and distances or differences
– Result from measurements and counts
– May lack a true zero
• Ratio Numbers
– Express order and distances or differences
– Result from measurements and counts
– Do have a true zero
Note that all ratio numbers are interval numbers, but not all interval
numbers are ratio numbers
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Interval vs. Ratio Numbers
• A temperature of 68° F might look twice as hot as 34° F.
But the same temperatures expressed in the Celsius
scale are 20° C and 1.1° C. This is because the zeros in
both the Celsius and Fahrenheit scales are arbitrary
(though the Celsius zero makes more physical sense than
the Fahrenheit zero).
• Only in the Kelvin and Rankine scales—in which the zero
of the scale is a true zero, Absolute Zero (-273.15° C = 0
K)—do temperatures express true relative hotness. Since
34° F = 274.26 K and 68° F = 293.15 K, you can see that
68° F is nowhere near twice as hot as 34° F.
• Celsius and Fahrenheit are interval, but not ratio, scales;
Kelvin is a ratio (and, hence, also an interval) scale.
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Why Use Samples?
• Observing a whole population would enable
certainty about its characteristics—so why
settle for uncertainty?
• If it is feasible to do so, you may observe the
whole population
• But, observing the whole population may be
– Logically impossible
• Infinite populations
• Future populations
– Destructive
– Too expensive
– Too time-consuming
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Virtues of Sampling
• Reduced cost
• Results available sooner
• Broader scope yields more information.
– For example, you could make the effort to collect
1000 observations in either of the following ways
• 1 variable observed on each of 1000 elements in the population
vs.
• 10 variables observed on each of 100 elements in a sample
– Which way would likely be more useful?
• Greater accuracy
– More attention possible to each observation
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Random Sampling Is Basic to
Statistical Inference
• The mathematical theory of statistics requires that
each element of the population have a known
chance of being chosen in a sample
• In practical terms, this usually translates to: Each
element of the population must have the same
chance as any other element of being chosen in a
sample
– Strictly speaking, this is called equiprobable sampling
• Ordinarily, the term "random sampling" is used
loosely to describe both known-probability and
equiprobable sampling.
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Experiments in Sampling
• Toss a coin 10 times, writing down the exact
sequence of Heads and Tails. Also write down the
proportion of Heads (e.g. 0.6 if there are 6 Heads).
• Repeat this experiment 3 more times.
• You probably got 3 or 4 different values for the
proportion of heads, even though, as you already
know, you expect that proportion to be 0.5 in the
long run.
• Calculate the proportion of Heads in all 40 tosses of
the coin. Note that it is closer to 0.5 than are most of
the 10-toss proportions.
• This should suggest to you one of the benefits of
taking large samples whenever feasible.
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
More Experiments in
Sampling
• Each student is to think of a number in the
range of 1 through 12 and write it down
• Each student is to write down the number
corresponding to his or her birth month (Jan =
1, Feb =2, . . . , Dec = 12)
• The point of these two experiments is to show
you that it is hard for humans to make truly
random choices.
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Random Sampling Can be Tricky
• Random sampling aims to eliminate all
anticipatable biases, i.e., tendencies to favor
one element or class of elements over
another in the population
• Think about how you would choose randomly
from among
– Lumps of coal in a gondola car
– Laboratory rats in a large cage
– Human volunteers in a survey
• What problems might you encounter? Please
think carefully about this question before you
move to the next slide to see my answer.
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Random Sampling Can be Tricky
• It can be difficult to choose truly at random
– Lumps of coal in a gondola car get shaken during the
train's journey. Small lumps sift to the bottom, so that
the easily chosen lumps on top tend to be larger than
the average lump in the whole car
– Laboratory rats in a large cage will try to evade the
person who is trying to capture them for an experiment.
This means that the rats that get captured will tend to be
among the less vigorous, less speedy rats.
– Human volunteers in a survey tend to be those with an
interest in the purpose of the survey or in making money
as paid subjects. Either way, a group of volunteers
rarely resembles the kind of group that would be chosen
at random.
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Random Sampling
• How to choose, randomly, population
elements for a sample? Here are some
possible ways:
– Use mechanical means
•
•
•
•
Toss a coin
Toss a die or dice
Draw straws or cards
Random interval timers
– Use a printed table of random numbers
– Use a random-number generator in a computer
program
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Basic Statistical Measures
• Measures of Central Tendency (central or
typifying values for a set of numbers)
– Mean (arithmetic average)
– Median
– Mode
• Measures of Dispersion (the scatteredness,
the variability of a set of numbers)
– Range
– Standard deviation and variance
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Basic Statistical Measures
• Measures of central tendency apply to
sets of 1 or more elements
• Measures of dispersion apply only to
sets of 2 or more elements
• Variability pervades the entire biological
world and the world of human activity
• Variability is usually vastly underrecognized and under-estimated
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Basic Statistical Measures
• The mean is the arithmetic average of a set
of numbers. For example,
– The mean of 2, 5, and 8 is 5, since 15÷3 = 5.
• Note: In statistical work, this equation would usually be
written as 15/3 = 5.
– The mean of 1, 3, 2, and 8 is 3.5, since 14/4 = 3.5.
• Note that this mean is not among the set of numbers
from which it was calculated.
• That is, the mean of a set of numbers can be a number
that cannot characterize any element of the set.
• Think of 1, 3, 2, and 8 as the numbers of children in four
different families to see the point of the previous remark.
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Basic Measures of Central Tendency
• If we make a set of observations, we can call them X in
general, or Xi if we want to emphasize the individual
observations: e.g., X1, X2, . . ., X97, etc.
• Using this notation, and using ∑ to represent the
operation of summation (i.e., adding things up, as you
probably recall from high-school algebra), we can write
the mean of n observations X as
X
i n
X
X
n
or X 
X
i 1
i
n
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Basic Measures of Central Tendency
• The median of a set of observations is the middle value of
the observations when they have been arranged in order.
This turns out to be slightly different for odd numbers of
observations than for even numbers of observations. For
example,
– The median of the five observations 1, 3, 15, 16, and 17 is 15. As
this example shows, the median has the property that just as many
observations are smaller than it as are larger than it, making it a
meaningful middle value.
– Determining the median of the six observations 1, 2, 3, 5, 8, and 9
requires us to agree that what we will mean by the "middle value"
is the half-way point between the middle pair of observations.
Here the middle pair is 3 and 5, and the half-way point between
them is 4. (More complicated definitions exist, but the half-way
point idea conveys the essence of the median satisfactorily for
most practical purposes.)
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Basic Measures of Central Tendency
• The mode of a set of observations is the
most frequently occuring value among a set
of observations (provided that there is a
most frequent value). For example,
– The mode of the observations 1, 2, 2, 3, 4, 5 is 2.
– The set of observations 1, 2, 3, 4, 5 has no mode.
– The set of observations 1, 2, 3, 3, 4, 5, 5 has, strictly
speaking, no mode. However, it is sometimes
convenient to call such a set bi-modal, i.e., to allow a
set to have 2 modes: e.g., 3 and 5 in this example.
• Though it would be logical to speak of tri-modal, quadri-modal,
etc. sets of observations, the basic idea of the mode is rarely
stretched beyond allowing for two modes.
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Basic Measures of Central Tendency
• Why are there so many measures of
central tendency?
– Because each has its own special
advantages and disadvantages.
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Advantages and Disadvantages of the
Mean
• Advantage: The mean always exists, and can always be
calculated by a basically simple formula (even though
large numbers of observations can be very tedious to
handle). For this reason, the mean is readily usable as a
basis for building a theory for making statistical inferences.
• Disadvantage: The mean can, by extremely large or
extremely small values among the set of observations, be
"distorted" into a value that fails to be useful as a
characteristic of the set. For example,
– The mean of 1, 2, and 1,000,000 is 333,334.33, which fails to
provide any useful idea about the actual observations in this set.
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Advantages and Disadvantages of the
Median
• Advantage: The median is a useful way of describing sets
of observations that are skewed (i.e., distorted) by
including extremely large or small values.
– The classic example is that of household incomes. Few households
have very small positive incomes (or even negative incomes), but in
almost any set of households (e.g., in a city or state) there are a
few households with extremely high incomes (e.g., the households
of Michael Millken in the 1980s, or Michael Dell and Bill Gates in
the 1990s, with annual incomes in the hundreds of millions of
dollars).
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Advantages and Disadvantages of the
Median (cont'd)
– Because the mean is badly distorted by extreme
incomes, it has become conventional to express
household incomes in terms of the median.
• For example, the Bureau of the Census might report that the
median household income in Bellevue, Washington is, say,
$40,000.
• The interpretation is that half of all the households in Bellevue
have incomes below $40,000 and half have incomes above
$40,000, so that the $40,000 median is a meaningful
characteristic of the set of household incomes
• In contrast, the mean household income in Bellevue might be
well over $1,000,000 since Bill Gates's house is located there.
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Advantages and Disadvantages of the
Median (cont'd)
• Disadvantage: Because the median is defined
somewhat differently for odd numbers than for
even numbers of observations, it has been
historically less usable than the mean as a basis
for building a theory of statistical inference.
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Advantages and Disadvantages of the
Mode
• Advantage: If a set of observations has a mode, then that
mode is a useful way of characterizing the set. For
example, "The most common result of tossing two dice is
that their top faces add up to 7."
• Disadvantage: Many sets of observations lack a mode
because no observed value occurs more than once; other
sets of observations may have several different "most
frequent" values; and in either case, the notion of a mode
has no useful value for characterizing the set of
observations. Furthermore, because of these definitional
difficulties, the mode fails to provide a solid basis for
building a theory of statistical inference.
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Measures of Dispersion
• The range of a set of observations is the distance
between the smallest and largest observations in
the set. For example,
– The range of the set of observations 2, 4, 7 is 5. The
range of the set -10, -3, 4 is 14.
– Sometimes it may be desirable to take into account the
matter of whether the observations come from counts
(in which case they are necessarily whole numbers) or
from measurements (in which case the extent to which
the measurements are rounded off can play a role).
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Measures of Dispersion (cont'd)
• The standard deviation s of a set of n
observations Xi is calculated by a somewhat
complicated formula.
i n
s
( X
i 1
 X)
2
i
n 1
– In pre-computer days the formula required some effort
to use; now, however, computer programs (and even
some sophisticated calculators) make the calculation
trivial for the user.
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Measures of Dispersion (cont'd)
• What does this complicated formula really
do?
– Inside the parentheses are the distances between each
individual observation and mean, i.e., the center, of the
whole set of observations.
– Each of these distances is squared, and the squared
distances are added up.
– This sum is divided by n-1, which is almost the number
of numbers in the sum; i.e., the result is an "adjusted
average" of the squared distances.
i n
s
2
(
X

X
)
 i
i 1
n 1
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Measures of Dispersion (cont'd)
– It is worth noting that if n is large (say, over 30 or so), then
there is really very little difference in dividing by n-1 rather n
in obtaining this average squared distance between
individual observations and the center (i.e., the mean) of the
set of observations.
– Finally, the formula takes the square root of this average
squared distance, and calls it s, the standard deviation.
– Speaking somewhat loosely, we can say that the standard
deviation is a fancy average distance between the individual
observations and the center of the set of observations.
i n
s
2
(
X

X
)
 i
i 1
n 1
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Advantages and Disadvantages of the
Range
• Advantages: The range is very simple to
calculate, and it provides a meaningful
characteristic of a set of observations, viz., the
total spread of the observations.
• Disadvantage: The range measures only the
total spread; it fails to take any account of the
scatteredness or clusterness of the observations
other than the two extremes. For example,
– The set 1, 2, 3, 4, 5, 6, 7, 8, 9 has a range of 8.
– But so also does the set 1, 9, 9, 9, 9, 9, 9, 9, 9, which
is, overall, obviously less scattered than the previous
set.
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Advantages and Disadvantages of the
Standard Deviation
• Advantages: The standard deviation can always be
calculated (i.e., its definition, though complicated,
never runs into logical difficulties). It provides a
meaningful characteristic of a set of observations that
takes every observation into account in developing a
number to express the scatteredness of the
observations. For example,
– The set of observations 1, 2, 3, 4, 5, 6, 7, 8, 9 has a standard
deviation s = 2.74.
– The set of observations 1, 9, 9, 9, 9, 9, 9, 9, 9 has a standard
deviation s = 2.67, reflecting the lesser scatteredness of this
set compared with the first.
– In short, the range fails to distinguish any difference in
scatteredness between these two sets, but the standard
deviation does measure a difference in their scatteredness.
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Advantages and Disadvantages of the
Standard Deviation
• Disadvantage: The formula for the
standard deviation is complicated (for
humans, though not for computer
programs, or even for sophisticated
calculators).
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Statistical Inference
• Although the descriptive statistics that we
have been discussing are useful in their own
right, they are also important as a basis for
making inferences from a sample of
observations to characteristics of the
population from which the sample came. For
example,
– The mean of a sample can be used to suggest the
likely value of the mean of the population.
– The standard deviation of a sample can be used
to suggest the likely value of the standard
deviation of the population.
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Statistical Inference
• The theory of statistical inference deals with
how to use sample statistics in reasonable
and suitable ways in order to enable us to
infer various values of a population through
observing one or more samples of
observations drawn from that population.
• You will learn about inferential statistics in
School of Information research courses.
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Goal of Statistics: To Pierce
through the Haze of
Obscuring Variation & Reveal
Underlying Patterns
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Goal of Statistics: To Pierce
through the Haze of
Obscuring Variation & Reveal
Underlying Patterns
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Goal of Statistics: To Pierce
through the Haze of
Obscuring Variation & Reveal
Underlying Patterns
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science
Goal of Statistics: To Pierce
through the Haze of
Obscuring Variation & Reveal
Underlying Patterns
School of Information - The University of Texas at Austin
LIS 397.1, Introduction to Research in Library and Information Science