Download 2) Center or middle of the data values

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Categorical variable wikipedia , lookup

Time series wikipedia , lookup

Transcript
Topic (4) SUMMARIZING DATA –CENTER OR CENTRAL TENDENCY
4-1
Topic (4) SUMMARIZING DATA –
CENTER OR CENTRAL TENDENCY
I) QUANTITATIVE DATA
a) Median (50th percentile)
Defn: The MEDIAN of a data set is the middle value. That is,
when the data values are arranged from low to high, it is that
value in the list such that half of the data points are smaller and
the other half are larger.
For even number of observations:
The fish weights for the Tennessee River study (n=12) are: 986, 1023,
1266, 1398, 917, 1763, 1459, 778, 532, 441, 544, 897
1) first order them from low to high
2) median = average of the 2 middle values.
441, 532, 544, 778, 897, 917,
median = m =
986, 1023, 1266, 1398, 1459, 1763
917 + 986
= 951.5
2
6 of the observed values fall below m and the other 6 are
larger than m
Topic (4) SUMMARIZING DATA –CENTER OR CENTRAL TENDENCY
4-2
For odd number of observations:
Thirteen fish weights are: 986, 1023, 1266, 1398, 917, 1763, 1459,
778, 532, 441, 544, 897, 1129
1) first order them from low to high
2) median = the middle value.
441, 532, 544, 778, 897, 917, 985, 1023, 1129, 1266, 1398, 1459, 1763
m = 985
Important Point #1: The median is said to be robust
because it is resistant to outliers
Important Point #2: The sample median divides the total
area under the bars in a histogram in half.
Important Point #3: Populations also have medians
called the population median (M). This number divides
the area under the curve describing the population
frequency distribution in halves.
Topic (4) SUMMARIZING DATA –CENTER OR CENTRAL TENDENCY
4-3
b) Arithmetic Mean
Defn: The MEAN of a data set is the average value. That is, it is
the value obtained by adding all of the numbers together and
dividing the result by the number of values in the sum (see
symbols later). The SAMPLE MEAN is denoted as x
(pronounced “x-bar”). The POPULATION MEAN is denoted
µ (pronounced “mu”).
EXAMPLE
The fish lengths for the Tennessee River study are:
48, 45, 49, 51, 44, 49, 46, 28.5, 26, 25.5, 25, 44
The dot plot of these data is
•
•
•
•• •
••• •• •
____|______|______|_____|______|_____|____
25
30
35
40
45
50
Length (cm)
If each point has the same weight, where should the pivot point
be to balance the x-axis (i.e. keep it horizontal)?
Ans: the pivot point is the arithmetic mean
Topic (4) SUMMARIZING DATA –CENTER OR CENTRAL TENDENCY
4-4
To calculate the sample mean for these data: Sum the
data values and divide the result by n.
48+45+49+51+44+49+46+28.5+26+25.5+25+44 = 481 = 40.08
12
12
We say that the fish caught in the study averaged 40.08
cm in length.
Important Point #1: If one were able to observe the value
of every single element in a population (say, every single
fish in the Tennessee River in 1978), then it would be
possible to calculate the population mean µ. Since we
can’t do that, we say that an estimate of the population
mean µ is the sample mean x .
Important Point #2: Is the mean robust?
Ans: NO! It’s value depends directly on the values in the
dataset.
Important Point #3: The mean of a set of data is the
fulcrum or balance point for the data.
Topic (4) SUMMARIZING DATA –CENTER OR CENTRAL TENDENCY
4-5
NOTATION:
X
denotes the NAME of the variable
e.g. LENGTH
x
denotes a value for the named variable
e.g. 48 cm
i
a subscript which denotes the index number for the
observation
e.g. fish IDs run from 1 to 12
xi
denotes the value for the ith observation (that is, the ith
observed value)
e.g. x1 = 48, x2 = 45, etc.
Σ
denotes the operation “SUM”
So, we can write
n
x=
∑ xi
i =1
n
x1 + x2 + ... + xn
=
n
Topic (4) SUMMARIZING DATA –CENTER OR CENTRAL TENDENCY
4-6
For frequency distributions, the relationship of the mean
to the median depends on the shape of the distribution:
Skewed to the right:
mean
median
Skewed to the left:
mean
median
Symmetric and unimodal
mean
median
Uniform
mean
median
Bimodal
mean
median
Question: So, which measure of center do you use when?
Topic (4) SUMMARIZING DATA –CENTER OR CENTRAL TENDENCY
4-7
II) CATEGORICAL DATA – BINARY DATA
In general, the summary statistics for categorical data are
the relative frequencies of each category in the dataset.
There is no such idea as a mean or average category, only
the most common one or the least common one or some
other appellation. For the special case of binary data (only
two categories), the measure of central tendency is the
“Proportion of Successes”.
Defn: When there are only two possible outcomes, define
one category to be the “Success” (it’s the category you are
studying). The PROPORTION OF SUCCESSES, then,
is the fraction of observations that are successes. When
the dataset is a sample the SAMPLE PROPORTION is
denoted p; when the dataset is the entire population the
POPULATION PROPORTION is denoted π.
So we can write,
# successes in sample # successes
=
p=
sample size
n
π=
# successes in population # successes
=
population size
N
Topic (4) SUMMARIZING DATA –CENTER OR CENTRAL TENDENCY
4-8
Example: Suppose a researcher is interested in the
recovery of submerged aquatic vegetation (SAV) in the
Chesapeake Bay. At each of 30 locations at which SAV
was historically found, the scientist categorizes the spot as
either medium to high amounts of SAV or low amount to
no SAV. 11 locations had medium to high SAV levels.
2 categories: Success = “medium or high SAV level”
Failure = “none or low SAV level”
The proportion of sample locations with medium to
high SAV levels is p = 11/30 = 0.3667.
Example: Suppose the rate of the birth defect, spina
bifida, is 1 baby in every 100,000 live births.
Success = “has spina bifida”
π = 0.00001 = 1x10-5