Download Handout 1 - UTRGV Faculty Web

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Time series wikipedia , lookup

Transcript
Sec 3.1 Measures of Central Tendency
The population arithmetic mean, μ (pronounced “mew”), is computed using all the
individuals in a population.
The population mean is a parameter
If x1, x2, …, xN are the N observations of a variable from a population, then the
population mean, µ, is
∑
=
If x1, x2, …, xn are the n observations of a variable from a sample, then the sample
mean, ̅ , is
̅ =
∑ The median of a variable is the value that lies in the middle of the data when arranged
in ascending order.
We use M to represent the median.
Steps in Finding the Median of a Data Set:
•
•
If the number of observations is odd, then the median is the data value exactly in the
middle of the data set. That is, the median is the observation that lies in then (n + 1)/2
position
If the number of observations is even, then the median is the mean of the two middle
observations in the data set. That is, the median is the mean of the observations that lie in
the n/2 position and the n/2 + 1 position.
Example: The following data represent the travel times (in minutes) to work for all seven
employees of a start-up web development company.
23, 36, 23, 18, 5, 26, 43
Determine the median of this data.
Example: Find the mean and median. Use the mean and median to identify the shape of the
distribution. Verify your result by drawing a histogram of the data.
The following data represent the asking price of homes for sale in Lincoln, NE.
The mean asking price is $168,320 and the median asking price is $148,700. Therefore, we
would conjecture that the distribution is skewed right.
The mode of a variable is the most frequent observation of the variable that occurs in
the data set.
A set of data can have no mode, one mode, or more than one mode.
If no observation occurs more than once, we say the data have no mode.
Sec 3.2 Measures of Dispersion
The range, R, of a variable is the difference between the largest data value and the
smallest data values. Range = R = Largest Data Value – Smallest Data Value
The population standard deviation is symbolically represented by σ (lowercase Greek
sigma).
∑ =
where x1, x2, . . . , xN are the N observations in the population and μ is the population
mean. The short-cut formula is following:
The sample standard deviation, s, of a variable is s=
∑ ̅ where x1, x2, . . . , xn are
the n observations in the sample and ̅ is the sample mean. The short-cut formula is
following:
Example: The following data represent the travel times (in minutes) to work for all seven
employees of a start-up web development company.
23, 36, 23, 18, 5, 26, 43
Compute the population standard deviation of this data.
The Empirical Rule:
If a distribution is roughly bell shaped, then
•
Approximately 68% of the data will lie within 1 standard deviation of the mean. That is,
approximately 68% of the data lie between
μ – 1σ and μ + 1σ.
•
Approximately 95% of the data will lie within 2 standard deviations of the mean. That is,
approximately 95% of the data lie between
μ – 2σ and μ + 2σ.
•
Approximately 99.7% of the data will lie within 3 standard deviations of the mean. That
is, approximately 99.7% of the data lie between μ – 3σ and μ + 3σ.
Note: We can also use the Empirical Rule based on sample data with
of μ and s used in place of σ.
used in place
Example: The following data represent the serum HDL cholesterol of the 54 female patients of a
family doctor.
(a) Compute the population mean and standard deviation.
(b) Draw a histogram to verify the data is bell-shaped.
(c) Determine the percentage of all patients that have serum HDL within 3 standard deviations of
the mean according to the Empirical Rule.
(d) Determine the percentage of all patients that have serum HDL between 34 and 69.1 according
to the Empirical Rule.
(e) Determine the actual percentage of patients that have serum HDL between 34 and 69.1.
Chebyshev’s Inequality:
For any data set or distribution(regardless of shape of the distribution), at least 1 100%
of the observations lie within k standard deviations of the mean, where k is any number greater
than 1. That is, at least
1 100% of the data lie between μ – kσ and μ + kσ for k > 1.
Note: We can also use Chebyshev’s Inequality based on sample data.
Sec 3.3 Measures of Central Tendency and Dispersion from Grouped Data
Approximate the Mean of a Variable from a Frequency Distribution
Population Mean
Sample Mean
th
where xi is the midpoint or value of the i class
th
fi is the frequency of the i class
n is the number of classes
Approximate the Standard Deviation of a Variable from a Frequency Distribution
Population
Sample Standard Deviation
Standard Deviation
th
where xi is the midpoint or value of the i class
th
fi is the frequency of the i class
Example: The National Survey of Student Engagement is a survey that (among other things)
asked first year students at liberal arts colleges how much time they spend preparing for class
each week. The results from the 2007 survey are summarized below. Approximate the standard
deviation number of hours spent preparing for class each week.
The weighted mean,
, of a variable is found by multiplying each value of the variable by its
corresponding weight, adding these products, and dividing this sum by the sum of the weights.
th
where w is the weight of the i observation
th
xi is the value of the i observation.
Example: Bob goes to the “Buy the Weigh” Nut store and creates his own bridge mix. He
combines 1 pound of raisins, 2 pounds of chocolate covered peanuts, and 1.5 pounds of cashews.
The raisins cost $1.25 per pound, the chocolate covered peanuts cost $3.25 per pound, and the
cashews cost $5.40 per pound. What is the cost per pound of this mix.
Sec 3.4 Measures of Position
The z-score represents the distance that a data value is from the mean in terms of the number of
standard deviations.
Population z-score
Sample z-score
The z-score is unitless. It has mean 0 and standard deviation 1.
Example: The mean height of males 20 years or older is 69.1 inches with a standard
deviation of 2.8 inches. The mean height of females 20 years or older is 63.7 inches
with a standard deviation of 2.7 inches. Data is based on information obtained from
National Health and Examination Survey. Who is relatively taller?
Kevin Garnett’s height is 4.96 standard deviations above the mean. Candace
Parker’s height is 4.56 standard deviations above the mean. Kevin Garnett is relatively
taller.
The kth percentile, denoted, Pk , of a set of data is a value such that k percent of the
observations are less than or equal to the value.
Example: Interpret the percentile
The Graduate Record Examination (GRE) is a test required for admission to many U.S. graduate
schools. The University of Pittsburgh Graduate School of Public Health requires a GRE score no
less than the 70th percentile for admission into their Human Genetics MPH or MS program.
Interpret this admissions requirement
In general, the 70th percentile is the score such that 70% of the individuals who took the exam
scored worse, and 30% of the individuals scores better. In order to be admitted to this program,
an applicant must score as high or higher than 70% of the people who take the GRE. Put another
way, the individual’s score must be in the top 30%.
Quartiles divide data sets into fourths, or four equal parts.
st
•
The 1 quartile, denoted Q , divides the bottom 25% the data from the top 75%.
st
1
th
Therefore, the 1 quartile is equivalent to the 25 percentile.
nd
•
The 2 quartile divides the bottom 50% of the data from the top 50% of the data, so
nd
th
that the 2 quartile is equivalent to the 50 percentile, which is equivalent to the
median.
rd
•
The 3 quartile divides the bottom 75% of the data from the top 25% of the data, so that
rd
th
the 3 quartile is equivalent to the 75 percentile.
Example: A group of Brigham Young University—Idaho students (Matthew Herring, Nathan
Spencer, Mark Walker, and Mark Steiner) collected data on the speed of vehicles traveling
through a construction zone on a state highway, where the posted speed was 25 mph. The
recorded speed of 14 randomly selected vehicles is given below:
20, 24, 27, 28, 29, 30, 32, 33, 34, 36, 38, 39, 40, 40
Find and interpret the quartiles for speed in the construction zone.
Step 1: The data is already in ascending order.
Step 2: There are n = 14 observations, so the median, or second quartile, Q , is the mean of the
th
2
th
7 and 8 observations. Therefore, M = 32.5.
Step 3: The median of the bottom half of the data is the first quartile, Q .
1
20, 24, 27, 28, 29, 30, 32
The median of these seven observations is 28. Therefore, Q = 28. The median of the top half of
1
the data is the third quartile, Q . Therefore, Q = 38.
3
•
•
•
3
25% of the speeds are less than or equal to the first quartile, 28 miles per hour, and
75% of the speeds are greater than 28 miles per hour.
50% of the speeds are less than or equal to the second quartile, 32.5 miles per hour,
and 50% of the speeds are greater than 32.5 miles per hour.
75% of the speeds are less than or equal to the third quartile, 38 miles per hour, and
25% of the speeds are greater than 38 miles per hour.
The interquartile range, IQR, is the range of the middle 50% of the observations in a data set.
That is, the IQR is the difference between the third and first quartiles and is found using the
formula
IQR = Q – Q
3
1
Example: A group of Brigham Young University—Idaho students (Matthew Herring, Nathan
Spencer, Mark Walker, and Mark Steiner) collected data on the speed of vehicles traveling
through a construction zone on a state highway, where the posted speed was 25 mph. The
recorded speed of 14 randomly selected vehicles is given below:
20, 24, 27, 28, 29, 30, 32, 33, 34, 36, 38, 39, 40, 40
Determine and interpret the interquartile range of the speed data.
The range of the middle 50% of the speed of cars traveling through the construction zone is 10
miles per hour.
th
Suppose a 15 car travels through the construction zone at 100 miles per hour. How does this
value impact the mean, median, standard deviation, and interquartile range?
Checking for Outliers by Using Quartiles
Step 1 Determine the first and third quartiles of the data.
Step 2 Compute the interquartile range.
Step 3 Determine the fences. Fences serve as cutoff points for determining outliers.
Lower Fence = Q1 – 1.5(IQR)
Upper Fence = Q3 + 1.5(IQR)
Step 4 If a data value is less than the lower fence or greater than the upper fence, it is
considered an outlier.
Sec 3.5 The Five-Number Summary and Boxplots
The five-number summary of a set of data consists of the smallest data value, Q1, the median,
Q3, and the largest data value. We organize the five-number summary as follows:
Drawing a Boxplot
Step 1: Determine the lower and upper fences.
Lower Fence = Q1 – 1.5(IQR)
Upper Fence = Q3 + 1.5(IQR)
where IQR = Q3 – Q1
Step 2: Draw a number line long enough to include the all necessary values. Insert vertical lines
at Q1, M, and Q3. Enclose these vertical lines in a box.
Step 3: Label the lower and upper fences.
Step 4: Draw a line from Q1 to the smallest data value that is larger than the lower
fence. Draw a line from Q3 to the largest data value that is smaller than the upper fence.
These values are called adjacent values and these lines are called whiskers.
Step 5: Any data values less than the lower fence or greater than the upper fence are
outliers and are marked with an asterisk (*).
Example: Every six months, the United States Federal Reserve Board conducts a
survey of credit card plans in the U.S. The following data are the interest rates charged
by 10 credit card issuers randomly selected for the July 2005 survey. Construct a
boxplot of the data.
Step 1: The interquartile range (IQR) is 14.4% - 12% = 2.4%. The lower and upper fences are:
Lower Fence = Q – 1.5(IQR) = 12 – 1.5(2.4) = 8.4%
1
Upper Fence = Q + 1.5(IQR) = 14.4 + 1.5(2.4) = 18.0%
3
Step 2: Draw the box plot
Use a boxplot and quartiles to describe the shape of a distribution.
The interest rate boxplot indicates that the distribution is skewed left.