Download Math Notes - Summit School District

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Advanced Algebra 2 - Math Notes: Appendix C and Unit 9
C.1.1- Interquartile Range and Boxplots
Quartiles are points that divide a data set into four equal parts (and thus, the use of the prefix “quar” as in
“quarter”). One of these points is the median. The first quartile (Q1) is the median of the lower half, and the third
quartile (Q3) is the median of the upper half.
To find quartiles, the data set must be placed in order from smallest to largest. Note that if there are an odd number of
data values, the median is not included in either half of the data set.
Suppose you have the data set: 22, 43, 14, 7, 2, 32, 9, 36, and 12.
The interquartile range (IQR) is the difference between the third and first quartiles. It is used to measure the spread (the
variability) of the middle fifty percent of the data. The interquartile range is 34 – 8 = 26.
A boxplot (also known as a box-and-whisker plot) displays a five number summary of data: minimum, first quartile,
median, third quartile, and maximum. The box contains “the middle half” of the data and visually displays how large the
IQR is. The right segment represents the top 25% of the data and the left segment represents the bottom 25% of the
data. A boxplot makes it easy to see where the data are spread out and where they are concentrated. The wider the box,
the more the data are spread out.
C.1.2 - Describing Shape (of a Data Distribution)
Statisticians use the words below to describe the shape of a data distribution.
Outliers are any data values that distribution. In the example below, data values in
the right-most bin are are far away from the bulk of the data outliers. Outliers are
marked on a modified boxplot with a dot.
C.1.3 - Describing Spread (of a Data Distribution)
A distribution of data can be summarized by describing its center, shape, spread, and outliers. You have learned three
ways to describe the spread.
Interquartile Range (IQR)
The variability, or spread, in the distribution can be numerically summarized with the interquartile range (IQR). The IQR
is found by subtracting the first quartile from the third quartile. The IQR is the range of the middle half of the data. IQR
can represent the spread of any data distribution, even if the distribution is not symmetric or has outliers.
Standard Deviation
Either the interquartile range or standard deviation can be used to represent the spread if the data is symmetric and has no
outliers. The standard deviation is the square root of the average of the distances to the mean, after the distances have
been made positive by squaring.
For example, for the data 10 12 14 16 18 kilograms:
 The mean is 14 kg.
 The distances of each data value to the mean are –4, –2, 0, 2, 4 kg.
 The distances squared are 16, 4, 0, 4, 16 kg2.
 The mean distance-squared is 8 kg2.
 The square root is 2.83. The standard deviation is 2.83 kg.
Range
The range (maximum minus minimum) is usually not a very good way to describe the spread because it considers only the
extreme values in the data, rather than how the bulk of the data is spread.
9.1.1 - Population Parameters and Sample Statistics
The entire group that you are interested in studying and making conclusions about is called the population. Some
questions can be investigated by studying every member of the population. For example, you can figure out how many
students in your class have siblings by asking every student. This process of measuring every member of a population is
called taking a census. Numerical summaries, such as mean and standard deviation, computed from a population census
are called parameters of that population.
The United States government performs a census every ten years. The government uses this data to learn such things as
how the population is changing, where people live, what types of families exist, and what languages are spoken. For
example, the 2010 U.S Census counted 308,745,538 people and 4,239,587 of them were over the age of 85.
To answer some questions, a census is not possible. It might be too expensive or impractical. For example surveying
every math student in your state may be too time consuming. It might be that the object being measured is destroyed
during the experiment, as when determining how strong bicycle tires are by filling every single bicycle tire with air until it
explodes. When it is not possible or practical to take a census, a portion of the population, called a sample, is measured or
surveyed. Numerical summaries of a sample are calledstatistics.
For example, if all of the people in this class make up our population, then every fifth member of our class is a possible
sample. The whole-class (population) average on the final exam is a parameter. The average test score of the students in
the sample is a statistic.
9..2.1 - Observational Studies and Experiments
In an observational study, data is collected by observing but without imposing any kind of change. A survey is one type
of observational study. Observational studies are often plagued by lurking variables, that is, hidden variables that are not
part of the study, but that are the true cause of an association. For example, the number of firefighters at a fire is
associated with the amount of damage at the scene of the fire. But of course firefighters are not the cause of the
damage! The lurking variable, “size of the fire,” causes both the number of firefighters and the amount of damage to
increase.
To determine cause and effect, often a randomized, controlled experiment is conducted. In an experiment, subjects are
randomly divided between two or more groups, and a treatment (a change) is imposed on at least one of the
groups. Randomized experiments can be used to determine cause and effect because the influence of lurking variables,
even though they are unknown, has mostly been equalized among all the groups. If there is a difference among groups, it
is most likely due to the treatment since everything else is mostly the same.
9.3.1 - Symbols for Standard Deviation
In statistics, standard variables are used to represent the mean and standard deviation.
The number of values in a data set is often represented with n.
The population parameters are written using lower-case Greek letters such as µ (pronounced “myoo”) for the mean,
and 𝜎(pronounced “sigma”) for the population standard deviation.
When the data are a sample, 𝑥̅ (pronounced “x-bar”) represents the sample mean, and s is used to represent the sample
standard deviation.
9.3.3 - Normal Density Function
For many situations in science, business, and industry, data may be represented by a bell-shaped curve. In order to be able
to work mathematically with that data – describing it to others, making predictions – an equation called a normal
probability density function is fitted to the data. This is very similar to how a line of best fit is used to describe and
make predictions of data on a scatterplot.
The normal probability density function (“normal distribution”) stretches to infinity in both directions. The area under the
normal distribution can be thought of as modeling the bars on a relative frequency histogram. However, instead of
drawing all the bars, a curve is drawn that represents the tops of all the bars. Like bars on a relative frequency histogram,
the area under the normal distribution (shaded in the diagram below) represents the portion of the population within that
interval.
The total area under the curve is 1, representing 100% of the population. The mean of the population is at the peak of the
normal distribution, resulting in 50% of the population below the mean and 50% above the mean.
The width of the normal distribution is determined by the standard deviation (the variability) of the population under
study. The more variability in the data, the wider (and flatter) the normal distribution is.