Download FST Ch. 1 chapter notes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

World Values Survey wikipedia , lookup

Time series wikipedia , lookup

Transcript
USCMP Functions, Statistics and Trigonometry:
Ch 1; Making Sense of Data
Note to students: You may have learned the skills presented in the first sections of
this chapter in previous years, thus it may appear easy or elementary. But the
chapter does build up from this foundation and you will be lost by the end if you do
not follow closely in the early and middle sections.
1-1: Collecting data
Data is information. The singular form is datum, the origin of which is the Latin word
for fact.
Statistics: the branch of mathematics dealing with the collection, organization,
analysis and interpretation of information (data), usually in numerical form.
Variable: any characteristic of a person or thing which can be classified, counted,
ordered or measured.
Population: Set of all people or objects you wish to study.
Sample: A subset of the population which is studied.
Survey: gathering facts or opinions through an interview or questionnaire.
Random Sample: Each member of the population has an equal chance of being
chosen.
Content: capture-recapture method used to estimate population size in wild animals.
A number of animals are captured, tagged and released. Later on, a sample is
captured. Assuming that the sample is chosen randomly, the ratio of tagged animals
to the number of animals caught is nearly the same as the percentage of tagged
animals in the entire population:
# of tagged animals is sample
# of animals in sample
=
# of tagged animals in population
P
1
USCMP Functions, Statistics and Trigonometry:
Ch 1; Making Sense of Data
1-2: Tables and Graphs
Circle Graph or Pie Chart
Calculate the measure of the central angle corresponding to each part of the sum.
The total circle will have 360.
For each piece of data, take the percentage of the whole and multiply it by
360%
In a circle graph, normally, the areas of sectors in the graph are proportional to the
percentages they represent. Viewed from a different angle, however, the graph
becomes distorted. The graphs below both show the same data, but the right graph
appears as if viewed from an angle.
1
2
1
3
2
4
3
4
Often multiple graphs are made to compare data from different samples.
2
USCMP Functions, Statistics and Trigonometry:
Ch 1; Making Sense of Data
1-3: Other Displays
Bar graphs: The bar graph has variables on one axis and a numerical scale
(percents or counts) on the other axis. To most accurately display data accurately,
numerical scales should begin at 0. William Playfair was the pioneer of graphing
changes in a variable over time.* Data about changes in a variable over time is
called time-series data. As an alternative to a bar graph, time-series data can also
be graphed on a coordinate plane. In a scatter plot, points are plotted but not
connected. In a line graph, the points are connected with line segments.
[see identical data in 3 representations on p. 17 (Boston population data)]
 here it is easier to represent the unequal variables of time on the coordinate
graphs
The average rate of change between points on the coordinate plane is the slope of
the line segment connecting the two lines.
Slope formula: slope (m) =
x2  x1
y2  y1
Reading from left to right, if a graph slants up its slope is positive, if the graph slants
down its slope is negative, and if the graph is horizontal its slope is 0. If a graph has
positive slope over an interval, the graph is increasing on that interval. If a graph
has negative slope over an interval, it is decreasing on that interval. If the slope of a
graph is zero over an interval, it is constant on that interval.
Stem-and-leaf plots or stemplots (19) -separate lessonSimilar to a bar graph, but individual data values are not lost
Data points:
 maximum: highest value
 minimum: lowest value
 range: difference between highest and lowest
 cluster: a group of values close together
 outliers: values very different from the rest. In Sect 1.5 we will learn a method of
determining what values in a given set are outliers
*The bar graph may seem to be an obvious concept to us today, but it was
developed relatively late in the history of mathematics: William Playfair pioneered bar
graphs in 1786.
3
USCMP Functions, Statistics and Trigonometry:
Ch 1; Making Sense of Data
1-4: Measures of Center
Measures of center, or measures of central tendency are descriptions of “typical”
or representative values from a data set. The most common measures of center, or
central tendency, are mean and median.
The mean is the arithmetic average, the sum of the data divided by the number of
items in the data set.
The median is the middle value of data placed in size order. When the data set has
an even number of elements, the median is the average of the two middle values.
The mean and median of the same data set can be significantly different. (ex. 25)
Data outliers can strongly effect the mean but not the median.
The mode is the most common value in a data set. The mode is a poor measure of
central tendency, because the mode of a data set will sometimes be an extreme
value (think about a repeated outlier in a set with no other repeated values).
Other Data points:
 maximum: highest value
 minimum: lowest value
 range: difference between highest and lowest
 cluster: a group of values close together
 outliers: values very different from the rest. In Sect 1.5 we will learn a method of
determining what values in a given set are outliers
Mean, median and mode are each valuable in certain situations. (top 26)
Summa or sigma notation: -notation
Recall from AA ch. 13. Sigma notation is a shorthand used to indicate the sum of a
series of terms. The sum of terms listed as xi as i goes from a to term b, with I
representing the index, or the position of a number in an ordered list.
b
x
i
ia
4
USCMP Functions, Statistics and Trigonometry:
Ch 1; Making Sense of Data
1-5: Quartiles, Percentiles and Box Plots
When organizing data, a helpful first step is to put list the values in size order, also
know as having a data set rank-ordered. Rank order makes it very easy to calculate
the data measures introduced below.
In addition to measures of central tendency, it is useful to know how the data in a set
are spread out from the center. One way of getting an idea of the spread of data is to
calculate the quartiles. As the name implies, quartiles divide a set of data into 4
equal parts. To find the quartiles, first find the median (the middle number of a rank
ordered data set). The median is actally the 2nd quartile. To find the 1st quartile,
find the median of the values below the median. The 3rd quartile is the median of
the values above the median. (The 4th quartile is the maximum value)
The interquartile range (IQR) is the difference between the 3rd and 1st quartiles.
This is the range of the middle 50% of the data, and can be a useful indication of the
spread around the center. (useful as a quick summary of the data)
A five-number summary of a data set is a list of the quartiles together with the
minimum and maximum. The five-number summary can be represented by a box
plot, or stem-and –whiskers plot (MCAS favorites).
To draw a box plot:
1. Draw a number line including the minimum and maximum data values.
2. Draw a rectangle with opposite sides at the lower and upper quartiles of the
data (sometimes these segments are called hinges)
3. Within the box, draw a segment parallel to the hinges at the median.
4. Draw segments from the midpoints of the hinges to the minimum and
maximum values.
Boxplots are useful when comparing data sets (ex. 4 p.34)
The pth percentile of a set of numbers is a value in the set such that p percent of the
numbers are less than or equal to that value. (ex. PSAT score ranking)
 To find the percentile of an element in a data set, figure out how many values are
equal to or less than it, then divide this by the number of elements in the set. The
quotient, if calculated correctly, will be between 0 and 1. Multiply by 100 to
convert this decimal to a percent.
The lengths of whiskers in a box plot can depend on percentiles. Whiskers may be
drawn to the 5th and 95th percentiles (or 10th & 90th) and can indicate outliers beyond
these. There is also a common procedure to determine outliers.
Outliers: (IQR= interquartile range). To determine outliers on the larger end of the
data set (1.5 * IQR) can be added to the 3rd quartile and any value above this will be
5
USCMP Functions, Statistics and Trigonometry:
Ch 1; Making Sense of Data
an outlier. The same sum (1.5 * IQR) can then be subtracted from the 1st quartile,
and any value below this is also considered an outlier.
1-6: Histograms
Histograms are a type of bar graph, where the data values are grouped into nonoverlapping intervals of equal width. Each bar displays the number of values that fall
within an interval. The intervals are usually continuous so, unlike most bar graphs,
there is no space between the bars on a histogram. (an interval which is empty will
become a bar with the height of 0) The number of values within an interval can be
called the frequency for that interval.
A histogram representing actual counts of values is often called a frequency
distribution. A histogram representing the counts as parts of the total is called a
relative frequency distribution. When creating the relative frequency distribution,
calculate the percentage of the whole represented by the vaues in each interval. You
can use a percentage, or a value from 0 to 1, on the Y-axis of your histogram.
When collecting data for a histogram, use a frequency table to record observations
per interval
If a histogram covers the ranges from a to b, and there are n intervals, than each
interval’s width represents (b-a)/n. So the first interval would run from a to [a + (ba)/n], the 2nd would run from [a + (b-a)/n] to [a+ 2(b-a)/n], and the 3rd [a + 3(b-a)/2], …
until the last interval: [a+ (n-1)(b-a)/n] to b, or [b - (b-a)/n] to b.
Histograms can display some information about central tendency and spread.
However, the nature of the non-overlapping intervals means histograms rarely show
any exact data points, such as a median. The choice of intervals can greatly affect
the amount of information displayed in the graph. If intervals are not of equal width,
the histogram can be distorted. (example: top p. 41)
6
USCMP Functions, Statistics and Trigonometry:
Ch 1; Making Sense of Data
1-8: Variance and Standard Deviation
Variance, deviation, standard deviation, range and interquartile range are all
measures of spread, as opposed to mean, median and mode, which are measures of
center. Data sets can have similar or identical measure of center, but different
measures of spread. For example, 4, 5, 6, 6, 7, 8 is a set of 6 numbers with a mean
of 6, a median of 6 and a mode of 6. The set {1, 3, 6, 6, 9, 11} also contains 6 values
and has a mean, median and mode of 6. However, the numbers are spread
differently over a number line…
Measure of Spread:
Variance and Standard Deviation
To calculate the Variance and Standard Deviation:
In words:
Notation:
1)
Calculate the Mean of the data set.
x
2)
Calculate the Deviations from the Mean.
x x
i
Deviation from Mean = (Each data value) – (Mean)
3)
Square the Deviations from the Mean.
x  x 
4)
Add all the Deviations from the Mean.
 x  x 
5)
Divide the sum by (n-1).
2
i
2
i
 x  x 
2
i
n 1
This is the VARIANCE
6)
Take the square root of the variance
to get the STANDARD DEVIATION
 x  x 
n 1
2
i
Note:
The Square Root of the Variance is the Standard Deviation
7