Download Median

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Time series wikipedia , lookup

World Values Survey wikipedia , lookup

Transcript
161.120 Introductory Statistics
Week 2 Lecture slides
• Graphical Displays of Univariate Data: Dot Plots & Stemand-leaf Plots and Histograms
– Text sections 2.4 and 2.5
– CAST sections 2.2 and 2.4
• Describing Centre and Spread
– Text sections 2.6 and 2.7
– CAST sections 2.5 and 2.6
• Transformation & Discrete Data
– CAST sections 2.7 and 2.8
Dot Plots
• A graphical display of a batch of numbers
• Each value is shown as a dot against a numerical axis
• Problem of overlapping
– Jittering dots (used in CAST)
• randomly move the dots perpendicularly to the axis in order to
separate them somewhat
– Stacking dots
• group values into classes, then vertically stack the dots in each
class
• the heights of the stacks show the density for each class
• The loss of detailed information in a stacked dot plot is rarely
important
Stem and leaf Plots
Basically a stacked dot plot using digits instead of dots and
slightly different layout
• The 'axis' is drawn vertically
• A value is printed on the axis for each stack, giving the most
significant digits that are common for all values on that stack. This is
called the stem for the stack.
• The digits representing the values are called the leaf digits and are
drawn in a row to the right of the stems
• Decimal points are not shown in the stems or the leaves
– The stem '12' and leaf '3' could represent
12300 or 1230 or 123 or 12.3 or 1.23 or 0.123, etc.
so need to provide a key or state the units of the stem
• Distribution of values is shown by ‘canopy’ of leaves
• Sometimes not shown well
– Can change the value of the leaves
– Or split the stems
Example 2.8 Big Music Collection
About how many CDs do you own?
Stem is ‘100s’ and leaf unit is ‘10s’.
Final digit is truncated.
Numbers ranged from 0 to about 450,
with 450 being a clear outlier and
most values ranging from 0 to 99.
The shape is skewed right.
Outliers and How to Handle Them
Outlier: a data point that is not consistent with
the bulk of the data.
•
Look for them via graphs.
•
Can have big influence on conclusions.
•
Can cause complications in some statistical analyses.
•
Cannot discard without justification.
Possible Reasons for Outliers
and Reasonable Actions
• Mistake made while taking measurement or entering it into
computer. If verified, should be discarded/corrected.
• Individual in question belongs to a different group than
bulk of individuals measured. Values may be discarded if
summary is desired and reported for the majority group
only.
• Outlier is legitimate data value and represents natural
variability for the group and variable(s) measured. Values
may not be discarded — they provide important
information about location and spread.
Example 2.7
Tiny Boatsmen
Weights (in pounds) of 18 men on crew team:
Cambridge:188.5, 183.0, 194.5, 185.0, 214.0,
203.5, 186.0, 178.5, 109.0
Oxford:
186.0, 184.5, 204.0, 184.5, 195.5,
202.5, 174.0, 183.0, 109.5
Note: last weight in each list is unusually small.
They are the coxswains for their teams, while others are
rowers.
Clusters
• If a dot plot or stem and leaf plot separates into two or more groups
of values (clusters), this suggests that the 'individuals' from which
the data were recorded may similarly be split into two or more
groups.
• Clusters may correspond to males and females, different varieties of
plants,…
• Detecting the cause of differences between the groups may lead to
valuable insights into the data.
Histograms
• Directly displays the 'canopy' shape, without separately displaying
the individual values.
• Are particularly useful displays for large data sets
• Area equals relative frequency
– Each value must contribute the same area to the histogram
– Equal width classes
• height of the rectangles equals the frequency of the class
• vertical axis labeled ‘frequency’
– Mixed class widths
• vertical axis labeled ‘density’
Choice of histogram classes
• Histogram classes should be chosen to give an outline that is as
smooth as possible
– Too narrow leads to jagged histogram
– Too wide leads to 'blocky' histogram and detail is lost
• Adjusting the class width and the starting position for the first class
can give a surprising amount of variability in histogram shape
for small data sets. As a result, you must be wary of overinterpreting features such as clusters or skewness in such
histograms.
Interpreting Histograms, Stemplots,
and Dotplots
• Values are centered around 20 cm.
• Two possible low outliers.
• Apart from outliers, spans range from about 16 to 23
cm.
Five-Number Summaries
•
Find extremes (high, low), the median, and the
quartiles (medians of lower and upper halves of
the values).
•
Quick overview of the data values.
•
Information about the center, spread, and shape of
data.
Notation and Finding the Quartiles
Split the ordered values into the half
that is below the median and the
half that is above the median.
Q1 = lower quartile
= median of data values
that are below the median
Q3 = upper quartile
= median of data values
that are above the median
Example 2.10 Fastest Speeds (cont)
Ordered Data
(in rows of
10 values)
for the 87
males:
55 60 80 80 80 80 85 85 85 85
90 90 90 90 90 92 94 95 95 95
95 95 95 100 100 100 100 100 100 100
100 100 101 102 105 105 105 105 105 105
105 105 109 110 110 110 110 110 110 110
110 110 110 110 110 112 115 115 115 115
115 115 120 120 120 120 120 120 120 120
120 120 124 125 125 125 125 125 125 130
130 140 140 140 140 145 150
• Median = (87+1)/2 = 44th value in the list = 110 mph
• Q1 = median of the 43 values below the median =
(43+1)/2 = 22nd value from the start of the list = 95 mph
• Q3 = median of the 43 values above the median =
(43+1)/2 = 22nd value from the end of the list = 120 mph
Percentiles
The kth percentile is a number that has
k% of the data values at or below it and
(100 – k)% of the data values at or
above it.
• Lower quartile = 25th percentile
• Median = 50th percentile
• Upper quartile = 75th percentile
Median, quartiles and area
• The data set is split into quarters by the median and quartiles.
• Histogram area is proportional to relative frequency therefore
the median and quartiles split the histogram into four equal
areas.
Basic Box plot
What does a box plot tell you about the
distribution?
• Centre
– The vertical line inside the box (the median) gives an indication
of the centre of the distribution.
• Spread
– The width of the box (the interquartile range) gives an indication
of the spread of values in the distribution.
• IQR = UQ - LQ
• Shape
– High density corresponds to adjacent box plot values being
close together. In particular, if the extreme and quartile on one
side are closer to the median than the extreme and quartile on
the other side, this shows that the distribution is skew.
Box plot: Clusters & Outliers
• Clusters
– Boxplots cannot show clusters in a data set
– Before using a box plots check that clusters do not exist by using dot
plot, stem and leaf plot or a histogram
• Outliers
– The basic box plot does not clearly show an outlier
– Any values more than 1.5 times the IQR from the box are considered to
be outliers and displayed with a separate cross
– Outliers are displayed with a separate cross
– The 'whiskers' that are drawn to the sides of the central box extend only
as far as the most extreme values that are not classified as outliers.
Example 2.10 Fastest Speeds Ever
Driven
Five-Number
Summary
for 87 males
•
•
•
Median = 110 mph measures the center of the data
Two extremes describe spread over 100% of data
Range = 150 – 55 = 95 mph
Two quartiles describe spread over middle 50% of data
Interquartile Range = 120 – 95 = 25 mph
Boxplot of Males Fastest Speeds
150
Fastest Speeds
125
100
75
50
Comparing two or more groups
• Box plots are particularly useful for comparing different groups of
values
• Rice yields in 1996
Picturing Location
and Spread with Boxplots
Boxplots for right handspans
of males and females.
• Box covers the middle
50% of the data
• Line within box marks
the median value
• Possible outliers are
marked with asterisk
• Apart from outliers,
lines extending from
box reach to min and
max values.
2.5 Pictures for
Quantitative Data
• Histograms: similar to bar graphs, used for any
number of data values.
• Stem-and-leaf plots and dotplots: present all
individual values, useful for small to moderate sized
data sets.
• Boxplot or box-and-whisker plot: useful summary
for comparing two or more groups.
2.6 Numerical Summaries
of Quantitative Data
Notation for Raw Data:
n = number of individuals in a data set
x1, x2 , x3,…, xn represent individual raw data values
Example: A data set consists of handspan
values in centimeters for six females;
the values are 21, 19, 20, 20, 22, and 19.
Then, n = 6
x1= 21, x2 = 19, x3 = 20, x4 = 20, x5 = 22, and x6 = 19
Describing the Location
of a Data Set
• Mean: the numerical average
• Median: the middle value (if n odd) or the
average of the middle two values (n even)
Symmetric: mean = median
Skewed Left: mean < median
Skewed Right: mean > median
Determining the Mean and Median
The Mean

x

x
i
n
xi means “add together all the
where
values”
The Median
If n is odd: M = middle of ordered values.
Count (n + 1)/2 down from top of ordered list.
If n is even: M = average of middle two ordered values.
Average values that are (n/2) and (n/2) + 1
down from top of ordered list.
Example 2.9
Will “Normal” Rainfall
Get Rid of Those Odors?
Data: Average rainfall (inches)
for Davis, California for 47 years
Mean = 18.69 inches
Median = 16.72 inches
In 1997-98, a company
with odor problem blamed
it on excessive rain.
That year rainfall was
29.69 inches. More rain
occurred in 4 other years.
The Influence of Outliers
on the Mean and Median
Larger influence on mean than median.
High outliers will increase the mean.
Low outliers will decrease the mean.
If ages at death are: 70, 72, 74, 76, and 78
then mean = median = 74 years.
If ages at death are: 35, 72, 74, 76, and 78
then median = 74 but mean = 67 years.
2.7 Bell-Shaped Distributions
of Numbers
Many measurements follow a predictable pattern:
• Most individuals are clumped around the center
• The greater the distance a value is from the
center, the fewer individuals have that value.
Variables that follow such a pattern are said
to be “bell-shaped”. A special case is called
a normal distribution or normal curve.
Example 2.11 Bell-Shaped
British Women’s Heights
Data: representative sample of 199 married British couples.
Below shows a histogram of the wives’ heights with a normal curve
superimposed. The mean height = 1602 millimeters.
Describing Spread
with Standard Deviation
Standard deviation measures
variability by summarizing how far
individual data values are from the
mean.
Think of the standard deviation as
roughly the average distance
values fall from the mean.
Describing Spread
with Standard Deviation
Both sets have same mean of 100.
Set 1: all values are equal to the mean so there is
no variability at all.
Set 2: one value equals the mean and other four
values are 10 points away from the mean, so
the average distance away from the mean is
about 10.
Calculating the Standard Deviation
Formula for the (sample) standard deviation:
 x  x 
2
s
i
n 1
The value of s2 is called the (sample) variance.
An equivalent formula, easier to compute, is:
s
x
2
i
 nx
n 1
2
Calculating the Standard Deviation
Step 1:
Calculate x , the sample mean.
Step 2:
For each observation, calculate the
difference between the data value
and the mean.
Step 3:
Square each difference in step 2.
Step 4:
Sum the squared differences in step 3,
and then divide this sum by n – 1.
Step 5:
Take the square root of the value
in step 4.
Interpreting the Standard Deviation
for Bell-Shaped Curves:
The Empirical Rule
For any bell-shaped curve, approximately
• 68% of the values fall within 1 standard
deviation of the mean in either direction
• 95% of the values fall within 2 standard
deviations of the mean in either direction
• 99.7% of the values fall within 3 standard
deviations of the mean in either direction
The Empirical Rule, the Standard
Deviation, and the Range
• Empirical Rule => the range from the
minimum to the maximum data values
equals about 4 to 6 standard deviations for
data with an approximate bell shape.
• You can get a rough idea of the value of
the standard deviation by dividing the
range by 6.
Range
s
6
Example 2.11
Women’s Heights
(cont)
Mean height for the 199 British women is 1602
mm and standard deviation is 62.4 mm.
• 68% of the 199 heights would fall in the
range 1602  62.4, or 1539.6 to 1664.4 mm
• 95% of the heights would fall in the interval
1602  2(62.4), or 1477.2 to 1726.8 mm
• 99.7% of the heights would fall in the
interval 1602  3(62.4), or 1414.8 to 1789.2
mm
Example 2.11
Women’s Heights
(cont)
Summary of the actual results:
Note: The minimum height = 1410 mm and the maximum
height = 1760 mm, for a range of 1760 – 1410 = 350 mm.
So an estimate of the standard deviation is:
Range 350
s

 58.3 mm
6
6
Standardized z-Scores
Standardized score or z-score:
Observed value  Mean
z
Standard deviation
Example: Mean resting pulse rate for adult men is 70
beats per minute (bpm), standard deviation is 8 bpm.
The standardized score for a resting pulse rate of 80:
80  70
z
 1.25
8
A pulse rate of 80 is 1.25 standard deviations
above the mean pulse rate for adult men.
The Empirical Rule Restated
For bell-shaped data,
• About 68% of the values have
z-scores between –1 and +1.
• About 95% of the values have
z-scores between –2 and +2.
• About 99.7% of the values have
z-scores between –3 and +3.
Transformations
• Sometimes it is convenient to express numbers on a different scale
Americans easily recognise that 90° Fahrenheit is a hot day.
We understand temperatures better on the Celsius scale.
• No gain or loss of information (usually)
• Graphical and numerical summaries are affected.
Transformations can help us understand a data set
Linear transformations
new value = a + b x old value
– imperical to metric measurements
– temperature
grams = 28.3494 x ounces
Fahrenheit = 32 + 1.8 x Celsius
• Relative positions of the points do not change so we neither gain nor
lose information.
linear transformation
• Affect the centre and spread of the data
• Shape remains unchanged
• Graphical displays: only the numbers labeling the axis changes
• Do not help you to understand the
distribution of values in the data
Nonlinear transformations
Examples:
– The wavelength of radiation (in metres) may alternatively be recorded
as a frequency (in cycles per second) -- a reciprocal relationship.
– A medical researcher might record the mean time between seizures for
acute epileptic patients, or the rate of seizures per year -- another
reciprocal relationship.
– The Richter scale transforms the measured intensity of earthquakes to a
logarithmic scale.
Nonlinear transformations….
• Changes the relative distances between data values
• Changes the shape of a distribution
Logarithmic transformations
• The most commonly used nonlinear transformation replaces each
value by its logarithm
new value = log10(old value)
• base-10 logarithms easier to interpret and used in CAST, but natural
logarithms (base e) have a similar effect.
• logarithms can be found only for positive numbers.
• log10(1) = 0, log10(10) = 1, log10(100) = 2, log10(1000) = 3,
• log10(0.1) = –1, log10(0.01) = –2, etc.
• Spreads out low values in a distribution and compresses high
values.
• Useful for skew data with a long tail towards the high values.
– It will spread out a dense cluster of low values and may detect
clustering or outliers that would not be visible in graphical displays of the
original data.
A family of nonlinear transformations
Power transformations- raises each value in the data set to a
power p
• p < 1 increases the spread in the lower tail of data values and
decrease the spread in the upper tail.
• p > 1 expanding the upper tail of the data values and compressing
the lower tail. (Rarely helpful)
Discrete Data Displays
• Large counts
– The distribution of values can be summarised with the same methods
as continuous data.
• Moderate counts
– Most of the earlier displays can still be used, but
• Stacked dot plots are better than jittered dot plots
– No information is lost by stacking since there can be a column of
crosses for each distinct value.
• Histogram class boundaries should end in '.5' to ensure that data
values do not occur on the boundary of two classes.
• Since the median, quartiles and extremes are always whole
numbers (or occasionally half-way between two whole numbers),
box plots do not give a very effective comparison of groups.
• Small counts
– A bar chart is a better representation of the data than a histogram