Download Week1_Lecture 3_post

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Gibbs sampling wikipedia , lookup

Transcript
Review of Chapter 4
In this chapter, we learned how to display
quantitative variables.
Graphic techniques:
histogram , stem-and-leaf plot, dot plot;
How to describe the shape of the distribution?
 Unimodal/Bimodal/Multimodal/Uniform
 Symmetric/Skewed to the left/Skewed to the right
 Outlier
 How to describe the center of a distribution?
Midrange: (max+min)/2
Median: # is odd/ # is even
Mean (Next)
1
• Center of a distribution
 Measure of Center #3: Mean
 For convenience of discussion, we are going to
introduce some notations from now on.
In Statistics, the notation is part of the vocabulary.
1) Variable (values of) ---- x (also can be y, z, etc.)
2) number of data values ----- n
3) mean of variable x ---- ( (pronounced “x-bar”)
• Center of a distribution
 Measure of Center #3: Mean
 Mean is defined by the following formula:
(Σ means “sum”)
The formula says to add up all the values of the
variable and divide that sum by the number of
data values.
In daily life, we call it average.
• Center of a distribution
 Measure of Center #3: Mean
 Example: Find the mean for the following dataset
{12, 34, 45, 52}
Solution:
Here we have 4 data in total, so n=4
So the mean of the dataset is 35.75
• Center of a distribution
 Measure of Center #3: Mean
 Interpretation of mean
First, look at the following simple example:
Dataset 1: {4, 5, 6}
median=5
mean=5
Dataset 2: {4, 5, 9}
median=5
mean=6
Dataset 3: {1, 5, 6}
median=5
mean=4
Therefore, we can see
1) Median is more resistant to the extreme values.
2) Mean is more sensitive to the extreme values
• Center of a distribution
 Measure of Center #3: Mean
 Interpretation of mean
The mean feels like the center because it is the point
where the histogram balances:
In our GPA example,
mean
median
• Center of a distribution
 Discussion of relative position of mean and
median
Case 1: When the distribution is symmetric
median coincides with mean
• Center of a distribution
 Discussion of relative position of mean and
median
Case 2: When the distribution is skewed to the
left
mean is on the LHS of median
• Center of a distribution
 Discussion of relative position of mean and
median
Case 3: When the distribution is skewed to the
right
mean is on the RHS of median
• Center of a distribution
 Hint: How to judge the relative position of
mean and median?
Compared to median, mean is always closer to the
longer tail (extreme values).
• Center of a distribution
 Let’s try the following example together.
A researcher is studying the distribution of a quantitative variable by
using the histogram below. On the histogram, he marked two vertical
lines, indicating the position of mean and median. But he is so careless
that he forgot to mark the corresponding names of them. Can you help
him to identify which line represents mean and which one is median?
• Center of a distribution
Q: When to use median and when to use mean as the
measure of the center ?
 Case 1: If the distribution is skewed or has outliers
We are usually better off with median
because it is resistant to the extreme values.
 Case 2: If the distribution is symmetric and there are
no outliers
We can report mean and median together
because they are not much of difference.
But, technically, people prefer to report the
mean
• Center of a distribution
 Case 3: If you are not sure,
report both and discuss why they might differ.
For example, to tell the center of the distribution displayed
below, which one do you prefer, mean or median?
• Spread of a distribution
When we describe a distribution numerically, we
always report a measure of its spread along with its
center.
There’re a number of measures of spread, we are going
to introduce three of them
 Measure of spread #1: Range
Range= maximum value – minimum value
Example: Please find the range of GPA data
3.9 3.0 2.7 4.0 3.6 3.2
4.0 2.2 3.2 3.7 4.0 3.9
1.6 3.8 1.9 2.8 2.9 3.6
3.5 2.0 1.2 3.7 3.3 2.9
3.5 1.6 2.4 3.7 3.9 3.2
• Spread of a distribution
 Measure of spread #2: The Interquartile Range (IQR)
When we study the definition of median, we divide the data
set into two equal-size halves.
High
Low
Median
Furthermore, let’s divide the data set into four quarters.
And we call these new dividing points quartiles.
High
Low
Lower Quartile
(1st quartile)
Q1
Median
(2nd quartile)
Upper Quartile
(3rd quartile)
Q3
• Spread of a distribution
 Measure of spread #2: The Interquartile Range (IQR)
 How to find quartiles by hand?
Always start from sorting (from low to high)
Case 1: When n (number of data values) is even.
For example, data set { 1 , 3 , 5 , 7, 9, 11} (n=6)
We know the median is the average of middle two values i.e.6.
lower quartile (Q1): we focus on the first half of numbers,
which are {1,3,5}. Find the median of {1,3,5}, then you will get
Q1 = 3
upper quartile (Q3): we focus on the second half of numbers,
which are {7,9,11}. Find the median of {7,9,11}, then you will
get Q3 = 9
• Spread of a distribution
 Measure of spread #2: The Interquartile Range (IQR)
 Then how to find quartiles by hand?
Let’s try an example immediately.
Please find the median, Q1, Q3 in the following data
set.
{ 64, 43, 64, 75}
• Spread of a distribution
 Measure of spread #2: The Interquartile Range (IQR)
 How to find quartiles by hand?
Always start from sorting (from low to high)
Case 2: When n (number of data values) is odd.
For example, data set { 1 , 3 , 5 , 7, 9, 11, 13} (n=7)
We know the median is the middle value 7.
lower quartile (Q1) : we focus on the numbers before the
median 7, which are {1,3,5}. Find the median of {1,3,5}, then
you will get Q1 = 3
upper quartile (Q3): we focus on the numbers after the median
7, which are {9,11,13}. Find the median of {9,11,13}, then you
will get Q3 = 11
Remark:
Some statisticians include the median in both halves.
• Spread of a distribution
 Measure of spread #2: The Interquartile Range (IQR)
 Then how to find quartiles by hand?
Let’s try an example immediately.
Please find the median, Q1, Q3 in the following data
set.
{ 14, 43, 64, 75, 72}
• Spread of a distribution
 Measure of spread #2: The Interquartile Range (IQR)
Now we are ready to define IQR,
IQR= upper quartile – lower quartile = Q3 – Q1
For example, the IQR of data set { 1 , 3 , 5 , 7, 9, 11, 13} is
IQR = Q3 – Q1 = 11 – 3 = 8
Comments on IQR:
• Just like the median, IQR is also resistant to values that are
extraordinarily large or small.
• So IQR is a good choice of the measure of the spread when
the distribution is skewed or has outliers.
• Spread of a distribution
 5-number Summary
5- number summary is commonly used to describe a
quantitative variable.
The 5-number summary of a distribution reports its
median, quartiles, and extremes (max and min).
For example, the 5-numner summary for data set
{1 , 3 , 5 , 7 , 9 , 11 , 13} is
Max
13
Q3
11
Median 7
Q1
3
Min
1
• Spread of a distribution
 Measure of spread #3: The Standard Deviation(SD)
For each of the value x,
tells us the distance from
the value x to the mean , and it is called deviation.
The standard deviation, denoted by s, is defined as
Comments on standard deviation:
• Like the mean, standard deviation is very sensitive to the
extraordinarily large or small values.
• So it’s a good idea to report SD as the measure of the
spread when the distribution is symmetric and has no
outliers.
•
is called the variance.
• Spread of a distribution
 Measure of spread #3: The Standard Deviation(SD)
Example: Please find the standard deviation of the
following dataset {1,2,3,4}
Solution: n=4
Step 1: Find the mean
Step 2: Fill in the following table
Original Values
x
1
2
3
4
Deviations
Squared Deviations
(x- )2
• Spread of a distribution
 Measure of spread #3: The Standard Deviation(SD)
Original Values
x
Deviations
Squared Deviations
1
1 – 2.5= - 1.5
(-1.5)2=2.25
2
2– 2.5 = - 0.5
(-0.5)2=0.25
3
3 – 2.5 = 0.5
0.52=0.25
4
4 – 2.5 = 1.5
1.52=2.25
SUM
Step 3: Add the squared deviations up
Step 4:
Q: What is the variance?
5
• Spread of a distribution
 Measure of spread #3: The Standard Deviation(SD)
 Interpretation of SD
Brain Storm: Quickly compute the standard deviation of
{1,1,1,1,1,1,1,1,1,1}
From it, we can see
1) The SD always equals to zero if the all the values in a
particular dataset are the same (i.e. no spread in
value)
2) The SD will be very large if the values in the dataset
vary a lot from each other. (i.e. a huge spread in
value)
Therefore, in this sense, we use SD as a measure of
spread.
• Spread of a distribution
 TI instructions
How to find n , 𝑥̅ , s , median, max, min, Q1, Q3 by using TI?
Step1: Press STAT Choose 1: Edit under the EDIT menu
and press ENTER Input your data set into L1
Step2: Press STAT again go to CALC menu  Choose 1:
1-Var Stats and press ENTER
Step3: On the main screen, input L1 at the flashing block
position and press ENTER.
Then you will get every value you need.
Practice:
{12, 34, 63, 723, 668, 593, 832, 774, 326, 753 }
Practice: #7 #8 in Suggested problem set 1
27
Review of Chapter 4
In this chapter, we learned
Center of a distribution
Midrange; Median; Mean (Definition, Properties)
Spread of a distribution
Range; IQR;SD (Definition, Properties)
28
Ch5 Understanding and comparing
distributions
 To understand the distributions,
 Draw boxplot by hand
 Read Information from boxplot
 To compare the distributions,
 Compare by using boxplots
Term 1: Boxplot
 Why Boxplot?
The numerical descriptions for a distribution, e.g., median, Q1,
Q3 and IQR, are useful.
However, we love plots!!!
Boxplots are perfect tools to vividly display the
numerical descriptions of median, Q1, Q3, IQR and
outliers on a single plot.
We will discuss:
How to make a boxplot by hand
Read information from a boxplot.
Term 1 Boxplot
 Example: Draw a boxplot for {0,6,7,8,9,10,11,15}
 Preparations: We need the 5-number summary
 Making a boxplot by hand: (Vertical Boxplot)
 Draw Box: Draw short horizontal lines at the lower and upper
quartiles and at the median. Then connect them with vertical lines to
form a box.
 Compute
Upper fence=Q3+1.5IQR
Caution: Don’t draw upper and
Lower fence=Q1-1.5IQR
lower fences on the boxplot !!
 Draw Whiskers: Draw lines from the ends of the box up and down to
the most extreme values found within the fences.
 Draw Outliers: any data values outside the fences, denoted by special
symbols. (e.g. *)
Remark: Sometimes, people prefer to construct a horizontal boxplot.
Term 1: Boxplot
• Interpretation of the boxplot
25% of
data
25% of
data
25% of
data
Upper Whisker
(maximum)
Q3
Median
IQR
Q1
Range
Lower Whisker
25% of
data
Outlier
(minimum)
Note: No matter what the pattern that the boxplot has, the maximum value is
always the top of the boxplot; the minimum value is always the bottom of it.