Download Chapter 2 Descriptive Statistics / Describing Distributions with

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Categorical variable wikipedia , lookup

Time series wikipedia , lookup

Transcript
Chapter 3 Descriptive Statistics / Describing Distributions with
Numbers
A. Common Measurements of Location
-these measurements give you a sense of where a data point falls in line relative to other
data points.
Note: If you calculate a stat from a sample it is called a sample statistic. If you calculate
it from a population statistic then it is called a population statistic.
1. Mean – this is the most common measurement. It is simply the average. It is one of
three measures of centrality.
_
a. sample mean - x = ∑xi / n = x1 + x2 + …. + xn
b. population mean – μ = ∑xi / N
note: n = total number in sample N = total number in population
2. Median–Md – The middle value when the data is arranged in ascending order; second
measure of center.
- if the data has an odd number of points, then there is a true middle
-if the data has an even number of points, then take the average of the two middle points.
Example: If you are given 3, 6, 7, 8, 8, 10
Since there are an even number of points we take (x3 + x4) / 2 = (7 + 8) /2 = 7.5
3. Mode–Mo - this is the value that occurs with the greatest frequency. If there is more
than on value that takes on the greatest frequency then we say that the value is bi, tri, or
multi-modal; last measure or centrality.
4. Percentiles/Quartiles – tells us how data are spread over a 100 percent interval from
smallest to largest.
How to calculate percentage:
(a) arrange data in ascending order
(b) Compute the index of the percent
Index – i = (p/100)n
p = percentile of interest
n = number of operations
(c) If it is not an integer round up. If it is an integer then the pth percentile is average of
the i and i+1 value
a. For Lower Quartile (Q1 or 25th percentile):
i. Sort all observations in ascending order
1
ii. Compute the position L1 = 0.25 * N, where N is the total number of observations.
iii. If L1 is a whole number, the lower quartile is midway between the L1-th value and
the next one.
iv. If L1 is not a whole number, change it by rounding up to the nearest integer. The
value at that position is the lower quartile.
b. For Upper Quartile (Q3 or 75th percentile):
i. Sort all observations in ascending order
ii. Compute the position L3 = 0.75 * N, where N is the total number of observations.
iii. If L3 is a whole number, the lower quartile is midway between the L3-th value and
the next one.
iv. If L3 is not a whole number, change it by rounding up to the nearest integer. The
value at that position is the lower quartile.
Example: 61, 61, 61, 67, 73, 73, 74, 79, 81, 81, 87, 89, 89, 92, 97, 100
Given our data from test scores again if we wanted to know the 40th percentile we obtain
it as follows.
i = (40/100) 16 = .4*16 = 6.4 = 7
So our 7th value is our 40th percentile. This corresponds to 74.
5. Quartiles – these divide our data in 4 equal parts.
Q1 = first quartile; or the 25th percentile
Q2 = second quartile; or the 50th percentile
Q3 = third quartile; or the 75th percentile
The percentiles are calculated just as shown above.
6. Five-number summary – this is a way to show the data with 5 important values. It
gives us the max, min, and the 3 quartiles above.
Graph 1: Test Scores from Before
Test Scores
Test Scores in Percent
100
90
80
70
60
2
7. Weighted Mean
a. Weighted mean- μ= Σ wixi / Σ wi; this is used when the data points take on different
levels of importance or weights.
Example: Suppose we wanted the average cost per lb gives the following data
Purchase
Cost/lb
# lbs
1
3
1200
2
4
500
3
2
800
So μ= Σ wixi / Σ wi =
So the average cost per pound is $2.88
= $2.88
b. Sample Mean if data is Grouped : = Σ fiMi / Σ fi = Σ fiMi / n
fi – the frequency for class i
Mi – the midpoint for class i
n = sample size; this always equals total frequency for all classes.
_
Note: the sample variance – s2 = ∑ fi( Mi – x )2 / n-1
B. Measures of Variability
- this gives us an idea of how disperse the data is. The data can be tightly bunched or not,
what scale it is, and where the majority of the data points lie.
1. Range – the largest value from the smallest value; gives you a sense of what the
difference between the extreme points look like.
2. Interquartile Range – IQR = Q3 – Q1; gives the middle 50% range.
-rule for outliers: if it lies farther out than 1.5*IQR either above or below the range of the
data we generally call it an outlier.
3. Variance – measures how different each data point is from the mean. It gives us an
overall idea of how different the data is from the mean value.
a. Population variance – σ2= ∑ ( xi – μ)2 / N
_
b. Sample Variance – s2 = ∑ ( xi – x )2 / n-1
alt. formula: s2 =
[∑ ( xi 2 – (Σxi)2/n ]
note:
i. we divide by n-1 for the sample variance because it has been found that it gives a better
predictor of population variance than if we just divided by n.
3
ii. We must make sure to square the differences. If we didn’t the sum of all the
_
differences would just add to 0; i.e. ∑ ( xi – x ) = 0.
iii. Variance measures cannot really be compared to one another since the scale of
different variables are different. If we measured variance of ages in a class, the ages of
people are in much smaller units than if we did variance of incomes which would be in
thousands.
4. Standard Deviation – It gives us a better measure of spread because it gets us back to
our units of the variable of interest.
a. Population standard deviation – σ =  2
b. Sample standard deviation – s =
s2
5. Coefficient of Variation – tells how the standard deviation relates to the mean in
terms of magnitude. It gives us a way to compare standard deviations across different
distributions.
Coefficient of Variation = (Standard deviation / mean) * 100
The smaller the value the more compact the data. The larger the value the more disperse
the data is.
6. Skewness -For skewness we are looking at where the mean is relative to the median
and where the tail of the data is pointing.
So, there are two cases. We can get skewed to the right which can be seen in the case
below. If the mean is to the right of the median this is the result. We also see that the tail
points in the rightward direction.
Relative
Freq.
Median
Mean
4
Also, if the mean is to the left then it is said to be left skewed. An example can by seen
below. The tail of the distribution points to the left.
Relative
Freq.
Mean
Median
If we have the mean and the median in the same place then the data is evenly
proportioned and there is said to be symmetry. This is what we talk about when we say
something is normal or if it has a bell-shape. See chapter 3 for more information.
7. Empirical Rule and Chebyshev’s Theroem
a. Empirical Rule – This gives the percentage of data that lies within 1, 2, and 3 standard
deviations from the mean if we have a normal distribution. See chapter 6.
b. Chebyshev’s Theorem – If we have a population with a mean of μ and stand1ard
deviation of σ then for any integer value of k (with k > 1) then at least 100(1- )% of the
data points lie in the interval μ± k*σ
example: Suppose we have μ = 5 with σ =1
Then if we consider k = 1.5 then 100(1- )= 33.3%. So what this tells us is that at least
33.3% of the data lies between 5±1.5.
This is a very conservative estimate, because there could in fact be much more.
C. Covariance/ Correlation/Line of Best Fit (OLS)
1. cross-tabulations – this is a method of showing the relationship between the data for
two variables simultaneously.
-it is very good at showing relationship between two variables
-it can be used for both quantitative, qualitative, or both
-you may use frequency, percent frequency, or relative frequency distributions
2. Scatter diagram/plot – typical graphical representation between two quantitative
variables. Can show positive, negative, or no relationship.
5
Graphs 1 & 2
Graphs 3 & 4
a. Explanatory Variable – may explain or influence the response variable. It is generally
called an independent variable.
b. Response Variable – measures the outcome of s study. This is often called the
dependent variable.
Ex: consider we want to focus on GPA. GPA would be the response variable, and all the
variables we thought would influence GPA would be the explanatory variables such as:
-study time
-ACT/SAT score
-age
Note: we can just as easily change our variable of interest to become the response
variable. It would most likely change our variables we use to explain it.
6
c. positive relationship – when one variables moves with another in the same direction.
So as one variable goes up, so does the other.
d. negative relationship - when one variables moves with another in the opposite
direction. So as one variable goes up, the other variable goes down.
e. outlier – a variable that falls well outside the overall pattern of a relationship between
two variables
f. strength – how closely the data seems to follow a certain pattern. If the data closely
follows a specific pattern the relationship is said to be strong. If it does not closely
follow one, it is said to be weak.
3. Covariance – measures the linear association between two variables.
Mathematically:
_
_
_
_
a. Sample Covariance = Sxy = (1/ (n-1)) ∑ ( xi – x ) ( yi – y ) for i = 1….n
b. Population Covariance = σxy = (1/ N) ∑ ( xi – x ) ( yi – y ) for i = 1….N
x
2
6
7
4
3
Example: Suppose we have the following data
sample.
So we first need the means of each set of data.
_
_
We find that x = 4.4 and y = 14 and n = 5
y
15
14
12
17
12
that define as coming from a
Using these facts we can now find covariance given the above formula as follows:
(1/4) * [( 2 – 4.4 ) ( 15 – 14 ) + ( 6 – 4.4 ) ( 14 – 14 ) + ( 7 – 4.4 ) ( 12 – 14 ) + ( 4 – 4.4 ) (
17 – 14 ) + ( 3 – 4.4 ) ( 12 – 14 ) ] = (1/4) [ -2.4 + 0 -5.2 -1.2 + 2.8] = -6
Note: the value of the covariance tells you whether or not there is a positive or negative
relationship based on its sign. If it is very close to zero then it would imply no
relationship.
4. Correlation Coefficient – measures the relationship between the two variables and it
also measures magnitude.
_
_
a. sample correlation - rxy = (1/ (n-1)) ∑ (( xi – x )/ Sx ) * (( yi – y ) / Sy ) = Sxy / Sx Sy
_
_
b. population correlation - ρxy = (1/ N) [ ∑ (( xi – x ) / σx ) * (( yi – y ) / σy ) ] = σxy / σx σy
Note: Correlation always lies between -1 and 1. The closer to each of the values the
stronger the relationship. The closer to 0 the weaker the relationship between the two
variables.
7
ex: Let’s use the same data that we had before. To calculate correlation we need the
covariance and standard deviations of both the x & y variables.
So calculating the standard deviation of both x and y is as follows:
( 2 – 4.4 ) 2 + ( 6 – 4.4 ) 2 + ( 7 – 4.4 ) 2 + ( 4 – 4.4 ) 2 + ( 3 – 4.4 ) 2 / 4
Sx =
(5.76 + 2.56 + 6.76 + 0.16 + 1.96) / 4 = 17.2 /2 ≈ 2.07
=
( 15 – 14 )2 +( 14 – 14 ) 2 + ( 12 – 14 ) 2 + ( 17 – 14 ) 2 + ( 12 – 14) 2 / 4
Sy =
(1 +0 + 4 + 9 + 4) / 4 = 18 /2 ≈ 2.12
=
rxy = Sxy / Sx Sy = -4 / (2.07)(2.12) ≈ - 0.909
This tells us that there is a strong negative relationship between the x and y data. If we
graph the data we can see this.
Graph 3:
Scatterplot of X & Y
18
16
Y data
14
12
10
8
6
4
2
0
0
1
2
3
4
5
6
7
8
X data
5. Least Squares Method / Regression Line
-the least squares method is what is used to determine the best fit line. It is not simply
drawn in.
-what is essentially done is the data is plotted as a scatter diagram and the line that
minimizes the difference between the actual data points and the line drawn is the best fit
line.
a. Least Squares Criterion – here is the mathematical idea of the explanation above.
Min Σ (yi - ŷi )2 , where yi = the actual value for the ith observation and ŷi = the estimated
value of the dependent variable for the ith observation.
8
-we find the b0 & b1 that minimize this distance.
b. Slope and Y-intercept for the estimated regression equation
_
_
_
i. b1 = ∑ ( xi – x ) ( yi – y ) / ∑ ( xi – x )2 or b1 = Sxy / Sx2
_
_
ii. b0 = y - b1 x
-so we use the above equations to estimate our regression equation and get a relationship
between x & y.
Note: we are assuming a linear relationship with this estimation, but in other estimation
techniques this need not be the case. For our purposes we will always assume a linear
form.
c. Graphical example of the above analysis
ŷ - Line of best fit
y
( ŷ -yi ) – so it is the difference between
the data point and the estimated line
_
_
y
( yi – y )- the difference between the avg.
and the data point.
_
( ŷ - y ) – the total distance from estimated
line from the mean.
x
**the important thing to get out of this is that the regression line or ‘line of best fit’
minimizes the distance of each actual data point to the line.
9