Download Chapter 3

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Categorical variable wikipedia , lookup

World Values Survey wikipedia , lookup

Time series wikipedia , lookup

Transcript
Chapter 3 Descriptive Statistics: Numerical Measures Outline and Definitions
Sample Statistic: measures from a data sample (point estimator of population parameter)
Population Parameter: measures for data from a population
Example (to be used throughout lecture): Apartment rental prices
20 efficiency apartments were sampled in a small town and analyzed with corresponding rent prices
425
430
435
435
440
445
445
445
465
470
475
480
490
500
510
535
550
570
600
615
Section 3.1: Measures of Location
I) Mean
-- average of all the data values
-- mean can be computed from a sample ( x
x
)
x
ii
where n = # observations
∑ xi = x1 + x2 + x3 +…+ xn
n
-- mean can be computed from a population

x
ii
(μ)
where N = # observations
∑ xi = x1 + x2 + x3 +…+ xn
N
-- influenced by large and small values or outliers
Example (Apartment rental prices)
II) Median
-- value in the middle when data is arranged in ascending order
-- preferred measure when data has extreme values
For odd # of observations
-- arrange data in ascending order
-- median is middle value
Example:
1
For even # of observations
-- median is the avg of the middle two values
Example:
Example (Apartment rental prices)
III) Mode:
-- value of data set that occurs most frequently
-- greatest frequency can occur at 2 or more different values
 Data with 2 modes are called bimodal
 Data with more than 2 modes is called multimodal
-- important measure for qualitative data
Example (Apartment rental prices)
IV] Percentiles
-- provides info about how data is spread from smallest to largest values
pth percentile = a value where approximately p percent of observations are < value and approximately (100p) percent of the observations are > value
Steps
1) Arrange data in ascending order
2) Compute the position of the pth percentile (index i)
i = (p/100)n
3)
If i is not an integer, round up and pth percentile is value in ith position
If i is an integer, pth percentile is the avg of values in position i and i + 1
Example (Apartment rental prices)
2
V] Quartiles
-- quartiles are specific percentiles
 1st quartile  25th percentile
 2nd quartile  50 percentile or Median
 3rd quartile  75th percentile
Example (Apartment rental prices)
Section 3.2: Measures of Variability
-- measures of variability are also known as measures of dispersion
-- used in conjunction with measures of location to understand business situation
Example
I] Range
-- difference between largest and smallest data value
-- very sensitive to smallest and largest data values so rarely used as an isolated measure of variability
Example (Apartment rental prices)
3
II] Interquartile Range
-- difference between 3rd Quartile and 1st Quartile
-- range for middle 50% of data
-- overcomes influence of extreme values
Example (Apartment rental prices)
III] Variance
-- measure of variability considering all the data (how spread out your data points are around center of distribution)
-- based on the difference between the value of each observation (xi) and the mean
xi - x
xi - μ
known as deviations
about the mean
-- variance is computed by taking the square values of each deviation about the mean
-- variance is the average of the square differences between each data value and the mean.
-- since the measure is squared, the units it’s measured in is also squared (not very intuitive as single comparison)
Sample Variance
Population Variance
 ( xi  x ) 2
2
2
s 

n 1
Example (Apartment rental prices)
4
 (x  )

22
22
ii
N
IV] Standard Deviation
-- positive square root of the variance
-- avg amount by which data is spread about the mean (+/- above and below the mean)
-- measured in the same units as the data, making it more easily interpreted than the variance.
-- under a bell shaped (normal) and symmetric distributions, # of data points out from the mean that includes 68% of
the cases (34% above mean and 34% below mean).
-- within the right context, the lower the standard deviation, the more concentrated the values are around the mean
Sample Standard Deviation
Population Standard Deviation
s  s 22
 
Example (Apartment rental prices)
5
22
V] Coefficient of Variation
-- indicates how large the standard deviation is in relation to the mean
Sample Coefficient of Variation
Population Coefficient of Variation



100

%


Example (Apartment rental prices)
Section 3.3 Measures of Distribution Shape, Relative Location and Detecting Outliers
Skewness
-- describes the shape of the distribution  numerical measure generated via statistical software
-- varies in degree and is described using terms such as moderately skewed and highly skewed
-- viewed via construction of histogram
a) Symmetric
-- skewness = 0
-- mean = median = mode
Example
6
b) Moderately Skewed Left
-- skewness is negative (skewness measure between -1 and 0)
--characterized as having a tail to the left
-- mean is usually < median
Example
c) Moderately Skewed Right
-- skewness is positive (skewness measure between 0 and 1)
-- mean is usually > median
Example
d) Highly Skewed Right
-- skewness is positive (skewness measure > 1)
-- mean is usually > median
Example
7
Z Score
-- an observation’s z-score is a measure of the relative location of the observation in a data set.
-- often called a standardized value  converting a value into units of standard deviation
-- it denotes the number of standard deviations a data value xi is from the mean.
x x
zi  i
s
zi = z score for xi
x = sample mean
s = sample standard deviation
a) z score < 0  data value < sample mean
b) z score > 0  data value > sample mean
c) z score = 0  data value = sample mean
Example
Chebyshev’s Theorem
-- applies for any type of distribution
-- allows us to make statements about proportion of data values that must be within a specified number of standard
deviations from the mean. Multiplying by 100 gives you the %
Theorem
At least (1 - 1/z2) of the items in any data set will be within (+ / --) z standard deviations of the mean, where z is any
value greater than 1.
Proportion of data values within Z= 2 standard deviations of mean  at least 75%
Proportion of data values within Z =3 standard deviations of mean  at least 89%
Proportion of data values within Z= 4 standard deviations of mean  at least 94%
-- actual outcome may be greater than min %
Example
8
Empirical Rule
-- For normal distributions or bell shaped distributions:



68.25% of the values of a normal random variable are within +/- 1 standard deviation from its mean
95.44% of the values of a normal random variable are within +/- 2 standard deviations from its mean
99.72% of the values of a normal random variable are within +/- 3 standard deviation from its mean
99.72%
95.44%
68.26%

 + 3
 – 3
 – 1
 + 1
 – 2
 + 2
9
x
Detecting Outliers
-- an outlier is an unusually small or unusually large value in the data set
-- a data value with z score < -3 or Z score > 3 might be considered an outlier
Section 3.5 Measure of Association Between 2 Variables
-- descriptive measures of the relationship between 2 variables
Covariance
-- measure of linear association between 2 variables
-- positive values indicate a positive linear relationship (as x inc, y inc)
negative values indicate a negative linear relationship (as x inc, y dec)
-- points evenly distributed are close to 0
-- need to be careful equating larger covariance values as meaning greater strength of linear association when
comparing two covariance values (covariance values depends on unit of measurements of x and y)
Eg.) Relationship between height and weight and measuring height in inches or feet
o
Measuring feet in inches vs feet gives a higher xi
covariance
- x
values  larger numerator  larger
For that reason, we don’t normally assess the strength of the linear associated using covariance (just used to denote
positive or negative relationship)
Sample Covariance
s xy 
 ( x  x )( y  y)
i
i
n 1
for each sample size n with observations
Population Covariance
 xy  
( xi   x )( yi   y )
N
10
(x1, y1), (x2, y2)…
Pearson Product Moment Correlation Coefficient
-- can take on values between -1 and +1
-- values near -1 indicate a strong negative linear relationship
values near +1 indicate a strong positive linear relationship
Sample Correlation Coefficient
rxy 
s xy
where
sx s y
sxy = sample covariance
sx, sy = sample standard deviations
Population Correlation Coefficient

 xy  xy
 x y
where
 xy = population covariance
 x, 
y=
population standard deviations
-- correlation is a measure of linear association not causation
 just because 2 variables are highly correlated does not mean one causes the other
-- scatter plots and correlation coefficients are normally done simultaneously
11
Example
12
Section 3.6 Weighted Mean and Working With Grouped Data
Weighted Mean
-- mean is computed by giving each data value a weight that reflects its importance
-- analyst must choose the weight that best reflects the importance of each value
Sample Weighted Mean
wx

x
w
i i
xi = value of observation i
wi = weight for each observation i
i
Example
13
14