Download descriptive statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Descriptive Statistics
for one Variable
Variables and measurements
• A variable is a characteristic of an individual
or object in which the researcher is
interested. For example the SAT score for a
college student.
• For a particular individual or object the
variable will take a value called measurement.
For example , John’s SAT is 720.
Different Types of Variables
• Some variables are quantitative variable, like the time
for a person to finish a task or the person’s age.
• Other variables are qualitative variables as the
person’s nationality or the person’s preferred sport.
• In this note we will work with quantitative variables.
• All the measurement collected from individuals about
a particular data is referred a “data”.
• Our data will contain the measurement for only one
variable.
Statistics has two major chapters:
• Descriptive Statistics
• Inferential statistics
Statistics
Descriptive Statistics
• Provides numerical and
graphic procedures to
summarize the information
of the data in a clear and
understandable way
Inferential Statistics
• Provides procedures
to draw inferences
about a population
from a sample
Population and Samples
The Population under study is the set off all individuals
of interest for the research.
We will see that, in practice, the variable is measured
only for a part of the population.
That part of the population for which we collect
measurements is called sample.
The number of individuals in a sample is denoted by n.
In this notes and examples we will assume that our
data correspond to a sample of the population under
study.
Descriptive Measures
• Central Tendency measures. They are
computed in order to give a “center” around which the
measurements in the data are distributed.
• Variation or Variability measures. They
describe “data spread” or how far away the
measurements are from the center.
• Relative Standing measures. They describe
the relative position of a specific measurement in the
data.
Measures of Central Tendency
• Mean:
Sum of all measurements in the data divided by the
number of measurements.
• Median:
A number such that at most half of the measurements
are below it and at most half of the measurements are
above it.
• Mode:
The most frequent measurement in the data.
Example of Mean
Measurements
x
Deviation
x - mean
3
-1
5
1
5
1
1
-3
7
3
2
-2
6
2
7
3
0
-4
4
0
40
0
• MEAN = 40/10 = 4
• Notice that the sum of the
“deviations” is 0.
• Notice that every single
observation intervenes in
the computation of the
mean.
Example of Median
Measurements Measurements
Ranked
x
x
3
0
5
1
5
2
1
3
7
4
2
5
6
5
7
6
0
7
4
7
40
40
• Median: (4+5)/2 =
4.5
• Notice that only the
two central values are
used in the
computation.
• The median is not
sensible to extreme
values
Example of Mode
Measurements
x
3
5
5
1
7
2
6
7
0
4
• In this case the data have
two modes:
• 5 and 7
• Both measurements are
repeated twice
Example of Mode
Measurements
x
3
5
1
1
4
7
3
8
3
• Mode: 3
• Notice that it is possible for a
data not to have any mode.
Measures of Variability
• Range
• Variance
• Standard Deviation
The Range
• Definition: The range of a data is the difference
between the largest and the smallest measurements
in the data.
• To find the range, first order the data from least to
greatest. Then subtract the smallest value from the
largest value in the set.
• Example: A marathon race was completed by 7
participants. What is the range of times given in
hours below?
2.3 hr, 8.7 hr, 3.5 hr, 5.1 hr, 4.9 hr, 7.1 hr, 4.2 hs
Ordering the data from least to greatest, we get: 2.3,
3.5, 4.2, 4.9, 5.1, 7.1, 8.7. So highest - lowest = 8.7
hr - 2.3 hr = 6.4 hr Answer: The range of swim times
is 6.4 hr.
The Range is not Enough
Consider the following examples of data
1,1,1,1,8
1,2,4,6,8
1,8,1,8,1
In the three cases the Range is the same:
Range = 7
However, the three series exhibit completely
different distributions of values along the
range of values
The sample variance
The variance takes into account the deviation
around the mean of the Data.
The formula for the sample variance is as follows
x  x 


2
s
2
n 1
The Standard Deviation consists of the
square root of the Variance
s  Variance  s
2
Notice that the mean and the standard
deviation have the same unit as the one
of the measurements
Variance (for a sample)
• Steps:
– Compute each deviation
– Square each deviation
– Sum all the squares
– Divide by the data size (sample size) minus
one: n-1
Example of Variance
Measurements Deviations
x
3
5
5
1
7
2
6
7
0
4
40
x - mean
-1
1
1
-3
3
-2
2
3
-4
0
0
Square of
deviations
1
1
1
9
9
4
4
9
16
0
54
• Variance = 54/9 = 6
• It is a measure of
“spread”.
• Notice that the larger
the deviations (positive
or negative) the larger
the variance
The standard deviation
• It is defined as the square root of the
variance
• In the previous example
• Variance = 6
• Standard deviation = Square root of the
variance = Square root of 6 = 2.45
• The standard deviation summarizes the
deviations in one number
Percentiles
• The p-th percentile is a number such that at most p%
of the measurements are below it and at most 100 – p
percent of the data are above it.
• Example, if in a certain data the 85th percentile is 340
means that 15% of the measurements in the data are
above 340. It also means that 85% of the
measurements are below 340
• Notice that the median is the 50th percentile
Tchebichev’s Rule
The standard deviation can be used to construct an interval enclosing
an important percent of the data. In fact, this rule says that for any
data set:
• At least 75% of the measurements differ from the mean less than
twice the standard deviation.
• At least 89% of the measurements differ from the mean less than
three times the standard deviation.
Note:
This is a general property and it is called Tchebichev’s Rule: At
least 1-1/k2 of the observation falls within k standard deviations from the
mean. It is true for every dataset.
Example of Tchebichev’s Rule
Suppose that for a certain
data is :
• Mean = 20
• Standard deviation =3
Then:
• A least 75% of the
measurements are
between 14 and 26
• At least 89% of the
measurements are
between 11 and 29
Further Notes
• When the Mean is greater than the Median the
data distribution is skewed to the Right.
• When the Median is greater than the Mean the
data distribution is skewed to the Left.
• When Mean and Median are very close to each
other the data distribution is approximately
symmetric.
Empirical Rule (68-95-99.7 Rule)
For “Normal Distributions” (Data sets whose histograms
are bell or mount shaped):
• Approx. 68% of values are within 1 standard deviation of the
mean
• Approx. 95% of values are within 2 standard deviations of the
mean
• Approx. 99.7% of values are within 3 standard deviations of the
mean
Example of Empirical Rule
Suppose that the hourly wages of certain type of workers have a
“normal distribution” ( bell shaped histogram). Assume also that
the mean is $16 with a standard deviation of $1.5
The we have:
1 standard deviation = $1.5
2 standard deviations = $3.0
3 standard deviations = $4.5
What does the empirical rule allow us to say?
Solution
The empirical rule allows us to say that:
• Approx. 68% of workers in this occupation earn wages that are
within 1 standard deviation of the mean :
– Between 14 – 1.5 and 14 + 1.5
– Between $12.5 and $15.5
• Approx. 95% of workers in this occupation earn wages that are
within 2 standard deviation of the mean :
– Between 14 – 3 and 14 + 3
– Between $11.0 and $17.0
• Approx. 99.7% of workers in this occupation earn wages that are
within 3 standard deviation of the mean :
– Between 14 – 4.5 and 14 + 4.5
– Between $9.5 and $18.5
Related documents