Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia, lookup

History of statistics wikipedia, lookup

Data mining wikipedia, lookup

World Values Survey wikipedia, lookup

Time series wikipedia, lookup

Transcript
```CHAPTER 3
Data Description
3-1
Objectives

Summarize data using measures of central
tendency, such as the mean, median, mode,
and midrange.

Describe data using the measures of
variation, such as the range, variance, and
standard deviation.

Identify the position of a data value in a data
set using various measures of position, such
as percentiles, deciles, and quartiles.
3-2
Objectives (cont’d.)

Use the techniques of exploratory data
analysis, including boxplots and five-number
summaries to discover various aspects of
data.
3-3
Introduction

Statistical methods can be used to summarize
data.

Measures of average are also called measures
of central tendency and include the mean,
median, mode, and midrange.

Measures that determine the spread of data
values are called measures of variation or
measures of dispersion and include the range,
variance, and standard deviation.
3-4
Introduction (cont’d.)

Measures of position tell where a specific data
value falls within the data set or its relative
position in comparison with other data
values.

The most common measures of position are
percentiles, deciles, and quartiles.
3-5
Introduction (cont’d.)

The measures of central tendency, variation,
and position are part of what is called
traditional statistics. This type of data is
typically used to confirm conjectures about
the data.
3-6
Introduction (cont’d.)

Another type of statistics is called exploratory
data analysis. These techniques include the
the box plot and the five-number summary.
They can be used to explore data to see what
they show.
3-7
Basic Vocabulary

A statistic is a characteristic or measure
obtained by using the data values from a
sample.

A parameter is a characteristic or measure
obtained by using all the data values for a
specific population.

When the data in a data set is ordered it is
called a data array.
3-8
General Rounding Rule

In statistics the basic
rounding rule is that
when computations
are done in the
calculation, rounding
should not be done
is calculated.
3-9
The Arithmetic Average

The mean is the sum of the values divided by
the total number of values.

Rounding rule: the mean should be rounded
to one more decimal place than occurs in the
raw data.

The type of mean that considers an additional
factor is called the weighted mean.
3-10
The Arithmetic Average

The Greek letter  (mu) is used to represent
the population mean.

The symbol
mean.

Assume that data are obtained from a sample
unless otherwise specified.
x
(“x-bar”) represents the sample
3-11
Median and Mode

The median is the halfway point in a data set.
The symbol for the median is MD.

The median is found by arranging the data in
order and selecting the middle point.

The value that occurs most often in a data set
is called the mode.

The mode for grouped data, or the class with
the highest frequency, is the modal class.
3-12
Midrange

The midrange is defined as the sum of the
lowest and highest values in the data set
divided by 2.

The symbol for midrange is MR.
3-13
Central Tendency: The Mean

One computes the mean by using all the
values of the data.

The mean varies less than the median or
mode when samples are taken from the same
population and all three measures are
computed for these samples.

The mean is used in computing other
statistics, such as variance.
3-14
Central Tendency: The Mean (cont’d.)

The mean for the data set is unique, and not
necessarily one of the data values.

The mean cannot be computed for an open-
ended frequency distribution.

The mean is affected by extremely high or low
values and may not be the appropriate
average to use in these situations.
3-15
Central Tendency: The Median

The median is used when one must find the center or
middle value of a data set.

The median is used when one must determine
whether the data values fall into the upper half or
lower half of the distribution.

The median is used to find the average of an openended distribution.

The median is affected less than the mean by
extremely high or extremely low values.
3-16
Central Tendency: The Mode

The mode is used when the most typical case
is desired.

The mode is the easiest average to compute.

The mode can be used when the data are
nominal, such as religious preference, gender,
or political affiliation.

The mode is not always unique. A data set
can have more than one mode, or the mode
may not exist for a data set.
3-17
Central Tendency: The Midrange

The midrange is easy to compute.

The midrange gives the midpoint.

The midrange is affected by extremely high or
low values in a data set.
3-18
Distribution Shapes

In a positively skewed or right skewed
distribution, the majority of the data values
fall to the left of the mean and cluster at the
lower end of the distribution.
3-19
Distribution Shapes (cont’d.)

In a symmetrical distribution, the data values
are evenly distributed on both sides of the
mean.
3-20
Distribution Shapes (cont’d.)

When the majority of the data values fall to
the right of the mean and cluster at the upper
end of the distribution, with the tail to the
left, the distribution is said to be negatively
skewed or left skewed.
3-21
The Range

The range is the highest value minus the
lowest value in a data set.

The symbol R is used for the range.
3-22
Variance and Standard Deviation

The variance is the average of the squares of
the distance each value is from the mean. The
symbol for the population variance is 2.
x   


2

2
N
3-23
Variance and Standard Deviation

The standard deviation is the square root of
the variance. The symbol for the population
standard deviation is . Rounding rule: The
final answer should be rounded to one more
decimal place than the original data.
 x   
2
  
2
N
3-24
Coefficient of Variation

The coefficient of variation is the standard
deviation divided by the mean. The result is
expressed as a percentage.

The coefficient of variation is used to compare
standard deviations when the units are
different for the two variables being
compared.
3-25
Variance and Standard Deviation

Variances and standard deviations can be
used to determine the spread of the data. If
the variance or standard deviation is large,
the data are more dispersed. The information
is useful in comparing two or more data sets
to determine which is more variable.

The measures of variance and standard
deviation are used to determine the
consistency of a variable.
3-26
Variance and Standard Deviation (cont’d.)

The variance and standard deviation are used
to determine the number of data values that
fall within a specified interval in a
distribution.

The variance and standard deviation are used
quite often in inferential statistics.
3-27
Chebyshev’s Theorem

The proportion of values from a data set that
will fall within k standard deviations of the
mean will be at least 1 – 1/k2; where k is a
number greater than 1.

This theorem applies to any distribution
regardless of its shape.
3-28
Empirical Rule for Normal Distributions
The following apply to a bell-shaped
distribution.

Approximately 68% of the data values fall
within one standard deviation of the mean.

Approximately 95% of the data values fall
within two standard deviations of the mean.

Approximately 99.75% of the data values fall
within three standard deviations of the mean.
3-29
Standard Scores

A standard score or z score is used when
direct comparison of raw scores is impossible.

A standard score or z score for a value is
obtained by subtracting the mean from the
value and dividing the result by the standard
deviation.
3-30
Percentiles

Percentiles are position measures used in
educational and health-related fields to
indicate the position of an individual in a
group.

A percentile, P, is an integer between 1 and 99
such that the Pth percentile is a value where
P % of the data values are less than or equal
to the value and 100 – P % of the data values
are greater than or equal to the value.
3-31
Quartiles and Deciles

Quartiles divide the distribution into four
groups, denoted by Q1, Q2, Q3. Note that Q1 is
the same as the 25th percentile; Q2 is the
same as the 50th percentile or the median;
and Q3 corresponds to the 75th percentile.

Deciles divide the distribution into 10 groups.
They are denoted by D1, D2, …, D10.
3-32
Outliers

An outlier is an extremely high or an extremely
low data value when compared with the rest of
the data values.

Outliers can be the result of measurement or
observational error.

When a distribution is normal or bell-shaped,
data values that are beyond three standard
deviations of the mean can be considered
suspected outliers.
3-33
Exploratory Data Analysis

The purpose of exploratory data analysis is to
examine data in order to find out what
information can be discovered. For example:
 Are
there any gaps in the data?
 Can
any patterns be discerned?
3-34
Boxplots and Five-Number Summaries

Boxplots are graphical representations of a fivenumber summary of a data set. The five specific
values that make up a five-number summary are:

The lowest value of data set (minimum)

Q1 (or 25th percentile)

The median (or 50th percentile)

Q3 (or 75th percentile)

The highest value of data set (maximum)
3-35
Summary

Some basic ways to summarize data include
measures of central tendency, measures of
variation or dispersion, and measures of
position.

The three most commonly used measures of
central tendency are the mean, median, and
mode. The midrange is also used to represent
an average.
3-36
Summary (cont’d.)

The three most commonly used measurements
of variation are the range, variance, and
standard deviation.

The most common measures of position are
percentiles, quartiles, and deciles.

Data values are distributed according to
Chebyshev’s theorem and in special cases, the
empirical rule.
3-37
Summary (cont’d.)

The coefficient of variation is used to describe
the standard deviation in relationship to the
mean.

These methods are commonly called traditional
statistics.

Other methods, such as the boxplot and fivenumber summary, are part of exploratory data
analysis; they are used to examine data to see
what they reveal.