Download Sigma Notation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Class 1
Introduction
Sigma Notation
Graphical Descriptions of Data
Numerical Descriptions of Data
Sigma Notation
• Representation of a sum
• Uses the Greek letter sigma, S , and a
variable of summation
3
 (2i  1)
 (2(1)  1)  (2(2)  1)  (2(3)  1)
i 1
 3  5  7  15
Sigma Notation
• This is used in many situations to represent
a computation performed with a data set.
• Let xi represent the ith value in a data set of
size n. Then the sum of the data set can be
written as:
n
x
i 1
i
Graphical Representations of
Data
• Frequently, there is too much information in
raw data.
• It is common to attempt to reduce the
amount of information. Examples include:
•
•
•
•
Histograms
Line graphs
Bar charts
Pie charts
Graphical Representations of
Data
• This is an art form. Creativity is a key to success.
• Some dimensions that can be used include:
• Vertical dimension
• Horizontal dimension
• Color
• Size
• Icon
• Animation
Numerical Representations of
Data
• It is absolutely critical to distinguish
between a population and a sample.
• A population is the entire body of data from which a
sample may be drawn.
• A sample is a specific subset of a population.
Numerical Representations of
Data
• A parameter is a numerical measure of a
population. Parameters are frequently
represented with Greek letters.
• A statistic is a numerical measure of a
sample.
Numerical Representations of
Data
Population
Parameters
Sample
Statistics
Numerical Representations of
Data
• Measures of Central Tendency in a
population
• The median is the middle value of a population
where the values have been ordered in size.
• The mode is the most frequently occurring value.
• The most important one is the mean (average). Let
xi be the ith data point in a population of size N.
1 N
Then
 
xi

N i 1
Numerical Representations of
Data
• Note that the median and mode are
insensitive to outliers, while the mean is
not. What might this imply about using
means, medians, and modes?
• In a sample of size n, the mean is computed
by
1 n
x 
xi

n i 1
Numerical Representations of
Data
• Measures of Central Tendency might not
reflect important attributes of the data
• What are the measures of central tendency for the
following two populations? {31000, 40000, 40000,
49000} and {39000, 40000, 40000, 41000}
31000
40000
40000
49000
39000
41000
Numerical Representations of
Data
• Measures of Variability or Dispersion
• The range is the difference between the largest and
smallest values in a population (sample).
» Consider the populations {0, 0, 0, 0, 4} and {0, 1, 2,
3, 4}
• How can we include all of the data in a measure of
dispersion? We can try to measure how far from
some point they are, but if we fix that point (say 0),
then we will get non-intuitive results.
Numerical Representations of
Data
• If we select  (for a population), then at
least we will be measuring the distance
from the middle of the population. Note
that the distance must be positive
(unsigned) or we always get 0! How can
we make the distance positive?
Numerical Representations of
Data
• The variance of a population is the average
(mean) squared distance of the values to the
N
mean.
 2  N1  ( xi   )2
i 1
• The standard deviation is the square root of
the variance.
Numerical Representations of
Data
• The sample variance is computed in a
slightly different way:
s
2

n
1
n 1
 (x  x)
i 1
2
i
• The sample standard deviation, s, is
computed by taking the square root of the
variance.
Numerical Representations of
Data
• Chebyshev’s Theorem
• At least (1 - 1/k2) of the values in a data set must be
within k standard deviations of the mean, where
k>1.
• As an example, if k = 2, we can say that at least (1 1/22) = (1 - 1/4) = 3/4 of the values will be within 2
standard deviations of the mean. For a population,
this is the interval [ - 2,  + 2]. For a sample,
this is the interval
[ x  2 s, x  2 s ].
Numerical Representations of
Data
• In fact, many data sets are unimodal
(mound or bell shaped). In this case, the
following approximation is found to hold
empirically:
• About 68% of the values will be within 1 standard
deviation of the mean.
• About 95% of the values will be within 2 standard
deviation of the mean.
• About 99% of the values will be within 3 standard
deviation of the mean.
Looking for Outliers: z-scores
• A z-score for the ith data point in a sample is
computed by
xi  x
zi 
s
• How would we define it for a population?
Related documents