Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Class 1 Introduction Sigma Notation Graphical Descriptions of Data Numerical Descriptions of Data Sigma Notation • Representation of a sum • Uses the Greek letter sigma, S , and a variable of summation 3 (2i 1) (2(1) 1) (2(2) 1) (2(3) 1) i 1 3 5 7 15 Sigma Notation • This is used in many situations to represent a computation performed with a data set. • Let xi represent the ith value in a data set of size n. Then the sum of the data set can be written as: n x i 1 i Graphical Representations of Data • Frequently, there is too much information in raw data. • It is common to attempt to reduce the amount of information. Examples include: • • • • Histograms Line graphs Bar charts Pie charts Graphical Representations of Data • This is an art form. Creativity is a key to success. • Some dimensions that can be used include: • Vertical dimension • Horizontal dimension • Color • Size • Icon • Animation Numerical Representations of Data • It is absolutely critical to distinguish between a population and a sample. • A population is the entire body of data from which a sample may be drawn. • A sample is a specific subset of a population. Numerical Representations of Data • A parameter is a numerical measure of a population. Parameters are frequently represented with Greek letters. • A statistic is a numerical measure of a sample. Numerical Representations of Data Population Parameters Sample Statistics Numerical Representations of Data • Measures of Central Tendency in a population • The median is the middle value of a population where the values have been ordered in size. • The mode is the most frequently occurring value. • The most important one is the mean (average). Let xi be the ith data point in a population of size N. 1 N Then xi N i 1 Numerical Representations of Data • Note that the median and mode are insensitive to outliers, while the mean is not. What might this imply about using means, medians, and modes? • In a sample of size n, the mean is computed by 1 n x xi n i 1 Numerical Representations of Data • Measures of Central Tendency might not reflect important attributes of the data • What are the measures of central tendency for the following two populations? {31000, 40000, 40000, 49000} and {39000, 40000, 40000, 41000} 31000 40000 40000 49000 39000 41000 Numerical Representations of Data • Measures of Variability or Dispersion • The range is the difference between the largest and smallest values in a population (sample). » Consider the populations {0, 0, 0, 0, 4} and {0, 1, 2, 3, 4} • How can we include all of the data in a measure of dispersion? We can try to measure how far from some point they are, but if we fix that point (say 0), then we will get non-intuitive results. Numerical Representations of Data • If we select (for a population), then at least we will be measuring the distance from the middle of the population. Note that the distance must be positive (unsigned) or we always get 0! How can we make the distance positive? Numerical Representations of Data • The variance of a population is the average (mean) squared distance of the values to the N mean. 2 N1 ( xi )2 i 1 • The standard deviation is the square root of the variance. Numerical Representations of Data • The sample variance is computed in a slightly different way: s 2 n 1 n 1 (x x) i 1 2 i • The sample standard deviation, s, is computed by taking the square root of the variance. Numerical Representations of Data • Chebyshev’s Theorem • At least (1 - 1/k2) of the values in a data set must be within k standard deviations of the mean, where k>1. • As an example, if k = 2, we can say that at least (1 1/22) = (1 - 1/4) = 3/4 of the values will be within 2 standard deviations of the mean. For a population, this is the interval [ - 2, + 2]. For a sample, this is the interval [ x 2 s, x 2 s ]. Numerical Representations of Data • In fact, many data sets are unimodal (mound or bell shaped). In this case, the following approximation is found to hold empirically: • About 68% of the values will be within 1 standard deviation of the mean. • About 95% of the values will be within 2 standard deviation of the mean. • About 99% of the values will be within 3 standard deviation of the mean. Looking for Outliers: z-scores • A z-score for the ith data point in a sample is computed by xi x zi s • How would we define it for a population?