Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Last lecture summary • Mode • Distribution • Five numbers summary, percentiles, mean • Box plot, modified box plot • Robust statistic – mean, median, trimmed mean • outlier MEASURES OF VARIABILITY Navození atmosféry www.udacity.com – Introduction to statistics QUESTION Mean1 Mean2 Mode1 Mode2 Median1 Median2 www.udacity.com – Statistics range (variační rozpětí) MAX - min www.udacity.com – Statistics Range Range changes when we add new data into dataset • Always • Sometimes • Never www.udacity.com – Statistics Adding Mark Zuckerberg www.udacity.com – Statistics Cut off data IQR, mezikvartilové rozpětí www.udacity.com – Statistics Interquartile range, IQR Let’ take this quiz, answer yes or no. 1. About 50% of the data fall within the IQR. 2. The IQR is affected by every value in the data set. 3. The IQR is not affected by outliers. 4. The mean is always between Q1 and Q3. průměr = 8.62 n = 13 0 1 1 1 2 2 2 2 2 3 3 3 90 Q1=1 Q2 Q3=3 www.udacity.com – Statistics Define the outlier Outlier < 𝑄1 − 1.5 × 𝐼𝑄𝑅 OR Sample (n=10) $38,946 $43,420 $49,160 $50,430 $50,557 $52,580 $53,595 $54,160 $60,181 $10,000,000 Outlier > 𝑄3 + 1.5 × 𝐼𝑄𝑅 What values are outliers for this data set? 1. $60,000 2. $80,000 3. $100,000 4. $200,000 www.udacity.com – Statistics Problem with IQR normal bimodal uniform www.udacity.com – Statistics Options for measuring variability • Find the average distance between all pairs of data values. • Find the average distance between each data value and either the max or the min. • Find the average distance between each data value and the mean. www.udacity.com – Statistics Average distance from mean Sample 10 5 3 2 19 1 7 11 1 1 Deviation from mean (𝑥𝑖 − 𝑥) Average distance from mean Sample Deviation from mean (𝑥𝑖 − 𝑥) 10 4 5 -1 3 -3 2 -4 19 13 1 -5 7 1 11 5 1 -5 1 -5 (𝑥𝑖 − 𝑥) = 0 Find the average distance between each data value and the mean. Preventing cancellation • How can we prevent the negative and positive deviations from cancelling each out? 1. 2. 3. 4. Ignore (i.e. delete) the negative sign. Multiply each deviation by two. Square each deviation. Take absolute value of each deviation. Average absolute deviation Sample Deviation from mean (𝑥𝑖 − 𝑥) Absolute deviation |𝑥𝑖 − 𝑥| 10 4 4 5 -1 1 3 -3 3 2 -4 4 19 13 13 1 -5 5 7 1 1 11 5 5 1 -5 5 1 -5 5 avg. absolute deviation = 4.6 Average absolute deviation What formulas describes what you just did? 1. 2. 3. 4. 5. 𝑥𝑖 𝑛 |𝑥𝑖 −𝑥| 𝑛 |𝑥𝑖 −𝑥| 𝑛 |𝑥−𝑥𝑖 | 𝑛 |𝑥𝑖 −𝑥| 𝑛 Squared deviations Sample Deviation from mean (𝑥𝑖 − 𝑥) Squared deviation 𝑥𝑖 − 𝑥 2 10 4 16 5 -1 1 3 -3 9 2 -4 16 19 13 169 1 -5 25 7 1 1 11 5 25 1 -5 25 1 -5 25 SS, sum of squares (čtverce odchylek) 𝑥𝑖 − 𝑥 2 avg. square deviation = 31.2 Variance Average square devation has a special name – variance (rozptyl). 𝑥𝑖 − 𝑥 𝑛 2 www.udacity.com – Statistics Standard deviation • směrodatná odchylka, 𝑠 𝑠= 𝑥𝑖 − 𝑥 𝑛 2 • Which symbol would you use for a variance? 𝑠2 Standard deviation • What is so great about the standard deviation? Why don’t we just find the average absolute deviation? 1. SD is used because of tradition 2. It is easier to work with power of two than with absolute value. 3. SD has very nice interpretation in Gaussian distribution. More on absolute vs. standard deviation: http://www.leeds.ac.uk/educol/documents/00003759.htm Standard deviation – empirical rule Standard deviation – empirical rule Standard deviation – empirical rule Empirical rule – well behaved distribution n = 400 𝑚𝑒𝑎𝑛 = 14.2, 𝑠. 𝑑. = 14.1 𝑚𝑒𝑎𝑛 ± 𝑠. 𝑑 covers 273 data values, 66.8% 𝑚𝑒𝑎𝑛 ± 2𝑠. 𝑑. covers 380 data values, 95% 𝑚𝑒𝑎𝑛 ± 3𝑠. 𝑑. covers 397 data values, 99.3% Empirical rule – not-so-well behaved distribution 197 countries 𝑚𝑒𝑎𝑛 = 69.9 𝑠. 𝑑. = 9.7 65% within 1 s.d. 94.7 within 2 s.d. 100% within 3 s.d. Statistical inference • The goal of statistics: make rational conclusions or decisions based on the incomplete information we have in our data. • This process is known as statistical inference. • In inferential statistics we want to answer 1. 2. Is some relationship in data due to chance? Or is it a real difference? If the effect is real, can it be generalized to a larger group? Statistical jargon • Population – the group we are interested in making conclusions about. • Census – a collection of data on the entire population. • Sample – if we can’t conduct a census, we collect data from the sample of a population. Goal: make conclusions about that population. Statistical jargon population (census) vs. sample parameter (population) vs. statistic (sample) Population - parameter Mean 𝜇 Standard deviation 𝜎 Sample - statistic Mean 𝑥 Standard deviation s Výběr - statistika Výběrový průměr 𝑥 Výběrová směrodatná odchylka s 𝑥 = 19.44 𝑠 = 2.45 𝑥 = 16.89 𝑠 = 9.17 𝑥 = 17.22 𝑠 = 6.24 Statistical inference • A statistic is a value calculated from our observed data (sample). • A parameter is a value that describes the population. • We want to be able to generalize what we observe in our data to our population. In order to this, the sample needs to be representative. • How to select a representative sample? Use randomization.