Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 2: Basics Petter Mostad [email protected] Empirical statistics versus statistical inference " Empirical statistics = descriptive statistics Extract important features from the data Illustrating the data graphically Giving an overview Examples: Average values, (sample) standard deviation, histogram, scatterplot, etc. " Statistical inference: Using statistical models to represent and infer knowledge Data types " Categorical " Discrete (whole numbers) " Continuous (real numbers, possibly limited to some interval) Numerical data " Univariate, bivariate, or multivariate data. The empirical distribution " The set of observed values " Can be summarized and illustrated in many ways Average, and (sample) standard deviation. Dot diagrams Boxplots Histograms or frequency distributions Scatterplots Average, standard deviation, and median " If data are x1, x2, x3,& ,xn, then: " The average or mean (gives location) is x1 + x2 +... + xn x= n " The standard deviation (gives spread) is s= ( x1 − x ) + ( x2 − x ) +... + ( xn − x ) n −1 2 2 2 sample variance is the square of the sample s.d.: s2 " The median: If an odd number of observations, the middle one If an even number of observations, the average of the two middle ones The coefficient of variation " The ratio of the standard deviation to the mean is called the coefficient of variation: s/x " The inverse is sometimes called the signalto-noise ratio: x/s Dot diagrams and boxplots " Dot diagrams show the data directly " Boxplots indicate the different portions of the set of values: The median, the quartiles, and the range of the data. 140 145 150 155 160 165 0 50 100 150 200 250 Visualizing numerical data with histograms 0 5 10 15 20 25 " Bins and bin sizes " The areas should be proportional to the number of observations within each bin. " Standard: The area is equal to the frequency of observations. Scatterplots 20 30 40 50 y 60 70 80 90 " Displays bivariate data. Here: two variables, x and y. " Great tool to see relationships between variables. 10 20 30 x 40 50 Population and sample " Often: Data is derived by sampling from a population " Random sampling " Often: We want to learn about properties of the population by studying the sample: Statistical inference " The properties of the sample approaches the population properties as sample size increases Example: a political poll " Population: All swedes? In Göteborg? Adult? " Is the sample a random sample? " Goal: To find out about the political opinions of all Swedes, based on the answers from the sampled persons. " Random sampling makes it possible to do so. Example: Repeated measurements Data: Repeated measurements of the acidity of a river. " Population: All possible measurements that could have been done. (Infinite population?) " Random sampling? " Goal: To learn about the whole population (acidity in all of river, and the variation) based on the sample data. Example: How can we make inference from sample to population Your newspaper claims 40% of Swedes vote S. Assume you ask a random sample of 100 persons, and 32 say they vote S. Can you claim the newspaper wrong? " Assume the newspaper is correct. On your computer, simulate a random sample of 100 persons, each with 40% chance of voting S. You get that 38 vote S. " Repeat the above 1000 times. You find that in only 64 of the cases, 32 or fewer of the simulated Swedes vote S. " This gives you a pretty good argument that the newspaper is probably wrong. When population size increases towards infinity " Sampling with or without replacement. " When population size increases towards infinity: with/without replacement unimportant. " Often: The population size can be infinite for practical (sampling) purposes (population of Sweden etc.) or imagined infinite. " Histograms can be made with smaller and smaller bin sizes, and approach continuous functions with area under the curve equal to 1. Sampling from an (almost) infinite population of continuous values " The population is represented by a continuous function with integral 1, the probability density, representing the population distribution " The probability that a sample will be in an interval is given by area under curve of continuous function. " Our goal: to learn about the population distribution from the sample For discrete values: " The population is represented by a set of probabilities on the possible values " Our goal: To learn about these probabilities from the sample. " NOTE: A probability distribution representing an infinite population is sometimes called a random variable. (Continuous or discrete) Population average versus sample average " Generally: Average of sample will approach average of population when sample size increases. " For infinite populations represented by a probability distribution: Population average is called expectation Computed with integral Example: Average CO2 emissions from cars in Sweden " Investigated by testing a few cars. " Compared to size of sample, size of population is infinite . " Population represented by a probability distribution, with some expectation. " We want to learn about this expectation from our sample. 45 44 43 average of sample 46 47 Sample averages seem to stabilize as sample size increases 0 20 40 60 sample size 80 100 Population standard deviation versus sample standard deviation " We can define the population standard deviation of a probability distribution representing a population. " The sample standard deviation will then (generally) approach the population standard deviation as the sample size increases. " We need mathematical tools so that we know what we can say about the population standard deviation from a sample standard deviation. A statistic " A specific funtion of the data is called a statictic. Examples: The mean The variance The range (max min) ... Population properties versus sample properties " Descriptive statistics based on sample have counterparts defined on probability distributions: sample average vs. population expectation sample variance vs. population variance sample median vs. population median histogram vs. probability density function " Generally, the sample versions will approach population versions as sample size increases. " How fast and how reliably is an important part of the rest of the course. Parametrical families of probability distributions " How can we solve the problem of learning about the population distribution from the sample? " Usual procedure: Assume the population distribution has a particular parametric form, and estimate parameters from the sample. Example: The normal distribution " A family of probability distributions. " Two parameters: μand σ>0 " The probability density is ( x − µ )2 p( x) = exp − 2 2 2σ 2πσ 1 " For all values of the parameters, the integral is 1, so it is a probability distribution. " The expectation is μand the standard deviation is σ. . 0.0 0.1 0.2 0.3 0.4 Plots of the normal distribution -6 -4 -2 0 2 4 6 The standard normal distribution has expectation 0 and standard deviation 1 Quantiles of the standard normal distribution 0.0 0.1 0.2 0.3 0.4 " 68.3 percent of the area is between -1 and 1. " 95.4 percent is between -2 and 2. " 99.7 percent is between -3 and 3. -6 -4 -2 0 2 4 6 Using tables " To find out how much area there is below the curve to the right of a number x, find x in table A in the textbook. " Example. The area to the right of 2.77 is 0.0028. So: If the population has a standard normal distribution, the probability of observing something above 2.77 is 0.0028. The probability of observing something less than -2.77 is also 0.0028. Linear transformations of normal distributions " If X has a normal distr. with exp μand st. dev. σ then (X-μ)/σhas a standard norm. distr. " Question: If X has a norm. distr. with exp. 4 and st.dev. 2, what is the probability that X is above 9? " Answer: X>9 if and only if (X-4)/2 > (9-4)/2 = 2.5. The probability is the same as for a standard normal distribution to be above 2.5. Table A gives the probability 0.0062 10 5 0 Frequency 15 Example: fitting a normal distribution to data 10 15 20 25 30 35 " We have a sample of size 100: see histogram. " The histogram shape is similar to a normal distribution: We use the parametric family of normal distributions as guess for population distribution. " Sample average: 22.39. sample standard deviation: 4.40. " Our best guess for population distribution is a normal distribution with expectation 22.39 and st.dev. 4.40 Example continued " Assuming our best guess for population distribution is correct we can answer questions as: What is the probability that a new sampled value will be larger than 30? (Answer: (30-22.39)/4.4 = 1.73, table gives 0.0418 ) What is the probability that a new sampled value will be less than 35? (Answer: (35-22.39)/4.4 = 2.87, table gives 1-0.0021=0.9979)