Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CE902 Lecture 3: Statistics for Research Annie Louis January 30, 2017 CE902 Professional Practice and Research Methodology, Spring 2017 Annie Louis CE902 Lecture 3: Statistics for Research Your 1-minute presentations start this week Three batches each approx. 30 students, check list on Moodle to identify which batch you are in I I Batch 1: today, Batch 2: week 19, Batch 3: week 20 Your presentation counts for 2% of your proposal marks I Based on participation not assessment of the content Email your image (of a single slide) to this address by noon on the Friday before your slot comes up I [email protected] Check previous lecture for instructions Annie Louis CE902 Lecture 3: Statistics for Research Recap previous lectures What common research methods are used in science? I Hypothesis-based I Measurements Good scientific practices I Science versus pseudoscience I Inference - drawing a conclusion I Fallacies - faulty inferences Annie Louis CE902 Lecture 3: Statistics for Research This lecture - statistics for research Many experiments involve data I data which are used to make conclusions I results/ intermediate values Statistics helps with data description and drawing conclusions from data Annie Louis CE902 Lecture 3: Statistics for Research Types of values What is the difference between these data types? 1. Height of each student in this classroom 2. Values from a survey where students in this class tell me their most preferred day of the week when they like to eat out 3. Number of stars (between 1 to 5) that each customer gave to a product on Amazon.com Annie Louis CE902 Lecture 3: Statistics for Research Types of values Categorical: values come from one of n categories I Not numeric I Eg. Values from a survey where students in this class tell me their most preferred day of the week when they like to eat out Numerical: values make numerical sense I Eg. Height of each student in this classroom Ordinal: one of n categories, but the categories have a numerical order I Number of stars (between 1 to 5) that each customer gave to a product on Amazon.com Annie Louis CE902 Lecture 3: Statistics for Research Suppose you have a dataset you are using in your project. How will you analyze/ present the data in your report? i.e. What are the first steps? 1. Height of students in this class (in cm) I 150, 160, 152, 162, 155, 168, 161, 170, 200 Annie Louis CE902 Lecture 3: Statistics for Research Arithmetic mean Arithmetic average of N numbers Suppose x1 , x2 , ... xN are N observations x1 + x2 + x3 + ... + xN x̄ = N 2. What if you have monthly household income in an area as follows? (in pounds) I 1000, 1500, 2000, 2200, 2500, 3000, 3300, 3500, 4000, 4200, 25000 Annie Louis CE902 Lecture 3: Statistics for Research Arithmetic mean is influenced greatly by outliers Median: what is a middle value? I Arrange the values in ascending order I If odd number of values, the number in the middle is the median I If even, take average of the middle two numbers Median household income is a better measure of the center of the data I Not prone to effects of some very poor or very wealthy households Annie Louis CE902 Lecture 3: Statistics for Research What about categorical data? Choice for the favourite day to eat out I 1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 5, 6, 7 Annie Louis CE902 Lecture 3: Statistics for Research What about categorical data? Choice for the favourite day to eat out I 1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 5, 6, 7 Mode: most repeated value I There may be more than one mode I There may be no mode! Annie Louis CE902 Lecture 3: Statistics for Research These measures helped to describe data Descriptive statistics. summarizing data and observations I Eg: computing an average I But don’t go beyond the data we have Mean, median and mode are known as measures of central tendency Measures of variability: How spread out are the values around the mean? Annie Louis CE902 Lecture 3: Statistics for Research These measures helped to describe data Descriptive statistics. summarizing data and observations I Eg: computing an average I But don’t go beyond the data we have Mean, median and mode are known as measures of central tendency Measures of variability: How spread out are the values around the mean? I Variance, standard deviation Annie Louis CE902 Lecture 3: Statistics for Research Statistics also helps to make conclusions from measurements or data Inferential statistics. drawing conclusions based on data and observations I Eg: testing statistical hypotheses Annie Louis CE902 Lecture 3: Statistics for Research Inferential Statistics How to make conclusions that go beyond the sample data? We will start with some basics of probability distributions before moving on Annie Louis CE902 Lecture 3: Statistics for Research Probability “Degree of certainty” Two interpretations Annie Louis CE902 Lecture 3: Statistics for Research Probability as Relative frequency How often something happens in a sequence of observations in the long-run I Eg. probability of getting a head on a coin toss Based on the idea of a repeatable experiment or a trial I Eg. tossing a coin, rolling a die I The outcome of the experiment be A Annie Louis CE902 Lecture 3: Statistics for Research Probability as Relative frequency If there are n trials and A comes up m times, the relative frequency of A is m/n Over a large number of trials, this relative frequency becomes stable For a fair coin, this relative frequency of getting a head is close to 1/2 Annie Louis CE902 Lecture 3: Statistics for Research Probability as Degree of Belief Subjective opinion of some individual regarding how certain an event is to occur I Eg. probability that it will snow tomorrow I Eg. probability of a patient surviving an operation Does not make sense to think of these as repeatable experiments Annie Louis CE902 Lecture 3: Statistics for Research Sample space and Events Sample or outcome space: All possible outcomes of an experiment I Eg. Rolling a die S = {1, 2, 3, 4, 5, 6} I Eg. Flipping three coins S = {HHH, HTT, THT, TTH, HHT, THH, HTH, TTT } Event: A subset of the outcome space I Eg. event of getting a value less than 4, E1 = {1, 2, 3} I Eg. event of getting exactly two heads, F1 = {HHT, THH, HTH} Annie Louis CE902 Lecture 3: Statistics for Research A random variable A convenient notation for representing the outcome of an experiment A random variable takes a unique value for each event Eg. Experiment where 3 coins are tossed I Y = number of heads I Range of Y is 0 to 3 I Y = 0 corresponds to the event {TTT} Annie Louis CE902 Lecture 3: Statistics for Research Two types of random variables Discrete random variable I Takes countable values I Eg. X = number of heads in 10 coin tosses Continuous random variable I Takes any real numbered value I Eg. Y = 1.534 for lifetime of a bulb in years Annie Louis CE902 Lecture 3: Statistics for Research We can now talk about the probability distribution of a random variable The probabilities of the individual values of a random variable taken together form a probability distribution P(X = x), for all x in the range of X Annie Louis CE902 Lecture 3: Statistics for Research Discrete probability distribution Defined by a probability mass function P(x) or P(X = x) giving probabilities for each value of a discrete random variable X I I 0 ≤ P(X = x) ≤ 1 for all x ∈ E P x∈E P(x) = 1 Annie Louis CE902 Lecture 3: Statistics for Research Discrete probability distribution Eg 1: X = outcome of a coin flip x P(X = x) 0 (head) 0.5 1(tail) 0.5 Eg 2: X = number of heads in 3 flips of a coin I S = {HHH, HHT, THH, HTH, HTT, THT, TTH, TTT} x P(X = x) 0 1/8 Annie Louis 1 3/8 2 3/8 3 1/8 CE902 Lecture 3: Statistics for Research Continuous probability distribution In some cases, it does not make sense talking about the probability at an individual point, eg. P(weight = 3.334) I i.e. for continuous random variables I Rather we talk about intervals, P(3 < weight < 4) Defined by a probability density function f(x) giving the probability of a random variable X taking values in a range Z P(A ≤ X ≤ B) = b f (x)dx a f (x) ≥ 0 for all x ∈ R R +∞ −∞ f (x) = 1 Annie Louis CE902 Lecture 3: Statistics for Research (1) Normal distribution A popular continuous distribution Takes the shape of a bell curve, symmetric when divided vertically in the middle The density function of a normal distribution with mean µ, standard deviation σ f (x) = √ Annie Louis (x−µ)2 1 e − 2σ2 2πσ CE902 Lecture 3: Statistics for Research (2) Annie Louis CE902 Lecture 3: Statistics for Research Empirical rules 68% of the data is within the first standard deviation from the mean 95% of data is within two standard deviations 99.7% within three standard deviations Annie Louis CE902 Lecture 3: Statistics for Research Many quantities in nature are well approximated by normal distributions Empirical observations of I Test scores I Heights of people I Errors in measurements I Blood pressure Annie Louis CE902 Lecture 3: Statistics for Research Standard normal distribution Z has standard normal distribution when mean = 0 and standard deviation is 1 I density function g (z) = 1 2 √1 e − 2 z 2π A Normal distribution X can be converted into a Z distribution I Z= X −µ σ The probabilities g(z) can be looked up from the Z table Annie Louis CE902 Lecture 3: Statistics for Research Annie Louis CE902 Lecture 3: Statistics for Research Quick summary We can have discrete and continuous probability distributions A normal distribution has a bell-shaped curve and symmetric when vertically divided in the middle I 95% of data is within two standard deviations A normal distribution can be transformed into a Z distribution Annie Louis CE902 Lecture 3: Statistics for Research Population versus Sample Population is universe of individuals you are interested in I Eg. for a coin flip, the population has outcomes of an infinite number of flips I Eg. people in Essex I Eg. salmon fish in the Pacific ocean Sample is a subset of the population from which you may want to make conclusions about the population I Eg. 100 flips of the coin I Eg. 100 people from Essex chosen for a survey I Eg. salmon fish observed in a 1sq. mile area of the Pacific Annie Louis CE902 Lecture 3: Statistics for Research Population parameters versus sample statistics Let P(X) be the probability distribution of the population I Eg. distribution of heads and tails I Eg. distribution of age of all the people of Essex I Eg. distribution of the lengths of salmon fish in the Pacific The characteristics of the population such as mean and standard deviation are known as population parameters or otherwise as true mean and true standard deviation When these measures are computed on the sample, we call them as sample statistics – sample mean and sample s.d. Annie Louis CE902 Lecture 3: Statistics for Research Remind ourselves of our goal We are interested in making conclusions about the population based on our sample I Eg: You may survey a small set of voters but what you are interested in is the actual election results We compute statistics based on the sample and want to know if the statistics are representative of the population I Ideally, we want to get sample statistics that are close to the population parameters Statistics provides tools to check this closeness between a samples statistics and the population parameters Annie Louis CE902 Lecture 3: Statistics for Research There is a primary concern while using sample statistics. What is it? Annie Louis CE902 Lecture 3: Statistics for Research There is a primary concern while using sample statistics. What is it? Variability! If we take different samples, the statistics will always show variability I number of heads in 100 flips of a coin will be different when your friend makes a different 100 flips and computes the number I different samples of 100 people from Essex. The average age and standard deviation will always have variability I mean lengths of fish observed in 1 square mile of the Pacific. Highly unlikely to get the exact same value in a different location Annie Louis CE902 Lecture 3: Statistics for Research How do we know if the sample statistics we have computed is reliable? Annie Louis CE902 Lecture 3: Statistics for Research Population distribution for outcome from the roll of a die Expected value = 1/6 * 1 + 1/6 * 2 + 1/6 * 3 + 1/6 * 4 + 1/6 * 5 + 1/6 * 6 = 3.5 If you rolled a die, an ‘infinite’ number of times, and averaged the values, you will end up with 3.5 Annie Louis CE902 Lecture 3: Statistics for Research Now we only have 10 rolls (a sample) Outcomes = {2, 4, 3, 2, 1, 6, 1, 4, 2, 6} Sample mean = 3.1 Sample standard deviation = 1.8 N = 10 is known as the sample size Annie Louis CE902 Lecture 3: Statistics for Research The Sampling Distribution If we take a very very large number of samples of size N and plot the sample statistic, we get the sampling distribution For our example, we take a very large number of samples of 10 rolls of a die I In each case, the compute the sample mean I The distribution of all these sample means is a new distribution – the sampling distribution Here the random variable is the sample statistic (not values from the population) Annie Louis CE902 Lecture 3: Statistics for Research Annie Louis CE902 Lecture 3: Statistics for Research Parameters of the sampling distribution For the sampling distribution of sample means, Mean = population mean µ I if you had infinite samples Standard deviation ∆x = I σ ← population s.d. I N ← sample size √σ N This standard deviation is known as the Standard Error Annie Louis CE902 Lecture 3: Statistics for Research Standard Error Standard Error ∆x = √σ N I Uncertainly in sample means I If I take different samples, how much do the means vary Standard error decreases as the sample size increases I Less variation in sample statistics as the sample size increases When population standard deviation σ is unknown, we can approximate using the sample standard deviation s ∆x = √sN Annie Louis CE902 Lecture 3: Statistics for Research Central Limit Theorem As N (sample size) becomes large, the sampling distribution can be approximated by a normal distribution Generally with a sample size of around 30 or more, we start seeing a normal distribution The sampling distribution is normal regardless of whether the population distribution was normal or not Annie Louis CE902 Lecture 3: Statistics for Research What are the implications of the Central Limit Theorem? We get one sample We compute the sample mean We can get the probability of this sample mean under the sampling distribution I If we did the sampling many many times, how often do we get such a mean value? I We can get this probability *without* actually doing the sampling many many times Annie Louis CE902 Lecture 3: Statistics for Research Annie Louis CE902 Lecture 3: Statistics for Research A particular (sample) mean value can now be mapped to a Z distribution Suppose z = 2.4 “If the data is actually sampled from a population with mean µ and stdev σ, then 95% of the time, z will lie between -2 and +2. By chance it will lie outside this interval only 5% of the time.” Annie Louis CE902 Lecture 3: Statistics for Research Z-test Null hypothesis: The data comes from a distribution with mean µ and standard deviation σ I Alternative hypothesis: The data does not come from this distribution Collect N samples, compute sample mean x̄ and standard error ∆x z= x̄ − µ ∆x (3) General rule: reject the null hypothesis if the z value could have arisen by chance < 5% of the time I z value less than -1.96 or greater than 1.96. 95% of the curve is between these values Annie Louis CE902 Lecture 3: Statistics for Research p-value This percent or probability value is known as a p-value General value: p-value < 0.05 reject the null hypothesis Reject the null hypothesis at 5% level. Chance alone can produce such a statistic less than 5% of the time Annie Louis CE902 Lecture 3: Statistics for Research Example: Z-test for a coin experiment Annie Louis CE902 Lecture 3: Statistics for Research You have a coin. You flip it 100 times, it comes up tails 54 times and head 46 times. Is the coin fair? Represent heads by 1 and tail by 0 Null hypothesis: The coin is fair. True mean is 0.5 and standard deviation is 0.5 Alternative hypothesis: The coin is not fair, biased towards tails Annie Louis CE902 Lecture 3: Statistics for Research Plot mean and standard error Annie Louis CE902 Lecture 3: Statistics for Research The error bar pretty much overlaps the expected fraction of heads Cannot reject the null hypothesis: The result of slightly more tails is only by random chance. I Therefore the coin is fair A statistical test can be used to get a precise value for how likely is it for the result to have occurred by chance Annie Louis CE902 Lecture 3: Statistics for Research Z-test for this experiment Sample mean (46 * 1 + 54 * 0)/ 100 = 0.46 Sample standard deviation = 0.501 Standard error ∆x = 0.05 z = 0.46 − 0.50.05 = 0.8 p-value = 0.42 “If the coin is fair, then we are likely to see such a sample mean by chance 42% of the time. Hence we cannot reject the null hypothesis” Annie Louis CE902 Lecture 3: Statistics for Research Common use of a Z-test Check if a sample mean is close to the population’s mean I The population’s mean value and its variance is known Annie Louis CE902 Lecture 3: Statistics for Research Summary Statistics helps in data analysis for research Descriptive statistics. Describe data’s characteristics. Eg. measures of central tendency andvariability Inferential statistics. Draw conclusions based on a sample from a population. Eg. Is this sample statistic very different from the general population? Annie Louis CE902 Lecture 3: Statistics for Research References and acknowledgements 1. [Book] Probability, Jim Pitman, 2006 2. [Book] Research Methods for Science, Michael P. Marder, 2011 3. [Book] Probability and Statistics for Engineers and Scientists, 2007 Annie Louis CE902 Lecture 3: Statistics for Research