Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Welcome to the Graduate Workshop in Statistics Instructor: Kam Hamidieh Monday July 11, 2005 Today’s Agenda • • • • Workshop Introductions & Website Tour Plan Ahead Today: All About Descriptive Statistics Brief Demo of SPSS & R (if time permits) 2 Before we start… When you see Bert on a slide, either I will go over the slide quickly or skip it entirely. However, you will need to read it on your own since the subsequent sessions will depend on these. When you see Sir Isaac Newton, it means that the slide will be of technical/mathematical nature. Read it if you wish. I will not use the material in the subsequent sessions. 3 Workshop Plan • July 11: Descriptive Statistics - making graphical and numerical summaries of data • July 18: language of research studies in statistics & a crash course in Probability and Random Variables • July 25: Hypothesis Testing, lots of t-tests, & confidence intervals • August 1: Categorical data & chi-squared tests • August 8: Linear Regression • August 15: ANOVA and catch up. 4 What is Statistics? • Statistics is the art of learning from data. It is concerned with the collection of data, their subsequent description, and their analyses, which often leads to the drawing of conclusions. • Want to know more about what statistics is and how its meaning has evolved? See Vic Barnett’s Comparative Statistical Inference 5 Biostatistics • Statistics applied to biological (life) problems, including: – Public health – Medicine – Ecological and environmental • Much more statistics than biology, however biostatisticians must learn the biology also. 6 Some Additional Terms • Bioinformatics - computerized and statistical analyses of biological data to extract and analyze biological data, particularly in studying the nucleotide sequences of DNA. • Microarray Data – Data with lots of variables and a few observations (more variables than cases). Mostly biological data. 7 Some Other Applications • Finance: statistical models used for analysis of stocks, bonds, and currencies to control risk or make money • Economics: statistical models used to forecast economic trends • Clinical trials: testing effectiveness of drugs • Information technology: network traffic analysis, pattern recognition, separation of noise from data • Business: fraud detection • Government: analysis of current economic situation, forecasting, opinion polling 8 Further Reading for Pleasure! • For a detailed history of statistics see Stephen M. Stigler’s History of Statistical Concepts and Methods. • For new approaches to statistics see Breiman’s Statistical Modeling: The Two Cultures. 9 The Big Picture in Statistics Use a small group of units to make some conclusions (inference) about a larger group Population Sample (Characteristics Unknown) 10 Populations and Parameters • Population – a group of individuals (or things) that we would like to know something about • Parameter - a characteristic of the population in which we have a particular interest – Often denoted with Greek letters (µ, ) – Examples: • The proportion of the population that would respond to a certain drug • The population average height of males in Michigan 11 Samples and Statistics • Sample – a subset of a population (hopefully representative and random) • Statistic – a characteristic of the sample (any function of the sample data) – Example: • The observed proportion of the sample that responds to treatment • The observed average height of males in Michigan 12 Example • A sample of 1000 women between the ages of 30 and 39 is randomly chosen across the US for a marketing study. The results are: 825 women prefer product A over product B (or 175 prefer B over A) • Population? Population of all women between the ages 30 and 39, living in the US • Sample? 1000 women, • Parameter? 30-39, sampled in the survey Population proportion of women 30-39 in the US preferring A over B – this is unknown • Statistic? Sample proportion of women 30-39 in r.s. of n=1000 who preferred A over B. Here the value of this statistic is 825/1000 or 82.5% 13 Populations and Samples • Studying populations is too expensive and timeconsuming, and thus impractical • If a sample is representative of the population, then by observing the sample we can learn something about the population – And thus by looking at the characteristics of the sample (statistics), we may learn something about the characteristics of the population (parameters). 14 Issues • Samples are random – If we had chosen a different sample, then we would obtain different values for the statistics (although we are trying to estimate the same (unchanged) population parameters). • Samples must should represent the population 15 Explanatory and Response Variables • Many questions in statistics are about the relationship between two or more variables. • It is useful to identify one variable as the explanatory and the other variable as the response variable. • In general, the value of the explanatory variable for an individual is thought to partially explain or account for the value of the response variable. 16 Explanatory and Response Variable • Other names: – Explanatory: independent, factor, treatment, input, x – Response: dependent, y, output 17 Statistical Analyses • Descriptive Statistics – Describe the sample – use numerical and graphical summaries to characterize a data set • Inference – Make inferences about the population – Primarily performed in two ways: • Hypothesis testing • Estimation – Point estimation – Interval estimation 18 Descriptive Statistics - Data • Pieces of information • Types of Data • Categorical Data: – Nominal – unordered categories – Ordinal – ordered categories • Quantitative Data – Discrete – only whole numbers are possible, order and magnitude matters – Continuous – any value is conceivable 19 Summary of Data Types Types of Data Categorical Nominal Quantitative Ordinal Discrete Continuous 20 Examples of Data Types • • • • • • • Age (years) quantitative, continuous Car Manufacturer (GM, Ford, etc.) categorical, nominal Starting Salary in Dollars quantitative, continuous Starting Salary (Low, Med., High) categorical, ordinal Calcium Level (microgram per liters) quantitative, continuous Current Smoker (yes or no) categorical, nominal Number on the flip of a die quantitative, discrete 21 Data • The vast majority of errors in research arise from a poor planning (e.g., data collection) • Fancy statistical methods cannot rescue garbage data • Collect exact values whenever possible 22 On Descriptive Statistics • It is ALWAYS a good idea to summarize your data – You become familiar with the data and the characteristics of the people/things that you are studying – You can also identify problems or errors with the data • This is the first the step in any statistical analysis 23 Dataset Structure • Think of data as a rectangular matrix of rows and columns. • Rows represent the “experimental unit” (e.g., person) • Columns represent variables measured on the experimental unit 24 Example Data Set • Data are for 11 variables and n = 1,606 respondents in the 1993 General Social Survey, a national survey done by the National Opinion Research Center at the University of Chicago. Some questions are only asked of about twothirds of the survey participants, so there is quite a bit of missing data. (Source: SDA archive at UC Berkeley website, http://csa.berkeley.edu:7502) • I will be using a smaller version of it with only n=500. 25 Example Data Set Column C1 C2 C3 C4 C5 Name sex race degree relig polparty C6 C7 C8 cappun tvhours marijuan C9 C10 owngun gunlaw C11 age Description Sex of respondent Race of respondent (White, African American, Other) Highest educational degree received (Five categories) Religious preference (Catholic, Protestant, Jewish, Other) Does respondent think of self as Democrat, Republican, Indep. or Other? Does the respondent favor or oppose the death penalty. Hours of watching television on a typical day Whether the respondent thinks marijuana should be legalized or not Whether respondent owns a gun or not (Yes or No) Does respondent favor or oppose a law requiring a permit to buy a gun? Age of the respondent 26 Example Data Set (DataSet_1) 27 Summarizing Categorical Data • Numerical Summaries – Frequency/Count tables • Visual Summaries – Pie Charts – good for summarizing a single categorical variable – Bar Charts – good for summarizing one or two categorical variables and useful for making comparisons when there are two categorical variables 28 Numerical Summary of Categorical Data • Count how many fall into each category • Calculate the percent in each category • If two variables, have the categories of the explanatory variable define the rows and compute the row percentages 29 Example Numerical Summary of the Sex Variable sex Valid Female Male Total Frequency 283 217 500 Percent 56.6 43.4 100.0 Valid Percent 56.6 43.4 100.0 Cumulative Percent 56.6 100.0 30 Example Bar Chart of the Sex Variable 60 56.6 % 50 43.4% 40 Percent 30 20 10 0 Female Male sex 31 Example Numerical Summary of political party affiliation polparty Valid Mis sing Total Democrat Indpndnt Other Republcn Total Frequency 157 178 5 156 496 4 500 Percent 31.4 35.6 1.0 31.2 99.2 .8 100.0 Valid Percent 31.7 35.9 1.0 31.5 100.0 Cumulative Percent 31.7 67.5 68.5 100.0 Question: What percentage of people in the US identify themselves as democrat/independent/republican/other? At least we have some descriptive information from the table above: most people seem to identify themselves as independents while the percentage of the democrat and the republicans seem to be very close. 32 Example Visual Summary of the political party affiliation 40 35.9 % 31.7 % 31.5 % 30 Percent 20 10 1.0 % 0 Democrat Indpndnt Other Republcn polparty 33 Example Numerical Summary of the Sex Variable vs. Political Party Affiliation sex * polparty Crosstabulation s ex Female Male Total Count % within s ex Count % within s ex Count % within s ex Democrat 94 33.6% 63 29.2% 157 31.7% polparty Indpndnt Other 95 2 33.9% .7% 83 3 38.4% 1.4% 178 5 35.9% 1.0% Republcn 89 31.8% 67 31.0% 156 31.5% Total 280 100.0% 216 100.0% 496 100.0% Question: Is there a difference in party affiliation (in %) between the men and the women? Again some descriptive information is available. There does not seem to be a big difference. 34 Example Numerical Summary of the political party affiliation vs. own a gun polparty * owngun Crosstabulation owngun polparty Democrat Indpndnt Other Republcn Total Count % within polparty Count % within polparty Count % within polparty Count % within polparty Count % within polparty No Yes 72 66.1% 61 58.1% 3 75.0% 40 39.2% 176 55.0% 37 33.9% 44 41.9% 1 25.0% 62 60.8% 144 45.0% Total 109 100.0% 105 100.0% 4 100.0% 102 100.0% 320 100.0% 35 Example Look at these! Question: Is there a relationship between gun ownership and party affiliation? Descriptively, there seems to be a relationship. Most republicans seem to be gun owners. 36 Questions to Ask – 1 Categorical Variable Question: How many and what percentage of individuals fall into each category? Example: What percentage of college students favor legalization of marijuana? Question: Are individuals equally divided across categories or do the percentages across categories follow some other interesting pattern? Example: When individuals are asked to choose a number from 1 to 10, are all numbers equally likely to be chosen? 37 Questions to Ask – Categorical Variables Question: Is there a relationship between the two categorical variables, so that the category into which individuals fall for one variable seem to depend on which category they are in for the other variable? Example: Is there a relationship between gun ownership and party affiliation? Another Example: The relationship between smoking and lung cancer was detected in part, because someone noticed that the combination of being smoker and having cancer is unusual. 38 Descriptive Statistics – Quantitative Data • We will use a new data set from http://www.infoplease.com/ipa/A0194030.html on the age of presidents at inaugural 39 Interesting Features of Quantitative Variables • Quick glance at the data values (Bloody Eyeball Test!) • Location: where most values lie or the value that represents the data best e.g. mean or median • Spread: variability in data • Shape: a bit later… • Five number summary: find extreme (high, low), the median, and the quartiles (median of lower and upper halves of the values). 40 Location of a Data Set: Mean, Median, and Mode • Mean: the numerical average, sum the data and then divide by the number of data points • Formula: x x i n • Median: the middle value (if n odd) or the average of the middle two values (n even) once the data have been ordered. 50% of data are above the median and 50% are below the median. • Mode: it is the measurement that occurs most often. 41 Some Word about Notation Notation for Data: n = number of individuals in a data set x1, x2 , x3,…, xn represent individual raw data values Example: A data set consists the president’s age at inaugural; the values are 51, 61, …, 46, and 54. Then, n = 43 x1= 51, x2 = 61, …, x42 = 46, and x43 = 54 42 Example of Mean • What is the average age of the US Presidents at inaugural? Mean age = (57 + 61 + … + 46 + 54)/43 = 55 Statistics age N Mean Median Std. Deviation Range Minimum Maximum Percentiles Valid Mis sing 25 50 75 43 0 54.81 55.00 6.235 27 42 69 51.00 55.00 58.00 43 Example of Median • What is the median age of the US President at inaugural? (n=43, n is odd) Note n=43, n is odd, take the (43+1)/2 = 22, or 22nd value which is 55 Note: Data has been sorted. Statistics age N Mean Median Std. Deviation Range Minimum Maximum Percentiles Valid Mis sing 25 50 75 43 0 54.81 55.00 6.235 27 42 69 51.00 55.00 58.00 44 A Bit More About Median Median Calculations If n is odd: M = middle of ordered values. Count (n + 1)/2 down from top of ordered list. If n is even: M = average of middle two ordered values. Average values that are (n/2) and (n/2) + 1 down from top of ordered list. Say you have the following list of numbers: 18,29,33,45,88,100 The median here is the average of 33 and 45 so (33 + 45)/2 = 39. 45 Describing Spread/Variability in Data • Range = highest/max value – lowest/min value • Interquartile Range (IQR) = upper quartile – lower quartile • Standard Deviation: a bit later…. 46 Describe the Spread - Quartiles • Split the ordered values into half that is below the median and the half that is above the median. • Q1 = lower quartile = median of data values that are below the median • Q3 = upper quartile = median of data values that are above the median • Q2 is the just the median • IQR, Interquartile Range = Q3 - Q1 • Min, Max, Median, Q1, and Q3 used in creation of boxplots Min Q1 25% Med 25% Q3 25% Max 25% 47 Example Using the Presidents Age Data Statistics age N Mean Median Std. Deviation Range Minimum Maximum Percentiles Valid Mis sing 25 50 75 43 0 54.81 55.00 6.235 27 42 69 51.00 55.00 58.00 Max - Min Min Max Q1 Q3 •About 25% of the presidents were 51 years old or younger. •About 75% were 58 or less. •About 50% (the middle 50%) were between the ages of 51 and 58. IQR = 58-51=7 •The oldest was 69 (Reagan) and the youngest 42 (T. Roosevelt). Range = 69 – 42 = 27. •About 50% were 55 or less or equivalently about 50% were 55 or older. 48 The Spread and Shape of Data are important! • Suppose 20 people take exams. Possible scores go from 0 to 100. The average score is 87. Bob got an 88. How well do you think he did? Case I: Bob is hot! Case II: Bob is not so hot! Just knowing the mean or the median is not enough. We need to know something about the spread and shape of data. Case I 80 81 85 86 86 86 86 86 86 86 86 86 86 86 87 87 87 88 99 100 Case II 0 3 88 95 95 95 95 96 96 96 96 96 96 97 98 98 100 100 100 100 49 Graphical Summaries for Quantitative Data • Histograms: similar to bar graphs, used for any number of data values • Stem and Leaf plot and dot plots: present all the individual values, useful for small to moderate sized data sets. • Boxplots: useful summary for comparing two or more groups. • Scatter Plot: very useful for exploring relationships between two variables 50 Creating a Histogram 1. The horizontal axis has your variable of interest. 2. Decide how many equally spaced intervals to use for the horizontal. Between 6 and 15 intervals is a good number. 3. Decide to use frequencies (count) or relative frequencies (proportion) on the vertical axis. Relative frequency is a usually a better choice. 4. Draw equally spaced intervals on the horizontal axis covering the entire range of data values. 5. Determine frequency or relative frequency of data values in each interval and draw a bar with corresponding height. 6. Decide rule to use for values that fall on the border between two intervals. 51 Histogram of Presidents Age Data Bin Sizes = 2.5 12 intervals. 52 Some Various Histogram (presidents age data) 53 Histograms and Software Dependency 0 5 Frequency 10 15 Default Histogram Generated by R, Bin Size = 5, 6 Intervals 40 45 50 55 60 65 70 Age 54 Describing Shape (By using histograms) 55 How About the Presidents Age Data? Seems approximately bell shaped. 56 Mean vs. Median • Mean is sensitive to extreme values. • Median is not sensitive to extreme values. • Simple example: Say you have the following set of data (n=7): {2,4,6,8,10,12,14} The mean and the median are both 8. Now suppose you have {2,4,6,8,10,12,50} The mean jumps to 13.14 but the median is still 8. Which is a better measure for location? Median in this case. 57 Mean vs. Median • Note: – Symmetric data: mean ≈ median (e.g. presidents age data) • • • • – Skewed Left: mean < median – Skewed Right: mean > median If your data is approximately symmetric then better to use mean Note extreme values can cause skew-ness With extreme skew-ness, median may be a better measure How about “outliers”? Hang on…. 58 Mean vs. Median (from DataSet1) Statistics tvhours N Mean Median Minimum Maximum Percentiles Valid Mis sing 25 50 75 497 3 2.92 2.00 0 16 1.50 2.00 4.00 The number of hours of TV watched per day is right skewed. Here the mean of 2.92 hours is greater than the median value of 2.00 hours per day. 59 Outliers • Outlier: a data point that does not seem to be consistent with the bulk of the data. • Remarks: – Look for them via graphs. Recommend boxplots. – Can have a big influence on conclusions. – Can cause complications in some statistical analysis. – Can not discard without solid justification 60 More on Outlier • Outlier is not necessarily a bad thing! Examples: Credit card Fraud: very high activity associated with stolen card. • May sometimes be due to errors in data entry. Example: You have height data for people and the minimum height shows up as 2 inches! Can’t be right! • How do you detected it? I recommend graphical methods such as boxplots. 61 More on Outliers It is a BAD idea to exclude outliers in an automatic manner: NASA launched Nimbus 7 satellite to record atmospheric data. After a few years in 1985, a few scientists observed a large decrease in ozone over Antarctic. It was found later that the NASA data processors were automatically throwing away data with very small values (ozone readings) and assumed to be mistakes. Had this been known earlier, perhaps CFC phase-out would have been implemented sooner! 62 Possible Reason for Outliers and Reasonable Actions • Mistake made while taking measurement or entering it into computer. If verified, should be discarded/corrected. • Individual in question belongs to a different group than bulk of individuals measured. Values may be discarded if summary is desired and reported for the majority group only. • Outlier is legitimate data value and represents natural variability for the group and variable(s) measured. Values may not be discarded — they provide important information about location and spread. 63 Boxplots – Presidents Age Data Max Q3 Q2 Q1 Possible Outliers are marked Apart from outliers, lines extending from box reach to min and max. Box covers 50% of data Boxplot gives you the visual summary of: Statistics age N Median Minimum Maximum Percentiles Valid Mis sing 25 50 75 43 0 55.00 42 69 51.00 55.00 58.00 Min 64 Boxplots – Presidents Age Data from R Mean 3rd Qu. 54.81 57.50 Max. 69.00 45 50 Age 55 60 65 70 > summary(p$age) Min. 1st Qu. Median 42.00 51.00 55.00 Presidents 65 Comparing Two Groups via Boxplots Boxplots are great graphical tool for comparing numerical summaries across different categories. 66 Drawing Boxplots • • • • • Step 1: Label either a vertical axis or a horizontal axis with numbers from min to max of the data. Step 2: Draw box with lower end at Q1 and upper end at Q3. Step 3: Draw a line through the box at the median. Step 4: Draw a line from Q1 end of box to smallest data value that is not further than 1.5 IQR from Q1. Draw a line from Q3 end of box to largest data value that is not further than 1.5 IQR from Q3. Step 5: Mark data points further than 1.5 IQR from either edge of the box with an asterisk. Points represented with asterisks are considered to be outliers. 67 Percentiles The kth percentile is a number that has k% of the data values at or below it and (100 – k)% of the data values at or above it. • • • Lower quartile = 25th percentile Median = 50th percentile Upper quartile = 75th percentile 68 Scatterplots • Scatterplots are two dimensional plots of data (quantitative data/variables of course.) • They can tell us something about the strength, the direction, and the nature of the relationship between two variables. – Direction: • Two variables have a positive association when the values of one variable tend to increase as the values of the other variable increase. • Two variables have a negative association when the values of one variable tend to decrease as the values of the other variable increase. – Strength: how tightly the points are clustered around some straight line or a curve. – Nature of the relationship: linear or curved? • More when we cover Regression 69 Example Scatterplot • The data set sats98 contain the average math and verbal SAT scores in 1998 for the 50 states and the District of Columbia. • The pcttook variable is the percent of graduating seniors who took the test that year. • Is there a relationship between the average math and the average verbal scores? 70 Example Scatterplot The nature of the relationship: There seem to be a linear relationship. See the line! Direction: The relationship is positive. As the math scores go up, the verbal scores go up on the average, Strength: The relationship seems to be strong. The points are bunched very close to the possible underlying line. WARNING: You can NOT conclude that high math scores cause high verbal scores! 71 Bell-Shaped Data • Many data or measurements follow a predictable pattern: – Most values are clumped around a center. – The greater the distance a value is from the center, the fewer individuals have that value. • Variables that follow such a pattern are said to be “bell-shaped”. A special case is called a normal distribution or normal curve. 72 Example: Presidents Age Data! Our presidential age data was Approximately bell shaped. 73 Describing Spread via Standard Deviation • Standard deviation measures variability by summarizing how far individual data values are from the mean of the data. • Think of the standard deviation as roughly the average distance values fall from the mean. • It will be in the same units as our data. 74 Computing Standard Deviation Formula for the (sample) standard deviation: x x 2 s i n 1 The value of s2 is called the (sample) variance. An equivalent formula, easier to compute, is: s 2 2 x n x i n 1 75 Computing Standard Deviation Step 1: Calculate x, the sample mean. Step 2: For each observation, calculate the difference between the data value and the mean. Step 3: Square each difference in step 2. Step 4: Sum the squared differences in step 3, andx then divide this sum by n – 1. Step 5: Take the square root of the value in step 4. 76 Simple Example Consider just four numbers: 62, 68, 74, 76 Step 1: 62 68 74 76 280 x 70 4 4 Steps 2 and 3: Step 4: 120 s 40 4 1 Step 5: s 40 6.3 2 77 Population Standard Deviation Data sets usually represent a sample from a larger population. If the data set includes measurements for an entire population, the notations for the mean and standard deviation are different, and the formula for the standard deviation is also slightly different. A population mean is represented by the symbol m (“mu”), and the population standard deviation is x m 2 i n 78 Some Remarks on Estimation • The population mean, µ, is most often an unknown parameter. • The sample mean, x ,a statistic computed from the sampled data, is our estimate of the unknown population mean µ. • The population standard deviation, , is most often an unknown parameter. • The sample standard deviation, s, a statistic computed from the sampled data, is our estimate of the unknown population standard deviation . 79 Standard Deviation and Bell Shaped Data For bell-shaped (normal) data, approximately • 68% of the values fall within 1 standard deviation of the mean in either direction • 95% of the values fall within 2 standard deviations of the mean in either direction • 99.7% of the values fall within 3 standard deviations of the mean in either direction The above approximation is sometimes called the Empirical Rule. 80 Example – President Age Data Descriptive Statistics N age Valid N (lis twis e) • • • • • • 43 43 Minimum 42 Maximum 69 Mean 54.81 Std. Deviation 6.235 The (sample) standard deviation for the presidents data is 6.235 years. Remember that our data looked bell shaped. We would expect 68% of the presidents’ ages to be between 54.81 ± 6.245 years old (at inaugural) or 49 to 61 years old. The actual data show about 72%. It is somewhat close. We would expect 95% of the presidents’ ages to be between 54.81 ± 2(6.245) years old or 42 to 67 years old. The actual data show about 95%. We would expect 99.7% of the presidents’ ages to be between 54.81 ± 3(6.245) years old or 36 to 74 years old. The actual data show about 100%. Interpretation of sample standard deviation: On the average, the inaugural age of the US presidents have been roughly 6 years away from their average age of 55. 81 Important Remarks • What does s = 0 mean? No variability in your data! All values are the same. • Like mean, standard deviation, is sensitive to extreme observations. • Use the mean and standard deviation for reasonably symmetric bell shaped data. • Five number summary: min, max, median, Q1, and Q2, is better for skewed distributions or if outliers are present. 82 Summary Descriptive Tools Quantitative Variables Histo gram Q -Q Plots Time Plots Categorical Variables Boxplots Scatter Plots Bar Charts Pie Charts Freq. Tables 83 Next Time • Please read articles: – Breiman’s article on two cultures of statistics (2001) – Altman’s articles on • Poor quality medical research (2002) • Statistical reviewing for medical journals (1998) • Some recent trends in statistics in medical journals (2000) – Goodman et. al. statistical reviewing policies of journals (1998) • Next Time: – Crash Course in Probability – Research Studies 84