Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Ex St 801 Statistical Methods Introduction Basic Definitions STATISTICS: Area of science concerned with extraction of information from numerical data and its use in making inference about a population from data that are obtained from a sample. Basic Definitions (cont.) POPULATION: set representing all measurements of interest to the investigator. PARAMETER: an unknown population characteristic of interest to the investigator. Basic Definitions (cont.) SAMPLE: subset of measurements selected from the population of interest. STATISTIC: a sample characteristic of interest to the investigator. Some Frequently Used Statistics and Parameters SAMPLE POPULATION MEAN y VARIANCE s2 STANDARD DEVIATION s PROPORTION Basic Definitions (cont.) STATISTICAL INFERENCE : making an "INFORMED GUESS" about a parameter based on a statistic. (This is the main objective of statistics.) STATISTICAL INFERENCE GATHER DATA POPULATION SAMPLE MAKE INFERENCES PARAMETERS , 2 , , , etc. SAMPLE STATISTICS y, s , s, ˆ , etc. 2 More Basic Definitions • A VARIABLE is a characteristic of an individual or object that may vary for different observations. • A QUANTITATIVE VARIABLE measures a variable scale. • A QUALITATIVE VARIABLE categorizes the values of the variable. RAISIN BRAN EXAMPLE • A cereal company claims that the average amount of raisins in its boxes of raisin bran is two scoops. • A random sample of five boxes was taken off the production line, and an analysis revealed an average of 1.9 scoops per box. Components of the Problem • Identify the population • Identify the sample • Identify the symbol for the parameter • Identify the symbol for the statistic • Is the variable quantitative or qualitative? ASPIRIN AND HEART ATTACKS 1 • Twenty thousand doctors participated in a study to determine if taking an aspirin every other day would result in a reduction of heart attacks. ASPIRIN AND HEART ATTACKS 2 • The physicians were randomly divided into two groups. The first group (called the treatment group) received an aspirin every other day, while the other group (called the control group) received a placebo. ASPIRIN AND HEART ATTACKS 3 • At the end of the study, there had been 104 heart attacks in the treatment group and 189 heart attacks in the control group. Identifying Components of the Problem • Identify the population • Identify the sample • Identify the symbol for the parameter • Identify the symbol for the statistic • Is the variable quantitative or qualitative? Five Steps in a Statistical Study: 1. Stating the problem 2. Gathering the data 3. Summarizing the data 4. Analyzing the data 5. Reporting the results Stating the Problem • Specifically identifying the population to be sampled • Identifying the parameter(s) being studied Stating the Problem Example • A researcher wanted to determine if a vitamin supplement would reduce the rate of certain cancers. • A large study was conducted in China and the results indicated that people who had the vitamin supplement had a significantly lower cancer rate. • Do the results of this study apply to Americans? Why or why not? Gathering the Data • SURVEYS – Random Sampling – Stratified Sampling – Cluster Sampling – Systematic sampling Gathering the Data • EXPERIMENTS – Completely Randomized Design – Randomized Block Design – Factorial Design More Definitions DESCRIPTIVE STATISTICS: Organizing and describing sample information. (Descriptive Statistics describe how things are.) Graphical Displays for Qualitative Data • PIE CHART • BAR CHART Major Volcanoes in the World 13% 30% 8% 3% 11% 35% Africa Antarctica Asia Europe North America South America Major Volcanoes in the World South America North America Europe Asia Antarctica Africa 0 10 20 30 40 50 Graphical Displays for Quantitative Data • HISTOGRAM • STEM AND LEAF DISPLAY Histogram of Major Volcanoes in the World 30 Frequency 25 20 15 10 5 0 2500 5000 7500 10000 12500 15000 17500 20000 Elevation Life Expectancies in 33 Developed Nations Country Austrialia Austria Belgium Britain Bulgaria Life Expectancy 76.3 75.1 74.3 75.3 71.5 Canada Czechoslovakia Demark East Germany West Germany Finland France Greece Hungary Iceland Ireland Israel 76.5 71.0 74.9 73.2 75.8 74.8 75.9 76.5 69.7 77.4 73.5 75.2 Country Italy Japan Luxembourg Malta The Netherlands New Zealand Norway Poland Portugal Rumania Soviet Union Spain Sweden Switzerland United States Yugoslavia Life Expectancy 75.5 79.1 74.1 74.8 76.5 74.2 76.3 71.0 74.1 69.9 69.8 76.6 77.1 77.6 75.0 71.0 Histogram of Life Expectancies in 33 Developed Nations 10 9 Frequency 8 7 6 5 4 3 2 1 0 71.20 72.80 74.40 76.00 Life Expectancy 77.60 79.20 Stem-Leaf Display for Elevation STEM LEAF 0 001111 0 222333 0 444444444455555555 0 6666667777777 0 8888888999999999999 1 0000000000000111111 1 22222222333333 1 44555 1 67777 1 8889999 KEY: UNIT = 1000 1| 2 REPRESENTS 12000 Construction of a Stem-Leaf Display • List the stem values, in order, in a vertical column • Draw a vertical line to the right of the stem values • For each observation, record the leaf portion of the observation in the row corresponding to the appropriate stem • Reorder the leaves from the lowest to highest within each stem row Construction of a Stem-Leaf Display (cont.) • If the number of leaves appearing in each stem is too large, divide the stems into two groups, the first corresponding to leaves 0 through 4, and the second corresponding to leaves 5 through 9. (This subdivision can be increased to five groups if necessary). • Provide a key to your stem and leaf coding, so the reader can reconstruct the actual measurements. Numerical Measures for Summarizing Data TYPES: 1. Measures of CENTRAL TENDENCY 2. Measures of VARIABILITY 3. Measures OF RELATIVE LOCATION The Arithmetic Mean The ARITHMETIC MEAN of a set of n measurements (y1, y2, ..., yn ) is equal to the sum of the measurements divided by n. The mathematical notation for the ARITHMETIC MEAN is: n y i 1 n yi The Median The MEDIAN of a set of n measurements (y1, y2, ..., yn ) is the value that falls in the middle position when the measurements are ordered from the smallest to the largest. RULE FOR CALCULATING THE MEDIAN 1 Order the measurements from the smallest to the largest. 2 A) If the sample size is odd, the median is the middle measurement. B) If the sample size is even, the median is the average of the two middle measurements. Example A random sample of six values were taken from a population. These values were: y1=7, y2=1, y3=10, y4=8, y5=4, and y6=12. What are the sample mean and sample median for these data? Sample Mean y1 y 2 y3 y 4 y5 y6 y n CALCULATIONS FOR THE SAMPLE MEDIAN ( Ordered Sample) y2=1, y5=5, y1=7, y4=8, y3=10, y6=12 MEDIAN = ( 7 + 8 ) / 2 = 7.5 Consider the following sample: 4 46 18 47 36 48 39 49 41 49 42 50 43 51 44 53 44 54 45 60 Which measure of central tendency best describes the central location of the data: THE SAMPLE MEAN OR SAMPLE MEDIAN? STEM LEAF 04 0 1 18 2 2 3 3 69 4 12344 4 567899 5 0134 5 60 MEASUREMENTS OF VARIABILITY • RANGE • VARIANCE • STANDARD DEVIATION Deviation The DEVIATION of an observation yi from the sample mean is equal to: ( yi y ) Deviations to the left of the sample mean are negative and deviations to the right of the sample mean are positive. Also, notice that the larger the squared deviation, the further away the observation is from the mean. Formula for the Sample Variance n y y n 2 i S 2 i 1 n 1 i 1 n yi i 1 2 yi n n 1 2 Obs. Y 1 2 3 4 5 6 7 1 10 8 4 12 y 7 (Y-Y) (Y-Y)2 Obs. Y 0 -6 3 1 -3 5 0 36 9 1 9 25 1 2 3 4 5 6 7 1 10 8 4 12 49 1 100 64 16 144 42 374 80 Y2 Calculation of Sample Variance yi i 1 n n n S2 2 y y i i 1 n 1 80 5 16 n S2 yi2 i 1 n 1 374 5 16 42 2 6 2 THE EMPIRICAL RULE Given a large set of measurements possessing a mound-shaped histogram, then • the interval y s contains approximately 68% of the measurements. • the interval y 2s contains approximately 95% of the measurements. • the interval y 3s contains approximately 99.7% of the measurements. Percent of Observations Included between Certain Values of the Standard Deviation 68% 95% 99.7% -4 s -3 s -2 s -1 s 0 1 s 2 s 3s 4 s Major Volcanoes in the World Emprical Rule Interval Pecentage of Actual Percentage of Observations Expected to Observations Found Fall within the Inteval within the Interval 4912 to 14058 68% 66.6% 339 to 18630 95% 95.7% -4232 to 23202 99.7% 100% TWO MEASURES OF RELATIVE STANDING • Percentile • Quartile The Pth Percentile is the value Xp such that p% of the measurements will fall below that value and (100-p)% of the measurements will fall above that value. (100-p)% p% X p Quartiles divide the measurements into four parts such that 25% of the measurements are contained in each part. The first quartile (Lower Quartile) is denoted by Q1, the second by Q2, and the third (Upper Quartile) by Q3. 25% 25% Q1 25% Q2 25% Q3 Box and Whisker Plot Life Expectancies in 33 Developed Nations 80 78 76 74 72 70 68 Calculating Fence Values Lower Inner Fence: Q1 - 1.5 (IQR) Upper Inner Fence: Q3 + 1.5 (IQR) Lower Outer Fence: Q1 - 3 (IQR) Upper Outer Fence: Q3 + 3 (IQR) EXAMPLE: Construct a Box-and-Whisker Plot for the elevations of volcanoes in Africa 1,650 5,981 7,745 13,451 19,340 Median = Lower Upper Lower Upper 9,281 10,023 Q1 = Inner Fence = Inner Fence = Outer Fence = Outer Fence = Q2 = 11,400 12,198 IQR = BOX AND WHISKER PLOT MAJOR VOLCANOES IN AFRICA 20000 18000 16000 14000 12000 10000 8000 6000 4000 2000 0 Ex St 801 Statistical Methods The End