Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Topic (4) SUMMARIZING DATA –CENTER OR CENTRAL TENDENCY 4-1 Topic (4) SUMMARIZING DATA – CENTER OR CENTRAL TENDENCY I) QUANTITATIVE DATA a) Median (50th percentile) Defn: The MEDIAN of a data set is the middle value. That is, when the data values are arranged from low to high, it is that value in the list such that half of the data points are smaller and the other half are larger. For even number of observations: The fish weights for the Tennessee River study (n=12) are: 986, 1023, 1266, 1398, 917, 1763, 1459, 778, 532, 441, 544, 897 1) first order them from low to high 2) median = average of the 2 middle values. 441, 532, 544, 778, 897, 917, median = m = 986, 1023, 1266, 1398, 1459, 1763 917 + 986 = 951.5 2 6 of the observed values fall below m and the other 6 are larger than m Topic (4) SUMMARIZING DATA –CENTER OR CENTRAL TENDENCY 4-2 For odd number of observations: Thirteen fish weights are: 986, 1023, 1266, 1398, 917, 1763, 1459, 778, 532, 441, 544, 897, 1129 1) first order them from low to high 2) median = the middle value. 441, 532, 544, 778, 897, 917, 985, 1023, 1129, 1266, 1398, 1459, 1763 m = 985 Important Point #1: The median is said to be robust because it is resistant to outliers Important Point #2: The sample median divides the total area under the bars in a histogram in half. Important Point #3: Populations also have medians called the population median (M). This number divides the area under the curve describing the population frequency distribution in halves. Topic (4) SUMMARIZING DATA –CENTER OR CENTRAL TENDENCY 4-3 b) Arithmetic Mean Defn: The MEAN of a data set is the average value. That is, it is the value obtained by adding all of the numbers together and dividing the result by the number of values in the sum (see symbols later). The SAMPLE MEAN is denoted as x (pronounced “x-bar”). The POPULATION MEAN is denoted µ (pronounced “mu”). EXAMPLE The fish lengths for the Tennessee River study are: 48, 45, 49, 51, 44, 49, 46, 28.5, 26, 25.5, 25, 44 The dot plot of these data is • • • •• • ••• •• • ____|______|______|_____|______|_____|____ 25 30 35 40 45 50 Length (cm) If each point has the same weight, where should the pivot point be to balance the x-axis (i.e. keep it horizontal)? Ans: the pivot point is the arithmetic mean Topic (4) SUMMARIZING DATA –CENTER OR CENTRAL TENDENCY 4-4 To calculate the sample mean for these data: Sum the data values and divide the result by n. 48+45+49+51+44+49+46+28.5+26+25.5+25+44 = 481 = 40.08 12 12 We say that the fish caught in the study averaged 40.08 cm in length. Important Point #1: If one were able to observe the value of every single element in a population (say, every single fish in the Tennessee River in 1978), then it would be possible to calculate the population mean µ. Since we can’t do that, we say that an estimate of the population mean µ is the sample mean x . Important Point #2: Is the mean robust? Ans: NO! It’s value depends directly on the values in the dataset. Important Point #3: The mean of a set of data is the fulcrum or balance point for the data. Topic (4) SUMMARIZING DATA –CENTER OR CENTRAL TENDENCY 4-5 NOTATION: X denotes the NAME of the variable e.g. LENGTH x denotes a value for the named variable e.g. 48 cm i a subscript which denotes the index number for the observation e.g. fish IDs run from 1 to 12 xi denotes the value for the ith observation (that is, the ith observed value) e.g. x1 = 48, x2 = 45, etc. Σ denotes the operation “SUM” So, we can write n x= ∑ xi i =1 n x1 + x2 + ... + xn = n Topic (4) SUMMARIZING DATA –CENTER OR CENTRAL TENDENCY 4-6 For frequency distributions, the relationship of the mean to the median depends on the shape of the distribution: Skewed to the right: mean median Skewed to the left: mean median Symmetric and unimodal mean median Uniform mean median Bimodal mean median Question: So, which measure of center do you use when? Topic (4) SUMMARIZING DATA –CENTER OR CENTRAL TENDENCY 4-7 II) CATEGORICAL DATA – BINARY DATA In general, the summary statistics for categorical data are the relative frequencies of each category in the dataset. There is no such idea as a mean or average category, only the most common one or the least common one or some other appellation. For the special case of binary data (only two categories), the measure of central tendency is the “Proportion of Successes”. Defn: When there are only two possible outcomes, define one category to be the “Success” (it’s the category you are studying). The PROPORTION OF SUCCESSES, then, is the fraction of observations that are successes. When the dataset is a sample the SAMPLE PROPORTION is denoted p; when the dataset is the entire population the POPULATION PROPORTION is denoted π. So we can write, # successes in sample # successes = p= sample size n π= # successes in population # successes = population size N Topic (4) SUMMARIZING DATA –CENTER OR CENTRAL TENDENCY 4-8 Example: Suppose a researcher is interested in the recovery of submerged aquatic vegetation (SAV) in the Chesapeake Bay. At each of 30 locations at which SAV was historically found, the scientist categorizes the spot as either medium to high amounts of SAV or low amount to no SAV. 11 locations had medium to high SAV levels. 2 categories: Success = “medium or high SAV level” Failure = “none or low SAV level” The proportion of sample locations with medium to high SAV levels is p = 11/30 = 0.3667. Example: Suppose the rate of the birth defect, spina bifida, is 1 baby in every 100,000 live births. Success = “has spina bifida” π = 0.00001 = 1x10-5