Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Review of Chapter One This chapter presented some important basics. There were fundamental definitions, such as sample and population, along with some very basic principles. Data: are observations (such as measurements, genders, survey responses) that has been collected. Statistics: is the science of collecting, analyzing, and drawing conclusions from sample data. A population is the complete collection of all elements (scores, people, measurements, and so on) to be studied. A census is the collection of data from every member of the population. A sample is a subcollection of members selected from a population. A parameter is a numerical measurement describing some characteristic of a population. A statistic is a numerical measurement describing some characteristic of a sample. Two Types of Data: • Quantitative data: consists of numbers representing counts or measurements. e.g. the heights of students. 1 • Qualitative (or categorical) data: can be separated into different categories that are distinguished by some nonnumerical characteristic. e.g. the genders (male/female) of students. Two Types of Quantitative Data: • Discrete Data: results when the number of possible values is either a finite number or a countable number. • Continuous data result from infinitely many possible values that correspond to some continuous scale that covers a range of values without gaps, interruptions, or jumps. Four Levels of Measurement of Data(see Table 1-1 on Page 10 for more details): • Nominal: Categories only. Data can not be arranged in an ordering scheme. • Ordinal: Categories are ordered, but differences can’t be found or are meaningless. • Interval: Differences are meaningful, but there is no natural starting point and ratios are meaningless. • Ratio: There is a natural zero starting point and ratios are meaningful. 2 Review of Chapter Two In this chapter we considered methods for describing, exploring, and comparing data sets. We have the following important characteristics of data: • Center: A representative or average value that indicates where the middle of the data set is located. • Variation: A measure of the amount that the data values vary among themselves. • Distribution: The nature or shape of the distribution of the data. • Outliers: Sample values that lie very far away from the vast majority of the other sample values. • Time: Changing characteristics of the data over time. Frequency distribution: lists data values (either individually or by groups of intervals), along with their corresponding frequencies (or counts). Relative frequency distribution: includes the same class limits as a frequency distribution, but relative frequencies are used instead of actual frequencies. 3 Cumulative frequency distribution: the cumulative frequency for a class is the sum of the frequencies for that class and all previous classes. Histogram: a bar graph in which the horizontal scale represents classes of data values and the vertical scale represents frequencies. The heights of the bars correspond to the frequency values, and the bars are drawn adjacent to each other (without gaps). Dotplot: a graph in which each data value is plotted as a point (or dot) along a scale of values. Dots representing equal values are stacked. Stem-and-Leaf Plot: represents data by separating each value into two parts: the stem (such as the leftmost digit) and the leaf (such as the rightmost digit). Pie Chart: a graph of a frequency distribution for a categorical data set. Each category is represented by a slice of the pie and the area of the slice is proportional to the corresponding frequency or relative frequency. Measures of Center P Sample mean: x̄ = nx , which is used to measure the center of a P sample with sample size n. Here x represents the sum of all data values. P Population mean: µ = Nx , which is the average value of the entire population (with size N ). Median: the middle value in the ordered list of sample data. If n is even, the median is the average of the two middle values. 4 Mean from a frequency distribution: x̄ = notes frequency and x is the class midpoint. P (f ·x) P f , where f de- Skewness: a distribution of data is skewed if it is not symmetric and extends more to one side than the other. A distribution of data is symmetric if the left half of its histogram is roughly a mirror image of its right half. Measures of Variation Range: the difference between the highest value and the lowest value of a set of data. 2 Sample variance: s = tion of sample data. P (x−x̄)2 n−1 , which is used to measure the varia- Sample standard deviation: s= √ rP (x − x̄)2 s2 = n−1 Population variance: denoted by σ 2 , population standard deviation: denoted by σ, which are used to measure variation of the entire population. Formula to calculate standard deviation from a frequency distribution: s P P n[ (f · x2 )] − [ (f · x)]2 , s= n(n − 1) where n is the sample size (or the total of frequencies), x is the class midpoint, and f is the class frequency. 5 Range Rule of Thumb: • minimum “usual” value =(mean)- 2 × (standard deviation). • maximum “usual” value=(mean) + 2 × (standard deviation). Empirical Rule for Data with a Bell-shaped Distribution: • About 68% of all values fall within 1 standard deviation of the mean. • About 95% of all values fall within 2 standard deviations of the mean. • About 99.7% of all values fall within 3 standard deviations of the mean. Chebyshev’s Theorem: The proportion of any set of data lying within K standard deviations of the mean is always at least 1 − 1/K 2 , where K is any positive number greater than 1. For K = 2 and K = 3, we get the following statements: • At least 3/4 (or 75%) of all values lie within 2 standard deviations of the mean. • At least 8/9 (or 89%) of all values lie within 3 standard deviations of the mean. 6 Measures of Relative Standing: z score: z = (x − x̄)/s (sample), and z = (x − µ)/σ (population). It is positive (negative) if the data value lies above (below) the mean. If z score is given, we can find the corresponding x value: x = x̄ + z · s. Quartiles and Percentiles: • Q1 (First quartile): separates the bottom 25% of the sorted values from the top 75%. • Q2 (Second quartile): separates the bottom 50% of the sorted values from the top 50%. The same as the median. • Q3 (Third quartile): separates the bottom 75% of the sorted values from the top 25%. • percentile of value x = number of values less than x · 100 total number of values • See Figure 2-15 for converting from the kth percentile to the corresponding data value. 7 Review of Chapter Three In this chapter, we discussed the basic concepts related to probability and developed some basic skills to calculate probabilities in a variety of important circumstances. Event: any collection of results or outcomes of a procedure. Simple event: an outcome or an event that cannot be further broken down into simpler components. Sample space: the collection of all possible simple events. Probability: a number between 0 and 1 that reflects the likelihood of occurrence of some event. P denotes a probability and P (A) denotes the probability of event A occurring. Some Basic Properties of Probability: • The probability of an impossible event is 0. • The probability of an event that is certain to occur is 1. • 0 ≤ P (A) ≤ 1 for any event A. Some Important Formulas: • P (A or B) = P (A) + P (B) − P (A and B). 8 • If A and B are mutually exclusive, then P (A or B) = P (A)+P (B). • P (Ā) = 1 − P (A), where Ā is the complement of event A. P (at least one) = 1 − P (none). • If all the n possible simple events of a procedure are equally likely to occur, then P (A) = number of simple events in A . n Conditional Probability: Let A and B be two events of a procedure. The conditional probability of B given that A has occurred is P (B|A) = P (A and B) . P (A) Multiplication Rule: P (A and B) = P (A) · P (B|A). Independence: • Two events A and B are independent if P (A and B) = P (A)P (B). If P (A) 6= 0, then this definition is equivalent to P (B|A) = P (B), which means that the probability of event B does not depend on whether A has occurred or not. • If A1 , A2 , · · · , Ak are independent, then P (A1 and A2 and · · · and Ak ) = P (A1 )P (A2 ) · · · P (Ak ). 9 Review of Chapter Four In this chapter we introduced some important concepts like random variable and probability distribution. Two important discrete probability distributions, binomial distribution and Poisson distribution, were discussed. • A random variable has values that are determined by chance. • A probability distribution consists of all values of a random variable, along with their corresponding probabilities. A probability distribution must satisfy two requirements: X P (x) = 1, and, for each value of x, 0 ≤ P (x) ≤ 1. • Important characteristics of a probability distribution can be explored by constructing a probability histogram and by computing its mean and standard deviation using these formulas: X µ= [x · P (x)] qX σ= [x2 · P (x)] − µ2 . • In a binomial distribution, there are two categories of outcomes and a fixed number of independent trials with a constant probability. The probability of x successes among n trials can be found by using the binomial probability formula, or Table A-1, or software. Binomial probability formula: P (x) = n! · px · q n−x , x = 0, 1, 2, . . . , n (n − x)!x! 10 where n =number of trials, x = number of successes among n trials, p = probability of success in any trial, and q = 1 − p. • In a binomial distribution, the mean and standard deviation can √ be easily found by calculating the values of µ = np and σ = npq. • A Poisson probability distribution applies to occurrences of some event over a specific interval, and its probabilities can be calculated with P (x) = µx · e−µ , x = 0, 1, 2, . . . x! where µ is the mean of x. The standard deviation is σ = 11 √ µ. Review of Chapter Five In this chapter, we introduced continuous probability distributions and focused on the most important category: normal distributions. Continuous probability distribution: is described by a smooth density curve. Areas under this curve are interpreted as probabilities. A continuous random variable has a uniform distribution if its values spread evenly over the range of possibilities. The graph of a uniform distribution results in a rectangular shape. Its density function is given by 1 , if a ≤ x ≤ b, f (x) = b−a 0, otherwise Normal distribution: a continuous probability distribution that is specified by a particular type of bell-shaped and symmetric density curve. Its density function is given by 2 1 − 21 ( x−µ ) σ e f (x) = √ , 2πσ where µ is the mean and σ is the standard deviation. Standard normal distribution: the normal distribution with µ = 0 and σ = 1. One can use Table A-2 to find probabilities for given z scores. Some useful formulas: • P (a < z < b) = P (z < b) − P (z < a). 12 • P (z > a) = 1 − P (z < a). • P (z > a) = P (z < −a) (Symmetry) Finding a z score from a known area (or probability): Using the cumulative area from the left, locate the closest probability in the body of Table A-2 and identify the corresponding z score. Standardized procedure: If x has a normal distribution with mean µ and standard deviation σ, then z = (x − µ)/σ has a standard normal distribution. We have P (x < a) = P (z < (a − µ)/σ). Given the value range of x, find probability related to x: (1) State the problem. (2) Standardize z = (x − µ)/σ and find the value range for z. (3) Use Table A-2 to find the desired probability. Given the probability, find the value of x: (1) State the problem. (2) Use Table A-2 to find the z score from given probability. (3) Convert back to x value by x = µ + (z · σ). 13