Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
HW#2: 2.1.6, 2.2.6, 2.3.18, 2.4.14, 2.5.8, 2.5.9, 2.6.6, 2.6.22 1) Please answer the following questions a. Are there any outliers in Female Heights? (answer using Excel analysis) b. Draw a histogram of Female Heights (in Excel) c. Draw a pie chart showing the relative proportion of males and females who did NOT responded to the survey (in Excel). There are 200 students in the class of whom 120 are males. d. Draw a Box and Whisker plot for the Female Heights data (Manually) 2) Please answer the following questions a. Compute the summary statistics of Female Heights b. Draw a scatter plot with Female Weights on the X-axis and Female Heights on the Y-axis (in Excel) c. Does the plot indicate any outliers? (Visual Inspection) d. What can you say about the relationship between Female Heights and Female Weights? (Visual Inspection) e. What is the correlation coefficient between Female Heights and Female Weights? (in Excel) Example There are 100 students at a school 40 of whom are females, 50 are E 243 students, and 30 are females and E 243 students. You come to know that a student met the president of the school. a) What is the probability that this student is an E 243 student? b) What is the probability that this student is an E 243 student, if you also know that the student who met the president is a female? Solution: Define: Event F – Student is a female, Event E – Student is a E 243 student Data: P (F) = 0.4, P (E) = 0.5, and P (F∩E) = 0.3 Part a) P (E) = 0.5 Part b) P (E/F) = P (F∩E)/P(F) = 0.3/0.4 = 0.75. Example A manufacturer claims that its drug test will detect steroid use (that is, show positive for an athlete who uses steroids) 95% of the time. What the company does not tell you is that 15% of all steroid-free individual s also test positive. Ten percent of all rugby team members use steroids. a) What is the probability that a rugby team member tests positive? b) If your friend on the rugby team has just tested positive, what is the probability that he is using steroids? Solution Define: Event of Detection – D, Event of Steroid Use – S Data: P (D/S) = 0.95, P (D/Sc) = 0.15, P (S) = 0.10 Part a) P (D) = P (D/S) x P (S) + P (D/Sc) x P (Sc) from theorem of total probability P(D) = 0.95 x 0.10 + 0.15 x 0.90 = 0.23 Part b) P (S/D) = P (D/S) x P (S)/P (D) from the Baye’s Rule P (S/D) = 0.95 x 0.10/0.23 =0.413 Lecture 4 A variable whose measured value can change is called a random variable. Height, weight, GPA, expected graded etc., are all examples of random variables. The random variable is usually represented by an upper case letter, say X. A measured value of the random variable is denoted by the corresponding lower case letter; in this case x. A collection of values of X is data. Data Summary and Presentation The first step after obtaining a dataset is to identify each variable and assign a letter to each. For example height (H), weight (W) and so on. Now pick any of these variables; Say Height. Before we go any further, we must check if there are any obvious errors in our dataset. These are mostly identified as outliers. Before we formally define a formula to identify outliers, it is useful to define percentiles, quartiles, and inter-quartile range. The pth percentile value of any data set is that value below which p percent of the data lies. 1st quartile (Q1) of a data set is that value below which 25% of the data lies. In other words, the 1st quartile is the 25th percentile. Similarly, 3rd quartile (Q3) is the 75th percentile. The 2nd quartile is the median (50th percentile) The interquartile range (IQR) is defined as the difference between the 3rd quartile and the 1st quartile. An outlier is formally defined as any value that is lower than (Q1 – 1.5 IQR) or greater than (Q3 + 1.5 IQR). The median, quartiles, maximum, and minimum are often visualized using box and whisker plots. 3rd Quartile Median 1st Quartile Minimum Maximum People are interested in what the data tells you. They don’t want to look at 1000s of cells of data. So you must come up with ways to summarize all these cells. How would you summarize this? The goal is to come up with a small set of numbers that give you significant amount of information about the data. The first that comes to mind is a measure of the central location of the data. This could be measured in any number of ways; mean, median, mode, or trimmed mean. Mean (𝑋̅) is obtained by summing up all the numbers and dividing by the total number of points (n). 𝑋̅ = ∑𝑎𝑙𝑙 𝑖 𝑥𝑖 𝑛 The median is the number such that 50% of the data points have a greater than this and 50% have a value less than this. The mode is the most frequently occurring value in the dataset. The second is a measure of spread in the data. This could be measured as; standard deviation, variance, range, interquartile range. The variance (s2) is given by, ∑𝑎𝑙𝑙 𝑖(𝑥𝑖 − 𝑥̅ )2 𝑠 = 𝑛−1 2 The standard deviation s is the square root of the variance. These are fundamental and are expected in any summary of data. Trimmed mean is sometimes a better estimate of the central location of the data set than the mean, especially in the presence of outliers. A k% trimmed mean is the mean calculated after eliminating k% of the data from the both the higher end and the lower end.