Download E 243 Spring 2015 Lecture 3

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
HW#2: 2.1.6, 2.2.6, 2.3.18, 2.4.14, 2.5.8, 2.5.9, 2.6.6, 2.6.22
1) Please answer the following questions
a. Are there any outliers in Female Heights? (answer using Excel analysis)
b. Draw a histogram of Female Heights (in Excel)
c. Draw a pie chart showing the relative proportion of males and females
who did NOT responded to the survey (in Excel). There are 200 students
in the class of whom 120 are males.
d. Draw a Box and Whisker plot for the Female Heights data (Manually)
2) Please answer the following questions
a. Compute the summary statistics of Female Heights
b. Draw a scatter plot with Female Weights on the X-axis and Female
Heights on the Y-axis (in Excel)
c. Does the plot indicate any outliers? (Visual Inspection)
d. What can you say about the relationship between Female Heights and
Female Weights? (Visual Inspection)
e. What is the correlation coefficient between Female Heights and Female
Weights? (in Excel)
Example
There are 100 students at a school 40 of whom are females, 50 are E 243 students, and 30
are females and E 243 students. You come to know that a student met the president of
the school.
a) What is the probability that this student is an E 243 student?
b) What is the probability that this student is an E 243 student, if you also know that
the student who met the president is a female?
Solution:
Define: Event F – Student is a female, Event E – Student is a E 243 student
Data: P (F) = 0.4, P (E) = 0.5, and P (F∩E) = 0.3
Part a) P (E) = 0.5
Part b) P (E/F) = P (F∩E)/P(F) = 0.3/0.4 = 0.75.
Example
A manufacturer claims that its drug test will detect steroid use (that is, show positive
for an athlete who uses steroids) 95% of the time. What the company does not tell you is
that 15% of all steroid-free individual s also test positive. Ten percent of all rugby team
members use steroids.
a) What is the probability that a rugby team member tests positive?
b) If your friend on the rugby team has just tested positive, what is the probability that
he is using steroids?
Solution
Define: Event of Detection – D, Event of Steroid Use – S
Data: P (D/S) = 0.95, P (D/Sc) = 0.15, P (S) = 0.10
Part a) P (D) = P (D/S) x P (S) + P (D/Sc) x P (Sc) from theorem of total probability
P(D) = 0.95 x 0.10 + 0.15 x 0.90 = 0.23
Part b) P (S/D) = P (D/S) x P (S)/P (D) from the Baye’s Rule
P (S/D) = 0.95 x 0.10/0.23 =0.413
Lecture 4
A variable whose measured value can change is called a random variable. Height,
weight, GPA, expected graded etc., are all examples of random variables. The random
variable is usually represented by an upper case letter, say X. A measured value of the
random variable is denoted by the corresponding lower case letter; in this case x. A
collection of values of X is data.
Data Summary and Presentation
The first step after obtaining a dataset is to identify each variable and assign a letter to
each. For example height (H), weight (W) and so on.
Now pick any of these variables; Say Height.
Before we go any further, we must check if there are any obvious errors in our dataset.
These are mostly identified as outliers. Before we formally define a formula to identify
outliers, it is useful to define percentiles, quartiles, and inter-quartile range.
The pth percentile value of any data set is that value below which p percent of the data
lies. 1st quartile (Q1) of a data set is that value below which 25% of the data lies. In other
words, the 1st quartile is the 25th percentile. Similarly, 3rd quartile (Q3) is the 75th
percentile. The 2nd quartile is the median (50th percentile)
The interquartile range (IQR) is defined as the difference between the 3rd quartile and
the 1st quartile.
An outlier is formally defined as any value that is lower than (Q1 – 1.5 IQR) or greater
than (Q3 + 1.5 IQR).
The median, quartiles, maximum, and minimum are often visualized using box and
whisker plots.
3rd Quartile
Median
1st Quartile
Minimum
Maximum
People are interested in what the data tells you. They don’t want to look at 1000s of cells
of data. So you must come up with ways to summarize all these cells. How would you
summarize this? The goal is to come up with a small set of numbers that give you
significant amount of information about the data.
The first that comes to mind is a measure of the central location of the data. This could
be measured in any number of ways; mean, median, mode, or trimmed mean.
Mean (𝑋̅) is obtained by summing up all the numbers and dividing by the total number
of points (n).
𝑋̅ =
∑𝑎𝑙𝑙 𝑖 𝑥𝑖
𝑛
The median is the number such that 50% of the data points have a greater than this and
50% have a value less than this. The mode is the most frequently occurring value in the
dataset.
The second is a measure of spread in the data. This could be measured as; standard
deviation, variance, range, interquartile range. The variance (s2) is given by,
∑𝑎𝑙𝑙 𝑖(𝑥𝑖 − 𝑥̅ )2
𝑠 =
𝑛−1
2
The standard deviation s is the square root of the variance.
These are fundamental and are expected in any summary of data.
Trimmed mean is sometimes a better estimate of the central location of the data set than
the mean, especially in the presence of outliers. A k% trimmed mean is the mean
calculated after eliminating k% of the data from the both the higher end and the lower
end.
Related documents