Download Shape of Data Distributions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Inductive probability wikipedia , lookup

Misuse of statistics wikipedia , lookup

Law of large numbers wikipedia , lookup

Probability amplitude wikipedia , lookup

Transcript
Stats Review Topics
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Combinations and Permutations (2)
Graphs: Box & Whisker, histograms (4)
Shape of data distributions: shape, outliers, skew, spread & center (3)
2-way tables and conditional probability (3)
Probability calculations – mutual exclusive, independent, conditional (5)
Normal Distribution & Probability (3)
Regression – finding equation from mean and variance of each sample, correlation (4)
Sampling methods & bias (3)
Mean and variance of random variables (expected value) (3)
Hypothesis testing (5)
Combinations & Permutations
Permutations: The number of ways to put n out of r items IN ORDER.
1. To put all n items in order n!
2. To put n out of r items in order nPr = n!/ (n-r)!
3. Calculator: MATH > PRB 2:nPr
Combinations: The number of ways to choose n out of r items when ORDER DOES NOT
MATTER.
1. To choose n items: n! /[ (n-r)!r!]
2. Calculator: MATH > PRB 3:nCr
Example 1: How many ways can a 4 digit code be created if none of the digits can be repeated?
Ex 2: How many ways can Gold, Silver & Bronze be awarded in a race with 8 runners?
Ex 3: How many ways can an ordered playlist of 6 songs be made from the 11 songs on an album?
Ex 4: How many ways can 4 students from a class of 20 be picked to receive the same award?
Ex 5: How many ways can the letters in the word DAWSON be rearranged?
Ex 6: How many ways can you choose 2 side dishes with your meal at a restaurant if there are 8 to
choose from?
Ex 7: An ice cream shop allows 4 different toppings on a sundae. How many different sundaes can be
made if they have 7 available toppings to choose from?
Ex 8: A test has 5 essay questions and each person must complete 3. How many different choices are
there of 3 essay questions to answer?
Graphs of 1 variable data
Histogram
Equal width “bins” to collect data values.
Bars need to touch… it is a continuous graph
Height of each bar represents the frequency of data items in that “bin”
Used to see information on center and skew
Box & Whisker
Shows the “5 number summary” Min, 1Q, median, 3Q, Max
Easy to find the Inter-quartile Range (IQR) – this is the width of the box
Easy to see outliers . Any value (1.5)IQR higher than 3Q or lower than 1Q
Divides the data into 4 equal sections
Example 1: Using the following data draw a histogram and a box & whisker plot.
16, 13, 18, 12.5, 9.5, 15, 22.5, 16.5, 14, 13, 11, 12, 15, 12.5, 11, 13.5
Example 2: Are there any outliers in the data. What effect would removing any outliers have on the
data?
Shape of Data Distributions
SHAPE - Symmetrical, Skewed, Modes
OUTLIERS – Points that are further out than 1.5IQR from the 1st or 3rd quartiles
CENTER – Mean if data is symmetrical / Median if Data is skewed
SPREAD – Standard Deviation goes with Mean / IQR goes with Median / Range
Example 1: A sample of SAT scores from 100 students is symmetrical and unimodal. What measure of
center and spread should be used?
Example 2: The ages of people at a theme park is skewed to the right. What measure of center and
spread should be used?
Example 3: The 5 number summaries of points scored in basketball games are shown below. Which
team has the smallest IQR?
Team 1:
42, 58, 62, 68, 75
Team 3: 59, 64, 68, 74, 91
Team 2:
44, 51, 59, 64, 68
Team 4: 62, 68, 74, 76, 79
Example 4: In the above example, which teams had game scores that would be outliers?
Example 5: The following distances were recorded by a company’s fleet vehicles:
330, 402, 350, 382, 31, 412, 375, 363
It turns out that the 31 was a mistake and should be 310. How does that change effect the mean,
median, IQR, range, and standard deviation?
Two-Way Tables and Conditional Probability





Two Way tables show the relationship between two variables.
Marginal Probability – the probability along the edges of the table in the TOTAL column and row
Joint Probability – The probability of the cells inside the table. Based on the table TOTAL
Conditional Probability – A joint probability based on only the column or row specified
Independent – the probability of event A is the same as the probability of A given B
Blue
L.H. 12
R.H.
Total 37
Green
8
17
25
Brown
35
71
106
Total
55
113
168
Example 1: What is the probability of being left handed?
Example 2: What is the probability of having green eyes and being right handed?
Example 3: What is the probability of having green eyes, given that someone is right handed?
Example 4: What is the probability of having green eyes?
Example 5: Compare your answers from #3 and 4. Are they about the same? Does that mean green
eyes are dependent or independent from right handedness?
Girls
Boys
Total
10th
0
4
1
11th
2
6
8
12th
13
15
28
total
15
25
37
Example 6: What is the probability of being a girl in this class?
Example 7: What is the probability of being in 11th grade?
Example 8: What is the probability of being a girl given that a student is in 11th grade?
Example 9: What is the probability of being in 11th grade given that the student is a girl?
Example 10: Do these two variables appear to be dependent or independent?
Pepsi
Coke
Total
Dogs
25
Cats
75
Total
40
60
100
Example 11: If Pet preference is independent from soda preference, how many cat lovers should prefer
Coke?
Conditional Probability
Two events are INDEPENDENT if P(A) is the same, with or without event B. So P(A) = P(A|B).
For independent events, P(A) AND P(B) = P(A)*P(B)
P(A) OR P(B) = P(A) + P(B) – P(A)*P(B)
Two events are MUTUALLY EXCLUSIVE if they can not happen at the same time. These are NOT
independent. For mutually exclusive events: P(A) AND P(B) = 0 P(A) OR P(B) = P(A)+P(B)
P(A|B) = P(A and B) / P(B)
P(A and B) = P(A|B) * P(B)
Example 1: P(A) = .7 and P(B) = .4 and the probability of both happening together is 0.15. What is the
conditional probability of event A given event B? ANSWER: P(A|B) = P(A and B) / P(B) 0.15/.4 = .375
Example 2: P(A) = .6 and P(B) = .2 and the probability of both happening together is .17. What is the
conditional probability of B given A
Example 3: The probability of rain is 40%, the probability of temperatures over 90 is 20% and the
probability of both is 5%. What is the probability of the temp being over 90 given that it’s raining?
Example 4: In a certain bloodline, the probability of gray-eyed goats is 15% and the probability of blueeyed goats is 25%. These are mutually exclusive. What is the probability of a goat having blue OR gray
eyes?
Example 5: In an experiment, you flip a coin and roll a standard die. What is the probability the coin
landing on heads OR the die showing an even number. These are independent.
Example 6: The probability that I go to the beach on Saturday is 0.75. The probability that I go to the
beach and get sunburn is .30. What is the probability that I get sunburn given that I am going to the
beach?
Example 7: P(A) = .3 and P(B|A) = .3. Find the probability of (A and B). Are these events mutually
exclusive? Are they independent?
Normal Distribution
Calculator: STAT > Test > Z-test. Type in the mean and standard deviation of the population. The
number you are testing is considered x-bar and n=1. You can find the probability of a number being
greater or less than your chosen value. Either CALCULATE or DRAW will give you p (the probability
associated with the number)
1) An insurance company finds that the ages of motorcyclists killed in crashes are normally distributed with a
mean of 26.9 years and a standard deviation of 8.4 years. If we randomly select one such motorcyclist, find
the probability that he or she was under 25 years old
2) A population of 700 scores has a mean of 5.40, a standard deviation of 1.20, and its distribution is
approximately normal. If a score is randomly selected, find the probability that it is greater than 5.00.
3) A sociologist finds that for a certain segment of the population, the numbers of years of formal education are
normally distributed with a mean of 13.20 years and a standard deviation of 2.95 years.
a) For a person randomly selected from this group, find the probability that he or she has between 13.20
and 13.50 years of education.
b) For a person randomly selected from this group, find the probability that he or she has at least 12.00
years of education.
4) Scores on a standard IQ test are normally distributed with a mean of 100 and a standard deviation of 15. Find
the probability that a randomly selected subject will achieve a score between 90 and 120.
Regression & Correlation
The regression line is the line of “best fit” through a set of data points.
The slope of the line is m = r (sy / sx)
The line also passes through the point (𝑋̅, 𝑌̅)
r = correlation coefficient which is how well Y is really matched with X.
residual: the difference between the predicted value from the equation and the actual data value.
Remember: Variance = standard deviation 2
Calculator: Enter numbers into STAT: EDIT
Use STAT: CALC: 1 var stats or 2 var stats for easy calculations
Use STAT: CALC: Lin Reg (a+bX) to get regression equation and correlation information (b=slope)
Example 1: Two different tests were designed to measure understanding of a topic. The two tests were
given to ten students with the following results:
Find the equation of the regression line, round to the nearest hundredth.
Example 2: Using the data above, find the variance of each of the test versions. Is there a difference in
variance between the tests?
Example 3: The accompanying data illustrates the number of movie theaters showing a popular film
and the film's weekly gross earnings, in millions of dollars. Find the slope of the regression equation:
Theaters: mean = 616.75 standard deviation = 205.08
Gross Earnings: mean = 4.63 standard deviation = 2.13
a)
correlation = 0.9807
Find the slope of the regression equation
b) Write the appropriate regression equation
c) When the movie was shown in 530 theatres, the actual income was $4.05 million. Find the
residual.
Sampling Methods & Bias
Methods to know:
Convenience Sample: Sample is chosen simply because they are easy to contact.
Simple Random Sample: Each member of the population has an equal chance to be chosen.
Cluster Sample: The population is divided into clusters, each of which could represent the population
and one or a few of the clusters are randomly selected to include in the sample.
Stratified Sample: The population is separated into distinct groups and a random sample is taken from
each group so that some members of each group are included in the sample.
Census: A survey of the entire population
BIAS
Response: Questions are asked in a way that influences the answers. Or respondents are lead toward
non-truthful answers.
Non response: The results of a survey are altered due to a large number of people who can not or will
not respond, especially if there is a common reason for their non-response.
Voluntary Response: Survey members are self-selected volunteers
Underrepresentation: Members of the population are inadequately represented in the sample. This is
usually a result of convenience sampling.
Example 1: I want a survey of the students at UHS to see how they feel about having a final exam week.
Identify the types of sampling represented:
a.
b.
c.
d.
e.
I ask the first 20 students in building 3 in my hallway in the morning.
I ask 10 freshmen, 10 sophomores, 10 juniors, and 10 seniors
I list all students in alpha code order and use a random number generator to pick 50 students
I ask all the students in my 4th period class
I put a box in the cafeteria for students to enter their responses on paper if they choose.
Example 2: In each of the samples above, what type of bias may occur?
Random Variable (Expected Value)
A random variable includes possible values of the variable and the probability of each value.
The probabilities of all options must add up to 1
The mean (expected value) of the variable is the sum of each possible value times its possibility
E(X) = ∑ 𝒏 𝑷(𝒏)
The variance of the random variable
Example 1
E(x2) – [E(X)]2