Download Math 140 Notes and Activity Packet (Word) Unit 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Sampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Math 140 Notes and Activity Packet (Word)
Unit 1 : Collecting Data, Bias, Experimental Design
Math 140 Notes – Good and Bad Ways to Collect Data
Population : The collection of every person or object you are studying
Ex) The weight of every man in Santa Clarita CA
Ex) The salary of every person in Norway
Census: Getting data from everyone in the population (almost) (Census is the most accurate data we
can get) (represents population)
Sample : Get information from a sub-group of the population. Key Question: Will the sample
represent the population????
Descriptive Statistics (math 075) : Being able to analyze a data set
Inferential Statistics (math 140) : What does the data set tell us about the population?
There are good and bad ways to collect data. Try to avoid Bias
Bias: When a data set does not represent the population, we say the data set is biased. It only
represents part of the population and often leaves out specific groups of people.
Ways of Collecting Data – The good and the bad
1.
Convenience Sample – Collect data in any way that is convenient or easy. Bad way of
collecting data and is full of Bias. Does not represent the population.
Ex) Stand outside of Ralphs and ask people what they think about taxes.
Ex) Collect data from my friends and family
2.
Voluntary Response sample – When people in the population choose to be in your data set.
Bad way of collecting data and is full of Bias. Does not represent the population. Tend to get
only people that care a lot about the topic, or people with nothing better to do.
Ex) Mail survey – mail survey to every address in Los Angeles. My data is made up of
whoever fills out the survey.
Ex) On-line survey
3. Random Sample – Everyone in the population has a chance to be included in your data set.
Good way of collecting data. Does represent the population pretty well. Difficult to set up
and expensive. Eliminates bias. Get data from all groups in the population.
Ex) All students at COC have a student ID#. Have a computer randomly pick 100 student
ID#s. Then get info from those 100 students.
Ex) Take everyone’s name, but it in a box. Shake up the box and draw out 20 names.
4.
Cluster Sample – Divide the population up into small groups. Then going to choose groups
(randomly) and get data from everyone in those groups. (Mini Census) Cluster can be good or
bad depending on if groups are chosen randomly.
Ex) Population is all elementary school students in L.A. County. Randomly choose 10
elementary schools in L.A. Then get information from every student at those 10 schools. (If
we choose the 10 schools that are most convenient for me, then it is a bad way of collecting
data.)
5.
Stratified sample – Break the population into a few large groups. Then we (randomly) pick
data sets from each group. Comparison studies. Can be good or bad depending on if
individuals are chosen randomly.
Ex) Separate the population of California into men and women. Randomly choose 1000 men
and randomly choose 1050 women.
6. Systematic sample – Use a system to choose people in your data set. Can be good or bad
depending on if you use randomization.
Ex) Store. I can pick every 5th person that enters the store and ask them opinion. (Convenient
also)
Ex) Alphabetical list of all employees in company. Pick every 20th person on list (#20, #40,
#60,…) Notice this is not random (bad). Statisticians often randomly choose a person from #120. Then pick every 20th person after the initial person is chosen.
Best way of collecting data? Census
2nd best? Random sample
Bad ways of collecting data? Convenience, voluntary response, not random
Simple Random – Picking individual people or objects randomly. There is no restriction on the
grouping. Every group of size n has an equal chance of being chosen. Every type of group is
possible. If you restrict groups (cluster or stratified) that is still random but is not simple
random.
Regular random sample (computer picking #s) is both random (individual) and simple random
(group)
Here is a confusing question:
What is the difference between random and simple random?
The definition of random is that everyone in the population has a chance of being chosen. This is true
for both simple random and random.
Example:
Population in question: All students at UCLA
A simple random sample choses individual people or objects randomly. Since every student has a
student ID number at UCLA, I will have a computer generate 200 random ID numbers and then track
down and talk to each of the 200 students. Notice there is not designed grouping, it is individual. In a
simple random sample any grouping is possible.
An example of something that is random but not simple random is when you define groups (Cluster and
Stratified). Let’s suppose instead of looking at individual people, we look at classes at UCLA. Have a
computer randomly pick 15 class section numbers. Then I go to those classes and get info from
everyone in those classes. (This would be a random cluster). It is random, everyone at UCLA has a
chance but I randomly picked groups not individuals. Notice that in this random cluster, groups are
designed in only one way (same class). It is random but not simple random.
Math 140 Sampling & Experiments Activity 1
Sampling Techniques: Good and Bad Ways to Collect Data
Directions: Identify the following sampling type being used as (systematic, convenience, voluntary
response, cluster, stratified, random, simple random). Explain why you chose your answer and if the
sampling method will represent the population or not?
1.
2.
3.
4.
5.
6.
The COC Admissions department wants to see how many students would be in favor of using a
new program to register for classes. They put a link on their website so that any students that
want to try out the program can. The students can then take a survey and say how well they like
the new system.
Rick works for a sports equipment manufacturing company. He separates all the employees into
men and women, and then chooses 28 women and 30 men to ask if they want changes to their
medical insurance coverage.
Michelle, a teacher at Valencia High, wants to see how many students at Valencia High school
will be attending COC. She gives the students in her U.S. History class a questionnaire to fill out
that asks where they will be attending college.
Jamie is working at the Republican recruiting committee in Newhall. She is curious how many
people that live in Newhall will vote for the Republican candidate in the next election. She
obtained an alphabetical list of all the residents in Newhall and numbered them. She then used
the computer to generate random numbers to decide which people to question about their
voting preferences.
Rachael works at the Democrat recruiting center in Northridge. To determine what percent of
people will vote for the Democratic candidate, she obtains a list of all residents of Northridge
and decides to ask every 50th person on the list.
Mike is trying to take an opinion poll about how people in Los Angeles would feel about raising
taxes in order to have a professional football team. He randomly selects 45 streets in Los
Angeles and asks every person living on those streets.
Math 140 Sampling & Experiments Activity 2
Deciding How to Collect Data
1. Martin works for a small company that makes pretzels. He recently came up with an idea to create a
garlic mustard pretzel. The company is hesitant to offer this type of pretzel because they are not sure if
their customers will buy it or not. Martin has had Statistics and offers to take a sample of customers and
see if he can find approximately what percent of their customers would buy a garlic mustard pretzel. His
boss tells him to go ahead, but that they do not have the funds for an expensive statistics study. Martin
and his sales team brainstorm to come up with some possible ways to conduct the sample. For each
sampling method, write a description of the method and an example of how Martin could use that method.
Include how costly you think the method will be and whether or not it will represent the population of
customers.
a. Systematic Sampling
b. Voluntary Response Sampling
c. Cluster Sampling
d. Simple Random Sampling
e. Convenience Sampling
f. Stratified Sampling
g. Census
2. Julie works for a clothing store that specializes in men’s and women’s jeans. The store usually offers
high end brand name jeans, but was recently approached by a clothing manufacturing company to sell a
cheaper brand of jeans in their stores. Julie’s boss is not sure whether they should or not. Julie has had
statistics and offers to take a sample of their customers to see if they would be interested in a cheaper
brand of jeans. Julie’s boss gave Julie a budget of $200 to take the sample. Julie is trying to decide
which sampling method would be best. For each sampling method, write a description of the method and
an example of how Julie could use that method. Will Julie be able to stay under the budget of $200 with
this method? Will the sampling method represent the population of customers or not?
a. Systematic Sampling
b. Voluntary Response Sampling
c. Cluster Sampling
d. Simple Random Sampling
e. Convenience Sampling
f. Stratified Sampling
g. Census
Math 140 Notes: Types of Bias
Bias – When a data set does not reflect the population
Question Bias – People phrase their question in order to make people answer a certain way.
Ex) Should a president have the power of line item veto? (48% yes, 52% no)
Ex) Should a president have the power of line item veto in order to eliminate waste and help
the economy? (86% yes, 14% no)
Response Bias – People will not answer truthfully or accurately. (Controversial topics)
Ex) Random sample of women and ask them if they are alcoholics?
Sampling Bias – Did not incorporate randomization into sampling process. Used bad method
like convenience or voluntary response. Also taking a sample that is too small can create
bias.
Deliberate Bias – Taking a random data set of people living in Los Angeles, and we
intentionally did not ask any people that are homeless. (Really bad)
Non-response bias – People refuse to be part of your study. Refuse to fill out data.
Math 140 Sampling & Experiments Activity 3
Spotting Bias
1. Define each of the following types of bias and give an example of each.
A. Question Bias
B. Response Bias
C. Sampling Bias
D. Deliberate Bias against a specific group
E. Non-response Bias
Directions for #2-7: For each of the following scenarios, describe the population being considered and
the type of bias that has taken place (Question, Response, Sampling, Deliberate or Non-response). There
may be more than one type of bias involved. Explain your answers and if there is bias, what groups of
people were not represented.
2. We are interested in calculating the percent of children in LA County that have had their vaccines. To
figure this out, a person put a survey up on the yahoo webpage asking the following question: “Is your
child up to date with vaccines?” The computer will keep track of the number of people that answer yes
or no.
3. We are interested in calculating the percent of children in LA County that have had their vaccines. To
figure this out, we randomly selected 350 people, and asked them the following question: “In order to
save children from devastating diseases, should all children be vaccinated?”
4. We are interested in finding out how many people in the U.S. have had whooping cough this year. To
figure this out, we called every major hospital in the United States and asked how many cases of
whooping cough they had this year. Then we added these numbers up.
5. We are interested in finding out what percent of Americans use Cocaine. We randomly chose 400
Americans and asked them if they use Cocaine or not.
6. What is the average age of college students in Canada? Since my cousin lives in Canada, I asked him
to drive to two colleges near his house and ask people he bumps into what their age is.
7. Julie is interested in calculating the yearly income of adults in Palmdale. She drives around Palmdale
and stops at certain streets and then asks people that live on that street what their yearly income is?
She skips streets that look “sketchy” as she is worried about her safety.
Math 140 Notes : Letters used in Stats (Statistics verses Parameters) – PDF online
Math 140 Sampling & Experiments Activity 4
Sample Statistic or Population Parameter?
Directions: Determine if the numbers in the following clips from magazines and newspapers are
describing a population parameter or a sample statistic. In each case give the symbol we would
use for the parameter or statistic ( n, p, pˆ ,  , x ,  , s ).
1. “Our study found that of the 2400 people tested, only 3% showed side effects to the
medication.”
2. “It has been speculated for years that the average height of men is 69.2 inches, but our study
may indicate that this may be wrong.”
3. “For normal body temperature, the standard deviation is about 1.8 degrees Fahrenheit.”
4. “We tested 3000 incoming college freshman and found that their average IQ was 101.9 with a
standard deviation of 14.8”.
5. “Normal human body temperature has long been thought to be 98.6 degrees Fahrenheit, but
our sample of 150 randomly selected adults found that the average was 98.08”.
6. “Students take about 12 units on average per semester, but when we took a random sample of
1600 college students found that the average was 12.3 units.”
7. “A public opinion poll showed that 47.2% of voters would vote for the candidate, but when
the votes were counted we found that only 41.3% voted for the candidate.”
Math 140 Sampling & Experiments Activity 5
Random Sample Values verses Population Values
(Gettysburg Address Activity)
We saw that there are different types of sampling methods that can be used to answer a question about a
population. We also saw that random samples were the best at reflecting the population. But how well
does a random sample reflect the population? How difficult is it to pick things or people at random? This
is the question we are going to strive to answer in this activity.
Directions Part I: Open the Gettysberg Address (by president Abraham Lincoln) on the sampling
experiments page. Pick 10 words “randomly”. Count how many letters there are in each word. Find the
mean average number of letters in your 10 words. (Add up the numbers and divide by 10.) Put a magnet
on the “student” number line on the board at the place where your sample mean fell. (Or your teacher may
just have you write the number on the board.)
Directions Part 2: (We may do this as a class with the instructor’s computer.) Now we are going to have a
computer pick words from the Gettysberg address at random. Go to
www.rossmanchance.com/applets . Click on the “Sampling Words” link on the top left of the page.
All the words to the Gettysberg Address have already been entered. Leave the number of samples as 1
and the sample size as 10. Click the button that says “draw samples”. Click this a few times. Put a
magnet on the other number line on the board labeled “computer” at the place where the computers
random sample mean fell. (Or your teacher may just have you write the numbers on the board.) Do not
put your magnet on the “student” number line or write your numbers where the student picked samples
are written by accident.
Problems
1. What is the population mean average of all the words of the Gettysburg Address? (Don’t
calculate this yourself. The Rossman/Chance App has already done it.)
2. How far away from the population mean average was your sample of 10 that you picked
yourself?
3. Look at the sample means on the board that were student picked. Look at the random sample
means picked by the computer. In general, where the random samples from the computer closer
or farther away than the student picked “random” samples.
4. When a sampling method does not reflect the population very well, we say that a Bias has
occurred. Which do you think has more bias, the random sample the computer picked or the
student picked samples.
5. What is the definition of random? Do you think the words you picked were truly random?
Where the words you picked longer or shorter in general than the population mean average?
What about other students around you? Where the words they picked longer or shorter than the
population mean average.
6. Discuss how difficult it is to choose a sample truly randomly without the help of a computer. Are
there ways to choose randomly without a computer? Describe a couple ways this might be done?
7. A student wanted to collect some data “randomly” from the people in Santa Clarita. She decided
to walk around the Valencia mall and ask people she “randomly” bumped into. Was this truly a
random sample of people in Santa Clarita? Why or why not?
8. Take the mean average of five of the computer sample means for the Gettysburg Address. Is this
average of averages closer to the population mean? (What you have discovered is the Central
Limit Theorem, one of the most important theorems in all of Statistics.)
Math 140 Notes on Experimental Design
In stats, we often want to find and explore relationships (association, correlation).
“Is there a correlation between blood pressure and heart rate?”
“Is living in tropical climates related to having nut allergies?”
We can determine if there is a relationship or correlation by looking at data (observational
study).
However, sometimes it is necessary to prove cause and effect. To prove cause and effect you
need an experiment!
Example 1
Explanatory (Treatment) Variable: Smoking cigarettes or not
Response Variable (what we will measure): Did the person get lung cancer later in life?
We need to prove that smoking cigarettes causes lung cancer.
There is plenty of data that shows a relationship (correlation) between smoking cigarettes and
getting lung cancer. However that does not prove cause.
Correlation ≠ Causation!!!!!
Correlation (relationships) does not imply that one
causes the other.
What is the problem? Why doesn’t it show causation? Cannot show cause because of
confounding variables (lurking variables)
We need to prove that it was the cigarettes that caused the lung cancer and not something
else.
Confounding variables for lung cancer?
Genetics, chemicals, job, asbestos exposure, age, gender, smoking other things, poor air quality
Experimental Design is not Frankenstein!!! We do not experiment on people. We collect the
data in a special way to control confounding variables.
Experimental Design is controlling the confounding variables so that we can prove cause and
effect.
Experimental Design
Randomly assign people into two groups. (Random Assignment)
Two groups will be as alike as possible. (Similar ages, similar genders, similar stress levels,
similar racial and ethnic groups, similar places that they live, similar number of people that
smoke other things, similar jobs, similar air quality, similar asbestos exposure) Can also use
Direct Control (blocking) to make the groups more alike if needed. Random Assignment does
most of the work though.
Group 1: Treatment group (smoked cigarettes)
Group 2: Control group (not smoke cigarettes)
Remember these two groups of people are very alike. So if group 1 has a significantly higher
rate of lung cancer, then we have controlled all the confounding variables and proved that
smoking cigarettes causes lung cancer.
Example 2
Prove that taking a new blood pressure medicine does decrease a person’s blood pressure.
(Prove cause and effect)
Confounding Variables? Stress, Diet, Genetics, Age, Gender, Racial / ethnic groups, Human
Brain (placebo effect)
Experiment:
Randomly assign people to two groups. Also use direct control (blocking) to make the two
groups alike.
Group 1: Treatment (get the medicine)
Group 2: Control group (not get the medicine)-Get a placebo (fake medicine) (Double Blind is
best)
Single Blind: means that the people in the groups do not know if they are getting medicine or
placebo
Double Blind: means that the people in the groups and the people giving the medicine do not
know if it is a placebo or not. (Obviously someone knows, just not the person giving the
medicine or treatment.)
If Group 1 has significantly lower blood pressure, then we have succeeded in proving that the
medicine lowers blood pressure.
Overall Take-Away:
Don’t do an experiment unless you have to prove cause and effect. Experiments are expensive
and time consuming.
Observational Study: Look at data and see if there is a relationship (correlation) between two
things. Remember observational studies do not control confounding variables.
Experiment: Randomly assign two groups, double blind placebo, and control all confounding
variables so that you can prove cause and effect.
Now do Sampling Experiment Activity 6
Ruler and Reflexes Experiment!!
Ruler Experiment Data (Previous Class)
With Phone
Mean average catch : 10.3 inches
Number of Drops: 37
Without Phone
Mean average catch: 8.2 inches
Number of Drops: 9
Math 140 Sampling & Experiments Activity 6
The Ruler Experiment
1. Divide class into groups of three. Number the class from 1 to max number of students. Then
generate some random numbers with Statcrunch. (Go to the Applet button and click on
“random numbers”. Let the minimum value be 1 and the max value be the total number of
students. Let the sample size also equal the total number of students. Use the column of
random numbers to put students into groups of three. (Every 3 numbers determine a group.)
2. Each group will need a ruler and their cell phones. It is best to stand up during this activity.
3. Procedure: Student A holds bottom of the ruler up inside of student B’s non-dominate hand.
Student A should hold the ruler from below student B’s hand. The top of the ruler should be at
about 4 inches on the ruler. Student A releases ruler and student B catches it. Student C
records the number of centimeters on the top of the ruler before caught. Student C will take
the catch length and subtract off the 4 inches and then record the difference. If student B
misses the ruler all together, then student C will just put “drop”. Make sure to label whether it
was with the cell phone or without. Then repeat the process, but this time student B is to text a
message to themselves while trying to catch the ruler. Continue until all students have done
the experiment three times without the cell phone and three times with the cell phones.
Alternate the person releasing the ruler and the time before released.
4. Put the without cell phone/with cell phone data up on the board without names. The
instructor or a student will collate the following results for the whole class: the mean average
catch length with the cell, the mean average catch length without the cell, the total number of
drops with the cell, the total number of drops without the cell.
5. When you are done collecting data, answer the following questions:
a) What is the explanatory (treatment) variable? What was the response variable?
b) Why did we bother to have the person catch the yard stick without the phone?
Wouldn’t it of been quicker to just record the catching with the cell phone?
c) Why did we use Statcrunch to randomly assign groups? Why not just take
people that sit next to each other?
d) What are some of the confounding variables in this experiment? What are some
steps that we took to control these variables?
e) Was this experiment blind or double blind or neither? How do you know?
f) How does texting affect reflexes? How do you think this experiment might apply
to driving while texting?
Math 140 Sampling & Experiments Activity 7
Observational Studies verses Experiments
Directions: Analyze each of the following research questions. Tell whether the question is best
answered through an observational study or from an experiment? Explain the reason for your
choice. Now come up with a method for answering the research question. Make sure to include
random sampling in your method and what population you will be addressing. If you chose to do
an experiment, also describe some of the lurking variables in the situation and some ways that
we can control these variables. Discuss the placebo and blinding technique and why they are
important in this experiment?
1. Tuberculosis (TB) is a disease that affects millions of people worldwide. TB is a contagious
bacterial infection that affects the lungs. Doctors have long speculated that Tuberculosis spreads
the fastest in low income, crowded cities. Is there a relationship between low income, crowded
cities and the number of cases of TB?
2. Dramamine is a common medication used in preventing and treating nausea, vomiting and
dizziness caused by motion sickness. This medication has become a staple for thousands of
people who travel by boat, car or plane. But is Dramamine really effective in preventing and
treating the symptoms of motion sickness?
3. Unemployment has become a very important topic in the United States and worldwide. In an
effort to create more jobs, many countries raise taxes on people’s income. Many argue that
raising taxes will decrease people’s income and possibly force businesses to close down. It is
your job to shed light on this issue. Is there a relationship between the tax rate percentage of a
country and the unemployment rate?
4. College and High School students in the United States have long claimed that listening to
music helps them study and retain information at a higher rate. But is this really true? Does
listening to music really help a person better retain information?
Sampling , Experiments & EDA Review Sheet (With Answers)
Topics to Study for Exam







Major Terms : Population, Sample, Census, Random, Bias, Parameter, Statistic
Various Types of Bias
Various ways of collecting data
Experimental Design
Quantitative vs Categorical
Exploratory Data Analysis
Letters used in statistics
1. Determine whether each of the following statements is describing a parameter (population value) or
a statistic (sample value) and then give the letter that we use to represent it from the following list:
x ,  , pˆ , p, s, 
a) The standard deviation of the heights of American men is 3.6 inches.
b) 46% of the sample showed signs of increased rust.
c) The average yearly salary of adults in Los Angeles is $41,000.
d) Of the 200 dogs in the data set, 87% of them were licensed.
e) The standard deviation for the sample data was 5.2 years.
f) The average weight of the group in the data set was 155 pounds.
2. Jim wants to know how much money the average working COC student makes. Describe how Jim
could use the following techniques to collect data and describe how well the sample data will
approximate the population value.
a) Systematic
b) Voluntary Response
c) Random Sample
d) Convenience Sample
e) Cluster Sample
f) Stratified Sample
g) Simple Random Sample
h) Census
3. Define the following key terms and give an example of each.
a) population
b) census
c) sample
d) random
e) bias
f) parameter
g) statistic
4. Describe and give an example of each of the following types of bias.
a) Samping Bias
b) Question Bias
c)
Response Bias
d)
Deliberate Bias
e)
Non-Response Bias
5. What is the difference between a random sample and a simple random sample? Give an example of
a sample that is random, but not simple random. Explain Why.
6. Rachael needs to do an experiment that will show that the nicotine patch causes a person to stop
smoking. Set up the experiment for Rachael. Write a description of the experiment and include the
following. What are some lurking variables that she will need to control? How can Rachael control the
lurking variables? Include a description of how we will deal with the placebo effect?
7. Compare and contrast the similarities and differences between an experiment and an observational
study. How can we tell if we should use an experiment or an observational study?
8. Tell if the following data is categorical or quantitative. If the data set is quantitative and we created a
histogram for the data, what do you think the shape would look like? Why can’t we find the shape for
categorical data?
a) The types of cars in the different COC parking Lots.
b) The average number of hours spent practicing ping pong.
c) Areas in North Dakota that have wild mustangs.
d) Each person is asked if they wear glasses, contacts, neither, or both.
e) The average speed of the race cars at the Indianapolis 500.
f) The test scores on a really easy test.
9. Look at the following summary statistics: max, sample size, min, mean, stand dev, median, Q1, Q3,
IQR, Range, Variance, mode,
a) Which of the statistics are measures of center (average)?
b) Which of the statistics are measures of spread (variability)?
c) Which of the statistics are measures of position?
d) Are there any statistics in the list that are not a center, not a spread, nor a position?
e) What measure of center (average) should we use when the data is bell shaped?
f) What measure of spread (variability) should we use when the data is bell shaped?
g) How do we find two numbers that typical values are in between when the data is bell shaped?
h) What measure of center (average) should we use when the data is skewed or uniform?
i) What measure of spread (variability) should we use when the data is skewed or uniform
j) How do we find two numbers that typical values are in between when the data is skewed or uniform?
10. The following data set describes the lengths in feet of pieces of lumber at a lumber yard. Type the
data into a column of statcrunch and make a histogram, dotplot and boxplot, and find the summary
statistics (max, min, mean, stand dev, median, Q1, Q3, IQR, Range, Variance, mode), best measure of
center, average, best measure of spread, range for typical values, outliers.
17.4
10.7
14.4
19.7
13.5
21.6
17.8
18.2
17.3
17.2
13.2
16.3
15.7
19.1
12.7
18.6
18.2
13.6
16.7
13.1
11.8
21.3
14.8
16.4
7.6
11. The following graph was made from the final exam scores of students in a history class.
a) What is the shape?
b) Is the mean and standard deviation an accurate representative of center and spread?
c) The mean average was 77. Write a couple sentences explaining this statistic in context and
what it tells us.
d) The standard deviation was 5.3 . Write a couple sentences explaining this statistic in context
and what it tells us.
e) The mean average was 77 and the standard deviation was 5.3 . Use the mean and standard
deviation to find two numbers that typical values are in between. Use the mean and standard deviation
to find the cut off for “unusually low” test scores. Use the mean and standard deviation to find the cut
off for “unusually high” test scores
Histogram of C1
35
30
Frequency
25
20
15
10
5
0
60
65
70
75
80
85
90
95
C1
Sampling / Experiments / EDA Review Sheet Answers
1.
a) Parameter,
 = 3.6
b) Statistic, p̂ =46%
c) Parameter,  =$41000
d) Statistics, p̂ =87% , n = 200
e) Statistic, s = 5.2
f) Statistic, x = 155
2.
a) Systematic : Look at a list of all COC students and pick every 20th person on the list. Would
not represent the population because it is not random. If he chooses the first person randomly, then it
would represent the population.
b) Voluntary Response: Create a survey on facebook and ask COC students to respond. Will not
represent the population.
c) Random Sample: He puts the names of all COC students in a hat and shakes it up and draws
out 50 names. This will represent the population since everyone had a chance of being chosen.
d) Convenience Sample : He picks everyone that he goes to class with. Will not represent the
population.
e) Cluster Sample : He randomly picks 8 classes and gets information from every individual in
those classes. Since its random, it would represent the population.
f) Stratified Sample : He seperates the COC students into 1st year, 2nd year, 3rd year and then
picks 50 people from each group. Would not represent the population unless he picks the 50 people
randomly.
g) Simple Random Sample : He puts the names of all COC students in a hat and shakes it up
and draws out 50 names. This will represent the population since everyone had a chance of being
chosen.
h) Census: Attempting to get data from all COC students. The COC computer has a list of all
students. He contacts and gets information from all of them.
3.
a) population: The collection of all people or objects to be studied. For example, all domestic
animals in Lancaster CA.
b) census: Attempting to get information (data) from everyone in a population. May or may
not succeed. This is the best data and represents the population very well. For example, measuring the
IQ of every employee in a software company.
c) sample: Getting information (data) from a subgroup of the population. Usually less than
10% of the population. For example, measuring the breed, age and weight of 48 dogs in Oklahoma.
d) random: When everyone in a population has an equal chance of being included in the
sample. For example: Each employee of a software company has an employee ID number. Have a
computer randomly choose ID numbers. Whichever employee’s number comes up, that employee will
take an IQ test.
e) bias: When sample data does not represent the population. Usually specific groups have
been left out and are not being represented. For example: Wanting to get information about all
domestic animals in Lancaster CA and instead only getting information about dogs and cats. (Left out
other domestic animals.)
f) parameter: A number that represents a population. For example a population proportion
(percentage) p = 0.75
g) statistic = A number that represents a sample. For example a sample standard deviation
s = 10.6 pounds.
4.
a) Samping Bias: When the sample was too small, collected incorrectly or without
randomization. For example, the sample data was collected by putting a survey up on Facebook.
b) Question Bias: When someone phrases a question in order to force people to answer the
way they want. For example: In order to save children from devastating diseases, should all children
have vaccinations.
c) Response Bias: When people do not feel comfortable answering truthfully. For example,
we ask people if they tend to hoard possessions in their house.
d) Deliberate Bias: When the people collecting the data, deliberately leave out certain groups
from the population. For example, a person wants to get data on who Americans will vote for in the
next election, but does not ask any Filipino Americans.
e) Non-Response Bias: When people are asked to give data, but refuse to be part of the
study. For example, a random phone number generator gave a phone number, but when the person
called and tried to get data, the person said they did not want to participate.
5. In a random sample, every individual in the population has an equal chance of being chosen to be in
the sample. In a simple random sample, every group of size n has a chance of being chosen. Random
Cluster Sample is random sample but not simple random because not every group that could possibly be
made has a chance. Only those groups designated in the cluster have a chance.
6. Rachael must randomly select two groups. One group will wear a nicotine patch and the other will
wear a placebo patch. The individuals and people giving the patch must not know whether it contains
nicotine or not. This will control the placebo effect. Lurking variables will be sources of nicotine, how
long someone has smoked for, the number of cigarettes smoked per day. She will want her nicotine and
placebo groups to be as similar as possible by picking them randomly and blocking. If the treatment
group has a much higher percentage of individuals that were able to quit, then she has proven that the
patch does cause people to stop smoking.
7. They are similar in that we are exploring relationships between variables. The experiment has the
added condition of showing a cause and effect relationship. If we need to show blame or cause then we
need to control lurking variables and therefore we need an experiment. If we are just showing a
relationship, then we do not need to control lurking variables and a correlation observational study
would be fine.
8.
a) The types of cars in the different COC parking Lots. Categorical
b) The average number of hours spent practicing ping pong. Quantitative , Skewed Right
c) Areas in North Dakota that have wild mustangs. Categorical
d) Each person is asked if they wear glasses, contacts, neither, or both. Categorical
e) The average speed of the race cars at the Indianapolis 500. Quantitative , Bell shaped
f) The test scores on a really easy test. Quantitative , Skewed left
9.
a) Mean, Median, Mode,
b) Standard Deviation, Variance, Range, IQR
c) Min, Max, Q1, Q3
d) Yes. Sample size (frequency)
e) Mean
f) Standard Deviation
g) Mean – Standard Deviation < Typical Values < Mean + Standard Deviation
h) Median
i) IQR
j) Q1 < Typical Values < Q3
10.
Variable
C1
Variable
C1
Mean
Standard
Deviation
15.876
3.338
Q1 Median
13.35
Q3 IQR Mode N for mode
16.4 18.2 4.85
18.2
2
Variable Min Max Range
C1
7.6 21.6 14.000
This data set describes the lengths in feet of pieces of lumber at a lumber yard. The data set is slightly
skewed left so the median of 16.4 feet is the best measure of center. So the average length of the
boards at the lumber yard is 16.4 feet. The IQR of 4.85 feet is the best measure of spread. So typical
boards had lengths 4.85 feet from each other. Hence typical boards were between 13.35 feet (Q1) and
18.2 feet (Q3) in length. There were no outliers.
11. The following graph was made from the final exam scores of students in a history class.
a) What is the shape?
Bell Shaped
b) Is the mean and standard deviation an accurate representative of center and spread?
Since the data is bell shaped (normal), the mean is an accurate measure of center and the
standard deviation is an accurate measure of spread.
c) The mean average was 77. Write a couple sentences explaining this statistic in context and
what it tells us.
The mean is a type of average. It is also the balancing point for the data. The sum of the
distances of numbers below the mean will equal the sum of the distances of numbers above the mean.
It is only accurate when the data is bell shaped. In this case the average test score was approximately
77.
d) The standard deviation was 5.3 . Write a couple sentences explaining this statistic in context
and what it tells us.
The standard deviation is a measure of typical spread from the mean. Data with more spread
tend to give less consistent values and may be difficult to predict, while data with less spread tend to
give more consistent values and may be easier to predict. In this case typical scores on the history exam
were within 5.3 points from the mean average of 77.
e) The mean average was 77 and the standard deviation was 5.3 . Use the mean and standard
deviation to find two numbers that typical values are in between. Use the mean and standard deviation
to find the cut off for “unusually low” test scores. Use the mean and standard deviation to find the cut
off for “unusually high” test scores
We find typical values by adding and subtracting the mean and standard deviation. So typical
final exam scores on the history final were between 77-5.3 and 77+5.3. So typical scores on the history
final were between 71.7 and 82.3
The cut off for a test score being unusually low is the mean minus two standard deviations or
77 – 2(5.3). So any score lower than 66.4 was unusually low when compared to the rest of the class.
The cut off for a test score being unusually high is the mean plus two standard deviations or
77 + 2(5.3). So any score higher than 87.6 was unusually high when compared to the rest of the class.