Download Practicing the Concepts #1 – Basic Concepts and Terminology

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Practicing the Concepts #1 – Basic Concepts and Terminology
WSU STUDENT SURVEY - In order to generate data for use in one of my introductory
statistics courses a few years back, I had the class develop a short survey and administer
this survey to ten WSU students of their choosing. In the end, survey responses were
recorded for a total of 348 WSU students (n=348).
1. What is the population of interest?
WSU students
2. What is the sample?
348 students selected using a sample of convenience. Not likely to representative
as a result.
3. What are some potential problems with this survey methodology?
Non-random sampling can introduce bias, either intentional or unintentional.
4. The following items comprised the survey. Classify each item (variable) as being
either numeric (quantitative) or categorical (qualitative).
Survey Item/Variable
Variable Type
Gender
Age
Did student have a declared major?
Major, if declared
College major program is in, e.g. College of Liberal Arts (CLA)
Class (Fr, So, Jr, or Sr)
Hours spent studying per day
GPA
Is student involved in extra-curricular activity, e.g. intramural sports
or biology club?
Is student living on- or off-campus?
Hours of sleep per night
Number of credits student is currently taking
Does student have a “significant other”?
Does student skip at least on class per week?
If they do skip, what is the most common reason for skipping?
Does student drink alcohol?
If they do drink, what would be a typical number of “drinks” they
would have per night that they drink?
Does student smoke cigarettes?
If student is a smoker, how many cigarettes do they smoke per day?
Should President Clinton be impeached for his sexual relations with
Monica Lewinsky?
Is student of legal drinking age (21 yrs. old)?
How much did student spend on textbooks this semester?
Does student think the WSU Laptop Program is a good idea?
Cat
Num
Cat
Cat
Cat
Cat
Num
Num
Cat
Cat
Num
Num
Cat
Cat
Cat
Cat
Num
Cat
Num
Cat
Cat
Num
Cat
1
Chapter 2 – Descriptive Statistics
WSU STUDENT SURVEY (cont’d) – We now consider methods for describing the
WSU student survey data described above.
Describing Categorical/Qualitative Data
To do this in JMP select Analyze > Distribution and put
College in the Y, Columns box.
The following options have also been selected from the
College pull-down menu.
 Display Options > Horizontal Layout
 Mosaic Plot
 Histogram Options > Std Err Bars
 Histogram Options > Prob Axis
 Histogram Options > Show Percents
The mosaic plot is essentially a rectangular pie chart. The
Prob Axis adds a vertical axis with relative frequencies.
Finally the standard error bars are p  SE p . The standard
error is the estimated standard, SE p , is actually the estimated
standard error because the true proportion of students in each
p(1  p)
college (  ) is not known, i.e. SE p 
.
n
Frequency Distribution Table from JMP
The first numeric column contains the Count or Frequency, which is
the number of students in our sample that had declared majors in the
college. Notice that 48 students could not be classified according to
college because they had not yet declared a major (N Missing 48).
The second numeric column contains the estimated probability (Prob)
that a randomly selected WSU student would have a major in that
college. For example from this we would estimate that the probability
that a randomly selected WSU student would have a declared major in
the College of Liberal Arts is .33 or 33% chance. More practically we
would could say that based our study we estimate that 33% of WSU
students are enrolled the College of Liberal Arts. Alternative labels
for this column could be Relative Frequency or Proportion (p).
2
Exploring the Relationship Between Two Categorical Variables
Suppose that we wish to examine the relationship between a student’s opinion of the
WSU Laptop Program and their gender.
Comparative Bar Graphs
Opinions of Females
Opinions of Males
1. Why can’t we directly compare the 142 females who do not think the laptop program
is a good idea to the 78 males who feel the same way? Because the number of males and
females sampled are different. We would expect the number of females who think the
laptop program is a bad idea to be larger perhaps simply because 56 more females were
sampled.
Contingency Table and Mosaic Plot
Gender
Female
Male
Column Totals
Opinion of Laptop Program
No Undecided Yes
142
7
53
78
1
67
220
8 120
Row Totals
202
146
348
2. What percentage of females surveyed have a favorable opinion of the laptop program?
53
p
 .2624 or 26.24%
202
3. What percentage of those students who had a favorable opinion of the laptop program
were female?
53
p
 .4417 or 44.17%
120
4. What percentage of males surveyed had an unfavorable opinion of laptop program?
78
p
 .5342 or 53.42%
146
5. What is the estimated probability that a randomly selected student has a favorable
opinion of the laptop program?
120
 .3448 or a 34.48% chance
348
3
Mosaic Plot of Laptop Program Opinion and Gender
How to read a mosaic plot
26.24%
The thin strip off to the side shows the
break down of laptop program opinion of
all 348 respondents. For example the
percent having a favorable opinion is
34.48% is the blue shading.
45.89%
 70%
The width of the vertical strips in the
main plot is controlled by the
number/percent of respondents in each
gender. Here more females were
sampled so their strip in the plot is
wider.
 50%
Contingency Table with Row %’s (in bold)
Count
Row %
F
M
Column
Totals
No Undecided
Yes
Row
Totals
202
142
7
53
70.30
3.47
26.24
78
1
67
53.42
0.68
45.89
220
8
120
146
n = 348
The shading within each strip shows the
laptop opinions conditional on gender.
The plot is a graphical representation of
the row percentages in the contingency
table. We can clearly see that in our
sample male respondents had a more
favorable opinion of the laptop program.
Nearly 70% of females had negative
opinion vs. approximately 50% for
males.
Questions to Think About:
 Do these results suggest that the proportion of WSU males who favor the Laptop
Program is greater than the proportion WSU females who do?
We will look at ways to determine whether this difference is “real” or statistically
significant in Chapters 6, 14, and 16.
 Could a difference this large be explained as chance variation? How could we
determine this?
We need to somehow determine how likely we are to get sample
proportions/percentages this different if in reality the two populations, female and
male WSU students in this case, have the same distribution of opinion.
 If did conclude the proportion of males with a favorable opinion exceeds that for
females, can we quantify how much larger we think it is?
We will look at how this is done Chapters 13 and 14.
4
GRAPHICAL METHODS FOR DESCRIBING NUMERIC DATA
(HISTOGRAMS and STEM-AND-LEAF PLOTS)
Histogram of Book Costs Per Semester
1. What would be the typical amount a WSU student would spend on books?
Somewhere between $250 and $275
2. Most students would have textbook costs in what range?
If we interpret most to mean more than half $200 - $350 would include over
60% of the students surveyed.
3. How much variation in book costs do we see?
This distribution is unimodal and fairly bell-shaped or normal with a large
percentage of the values within about $75 dollars of what we might consider to
be typical. However there are a few extreme values with largest values being at
least 6 times larger than the smallest values.
4. Estimate the probability that a randomly selected student would spend between $300
and $400 on books.
By adding up the heights of the appropriate bars we find this probability to be
approximately .25 + .07 = .32 or a 32% chance.
5. Estimate the percentage of students who spend more than $500 on books.
Approximately .02 + 0 + .01 = .03 or 3%.
5
Histogram with Density Scaling
Book costs in $100s
6. Estimate the probability that a randomly selected student has book costs between $200
and $300 dollars.
Each rectangle or bar in the histogram has width = .5 (i.e. $50 dollars).
The area of the rectangles or bars corresponding to the desired interval is
given by: (.5  .44)  (.5  .46)  .22  .23  .45 or a 45% chance.
Note: Using the histogram with data in the original scale on previous page we
arrive at the same result by adding the heights of the appropriate bars,
.22 + .23 = .45.
7. Estimate the probability that a randomly selected student has book costs below $200
dollars.
The area of the rectangles or bars corresponding to this range of book costs is
given by: (.5  .02)  (.5  .08)  (.5  .18)  .01  .04  .09  .14 or a 14% chance.
The key idea here is that:
PROBABILITY = AREA !!!
If we could do it, better estimates of these probabilities or percentages might come from
considering the area beneath the smoothed histogram or density curve. This would
require two things: 1) we know the exact mathematical formula for the curve
2) ability to find the area beneath that curve, i.e. integral calculus!
We will actually be doing this in Chapter 7 when we discuss continuous random
variables.
6
Stem-and-Leaf Plot of Book Costs
8. What advantage if any does the stem-and-leaf display of these data provide when
compared to the histogram?
The only real advantage is that you can see the raw data as well as how it is
distributed. In my opinion this does not offset the disadvantages.
Histogram of Hours Spent Studying Per Day
Hours Studying
9. Discuss what is learned about studying time of WSU students from the histogram.
There are number of comments that could be made:
 The most common response was 3 hours.
 About half the students reported studying between 2 – 3 hours per day.
10. What interesting feature(s) does this particular histogram have?
 Students reported their times to the nearest half hour with most reporting
their study times to the nearest hour.
7
Using Histograms to Compare Two Groups on a Single Numeric Variable
GPA’s of Female WSU Students
11. Use these histograms to compare and contrast the
GPAs of female and male WSU students.
GPA’s of Male WSU Students
It appears that female students reported having
higher GPAs than male students. The variation in
GPAs is similar for the both genders as well as the
distributional shape.
Hours Studying Per Day (Females vs. Males)
Females
Males
12. What are the differences between males and
females in terms of the hours they spend studying
per day?
Female respondents reported having studied
more per day than male respondents. The most
common response, or mode, for females is 3
hours and 2 hours.
Numerical Measures of “Average”
n



Mean - x 
x
i 1
i
n
Median – middle value when data is ranked from smallest to largest.
Mode – most frequently occurring value.
8
Hours Spent Studying Per Day
1. What are the mean,
median, and mode for the
time spent studying per day?
Sample mean ( x ) = 2.65 hrs.
Sample median (m) = 3.0 hrs.
Sample mode = 3.0 hrs.
Hours Spent Studying (WSU Females)
2. Use the measures of central
tendency to compare and contrast
the hours spent studying for male
and female students.
x females  2.91
xmales  2.29
x females  xmales
Hours Spent Studying (WSU Males)
similarly for the sample medians,
m females  mmales
and the sample modes.
9
Numerical Measures of Variability


Range = max - min
Variance and Standard Deviation
s
1 n
( xi  x ) 2 which is an estimate of the population standard deviation

n  1 i 1
1 N
 ( xi   ) 2 . The sample variance and population variance are the
N i 1
squares of these quantities.
Interquartile Range (IQR) – range of the middle 50% of the data =
IQR  Q3  Q1
Coefficient of Variation (CV)
s
CV   100% this measures the amount of variation relative to the size of the
x
mean.



GPA of WSU Students
1. What is the range of the GPA’s?
Range = 4 – 1.90 = 2.10
2. What are the sample variance and standard deviation?
s = .473 and s2 = .224
3. What is the inter-quartile range (IQR)?
IQR = 3.465 – 2.720 = .745
4. What is the coefficient of variation (CV)?
CV = (.473/3.065)*100% = 15.43%
5. Approximately 68% of the students will have GPAs in what range?
Assuming the distribution of GPAs is approximately normal we estimate that
68% of students will GPAs in the range 2.592 to 3.548.
Approximately 95% of the students will have GPAs in what range?
3.065 – 2*.473 to 3.065 + 2*.473 which is 2.119 to 4.011. We should truncate
the 4.011 to 4.00 as GPAs cannot exceed 4.00.
Approximately 99.73% of the students will have GPAs in what range?
Find the interval given by x  3s
10
Cost of Textbooks for WSU Students
6. Approximately what percent of WSU students spend between $178.09 and $357.19?
This interval represents the sample mean +/- one standard deviation, so assuming the
population is approximately normally distributed we would say 68%.
Approximately what percent of WSU students spend between $88.52 and $446.76?
This interval represents the sample mean +/- two standard deviations, so assuming the
population is approximately normally distributed we would say 95%.
7. Which has more variation GPAs of WSU students or their textbook costs? Explain.
Numerical Measures of Relative Standing


Percentiles/Quantiles and Quartiles
z-scores
1. 25 percent of WSU students have GPAs below what value? _____2.720____
2. 75 percent of WSU students have GPAs below what value? _____3.465_________
3. What percent of WSU students have GPAs below 3.706? _____90%_________
4. What is the z-score associated with a GPA = 3.75? ___1.45___
11
5. What is the z-score associated with a GPA = 2.25? __-1.72______
6. What is the z-score associated with a GPA = 3.90? ___1.76______
Histogram of z-scores for GPAs
Mean = 0
Standard
Deviation = 1
Comparative Boxplots
7. How do the GPAs of female students
compare to GPAs of males?
The mean and median for WSU female students are
approximately .15 grade points higher than that for
the male students. The variation in GPAs for both
groups are similar. This is evidenced by the fact
that both the sample standard deviations
( s female  .46 and s male  .48 ) and the IQRs
( IQR female  1.0 and IQR male  .90 ) are nearly equal.
We will examine later how we can use these results
to decide whether or not the true population means
or medians significantly differ.
Note: The standard error of the mean for females is
smaller because more females were included in the
sample ( SE female  .46 / 192  .033 ).
12
8. How do the GPAs of students who skip at
least one class per week compare to those who
do not?
The mean and median for students who do not regularly
skip classes are larger than that for students who
regularly skip classes:
( x no  3.14 vs. x yes  2.87 and mno  3.2 vs. m yes  2.8 )
The variation in GPAs for both groups are similar. This
is evidenced by the fact that both the sample standard
deviations ( s no  .45 and s yes  .47 ) and the IQRs
( IQRno  1.0 and IQR yes  .90 ) are nearly equal.
We will examine later how we can use these results to
decide whether or not the true population means or
medians significantly differ.
Note: The standard error of the mean for “non-skippers”
is smaller there over twice as many of them in our
survey ( SEno  .45 / 236  .030 ).
CDF Plots (ogives)
We will see in Chapter 4 that the cumulative distribution function (CDF) is defined
F ( x)  P( X  x) which reads as “the probability the random variable X takes on a
value less than or equal to x”.
For example X = the GPA of a randomly selected female WSU student and we might
be interested in the probability that her GPA is at or below 3.00 which would be written
as P( X  3.00) .
We can estimate this using data for a particular value x as follows:
F ( x)  P( X  x) 
# of xi ' s  x
n
The CDF Plot is a plot of the estimated cumulative distribution function vs. x. The
estimate probability only changes at the observed xi values. This gives the CDF a step
function appearance.
13
CDF Plot ~ GPA of “Skippers” vs. “Non-Skippers”
Skippers – said “Yes” they skip at
least one class per week.
Yes
No
Non-skippers – said “No” they do
not skip at least one class per
week.
10. Estimate the probability that a skipper has a GPA below 2.80.
Approximately .50 or 50% chance
11. Estimate the probability that a non-skipper has a GPA below 3.00.
Approximately .35 or 35% chance
12. Which group, skippers or non-skippers, has a greater chance of having a GPA below
3.0?
Skippers
Comparing age at which respondent had their first child across education level
(comparative boxplots with histograms added)
How does the age distribution differ
across education level?
The typical age at which a person had their first
child appears to differ greatly across the different
education levels. The more educated a person is
the later in life they have their first child. We
estimate that those who dropped out of high
school had their first child at around 20 years of
age, those with a high school diploma only
around 22 years of age, those with a college
degree around 26 years of age and those with
professional degrees around 28 years of age.
The histograms suggest that the distributional
shapes also differ. Those with the least amount
of education have skewed right age distributions
while those with the more education have skewed
left distributions. The variability in each
distribution are roughly the same with the
exception of those with professional degrees.
]
14