Download Methods for Describing Data I. Describing Qualitative Data (i

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Categorical variable wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Methods for Describing Data
I. Describing Qualitative Data
(i) Frequency Table or Frequency Distribution is the organization or summarization of raw data in a table with two
columns
• classes : each class is one of the categories into which unit characteristics or data values are classified, and
• frequencies : each class frequency is the number of units or data values falling in the class.
Important Note: The terms “Class” and “Frequency” are generic only. Use the name of the variable for class and number
of units for frequency. Also, you must include a title describing the frequency table. In short, your frequency table must be
self explanatory.
Example.
Frequency distribution of race of 5000 students
Race
Number of Students
Percentage of Students
Black
White
Hispanic
Other
1000
2000
1500
500
20.00
40.00
30.00
10.00
Total
5000
100.00
What percentage of students are not white? What percentage of students are black or hispanic?
II. Describing Quantitative Data
(i) Ungrouped Frequency Distribution for Discrete Data is the organization or summarization of raw data in the
form of a table with two columns
classes : that shows distinct values (in order of magnitude) of the discrete variable under consideration, and em frequencies : each class frequency is the number of times each value of the discrete variable is repeated in the data set. A third
column that shows percent of each class frequency is also recommended.
Frequency distribution of the # of courses taken by 30 students
Number of Courses
Number of Students
3
4
5
6
4
18
6
2
1. How many students enrolled in at least four courses? (ans. 26)
2. What percentage of students enrolled in four courses? (ans. 60%)
3. What percentage of students enrolled in at most five courses? (ans. 93.33%)
(ii) Grouped Frequency Distribution for Continuous Data is the organization or summarization of raw data in the
form of a table with two columns
class intervals (preferably with boundaries): obtained by dividing the range of the data into several (about 5 to 15)
intervals preferably with equal width in such a way that no data value belongs to two different intervals;
frequencies : the number of data values that fall in a class interval is the corresponding class frequency.
Example. Frequency distribution of EPA mileage ratings on 100 cars
(Data is in Table 2.2, page 38; H = 44.9, L= 30.0)
Mileage Ratings
29.99 − 32.99
32.99 − 35.99
35.99 − 38.99
38.99 − 41.99
41.99 − 44.99
Number of Cars
6
16
58
18
2
1
1. How many cars are rated under 33 miles per gallon?
2. What percentage of cars has a rating of 36 or more mileage?
(iii) Frequency Histogram for grouped frequency distribution is a graph that displays class frequencies along the Y -axis
and class boundaries along the X-axis. Appropriate adjustments along the Y -axis are necessary to display grouped
frequency distribution with unequal class widths. For ungrouped frequency distribution, first create class boundaries (as
explained in class) and then construct the histogram.
Chapter 2: Practice Exercise 1
On a given day, a researcher gathered some information from each of a group of students sitting at the University Center.
Data for 30 students are shown below.
Class
Rank (R)
Jr
Sr
Jr
Fr
Jr
Jr
Sr
Sr
Fr
Sr
So
So
So
Sr
Jr
No. of
Courses (X)
4
5
4
4
4
4
4
4
4
6
3
4
4
4
4
GPA (Y )
3.76
4.00
3.66
1.98
3.54
3.33
2.97
3.25
3.36
2.40
1.51
1.11
2.85
3.88
3.17
Class
Rank
So
Fr
Jr
Fr
Sr
So
So
Jr
Sr
Fr
So
Jr
Jr
Sr
Fr
No. of
Courses
5
5
4
3
4
5
6
4
3
4
5
4
3
4
5
GPA
3.36
2.43
1.52
3.29
2.69
3.84
3.74
2.62
3.41
2.28
3.46
1.33
3.22
2.17
3.47
Construct a frequency table for each variable. Display each table using a graph.
Distribution of Class Rank
Distribution of Number of Courses
Class Rank
FR
SO
JR
SR
# of Students
6
7
9
8
# of courses
3
4
5
6
# of Students
4
18
6
2
(mean =4.20, std dev = 0.7611)
Frequency distribution of GPA
GPA
1.105 − 1.685
1.685 − 2.265
2.265 − 2.845
2.845 − 3.425
3.425 − 4.005
# of Students
4
1
6
10
9
(From the above grouped distribution, mean = 2.9223, std dev = 0.7689; using all thirty individual GPA, mean = 2.92, std
dev = 0.8158)
Cross-table - Summarizing Bivariate Attribute Data
Data sets with two attribute variables are often summarized in cross-tables with one row (column) for each category of
first (second) attribute variable, and the number or the proportion of subjects that belong to two row-column categories are
written in the corresponding cell in the table.
2
Example : Consider the data set collected from 18 students.
Class Rank
Jr
Sr
Jr
Fr
Jr
Fr
Jr
Sr
Sr
Employment Status
Full-time
Part-time
Full-time
Unemployed
Part-time
Part-time
Full-time
Full-time
Unemployed
Class Rank
Fr
Sr
So
So
So
Fr
Sr
Jr
So
Employment Status
Full-time
Part-time
Full-time
Part-time
Unemployed
Part-time
Part-time
Unemployed
Part-time
Distribution of students according to class rank and employment status
Fr
So
Jr
Sr
Full-time
Part-time
Unemployed
1
1
3
1
2
2
1
3
1
1
1
1
One may be interested in answering following type of questions.
1. What percentage of the survey is Sr? (27.78%)
2. What percentage of the survey is unemployed? (22.22%)
3. What percentage of Sr is unemployed? (20%)
4. What percentage of unemployed is Sr? (25%)
Summarizing quantitative data by categories of a qualitative variable.
Simple Table: Data on one attribute and one quantitative variable are often summarized in a simple table with
a column for the attribute and a second column for summary results of the quantitative variable. (Note: This is not a
frequency table)
First column shows different categories of the attribute, and
Second column shows summary results (e.g., mean, median) of the quantitative variable corresponding to each category
of the attribute.
Example :
Average of age of students by class rank.
Class Rank
Fr
So
Jr
Sr
Average Age
20
21.2
23
24.5
The above summary results can be displayed using a bar chart with class rank along the horizontal axis and average age
along the vertical axis.
Summarizing bivariate quantitative data.
• Scatter diagram to summarize and display bivariate quantitative variables.
Summary measurements of Quantitative Data
Measures of Location or Central Tendency
Mean. If x1 , x2 , . . . , xn are used to denote the observations of a variable X in the data set, then the mean is
P
n
1X
x
X̄ =
xi =
.
n i=1
n
3
Median is the middle number in the ordered data set. An ordered data set is one in which observations are arranged in
ascending (or descending ) order of magnitude. If the total number n of observations is even, then median is the mean
of the middle two numbers. Write M to denote the median. Median is less sensative than the mean to extremely large
or small measurements.
Mode is the observation (or observations) in the data set that occurs more than the other numbers in the data set.
A data set may have no mode, it may have a unique mode or more than one mode. Mode is an appropriate summary
measure for categorical data. For example, if answers for favorite colors are red, blue, green, purple, it is impossible to
add up those values to find the mean or to order the colors to find median. The other alternative to describe this set of
data would be the mode. Then you can say that most people prefer red (or whatever color occurs most).
Measures of Variablity or Dispersion or Spread
Range = Highest observation (H) - Lowest observation (L).
Variance. If x1 , x2 , . . . , xn are used to denote the observations in the data set, then the variance is
P 2
P 2
P
(
x)
n
2
X
(x
)
−
1
(x
−
x̄)
2
2
n
s =
(xi − x̄) =
=
(n − 1) i=1
n−1
n−1
Standard Deviation =
√
s2
Coefficient of Variation = ( x̄s )100%
Numerical Measures of Relative Position or Relative Standing
Percentile : Percentiles are numbers that divide an ordered data set into 100 equal parts. The p-th percentile is the
value xp such that at most p% of all values are smaller than xp and at most (100 − p)% are larger than xp . The value xp is
np th
np
np
equal to the average of ( 100
) and ( 100
+ 1)th largest sample observations if 100
is an integer; Otherwise xp is equal to the
np
th
.
(k + 1) largest sample observation where k is the largest integer not exceeding 100
z-score (also known as standardized score): This is defined as
z
=
observed measurement − mean
standard deviation
=
x − x̄
s
or
x−µ
σ
depending on sample or population measurement x.
Note: The z-score represents the distance between a given data value x0 and the mean of all data value, expressed in standard
deviation. For example, if z-score for x0 is 2, then x0 is 2 standard deviation above the mean. All observations with absolute
z-scores more than three are considered outliers.
Boxplot for quantitative data - Detecting outliers and extreme observations
(**** This section on boxplot is intended for students in Honors Section Only ****)
(i)
(ii)
(iii)
(iv)
A box plot displays the following summary results of quantitative variable(s):
Lower adjacent value = smallest observation greater than or equal to Q1 − 1.5 ∗ IQR (=lower inner fence),
Q1 = first quartile,
median,
Q3 = third quartile,
4
(v) Upper adjacent value= largest observation less than or equal to Q3 + 1.5 ∗ IQR ( = upper inner fence)
(vi) outlying and extreme observations: Observations that are unusually large or unusually small relative to other observations
in a data set are outliers. All observations falling outside the interval [Q1 −1.5(Q3 −Q1 ), Q3 +1.5(Q3 −Q1 )] are considered
outliers and those falling outside [Q1 − 3(Q3 − Q1 ), Q3 + 3(Q3 − Q1 )] are considered extreme outlying observations.
The upper and lower quartiles of the data portrayed by the top and bottom of a rectangle, and the median is portrayed
by a horizontal line segment within the rectangle. The mean is marked by a dot or a plus sign. The vertical lines (or
“whiskers”) above and below the rectangle extend to the adjacent values.
• Box plots focuses on the following important features of a data set:
→ typical or central value measured by mean
→ position of data measured by median, first and third quartiles
→ measures of spread or variability measured by IQR
→ shape - symmetry or skewness : the relative distances of the upper and lower quartiles from the median give information
about the shape of the distribution of the data. If one distance is much bigger than the other, the distribution is skewed.
For a symmetric distribution, the median cuts the box in half, the whiskers are about the same length and outside values
(if any) are symmetrically placed beyond the lower and upper whiskers.
→ Outlying and extreme data points and the behavior of the tails: outlying and extreme values (observations beyond
the adjacent values) are graphed individually. These values portray behavior in the extreme tails of the distribution,
providing further information about spread and shape.
• Box plots are often drawn side-by-side to compare two or more quantitative variables.
Example. Environmental Protection Agency (EPA) performs extensive tests on all new car models to determine their mileage
ratings. Suppose that the following 100 measurements represent the results of such tests on a certain new car models:
30
31.8
32.5
32.7
32.9
32.9
33.1
33.2
33.6
33.8
37
37
37
37.1
37.1
37.1
37.2
37.2
37.3
37.3
33.9
33.9
34
34.2
34.4
34.5
34.8
34.8
35
35.1
37.4
37.4
37.5
37.6
37.6
37.7
37.7
37.8
37.9
37.9
35.2
35.3
35.5
35.6
35.6
35.7
35.8
35.9
35.9
36
38
38.1
38.2
38.2
38.3
38.4
38.5
38.6
38.7
38.8
36.1
36.2
36.3
36.3
36.4
36.4
36.5
36.5
36.6
36.6
39
39
39.3
39.4
39.5
39.7
39.8
39.9
40
40.1
36.7
36.7
36.7
36.8
36.8
36.8
36.9
36.9
36.9
37
40.2
40.3
40.5
40.5
40.7
41
41
41.2
42.1
44.9
Some summary results are mean =36.994, median= 37.00, Q1 = 35.65, Q3 = 38.35, L = 30.00, H = 44.90, 1.5(Q3 − Q1 ) =
4.05, 3(Q3 − Q1 ) = 8.1. Summarize the data in a box plot.
Interpreting the Mean and Standard Deviation
Empirical Rule: For approximately mound-shaped distributions of data, the following statements can be made.
1. Approximately 68% of the measurements will fall within 1 standard deviation of the mean.
2. Approximately 95% of the measurements will fall within 2 standard deviations of the mean.
3. Approximately 99.7% of the measurements will fall within 3 standard deviations of the mean.
Chebyshev’s Rule.
the mean.
For k > 1, at least (1 −
1
k2 )100%
5
of the observations will fall within k standard deviations of
Example. A small computing center has found that the number of jobs submitted per day to its computers has a distribution
that is bell-shaped symmetric with a mean of 83 jobs and a standard deviation of 10. What proportion of the days do the
number of jobs submitted exceed 93? (approximately 16%)
Example. A small computing center has found that the number of jobs submitted per day to its computers has a distribution
that is approximately mound-shaped with a mean of 83 jobs and a standard deviation of 10. What proportion of the days
do the number of jobs fall between 73 and 93? (approximately 68%)
Example. Assume that the average SAT score of first year UCF students is 1175 with a standard deviation of 120. If the
distribution of their SAT scores is approximately bell-shaped symmetric, what percentage of students scored between 1055
and 1415? Find the minimum score of a student who scored among the top 2.5% students. (approximately 81.5%; 1415)
Example. Solar energy is considered by many to be the energy of the future. A recent survey was taken to compare the cost
of solar energy to the cost of gas or electric energy. Results of the survey revealed that the distribution of the amount of the
monthly utility bill of a 3-bedroom house using gas or electric energy had a mean of $125 and a standard deviation of $10.
What percentage of homes will have a monthly utility bill of less than $105 or more than $145. (at most 25%)
Chapters 1 and 2 Review
Key-words: Population and sample, quantitative and qualitative variable, descriptive and inferential statistics, parameter
and statistic, census and survey, frequency tables and simple tables, bar diagram and histogram, measures of location (mean,
median and mode), measures of dispersion (range, standard deviation, variance, coefficient of variation) and measures of
relative position (percentiles, quartiles, z-score), outliers, box plot.
MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.
(answers are given at the end)
1) A sample of high school teenagers reported that 92% of those sampled are interested in pursuing a college education.
This statement is a result of a
A) quantitative variable
B) statistical inference
C) descriptive statistic
2) A published analysis recently stated ”Based on a sample of 250 newly hired truck drivers, there is evidence to indicate
that, on average, independent truck drivers are overpaid relative to company-hired truck drivers.” This statement is an
example of
A) descriptive statistics
B) a random sample
C) a conclusion
D) inferential statistics
4) Which of the following measures would allow you to compare a student’s combined SAT score (taken from a population
with mean = 600 and standard deviation of 110) and a student’s ACT score (taken from a population with mean = 22
and standard deviation of 1.4)?
A) the percentiles of both scores
B) the z-scores of both scores
C) both a and b
D) neither a nor b
5) From past figures, it is predicted that 47% of the registered voters in California will vote in the June primary. Does this
statement describe descriptive or inferential statistics?
A) descriptive statistics
B) inferential statistics
6
6) The average age of students in a statistics class is 19 years. Does this statement describe descriptive or inferential
statistics?
A) inferential statistics
B) descriptive statistics
SHORT ANSWER. Write the word or phrase that best completes each statement or answers the question.
7) Parking at a large university has become a very big problem. University administrators are interested in determining the
average parking time (e.g. the time it takes a student to find a parking spot) of its students. An administrator followed
110 students and carefully recorded their parking times. Identify the sample used.
8) Which is used more often, a sample or a population? Why?
9) The ages of professors in the biology department at a private university are 57, 68, 66, 64, and 44. Calculate the sample
variance of these ages.
MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.
10) A study published in 1990 attempted to estimate the proportion of Florida residents who were willing to spend more
tax dollars on protecting the Florida beaches from environmental disasters. Forty-three hundred Florida residents were
surveyed.Which of the following describes the variable of interest in the study?
A) the response to the question ”Do you use the beach?”
B) the 4300 Florida residents surveyed
C) being willing to spend more tax dollars on protecting the Florida beaches from environmental disasters
D) the response to the question ”Do you live along the beach?”
11) Parking at a large university has become a very big problem. University administrators are interested in determining
the average parking time (e.g. the time it takes a student to find a parking spot) of its students. An administrator
followed 260 students and carefully recorded their parking times. Identify the inference of interest to the university
administration.
A) the generalization of the average time it takes the university administrators to find a parking spot
B) the generalization of the average time it takes a student to find a parking spot
C) the generalization of the average number of parking spots available for students
D) the generalization of the average amount of money that should be charged for student parking passes as a result of
limited parking spots
12) The amount of television viewed by today’s youth is of primary concern to Parents Against Watching Television (PAWT).
290 parents of elementary school-aged children were asked to estimate the number of hours per week that their child
watched television. The mean and the standard deviation for their responses were 12 and 4, respectively. Identify the
type of data collected by PAWT.
A) quantitative data
B) qualitative data
13) A radio station claims that the amount of advertising per hour of broadcast time has an average of 12 minutes and a
standard deviation equal to 2.2 minutes. You listen to the radio station for 1 hour, at a randomly selected time, and
carefully observe that the amount of advertising time is equal to 13 minutes. Calculate the z-score for this amount of
advertising time.
A) z = -0.45
B) z = 0.45
C) z = 0.90
D) z = 2.2
14. Identify the correct answer in each case.
• In a data set with 5 distinct numbers, the value of mode is
(a) zero
(b) one
(c) there is no mode
(d) five.
• Which is not a measure of central tendency or location?
(a) Mean
(b) Range
(c) Mode
(d) Median.
• Which is not a measure of relative position?
(a) Median
(b) Quartile
(c) standard deviation
(d) Percentile
• Which is not displayed in a box plot?
(a) Median
(b) Mode
(c) Outliers
(d) Quartiles
15. Identify each of the following as examples of attribute/qualitative (A) or quantitative (Q) variables.
• The amount of flu vaccine in a syringe
7
• Method of payments (cash, check, etc.) for books bought by UCF students.
• The amount of carbon monoxide product per gallon of unleaded gas.
16. “TRUE” / “FALSE” questions.
• Both 2 and 3 are mode of the data set 2, 3, 2, 3, 3, 2.
• When we take the information contained in the sample and make statements or predictions about all of the information in
the population, we are utilizing the technique that is known as descriptive statistics.
• Median of data set is also equal to one-half of the range of the data set.
• 50th percentile of a data set is also equal to Range/2.
• If your score on an exam (that has 100 true-false questions) corresponds to the 75th percentile, then you obtained 75 correct
answers out of 100 questions.
• If a data set has seven distinct observations, then the median of the data set is four.
• If a data set has all negative numbers, then its standard deviation is also negative.
• In a symmetric distribution, we expect the values of the mean, median, and mode to differ greatly from one another.
• In skewed distributions, the mean is the best measure of the center of the distribution since it is least affected by extreme
observations.
(Answers: 1 (C); 2(D); 4(C); 5(B); 6(B); 7: parking time of 110 students; 8: sample; 9: s2 = 95.20; 10(C); 11(B); 12(A);
13(B); 14(C, B, C, B); 15(Q, A, Q); 16(F, F, F, F, F, F, F, F, F))
8