Download Describing Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Describing Data
Week 1
The W’s (Where do the Numbers
come from?)
•
•
•
•
•
•
•
Who: Who was measured?
By Whom: Who did the measuring
What: What was measured?
Where: Where was the data measured?
When: When was the measurement done?
HoW: How was the data measured?
Why: Why was the measurement done?
Always Check the W’s
• Anytime you see data always check the
W’s.
• This will help spot questionable statistics.
• ALWAYS QUESTION DATA
Variables (The What)
• Variables are characteristics that are
recorded about each individual.
• Categorical variables are non-numeric in
nature.
• Quantitative variables are measurements
and have units
Displaying and Describing
Categorical Data
Terms
• Frequency table: Categories and counts
• Distribution: lists the frequencies of each
category
• Distribution: lists the relative frequencies
of each category
• Contingency Table: The frequencies or
relative frequencies of 2 variables.
Terms
• Marginal Distribution: the totals found on the margins
of the chart. The distribution of one of the two variables
• Conditional distribution: the distribution of one row or
column of a contingency table.
• Independence: two variables are independent if the
conditional distribution of all the values of a variable is
the same as the marginal distribution of that variable.
(Huh!)
Three Rules of Data Analysis
• First, make a picture!
• First, make a picture!
•First, make a picture!
Or you could
Why?
• Pictures reveal things charts don’t.
• Patterns can be revealed that are not
readily apparent from the numbers.
• Pictures are the easiest way to explain to
others about the data
To Make a Graph
• Make piles. Organize the data into like
groups
• Make a frequency table
• Make a relative frequency table by
finding the percentages
Make a Graph
• Probably a bar chart graphing the
frequencies or . . .
• A pie chart to graph the relative
frequencies
• Beware of the area principle.
• Stay 2-D
To Make a Graph of Categorical
Data
• Think
 Check W’s
 Identify the variables
 Check to see if categories overlap
 Data are counts
To Make a Graph of Categorical
Data
• Show
 Select the appropriate graph to compare
categories
 Bar Graph for frequencies
 Pie Chart for relative frequencies (percents)
 Stacked bar graph can be used instead of a pie
chart
To Make a Graph of Categorical
Data
• Tell
 Interpret the results
 Describe the results in the context of the
problem
 Answers are sentences not numbers
Displaying Quantitative Data
More Graphs
Histograms
• Think:
 Must be quantitative data
 Want to see the distribution
 Could be counts or percents
Stem and Leaf Plots
• Think
 Must be quantitative data
 Want to see the distribution
 Usually counts
 Relatively small sample size
Stem and Leaf Plot
• Show
 Scale is usually vertical
 Put the ‘Stems’ on the vertical scale
 Stems are usually the data without the last digit
 Might be rounded
 If there are a lot of leaves with one stem make dual
stems and put 0-4 on one and 5-9 on the other
 Plot the ‘leaves’
Dot Plot
• Think
 Must be quantitative data
 Want to see the distribution
 Usually counts
 Relatively small sample size
Dot Plot
• Show
 Scale can be vertical or horizontal
 Place a dot at the appropriate location
Describing the Distribution
• Tell
 Shape
 How many humps?
• Unimodal
• Bimodal - maybe more than one group thrown together
• Multimodal





Uniform
Symmetric
Skewed
Gaps
Clusters
Describing the Distribution
• Tell (continued)
 Center
 What is the middle value
 What is the middle range
Describing the Distribution
• Tell (Continued)
 Spread
 Range = Maximum value - minimum value
 Variation: How much does the data jump around
Outliers
• Discuss any data points that do not seem
to fit the overall pattern.
• Is there a logical explanation for them to
be that different?
Comparing Two Distributions
• Compare the centers of the two
distributions
• Compare the shapes of the two
distributions
• Compare the spread of the two
distributions
• Compare any extreme values (outliers) of
the two distributions.
Time Plot
• Think:
 Quantitative data
 Looking for trends
• Show
 Time is horizontal scale
 Plot data
 Connect the dots
 Can use calculator
Describing Distributions
with Numbers
Measurements of the Center
• Mean: The ‘Average’
•µ mean of a population
• x mean of a sample
•Unique
• Median: The middle score
• Sort the data
• Middle score or the average of the middle two
scores
• Unique
More Center Measurers
• Mode: The most common score
 Not necessarily unique
 Does Not necessarily exist
Finding Quartiles
• Sort the data
• Find the median
• The 1st quartile (25% mark) is the median
of the smaller half of the data
• The 3rd quartile (75% mark) is the median
of the larger half of the data
The Five Number Summary
•
•
•
•
•
The minimum data point
The 1st quartile
The median
The 3rd quartile
The largest data point
InterQuartile Range and Outliers
• Outliers are data points that do not fit the
pattern of the distribution.
• Interquartile range IQR is the difference of the
3rd quartile - the 1st quartile
• An outlier is a point more that one and half
times the IQR below the 1st quartile number or
one and half times the IQR above the 3rd
quartile
Checking for Outliers
• Find the 5 number summary
• Calculate the Interquartile Range
• IQR = 3rd quartile - 1st quartile
• Lower cut off point = 1st quartile– 1.5(IQR)
• Upper cut off point = 3rd quartile+ 1.5(IQR)
• Check for data outside the cut off points
The Normal Model
Density Curves and
Normal Distributions
A Density Curve:
• Is always on or above the x axis
• Has an area of exactly 1 between the
curve and the x axis
• Describes the overall pattern of a
distribution
• The area under the curve above any range
of values is the proportion of all the
observations that fall in that range.
Mean vs Median
• The median of a density curve is the equal
area point that divides the area under the
curve in half
• The mean of a density function is the
center of mass, the point where curve
would balance if it were made of solid
material
Normal Curves
•
•
•
•
•
Bell shaped, Symmetric,Single-peaked
Mean = µ

Standard deviation =
Notation N(µ,  )
One standard deviation on either side of µ
is the inflection points of the curve
68-95-99.7 Rule
• 68% of the data in a normal curve at least
is within one standard deviation of the
mean
• 95% of the data in a normal curve at least
is within two standard deviations of the
mean
• 99.7% of the data in a normal curve at
least is within three standard deviations of
the mean
Why are Normal Distributions
Important?
• Good descriptions for many distributions of
real data
• Good approximation to the results of many
chance outcomes
• Many statistical inference procedures are
based on normal distributions work well for
other roughly symmetric distributions
Standard Normal Curve
Standardizing (z-score)
• If x is from a normal population with mean
equal to µ and standard deviation,  then
the standardized
 value z is the number of
standard deviations x is from the mean
• Z = (x - µ)/
• The unit on zis standard deviations
Standard Normal Distribution
• A normal distribution with µ = 0 and
 N(0,1) is called a Standard Normal
1,
distribution
• Z-scores are standard normal where
z=(x-µ)/ 
=
Standard Normal Tables
• Table B (pg 552) in your book gives the percent of the
data to the left of the z value.
• Or in your Standard Normal table
• Find the 1st 2 digits of the z value in the left column and
move over to the column of the third digit and read off
the area.
• To find the cut-off point given the area, find the closest
value to the area ‘inside’ the chart. The row gives the
first 2 digits and the column give the last digit
Solving a Normal Proportion
• State the problem in terms of a variable (say x) in the
context of the problem
• Draw a picture and locate the required area
• Standardize the variable using z =(x-µ)/
• Use the calculator/table and the fact that the total area
under the curve = 1 to find the desired area.
• Answer the question.

Finding a Cutoff Given the Area
• State the problem in terms of a variable
(say x) and area
• Draw a picture and shade the area
• Use the table to find the z value with the
desired area
• Go z standard deviations from the mean in
the correct direction.
• Answer the question.
Assessing Normality
• In order to use the previous techniques the
population must be normal
• To assessing normality :
 Construct a stem plot or histogram and see if
the curve is unimodal and roughly symmetric
around the mean