Download Descriptive statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Regression analysis wikipedia , lookup

Generalized linear model wikipedia , lookup

Simplex algorithm wikipedia , lookup

Probability box wikipedia , lookup

Transcript
Descriptive statistics I
Distributions, summary statistics
Frequency distributions
• Frequency means the number of cases at a single value of a variable
• A “distribution” depicts the frequency (number of cases) at every
value of a variable
– Frequency distributions illustrate how values disperse
– For categorical variables use a BAR graph
– For continuous variables use a HISTOGRAM (also try AREA)
• Open DEMO PLUS.SAV
• For categorical choose variable SEX (1=Male, 2=Female)
• For continuous choose variable AGE
• Open Height weight gender age.sav (or .xls), choose a categorical and
continuous variable, display their distributions as above
Summarizing distributions
•
•
•
•
Producing a single statistic that best depicts a distribution
For categorical variables, use the statistic “proportion”
– Proportions with a base 100 are called a “percentage” (per 100)
For continuous variables, use a measure of central tendency
– The statistic “mean” (arithmetic average)
– The statistic “median” (midpoint value – half of cases above, half below)
– The statistic “mode” (most frequent value – can be more than one)
Open DEMO PLUS.SAV
– For categorical choose variable SEX (1=Male, 2=Female)
• Analyze|Descriptive Statistics|Frequencies
• Ask for a Bar Chart
– For continuous choose variable AGE
• Analyze|Descriptive Statistics|Frequencies
• Ask for a Histogram
• Open Height weight gender age.sav (or .xls), choose a categorical and
continuous variable, proceed as above
Categorical variables
• “Percent” is a summary statistic – it summarizes a distribution
• “Percent” – per cent – per hundred. 100 is always the denominator
• Increases in percentage are computed off the base amount:
Increase in jail population of 100 prisoners
• 100 percent increase - 100 percent of 100 is 100; 100 + 100 =
200
• 150 percent increase – 150 percent of 100 is 150, 150 plus 100
= 250
• 200 percent increase – 200 percent of 100 is 200, 200 plus
100= 300 (3 times the base amount)
• Percentages of less than 1 percent are described as a
fraction
– Example - 0.2 percent is 2/10th of 1 percent
– Do not confuse decimals and percentages
• Decimal .20 = 20/100 = 20 percent
• Decimal .0020 = 20/10,000 = .20 percent
• Percentages (proportions) are usually the best way to
summarize datasets using categorical variables
– 70 percent of students are employed
– 60 percent of parolees recidivate
• Percentages can be used to summarize findings when large
numbers are involved
– 50,000 persons were asked whether crime is a serious
problem: 32,700 said “yes”
Compute…
Divide 32,700 by 50,000 and multiply by 100
32,700
-------- = .65
50,000
.65 X 100 = 65%
• Percentages can be used to compare datasets
– This year, 65% of 10,000 people polled said crime is a
serious problem
– Last year, 12,000 people were polled and 9,000 said
crime is a serious problem
Compute…
9,000
--------- = .75
12,000
.75 X 100= 75%
• Because both samples were standardized (responses per
100 persons) they are directly comparable even though
different numbers of persons were polled
– 65% v. 75%
• Percentages can magnify differences when raw numbers
are small
• Percentages can deflate differences when numbers are
large
– Increase from 1 to 3 convictions is …
– Increase from 5,000 to 6,000 convictions is …
Compute both...
• Increase from 1 to 3 convictions is 200 percent
– 3-1 = 2
– 2/1 (base) X 100= 200%
• Increase from 5,000 to 6,000 convictions is 20
percent
– 6,000 - 5,000 = 1000
– 1000/5000 (base) X 100= 20%
Summarizing a distribution for ordinal variables
•
Categorical variables – categories reflect an inherent rank or order
•
Can summarize the distribution of an ordinal variable two ways:
– As a categorical variable, using proportions / percentages
– As a continuous variable, treating categories as points on a scale
• Assign a numerical value to each category and calculate a mean
•
Open DEMO PLUS.SAV
– Variable “class” is ordinal
– Display and summarize the distribution both ways...
• As a categorical/ordinal variable
• As a continuous variable
Continuous variables
•
If variables are continuous, can summarize a distribution with one or more
measures of “central tendency”
– Mean, median, mode
•
Mean: arithmetic average of scores
– Pulled in the direction of extreme scores
– Experiment with Height weight gender age.sav
•
Median: Middle score – half higher, half lower
– If there is an even number of scores, average the two center scores
– If there is an odd number of scores, use the center score
• Exercise 1:
• Exercise 2:
2, 3, 5, 5, 8, 12, 17, 19, 21
2, 3, 5, 5, 8, 12, 17, 19, 21, 21
Exercise 1:
2, 3, 5, 5, 8, 12, 17, 19, 21
Answer: 8
Exercise 2:
2, 3, 5, 5, 8, 12, 17, 19, 21, 21
Answer: 10
12-8 = 4
•
4/2 = 2
8+2 or 12-2 = 10
Median is a useful summary statistic when there are extreme scores
– Extreme scores make the mean a misleading summary measure of a
distribution
•
Median can be used with continuous or ordinal variables
•
Mode: Score that occurs most often (with the greatest frequency)
– There can be more than one mode (bi-modal, tri-modal, etc.)
• Exercise 1:
• Exercise 2:
2, 3, 5, 5, 8, 12, 17, 19, 21
2, 3, 5, 5, 8, 12, 17, 19, 21, 21
Exercise 1:
2, 3, 5, 5, 8, 12, 17, 19, 21
Mode = 5 (uni-modal)
Exercise 2:
2, 3, 5, 5, 8, 12, 17, 19, 21, 21
Modes = 5, 21 (bi-modal)
•
Modes are a useful summary statistic for distributions where cases cluster at
particular scores – an interesting condition that would be missed by the mean
or median
Range
•
Another way to describe a distribution of a continuous variable
– Not a measure of central tendency
•
Range depicts the lowest and highest scores in a distribution
2, 3, 5, 5, 8, 12, 17, 19, 21
Range is 221 or 19 (21-2)