Download Chapter 3 Notes - Mr. Davis Math

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

World Values Survey wikipedia , lookup

Time series wikipedia , lookup

Transcript
Data Description
Chapter 3
Introduction
 Different statistical methods can be used to summarize data
 Measures of central tendency: mean, median, mode, midrange
 Measures of variation: range, variance, standard deviation
 Measures of position: percentiles, quartiles, deciles
3.1 – Measures of Central Tendency
 Statistic:
 Characteristic or measure obtained using data values from sample
 Parameter
 Characteristic or measure obtained using all data values from specific
population
 General Rounding Rule:
 Do not round until final answer is obtained
 Round off to one more decimal place than raw data
Mean
 Mean (arithmetic average)
 Sum of all values divided by total number of values
 Symbol 𝑋 represents sample mean so
𝑿=
𝑿
where n is total number of values in sample
𝒏
 Symbol µ (mu) is used to represent population means so
𝝁=
𝑿
where N is total number of values in population
𝑵
Examples
 Example 3 – 1
 The data represent the number of days off per year for a sample of individuals
selected from nine different countries. Find the mean.
20 26 40 36 23 42 35 24 30
 Example 3 – 3
 Using the frequency distribution for example 2-7, find the mean. Data represent
the number of miles run during one week for a sample of 20 runners.
Finding Mean for Grouped Data
Procedure
1. Make a table with:
Column A = CLASS
Column B = Frequency f
Column C = Midpoint Xm
Column D = f * Xm
2. Find midpoints of each class and place them in column C
3. Multiply frequency by midpoint for each class, and place product in Column D
4. Find sum of Column D
5. Divide sum obtained in column D by sum of frequencies obtained in column B
Formula is : 𝑿 =
𝒇∗𝑿𝒎
𝒏
Median
 Median
 Midpoint of a data array (data arranged in order)
 Symbol for median is MD
 Will be a specific value or fall between two values
 Examples
 3–4
 3–5
 3–6
Mode
 Mode
 Value that occurs most often in data set
 Data set that has only one value occurring with greatest frequency is called
unimodal
 Data sets with two values with same greatest frequency is called bimodal
 More than two values is called multimodal
 When all values occur with same frequency data set is said to have no mode
Examples
 3–9
 3 – 10
 3 – 11
Modal class: class with largest frequency
Midrange
 Midrange
 Sum of lowest and highest values of data set, divided by 2
 Symbol is MR
Properties and Uses of Central Tendency
The Mean




1.
Found by using all values of the data
2.
Varies less than median or mode when sample are take from same population and all three measure are computed for these
samples
3.
Used in computing other statistics, such as variance
4.
Unique and not necessarily one of the data values
5.
Cannot be computed for data in a frequency distribution that has an open-ended class
6.
Affected by extremely high or low values, called outliers, and may not be appropriate average to use in these situations
The Median
1.
Used to find center or muddle value of data set
2.
Used when it is necessary to fund out whether data values fall into the upper half or lower half of the distribution
3.
Used for an open-ended distribution
4.
Affected less than mean by extremely high or low values
The Mode
1.
Used when most typical case is desired
2.
Easiest average to compute
3.
Used when data are nominal
4.
Not always unique
The Midrange
1.
Easy to compute
2.
Gives the midpoint
3.
Affected by extremely high or low data values
Distribution Shapes
 Positively-skewed (right-skewed) distribution
 Majority of data values fall to left of mean and cluster at lower end of distribution
 “tail” of distribution is to the right
 Mean is right of median, mode is left of median
 Symmetric distribution
 Data values are evenly distributed on both sides of mean
 Mean, median, and mode are the same and at center of distribution
 Negatively-skewed (left-skewed) distribution
 Majority of data values fall to right of mean and cluster at upper end of distribution
 Mean is left of median, mode is right of median
3.2 – Measures of Variation
 To describe data sets accurately, statisticians must know more than
measures of central tendency
 Range
 Highest value minus lowest value
 Symbol R is used for range
 Example 3 – 19
R = highest value – lowest value
Population Variance and Standard
Deviation
 Rounding rule: round to one more decimal place than that of original data
 Variance
 Average of squares of distance each value is from mean
 Symbol for population variance is σ2
 Formula is:
𝝈𝟐
=
(𝑿−𝝁)𝟐
𝑵
 Standard Deviation
 Square root of variance
 Symbol is σ
 Formula is:
𝝈=
𝝈𝟐
=
(𝑿−𝝁)𝟐
𝑵
Example 3 – 22
 Find variance and standard deviation for Brand B paint data for months of
35, 45, 30, 35, 40, 25
Sample Variance and Standard
Deviation
 Formulas for sample variance and standard deviation
 Symbol is s
 Sample Variance:
 𝒔𝟐 =
𝒏( 𝑿𝟐 )−( 𝑿)𝟐
𝒏(𝒏−𝟏)
 Sample Standard Deviation:
 𝒔 = 𝒔𝟐 =
𝒏( 𝑿𝟐 )−( 𝑿)𝟐
𝒏(𝒏−𝟏)
Example 3 – 23
 Find the sample variance and standard deviation for the amount of
European auto sales for a sample of 6 years shown. The data are in millions
of dollars.
 11.2
11.9
12.0
12.8
13.4
14.3
Uses of the Variance and Standard
Deviation
1. Variances and standard deviations can be used to determine the spread
of data. If variance or standard deviation is large, data are more dispersed.
This data is useful in comparing two (or more) data sets to determine which
is more (most) variable.
2. Measures of variance and standard deviation are used to determine the
consistency of a variable.
3. Variance and standard deviation are used to determine number of data
values that fall within a specified interval in a distribution.
4. Variance and standard deviation are used quite often in inferential
statistics.
Coefficient of Variation
 Sometimes we are required to compare standard deviations of data that is
not in the same units
 Coefficient of variation
 Denoted by CVar, is standard deviation divided by mean
 Result is expressed as a percentage
 Samples: CVar =
𝒔
𝑿
 Populations: CVar =
𝝈
𝝁
Examples
 Example 3 – 25
 The mean of the number of sales of cars over a 3-month period is 87, and the
standard deviation is 5. The mean of the commissions is $5225, and the standard
deviation is $773. Compare the variations of the two.
 Example 3 – 26
 The mean for the number of pages of a sample of women’s fitness magazines is
132, with a variance of 23; the mean for the number of advertisements of a
sample of women’s fitness magazines is 182, with a variance of 62. Compare the
variations.
Chebyshev’s Theorem
 Developed by Russian mathematician Chebyshev (1821-1894)
 Specifies proportions of spread in terms of standard deviation
 Chebyshev’s theorem
 Proportion of value from a data set that will fall within k standard deviations of
the mean will be at least 1 – 1/k2, where k is a number greater than 1
 k is not necessarily an integer
 Example 3 – 27
 The mean price of houses in a certain neighborhood is $50,000, and the standard
deviation is $10,000. Find the price range for which at least 75% of the houses will
sell.
Empirical (Normal) Rule
 Chebyshev’s theorem applies to any distribution regardless of shape
 When distribution is bell-shaped (normal), then following are true
 Empirical Rule
 Approximately 68% of the data values will fall within 1 standard deviation of the
mean.
 Approximately 95% of the data values will fall within 2 standard deviations of the
mean.
 Approximately 99% of the data values will fall within 3 standard deviations of the
mean.
3.3 – Measures of Position
 Measures exist for position or location within a data set
 Include standard scores, percentiles, deciles, and quartiles
 Standard score (z score)
 Value obtained by subtracting mean from value and dividing result by standard
deviation.
 Symbol for standard score is z
 Sample formula:
 Population formula:
𝒛=
𝑿−𝑿
𝒔
𝒛=
𝑿−𝝁
𝝈
 z score represents number of standard deviations that a data value falls above or
below the mean
 z > 0 (above mean)
z < 0 (below mean)
z = 0 (equal to mean)
Example 3 – 29
 A student scored 65 on a calculus test that had a mean of 50 and a
standard deviation of 10; she scored 30 on a history test with a mean of 25
and a standard deviation of 5. Compare her relative positions on the two
tests.
Percentiles
 Percentiles
 Divide data set into 100 equal groups
 Percentile formula
 Percentile corresponding to a given value X is computed by using the following
formula:
Percentile =
𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒅𝒂𝒕𝒂 𝒗𝒂𝒍𝒖𝒆𝒔 𝒃𝒆𝒍𝒐𝒘 𝑿 +𝟎.𝟓
∗ 𝟏𝟎𝟎%
𝒕𝒐𝒕𝒂𝒍 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒗𝒂𝒍𝒖𝒆𝒔
Example 3 – 32
 A teacher gives a 20-point test to 10 students. The scores are shown here.
Find the percentile rank of a score of 12.
18
15
12
6
8
2
3
5
20
10
Quartiles and Deciles
 Quartiles
 Divide distribution into four groups, separated by Q1, Q2, Q3
 Q1 is same as 25th percentile, Q2 is same as 50th percentile or median, Q3 is same
as 75th percentile
 Procedure for finding data values corresponding to Q1, Q2, Q3
1.
Arrange data in order from lowest to highest
2.
Find median of data values. This is Q2
3.
Find median of data values that fall below Q2. This is Q1
4.
Find median of data values that fall above Q2. This is Q3
Example 3 – 36
 Find Q1, Q2, Q3 for the data set 15, 13, 6, 5, 12, 50, 22, 18
IQR
 Interquartile range (IQR)
 Difference between Q1 and Q3
 Range of middle 50% of data
 Used to identify outliers and as measure of variability in exploratory data analysis
 Deciles
 Divide distribution into ten groups, denoted by D1, D2, … , D10
 Found by using same formula for percentiles
Outliers

Outlier
 Extremely high or extremely low data value when compared with rest of data values
 Can strongly affect mean and standard deviation of a variable


Procedure for Identifying Outliers
1.
Arrange data in order and find Q1 and Q3
2.
Find interquartile range: IQR = Q3 – Q1
3.
Multiply IQR by 1.5
4.
Subtract value obtained in step 3 from Q1 and add value to Q3
5.
Check data set for any data value that is smaller than Q1 – 1.5(IQR) or larger than Q3+1.5(IQR)
Example 3 – 37
 Check data set for outliers: 5, 6, 12, 13, 15, 18, 22, 50
3.4 – Exploratory Data Analysis
 The Five-Number Summary and Boxplots
 Boxplot (box and whisker plot)
 Can be used to graphically represent data set
 Set involves specific values called a five-number summary
1.
Lowest value of data set (minimum)
2.
Q1
3.
Median
4.
Q3
5.
Highest value of data set (maximum)
Boxplot
 Procedure for Constructing a Boxplot
1. Find five-number summary for data values
2. Draw horizontal axis with a scale such that it includes maximum and minimum
data values
3. Draw a box whose vertical sides go through Q1 and Q3, and draw a vertical line
through the median
4. Draw a line from minimum data value to left side of box and a line from
maximum data value to right side of box
Example 3 – 38
 The number of meteorites found in 10 states of the United States is 89, 47,
164, 296, 30, 215, 138, 78, 48, 39. Construct a boxplot for the data.
Information Obtained from a Boxplot
1. If the median is near the center of the box, the distribution is approximately symmetric
2. If the median falls to the left of the center of the box, the distribution is positively
skewed
3. If the median falls to the right of the center, the distribution is negatively skewed
4. If the lines are about the same length, the distribution is approximately symmetric
5. If the right line is larger than the left line, the distribution is positively skewed
6. If the left line is larger than the right line, the distribution is negatively skewed
 Resistant statistic
 Summary statistics median and IQR is less affected by outliers
 Mean and standard deviation are affected more by outliers and are called nonresistant
statistics