Download ONE-VARIABLE Data Analysis Class Notes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Categorical variable wikipedia , lookup

Time series wikipedia , lookup

Transcript
1
Statistics is a Science of Data
Class Notes


Statistics uses data to ESTIMATE UNKNOWN QUATITIES, to make decisions and to develop
policies.
To draw any sensible conclusions from collected data we need to SUMMARIZE the data to
EXAMINE the patterns that it forms.
Unit #1 “Exploring one-Variable Data”(One-Variable data Analysis)
“Exploring data” includes Graphical and Numerical Techniques used to Study Data (this topic appears in
8 to 12 out of 40 multiple-choice questions; in the Free-responses section, this topic appears in 1 to 2 out of 6
questions)
Main issues:




Who collects data and what do they do with collected data?
Data is rarely collected in a form that is immediately useful for decision-making. To use it, it needs to
be ORGANIZED and SUMMARIZED
Descriptive methods are useful for data presentation, data reduction, and summarization. The best
method depends on the type of data being collected.
Types of Descriptive Methods / all of them complement each other
o Tabular methods (frequency distribution table; this table facilitates
the analysis of patterns of variation among observes data)
o
o
o Graphical methods:
o Numerical methods
Tabular methods (frequency distribution table; this table facilitates the analysis of
patterns of variation among observes data.
 Frequency of values (f) is the number of times that observation occurs.
 Relative Frequency of a value (r f) is the ratio of the frequency (f) to the total
number of observations (n).
 Cumulative frequency gives the number of observations less than or equal to
a specified value. (c f)
 See example 1, page 41 to 43, from “Cracking the AP Statistics Exam” Ed
2010.
Graphical methods
1. For QUALITATIVE data bar chart is particularly useful. Pie charts are frequently
used but are not recommended. For categorical data : Bar Charts and Pie Charts
2. For QUANTITATIVE data Dotplots and stemplots are used for small sets of data.
For larger sets, histograms, cumulative frequency charts and boxplots are often
employed.
 To describe quantitative data we need observe the Center of distribution,
the spread, and the shape (symmetric distribution, left-skewed distribution,
or right-skewed distribution. See page 47. Also look for patterns and for
2
striking (unusual) deviations: Clusters and gaps; Outliers. (See page 48, from
“Cracking the AP Statistics Exam” Ed 2010).
3. Making a histogram using the entire data set. How do we read and histogram?
(See pages 54, 55, 56, 57, and 58, from “Cracking the AP Statistics Exam” Ed
2010).
o
Numerical methods for Continuous Variables. There are three types of numerical
measures:
 Measures of central tendency: Mean (population mean (µ) and
sample mean and Median (M).
 Measures of Variation (spread). Range, Interquartile Range (IQR),
and Standard Deviation (σ is used to denote a population standard
deviation σ2 denotes a population variance. S is used to denote a
sample standard deviation, s2 denotes a sample variance) Standard
deviation measures the distance between any measurement and the
mean of the set of data. S=0 indicates that all of the measurements
are identical. A larger “s” (and consequently, variance) indicates a
larger spread among measurements. See example on page 63
 Measures of Position: Quartiles (divide the data into four equal
parts), Percentiles, and standardized scores (z-scores)
 Q1 First quartile. A number such that, at most, 25% of the values are
at or below it, and at most 75% of the values are at or above it. In
other words, 25% of the values are below Q1 .
 Q2 Second quartile. Same as the median. A number such that at most
50% of the values are at or below it and at most 50% of values are at
or above it. In other words, 25% of the values are between Q1 and Q2.
And 25% are above Q3.
 Q3 Third quartile. A number such that at most 75 % of the values are
at or below it and at most 25% of the values are at or above it.
 Quartiles can be calculated using STAT/CALC/1-Var Stats on your
calculator.
 Percentiles (l) divide the set of data into 100 equal parts. For each
variable there are 99 percentiles, denoted by P1, P2,……., P99. Pk is the kth
percentile, which is the number such that at most k percent of the
values are at or below it, , and at most (100-k) percent of the values
are at or above it. Example, 95th percentile means 95% of the values
are at or below P95 and at most 5% of the values are at or above P95.
 P25 = Q1
 P50 = Q2 = M
 P75 = Q3
 (l) = (n+1) k / 100
 Z-Scores are independent of the units in which the data values are
measured. They are useful when comparing observations measured
on different scales.
3
o
o
o
o
Z- Score = measurement – mean/ standard deviation. Z-Score gives the distance
between the measurement and the mean in terms of standard deviation. A negative
Z-Score indicates that the measurement is smaller than the mean. A positive Z-Score
indicates that the measurement is larger than the mean. See example 10, page 66
and example 11, page 67.
See page 68 to see calculator instructions.
BOXPLOTS : To know How do we do it by hand and How do we read it see
Instructions in page 69)
Effect of changing units on summary measures (copy pg 71 and 72)
Summarizing Distribution








Measuring the Center.
Measuring Spread.
Measuring Position.
Empirical Rule.
Histograms
Cumulative Frequency
Box plots
Changing Units
Notes
-The presentation of data includes summarizations and descriptions, and involves concepts
such that average values, measure of dispersion, positions of various values, and the shape
of a distribution (Descriptive statistics)
-Median: is the middle number of a set of numbers arranged in numerical order. The
median is not affected by exactly how large the larger values are or by exactly how small
the smaller values are. It is particularly useful measurement when the outliers are in some
way suspicious or when we want to diminish their effects.
-Mean: is found by summing items in a set and dividing by the number of items. Notation μ
is used to denote the mean of the whole population and X is used to represent the mean of a
sample (or a part of a population)
Mean = ∑x/n
-Measures of center: Mean And Median
Mean > Median  We can suppose that the distribution is right skewed
-Measures of spread
-Range = Max – min = Largest – Smallest
- IQR = Q3 - Q1 ( Interquartile Range)
Upper Quartile – Lower Quartile
-Variance
4
-Standard Deviation
-IQR: is the difference between the largest and the smallest values after removing the lower and
upper quartiles. IQR is the range of the middle 50% that is IQR= Q_3 - Q_1= 75 percentile minus 25
percentile
-Variance: is determined by averaging the squared difference of all the values from the mean. It is
measured in Square Units
-Standard Deviation: is the square root of the variance .It is measured in the same units as are the
data.
-Mean and standard deviation are appropriate only for symmetric data
-IQR is a reasonable summary of spread (it uses only two quartiles/ ignores info about how
individual values vary)
-Outliers: Data away from the body of the distribution.(Extreme values / Unusual Values)
Notes about Outliers:







Can be the most informative part of the data
Can be an error (then fixed if you can)
Represent a very important info in your comments
Affect mean and variance
Median and IQR are not sensitive to the outliers
Do not delete outliers, think about them
Comparing Distribution





Dotplots
Double Bar Charts
Back-to-back Stemplots
Parallel Boxplots
Cumulative Frequency Plots.
Many real-life applications of statistics involve comparisons of TWO populations. We should portray both
sets simultaneously. See EXAMPLEs I, II, III, IV, and V, pages 74, 75, 76, and 77, from “Cracking the AP
Statistics Exam” Ed 2010).
5
Unit #2 Exploring Relationships between TWO Variable” (BIVARIATE DATA)
(To investigate relationship between two quantitative variables)
Main tppics






Stemplots (shape, direction and strength of relationship)
Correlation and Linearity
Least Square Regression Line
Residual plots
Outliers and Influential Points
Transformations to Achieve Linearity
Notes



Our studies so far have been concerned with measurements of a SINGLE VARIABLE. However, many
statistical applications involve two or more variables related to one another.
Fist question: How can the strength of an apparent relationship be measured?
Second question: How can an observed relationship be put into functional terms?
1. Scatterplot and Correlation Coefficient




If two different variables have a LINEAR RELATION, then we can measure the strength of that
relationship with a LINEAR REGRESSION.
How to SUMMARIZE the relationship between two variables. There are two commonly used
measures
 A Scatterplot, a GRAPHICAL SUMMARY MEASURE
 The CORRELATION COEFFICIAENT (r), a NUMERICAL SUMMARY MEASURE.
A SCATTERPLOT is used to describe the Nature, Degree, and Direction of the relation between two
variables X and Y; (x,y) gives an order pair.
A SCATTERPLOT can tell us
 Shape: Is the relation between X and Y linear or nonlinear?
 Direction: Whether y-value increases (positive relation) or decreases (negative
relation) as x increases.
 Strength of relationship: If the trend of the data can be described with a linear or a
curve; then the spread of the data values around the line or curve describes the
degree (or strength) of the relation:
o Points very close to the line or curve, a strong relationship between
two variables.
o More scattered points, a weaker relationship between two
variables.
o Points scattered without any apparent pattern, no relationship
between two variables.
6


Correlation Coefficient (r) is a numeric measure of the degree and direction of the LINEAR RELATION
between two quantitative variables.
 -1 ≤ r ≤ 1 always
 Direction: the positive or negative sign of the correlation describes the direction of
the relation between two variables.
 Strength: the numeric value of the correlation describes the strength (or degree) of
the linear relation between two variables:
o + 1 describes a perfect positive correlation
o -1describes a perfect negative correlation
o Close to -1 or + 1 describes a strong relationship between two
variables.
How to compute the correlation by using a calculator.
Page 84



How to make a scatterplot.
How to get summary statistics for bivariate data.
Least Squares Regression Line.
After know that two variables are related to each other, we need quantify (or estimate) the relation
between two variables.

A Linear Regression Equation or Linear Regression Model is an equation that gives a straight line
relationship between two variables
Y = β0 + β1 X + e , where
Y is the dependent variable or responsive variable.
X is the independent variable or explanatory variable.
β0 is the y-intercept. It is the value of Y for X=0
β1 is the slope of the line. It gives the amount of change in Y
for every unit change in X.
e is the random error or RESIDUAL . It is the different
between the observed and predicted value.
The predicted value of Y for a given value of X is denoted by yˆ (is denoted as “y-hat). It is
computed by using the estimated regression line ŷ = a + bx, where a = β0 ˆ= the estimated yintercept of the regression line and b= β1 ˆ= the estimated slope of the regression line; e
(error or residual) = (y- ŷ).
7


The least squares regression line is a line that minimizes the error sum of squares (SSE). It is also
known as the line of best fit.
b= β1 ˆ= SS xy /SS xx and a = β0 ˆ= Ȳ - b X̄
Using TI-83
To make a scatterplot









Chose STAT – EDIT
Enter X-values into L1
Enter corresponding Y-values into L2
Choose scatterplot by highlighting the picture of a scatterplotunder Type
Enter XList: L1
Enter YList: L2
Select type of marking from 3 available marks
Choose ZOOM – 9 : ZoomStat
Use TRACE option to trace points in the scatterplot
To compute the correlation coefficient (r):






Enter X-values into L1
Enter corresponding Y-values into L2
Chose 2nd – CATALOG – DiagnosticOn
Choose STAT-CALC- 8: LinReg(a+bx)
Enter L1 , L2 (Don’t forget the comma). This will result in LinReg(a+bx) L1 , L2
Press ENTER
This option will show a, b,r2 , and r for the regression line y= a+ bx
To get summary statistics for bivariate data:





Enter X-values into L1
Enter corresponding Y-values into L2
Select STAT – CALC-2:2-VAR Stats
Enter L1 , L2 . This will result in 2-Var Stats L1 , L2
Press ENTER
This option will provide
X̄ , ∑X , ∑X2 , Sx , σx , n
Ȳ, ∑Y, ∑Y2 , Sy, σy
∑ XY, min X, max X, min Y, max Y
To compute least squares estimates of the slope (b), y-intercept (a), and coefficient of determination (r2 )
for the line y= a+ bx

Enter X-values into L1
8




Enter corresponding Y-values into L2
Choose STAT-CALC- 8: LinReg(a+bx)
Enter L1 , L2 (Don’t forget the comma). This will result in LinReg(a+bx) L1 , L2
Press ENTER
This option will show a, b,r2 , and r
To get a regression line superimposed on a scatterplot
 Enter X-values into L1
 Enter corresponding Y-values into L2
 Choose STAT-CALC- 8: LinReg(a+bx)
 Enter L1 , L2 , Y1. This will display LinReg(a+bx) L1 , L2 , Y1
To get Y1, use the following sequence of commands.
 Choose VARS-Y-VARS-1: Function-1: Y1
 Press ENTER
 Press GRAPH
To get a RESIDUAL Plot






Enter X-values into L1
Enter corresponding Y-values into L2
Choose STAT-CALC- 8: LinReg(a+bx)
Enter L1 , L2 , Y1. This will display LinReg(a+bx) L1 , L2 , Y1
To get Y1, use the following sequence of commands.
Choose VARS-Y-VARS-1: Function-1: Y1
Press ENTER
This will store the regression equation inY1
On TI-83, create a new column title. This will allow you to define the column. Enter
L2 - Y1 (L1 ). This command will fill the column titled RESDL with residuals
Press ENTER to compute residuals for the above estimated line.
To plot the computed residuals
 Choose Y=
 In this screen, deselect regression equation \ Y1 by moving the cursor over the = symbol and pressing
ENTER
 In this screen, set \ Y2 = 0
 Choose 2nd-STAT PLOT -1:plot1
 Turn Plot1 Off by highlighting OFF and pressing ENTER.
 Choose 2nd-STAT PLOT -2:plot2
 Turn Plot2 Off by highlighting OFF and pressing ENTER.
 Choose scatterplot by highlighting the picture of a scatterplot under Type and pressing ENTER
 Enter Xlist:L1
 Enter Ylist:RESDL
 Select type of marking from the three available marks
9


Choose ZOOM-9:Zoom Stat
Use TRACE option to trace points in the scatterplot
Transformation to achieve linearity

Using nonlinear models. Here are some examples of transformations used to linearize the regression
model:
 The log transformation (Z=ln(Y)) is used to linearize the regression model when the
relationship between X and Y suggests a model with a consistently increasing slope.
 The square root transformation (Z=√Y=Y1/2) is used when the spread of observations
increases with the mean.
 The reciprocal transformation (Z=1/Y1) is used to minimize the effect of large values of
X.
 The square transformation (Z=Y2) is used when the slope of the relation consistently
decreases as the independent variable increases.
 The power transformation (ln(Y) and ln(X)) is used if the relation between dependent
and independent variables is modeled by Y=aXb.
Exploring Categorical data: Frequency Tables
Categorical data is data classified into different categories
Example 1: Students might be classified by gender (male or female) and their student status (freshman, sophomore,
junior or senior).
Example 2: Suppose 200 students were classified by gender and academic major. The data resulted in a 2X4
contingency table, show in table 10.
Academic Major
Gender
Arts
Sciences
Engineering
Education
Female
35
15
5
25
Male
10
40
50
20
The table shows that:
 35 Female students are majoring in arts.
 10 Male students are majoring in arts.
 15 Female students are majoring in sciences.
The figures above are joint frequencies of respective categories. The joint frequency of two categories is the
frequency with which two categories, one from each of the two classifications criteria, occur together.
10
Academic Major
Gender
Female
Male
Column Totals
Row Totals
Arts
Sciences
Engineering
Education
35
15
5
25
10
45
40
55
50
55
20
45
80
120
200
This table shows that:
 There are 80 female students in this study.
 There are 120 male students in this study.
 Out of the 200 students in this study, 45 are majoring in arts subjects.
 Out of the 200 students in this study, 55 are majoring in science subjects.
 Out of the 200 students in this study, 55 are majoring in engineering subjects.

Out of the 200 students in this study, 45 are majoring in education subjects.
From the above contingency table we can see that;
 Among female students, 35/80 or 43.75 percent are majoring in arts subjects. In other words, the
conditional percentage of arts majors among female students is 43.75 percent.
 Among male students, 50/120 or 41.67 percent are majoring in engineering subjects. In other words,
the conditional percentage of engineering majors among male students is 41.67 percent.
 Among science majors, 15/55 or 27.27 percent are females. In other words, the conditional
percentage of female students among the science majors is 27.27 percent.