Download Semester 1 Project (1 & 7)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Inductive probability wikipedia , lookup

History of statistics wikipedia , lookup

Categorical variable wikipedia , lookup

Central limit theorem wikipedia , lookup

Law of large numbers wikipedia , lookup

Transcript
Chapter 1
Exploring Data
Section 1.1 – Displaying
Distributions with Graphs
Introduction
0 Any set of data contains information about some group of
0
0
0
0
0
0
0
individuals
Individuals- the objects described by a set of data
Variable- any characteristic of an individual
Variables can be divided into two sections: categorical and
quantitative
Categorical- places into categories
Quantitative- Numerical value
Lets try!
A political scientist selects a large sample of registered voters.
For each voter, she records gender, age, and household income.
Which variables are quantitative and which are categorical?
•Gender: Categorical
•Age: Quantitative
•Household income: Quantitative
Distribution
0 Distribution of a variable tells us what values the
variable takes and how often it takes these values
0 Exploratory data analysis examines and describes
the data’s main features
0 Two basic strategies:
1. Examine each variable by itself, then connect it
to the other one
2. Make a graph. Add specific aspects of
numerical summaries
Bar Graphs
0 Bar graphs help the audience grasp the distribution quickly
0 To construct a bar graph:
1. Label your axes and title your graph
2. Scale your axes. Use the counts in each category to help you scale
your vertical axis
3.Draw a vertical bar above each category name to a height that
corresponds to the count in that category
Color Preference
Determine which
color students prefer
to wear to class:
Red- 5
Green- 2
Blue- 5
Black- 3
Red
Green
Blue
Black
Pie Charts
0 Pie charts help us see what part of the whole each group forms
0 How to construct a pie chart:
* Tip: Recommended to use statistical software package
1. Change any numerical values into percents
2. Estimate how much space the category will cover
depending on the data given
3. All percents must add up to a total of 1
Red- 5
Green- 2
Blue- 5
Black- 3
=
=
=
=
.33
.14
.33
.20
= 33%
= 14%
= 33%
= 20%
Dotplot
0 Helps display quantitative data
0 How to construct:
1. Draw one horizontal line going across
2. Label the axis
3. Scale the axis
4. Put a dot in the correct place for every value
that appears in the data
You roll a die 50 times and record the
numbers that you got. Using the data
provided, construct a dotplot for this
observation.
Data:
8 1’s
6 2’s
6 3’s
9 4’s
7 5’s
14 6’s
Overall Pattern of a Distribution
0 To describe overall pattern of a distribution
1. Give the center and the spread
2. See if the distribution has a simple shape that you
can describe
0 Center is the value that divides the observations in
half
0 Spread is giving by the smallest and largest value
0 An outlier in any graph of data is an observation that
falls outside the overall pattern of the graph
Stemplot
0 Stemplots are used when the values of a variable are too
spread out for us to make a reasonable dotplot
0 How to construct a stemplot:
1. Separate each observation into a stem
consisting of all but the rightmost digit and a
leaf, the final digit
2. Write the stems vertically in increasing
order from top to bottom, and draw a
vertical line to the right of the stems.
3. Write the stems again, and rearrange the
leaves in increasing order out from the stem
4. Title your graph and add a key describing what
the stems and leaves represent
Given these values, construct a stemplot.
40 26 39 14 42 18 25 43 46 27 19 47
19 26 35 34 15 44 40 38 31 46 52 59
1
2
3
4
5
45899
5667
24589
0023467
29
Key: 5 2 = 52
Histograms
0 The most common graph of the distribution of one quantitative variable is a
histogram
0 How to construct a histogram:
1. Divide the range of the data into classes of equal width. Count the
number of observations in each class
2. Label and scale your axes and title your graph. Vertical axis
contains the scale of counts
3.Draw a bar that represents the count in each class. The base of a bar
should cover its class, and the bar height is the class count
1– 5 =9
6 – 10 = 9
11 – 15 = 8
16 – 20 = 5
21 – 25 = 2
The data below is the number of unprovoked attacks by alligators on
people in Florida each year for a 33- year period
F
Construct a histogram for this
r
distribution:
e
q
6 12 2 4 17 4 6 10 3 9 13 9 15
u
14 6 18 1 9 6 6 11 24 14 14 5
e
17 17 5 13 22 20 3 5
n
c
1-5
6-10 11- 15 16- 20 21-25
y
Number of Unprovoked Attacks
Symmetric and Skewed
Distributions
0 A distribution is symmetric if the right and left side of the histogram are
almost mirror images of each other
0 A distribution is skewed to the right if the right side of the histogram
extends farther out than the left side
0 A distribution is skewed to the left if the histogram extends much
farther out than the right side
For Example…
Left Skewed Distribution
Right Skewed Distribution
Symmetric and Skewed
Distributions (Cont’d)
Symmetric Distribution
Percentile
0 The pth percentile of a distribution is the value such
that p percent of the observation fall at or below it
0 For example:
You may have received a standardized test score
report that said you were in the 80th percentile. This
means that 80% of the people who took the test
earned scores that were less than or equal to your
score. The remaining 20% are students that earned a
higher score than you
Tip: Think of it like your SAT scores, if you are in the 60th
percentile, you did better than 60% of the students that also
took the SAT.
Ogive
0 Also known as a culimative relative frequency
0 Helps us understand the relative standing of an
individual observation
0 How to construct an Ogive:
1. Decide on class intervals and make a frequency table, just as
in making a histogram. Add three columns to your frequency
table: relative, cumulative, and relative cumulative frequency
2. Label and scale your axes and title your graph
3.Plot a point corresponding to the relative cumulative frequency in each class
interval at the left endpoint of the next class interval
Ogive (Cont’d)
0 To get the values for the relative frequency, count the number of
times the value appears
0 To fill in the cumulative frequency column, find the % of the data
0 For relative cumulative frequency column, add the %’s together
Example:
Construct an ogive with the data provided
Twenty- nine female raccoons were
observed and the number of male
partners during the time the female
was accepting partners (generally 1
to 4 days each year) was recorded
for each female
1 3 2 1 1 4 2 4 1 1 1 3 1 1 1
1 2 2 1 1 4 1 1 2 1 1 1 1 3
Time Plot
0 A time plot of a variable plots each observation
against the time at which it was measured. Always
mark the time scale on the horizontal axis and the
variable of interest on the vertical axis
0 When examining a time plot, look once again for an
overall pattern and for strong deviations form the
pattern
0 Trend- a long-term upward or downward movement
over time
0 Seasonal variation- a pattern that repeats itself at
regular time intervals
Time Plot (Cont’d)
Example of a Time Plot:
Section 1.2 – Describing
Distributions with Numbers
0 Mean
0 Average of observations
0 Median
0 Midpoint of values (Center)
0 Inter Quartile Range (IQR)
0 IQR= Q3 – Q1
0 Outlier
0 Less than Q1 – 1.5 x IQR
0 More than Q3 + 1.5 x IQR
The Five Number Summary
0 Overall description of a distribution:
0 Min
0 Q1
0M
0 Q3
0 Max
Example:
22 25 34 |35| 41 41 46 |46| 46 47 49 |54| 54 59 60
Min
Q1= 35
M= 46
Q3= 54
Max
IQR and Outlier
0 IQR= Q3 – Q1= 54 – 35= 19
0 Finding Outlier
Q1 – 1.5 x IQR
35 – 1.5 x 19= -28.5 (Lower cutoff)
Q3 + 1.5 x IQR
54 + 1.5 x 19= 82.5 (Upper cutoff)
0 There are no outliers.
Chapter 7
Random Variables
Section 7.1 – Discrete and
Continuous Random Variables
0 Random variable - a variable whose value is a
numerical outcome of a random phenomenon
0 Discrete random variables
0 The outcome probabilities must be between 0 and 1 and
have a sum of 1.
0 When the outcomes are numerical, they are values of a
random variable.
0
0
0
A discrete random variable X has a countable
number of possible values.
The probability distribution of X lists the values and
their probabilities.
Value of X:
x₁ x₂ x₃ … xk
Probability:
p₁ p₂ p₃ … pk
pi has two requirements.
1) The probability of pi has to be a number between 0
and 1.
2) p₁ + p ₂ + … + pk = 1.
Find the probability of any even by adding the
probabilities pi of the particular values x that make up
the event.
7.1 - Example #1
0 The instructor of a large class gives 15% each of A’s and D’s,
each of B’s and C’s and 10% F’s. Choose a student at random
from this class. To “chose a random” means to give every
student the same chance to be chosen. The student’s grade on
a four-point scale (A=4) is a random variable X.
0 The value of X changes when we repeatedly choose students at
random, but it is always one of 0, 1, 2, 3, or 4.
This is the distribution of X:
Grade:
0
1
Probability: 0.10
0.15
2
0.30
3
0.30
4
0.15
0 The probability that the student got a B or better is the sum of
the probabilities of an A and a B:
P(grade is a 3 or 4) = P(X = 3) + P (X = 4)
= 0.30 + 0.15
= 0.45
Probability Histogram
0 We can use histograms to display probability
distributions as well as distributions of data.
0 Probability histograms are used to compare the probability
model for random digits with the model given by Benford’s law
(Chapter 6).
0 The height of each bar represent the probabilities.
0 They all add to 1.
0 Using histograms help us quickly compare the two
distributions.
Continuous Random Variables
0 When we use the table of random digits to select a
digit between 0 and 9, the result is a discrete random
variable.
0 This is one way of assigning probabilities, by using the
random digits table.
0 However for certain events, it may be impossible because
there are infinitely many possible values.
0 A new way of assigning probabilities to events is to
use areas under a density curve.
0 The total area of a density curve is exactly 1 underneath
it, corresponding to a total of a probability of 1.
0 This is important way of assigning probabilities to events.
Continuous Random Variables
(Cont’d)
0 A continuous random variable X takes all values in an
interval of numbers.
0 The probability distribution of X is described by a
density curve.
0 The probability of any event in the area under the density
curve and above the values of X that make up the area.
0 The probability model for a continuous random variable
assigns probabilities to intervals of outcomes rather than
to individual outcome.
0 All continuous probability distributions assign probability 0
to every individual outcome.
Normal Distributions as
Probability Distributions
0 Normal distributions are probability distributions.
0 This is because density curves describe an assignment of
probabilities.
0 As we know, N(μ, σ), is the shorthand notation for
normal distribution. In the language of random
variables, if X has the N(μ, σ) distribution, then the
standardized variable:
Z= X – μ
σ
is a standard normal random variable having the
distribution, N(0, 1).
Section 7.2 – Means and
Variances of Random Variables
0 Rules for Variances
0 Two random variables X and Y are independent if
knowing that any event involving X alone did or did not
occur tells us northing about the occurrence of an event
involving Y alone.
0 When random variables are not independent, the
variance of their sum depends on the correlation
between them as well as on their individual variances.
0 We use ρ, the Greek letter rho, for the correlation
between two random variables.
0 The correlation between two independent random
variables is zero.
0 Rule 1. If X is a random variable and a and b are fixed
numbers, then
σ²ₐ+bX = b²σ²ₓ
0 Rule 2. If X and Y are independent ransom variables,
then
σ²x+y = σ²x+σ²y
σ²x-y = σ²x+σ²y
0 This is the addition rule for variances of
independent random variables.
0 Rule 3. If X and Y have correlation p, then
σ²x+y = σ²x+σ²y + 2ρσxσy
σ²x-y = σ²x+σ²y - 2ρσxσy
0 This is the general addition rule for variances of
random variables.
Combining Normal Random
Variables
0 Any linear combination of independent normal random
variables is also normally distributed. That is, if X and Y
are independent normal random variables and a and b are
any fixed numbers, aX + bY is also normally distributed. In
particular, the sum or difference of independent normal
random variables has a normal distribution.
7.2 - Example #1
0 A college uses SAT scores as one criterion for admission. Experience has
shown that the distribution of SAT scores among its entire population of
applicants is such that
SAT Math score X
µx = 625
σx = 90
SAT Verbal score Y μy = 590
σy = 100
What are the mean and standard deviation of the total score X + Y
among students applying to this college?
The mean overall SAT score is
μx+y = μx + μy = 625 + 590 = 1215
The variance and standard deviation of the total cannot be computed
from the information given. SAT verbal and math scores are not
independent, because students who score high on one exam tend to
score high on the other also. Therefore Rule 2 does not apply and we
need to know ρ, the correlation between X and Y, to apply Rule 3.
7.2 - Example #1 (Cont’d)
0 Nationally, the correlation between SAT Math and Verbal scores is about
ρ = 0.7.
If this is true for these students,
σ²x+y = σ²x+σ²y + 2ρσxσy
= (90)² + (100)² + (2)(0.7)(90)(100)
= 30,700
The variance of the sum X + Y is greater than the sum of the variances
σ²x+σ²y because of the positive correlation between SAT Math scores
and SAT Verbal scores. That is, X and Y tend to move up together and
down together, which increases the variability of their sum. We find
the standard deviation from the variance,
σ²√30,700 = 175
7.2 - Example #2
0 Zadie has invested 20% of her funds in Treasury bills and 80% in an
“index fund” that represents all U.S. common stocks. The rate of return
in an investment over a time period is the percent change in the price
during the time period, plus any income received. If X is the annual
return on T-bills and Y the annual return on stocks, the portfolio rate of
return is
R = 0.2X +0.8Y
The returns X and Y are random variables because they vary from year
to year. Based on annual returns between 1950 and 2000, we have
X = annual return on T-bills μx = 5.2% σx = 2.9%
Y = annual return on stocks μy = 13.3% σy = 17.0%
Correlation between X and Y ρ = -0.1
Stocks had higher returns than T-bills on the average, but the standard
deviations show that returns on stocks varied much more from year to
year. That is, the risk of investing in stocks is greater than the risk for Tbills because their returns are less predictable.
7.2 - Example #2 (Cont’d)
0 For the return R on Zadie’s portfolio of 20% T-bills and 80% stocks,
R = 0.2X + 0.8Y
μR = 0.2μx + 0.8μy
= (0.2 x 5.2) + (0.8 x 13.3) = 11.68%
To find the variance of the portfolio return, combine Rule 1 and Rule 3:
σ²R = σ²0.2X + σ²0.8Y + 2ρσ0.2Xσ0.8Y
= (0.2)²σ²x + 0.8²σ²y + 2ρ(0.2σx)(0.8σy)
= (0.2)²(2.9)² + (0.8)²(17.0)² + (2)(-0.1)(0.2 x 2.9)(0.8 x 17.0)
= 183.719
σR = √183.719 = 13.55%
The portfolio has a smaller mean return than an all-stock portfolio, but it
is also less risky. As a proportion of the all-stock values, the reduction in
standard deviation is greater than the reduction in mean return. That’s
why Zadie put some funds into Treasury bills.
7.2 Mean and Variances of
Random Variables (Continued)
0 Mean x- bar: ordinary average
0 Mean of random variable X: an average of possible
values of x.
Example: taking X to be the amount your ticket pays you the
probability distribution of X is..
Pay off x:
$0
$500
Probability: 0.999
0.001
Long run average: $500
1
1000
+
$0.999 = $0.50
1000
0 You will often find the mean of a random variable X
called the expected value.
Mean of a Discrete Random
Variable
The mean of a discrete random
variable X is a weighted average of the
possible values that the random variable
can take. Unlike the sample mean of a
group of observations, which gives each
observation equal weight, the mean of a
random variable weights each
outcome xi according to its probability, pi.
The common symbol for the mean (also
known as the expected value of X) is ,
formally defined by
The mean of a random variable
provides the long-run average of
the variable, or the expected
average outcome over many
observations.
Example: Suppose an individual
plays a gambling game where it is
possible to lose $1.00, break even,
win $3.00, or win $10.00 each
time she plays. The probability
distribution for each outcome is
provided by the following table:
Outcome -$1.00 $0.00 $3.00 $5.00
Probability 0.30 0.40 0.20 0.10
The mean outcome for this game is
calculated as follows:
= (-1*.3) + (0*.4) + (3*.2) +
(10*0.1) = -0.3 + 0.6 + 0.5 = 0.8.
In the long run, then, the player
can expect to win about 80 cents
playing this game -- the odds are
in her favor.
0 Continuous random variable X: described by a density
curve; variance of a random variable.
0 Mean: A measure of the center of a distribution.
0 The Variance of a random variable X is also denoted
by σ;2 but when sometimes can be written as Var(X).
0 Variance of a random variable can be defined as the
expected value of the square of the difference
between the random variable and the mean.
0 Given that the random variable X has a mean of μ,
then the variance is expressed as:
Variance of a Discrete
Random Variable
0 Discrete random variables are introduced here. The
related concepts of mean, expected value, variance,
and standard deviation are also discussed.
0 Let X be a numerically valued random variable with
expectedvalue µ = E(X). Then the variance of X,
denoted by V (X), is
V (X) = E((X − µ)^2)
• Law of a Large Number:
Remarkable fact because
it holds for any
population, not just for
some special class such
as normal distribution.
• The mean μ of a random
variable is the average
value of the variable in
two senses.
• μ is the average of the
possible values,
weighted by their
probability of occurring.
Rules for Means:
0 RULE 1: If X is a random
variable and A and B are
fixed numbers, then
μ a+b μx
0 RULE 2: if X and Y are
random variables then
μ x+y= μx+y