Download 235_lecture2_080122

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Psyc 235:
Introduction to Statistics
Lecture Format
• New Content/Conceptual Info
• Questions & Work through problems
What you should have accomplished so far…
•
•
•
•
•
ALEKS account set up
completed first assessment
Worked through first section of material
Spent 5+ hours on ALEKS
Watched the video “What is statistics?”
Any questions/problems so far?
From Last week:
• Definition of Statistics…
C
O
Collecting …
Organizing …
D
I
A
Displaying …
Interpreting …
Analyzing …
Data
What is Data?
• Data is the generic term for numerical
information that has been obtained on a set of
objects/individuals etc.
• Variable:
 Some characteristic of the objects/individuals (e.g.,
height)
• Data:
 the values of a variable for a certain set of
objects/individuals
Two branches of statistics:
Descriptive Statistics
Describes a given set of data you have.
Inferential Statistics
Given the data you have about these people,
does this say anything about other people?
Today: Descriptive Statistics
• Graphical Presentations of Distributions
 Histograms
 Frequency Polygons
 Cumulative Distributions
 Box-and-whisker plots
• Descriptive Measures of Data
 Measures of Central Tendency
 Measures of Dispersion
Organizing Data
• Data from last week
• Frequency Table
Time Awake
6:30-7:00
7:00-7:30
7:30-8:00
8:00-8:30
8:30-9:00
9:00-9:30
9:30-10:00
10:00-10:30
10:30-11:00
Number of Students
1
1
3
2
4
5
7
4
3
6:55
7
7:30
7:30
7:45
8
8:25
8:30
8:45
8:45
8:50
9
9
9
9:15
9:25
9:30
9:30
9:30
9:30
9:30
9:45
9:45
10
10
10:15
10:25
10:30
10:45
10:50
Histograms
8
Number of Students
7
6
5
4
3
2
1
0
6:307:00
7:007:30
7:308:00
8:008:30
8:309:00
9:009:30
9:30- 10:00- 10:3010:00 10:30 11:00
Wake-Up Time
Note: Use Histogram to note patterns in data. (Skew, etc.)
Frequency Polygon
6:30-7:00
Number of
Students
1
Frequency
0.25
0.0333
7:00-7:30
1
0.0333
7:30-8:00
3
0.1
8:00-8:30
2
0.0667
8:30-9:00
4
0.1333
9:00-9:30
5
0.1667
9:30-10:00
7
0.2333
10:00-10:30
4
0.1333
10:30-11:00
3
0.1
30
1
Total
Proportion of Students
Time Awake
0.2
0.15
0.1
0.05
0
6:307:00
7:007:30
7:308:00
8:008:30
8:309:00
9:009:30
Time Awake
9:30- 10:00- 10:3010:00 10:30 11:00
Cumulative Frequency
Time Awake
6:30-7:00
7:00-7:30
7:30-8:00
8:00-8:30
8:30-9:00
9:00-9:30
9:30-10:00
10:00-10:30
10:30-11:00
Total
Number of Students
1
1
3
2
4
5
7
4
3
30
Frequency
Cumulative
0.03333333
0.0333
0.03333333
0.0667
0.1
0.1667
0.06666667
0.2333
0.13333333
0.3667
0.16666667
0.5333
0.23333333
0.7667
0.13333333
0.9000
0.1
1.0000
1
1.2000
1.0000
0.8000
0.6000
0.4000
0.2000
0.0000
6:307:00
7:007:30
7:308:00
8:008:30
8:309:00
9:009:30
Time Awake
9:30- 10:00- 10:3010:00 10:30 11:00
Box and Whisker Plots
• Graphical representation of the 4 quartiles, (e.g. data is
split into 4 equally sized groups)
• If there are an even number of observations, let the “top”
be the top half, and let the “bottom” be the bottom half.
• If there are an odd number of observations, let the “top”
be everything above the median and the “bottom” be
everything below the median.
• The first quartile is the “median of the bottom”. The third
quartile is the “median of the top”.
Box-and-Whisker Example
6:55
7
7:30
7:30
7:45
8
8:25
8:30
8:45
8:45
8:50
9
9
9
9:15
9:25
9:30
9:30
9:30
9:30
9:30
9:45
9:45
10
10
10:15
10:25
10:30
10:45
10:50
Median: 9:20
1st Quartile: 8:30
3rd Quartile: 9:45
Again,
Note the information you
can obtain by looking at this
graphical representation of the data
Graphical Presentations of Data
• Listed Data:
All data available
• Frequency Table:
Data frequency for each cell is available
• Histograms:
Data frequency for each bin is available
• Polygons:
Data frequency for each bin is available
• Box-and-whisker plots:
Summary info and data range available
Less
And
Less
Information
• Often:
Just summarize key features of the distribution.
Describing Distributions
Summary Measures
Summary
Measures
•• Measures
Tendency
MeasuresofofCentral
Central
Tendency
“Average”, “Location”, “Center”
“Average”, “Location”, “Center”
of the distribution.
of the distribution.
• Measures of Dispersion
• Measures
of Dispersion
“Spread”, “Variability”
of the distribution.
“Spread”, “Variability” of the distribution.
Measures of Central Tendency
• Mean
• Median
• Mode
• May already be familiar with these
concepts, but I want you to think of them in
relation to describing data.
Mode
• Most frequent observation or observation
class
• There can be several distinct modes
• “Best guess” in single shot guessing game
19
5
12
A
B
C
D
Mode (example data)
6:55
7
7:30
7:30
7:45
8
8:25
8:30
8:45
8:45
8:50
9
9
9
9:15
9:25
9:30
9:30
9:30
9:30
9:30
9:45
9:45
10
10
10:15
10:25
10:30
10:45
10:50
Mode?
9:30
Median
• Any value M for which at least 50% of all
observations are at or above M and at
least 50% are at or below M.
• Resistant measure of central tendency
(not heavily influenced by extreme values)
Calculating the Median
Order all observations from smallest to largest.
If the number of observations is odd, it is the
“middle” object, namely the [(n+1)/2]th
observation.
For n = 61, it is the 31st
If the number of observations is even then, to get a
unique value, take the average of the (n/2)th and
the (n/2 +1)th observation.
For = 60, it is the average of the 30th and the 31st
observation.
Median (example data)
6:55
7
7:30
7:30
7:45
8
8:25
8:30
8:45
8:45
8:50
9
9
9
9:15
9:25
9:30
9:30
9:30
9:30
9:30
9:45
9:45
10
10
10:15
10:25
10:30
10:45
10:50
Since there are an even number of data points,
Take the average of the middle two values.
Mean
• Sum up all observations (say, n many) and
divide the total by n.
• Extreme values strongly influence the
mean
• Mean as the center of the value in a
distribution (center of gravity)
Calculating the mean
• Suppose that we collect n many observations
• Let
denote the individual
X 1 , X 2 , X 3 ,..., X n
observations.
Mean
Mean
• Sum up all observations (say, n many)
and divide the total by n.
X 1  X 2  ...  X n 1
X
  X 1  X 2  ...  X n 
n
n
Mathematical Notation
n
X
i 1
i
X 1  X 2  ...  X n 
X
i
Mean X  X 1  X 2  ...  X n  1  X 1  X 2  ...  X n 
X


n
n
i
1
  Xi
n
n
Mean (example data)
6:55
7
7:30
7:30
7:45
8
8:25
8:30
8:45
8:45
8:50
9
9
9
9:15
9:25
9:30
9:30
9:30
9:30
9:30
9:45
9:45
10
10
10:15
10:25
10:30
10:45
10:50
6.92
7
7.5
7.5
7.75
8
8.42
8.5
8.75
8.75
8.83
9
9
9
9.25
9.42
9.5
9.5
9.5
9.5
9.5
9.75
9.75
10
10
10.25
10.42
10.5
10.75
10.83
∑X = 273.34
X = 273.34 / 30 = 9.11
Transform back into time scale:
≈ 9:06
A few notes about summation, and implications
for calculation of the mean
n
a  a  ...  a  na
n
a

na

i 1
If all data has the same value, a,
then the mean value is also
a.
10
n
1
n
a 
i 1
1
n
na  a
because:
9
8
7
6
5
4
3
2
1
0
1
2
3
n
 a  na
i 1
Mean
4
5
Multiplying all values by a
constant
aX 1  aX 2  ...  aX n  a X 1  X 2  ...  X n 
n
 aX
i 1
n

a
X

i
i
i 1
If we multiply each observation
by 2, then we obtain a new
distribution with a different shape
n
1
n
 2X
i 1
i
2
n
1
n
X
i 1
i
A multiplying constant
affects the mean
(and the “spread”)
1
2
3
4
5
6
7
8
9
10
7
8
9
10
 2X
1
2
3
4
5
6
Adding a constant to all values
( X 1  a)  ( X 2  a)  ...  ( X n  a)
 ( X 1  X 2  ...  X n )  na


( X i  a)    X i   na

i 1
 i 1 
n
n
If we add the constant 5
to each observation,
then we obtain a new
distribution that is shifted
to the right by 5 units
n
1
n
(X
i 1
i
1
2
3
4
1
2
3
4
5
6
7
8
9
10
 5)
n

 1
1
 n   X i   n n5  X  5
 i 1 
A shift affects the mean
(but not the “spread”)
5
6
7
8
9
10
Combining two variables
( X 1  Y1 )  ( X 2  Y2 )  ...  ( X n  Yn )
 ( X 1  X 2  ...  X n )  (Y1  Y2  ...  Yn )

 

( X i  Yi )    X i     Yi 

i 1
 i 1   i 1 
n
n
n
Adding two variables
 n
  n 
( X i  Yi )    X i     Yi 

i 1
 i 1   i 1 
n
1
 1

( X i  Yi )   n  X i    n  Yi   X  Y

i 1
 i 1   i 1 
n
1
n
n
n
The mean of the sum of two variables
is the sum of their means
Measures of Dispersion
• Population Standard Deviation
• Sample Standard Deviation
If we want to know how much the
values vary around the mean….
We could calculate how much each value varies from the mean…
X  X   X
  X  X 
1
2
 X   ...  X n  X 
i
Because of the way we calculate the mean,
this formula gives zero no matter what data you have!
Population Standard Deviation
• Variance
Ss
2

X

 X   X 2  X   ...X n  X 
n 1
2
1
2
2
• Standard Deviation
Ss 
X
 X   X 2  X   ...X n  X 
n 1
2
1
2
2
Sample Standard Deviation
• Variance
s
2

X

 X   X 2  X   ...X n  X 
n 1
2
1
2
2
• Standard Deviation
s
X
 X   X 2  X   ...X n  X 
n 1
2
1
2
2
There are n-1 “degrees of freedom”
(If you know the mean and n-1 observations
then you can figure out the n’th observation)
Computational Formulas
• Note that there are computational formulas
for the standard deviation.
• Look for them in ALEKS and write them
down.
• Remember you can bring notes to your
assessments
For Next Week…
•
•
•
•
Keep working on ALEKS
Finish the descriptive statistics section
Watch the second video
If you can, start probability section before
Jason’s lecture next week.
• Remember: Office Hours and Lab are
always available for you.