Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

History of statistics wikipedia , lookup

Regression toward the mean wikipedia , lookup

Time series wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
Introductory Statistics
Learning Objectives
After the session the students should be able to:





Distinguish between different data types
Evaluate the central tendency of realistic
business data
Evaluate the dispersion of data
Evaluate test statistics
Use a test statistic to formulate a business
decisions using regression analysis
Types of data
• Discrete (A variable controlled by a fixed set of values)
• Continuous data (A variable measured on a continuous scale )
• These data may be collected (ungrouped) and then grouped
together in particular form so that can be easily inspected
• But how would we collect data?
Sampling Techniques
Simple random sampling
Stratified sampling
Cluster sampling
Quota sampling
Systematic sampling
Mechanical sampling
Convenience sampling
Frequency distributions
The following are data of ages of a sample of ages
managers
42
30
53
50
52
30
55
49
61
74
26
58
40
40
40
28
36
30
31
37
32
37
30
32
23
32
58
43
30
29
34
50
47
31
35
26
64
46
40
43
57
30
49
40
25
50
52
32
60
54
How could we represent these data effectively?
Scattering the data
70
60
50
Series1
 Scatter Diagrams
40
Series2
30
Series3
Series4
20
10
0
0
20
40
 Bar Diagrams
80
70
60
Series1
50
Series2
Series3
40
Series4
30
Series5
20
10
0
1
2
3
4
5
6
7
8
9
10
60
80
The histogram
 We could group the data
into convenient class
intervals thus
Class
Range
20
30
40
50
60
70
29
39
49
59
69
79
Central val
24.5
34.5
44.5
54.5
64.5
74.5
6
17
12
11
3
1
6
23
35
46
49
50
18
 and plot these to produce
a histogram
16
14
12
10
Series2
8
6
4
2
What measures of the central
tendency do we have
0
24.5
34.5
44.5
54.5
64.5
74.5
Measures of the central tendency
•
•
•
•
Mode
• The maximum value of the distribution e.g. the most
occurring value (in reality this can be evaluated using a
standard formula
Median
• The central value of a set of data or a distribution. Can be
evaluated using a standard method of using the CDF
Arithmetic mean
• The central value assuming the data are distributed in
accordance to an arithmetic progression
Geometric mean
• The central value assuming the data are distributed
according to a geometric progression
The mode
• For our data this occurs
between 30-39 (the modal
range)
• The construction shown
can be employed to home
in on the exact value
• Or the formula: where
L=lower boundary, l=lower
freq diff, u=upper freq diff
& c=the class boundary
width
18
16
14
12
10
Series2
8
6
4
2
0
24.5
34.5
xmode
44.5
54.5
64.5
74.5
 l 
 L
c
l u 
The mode
•
•
•
•
•
Here, for our data
 l 
L=29.5,
xmode  L  
c

l=5,
l u 
u=1and
the class boundary
 5 
width c=10
xmode  29.5  10

 5 1
xmode
 50 
 29.5     37.83
 6 
The Median
 For our data we could
•
60
Cumulative Frequency
•
evaluate this quantity two
fold
Approximate using by
plotting the cumulative
frequency diagram
Via logical inference
CDF
50
40
30
Series1
20
10
0
0
20
40
60
80
100
Frequency
18
16
14
12
10
Series2
8
6
4
2
0
24.5
34.5
44.5
54.5
64.5
74.5
12 11 3 1 17  6  4
4
 m  39.5  10
 41.16
2 12
Measures of Dispersion
• The range
•
Largest value minus Smallest
value
• Variance
•
Mean Square variation from
the mean
R  LS
f x  x 


f
2

2
i
i
• Standard Deviation
•
Square root of the variance
  2 

NOTE:
f i x  x 
2
f
i
n   fi
Use of Computer packages
 Example:
 Given the following data use a spreadsheet to
produce a grouped histogram using 9 bins
also produce a CFD. Hence or otherwise
evaluate:
a) Three measures of the central tendency and,
b) Three measures of the dispersion
Decision Processes


This is all very well
and good however,
how does this allow
us to make research
and managerial &
research decisions?
To answer this we
need to consider the
pattern of the data,
thus:
12
10
8
6
Series1
4
2
0
20.445
20.545
20.645
20.745
20.845
20.945
21.045
21.15
The Normal distribution
•
•
•
•
Many sets of data adhere to the
normal distribution.
The most important distribution of
them all
It is pretty much this property that
allows us to obtain (research)
management decisions
The normal distribution is usually
written N(μ,σ2); with μ the
population mean and σ2 the
variance
Properties of N(μ,σ2)
•
For any normal curve with
mean mu and standard
deviation sigma:
• 68 percent of the observations
fall within one standard
deviation sigma of the mean.
• 95 percent of observation fall
within 2 standard deviations.
• 99.7 percent of observations fall
within 3 standard deviations of
the mean.
The Z-Score
This is formula that
allows us to evaluate
the probability of an
event if we know that
a particular
population is
normally distributed
X 
Z

Example: If a population is N(48,12), find the
probability that some value of X<20.
Solution Protocol
Establish
hypothesis
2. Evaluate the Zscore
3. Sketch the
distribution
4. Evaluate
probability
1.


P X  20 | N (48,122 )  p
Z
20  48
 2.333
12
-2.15
p
p  0.5  0.4901  0.0099  1%
Spreadsheet Solution
Protocol
Establish hypothesis
2. Use normal
distribution function
3. Perform Check i.e.
use Z-function
1.


P X  20 | N (48,122 )  p
p  0.009815  1%
20  48
Z
 2.153
12
p  0.009815  1%
Exercise
• Example: Using a z score If a population is N(111,33.82),
find the probability that some value of 100 <X<150.
Pa  X  b | N (, )  p
X 
Z

p
Exercise
• Using a z score and given that the population is
N(37,4.352), find the probability that some value of
X>150.
Pa  X  b | N (, )  p
X 
Z

p
Samples
 If we are using a sample of values as a
consequence of the central limit theorem the z
score will change, thus
X 
Z
/ n
Example
 The mean expenditure per customer at a tire store is £60
and the sd £6. It is known that the nominal customer per
day is 40. A new product costs £64, what is the
probability of selling such a product per customer
Pa  X  b | N (, )  p
64  60
Z
 1.41
6 / 40
p
Try one
 In a store, the average number of shoppers is
448, with an sd of 21. What is the probability that
49 shopping hours have a mean between 441
and446.
P441  X  446 | N (, )  p
441 446
Z
21 / 49
p
X 
Z
/ n
Regression & Correlation
analysis
 A scatter diagram can be used to show the
relationship between two variables
 Correlation analysis is used to measure strength of
the association (linear relationship) between two
variables

Correlation is only concerned with strength of the
relationship

No causal effect is implied with correlation

Scatter diagrams were presented in the last sessions

As was Correlation
Regression & Correlation
analysis
 A scatter diagram can be used to show the
relationship between two variables
 Correlation analysis is used to measure strength of
the association (linear relationship) between two
variables

Correlation is only concerned with strength of the
relationship

No causal effect is implied with correlation

Scatter diagrams were presented in the last sessions

As was Correlation
Introduction to
Regression Analysis
o Regression analysis is used to:

Predict the value of a dependent variable based on the
value of at least one independent variable

Explain the impact of changes in an independent
variable on the dependent variable
o Dependent variable: the variable we wish to predict
or explain
o Independent variable: the variable used to explain
the dependent variable
Simple Linear Regression
Model
o Only one independent variable, X
o Relationship between X and Y is
described by a linear function
o Changes in Y are assumed to be
caused by changes in X
Types of Relationships
Linear relationships
Y
Curvilinear relationships
Y
X
Y
X
Y
X
X
Types of relationships
cont…
Weak relationships
Strong relationships
Y
Y
X
Y
X
Y
X
X
Types of Relationships
No relationship
Y
X
Y
X
The regression model
Population
Y intercept
Dependent
Variable
Population
Slope
Coefficient
Independent
Variable
Random
Error
term
yi  A  Bx i  εi
Linear component
Random Error
component
The regression model
Y
yi  A  Bx i  εi
Observed Value
of Y for Xi
εi
Predicted Value
of Y for Xi
Slope = β1
Random Error
for this Xi value
Intercept = β0
Xi
X
The Least Squares
approach
 b0 and b1 are obtained by finding the
values of b0 and b1 that minimize the sum
of the squared differences between Y and
:
2
ˆ
min (yi yi )  min
Rendering:
(y  (A  Bx ))
2
S XY
B 2
S XX
2
i
i
A  y  Bx
The proof of these requires the calculus
Regression Formulae
 Thus the formulae can
be summarized as:
yi  A  Bx i  εi
S XY  xy  yx the covariance
n
S
2
XX

2
x

i 1
S
B
S
n
2
XY
2
XX
 x the variance
2
A  y  Bx
Where:
xy  mean of
x y
yx  x  y
 
mean x  mean of x
2
x 2  square of x
and of course :
x  mean of x
y  mean of y
2
Regression Example
 An estate agent wishes to find the relationship
between the house prices and size, it is suspected
that a linear relationship exists between the house
price (the dependent variable Y) and the house size
in square metres (the independent variable X). Using
linear regression, find the relationship and make a
prediction of a house price measuring 200m2. The
following data have been collected by the estate
agent.
Regression data
House Price in £k
(Y)
Area in m sqr
(X)
123
156
156
178
140
189
154
208
100
122
110
172
203
261
162
272
160
158
128
189
Regression Solution
 It is usual to set up a table of results, using an
appropriate Excel spreadsheet
Mean values:
Area in m sqr
(X)
156
178
189
208
122
172
261
272
158
189
190.5
House
Price in
£k
(Y)
123
156
140
154
100
110
203
162
160
128
143.6
X×Y
19188
27768
26460
32032
12200
18920
52983
44064
25280
24192
28308.7
2
X×X =X
24336
31684
35721
43264
14884
29584
68121
73984
24964
35721
38226.3
Regression Solution Cont…
 Now we simply apply the formulae as follows,
first the regression coefficient, i.e. the
gradient
2
S XY
B 2
S XX
 
2
S XX
 mean xi2  x 2
2
S XX
 38226.3 - 190.52
2
S XX
 1936.05
2
S XY
 xy  x y
2
S XY
 28308.7 - 190.5 143.6
2
S XY
 952.9.
952.9
B
1936.05
B  0.492.
Regression Solution
Cont…
 Then we evaluate the regression constant
A  y  Bx
A  143.6 - 0.492 190.5
A  49.838
There are various computer methods available
which do these calculations for you these are
detailed in the handout
Regression computer
solution
•
•
•
There a three methods to evaluate the
Regression coefficient and constant
using an Excel spreadsheet. These being:
Graphical
250
y = 0.4922x + 49.838
Calculation
200
R = 0.5784
Functions
150
House Price (£k)

2
Series1
Linear (Series1)
100
50
0
0
100
200
House size (m sqr)
300
Regression computer
solution Cont…
 This is an example of the graphical method,
which is required for a pass grade in the
forthcoming assignment! If you want higher
grades however you will have to check these
answers using the other two methods shown
in the handout
Summary
Have we met out learning objectives? Specifically are
you able to:





Distinguish between different data types
Evaluate the central tendency of realistic business
data
Evaluate the dispersion of data
Evaluate test statistics
Use a test statistic to formulate a business
decisions using regression analysis