Download ST3905

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
ST3905
Lecturer : Supratik Roy
Email : [email protected]
(Unix) : [email protected]
Phone: ext. 3626
What do we want to do?
1. What is statistics?
2. Describing Information :
3. Summarization, Visual and non-Visual representation
4. Drawing conclusion from information :
5. Managing uncertainty and incompleteness of information
Describing Information
1. Why summarization of information?
2. Visual representation (aka graphical Descriptive Statistics)
3. Non-visual representation (numerical measures)
4. Classical techniques vs modern IT
Stem and Leaf Plot
Decimal point is 2 places to the right of the colon
0:8
1 : 000011122233333333333344444
1 : 55555566666677777778888888899999999999
2 : 0000000111111111111222222233333333444444444
2 : 555556666666666777778889999999999999999
3 : 000000001111112222333333333444
3 : 55555555666667777777888888899999999
4 : 0122234
4 : 55555678888889
5 : 111111134
5 : 555667
6 : 44
6:7
com
com
ple
x
ple
x
Pie-Chart
diffgeom
algebra
diffgeom
algebra
isti
c
s
reals
reals
sta
t
sta
t
isti
c
s
DotChart
Child Care
Old Suburb
Coast County
New Suburb
o
o
o
Health Services
Old Suburb
Coast County
New Suburb
o
o
o
Community Centers
Old Suburb
Coast County
New Suburb
o
o
Family & Youth
Old Suburb
Coast County
New Suburb
Other
Old Suburb
Coast County
New Suburb
o
o
o
o
o
10
o
20
30
o
Histogram
0
5
10
15
50 samples from a t distribution with 5 d.f.
-4
-3
-2
-1
my.sample
0
1
2
0
5
10
15
Histogram-Categorical
Northeast
South
North Central
state.region
West
Rules for Histograms
1. Height of Rectangle proportional to frequency of class
2. No. of classes proportional to sqrt(total no. of observations)
[not a hard and fast rule]
3. In case of categorical data, keep rectangle widths identical,
and base of rectangles separate.
4. Best, if possible, let the software do it.
Data
-0.053626486 -0.828128399 0.214910482 0.346570399
[5] -0.849316517 0.001077376 0.736191791 1.417540397
[9] -2.382332275 -2.699019949 -0.111907192 1.384903284
[13] 2.113286699 -1.828108272 -1.108280724 0.131883612
[17] -0.394494473 0.829806888 0.023178033 0.019839537
[21] -0.346280222 -0.251981108 1.159853307 -0.249501904
[25] -1.342704742 -2.012653224 -1.535503208 0.869806233
[29] -1.313495887 -0.244408426 -0.998886998 -1.446769605
[33] 1.224528053 -0.410163230 0.032230907 -0.137297112
[37] -2.717620031 -0.728570438 0.034697116 2.202863874
[41] -0.170794163 0.353651680 -0.673296374 3.136364814
[45] -1.260108638 -0.367334893 -0.652217259 -0.301847039
[49] 0.315180215 0.190766333
Tabulation
Class
-3,-2
-2,-1
-1,0
0,1
1,2
2,3
3,4
////
////
////
////
////
//
/
//
////
////
////
////
freq
4
7
///
18
14
4
2
1
Total 50
200
400
600
800
Box-Plot - I
1
2
3
4
5
6
7
Box Plot – II
18-24
25-34
35-44
45-54
55-64
65+
Box Plot – III
200
400
Payoff
600
800
NJ Pick-it Lottery (5/22/75-3/16/76)
0
1
2
3
4
5
6
Leading Digit of Winning Numbers
7
8
9
Non-Visual (numerical measures)
1. Pictures vs. quantitative measures
2. Criteria for selection of a measure – purpose of study
3. Qualities that a measure should have
4. We live in an uncertain world – chances of error
Measures of Location
1. Mean :
2. Mode
3. Median
Location : mean, median
algebra test scores
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
43 50 41 69 52 38 51 54 43 47 54 51 70 58 44 54 52 32 42 70
21 22 23 24 25 50 49 56 59 38
Mean = 50.68
10% trimmed mean of scores = 50.33333
Median = 51
Location : Non-classical
An M-estimate of location is a solution mu of the equation:
sum(psi( (y-mu)/s )) = 0.
Data set : car.miles
(bisquare) 204.5395
(Huber’s ) 204.2571
Tabular method of computing
Class freq
-3,-2
-2,-1
-1,0
0,1
1,2
2,3
3,4
4
7
18
14
4
2
1
50
Classmidpt
-2.5
-1.5
-0.5
0.5
1.5
2.5
3.5
Rel.
freq
0.08
0.14
0.36
0.28
0.08
0.04
0.02
r.f X
midpt
-0.20
-0.21
-0.18
0.14
0.12
0.10
0.07
-0.16
Tabular method of computing
Class freq Classmidpt(x)
-3,-2
-2,-1
-1,0
0,1
1,2
2,3
3,4
4
7
18
14
4
2
1
50
-2.5
-1.5
-0.5
0.5
1.5
2.5
3.5
A=-0.5
x-A/d
-2
-1
0
1
2
3
4
Rel.
freq
r.f X x
0.08
0.14
0.36
0.28
0.08
0.04
0.02
-0.16
-0.14
0
0.28
0.16
0.12
0.08
0.34
Measures of Scale (aka
Dispersion)
1. Variance (unbiased) : sum((x-mean(x))^2)/(N-1)
2. Variance (biased) : sum((x-mean(x))^2)/(N)
3. Standard Deviation : sqrt( variance)
Tabular method of computing
Class Classmidpt(x)
-3,-2
-2,-1
-1,0
0,1
1,2
2,3
3,4
-2.5
-1.5
-0.5
0.5
1.5
2.5
3.5
A=-0.5
x’=(xA)/d
-2
-1
0
1
2
3
4
x^2
Rel.
freq
r.f X
x^2
4
1
0
1
4
9
16
0.08
0.14
0.36
0.28
0.08
0.04
0.02
0.32
0.14
0
0.28
0.32
0.36
0.32
1.74
Robust measures of scale
1. The MAD scale estimate generally has very small bias
compared with other scale estimators when there is
"contamination" in the data.
2. Tau-estimates and A-estimates also have 50% breakdown,
but are more efficient for Gaussian data.
3. The A-estimate that scale.a computes is redescending, so it
is inappropriate if it necessary that the scale estimate always
be increasing as the size of a datapoint is increased.
However, the A-estimate is very good if all of the
contamination is far from the "good" data.
Comparison of scale measures
MAD(corn.yield) =4.15128
scale.tau(corn.yield) = 4.027753
scale.a(corn.yield) = 4.040902
var(corn.yield) = 19.04191
sqrt(var(corn.yield)) = 4.363703
N.B. To really compare you have to compare for various
probability distributions as well as various sample sizes.
Probability
1. Concept of an Experiment on Random observables
2. Sets and Events, Random variables, Probability
(a).Set of all basic outcomes = Sample space = S
(b).An element of S or union of elements in S = An event
(Asingleton event = simple event, else compound)
(c) A numerical function that associates an event with a
number(s) = Random Variable
(d) A map from E onto [0,1] obeying certain rules =
probability
Examples of Probability
Consider toss of single coin :
1. A single throw : Only two possible outcomes – Head or Tail
2. Two consecutive throws : Four possible outcomes – (Head,
Head), (Head, Tail), (Tail, Head), (Tail, Tail)
3. Unbiased coin : P(Head turns up) = 0.5
4. Define R.V. X to be X(Head)=1, X(Tail)=0. P(X=1)=0.5,
P(X=0)=0.5.
Axioms of Probability
1. 0 <= P(A) <= 1 for any event A
2. P[A  B] = P[A]+P[B] if A,B are disjoint sets/events
3. P[S] =1
Basic Formulae-I
1. P[A’] = 1- P[A]
2. P[A  B] = 0 if A,B are disjoint
3. P[A  B] = P[A]+P[B]-P[A  B]
4. P[A  B  C] = P[A]+P[B]+ P[C]
-P[A  B] –P[A  C] – P[B  C]
+P[A  B  C]
Basic Formulae - II
1. Counting Principle : For an ordered sequence to be formed
from N groups G1,G2,….GN with sizes k1,k2,….kN, the total
no. of sequences that can be formed are k1 x k2 x ….kN.
2. An ordered sequence of k objects taken from a set of n
distinct objects is called a Permutation of size k of the
objects, and is denoted by Pk,n.
3. For any positive integer m, m! is read as “m-factorial” and
defined by m!=m(m-1)(m-2)…3.2.1
4. Any unordered subset of size k from a set of n distinct
objects is called a Combination, denoted Ck,n.
Basic Formulae-III
1. Pk,n = n!/(n-k)!
2. Ck,n = n!/[k!(n-k)!]
3. For any two events A and B with P(B)>0, the Conditional
Probability of A given (that ) B (has occurred)is defined by
P(A|B) = P(A  B)/P(B) [=0 if P(B)=0]
4. Let A,B be disjoint and C be any event with P[C]>0. Then
P(C)=P(C|A)P(A)+P(C|B)P(B) [Law of Total Probability]
5.
Let A,B be disjoint and C be any event with P[C]>0. Then
P(A|C)=P(C|A)P(A)/[P(C|A)P(A)+P(C|B)P(B)]. [Bayes
Theorem]
Random Variables - Discrete
1. A discrete set is a set such that either it is finite or there exists a
map from each element of the set into a subset of the set of
Natural numbers.
2. A discrete random variable is a r.v. which takes values in a
discrete set consisting of numbers.
3. The probability distribution or probability mass function (pmf)
of a discrete r.v. X is defined for every number x by
p(x)=P(X=x)=P(all s  S: X(s)=x)
[P[X=x] is read “the probability that the r.v. X assumes the value
x”. Note, p(x) >= 0, sum of p(x) over all possible x is 1
Cumulative Distribution Function
1. The Cumulative distribution function (cdf) F(x) of a discrete
r.v. X with pmf p(x) is defined for every number x by
F(x)=P(Xx)={y : y  x} p(y)
2. For any number x, F(x) is the probability that the observed
value of X will be at most x.
3. For any two numbers a,b with a  b, P(a  X  b) = F(b)-F(a-)
where a- represents the largest possible X value that is strictly
less than a.
Operations on RV’s
1. Expectation of a RV
2. Expectations of functions of RV’s
3. Special Cases : Moments, Covariance
Expected Values of Random
Variables
1. Let X be a discrete r.v. with set of possible values D and pmf
p(x). The expected value or mean value of X, denoted by E(X)
or X , is E(X) = X ={xD} x.p(x)
2. Note that E(X) may not always exists. Consider p(x)=k/x2
Expected Values of functions of
Random Variables
1. Let X be a discrete r.v. with set of possible values D and pmf
p(x). The expected value or mean value of f(X), denoted by
E(f(X)) or  f(X) , is E(f(X)) ={xD} f(x).p(x)
2. Example : Variance.
E(X)]2=E(X2)-[E(X)]2
Var(X)=V(X)=E[X-
Random Variables - Continuous
Joint distribution of >1 RV’s
Gaussian or Normal Distribution
Sample as Random Observables
Parametric Inference
Tests of Hypothesis
Hypothesis Tests for Normal
Population
Related documents