Download data analysis - DCU School of Computing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Metagenomics wikipedia , lookup

Human genetic variation wikipedia , lookup

Genome (book) wikipedia , lookup

History of genetic engineering wikipedia , lookup

Heritability of IQ wikipedia , lookup

Genetic drift wikipedia , lookup

Microevolution wikipedia , lookup

Population genetics wikipedia , lookup

Species distribution wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Transcript
DATA ANALYSIS
Module Code: CA660
Lecture Block 4
Examples using Standard
Distributions/sampling distributions
Background Recombinant Interference
rˆ 
nr
n
recombination fraction ( gametes or... )
Greater physical distance between loci  greater chance to recombine (homologous). Departure from additivity increases with distance -hence mapping.
Example: 2 loci A,B, same chromasome, segregated for two alleles at each locus
 A,a,B,b gametes AB, Ab, aB, ab. Parental types AB, ab gives Ab and aB
recombinants . Simple ratio. Denote recombinant fraction as R.F. (r)
Example: For 3 linked loci, A,B, C, relationship based on simple prob. theory
rAC  rAB  rBC  2rABrBC 3 possible RF , so Interference
rAC  rAB  rBC  2C *rABrBC more generally
1  C *  Interference, C *  Coefft. Coincidence
r12
C* 
where r12 true double recombinant frequency
2rABrBC
2
Example cont.- LINKAGE/G.M CONSTRUCTION
• Genetic Map -Models linear arrangement of group of genes / markers
(easily identified genetic features - e.g. change in known gene, piece of
DNA with no known function). Map based on homologous recombination
during meiosis. If two or more markers located close together on
chromosome, alleles usually inherited through meiosis
• 4 basic steps after marker data obtained. Pairwise linkage - all 2locus combinations (based on observed and expected frequencies of
genotypic classes). Grouping markers into Linkage Groups (based on
R.F.’s, significance level etc.). If good genome coverage –many
markers, good data and genetic model, No. linkage groups should 
haploid no. chromosomes for organism. Ordering within group
markers (key step, computationally demanding, precision important).
Estimation multipoint R.F. (physical distance - no. of DNA base pairs
between two genes vs map distance => transformation of R.F.).
• Ultimate Physical map = DNA sequence (restriction map also common)
3
STANDARD DISTRIBUTIONS Examples/Extensions
GENETIC LINKAGE and MAPPING
• Linkage Phase
- chromatid associations of alleles of linked loci
- same chromosome =coupled, different =repulsion
• Genetic Recombination - define R.F. (in terms of gametes or
phenotypes); homologous case - greater the distance between
loci, greater chance of recombining. High interference =
problem for multiple locus models. R.F. between loci not
additive. Need Mapping Function
• Haldane’s Mapping Function
Assume crossovers occur randomly along chromosome length
and average number = , model as Poisson, so
P{NO crossover} = e - 
and
P{Crossover} = 1- e - 
4
Example - continued
• P{recombinant} = 0.5  P(Crossover} (each pair of homologs, with
one crossover resulting in one-half recombinant gametes)
• Define Expected No. recombinants in terms of mapping function
(m = 0.5 )
R.F.
r = 0.5(1-e -2m)
(form of Haldane’s M.F.)
with inverse m = - 0.5 ln (1-2r)
so converting an estimated R.F. to Haldane’s map distance
• Thus, for locus order ABC
mAC = mAB + mBC (since mAB= - 0.5ln(1-2rAB) ) etc.
Substituting for each of these gives us the usual relationship
between R.F.’s (for the no interference situation)
• Net Effect - transform to straight line i.e. mAC vs mAB or mBC
• In practice - too simple/only applies to specific conditions; may not
relate directly to physical distance = common Mapping Fn. issue).
5
Examples
RECOMBINANTS, BINOMIAL and MULTINOMIAL
• Binomial No. of recombinant gametes, produced by a
heterozygous parent for a 2-locus model, with parameters, n
and  = P{gamete recombinant} (= R.F.)
So for r recombinants in sample of n
 n r
P{ X  r}    (1   ) nr
r 
 
• Multinomial 3-locus model (A,B,C) - 4 possible classes of
gametes (non-recombinants, AB recombinants, BC
recombinants and double recombinants at loci ABC).
Joint probability distribution for r.v.’s requires counting number in
each class
n!
P{ X 1  a, X 2  b, X 3  c, X 4  d } 
P1a P2b P3c P4d
a!b!c!d!
where a+b+c+d = n and P1, P2, P3, P4 are probabilities of
observing a member of each of 4 classes respectively
6
Sampling and Sampling Distributions –
Extended Examples: refer to primer
Central Limit Theorem
If X1, X2,… Xn are a random sample of r.v. X, (mean , variance
2), then, in the limit, as n , the sampling distribution of
means has a Standard Normal distribution, N(0,1)
xi 
'
xi  

i  1,2,...
n
Probabilities for sampling distribution – limits
• for large n


x  x
Pa 
 b  P{a  U  b}
x


U = standardized Normal deviate
7
Large Sample theory
• In particular
P{ x    r}  P{r  x    r}
 r x  
r 
 P



x
x 
 x
 r 
 r 



 F
 F 


 n 
 n 
• F is the C.D.F. or D.F.
• In general, the closer the random variable X behaviour is to the
Normal, the faster the approximation approaches U. Generally,
n 30  “Large sample” theory
8
Attribute and Proportionate Sampling
recall primer sample proportion p̂ and sample mean x synonymous
Probability Statements
If X and Y independent Binomially distributed r.v.’s parameters
n, p and m, p respectively, then X+Y ~ B(n+m, p) - (show e.g. by
m.g.f.’s)
• So, Y=X1+ X2+…. + Xn ~ B(n, p) for the IID X~B(1, p).
• Since we know Y = np, Y=(npq) and, clearly Y  nx then
Y Y

x   x n n Y  np


 N (0,1) as n  
Y
x
npq
n
•
and, further U 
pˆ  p
~ N (0,1)
pq
n
is the sampling distribution of
a proportion
9
Difference in Proportions
• Can use 2 : Contingency table type set-up
• Can set up as parallel to difference estimate or test of 2 means
(independent) so for 100 (1-a)% C.I.
( pˆ 1  pˆ 2)  U a
2
pˆ 1qˆ1  pˆ 2 qˆ 2
n1
n2
• Under H0: P1 – P2 =0
so, can write S.E. as
2-sided
 1
1 
ˆq
ˆ
p
n  n 

2 
 1
ˆ 1  n2 p
ˆ2
X Y
n1 p
ˆ 
p

n1  n2
n1  n2
S.E., n1, n2
large.
Small
sample n-1
for pooled
X & Y =# successes
10
C.L.T. and Approximations summary
• General form of theorem - an infinite sequence of independent
r.v.’s, with means, variances as before, then approximation  U
for n large enough. Note: No condition on form of distribution of
the X’s (the raw data)
• Strictly - for approximations of discrete distributions, can
improve by considering correction for continuity
e.g.
X    0.5
U
Poisson , parameter 

U
( x n)  0.5  p
pq n
x  No. in sample , so observed / sample proportion  pˆ
11
Generalising Sampling Distn. Concept
-see primer
• For sampling distribution of any statistic, a sample
characteristic is an unbiased estimator of the parent population
characteristic, if the mean of the corresponding sampling
distribution is equal to the parent characteristic.
Also the sample average proportion is an unbiased estimator
of the parent average proportion E{x}  
E{ pˆ }  P
• Sampling without replacement from a finite population gives
the Hypergeometric distribution.
finite population correction (fpc) =  [( N - n) / ( N - 1)] ,
N, n are parent population and sample size respectively.
• Above applies to variance also.
12
Examples in context
Rates of prevalence of CF antibody to P1 virus among given age
group children. Of 113 boys tested, 34 have antibody, while of 139
girls tested, 54 have antibody. Is evidence strong for a higher
prevalence rate in girls?
H0: p1=p2 vs H1: p1< p2 (where p1, p2 proportion boys, girls with
antibody respectively).
Soln.
pˆ 
34  54
 0.349
113  139
U 
34
 0.301
113
54
ˆ2 
p
 0.388
139
ˆ1 
p
0.301  0.388
1 
 1
0.349  0.651


113
139


Can not reject H0
Actual p-value = P{U ≤ -1.44) = 0.0749
 1.44
13
Examples – contd.
Large scale 1980 survey in country showed 30% of adult population
with given genetic trait. If still the current rate, what is probability
that, in a random sample of 1000, the number with the trait will be
(a) < 250, (b) 316 or more?
Soln. Let X = no. successes (with trait) in sample. So, for expected
proportion of 0.3 in population, we suppose X ~B(1000,0.3)
Since np=300, and √npq = √210 =14.49, distn. of X ~N(300,14.49)
279.5  300 

P
U

  PU  1.415  0.0786
(a) P{X<280} or P{X≤279}  
14.49 

(b)
315.5  300 

P{X≥316}  P U 
  PU  1.07  1  0.8588  0.1423
14.49 

14
Examples contd.
Blood pressure readings before and after 6 months on medication taken in
women students, (aged 25-35); sample of 15. Calculate (a) 95% C.I. for
mean change in B.P. (b) test at 1% level of significance, (a= 0.01) that the
medication reduces B.P.
Data:
Subject
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1st (x) 70 80 72 76 76 76 72 78 82 64 74 92 74 68
2nd (y) 68 72 62 70 58 66 68 52 64 72 74 60 74 72
15
84
74
d =x-y
10
2
8 10
6 18 10
(a) So for 95% C. limits
4 26 18 -8
di
d    8.80
15
s
0 32
 (d
d  t0.025
i
0 -4
 d )2
14
s
15
 10.98
15
Contd.
Value for t0.025 based on d.o.f. = 14. From t-table, find t0.025 = 2.145
10.98
10.98 

So, 95% C.I. is: P 8.80  2.145
 D  8.80  2.145
  0.95
15
15 

i.e. limits are 8.80  6.08 or (2.72, 14.88), so 95% confident that there is
a mean difference (reduction) in B.P. of between 2.72 and 14.88
(b) The claim is that  > 0, so we look at H0:  = 0 vs H1:  > 0 ,
So t-statistic as before, but right-tailed (one sided only) Rejection Region.
For d.o.f. = 14, t0.01 = 2.624. So calculated value from our data
t
d

s
8.80
10.98
n
 3.10
15
clearly in Rejection region, so H0
rejected in favour of H1 at a= 0.01
Reduction in B.P. after medication
strongly supported by data.
t14
Accept
0
Reject = 1%
t0.01 = 2.624.
16