Download A Genetical Genomics Project - Wellcome Trust Centre for

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
A Genetical Genomics Project
Richard Mott
Wellcome Trust Centre for Human
Genetics
Project Summary
• Gene Expression Analysis
–
–
–
–
Gene Coexpression Networks
Comparison between tissues
Comparison with phenotypes
Gene Ontology Analysis
• NB: There are many R packages available from
CRAN for gene expression and network analysis,
which are not covered in this lecture.
– You should explore them!
Gene expression datasets
• Hippocampus (460 mice),
• Liver and Lung (260 mice)
• 100 Phenotypes
• Mice are from a Heterogeneous Stock, from
164 families
Gene Expression data
• Gene expression measured on Illumina Mouse
arrays
– 47000 50-mer probes
– Approx 2 probes per gene
– Covariates (eg Sex, Family) also available
• > load("liver.exp.RData")
• > load("liver.cov.RData")
• > source("expression.tutorial.R")
Exploring Expression Data
> liver.median <- apply(liver.exp, 2, median )
> hist(liver.median, breaks=50)
> liver.subset <- liver.exp[,liver.median>6]
Sex Effects
• Which transcripts have different expression levels for the two
sexes?
– Use a T-test on each transcript
– The R apply() function speeds up the analysis
– First define a function tfunc that performs the T test and reports the
P-value
– tfunc <- function( X, GENDER ) {
tt <- t.test( X ~ GENDER );
return(tt$p.value) }
– Then compute the test for each transcript
– > sex.pvalue <- apply(liver.subset, 2,tfunc, liver.cov$GENDER )
– Then plot the distribution of p-values
– > hist( sex.pvalue, breaks=50)
– > sum(sex.pvalue<1.0e-5)
– [1] 78
Sex Effects
312/2796 (11%) of transcripts with median level > 6 have sex effects with P < 0.01
78/2796 (2%) of transcripts with median level > 6 have sex effects with P < 0.00001
Family effects (Heritability)
•
•
•
•
•
•
•
•
•
Which transcripts are affected by genetic background?
Use one-way ANOVA wrapped inside apply()
First define a function to return the p-value of the ANOVA:
anova.pvalue <- function( X, factor ) {
a <- anova(lm( X ~ factor))
return(a[1,5])
}
Then find the transcripts with high heritability
family <- apply( liver.subset, 2,
anova.pvalue, liver.cov$Family )
Family Effects
18% of transcripts with median level >6 have heritability p-value < 0.01
0.2% of transcripts with median level >6 have heritability p-value < 0.00001
Body Weight
• We can find transcripts associated with body
weight in a similar fashion to family effects,
except that linear regression is used.
– Note that the direction of causality is no longer
certain, ie it is not clear whether variation in a
transcript is causative for variation in weight or
vice versa
> weight.pvalue <- apply( liver.subset, 2,
anova.pvalue, liver.cov$EndNormalBW )
> hist(weight.pvalue,breaks=50)
Body Weight
11% of transcripts with median levels > 6 are significant at P < 0.01
1.5% of transcripts with median levels > 6 are significant at P < 0.00001
What do the genes do?
• So far we have identified sets of genes which are associated
with sex, family and weight
• How can we characterise these genes ?
• One popular method is to test if the annotations of these
genes have unusual features.
• Annotations include:
– genome location
– protein domain architecture (eg from INTERPRO)
– gene function, where known (eg from GO)
• From a statistical perspective, is is importation that a
controlled vocabulary (ontology) is used to describe gene
functions.
– The analysis then does not have to understand any biology!!
The Gene Ontology (GO)
http://www.geneontology.org/
• GO associates a set of GO-terms with every gene,
describing aspects of the gene’s known function.
• It has become a very popular tools for automated
investigation of large sets of genes.
• But note that:
– GO is not complete, covering only biological
processes, cellular components and molecular
functions. Other ontologies are also important.
– many genes have no known function
GO annotation Examples
GO:0000001 mitochondrion inheritance
GO:0000002 mitochondrial genome maintenance
GO:0000003 reproduction
GO:0000005 ribosomal chaperone activity
GO:0000006 high affinity zinc uptake transporter activity
GO:0000007 low-affinity zinc ion transporter activity
GO:0000008 thioredoxin
GO:0000009 alpha-1,6-mannosyltransferase activity
GO:0000010 trans-hexaprenyltranstransferase activity
GO:0000011 vacuole inheritance
GO:0000012 single strand break repair
ENSMUSG00000061404 Olfr936 GO:0001584 GO:0016020 GO:0007600 GO:0007166 GO:0004930 GO:0031224 GO:0003674 GO:0005623 GO:0050896
GO:0016021 GO:0004888 GO:0004871 GO:0050877 GO:0007582 GO:0005575 GO:0007186 GO:0007608 GO:0007165 GO:0004872 GO:0007154
GO:0044464 GO:0044425 GO:0004984 GO:0007606 GO:0050874 GO:0009987 GO:0051869 GO:0008150
ENSMUSG00000030105 Arl8b GO:0016043 GO:0007046 GO:0051233 GO:0005737 GO:0016817 GO:0044424 GO:0048487 GO:0005515 GO:0016787
GO:0043014 GO:0005623 GO:0007028 GO:0044422 GO:0044237 GO:0007242 GO:0043228 GO:0005856 GO:0007582 GO:0008152 GO:0007165
GO:0015630 GO:0008092 GO:0019003 GO:0016462 GO:0005622 GO:0044464 GO:0007154 GO:0003824 GO:0006139 GO:0006364 GO:0005488
GO:0003924 GO:0043170 GO:0016818 GO:0019001 GO:0009987 GO:0005525 GO:0008150 GO:0017076 GO:0043229 GO:0006396 GO:0016072
GO:0007059 GO:0043232 GO:0050875 GO:0044430 GO:0043283 GO:0044446 GO:0030496 GO:0015631 GO:0003674 GO:0042254 GO:0007264
GO:0000166 GO:0005819 GO:0017111 GO:0044238 GO:0043226 GO:0016070 GO:0005575 GO:0006996
ENSMUSG00000042428 Mgat3 GO:0016020 GO:0008375 GO:0043413 GO:0005615 GO:0044421 GO:0005737 GO:0031224 GO:0044267 GO:0044424
GO:0008194 GO:0009058 GO:0005623 GO:0044422 GO:0044237 GO:0007582 GO:0008152 GO:0044425 GO:0005622 GO:0044464 GO:0003824
GO:0044431 GO:0003830 GO:0043227 GO:0043170 GO:0019538 GO:0006487 GO:0009059 GO:0006412 GO:0009987 GO:0016740 GO:0008150
GO:0043229 GO:0006486 GO:0009101 GO:0050875 GO:0016758 GO:0043283 GO:0044446 GO:0005795 GO:0003674 GO:0009100 GO:0016021
GO:0005576 GO:0044238 GO:0044249 GO:0043226 GO:0043231 GO:0005575 GO:0044260 GO:0016757 GO:0006464 GO:0005794 GO:0043412
GO:0044444
Testing for GO association
• Set of genes G is classified into two groups eg by sex
• A given GO annotation term classifies the genes into
two groups (present, absent)
• The data are a 2x2 contingency table classified by sex
and GO, and the test of GO/sex association can be
done either by a chi-squared test or by Fisher’s Exact
Test FET, or a generalised linear model with Poisson link
function.
• The most popular methods use the FET, which can be
calculated quickly using the hypergeometric
distribution, and is exact even when the counts of data
are small
Testing for GO association
• Read in a file of mappings between Illumina probe ids and Ensembl
Mouse gene ids
transcripts <- read.delim("mouse.transcripts.genes.txt",
h=T)
• Match them to the expressed transcripts (note that not all
transcripts match genes)
idx <- match( colnames(liver.exp), transcripts$transcript)
> length(idx)
[1] 47429
> sum(is.na(idx))
[1] 9343
• Read in a file of GO terms associated with each Ensembl Mouse
gene (this set has been reduced to include only those GO terms
present in more than 5% of genes)
•
•
•
go1 <- read.delim("GO.Ensembl.01.txt")
> dim(go1)
[1] 19988
387
Testing for GO association
•
Find the common transcripts between liver.subset and the annotations, and those transcripts with
sex p-values < 0.01
> intersect <- colnames(liver.subset)[match(go1$transcript, colnames(liver.subset),
nomatch = 0)]
> intersect <- unique(sort(intersect))
> liver.subset.intersect <- liver.subset[, match(intersect, colnames(liver.subset))]
> dim( liver.subset.intersect)
[1] 275 1650
> go.intersect <- go1[match(intersect,go1$transcript),]
> dim(go.intersect)
[1] 1650 388
> sex.ids <- colnames(liver.subset)[sex.pvalue<0.01]
> sex.intersect <- sex.ids[match(sex.ids,intersect,nomatch=0)]
> length(sex.intersect)
[1] 174
> sex.idx <- go.intersect$transcript %in% sex.ids
Testing for GO Association
using apply() and fisher.test()
•
fisher.func <- function( X, sex.idx) { X <- as.factor(X) ; if ( nlevels(X) == 2 )
{f <- fisher.test(X, sex.idx); return (f$p.value)} else return(1) }
•
> fish <- apply( go.intersect[,4:ncol(go.intersect)], 2, fisher.func, sex.idx )
•
•
> length(fish)
[1] 385
•
•
•
•
•
•
•
> fish[fish < 0.01]
GO.0000267
GO.0002376
GO.0003735
GO.0005624
GO.0005783
GO.0005840
5.255498e-03 6.142841e-04 9.096193e-03 4.108839e-04 1.153113e-03 9.125263e-03
GO.0006412
GO.0006955
GO.0009058
GO.0009059
GO.0016740
GO.0016788
9.852476e-05 4.726468e-05 7.243732e-03 4.532035e-05 3.915276e-03 4.224464e-03
GO.0030529
GO.0043170
GO.0043234
GO.0044249
GO.0044422
GO.0044446
2.250219e-03 2.157644e-03 5.039780e-04 2.347306e-04 2.360906e-03 2.360906e-03
•
•
<length(fish[fish < 0.01])
[1] 18
Significant GO terms
> data.frame( pvalue=fish[fish<0.01], desc=as.character(go2name$desc[go2name$go
%in% names(fish[fish<0.01])]))
pvalue
desc
GO.0000267 5.255498e-03
cell fraction
GO.0002376 6.142841e-04
immune system process
GO.0003735 9.096193e-03
structural constituent of ribosome
GO.0005624 4.108839e-04
membrane fraction
GO.0005783 1.153113e-03
endoplasmic reticulum
GO.0005840 9.125263e-03
ribosome
GO.0006412 9.852476e-05
protein biosynthesis
GO.0006955 4.726468e-05
immune response
GO.0009058 7.243732e-03
biosynthesis
GO.0009059 4.532035e-05
macromolecule biosynthesis
GO.0016740 3.915276e-03
transferase activity
GO.0016788 4.224464e-03 hydrolase activity, acting on ester bonds
GO.0030529 2.250219e-03
ribonucleoprotein complex
GO.0043170 2.157644e-03
macromolecule metabolism
GO.0043234 5.039780e-04
protein complex
GO.0044249 2.347306e-04
cellular biosynthesis
GO.0044422 2.360906e-03
organelle part
GO.0044446 2.360906e-03
intracellular organelle part
Testing for GO association
1. Remove missing data
> load("hippocampus.pdo")
>
names(pdo)
[1] "transformed.response.matrix" "covariate.data"
[3] "subformula.lhs"
> dim(pdo$transformed.response.matrix)
[1]
460 47429
> hipp <- pdo$transformed.response.matrix
> hipp.mean <- apply( hipp, 2, mean)
> hist(hipp.mean)
> hist(hipp.mean,breaks=50,freq=FALSE)
> sum(hipp.mean>5)
[1] 7805
> sum(hipp.mean>10)
[1] 4753
> sum(hipp.mean>20)
[1] 4618
> hipp.subset <- hipp[,hipp.mean>20]
> dim(hipp.subset)
[1] 460 4618
Compute pairwise correlations
between probes
> corr <- cor(hipp.subset)
> dim(corr)
[1] 4618 4618
> cor.t <- corr* sqrt( 458/(1-corr*corr))
> cor.pvalue <- pt( abs(cor.t), df=458,
lower=FALSE)
> qqplot(cor.t[lower.tri(cor.t)],
rt(4618*4617/2,df=458))
> abline(0,1)
> sum(cor.pvalue[lower.tri(cor.pvalue)]<
1.0e-6)
[1] 1221647
> sum(cor.pvalue[lower.tri(cor.pvalue)]<
1.0e-10)
[1] 721629
Plenty of significant correlations !
Some Transcripts are highly correlated
What is the cause of the significant
correlations?
• Sex differences
• Linkage disequilibrium
• Gene Coexpression Networks
Sex Differences
T-test
> tfunc <- function( X, GENDER ) { tt <t.test( X ~ GENDER ); return(tt$p.value) }
> tt <- apply( hipp.subset, 2, tfunc,
pdo$covariate.data$GENDER )
[1] 4618
> qqplot( tt, runif(4618))
> sum(tt<0.01)
[1] 72
Few Sex differences !
(Because the data were cleaned up beforehand)
This suggests that the correlations are not due
To sex differences
Positional and Linkage Disequilibrium (LD)
effects on gene expression
• Some probes belong to the same gene and so should be correlated
• LD is the correlation between neighbouring polymorphisms, which
breaks down over larger genomic distances due to recombination
• Neighbouring genes may have correlated gene expression because
– They might be controlled by the same cis-regulatory sites, so
polymorphisms within these sites will generate coordinated changes in
expression
– Polymorphisms within different cis-regulatory sites may be in LD
– Presence of SNPs within neighbouring probe sequences may be in LD
Using Probe Annotation Data
(thanks to Ernesto Lowy)
> anot <- read.delim( "Illumina.Annotations21112008.txt", sep="\t")
> head(anot)
probe_id score chr
start
end
genes
1
scl011438.1_28-S
50
2 180759405 180759455 ENSMUSG00000027577,
2
scl1849.1.1_93-S
50 14 117869648 117869698 ENSMUSG00000058571,
3 scl44597.1.11_205-S
50 13 78165362 78165412 ENSMUSG00000064138,
4
scl0001543.1_72-S
50 11 103248523 103248573 ENSMUSG00000034247,
5
GI_38092213-S
50 11
4583375
4583425 ENSMUSG00000020412,
6
scl42569.2_692-S
50 12 29789380 29789430
NO-GENES
transcript
1
ENSMUST00000108851,ENSMUST00000067120,
2
ENSMUST00000088483,ENSMUST00000078849,
3
ENSMUST00000091459,ENSMUST00000099358,
4
ENSMUST00000041272,
5 ENSMUST00000109932,ENSMUST00000070257,ENSMUST00000109930,
6
proteins
1
ENSMUSP00000104479,ENSMUSP00000066338,
2
ENSMUSP00000085835,ENSMUSP00000077893,
3
ENSMUSP00000089038,ENSMUSP00000096960,
4
ENSMUSP00000047327,
5 ENSMUSP00000105558,ENSMUSP00000063272,ENSMUSP00000105556,
6
name
1
HP.express.scl011438.1_28.S
2
HP.express.scl1849.1.1_93.S
3 HP.express.scl44597.1.11_205.S
4
HP.express.scl0001543.1_72.S
5
HP.express.GI_38092213.S
6
HP.express.scl42569.2_692.S
Do probes from the same gene have
more similar signals?
Gene Networks
• Networks of interacting genes are important
• Interaction has several meanings
–
–
–
–
The proteins encoded by the genes physically interact
The proteins form part of the same complex
The proteins are in the same metabolic pathway
The expression of the proteins/mRNA are co-localised
• In the same tissue
• By subcellular localisation (nuclear, cytoplasmic, secreted)
– The mRNA expression levels are correlated in a
population of individuals
• Correlation coefficient measures pairwise association of
transcripts
Gene Co-expression Networks
Correlation coefficients between transcripts
Network:
Nodes are genes/transcripts
Edges connect highly correlated nodes
Types of Networks
• Random
– Number of edges per node is about the same
– Every pair of nodes is equally likely to be
connected
• Scale- Free
– Some nodes have many more edges
• Hub genes that may control the expression of many
others
– Such Networks tend to be hierarchical (tree-like)
Related documents