Download Andrew Pocklington

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genome evolution wikipedia , lookup

Ridge (biology) wikipedia , lookup

Exome sequencing wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Gene regulatory network wikipedia , lookup

Biochemical cascade wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene expression profiling wikipedia , lookup

Community fingerprinting wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
Biostatistics & Bioinformatics Unit
(research summary)
BBU members
•
•
•
•
•
•
•
•
Peter Holmans
Valentina Moskvina
Andrew Pocklington
Marian Hamshere
Dobril Ivanov
Giancarlo Russo
Alex Richards
Alexey Vedernikov
Research Areas
• Genome-wide association analyses
• Polygenic analyses to investigate genetic
architecture and relationship between traits
• Sub-phenotype analysis: can refining the
phenotype refine the association signal?
• Gene-wide analysis to summarise
association evidence per gene
• Genome-wide interaction analysis:
statistically desirable?
• Integrating gene expression data and
association data: are eQTLs useful for
predicting disease?
• Pathway analysis: are sets of biologicallyrelated genes enriched for association (or
CNV) signal?
• Next generation sequencing
Data
•
•
•
•
•
Samples
– Bipolar, Schizophrenia, Alzheimer’s Disease, Parkinson’s,
ADHD, VCFS
– 3k cases, 5k controls (on average)
– 9k cases, 12k controls through collaboration
SNP data
– 500k -1.2M genotyped (genome-wide on chip)
– 8.5M imputed
• currently 1kG+hapmap3 CEU+TSI samples as reference
panel
• 256 node cluster (PBS script)
– 200k custom chip
QC
– Remove poor samples, poor SNPs, minimise systematic bias,
cluster plots
Merging data
– Strand alignment, overlapping samples
Analyse & store results
– Plink, snptest, mach2dat, custom scripts etc
Marian Hamshere
Data
SNPs
Genes (exons)
Genome
Phenotype data
• Plenty!
– Psychological measures, e.g. grandiose
delusions, depression
– fMRi brain volume
– Neurocognitive measures, e.g. IQ, speed
tasks
• Define phenotypes across
– Collections
– Diseases
• Targeted/guided hypotheses
• Data reduction techniques, e.g. PCA
• Cross disorder polygenic analysis (with care!)
• Phenotype analysis -> guide GWAS
Marian Hamshere
Databases
• 37 interrelated MySQL databases on 3 servers,
approximately:
- 730 tables
- 9,600,000,000 records
- 345 GB of disk space
• 2 Web Applications using the databases
WGA Results:
- Genome-wide association analysis results (AD, BD, SZ, PD)
- SNP-to-Gene and Gene-to-SNP annotations
- Upload SNP lists for analysis and annotation
- Download query results, top hits, entire resultsets
WTCCC mirror:
- analysis, SNP, genotype, phenotype and sample data
- selection and graphical representation of data according
to user filters
Alexey Vedernikov
eQTLs and polygenic score analysis
• Analysis
- Polygenic score analysis is a method of aggregating
genotype data across many SNPs to predict affected status
- eQTL analysis is a way of linking genotype data with gene
expression levels
- eQTL analysis is used to define groups of SNPs with a
greater or lesser effect on global brain gene expression
• Data
- eQTLs are defined in the datasets of Myers et al (8361
transcripts and 380157 SNPs, for 163 adult control brains)
and Gibbs et al (2532685 SNPs and around 14000
transcripts in 125 adult control brain samples, across 4
brain areas)
- ISC and MGS data: SNPs with a greater effect on global
gene expression generally predict schizophrenia affected
status significantly better than those with a lesser effect
Alex Richards
Gene-wide analysis
• Analysis
- GWA studies are focused on SNPs as the unit of analysis
- Complex patterns of association might not be reflected by
association to the same SNPs in different samples
- Power to detect association might be enhanced by
exploiting information from multiple (quasi) independent
signals within genes
- Risk likely reflects the co-action of several loci but the
approximate numbers of loci involved at the individual or
the population levels are unknown
• Methods
- SNP – Gene annotations
- Permutations
- Use of summary statistics only
Valentina Moskvina
Biological models of disease
• Molecular systems biology: models of neuronal
signalling and diversity
Text-mining/data curation (interactions, functional annotations)
Network analysis
Data integration
Andrew Pocklington
Biological models of disease
• Molecular systems biology: models of neuronal
signalling and diversity
• Neurobiology of disease: use these models to
understand disease genetics (e.g. by identifying
biologically-relevant sets of genes for pathway
analysis).
Andrew Pocklington
Biological models of disease
21,000 genes
179 neuro-anatomical structures
http://www.brain-map.org/
Biological models of disease
21,000 genes
179 neuro-anatomical structures
http://www.brain-map.org/
Pathway analysis
Testing whether biologically-related genes are
enriched for association signal
• GWAS data: do pathways contain a larger
number of significantly-associated genes than
expected? (allowing for varying gene sizes,
numbers of SNPs, genetic linkage,...)
Peter Holmans
Pathway analysis
Testing whether biologically-related genes are
enriched for association signal
• GWAS data: do pathways contain a larger
number of significantly-associated genes than
expected? (allowing for varying gene sizes,
numbers of SNPs, genetic linkage,...)
• CNV data: do CNVs in cases hit more genes in a
pathway than CNVs in controls? (allowing for
varying gene and CNV sizes)
Peter Holmans
Pathway analysis
Testing whether biologically-related genes are
enriched for association signal
• GWAS data: do pathways contain a larger
number of significantly-associated genes than
expected ? (allowing for varying gene sizes,
numbers of SNPs)
• CNV data: do CNVs in cases hit more genes in a
pathway than CNVs in controls ? (allowing for
varying gene and CNV sizes)
• Future:
– Continue to develop better-annotated pathways,
using genomic data from multiple sources (e.g.
expression, proteomics)
– Extend methods to use next-gen sequencing data
Peter Holmans
Next Generation Sequencing
• Targeted exome capture:
- Digest DNA to ~300bp
- Clean and anneal adapters
- Perform Pre-capture PCR with indexed primers
- Hybridise exome capture probes
- Clean and extract captured regions
- Perform Post-capture PCR
- Quantify and pool DNA samples
• Data processing
- Indexing, aligning and sorting the output reads
- Removing PCR duplicate reads
- Analysis of target capturing and coverage
- Local realignment around indels
- Recalibration of phred scores
- QC using depth information and phred scores
- Variants calling
• Analysis...
Giancarlo Russo