Download Department of Biomedical Informatics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genomic imprinting wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Genetic engineering wikipedia , lookup

Non-coding DNA wikipedia , lookup

Genomics wikipedia , lookup

Copy-number variation wikipedia , lookup

Genealogical DNA test wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Behavioural genetics wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Pathogenomics wikipedia , lookup

Gene wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Gene expression programming wikipedia , lookup

Minimal genome wikipedia , lookup

Human genome wikipedia , lookup

History of genetic engineering wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Gene expression profiling wikipedia , lookup

Microevolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Designer baby wikipedia , lookup

Genome (book) wikipedia , lookup

Human genetic variation wikipedia , lookup

Public health genomics wikipedia , lookup

Genome evolution wikipedia , lookup

Molecular Inversion Probe wikipedia , lookup

RNA-Seq wikipedia , lookup

Quantitative trait locus wikipedia , lookup

SNP genotyping wikipedia , lookup

Tag SNP wikipedia , lookup

Transcript
Bioinformatics and Genetics
Kun Huang
Department of Biomedical Informatics
OSUCCC Biomedical Informatics Shared Resource
The Ohio State University
2011
Department of
Biomedical Informatics
Outline
• Introduction
• Genetic variations
• Technologies
• Array-based technology
• Massive sequencing
• Genome wide association study (GWAS)
• SNP array  exome sequencing  genome resequencing
• Expression quantitative trait loci (eQTL)
• Allelic specific ********ion
Department of
Biomedical Informatics
2
Genetic Variations
•
•
•
•
•
•
•
SNP
In-Del
Transposon
Copy number variation
LOH
Gene fusion
…
Department of
Biomedical Informatics
3
Single Nucleotide Polymorphism
(SNP)
The single nucleotide polymorphism (SNP) [pronounced "snip"] is the
most common form of genetic variation. As the name suggests, each SNP
is a difference in a single nucleotide (A,T,C,or G) of an individual's DNA
sequence, such as having AAGG instead of ATGG. There may be from 1 to
10 million SNPs in the entire human genome, but perhaps only a few
thousand relate to disease outcomes. The numbers seem to change with
every news report.
At least 1% of a population has a different
nucleotide
There are many other classes of variants and
these are no less important (e.g., deletions and
duplications), SNP are simply the most
abundant.
First SNPs - RFLPs – D. Botstein - 1980
Department of
Biomedical Informatics
Critical SNP concepts
Marker SNP vs. Functional SNP
SNPs highlights the spots for search (features, region of interest).
SNP patterns from a target population can be compared with SNP patterns from
unaffected populations to find genetic variations shared only by the affected group.
The most useful SNPs are known as "functional SNPs." A single functional
SNP or certain combinations of functional SNPs may help explain variability
in individual responses to a given drug or pinpoint the subtle genetic
differences that predispose some to diseases such as arthritis, Alzheimer's,
cancer, diabetes, and depression.
Department of
Biomedical Informatics
5
Critical SNP concepts
• Understand evolution
• DNA fingerprinting – forensic applications
• Markers for polygenetic traits
• Genotype-specific medicine (personalized medicine)
Department of
Biomedical Informatics
6
Critical SNP concepts
1. Humans are diploid and exhibit significant heterogeneity
and heterozygosity
2. DNA is essentially identical in every cell
3. The closer two SNP are the less likely they are to have
segregated in a population (linkage disequilibrium)
4. Multiple variants/alleles can be combined into
haplotypes (polygenic markers – quantitative trait loci or
QTL)
Department of
Biomedical Informatics
7
HapMap
• The International HapMap Project is a multi-country effort to
identify and catalog genetic similarities and differences in
human beings. Six participating countries: Japan, the United
Kingdom, Canada, China, Nigeria, and the United States.
• The goal is to compare the genetic sequences of different
individuals to identify chromosomal regions where genetic
variants are shared.
• Data generated by the Project can be downloaded with minimal
constraints.
• http://www.hapmap.org/index.html.en
Department of
Biomedical Informatics
8
NCBI SNP
Department of
Biomedical Informatics
9
•
•
•
•
In-Del
Transposon
Aneuploidy
…
Department of
et al,
BiomedicalKeiko
Informatics
Genome Research 2008
10
SNP Array
Affymetrix SNP 6.0 array
• More than 906,600 SNPs:
• Unbiased selection of 482,000 SNPs; historical SNPs from the SNP
Array 5.0
• Selection of additional 424,000 SNPs
• Tag SNPs
• SNPs from chromosomes X and Y
• Mitochondrial SNPs
• New SNPs added to the dbSNP database
• SNPs in recombination hotspots
• More than 946,000 copy number probes
Department of
Biomedical Informatics
11
SNP Array
Affymetrix SNP 5.0 array
Department of
Biomedical Informatics
12
Cytogenetics
Department of
Biomedical Informatics
CGH – Comparative Genomic
Hybridization
Department of
Biomedical Informatics
Re-sequencing using massive parallel
sequencer
$1000 genome project
Solexa
SOLiD
454
Department of
Biomedical Informatics
GWAS
• Focus is on SNPs
• Control vs case
• Chi-square based test
• Distribution of haplotypes in different conditions
• Contigency table
• Other statistics or metric can also be used
Department of
Biomedical Informatics
16
GWAS
• Statistical challenges
• Millions of SNPs – millions of tests
• Compensate for multiple tests
• P-value cutoff is very stringent
• Needs a lot of samples (thousands or more) to achieve
the necessary power
• Rare event detection is statistically challenging
Department of
Biomedical Informatics
17
GWAS
• Interpretation challenges
• Association is NOT causation
• Many SNPs are on inter-genic regions (not on genes)
• For SNPs on genes, most of them do NOT affect
protein coding – what are they doing?
• Due to the stringent cut, many potentially associated
genes were not selected and it is hard to infer high
level information such as pathways
Department of
Biomedical Informatics
18
GWAS
• Integration of bioinformatics information
• Pathway information – not necessarily the same genes
are targeted – could be the same pathways
• Other annotations – networks, GO terms
• Frequent pattern – data mining using frequent item
set on SNPs
• Frequent set mining on pathways (not just genes)
• The only phenotypes are disease vs control – how
about other phenotypes?
Department of
Biomedical Informatics
19
Quantitative Trait Locus (QTL)
• Quantitative phenotype – phenotype attributed to
multiple genes (polygenic effects)
• Examples – height, longevity
• Multiple genes + environment
• QTLs – stretches of DNA containing or linked to the genes
that underlie a QT
• Detection – copy number variance, SNPs
• Statistical analysis
• t statistics (compare the quantitative phenotypes
between the two groups with different genotype)
• Multiple genotype groups – ANOVA (F statistics)
• Mutual information
Department of
Biomedical Informatics
20
Expression Quantitative Trait
Locus (eQTL)
• Gene expression is a quantitative phenotype –
phenotype attributed to multiple genes (what are the
possible ones?)
• Besides other genes – regulatory elements
• eQTLs – most focus on SNP vs gene expression
• 3 million SNPs X 20,000 genes  6X1010 ANOVA tests
Department of
Biomedical Informatics
21
Expression Quantitative Trait
Locus (eQTL)
• Restrain to a small set of SNPs
• E.g., for a gene, only focus on the SNPs on the gene
• Cis-eQTL (local)
• Trans-eQTL (distal)
• Direct and indirect effects
• Second and third order effects
• eQTL networks
Lodish et al, Molecular Cell Biology
Department of
Biomedical Informatics
22
RNA-seq
Paradigm changes by NGS
• RNA-seq – not only gene expression, but also
sequences
•
Department of
Biomedical Informatics
TopHat
Trapnell et al. Bioinformatics 2009
Department of
Biomedical Informatics
After TopHat
• You got this:
Cufflinks
• But you want this:
Department of
Biomedical Informatics
Cufflinks
• Assigning each reads to
its potential isoform by
maximizing a function
that assigns a
likelihood to all
possible sets of relative
abundances of the
different isoforms.
• Open source software
Trapnell et al. Nat. Biot 2010
Department of
Biomedical Informatics
From sequence reads to isoforms
Primary aligner:
Eland, BFAST, BOWTIE, …
Junction finding
Strategy:
TopHat
SOLiD Bioscope
…
Isoform
identification:
Xing et al. NAR 2006
Jiang et al. Bioinformatics 2009
Cufflink (Nat Biot 2010)
Scribble (Nat Biot 2010)
…
Department of
Biomedical Informatics
Allelic Specific Expression
Specific X-chromosome suppression
• Much more broader presence in the genome
• Screen for functional SNPs
•
Department of
Biomedical Informatics
Allelic Specific Expression
•
Screen for functional SNPs
A=48 G=89
A=99 G=105
Department of
Biomedical Informatics
Allelic Specific Binding
Protein binding requires recognition of specific
sequences (motifs)
• Mutations on the binding sites may lead to disruption
of regulation and hence expression
•
Department of
Biomedical Informatics
Kasowski et al, Science,
2010.
Allelic Specific Methylation
•
One of the earliest known mechanism for allelic specific
expression
Department of
Biomedical Informatics
Other Allelic Specific Events
•
Allelic specific splicing
BMC Genomics. 2008 Jun 2;9:265.
Genome-wide survey of allele-specific splicing in humans.
Nembaware V, Lupindo B, Schouest K, Spillane C, Scheffler K,
Seoighe C.
Department of
Biomedical Informatics