Download What is SNP?

Document related concepts

Human genetic variation wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Genomic imprinting wikipedia , lookup

Gene wikipedia , lookup

NEDD9 wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Point mutation wikipedia , lookup

Tag SNP wikipedia , lookup

Pathogenomics wikipedia , lookup

Gene expression programming wikipedia , lookup

Copy-number variation wikipedia , lookup

Metagenomics wikipedia , lookup

Genome evolution wikipedia , lookup

Genomics wikipedia , lookup

Genome (book) wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Designer baby wikipedia , lookup

Public health genomics wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Gene expression profiling wikipedia , lookup

Microevolution wikipedia , lookup

Oncogenomics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
Introduction to myself
2009.12.11
Do Kyoon Kim
Outline
• Introduction to myself
• Course work
• Introduction to Bioinformatics
• What I have done in SNUBI
• Research Plan
Kim Do Kyoon
• Education
– Ph.D. candidate in Molecular and Genomic Medicine,
College of Medicine, Seoul National University, Mar 2006 – Present
– B.S. in Computer Science, Korea University, Mar 1999 – Feb 2006
• Experience
– Summer, 2005
• Intern in SNUBI Lab. at Seoul National University, College of Medicine
– Winter, 2004
• Intern in Sun Microsystems in Colorado, U.S.
– Summer, 2004
• Exchange student in University of Colorado at Boulder, U.S.
– Winter, 2003
• Intern in Ballet Creole, Department of Marketing, Canada
– Spring, 2003
• Study abroad in BEST (Business English School of Toronto), Business English Program,
Canada
Kim Do Kyoon
• English
– Speaking
– Writing
– Official English Test
• Strengths
–
–
–
–
–
Programming skills: C, Java, Python, JSP, PHP
Data manipulation
Statistical Packages: R
Database (Modeling)
Linux
• Weaknesses
– Machine Learning (Lack of experience)
– Ability to write a paper
Course work
Introduction to Bioinformatics
Central Dogma of Molecular Biology
What is DNA?
DNA: 약 30억 bases
What is SNP?
How similar?
SNP: (Single Nucleotide Polymorphism)
Copy Number Variation (CNV)
• Human Genetic Variation
– SNP (~ 0.1%)
– Micro satellite
– Copy Number Variation (~ 18%)
• Copy Number Variant
–
–
–
–
–
Segment of DNA > 1kb length
Present at variable copy number with respect to a reference genome
If present in > 1% of population: Copy Number Polymorphism
Polymorphisms not somatic re-arrangements (tumours)
Duplications, deletions, inversions
Feuk et al. 2006 Nature
What is Bioinformatics?
• Bioinformatics: the application of information technology to the filed of
molecular biology
• Genomic data explosion
– Human genome project, DNA chip, Next generation sequencing
Bioinformatics Data
• DNA Sequence information
– Genome Projects, etc
• mRNA expression information
– Microarrays, SAGE
• Metabolite concentrations
– Mass Spec, etc
• Protein sequence information
• Protein structure information
Microarray
• Animation
– http://www.bio.davidson.edu/courses/genomics/chip/chip.html
• Data
– Matrix
What is Database?
• A collection of data
– Structured
– Searchable (index)
– Cross-referenced (hyperlinks): Link with other DB
• Access, updating, information insertion, information deletion
• Data storage management: flat files, relational database
Bio Databases
• Factual Database
–
–
–
–
Sequence
Gene
Protein
Transcription factor
• Knowledgebase
– Gene Ontology
– Pathway
– OMIM (Disease)
• Experiment database
– GEO (Gene Expression Omnibus)
– ArrayExpress
What I have done in SNUBI?
Microarray Analysis
• Gene Expression
Detect DEGs
– Basic Analysis procedure
Gene Ontology Enrichment
• SNP
Pathway Enrichment
Classification
• Copy number
Copy number detection
• Copy number proportional to hybridization intensity
• Examine intensity ratios with respect to a reference genome
• Change in intensity ratio: duplication / deletion
Analysis of copy number data
Amplification
Deletion
Heatmap of inferred copy number
ChromoViz-web
Poster
ISMB, 2006
• Multimodal visualization of gene expression data onto chromosomes
using scalable vector graphics
• http://xperanto.snubi.org:8080/ChromoViz/
Identify copy number aberrant regions
using gene expression data
Poster
• Detect Differentially Expressed Genes (DEGs)
ECCB, 2008
– T-test
• Identify putative ROI regions affected by genetic changes using
expression profile
– Due to the spatial nature of the mechanism of amplification, genes that occur
closer to one another are more likely to be included in the same amplicon
– Hypergeometric Distribution can be used
•
•
•
•
i: number of DEGs within window
A: number of DEGs
n: number of genes within window
N: number of genes
Chromosome 1
Results
Results
• Areas of amplification and deletion identified in human
tumors are strong candidates to contain genes
important for cancer development and progression
• Find a oncogene
–
(ex. REL/BCL11A gene loci : B-cell CLL/lymphoma 11A)
Integration drug databases (project)
• Problem: not adequate for the pharmacopoeia by using previous drug
databases in Korea
– Ex. 약사회, 식약청, 심평원, KIMS, Druginfor, etc)
• Need an integration with previous drug databases with a
new schema
• http://snubi.org/~shats99/kma
Korean SNP Database
• Conference presentation
– 07/2008 (Invited) “Application of Pharmacogenomics: HapMap and Korean
polymorphism database,” 2008 Asian Institute in Statistical Genetics and
Genomics, Kyunghee University, Seoul, Korea
• Whole genome association studies using high-density DNA
oligonucleotide arrays
• Korean SNP data
– 200 samples
• Need to handle with large scale SNP data systematically
• Create Korean reference polymorphism database
Work flow
500K SNP data: 200 Korean samples
Sample Collection
Automated Genotype Calls
(RLMM, MAMS, Gtype, BRLMM)
Automated
Genotype Calls
Quality Control
Calc. allele frequency
Genotype frequency
Calc. pairwise Linkage
Disequilibrium
DB
Calc. allele frequency, genotype frequency, etc
from genotype calls data
LD statistics ( D’ , r^2 )
Store genotype, allele freq, HW Equilibrium, etc
Phenotype data into SNP DB
Annotation
View , Search via
Web interface
( HWE test, confidence score,
Multiple testing correction, Snipper-HD )
Annotation chr. pos ,gene, region,
AA change, etc…
View, Search SNPs
1. SNP DB 접속
•
http://kprn.snubi.org/snpchip
•
Log in
– ID: kprn
– Password: kprnsnu12
2. Filters
Search Page
Filtering condition
MAF , is non-monormorphic SNP ?
3’UTR, 5’UTR, Exon, Intron,
Synonymous/non-synonymous SNP
Specific SNPs only
Search by
Chromosome, physical position,
Cytoband , gene
Output
Genotype, Frequency, Linkage Disequilibrium
3. Genotype: Filter page
MAF > 0.20
Exclude monomorphic SNPs
Specific position
Output: GENOTYPE
Click Submit: It will take some time!
3. Genotype: Result
Click for annotation
Genotype info
3. Genotype: Annotation
SNP annotation
Gene related information annotation
( chr, physical position, cytoband, feature
( Gene , Protein , Protein Family,
Allele, associated gene,
Enzyme and Pathway info. )
Allele freq. japanese, chinese, yoruba,
Caucasian , heterzygosity …)
Click for annotation of a specific gene: using GRIP
6. Haploview
Click for Haploview
Xperanto-SNP
Poster
ISMB, 2008
• A web-based integrated system for SNP data management and
analysis
Xperanto-SNP
• http://kprn.snubi.org/xperanto_SNP
Developing a database for integrative
genomics
Poster
• Genetical genomics
Pharmacogenomics &
Personalized Medicine, 2009
– Measure the influence of genetic variation on gene expression. (Williams et al.
Nature 2006)
– Identifying chromosomal loci that control the level of expressions of a particular
gene. (Schadt et al. Nature 2005)
– Studying the genetic basis of gene expression. (Li et al. Human Molecular
Genetics 2005)
• There is no comprehensive database for multi-dimensional genomic
data from single individuals
• A database for genetical genomics can be extended for other types of
genomic data
Adrsnp (Project)
• http://kprn.snubi.org/adrsnp
• 약물치료에서 동일한 약물을 동일한 질환을 가진 환자들에게 투여하
더라도, 일부 환자에서는 예상치 못한 약물이상 반응으로 인해 환자가
불편을 경험하거나 심지어 사망하는 경우도 발생
• 현재까지는 어떤 환자에서 이러한 약물이상 반응이 일어날 것 인지를
예측할 수 있는 방법이 없는 실정
• 개개인의 유전체적 특성에 따라 약물을 선택하거나 용량을 조절할 수
있는 방법을 개발하는 것이 시급한 과제
• 본 연구는 약물이상 반응이 나타난 환자들을 대상으로 유전적인 차이
를 분석하고 이를 토대로 하여 약물이상 반응이 나타날 가능성이 높은
환자들을 선별하는 기술을 개발
The effects of copy number variation on
classical genetic studies
• Objective
– Identify the effects of copy number variation
on traditional genetic studies such as
• Linkage studies
• Genome-wide association studies
• Error checking
– Hardy-Weinberg Equilibrium test
– Mendelian Inconsistency test
Materials and Methods
• Data
– Phenotypes
• Expression level of genes in lymphoblastoid cells
• For 3,554 of the 8,500 genes tested
– Genotypes
• Genotypes of 2,882 autosomal and X-linked SNPs of members of
the 14 CEPH Utah families
• Hardy-Weinberg Disequilibrium test (p-value < 0.05)
– Pearson’s chi-square test
– Fisher’s exact test
Materials and Methods
• Mendelian Inconsistency test
– PedStat (Abecasis et al 2002)
• Public Copy Number Variation data localization
– Database of Genomic Variants (http://projects.tcag.ca/variation/)
– Total entries: 6559
• CNVs: 6482
• Inversions: 77
– Localize annotations for SNP for convenient queries
• Search out SNPs within Copy Number Variation regions
Results
• Genotype missing data
– Plot a grid showing which genotypes are missing
Results
•
Hardy-Weinberg Disequilibrium test
Table 1. Result of Hardy-Weinberg Equilibrium test and search SNPs in CNV regions
(p-value < 0.05)
•
N of SNPs
N of SNPs in CNV regions
HWE.exact
408
72
HWE.chisq
440
76
Mendelian Inconsistency test
– 14 SNPs of distinct 63 SNPs with Mendelian Inconsistencies are found
within copy number variation regions
Results
• Morley et al’s result (2004)
Structural insertion/deletion variation in IRF5 is associated with a risk haplotype and defines
the precise IRF5 isoforms expressed in systemic lupus erythematosus (Kozyrev et al.
ARTHRITIS & RHEUMATISM, 2007)
• Target gene: IRF5
– Encodes a member of the interferon regulatory factor (IRF) family
– A group of transcription factors with diverse roles
• Virus-mediated activation of interferon, modulation of cell growth,
differentiation, apoptosis, and immune system activity
Rare disease knowledge base (Project)
• Useful to all clinicians regardless of availability of molecular genetic
testing
• Provide to non-expert clinicians information on the diagnosis,
management and genetic counseling of patients with inherited
disorders and their families
• Expert-authored, peer-reviewed, updated regularly
• Disease descriptions focused on use of currently available molecular
genetic testing in diagnosis, management, and genetic counseling
•
The importance of tissue microarrays (TMA) as clinical validation tools for
cDNA microarray results is increasing, whereas researchers are still suffering
from TMA data management issues
•
We developed TMA-TAB, a spreadsheet-based data format for TMA data
submission to the TMA-OM supportive TMA database system
TMA-TAB
Research Plan
Research interests
• Integration and integrative analysis with multidimensional genomic
data
– SNP, copy number, LOH, gene expression, miRNA, methylation, exon, sequence
• Why important?
Biological Organization
• TF binding
• SNP
• methylation
• CNV,LOH, Del
• CNV,LOH, Del
TFbs
TFbs
TFbs
Gene
Gene
Gene
TRANSCRIPTION
alternative
splicing
EXPRESSION
• microRNA
microRNA mRNA mRNA mRNA
TRANSLATION
x
• post modification
• glucosylation
• phosphorylation
Protein
TF Protein
FUNCTION
TF: transcription factor
TFbs: transcription factor binding site
Phenotype
What is “Integrative genomics” ?
versus traditional research approaches
G
A
Normal
Methylation
miRNA
miRNA
Patient
Mode of action based research
Identifying functional impact of genomic
alteration
Integration with methylation, expression,
and copy number aberration
Integration with methylation, expression,
and copy number aberration
The Cancer Genome Atlas (TCGA)
• Mission
– The Cancer Genome Atlas (TCGA) is a comprehensive and coordinated effort to
accelerate our understanding of the molecular basis of cancer through the
application of genome analysis technologies, including large-scale sequencing
• Goal
– To improve our ability to diagnose, treat and prevent cancer
– A pilot project developed and tested the research framework needed to
systematically explore the entire spectrum of genomic changes involved in human
cancer
The Cancer Genome Atlas
• 1981 discovery of a cancer-promoting version of a human gene,
known as an Oncogene
• Cancer is caused primarily by mutations in specific genes
• Mutations disrupt biological pathways in ways that result in
uncontrolled cell replication, or growth
• TCGA aims to find all mutations that occur with a frequency of 5% or
more for each tumor type
TCGA pilot project
• Focus on three selected cancer types
– Serous cystadenocarcinoma (ovarian)
– Squamous carcinoma (lung)
– Glioblastoma multiforme (brain)
• 500 samples per tumor type
TCGA GBM: Center Overview
Glioblastoma samples
Broad/
DFCI
Harvard
LBNL
MSKCC
JHU/USC Stanford UNC
Sequencing
Broad, WU, Baylor
SNP 6.0
HTA
aCGH
Exon Array
aCGH
GoldenGate
Infinium
2 color arrays
PCR >ABI
Copy
Number
RNA
Expression
Copy
Number
RNA
Expression
Copy
Number
Methylation
Copy
Number
RNA, miRNA
Expression
Somatic
Mutations
TCGA data
How to integrate?
TCGA research network., (2008), Nature
The second page of TCGA project
TCGA provides us with many challenge
• Very large noisy datasets
• Integration of different data types
• Complex interactions within and between different data types
• Future genomics data sets will increase in size and new
technologies become available
Key questions posed at start of project
•
Can samples of adequate quality and quantity be assembled?
•
How sensitive, specific and comparable are current platforms?
•
How can diverse data sets be integrated -- and what can be learned from
integration?
•
Can we identify new genes associated with cancer types?
•
Can we identify new subtypes of cancer?
•
Does new knowledge suggest therapeutic implications?
Francis S. Collins et al., 2007
Data release
•
Data Levels I and II correspond to raw and processed data, respectively, for
each sample
•
Level III data are the output of basic analyses of Level I/II data, such as
mutational calls of sequenced genes, copy number and LOH calls of genomic
regions of aberrations, and expression level of a gene for each sample
•
Level IV data represent interpretations of the data, such as what genes are
significantly mutated, or altered in copy number, DNA methylation, or
expression across multiple samples and data types
•
For protection of patient privacy, access to Level I and/or II data for certain
platforms (e.g. SNP genotyping) or data types (e.g. germ-line mutations) is
restricted to qualified researchers and requires approval of a TCGA Data
Access Committee
Retrieving TCGA data
•
Download TCGA data with open-access
– ftp
• ftp://ftp1.nci.nih.gov/tcga/
– Search by Archive
• Data Portal
– Search by Sample
• Data Access Matrix
Retrieving available TCGA Data: Done
• Time: about 10 days
• Size: About 225 GB
Mapping to common object for queries
SNP
S1, S2, …
expression
SN
S1, S2, …
SNP_1
SNP_2
,
.
.
affy_1
affy_2
,
.
.
SNP_500k
affy_21860
exon
S1, S2, …
SN
exon_1
exon_2
,
.
.
miRNA
SN
S1, S2, …
miRNA_1
miRNA_2
,
.
.
miRNA_1254
Expression Exon miRNA
CGH
methylation
SNP
ERBB2
Chr17 q36.1
Gene_M
SN
S1, S2, …
methy_1
methy_2
,
.
.
methy_1505
Chromosome position
Gene_1
Gene_2
,
.
.
S1, S2, …
clone_243430
Gene
SNP
SN
methylation
clone_1
clone_2
,
.
.
exon_32000
•
CGH
Expression Exon miRNA
CGH
methylation
SN
Identify common unit for integration with
multidimensional genomic data
•
Copy Number: regions with copy number changes
–
•
LOH: LOH region
–
•
Position, sequence, gene (within)
Expression-Gene: differentially expressed genes
Expression-miRNA: differentially expressed miRNAs
–
•
Position, sequence, genes belonging to regions (promoter)
Expression-Exon: alternative splicing, differential expression of each exon
within a gene
–
•
•
Position, Flanking sequence (16 bases on each side of the SNP), annotated gene (5UTR,
3UTR, intron, exon, upstream, downstream)
DNA Methylation: methylated sites (hyper- / hypomethylation)
–
•
Position, genes belonging to regions
SNP: genotypes
–
•
Position, sequence (clone), genes belonging to regions
Position, sequence (miRNA), target genes
Common object
–
–
–
Physical position
Gene
Sequence
Clinical data
• Samples: about 220
– Statistics: not yet
• Clinical data
– Cancer and Normal (15)
State of the art
Pathway Analysis in GBM
mutation, homozygous
deletion in 17%
EGFR
ERBB2
PDGFRA
MET
mutation, amplification
in 45%
mutation
in 7%
amplification
In 13%
amplification
in 4%
NF-1
RAS
PI-3K
mutation in 2%
Proliferation
Survival
Translation
86%
FOXO
mutation in 2%
homozygous
deletion in 51%
CDKN2A
(ARF)
CDKN2C
homozygous
deletion in 47%
homozygous
deletion in 2%
homozygous
deletion in 49%
amplification
in 17%
amplification in 14%
MDM2
MDM4
amplification in 6%
TP53
mutation,
homozygous
deletion in 35%
CDKN2B
(P16/INK4A)
86%
Senescence
amplification in 2%
CDKN2A
Activated oncogenes
mutation, homozygous
deletion in 36%
mutation
in 15%
AKT
RTK/RAS/PI-3K
signaling network
P53
signaling
PTEN
Class I
Apoptosis
CDK4
CCND2
CDK6
amplification
in 1%
amplification
in 2%
RB1
homozygous deletion,
mutation in 11%
G1/S progression
RB
signaling
77%
TCGA research network., (2008), Nature
Databasing TCGA data
•
•
Multiple types of Annotation Files -> Database
–
Experiment data (e.g. expression value)
• Level 1, 2, and 3
–
Annotation (each platforms)
–
ADF files (same genome build)
Theoretically, queries are possible
–
Select all data with level 3 where gene symbol is ‘ERBB2’
–
Select expression, CGH, methylation with level 2 where chromosome position is ‘17q36.1’
–
Pathway?
Upload new platforms and Experiment data
• Probe
– Gene expression: 22277
– miRNA: 12033
– Array CGH: 243430
• Column wise queries
Row wise queries
•
Theoretically, queries are possible
–
Select all data with level 3 where gene symbol is ‘ERBB2’
–
Select expression, CGH, methylation with level 2 where chromosome position is ‘17q36.1’
Integration of multidimensional genomic data
•
Goal: develop a repository system to organize and mine multiple types of
genomics data
•
Motivation
– Growing number of multi-type datasets produced (ovarian, lung cancer)
– Data cannot be readily analyzed (due to complexity of multiple types of
data)
– ‘Omics’ repository
•
Vision
– Organize data to reflect their biological interdependencies
– Queries subset of data (e.g. gene or chromosome position)
– Provide meaningful biological discoveries with integrative analysis using
different types of level 4 data
How?
Specific Goal
•
Problem: Classification tumor subtypes using multidimensional genomic data
•
Kernel-level integration: possible?
Introduction
•
Kernel methods are a powerful class of methods for pattern analysis
– Reliability, accuracy, and computational efficiency
•
Kernel methods have the capability to handle a very wide range of data types
(sequences, vectors, networks, phylogenetic trees, and so on)
– The ability of kernel methods to deal with complex structured data makes them
ideally positioned for heterogeneous data integration (at the level of kernel
matrices)
•
We propose a kernel-based approach for clinical decision support in which
many genome-wide sources are combined
•
We apply this framework to two cancer cases, namely, a rectal cancer data
set containing microarray and proteomics data and a prostate cancer data set
containing microarray and genomic data
•
For both cancer sites the prediction of all outcomes improved when more than
one genome-wide data set was considered
Data set
cetuximab
•
•
•
WHEELER: grade of tumor regression
pN-STAGE: lymph node stage at surgery
CRM (circumferential resection margins): knowledge of the CRM before therapy
provides important prognostic information for local recurrence and development of
distant metastasis
Model building
•
Kernel methods and weighted least squares support vector machine
–
–
–
Single data set
Manual integration of data over time
Multiple omics integration approach
Results in rectal cancer
Thank You