* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download What is SNP?
Human genetic variation wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Genomic imprinting wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Point mutation wikipedia , lookup
Pathogenomics wikipedia , lookup
Gene expression programming wikipedia , lookup
Copy-number variation wikipedia , lookup
Metagenomics wikipedia , lookup
Genome evolution wikipedia , lookup
Genome (book) wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Designer baby wikipedia , lookup
Public health genomics wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Gene expression profiling wikipedia , lookup
Microevolution wikipedia , lookup
Oncogenomics wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Introduction to myself 2009.12.11 Do Kyoon Kim Outline • Introduction to myself • Course work • Introduction to Bioinformatics • What I have done in SNUBI • Research Plan Kim Do Kyoon • Education – Ph.D. candidate in Molecular and Genomic Medicine, College of Medicine, Seoul National University, Mar 2006 – Present – B.S. in Computer Science, Korea University, Mar 1999 – Feb 2006 • Experience – Summer, 2005 • Intern in SNUBI Lab. at Seoul National University, College of Medicine – Winter, 2004 • Intern in Sun Microsystems in Colorado, U.S. – Summer, 2004 • Exchange student in University of Colorado at Boulder, U.S. – Winter, 2003 • Intern in Ballet Creole, Department of Marketing, Canada – Spring, 2003 • Study abroad in BEST (Business English School of Toronto), Business English Program, Canada Kim Do Kyoon • English – Speaking – Writing – Official English Test • Strengths – – – – – Programming skills: C, Java, Python, JSP, PHP Data manipulation Statistical Packages: R Database (Modeling) Linux • Weaknesses – Machine Learning (Lack of experience) – Ability to write a paper Course work Introduction to Bioinformatics Central Dogma of Molecular Biology What is DNA? DNA: 약 30억 bases What is SNP? How similar? SNP: (Single Nucleotide Polymorphism) Copy Number Variation (CNV) • Human Genetic Variation – SNP (~ 0.1%) – Micro satellite – Copy Number Variation (~ 18%) • Copy Number Variant – – – – – Segment of DNA > 1kb length Present at variable copy number with respect to a reference genome If present in > 1% of population: Copy Number Polymorphism Polymorphisms not somatic re-arrangements (tumours) Duplications, deletions, inversions Feuk et al. 2006 Nature What is Bioinformatics? • Bioinformatics: the application of information technology to the filed of molecular biology • Genomic data explosion – Human genome project, DNA chip, Next generation sequencing Bioinformatics Data • DNA Sequence information – Genome Projects, etc • mRNA expression information – Microarrays, SAGE • Metabolite concentrations – Mass Spec, etc • Protein sequence information • Protein structure information Microarray • Animation – http://www.bio.davidson.edu/courses/genomics/chip/chip.html • Data – Matrix What is Database? • A collection of data – Structured – Searchable (index) – Cross-referenced (hyperlinks): Link with other DB • Access, updating, information insertion, information deletion • Data storage management: flat files, relational database Bio Databases • Factual Database – – – – Sequence Gene Protein Transcription factor • Knowledgebase – Gene Ontology – Pathway – OMIM (Disease) • Experiment database – GEO (Gene Expression Omnibus) – ArrayExpress What I have done in SNUBI? Microarray Analysis • Gene Expression Detect DEGs – Basic Analysis procedure Gene Ontology Enrichment • SNP Pathway Enrichment Classification • Copy number Copy number detection • Copy number proportional to hybridization intensity • Examine intensity ratios with respect to a reference genome • Change in intensity ratio: duplication / deletion Analysis of copy number data Amplification Deletion Heatmap of inferred copy number ChromoViz-web Poster ISMB, 2006 • Multimodal visualization of gene expression data onto chromosomes using scalable vector graphics • http://xperanto.snubi.org:8080/ChromoViz/ Identify copy number aberrant regions using gene expression data Poster • Detect Differentially Expressed Genes (DEGs) ECCB, 2008 – T-test • Identify putative ROI regions affected by genetic changes using expression profile – Due to the spatial nature of the mechanism of amplification, genes that occur closer to one another are more likely to be included in the same amplicon – Hypergeometric Distribution can be used • • • • i: number of DEGs within window A: number of DEGs n: number of genes within window N: number of genes Chromosome 1 Results Results • Areas of amplification and deletion identified in human tumors are strong candidates to contain genes important for cancer development and progression • Find a oncogene – (ex. REL/BCL11A gene loci : B-cell CLL/lymphoma 11A) Integration drug databases (project) • Problem: not adequate for the pharmacopoeia by using previous drug databases in Korea – Ex. 약사회, 식약청, 심평원, KIMS, Druginfor, etc) • Need an integration with previous drug databases with a new schema • http://snubi.org/~shats99/kma Korean SNP Database • Conference presentation – 07/2008 (Invited) “Application of Pharmacogenomics: HapMap and Korean polymorphism database,” 2008 Asian Institute in Statistical Genetics and Genomics, Kyunghee University, Seoul, Korea • Whole genome association studies using high-density DNA oligonucleotide arrays • Korean SNP data – 200 samples • Need to handle with large scale SNP data systematically • Create Korean reference polymorphism database Work flow 500K SNP data: 200 Korean samples Sample Collection Automated Genotype Calls (RLMM, MAMS, Gtype, BRLMM) Automated Genotype Calls Quality Control Calc. allele frequency Genotype frequency Calc. pairwise Linkage Disequilibrium DB Calc. allele frequency, genotype frequency, etc from genotype calls data LD statistics ( D’ , r^2 ) Store genotype, allele freq, HW Equilibrium, etc Phenotype data into SNP DB Annotation View , Search via Web interface ( HWE test, confidence score, Multiple testing correction, Snipper-HD ) Annotation chr. pos ,gene, region, AA change, etc… View, Search SNPs 1. SNP DB 접속 • http://kprn.snubi.org/snpchip • Log in – ID: kprn – Password: kprnsnu12 2. Filters Search Page Filtering condition MAF , is non-monormorphic SNP ? 3’UTR, 5’UTR, Exon, Intron, Synonymous/non-synonymous SNP Specific SNPs only Search by Chromosome, physical position, Cytoband , gene Output Genotype, Frequency, Linkage Disequilibrium 3. Genotype: Filter page MAF > 0.20 Exclude monomorphic SNPs Specific position Output: GENOTYPE Click Submit: It will take some time! 3. Genotype: Result Click for annotation Genotype info 3. Genotype: Annotation SNP annotation Gene related information annotation ( chr, physical position, cytoband, feature ( Gene , Protein , Protein Family, Allele, associated gene, Enzyme and Pathway info. ) Allele freq. japanese, chinese, yoruba, Caucasian , heterzygosity …) Click for annotation of a specific gene: using GRIP 6. Haploview Click for Haploview Xperanto-SNP Poster ISMB, 2008 • A web-based integrated system for SNP data management and analysis Xperanto-SNP • http://kprn.snubi.org/xperanto_SNP Developing a database for integrative genomics Poster • Genetical genomics Pharmacogenomics & Personalized Medicine, 2009 – Measure the influence of genetic variation on gene expression. (Williams et al. Nature 2006) – Identifying chromosomal loci that control the level of expressions of a particular gene. (Schadt et al. Nature 2005) – Studying the genetic basis of gene expression. (Li et al. Human Molecular Genetics 2005) • There is no comprehensive database for multi-dimensional genomic data from single individuals • A database for genetical genomics can be extended for other types of genomic data Adrsnp (Project) • http://kprn.snubi.org/adrsnp • 약물치료에서 동일한 약물을 동일한 질환을 가진 환자들에게 투여하 더라도, 일부 환자에서는 예상치 못한 약물이상 반응으로 인해 환자가 불편을 경험하거나 심지어 사망하는 경우도 발생 • 현재까지는 어떤 환자에서 이러한 약물이상 반응이 일어날 것 인지를 예측할 수 있는 방법이 없는 실정 • 개개인의 유전체적 특성에 따라 약물을 선택하거나 용량을 조절할 수 있는 방법을 개발하는 것이 시급한 과제 • 본 연구는 약물이상 반응이 나타난 환자들을 대상으로 유전적인 차이 를 분석하고 이를 토대로 하여 약물이상 반응이 나타날 가능성이 높은 환자들을 선별하는 기술을 개발 The effects of copy number variation on classical genetic studies • Objective – Identify the effects of copy number variation on traditional genetic studies such as • Linkage studies • Genome-wide association studies • Error checking – Hardy-Weinberg Equilibrium test – Mendelian Inconsistency test Materials and Methods • Data – Phenotypes • Expression level of genes in lymphoblastoid cells • For 3,554 of the 8,500 genes tested – Genotypes • Genotypes of 2,882 autosomal and X-linked SNPs of members of the 14 CEPH Utah families • Hardy-Weinberg Disequilibrium test (p-value < 0.05) – Pearson’s chi-square test – Fisher’s exact test Materials and Methods • Mendelian Inconsistency test – PedStat (Abecasis et al 2002) • Public Copy Number Variation data localization – Database of Genomic Variants (http://projects.tcag.ca/variation/) – Total entries: 6559 • CNVs: 6482 • Inversions: 77 – Localize annotations for SNP for convenient queries • Search out SNPs within Copy Number Variation regions Results • Genotype missing data – Plot a grid showing which genotypes are missing Results • Hardy-Weinberg Disequilibrium test Table 1. Result of Hardy-Weinberg Equilibrium test and search SNPs in CNV regions (p-value < 0.05) • N of SNPs N of SNPs in CNV regions HWE.exact 408 72 HWE.chisq 440 76 Mendelian Inconsistency test – 14 SNPs of distinct 63 SNPs with Mendelian Inconsistencies are found within copy number variation regions Results • Morley et al’s result (2004) Structural insertion/deletion variation in IRF5 is associated with a risk haplotype and defines the precise IRF5 isoforms expressed in systemic lupus erythematosus (Kozyrev et al. ARTHRITIS & RHEUMATISM, 2007) • Target gene: IRF5 – Encodes a member of the interferon regulatory factor (IRF) family – A group of transcription factors with diverse roles • Virus-mediated activation of interferon, modulation of cell growth, differentiation, apoptosis, and immune system activity Rare disease knowledge base (Project) • Useful to all clinicians regardless of availability of molecular genetic testing • Provide to non-expert clinicians information on the diagnosis, management and genetic counseling of patients with inherited disorders and their families • Expert-authored, peer-reviewed, updated regularly • Disease descriptions focused on use of currently available molecular genetic testing in diagnosis, management, and genetic counseling • The importance of tissue microarrays (TMA) as clinical validation tools for cDNA microarray results is increasing, whereas researchers are still suffering from TMA data management issues • We developed TMA-TAB, a spreadsheet-based data format for TMA data submission to the TMA-OM supportive TMA database system TMA-TAB Research Plan Research interests • Integration and integrative analysis with multidimensional genomic data – SNP, copy number, LOH, gene expression, miRNA, methylation, exon, sequence • Why important? Biological Organization • TF binding • SNP • methylation • CNV,LOH, Del • CNV,LOH, Del TFbs TFbs TFbs Gene Gene Gene TRANSCRIPTION alternative splicing EXPRESSION • microRNA microRNA mRNA mRNA mRNA TRANSLATION x • post modification • glucosylation • phosphorylation Protein TF Protein FUNCTION TF: transcription factor TFbs: transcription factor binding site Phenotype What is “Integrative genomics” ? versus traditional research approaches G A Normal Methylation miRNA miRNA Patient Mode of action based research Identifying functional impact of genomic alteration Integration with methylation, expression, and copy number aberration Integration with methylation, expression, and copy number aberration The Cancer Genome Atlas (TCGA) • Mission – The Cancer Genome Atlas (TCGA) is a comprehensive and coordinated effort to accelerate our understanding of the molecular basis of cancer through the application of genome analysis technologies, including large-scale sequencing • Goal – To improve our ability to diagnose, treat and prevent cancer – A pilot project developed and tested the research framework needed to systematically explore the entire spectrum of genomic changes involved in human cancer The Cancer Genome Atlas • 1981 discovery of a cancer-promoting version of a human gene, known as an Oncogene • Cancer is caused primarily by mutations in specific genes • Mutations disrupt biological pathways in ways that result in uncontrolled cell replication, or growth • TCGA aims to find all mutations that occur with a frequency of 5% or more for each tumor type TCGA pilot project • Focus on three selected cancer types – Serous cystadenocarcinoma (ovarian) – Squamous carcinoma (lung) – Glioblastoma multiforme (brain) • 500 samples per tumor type TCGA GBM: Center Overview Glioblastoma samples Broad/ DFCI Harvard LBNL MSKCC JHU/USC Stanford UNC Sequencing Broad, WU, Baylor SNP 6.0 HTA aCGH Exon Array aCGH GoldenGate Infinium 2 color arrays PCR >ABI Copy Number RNA Expression Copy Number RNA Expression Copy Number Methylation Copy Number RNA, miRNA Expression Somatic Mutations TCGA data How to integrate? TCGA research network., (2008), Nature The second page of TCGA project TCGA provides us with many challenge • Very large noisy datasets • Integration of different data types • Complex interactions within and between different data types • Future genomics data sets will increase in size and new technologies become available Key questions posed at start of project • Can samples of adequate quality and quantity be assembled? • How sensitive, specific and comparable are current platforms? • How can diverse data sets be integrated -- and what can be learned from integration? • Can we identify new genes associated with cancer types? • Can we identify new subtypes of cancer? • Does new knowledge suggest therapeutic implications? Francis S. Collins et al., 2007 Data release • Data Levels I and II correspond to raw and processed data, respectively, for each sample • Level III data are the output of basic analyses of Level I/II data, such as mutational calls of sequenced genes, copy number and LOH calls of genomic regions of aberrations, and expression level of a gene for each sample • Level IV data represent interpretations of the data, such as what genes are significantly mutated, or altered in copy number, DNA methylation, or expression across multiple samples and data types • For protection of patient privacy, access to Level I and/or II data for certain platforms (e.g. SNP genotyping) or data types (e.g. germ-line mutations) is restricted to qualified researchers and requires approval of a TCGA Data Access Committee Retrieving TCGA data • Download TCGA data with open-access – ftp • ftp://ftp1.nci.nih.gov/tcga/ – Search by Archive • Data Portal – Search by Sample • Data Access Matrix Retrieving available TCGA Data: Done • Time: about 10 days • Size: About 225 GB Mapping to common object for queries SNP S1, S2, … expression SN S1, S2, … SNP_1 SNP_2 , . . affy_1 affy_2 , . . SNP_500k affy_21860 exon S1, S2, … SN exon_1 exon_2 , . . miRNA SN S1, S2, … miRNA_1 miRNA_2 , . . miRNA_1254 Expression Exon miRNA CGH methylation SNP ERBB2 Chr17 q36.1 Gene_M SN S1, S2, … methy_1 methy_2 , . . methy_1505 Chromosome position Gene_1 Gene_2 , . . S1, S2, … clone_243430 Gene SNP SN methylation clone_1 clone_2 , . . exon_32000 • CGH Expression Exon miRNA CGH methylation SN Identify common unit for integration with multidimensional genomic data • Copy Number: regions with copy number changes – • LOH: LOH region – • Position, sequence, gene (within) Expression-Gene: differentially expressed genes Expression-miRNA: differentially expressed miRNAs – • Position, sequence, genes belonging to regions (promoter) Expression-Exon: alternative splicing, differential expression of each exon within a gene – • • Position, Flanking sequence (16 bases on each side of the SNP), annotated gene (5UTR, 3UTR, intron, exon, upstream, downstream) DNA Methylation: methylated sites (hyper- / hypomethylation) – • Position, genes belonging to regions SNP: genotypes – • Position, sequence (clone), genes belonging to regions Position, sequence (miRNA), target genes Common object – – – Physical position Gene Sequence Clinical data • Samples: about 220 – Statistics: not yet • Clinical data – Cancer and Normal (15) State of the art Pathway Analysis in GBM mutation, homozygous deletion in 17% EGFR ERBB2 PDGFRA MET mutation, amplification in 45% mutation in 7% amplification In 13% amplification in 4% NF-1 RAS PI-3K mutation in 2% Proliferation Survival Translation 86% FOXO mutation in 2% homozygous deletion in 51% CDKN2A (ARF) CDKN2C homozygous deletion in 47% homozygous deletion in 2% homozygous deletion in 49% amplification in 17% amplification in 14% MDM2 MDM4 amplification in 6% TP53 mutation, homozygous deletion in 35% CDKN2B (P16/INK4A) 86% Senescence amplification in 2% CDKN2A Activated oncogenes mutation, homozygous deletion in 36% mutation in 15% AKT RTK/RAS/PI-3K signaling network P53 signaling PTEN Class I Apoptosis CDK4 CCND2 CDK6 amplification in 1% amplification in 2% RB1 homozygous deletion, mutation in 11% G1/S progression RB signaling 77% TCGA research network., (2008), Nature Databasing TCGA data • • Multiple types of Annotation Files -> Database – Experiment data (e.g. expression value) • Level 1, 2, and 3 – Annotation (each platforms) – ADF files (same genome build) Theoretically, queries are possible – Select all data with level 3 where gene symbol is ‘ERBB2’ – Select expression, CGH, methylation with level 2 where chromosome position is ‘17q36.1’ – Pathway? Upload new platforms and Experiment data • Probe – Gene expression: 22277 – miRNA: 12033 – Array CGH: 243430 • Column wise queries Row wise queries • Theoretically, queries are possible – Select all data with level 3 where gene symbol is ‘ERBB2’ – Select expression, CGH, methylation with level 2 where chromosome position is ‘17q36.1’ Integration of multidimensional genomic data • Goal: develop a repository system to organize and mine multiple types of genomics data • Motivation – Growing number of multi-type datasets produced (ovarian, lung cancer) – Data cannot be readily analyzed (due to complexity of multiple types of data) – ‘Omics’ repository • Vision – Organize data to reflect their biological interdependencies – Queries subset of data (e.g. gene or chromosome position) – Provide meaningful biological discoveries with integrative analysis using different types of level 4 data How? Specific Goal • Problem: Classification tumor subtypes using multidimensional genomic data • Kernel-level integration: possible? Introduction • Kernel methods are a powerful class of methods for pattern analysis – Reliability, accuracy, and computational efficiency • Kernel methods have the capability to handle a very wide range of data types (sequences, vectors, networks, phylogenetic trees, and so on) – The ability of kernel methods to deal with complex structured data makes them ideally positioned for heterogeneous data integration (at the level of kernel matrices) • We propose a kernel-based approach for clinical decision support in which many genome-wide sources are combined • We apply this framework to two cancer cases, namely, a rectal cancer data set containing microarray and proteomics data and a prostate cancer data set containing microarray and genomic data • For both cancer sites the prediction of all outcomes improved when more than one genome-wide data set was considered Data set cetuximab • • • WHEELER: grade of tumor regression pN-STAGE: lymph node stage at surgery CRM (circumferential resection margins): knowledge of the CRM before therapy provides important prognostic information for local recurrence and development of distant metastasis Model building • Kernel methods and weighted least squares support vector machine – – – Single data set Manual integration of data over time Multiple omics integration approach Results in rectal cancer Thank You