Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
整合式基因體與蛋白體 資料庫 劉 志 俊 (Chih-Chin Liu) 中華大學 資訊工程系 July 2008 Outline 生物資訊 (Bioinformatics): 資料庫觀點 生物資訊四大資料型態(Data Types) 生物資料庫設計與UML 整合式生物資料庫: UniBio 豬/土雞基因體資料庫 蛋白體資料庫 Assistant Prof. Chih-Chin Liu Page 2 當生物遇見資訊 生物學 分子遺傳學 分子生物學 生物化學 細胞生物學 蛋白質學 免疫學 Assistant Prof. Chih-Chin Liu 資訊學 生物資訊 程式語言 資料結構 演算法 資料庫 平行處理 資料探勘 Page 3 基因體、轉錄體、蛋白體、代謝體 基因體 (Genome): 轉錄體 (Transcriptome): The complement of expressed gene that are found in a particular cell or tissue. 蛋白體 (Proteome): The complement of proteins that are found in a particular cell or tissue. 代謝體 (Metabolome): The assembly of substrates, metabolites, and other small molecules that are present in a population of cells. Assistant Prof. Chih-Chin Liu Page 4 更多的【體】 結構體 (∑ Structures, Structurome) 變異體 (∑ SNPs, SNPome) 文獻體 (∑ Literatures, Literaturome) 訊號傳導體 (∑ Transductions, Transductome) 反應路徑體 (∑ Pathways, Pathwayome) 遺傳疾病體 (∑ Diseases, Diseasome) Assistant Prof. Chih-Chin Liu 體 資料庫 Page 5 Research Issues in Biological Databases Data Modeling Data Retrieval How to retrieve similar biological objects Data Mining How to store/represent biological data How to find rules behind biological data Simulation Pathway Simulation, Virtual Cell, Virtual Life Assistant Prof. Chih-Chin Liu Page 6 New Data Types in Bio-Databases Large Strings Biological Images 2D Gels, Microarray Images 3D Structures DNA Sequences, Protein Sequences Proteins, Compounds Network Pathways Assistant Prof. Chih-Chin Liu Page 7 New Data Types in Bio-Databases Large Strings: DNA Sequences 現代人第1號染色體 的完整序列,長度為 245,564,334 bp 是GenBank最長的 一筆序列紀錄 Assistant Prof. Chih-Chin Liu Page 8 New Data Types in Bio-Databases Large Strings: Protein Sequences PIR: I38344 PIR資料庫最長的 蛋白質序列 26,926 個氨基酸 titin, cardiac muscle [validated] human Assistant Prof. Chih-Chin Liu Page 9 New Data Types in Bio-Databases Images: Microarray (Stanford Microarray Database) Assistant Prof. Chih-Chin Liu Page 10 New Data Types in Bio-Databases Images: 1D-Gel, 2D-Gel Assistant Prof. Chih-Chin Liu Page 11 New Data Types in Bio-Databases 3D Structures: Chemical Compound Assistant Prof. Chih-Chin Liu Page 12 New Data Types in Bio-Databases 3D Structures Assistant Prof. Chih-Chin Liu Page 13 New Data Types in Bio-Databases 3D Structures ATOM 1 -4.004 15.224 13.636 1.00 32.64 N ANISOU 4441 ATOM 1 N VAL VAL -335 -2675 320 4512 3449 N -3.526 15.758 14.900 1.00 18.42 C 3289 ATOM 2 CA VAL 1 1 ANISOU 2 CA VAL -286 2233 C 14.733 15.628 1.00 17.06 C ATOM 3 C -152 VAL -466 234 1603 1981 C -3.053 13.569 15.714 1.00 18.61 O 3163 4 O -489 VAL 1 1 ANISOU 4 O VAL 555 1478 -2.662 2899 3 C -467 1 1 ANISOU Assistant Prof. Chih-Chin Liu 1 N VAL -394 1 501 1758 O 2150 Page 14 New Data Types in Bio-Databases Network: Pathways Assistant Prof. Chih-Chin Liu Page 15 Database Design Conceptual Database Design Class Diagram (ER Model, UML Class Diagram) Entities(Classes), Relationships, Attributes Logical Database Design Relational Schema Normalization, ER to Relational Data Model Mapping Physical Database Design Implementation (e.g. Oracle, MySQL, SQL Server) Indexes and Storage Methods Assistant Prof. Chih-Chin Liu Page 16 The UniBio Project 完整性 整合性 收集所有生物相關之可下載資料庫 所有資料互相參考, 邏輯上為單一資料庫 中文化 盡可能提供對應之中文資料, 降低學習障礙 Assistant Prof. Chih-Chin Liu Page 17 The UniBio Project 生物資 訊 網站 UML 下載原始格式 生物資訊 生物資料庫 設計 MySQL Perl 調整生物 資訊格式 Assistant Prof. Chih-Chin Liu phpMyAdmin 生物 資料庫 生物資料庫 建置 Page 18 The UniBio Project Developing Environment RedHat Linux 9.0 (Free, 穩定, 高效能) MySQL (Free, 跑的最快的資料庫) Apache (Free, 穩定, 功能強大, 高效能) Perl (Free, 生物資訊主要程式語言, 程式精簡,跨平 台) PHP (Free, 函數眾多, 容易撰寫,跨平台) C/C++ (Free, 歷史悠久, 功能強大) Java (Free, 可Web顯示, 跨平台) Assistant Prof. Chih-Chin Liu Page 19 The UniBio Project http://140.126.11.172/ Assistant Prof. Chih-Chin Liu Page 20 Genome Data Management GenBank EMBL DDBJ Sampling Sample Database Cloning Clone Database Assistant Prof. Chih-Chin Liu RefSeq Sequencing cDNA Database TIGR TGI BLASTing BLAST Report Database UniGene Submitting GenBank Submission Files Page 21 Functional Genome Data Management cDNA Database KEGG Gene Expression Gene Expression Profile Enzyme in silico Simulation ? MicroArray Database Profile Database Assistant Prof. Chih-Chin Liu Simulation Result Database in situ Verification ? Verification Report Database in vivo Testing ? New Drug $$$ Page 22 豬/土雞基因體資料庫 Assistant Prof. Chih-Chin Liu Page 23 豬/土雞基因體資料庫 Assistant Prof. Chih-Chin Liu Page 24 豬/土雞基因體資料庫 Assistant Prof. Chih-Chin Liu Page 25 豬/土雞基因體資料庫 Assistant Prof. Chih-Chin Liu Page 26 豬/土雞基因體資料庫 BLAST Results (GenBank) Assistant Prof. Chih-Chin Liu Page 27 豬/土雞基因體資料庫 dbEST Submission TYPE: EST STATUS: New CONT_NAME: Wen-Chuan Lee CITATION: Porcine testis EST project LIBRARY: Porcine testis cDNA library I EST#: PDUts1001A02 CLONE: PDUts1001A02 SOURCE: Division of Biotechnology, Animal Technology Institute Taiwan ... SEQ_PRIMER: T7 promoter primer HIQUAL_START: 1 HIQUAL_STOP: 306 DNA_TYPE: cDNA PUBLIC: 12/31/2005 SEQUENCE: CTCAACCATTGATGGAGCATATTTCTCTATTTTTAGTAGATCTAGAAAAAAATAGTATGA AGTTAGATATCCTAAGAAGAGCAATTACCGCTATTTCATTATATTTTGCTTAAAAAAAAA CAAGATTATTTTAATGGATATATCAAATCCTCGTGCACGATGTACAAAAATTAAAGCACG TCTGGGGCCACAAAGCACATCTCGATGAACTCTGAATAGATAGTACCAAGCAATTAGGTT ATAAATTAATACTTTACAAGAGAATTTAGAAAATTTCATAGTTGCCCAGTGTAAGCTACC TTTCTA || Assistant Prof. Chih-Chin Liu Page 28 Integrated Proteomic Database SWISS2DPAGE MassSpec Siena2DPAGE ATIT2DPAGE PMMA2DPAGE Plasma2DPAGE UniProt RESID Dali/FSSP Pfam SWISS-PROT PROSITE PIR MIPS/JIPID PDB CATH SCOP PRINTS BioCyc KEGG BLOCKS EMOTIF ENZYME BRENDA WIT LIGAND Assistant Prof. Chih-Chin Liu Page 29 2D Gel Electrophoresis Separation by Molecular Weight (MW) Separation by Charge (pI) Molecular Weight Markers Assistant Prof. Chih-Chin Liu Page 30 Exploring Diseases Detect the spots that changed. Identify which proteins they are by PMF (Peptide Mass Fingerprinting) They could be candidates for drug screening. Assistant Prof. Chih-Chin Liu Page 31 2D-PAGE Example 2D123456_1.tif Assistant Prof. Chih-Chin Liu Page 32 2D-PAGE Spot Examples 2D123456_1.out "SSP" "" 0105 0304 0409 0410 0411 0510 0610 0708 0709 0710 0711 0712 0713 0902 "MR" "PI" "" "" 14.000000 20.000000 27.025288 28.200542 26.410089 30.000000 45.000000 70.379211 60.177605 71.341202 68.146568 57.148594 66.000000 116.400002 Assistant Prof. Chih-Chin Liu "TA20040301PH4~7" "quantity" 0.940249 17718.58 0.100000 3015.93 2.881626 4703.69 3.015601 7963.92 3.035875 5168.19 0.100000 568.17 -1.000000 256.19 4.008969 12372.92 4.017597 60490.97 4.018401 20098.13 4.018714 25632.64 4.023514 73912.91 -1.000000 940.28 4.000000 160499.94 Page 33 Gel Database A Gel UML Class Diagram for Modeling 2D-PAGE Images and Their Spots Database: GelDB Date: 2004/03/05 DBA: Chih-Chin Liu Sample Sample_ID Description Date Qty Method Prepare 1 1..n SampleType electrophoresis Species Organ Tissue Sex Age Genotype Phenotype Assistant Prof. Chih-Chin Liu Gel Gel_ID Expt_No ImageFile IPG_Strip pH_Low pH_High Linear pI_Low pI_High MW_Low MW_High Complexity Property Spot SSP MW PI Qty Page 34 MassSpec Database Samples MassSpec Analysis Results (.pkl) Mascot Configuration Mascot Query Mascot Result (.dat) Mascot Protein Reports Mascot Peptide Reports Assistant Prof. Chih-Chin Liu Page 35 MassSpec Sample Assistant Prof. Chih-Chin Liu Page 36 MassSpec Instruments Assistant Prof. Chih-Chin Liu Page 37 Mass Spectrum Example MIxxxxxx.pkl Assistant Prof. Chih-Chin Liu Page 38 Mascot Query Example Assistant Prof. Chih-Chin Liu Page 39 Mascot Search Result Fxxxxxx.dat Assistant Prof. Chih-Chin Liu Page 40 MassSpec Database A MassSpec UML Class Diagram for Modeling Mascot Search Results Database: MassSpec Date: 2003/12/20 DBA: Chih-Chin Liu 2D_PAGE_Spot 1 associated_with 0..* MassSpecResult FileName FileType Instrument 1 contain 1 PeakList MassMin MassMax IntMin IntMax NumPeaks Peak PeakMass PeakIntensity query 1 1..n MascotQuery UserName UserEmail TaxonomyFilter CleaveEnzyme MissedCleave StaticMods ICAT PeptideTol PeptideTolUnit FragmentTol FragmentTolUnit ChargeState MassType TypeOfSearch PrecursorMass CTermMass NTermMass Assistant Prof. Chih-Chin Liu MascotResult FileName NumHits ExecTime ObservedMass ObservedCharge ObservedMrValue RepeatSearchString config 1 hit MS_Protein Accession Description Score Mass Frame Coverage NumPeptides MascotConfig FastaVer MascotVer MSParserVer 1 Database NumSeqs NumResidues MS_Peptide Query Rank PrettyRank Matched MissedCleave MrCalc Delta Observed Charge MrExp IonsMatched PeptideStr PeaksUsed1 VarModsStr VarMods IonsScore SeriesUsed PeaksUsed2 PeaksUsed3 PeptideIdTh HomologyTh ProbOfPep Page 41 Flowchart *.txt *.pkl Mascot Search (PMF) *.dat Mascot Parser MassSpec Database Assistant Prof. Chih-Chin Liu Page 42 Proteome Data Management Sample 2D-PAGE Spot Mass Spectrum *.tiff *.out *.pkl upload upload/ parsing Protein/ Peptide Report *.dat upload upload/ parsing key-in Gel Database Assistant Prof. Chih-Chin Liu MassSpec Database Page 43 蛋白體資料庫 Assistant Prof. Chih-Chin Liu Page 44 蛋白體資料庫 Assistant Prof. Chih-Chin Liu Page 45 蛋白體資料庫 Assistant Prof. Chih-Chin Liu Page 46 蛋白體資料庫 Assistant Prof. Chih-Chin Liu Page 47