Download 整合式基因體與蛋白體資料庫及蛋白質結構搜尋演算法

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
整合式基因體與蛋白體
資料庫
劉 志 俊 (Chih-Chin Liu)
中華大學 資訊工程系
July 2008
Outline

生物資訊 (Bioinformatics): 資料庫觀點

生物資訊四大資料型態(Data Types)

生物資料庫設計與UML

整合式生物資料庫: UniBio

豬/土雞基因體資料庫

蛋白體資料庫
Assistant Prof. Chih-Chin Liu
Page 2
當生物遇見資訊
生物學
分子遺傳學
分子生物學
生物化學
細胞生物學
蛋白質學
免疫學
Assistant Prof. Chih-Chin Liu
資訊學
生物資訊
程式語言
資料結構
演算法
資料庫
平行處理
資料探勘
Page 3
基因體、轉錄體、蛋白體、代謝體

基因體 (Genome):

轉錄體 (Transcriptome): The complement of expressed
gene that are found in a particular cell or tissue.

蛋白體 (Proteome): The complement of proteins that are
found in a particular cell or tissue.

代謝體 (Metabolome): The assembly of substrates,
metabolites, and other small molecules that are present in
a population of cells.
Assistant Prof. Chih-Chin Liu
Page 4
更多的【體】

結構體 (∑ Structures, Structurome)

變異體 (∑ SNPs, SNPome)

文獻體 (∑ Literatures, Literaturome)

訊號傳導體 (∑ Transductions, Transductome)

反應路徑體 (∑ Pathways, Pathwayome)

遺傳疾病體 (∑ Diseases, Diseasome)
Assistant Prof. Chih-Chin Liu
體  資料庫
Page 5
Research Issues in Biological
Databases

Data Modeling


Data Retrieval


How to retrieve similar biological objects
Data Mining


How to store/represent biological data
How to find rules behind biological data
Simulation

Pathway Simulation, Virtual Cell, Virtual Life
Assistant Prof. Chih-Chin Liu
Page 6
New Data Types in Bio-Databases

Large Strings


Biological Images


2D Gels, Microarray Images
3D Structures


DNA Sequences, Protein Sequences
Proteins, Compounds
Network

Pathways
Assistant Prof. Chih-Chin Liu
Page 7
New Data Types in Bio-Databases

Large Strings: DNA Sequences
現代人第1號染色體
的完整序列,長度為
245,564,334 bp
是GenBank最長的
一筆序列紀錄
Assistant Prof. Chih-Chin Liu
Page 8
New Data Types in Bio-Databases

Large Strings: Protein Sequences
PIR: I38344
PIR資料庫最長的
蛋白質序列
26,926 個氨基酸
titin, cardiac
muscle [validated] human
Assistant Prof. Chih-Chin Liu
Page 9
New Data Types in Bio-Databases

Images: Microarray (Stanford Microarray Database)
Assistant Prof. Chih-Chin Liu
Page 10
New Data Types in Bio-Databases

Images: 1D-Gel, 2D-Gel
Assistant Prof. Chih-Chin Liu
Page 11
New Data Types in Bio-Databases

3D Structures: Chemical Compound
Assistant Prof. Chih-Chin Liu
Page 12
New Data Types in Bio-Databases

3D Structures
Assistant Prof. Chih-Chin Liu
Page 13
New Data Types in Bio-Databases

3D Structures
ATOM
1
-4.004
15.224 13.636 1.00 32.64
N
ANISOU
4441
ATOM
1 N
VAL
VAL
-335 -2675
320
4512
3449
N
-3.526
15.758 14.900 1.00 18.42
C
3289
ATOM
2 CA VAL
1
1
ANISOU
2 CA VAL
-286
2233
C
14.733 15.628 1.00 17.06
C
ATOM
3 C
-152
VAL
-466
234
1603
1981
C
-3.053
13.569 15.714 1.00 18.61
O
3163
4 O
-489
VAL
1
1
ANISOU
4 O
VAL
555
1478
-2.662
2899
3 C
-467
1
1
ANISOU
Assistant Prof. Chih-Chin Liu
1 N
VAL
-394
1
501
1758
O
2150
Page 14
New Data Types in Bio-Databases

Network: Pathways
Assistant Prof. Chih-Chin Liu
Page 15
Database Design



Conceptual Database Design

Class Diagram (ER Model, UML Class Diagram)

Entities(Classes), Relationships, Attributes
Logical Database Design

Relational Schema

Normalization, ER to Relational Data Model Mapping
Physical Database Design

Implementation (e.g. Oracle, MySQL, SQL Server)

Indexes and Storage Methods
Assistant Prof. Chih-Chin Liu
Page 16
The UniBio Project

完整性


整合性


收集所有生物相關之可下載資料庫
所有資料互相參考, 邏輯上為單一資料庫
中文化

盡可能提供對應之中文資料, 降低學習障礙
Assistant Prof. Chih-Chin Liu
Page 17
The UniBio Project
生物資
訊
網站
UML
下載原始格式
生物資訊
生物資料庫
設計
MySQL
Perl
調整生物
資訊格式
Assistant Prof. Chih-Chin Liu
phpMyAdmin
生物
資料庫
生物資料庫
建置
Page 18
The UniBio Project
Developing Environment

RedHat Linux 9.0 (Free, 穩定, 高效能)

MySQL (Free, 跑的最快的資料庫)

Apache (Free, 穩定, 功能強大, 高效能)

Perl (Free, 生物資訊主要程式語言, 程式精簡,跨平
台)

PHP (Free, 函數眾多, 容易撰寫,跨平台)

C/C++ (Free, 歷史悠久, 功能強大)

Java (Free, 可Web顯示, 跨平台)
Assistant Prof. Chih-Chin Liu
Page 19
The UniBio Project
http://140.126.11.172/
Assistant Prof. Chih-Chin Liu
Page 20
Genome Data Management
GenBank
EMBL
DDBJ
Sampling
Sample
Database
Cloning
Clone
Database
Assistant Prof. Chih-Chin Liu
RefSeq
Sequencing
cDNA
Database
TIGR
TGI
BLASTing
BLAST
Report
Database
UniGene
Submitting
GenBank
Submission
Files
Page 21
Functional Genome Data
Management
cDNA
Database
KEGG
Gene
Expression
Gene
Expression
Profile
Enzyme
in silico
Simulation
?
MicroArray
Database
Profile
Database
Assistant Prof. Chih-Chin Liu
Simulation
Result
Database
in situ
Verification
?
Verification
Report
Database
in vivo
Testing
?
New Drug $$$
Page 22
豬/土雞基因體資料庫
Assistant Prof. Chih-Chin Liu
Page 23
豬/土雞基因體資料庫
Assistant Prof. Chih-Chin Liu
Page 24
豬/土雞基因體資料庫
Assistant Prof. Chih-Chin Liu
Page 25
豬/土雞基因體資料庫
Assistant Prof. Chih-Chin Liu
Page 26
豬/土雞基因體資料庫
BLAST Results (GenBank)
Assistant Prof. Chih-Chin Liu
Page 27
豬/土雞基因體資料庫
dbEST Submission
TYPE: EST
STATUS: New
CONT_NAME: Wen-Chuan Lee
CITATION:
Porcine testis EST project
LIBRARY: Porcine testis cDNA library I
EST#: PDUts1001A02
CLONE: PDUts1001A02
SOURCE: Division of Biotechnology, Animal Technology Institute Taiwan
...
SEQ_PRIMER: T7 promoter primer
HIQUAL_START: 1
HIQUAL_STOP: 306
DNA_TYPE: cDNA
PUBLIC: 12/31/2005
SEQUENCE:
CTCAACCATTGATGGAGCATATTTCTCTATTTTTAGTAGATCTAGAAAAAAATAGTATGA
AGTTAGATATCCTAAGAAGAGCAATTACCGCTATTTCATTATATTTTGCTTAAAAAAAAA
CAAGATTATTTTAATGGATATATCAAATCCTCGTGCACGATGTACAAAAATTAAAGCACG
TCTGGGGCCACAAAGCACATCTCGATGAACTCTGAATAGATAGTACCAAGCAATTAGGTT
ATAAATTAATACTTTACAAGAGAATTTAGAAAATTTCATAGTTGCCCAGTGTAAGCTACC
TTTCTA
||
Assistant Prof. Chih-Chin Liu
Page 28
Integrated Proteomic Database
SWISS2DPAGE
MassSpec
Siena2DPAGE
ATIT2DPAGE
PMMA2DPAGE
Plasma2DPAGE
UniProt
RESID
Dali/FSSP
Pfam
SWISS-PROT
PROSITE
PIR
MIPS/JIPID
PDB
CATH
SCOP
PRINTS
BioCyc
KEGG
BLOCKS
EMOTIF
ENZYME
BRENDA
WIT
LIGAND
Assistant Prof. Chih-Chin Liu
Page 29
2D Gel Electrophoresis
Separation by Molecular Weight (MW)
Separation by Charge (pI)
Molecular Weight Markers
Assistant Prof. Chih-Chin Liu
Page 30
Exploring Diseases
Detect the spots that changed.
Identify which proteins they are by PMF (Peptide Mass Fingerprinting)
They could be candidates for drug screening.
Assistant Prof. Chih-Chin Liu
Page 31
2D-PAGE Example
2D123456_1.tif
Assistant Prof. Chih-Chin Liu
Page 32
2D-PAGE Spot Examples
2D123456_1.out
"SSP"
""
0105
0304
0409
0410
0411
0510
0610
0708
0709
0710
0711
0712
0713
0902
"MR" "PI"
""
""
14.000000
20.000000
27.025288
28.200542
26.410089
30.000000
45.000000
70.379211
60.177605
71.341202
68.146568
57.148594
66.000000
116.400002
Assistant Prof. Chih-Chin Liu
"TA20040301PH4~7"
"quantity"
0.940249
17718.58
0.100000
3015.93
2.881626
4703.69
3.015601
7963.92
3.035875
5168.19
0.100000
568.17
-1.000000
256.19
4.008969
12372.92
4.017597
60490.97
4.018401
20098.13
4.018714
25632.64
4.023514
73912.91
-1.000000
940.28
4.000000
160499.94
Page 33
Gel Database
A Gel UML Class Diagram for
Modeling 2D-PAGE Images and Their Spots
Database: GelDB
Date: 2004/03/05
DBA: Chih-Chin Liu
Sample
Sample_ID
Description
Date
Qty
Method
Prepare
1
1..n
SampleType
electrophoresis
Species
Organ
Tissue
Sex
Age
Genotype
Phenotype
Assistant Prof. Chih-Chin Liu
Gel
Gel_ID
Expt_No
ImageFile
IPG_Strip
pH_Low
pH_High
Linear
pI_Low
pI_High
MW_Low
MW_High
Complexity
Property
Spot
SSP
MW
PI
Qty
Page 34
MassSpec Database

Samples

MassSpec Analysis Results (.pkl)

Mascot Configuration

Mascot Query

Mascot Result (.dat)

Mascot Protein Reports

Mascot Peptide Reports
Assistant Prof. Chih-Chin Liu
Page 35
MassSpec Sample
Assistant Prof. Chih-Chin Liu
Page 36
MassSpec Instruments
Assistant Prof. Chih-Chin Liu
Page 37
Mass Spectrum Example
MIxxxxxx.pkl
Assistant Prof. Chih-Chin Liu
Page 38
Mascot Query Example
Assistant Prof. Chih-Chin Liu
Page 39
Mascot Search Result
Fxxxxxx.dat
Assistant Prof. Chih-Chin Liu
Page 40
MassSpec Database
A MassSpec UML Class Diagram for
Modeling Mascot Search Results
Database: MassSpec
Date: 2003/12/20
DBA: Chih-Chin Liu
2D_PAGE_Spot
1
associated_with
0..*
MassSpecResult
FileName
FileType
Instrument
1
contain
1
PeakList
MassMin
MassMax
IntMin
IntMax
NumPeaks
Peak
PeakMass
PeakIntensity
query
1
1..n
MascotQuery
UserName
UserEmail
TaxonomyFilter
CleaveEnzyme
MissedCleave
StaticMods
ICAT
PeptideTol
PeptideTolUnit
FragmentTol
FragmentTolUnit
ChargeState
MassType
TypeOfSearch
PrecursorMass
CTermMass
NTermMass
Assistant Prof. Chih-Chin Liu
MascotResult
FileName
NumHits
ExecTime
ObservedMass
ObservedCharge
ObservedMrValue
RepeatSearchString
config
1
hit
MS_Protein
Accession
Description
Score
Mass
Frame
Coverage
NumPeptides
MascotConfig
FastaVer
MascotVer
MSParserVer
1 Database
NumSeqs
NumResidues
MS_Peptide
Query
Rank
PrettyRank
Matched
MissedCleave
MrCalc
Delta
Observed
Charge
MrExp
IonsMatched
PeptideStr
PeaksUsed1
VarModsStr
VarMods
IonsScore
SeriesUsed
PeaksUsed2
PeaksUsed3
PeptideIdTh
HomologyTh
ProbOfPep
Page 41
Flowchart
*.txt
*.pkl
Mascot
Search
(PMF)
*.dat
Mascot
Parser
MassSpec
Database
Assistant Prof. Chih-Chin Liu
Page 42
Proteome Data Management
Sample
2D-PAGE
Spot
Mass
Spectrum
*.tiff
*.out
*.pkl
upload
upload/
parsing
Protein/
Peptide
Report
*.dat
upload
upload/
parsing
key-in
Gel
Database
Assistant Prof. Chih-Chin Liu
MassSpec
Database
Page 43
蛋白體資料庫
Assistant Prof. Chih-Chin Liu
Page 44
蛋白體資料庫
Assistant Prof. Chih-Chin Liu
Page 45
蛋白體資料庫
Assistant Prof. Chih-Chin Liu
Page 46
蛋白體資料庫
Assistant Prof. Chih-Chin Liu
Page 47
Related documents