Download protein sequence

Document related concepts

Immunoprecipitation wikipedia , lookup

List of types of proteins wikipedia , lookup

Circular dichroism wikipedia , lookup

Degradomics wikipedia , lookup

Rosetta@home wikipedia , lookup

Protein wikipedia , lookup

Cyclol wikipedia , lookup

Intrinsically disordered proteins wikipedia , lookup

Protein domain wikipedia , lookup

Protein folding wikipedia , lookup

Protein moonlighting wikipedia , lookup

Protein design wikipedia , lookup

Bimolecular fluorescence complementation wikipedia , lookup

Structural alignment wikipedia , lookup

Western blot wikipedia , lookup

Protein purification wikipedia , lookup

Protein structure prediction wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Homology modeling wikipedia , lookup

Proteomics wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Protein mass spectrometry wikipedia , lookup

Transcript
Bioinformatics for Proteomics
Shu-Hui Chen (陳淑慧)
Department of Chemistry
National Cheng Kung University
Bioinformatics I
DNA
5’
3’
How do we find
protein coding regions,
introns and exons in
genomic DNA
Transcription
sequences?
Splicing
mRNA
Translation
Polypeptide
Folding
Protein
• Transport / Localization
• Oligomerization
• PTM (Post-Translational Modification)
Function
Function
What is Proteomics ?
Systematic analysis of
All protein sequences
All protein expression pattern
All protein interactions
This involves
Protein isolation
Protein separation
Protein identification
Functional characterization of all proteins
The tools of Proteomics
Traditional protein chemistry assay methods struggle to establish
Identity
Identity requires:
Specificity of measurement (Precision)
Mass Spectrometry
MS-based data acquisition algorithm
A reference for comparison
Protein sequence databases
Search algorithms
MS-based Proteomics and
Bioinformatics
• MS instrument is so far not sensitive enough to
resolve proteins in a biological system solely
based on signals measured.
• MS, however, is able to acquire sufficient data
for mapping a protein from the database using
new computer algorithms to analyze the data.
• This is the field of bioinformatics
Instrumentation
Sample inlet
vacuum
Ion source
Mass analyzer
Data
acquisition
“Bioanalytical Chemistry” Mikkelsen, S.R.,
published by John Wiley & Sons, Inc.
MS-based Protein Identification
 Mass Mapping
Peptide Sequencing
Conventional Methodology
- Expression Proteomics
Trypsin Digestion
We know that trypsin cleaves polypeptides
C-terminal to basic amino acids.
-NH-CH(R1)-CO-NH-CH(R2)-COtrypsin
-NH-CH(R1)-COOH
H2N-CH(R2)-CO-
Ion intensity
m/z
Mass Spectrometry
Protein identified by database mapping
Automated Database Search
Number 1 match: tumor necrosis factor type 1 receptor
associated protein TRAP-1 (Mr): 76030.27
1
51
101
151
201
251
301
351
401
451
501
551
601
651
RALRRAPALA
DKEEPLHSII
LISNASDALE
EELVSNLGTI
EVYSRSAAPG
SEARVRDVVT
RYVAQAHDKP
YSRKVLIQTK
DVLQQRLIKF
KLLRYESSAL
AMKKKDTEVL
DRSPAAECLS
GAARHFLRMQ
SCWWIRYTRT
AVPGGKPILC
SSTESVQGST
KLRHKLVSDG
ARSGSKAFLD
SLGYQWLSDG
KYSNFVSFPL
RYTLHYKTDA
ATDILPKWLR
FIDQSKKDAE
PSGQLTSLSE
FCFEQFDELT
EKETEELMAW
QLAKTQEERA
P
PRRTTAQLGP
SKHEFQAETK
QALPEMEIHL
ALQNQAEASS
SGVFEIAEAS
YLNGRRMNTL
PLNIRSIFYV
FIRGVVDSED
KYAKFFEDYG
YASRMRAGTR
LLHLREFDKK
MRNVLGSRVT
QLLQPTLEIN
RRNPAWSLQA
KLLDIVARSL
QTNAEKGTIT
KIIGQFGVGF
GVRTGTKIII
QAIWMMDPKD
PDMKPSMFDV
IPLNLSRELL
LFMREGIVTA
NIYYLCAPNR
KLISVETDIV
NVKVTLRLDT
PRHALIKKLN
GRLFSTQTAE
YSEKEVFIRE
IQDTGIGMTQ
YSAFMVADRV
HLKSDCKEFS
VGEWQHEEFY
SRELGSSVAL
QESALIRKLR
TEQEVKEDIA
HLAEHSPYYE
VDHYKEEKFE
HPAMVTVLEM
HCAQASLAWL
Total coverage: 33.4%
Bioinformatics I
Minimal content of a « protein sequence » db
•
•
•
•
•
•
•
•
Sequences !!
Accession number (AC)
Taxonomic data
References
ANNOTATION/CURATION
Keywords
Cross-references
Documentation
Bioinformatics I
SWISS-PROT/TrEMBL
• Collaboration between the SIB (CH) and EMBL/EBI (UK)
• SWISS-PROT: Fully annotated (manually), non-redundant,
cross-referenced, documented protein sequence database.
• TrEMBL: is automatically generated (from annotated EMBL
coding sequences (CDS)) and annotated using software
tools.
http://www.expasy.org/sprot/
ExPASy Web Server
ExPASy =
Expert
Protein
Analysis
System
History for MS Searching
1993
MOWSE
By Pappin and Bleasby
SEQUEST
1994
1996
MOWSEⅡ
1997
MOWSEⅢ
1998
MASCOT
By Yates and Eng
Molecular Weight Search
By Matrix science
Scoring algorithm
Final score= -10*LOG(P),
where P is absolute probability that the observed match
is a random event
E value (expected value) = describes the number of hits
one can expect to see by chance when searching a
database of a particular size. A value of zero indicates
that no matches would be expected
by chance.
Significant hits at 95% confidence level (p<0.05)
there is less than a 1 in 20 chance that the observed match is a random event.
Increase
mass
tolerance
5
7
MS-based Protein Identification
Mass Mapping

Peptide Sequencing
Tandem Mass Spectrometry- MS/MS
MS/MS acquisition is controlled by software setting
Protein Identification
Peptide Sequencing using MSMS
peptide
A BCDEF
CID
AB CDEF
ABCDEF
precursor ion
ABC DEF
ABCD EF
ABCDE
AB
ABC
A
A
ABCD
B
F
C
D
ABCDE
E
m/z
Nomenclature used for CID peptide fragmentationLow Energy (eV)- Q, TOF, FT
“Bioanalytical Chemistry” Mikkelsen, S.R.,
published by John Wiley & Sons, Inc.
Protein Identification by Database Search
Trypsin Digestion
We know that trypsin cleaves polypeptides
C-terminal to basic amino acids.
-NH-CH(R1)-CO-NH-CH(R2)-COtrypsin
-NH-CH(R1)-COOH
H2N-CH(R2)-CO-
Ion intensity
m/z
Sequence Tag Approach for Peptide Sequencing
“Bioanalytical Chemistry” Mikkelsen, S.R.,
published by John Wiley & Sons, Inc.
The Basic Local Alignment Search Tool (BLAST) finds
regions of local similarity between sequences.
The program compares nucleotide or protein sequences to
sequence databases and calculates the statistical
significance of matches.
BLAST can be used to infer functional and evolutionary
relationships between sequences as well as help identify
members of gene families.
Bioinformatics I
BLAST:
Basic
Local
Alignment
Search
Tool
NCBI BLAST
http://www.ncbi.nlm.nih.gov/blast/
Bioinformatics I
Sequence alignments and comparison
1: MYTAILORISRICH
2: MONTAILLEURESTRICHE
1: MY-TAIL--ORIS-RICH¦x ¦¦¦¦ x¦x¦ ¦¦¦¦
2: MONTAILLEURESTRICHE
1:
2:
TAILO
¦¦¦¦x
TAILL
RICH
¦¦¦¦
RICHE
Global Alignment
Two Local Alignments
¦ = Identity
x = Mismatch
- = Insertion / Deletion
Bioinformatics I
Multiple
Sequence
Alignment
(MSA)
Programs:
• CLUSTALW
• T_COFFEE
• MULTALIGN
HBA_CHICK
HBAD_CHICK
HBPI_CHICK
HBB_CHICK
HBE_CHICK
HBRH_CHICK
MYG_CHICK
VL-SAADKNNVKGIFTKIAGHAEEYGAETLERMFTTYPPTKTYFPHF-DL
ML-TAEDKKLIQQAWEKAASHQEEFGAEALTRMFTTYPQTKTYFPHF-DL
AL-TQAEKAAVTTIWAKVATQIESIGLESLERLFASYPQTKTYFPHF-DV
VHWTAEEKQLITGLWGKV--NVAECGAEALARLLIVYPWTQRFFASFGNL
VHWSAEEKQLITSVWSKV--NVEECGAEALARLLIVYPWTQRFFASFGNL
VHWSAEEKQLITSVWSKV--NVEECGAEALARLLIVYPWTQRFFDNFGNL
GL-SDQEWQQVLTIWGKVEADIAGHGHEVLMRLFHDHPETLDRFDKFKGL
....
.
..*
. .. * * * *.. .* *
* * ..
48
48
48
48
48
48
49
HBA_CHICK
HBAD_CHICK
HBPI_CHICK
HBB_CHICK
HBE_CHICK
HBRH_CHICK
MYG_CHICK
SH-----GSAQIKGHGKKVVAALIEAANHIDDIAGTLSKLSDLHAHKLRV
SP-----GSDQVRGHGKKVLGALGNAVKNVDNLSQAMAELSNLHAYNLRV
SQ-----GSVQLRGHGSKVLNAIGEAVKNIDDIRGALAKLSELHAYILRV
SSPTAILGNPMVRAHGKKVLTSFGDAVKNLDNIKNTFSQLSELHCDKLHV
SSPTAIMGNPRVRAHGKKVLSSFGEAVKNLDNIKNTYAKLSELHCDKLHV
SSPTAIIGNPKVRAHGKKVLSSFGEAVKNLDNIKNTYAKLSELHCEKLHV
KTPDQMKGSEDLKKHGATVLTQLGKILKQKGNHESELKPLAQTHATKHKI
.
*. .. ** .*.. . . .. ..
.
*.. *
..
93
93
93
98
98
98
99
HBA_CHICK
HBAD_CHICK
HBPI_CHICK
HBB_CHICK
HBE_CHICK
HBRH_CHICK
MYG_CHICK
DPVNFKLLGQCFLVVVAIHHPAALTPEVHASLDKFLCAVGTVLTAKYR-DPVNFKLLSQCIQVVLAVHMGKDYTPEVHAAFDKFLSAVSAVLAEKYR-DPVNFKLLSHCILCSVAARYPSDFTPEVHAEWDKFLSSISSVLTEKYR-DPENFRLLGDILIIVLAAHFSKDFTPECQAAWQKLVRVVAHALARKYH-DPENFRLLGDILIIVLASHFARDFTPACQFAWQKLVNVVAHALARKYH-DPENFRLLGNILIIVLAAHFTKDFTPTCQAVWQKLVSVVAHALAYKYH-PVKYLEFISEVIIKVIAEKHAADFGADSQAAMKKALELFRNDMASKYKEF
. .... .
.* .
. ... .
.* .
.. **.
141
141
141
146
146
146
149
HBA_CHICK
HBAD_CHICK
HBPI_CHICK
HBB_CHICK
HBE_CHICK
HBRH_CHICK
MYG_CHICK
------------------GFQG
141
141
141
146
146
146
153
Consensus length: 154; Identity : 19 ( 12.3%); Similarity: 51 ( 33.1%)
Character to show that a position in the alignment is perfectly conserved: '*'
Character to show that a position is well conserved: '.'
Searching databases with multiple alignments
PSI-BLAST: Position-Specific Iterative BLAST (Altschul et al., 1997)
1. Starting with a single sequence, PSI-BLAST searches a database
using BLAST and builds a multiple sequence alignment and a profile.
2. The profile is then used to search the protein database again.
3. Running the program several times can further refine the profile
and increase search sensitivity.
Error tolerance search
0.2Da/0.2Da
32
0.05Da/0.05Da
27
33
0.5Da/0.5Da
MS/MS Scan Functions
Collision Chamber (gas)
m2
m1
m4 m3
+
m2 m2
m2
m2
single mass transmission
+
+
N2
+
+ +
+
+
+
mass scan mode
Q1
Product Ion Scan (PI)
Fix
Multiple Reaction Mode (MRM) Fix
Precursor Ion Scan (PS)
Scan
Neutral Loss Scan (NL)
Scan
Q3
Scan
Fix
Fix
Scan
+
+
IP + MS/ID for searching protein interaction complex
Conclusions
Protein identification by MS is a key element of proteomics and
the ID process is an informatics-based methodology.
MS + sequence databases represent a huge leap for protein
Biochemistry- A large scale analysis approach.
Biochemical manipulation + protein ID is capable of providing
functional information of proteins.
Bioinformatics tools are needed to link proteomics data to protein
interaction and biological pathways.