Download Big data processing

Document related concepts
no text concepts found
Transcript
Big Data
Protein functional Prediction
Blast
1042. Data Science in Practice
Week 16, 06/06
Jia-Ming Chang
http://www.cs.nccu.edu.tw/~jmchang/course/1042/datascience/
The slide isonly for educational purposes. If any infringement, please contact me, we will correct immediately.
Dataset for homework4
Performance comparison for archaeal
proteins
• Yu,N.Y. et al. (2010) PSORTb 3.0: improved protein subcellular localization
prediction with refined localization subcategories and predictive
capabilities for all prokaryotes. Bioinformatics, 26, 1608–15.
Protein subcellular localization prediction
Prokaryotic Structure
protein
MPLDLYNTLTRRKERF…
1. Chang, J.-M., Su, E.C.-Y., Lo, A., Chiu, H.-S., Sung, T.-Y. and Hsu, W.-L. (2008) PSLDoc: Protein subcellular localization prediction
based on gapped-dipeptides and probabilistic latent semantic analysis. Proteins, 72, 693-710.
2. Chang, J.-M., Taly, J.-F., Erb, I., Sung, T.-Y., Hsu, W.-L., Tang, C.Y., Notredame, C. and Su, E.C.-Y. (2013) Efficient and Interpretable
Prediction of Protein Functional Classes by Correspondence Analysis and Compact Set Relations. PLoS One, 8, e75542.
Document Classification
Categories
Classifier
Documents
Salton’s vector space model
Represent each document by a high-dimensional vector
in the space of words
Documents
Journal of Artificial Intelligence Research
JAIR is a refereed journal, covering all areas
of Artificial Intelligence, which is distributed
free of charge over the internet. Each
volume of the journal is also published by AI
Access Foundation …
Vectors
0 learning
2 Journal
3 Intelligence
0 text
0 agent
1internet
0 webwatcher
0 perlS
…
1 volume
Gerald Salton
bag-of-words model
Term-document matrix is m x n matrix where m is
number of terms and n is number of documents
document
d
1
¯
éa
ê 11
êa 21
A = êê
ê
ê
êëa m1
d
d
2
¯
a
a
a
¯
12
…
22
…
m2
n
…
ù
ú ¬ t1
ú ¬
2n
t2
ú
ú
ú
ú
a mnúû ¬ t m
a
a
1n
term
Vectors in Term Space
Predicted by 1 Nearest-Neighbor based
on Cosine Similarity
Term Weighting by TFIDF
•
The term frequency (tf) in the given document d gives a measure of the importance of the term ti
within the particular document
tf (ti , d ) 
ni
 nk
k
with ni being the number of occurrences of the considered term, and the
denominator is the number of occurrences of all terms
• The inverse document frequency (idf) is obtained by dividing the number of
all documents by the number of documents containing the term ti,
idf (ti )  log
D
( d i  ti )
|D| : total number of document in the corpus
: number of documents where the term ti appears
tfidf = tf*idf
Feature Reduction
•  a best choice of axes – shows most variation in the
data. => Found by linear algebra: Singular Value
Decomposition (SVD)
True plot in k dimensions
Reduced-dimensionality plot
System Architecture
MPLDLYNTLT…
PSIBLAST
1
2
3
4
5
6
7
8
9
10
M
P
L
D
L
Y
N
T
L
T
A
-3
2
-4
-2
-4
-4
-4
-2
0
-1
R
-3
-3
-5
5
-5
-3
-3
-3
-1
-3
N
-4
-3
-6
-1
-6
-3
8
-1
-5
-1
D
-5
-1
-6
-3
-6
-5
4
-3
-5
-1
C
-3
-3
-4
-4
-4
-5
-6
-1
-4
-4
Q
-3
-1
-3
2
-5
-3
-3
-3
-3
-2
E
-4
-1
-5
-1
-6
-4
-2
-3
-4
-3
G
-5
-1
-6
-4
-6
-5
-3
-4
-4
-2
H
-4
-4
-5
2
-4
4
-2
-3
-3
-1
I
0
-2
3
-5
4
-4
-6
-4
-1
-4
L
1
-4
5
-3
4
-3
-6
-4
5
-3
K
-3
-2
-5
5
-5
-3
-3
-1
-3
-1
M
10
-2
4
-2
0
-2
-5
-4
3
-3
F
-2
-5
0
-2
1
4
-6
-4
0
-4
P
-5
4
-5
-4
-5
-5
-4
-4
-4
-4
S
-4
2
-5
-2
-5
-3
-1
4
-3
3
T
-3
4
-3
0
-3
-2
-3
6
-3
6
W
-4
-5
-4
-1
-4
2
-7
-5
-3
-5
Y
-3
-4
-3
0
-3
8
-5
-4
-2
-4
V
-1
-3
2
-3
3
-4
-6
-2
-1
-3
Gapped-Dipeptide Representation
A0A,
A1A,
A2A,
A3A,
A4A,
A5A
, …,
Y5Y
{0.81396, 0.78755, 0.788206, 0.799535, 0.784058, 0.742093,…,0.437457}
PSLDoc
Protein
Subcellular
Localization
prediction by
PLSA Reduction
{0.012103, 0.014095, 0.015480, 0.018894,…,0.003121}
SVMCP
SVMIM
SVMPP
SVMOM
Document
classification
SVMEC
Highest Probability
Predicted Localization Site
12/50
PSLDoc 2
PSLDoc
MPLDLYNTLT…
PSIBLAST
1
2
3
4
5
6
7
8
9
10
M
P
L
D
L
Y
N
T
L
T
A
-3
2
-4
-2
-4
-4
-4
-2
0
-1
R
-3
-3
-5
5
-5
-3
-3
-3
-1
-3
N
-4
-3
-6
-1
-6
-3
8
-1
-5
-1
D
-5
-1
-6
-3
-6
-5
4
-3
-5
-1
C
-3
-3
-4
-4
-4
-5
-6
-1
-4
-4
Q
-3
-1
-3
2
-5
-3
-3
-3
-3
-2
E
-4
-1
-5
-1
-6
-4
-2
-3
-4
-3
G
-5
-1
-6
-4
-6
-5
-3
-4
-4
-2
H
-4
-4
-5
2
-4
4
-2
-3
-3
-1
I
0
-2
3
-5
4
-4
-6
-4
-1
-4
L
1
-4
5
-3
4
-3
-6
-4
5
-3
K
-3
-2
-5
5
-5
-3
-3
-1
-3
-1
M
10
-2
4
-2
0
-2
-5
-4
3
-3
F
-2
-5
0
-2
1
4
-6
-4
0
-4
P
-5
4
-5
-4
-5
-5
-4
-4
-4
-4
S
-4
2
-5
-2
-5
-3
-1
4
-3
3
T
-3
4
-3
0
-3
-2
-3
6
-3
6
W
-4
-5
-4
-1
-4
2
-7
-5
-3
-5
Y
-3
-4
-3
0
-3
8
-5
-4
-2
-4
V
-1
-3
2
-3
3
-4
-6
-2
-1
-3
Gapped-Dipeptide Representation
A0A,
A1A,
A2A,
A3A,
A4A,
A5A
, …,
Y5Y
{0.81396, 0.78755, 0.788206, 0.799535, 0.784058, 0.742093,…,0.437457}
PLSA Reduction
{0.012103, 0.014095, 0.015480, 0.018894,…,0.003121}
SVMCP
SVMIM
SVMPP
SVMOM
Highest Probability
Predicted Localization Site
SVMEC
Term Weighting Scheme – TF Position
Specific Score Matrix
• Position Specific Score Matrix (PSSM) : A PSSM is constructed
from a multiple alignment of the highest scoring hits in the BLAST
search
A R N D C Q E G H I L
K M F P S T W Y V
1 M
-3 -3 -4 -5 -3 -3 -4 -5 -4 0 1 -3 10 -2 -5 -4 -3 -4 -3 -1
2 P
2 -3 -3 -1 -3 -1 -1 -1 -4 -2 -4 -2 -2 -5 4 2 4 -5 -4 -3
3 L
-4 -5 -6 -6 -4 -3 -5 -6 -5 3 5 -5 4 0 -5 -5 -3 -4 -3 2
4 D
-2 5 -1 -3 -4 2 -1 -4 2 -5 -3 5 -2 -2 -4 -2 0 -1 0 -3
5 L
-4 -5 -6 -6 -4 -5 -6 -6 -4 4 4 -5 0 1 -5 -5 -3 -4 -3 3
...
78 N
-4 -3 8 4 -6 -3 -2 -3 -2 -6 -6 -3 -5 -6 -4 -1 -3 -7 -5 -6
79 T
-2 -3 -1 -3 -1 -3 -3 -4 -3 -4 -4 -1 -4 -4 -4 4 6 -5 -4 -2
80 L
0 -1 -5 -5 -4 -3 -4 -4 -3 -1 5 -3 3 0 -4 -3 -3 -3 -2 -1
81 T
-1 -3 -1 -1 -4 -2 -3 -2 -1 -4 -3 -1 -3 -4 -4 3 6 -5 -4 -3
Database Size
NCBI non-redundant (NR)
UniProt (release 15.15 – 2010)
UniRef50
UniRef90
UniRef100
Data Set
No.
UniRef50
3,077,464
UniRef90
6,544,144
UniRef100
9,865,668
UniProt
11,009,767
NCBI NR
10,565,004
Feature reduction – topic model
Terms
Documents
economic
imports
TRADE
Latent
Concepts
trade
Probabilistic Latent Semantic Analysis
A joint probability between a term w and a document d can be modeled as:
P( w, d )  P(d ) P( w | z )P( z | d )
zZ
Latent variable z
(“small” #states)
Concept
expression
probabilities
Document-specific
mixing proportions
The parameters could be estimated by maximumlikelihood function through EM algorithm
Hofmann T: Unsupervised Learning by Probabilistic Latent Semantic Analysis. Mach Learn 2001, 42(1-2):177-196.
PLSA model fitting
• Likeli-hood function
• E-step: the probability that a term w in a particular document d explained by
the class corresponding to z
• M-step :
Probabilistic Latent Semantic Analysis
Topic Space
Term Space
Term 1
Topic 1
Term 2
Vector
PLSA
Feature Reduction
Topic 2
Term 3
Term 5
Term 4
Topic 3
Gapped-peptide signature
The site-topic preference of the topic z for a site l = average { P(z|d)| d (a
protein) belongs to l class}
site-topic preference matrix
For each site, 10 preferred topics according to preference confidence ( =
the 1th site-topic preference - the 2th site-topic preference)
Gapped-peptide signature
For each topic, 5 most frequent gapped-dipeptides are selected.
Classifier – Support Vector Machines
• Support Vector Machines (SVM)
– LIBSVM software
– Five 1-v-rest SVM classifiers corresponding to five localization
sites.
– Kernel: Radial Basis Function (RBF)
– Parameter selection
• c (cost) and γ(gamma) are optimized
• five-fold cross-validation
SVMCP v.s. -CP
SVMIM v.s. -IM
SVMPP v.s. -PP
SVMOM v.s. -OM
SVMEC v.s. -EC
Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines, 2001.
Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
4
5
6
7
8
9
10
D
L
Y
N
T
L
T
-2
-4
-4
-4
-2
0
-1
5
-5
-3
-3
-3
-1
-3
-1
-6
-3
8
-1
-5
-1
-3
-6
-5
4
-3
-5
-1
-4
-4
-5
-6
-1
-4
-4
2
-5
-3
-3
-3
-3
-2
-1
-6
-4
-2
-3
-4
-3
-4
-6
-5
-3
-4
-4
-2
2
-4
4
-2
-3
-3
-1
-5
4
-4
-6
-4
-1
-4
-3
4
-3
-6
-4
5
-3
5
-5
-3
-3
-1
-3
-1
-2
0
-2
-5
-4
3
-3
-2
1
4
-6
-4
0
-4
-4
-5
-5
-4
-4
-4
-4
-2
-5
-3
-1
4
-3
3
0
-3
-2
-3
6
-3
6
-1
-4
2
-7
-5
-3
-5
0
-3
8
-5
-4
-2
-4
-3
3
-4
-6
-2
-1
-3
Prediction Confidence
Gapped-Dipeptide Representation
A0A,
A1A,
A2A,
A3A,
A4A,
A5A
, …,
Y5Y
•
{0.81396,
0.78755,
0.788206,class
0.799535, 0.784058, 0.742093,…,0.437457}
The confidence
of the
final predicted
•
Prediction Confidence = the largest probability - the second largest probability
PLSA Reduction
{0.012103, 0.014095, 0.015480, 0.018894,…,0.003121}
Largest
SVMCP
Second
SVMIM
SVMPP
SVMOM
SVMEC
Prediction Confidence = SVMCP – SVMOM
Highest Probability
100
Predicted Localization Site
90
Overall Accuracy(%)
80
70
60
50
40
30
20
10
0
[0-0.1)
[0.1-0.2) [0.2-0.3) [0.3-0.4) [0.4-0.5) [0.5-0.6) [0.6-0.7) [0.7-0.8) [0.8-0.9)
Prediction Confidence
[0.9-1]
Prediction Threshold (1/3)
Prediction Confidence
 Threshold
No
Unknown
Yes
Predicted Localization Site
95
0
0.1
0.2
90
0.3
0.4
0.5
Recall(%)
0.6
0.7
85
0.8
80
75
0.9
70
92
93
94
95
96
97
98
99
Precision(%)
The value above the point denotes the corresponding
prediction threshold.
100
Prediction Threshold (2/3)
PSLDoc_PreThr=0.7
PSLDoc_PreThr=0.3
Precision
Recall
Precision
Recall
Precision
Recall
CP
97.30
77.70
94.92
87.41
92.86
70.14
IM
98.91
88.35
97.94
92.23
95.33
92.56
PP
96.19
73.19
93.00
81.88
95.50
69.20
OM
99.46
93.61
98.41
95.14
97.38
94.88
EC
95.57
79.47
91.57
85.79
97.40
78.95
Overall
97.89
83.66
95.77
89.27
95.82
82.62
Loc. Sites
PSORTb v.2.0
Prediction Threshold (3/3)
*The threshold is set such that the coverage is similar with PSLT.
PSLDoc
MPLDLYNTLT…
PSIBLAST
1
2
3
4
5
6
7
8
9
10
M
P
L
D
L
Y
N
T
L
T
A
-3
2
-4
-2
-4
-4
-4
-2
0
-1
R
-3
-3
-5
5
-5
-3
-3
-3
-1
-3
N
-4
-3
-6
-1
-6
-3
8
-1
-5
-1
D
-5
-1
-6
-3
-6
-5
4
-3
-5
-1
C
-3
-3
-4
-4
-4
-5
-6
-1
-4
-4
Q
-3
-1
-3
2
-5
-3
-3
-3
-3
-2
E
-4
-1
-5
-1
-6
-4
-2
-3
-4
-3
G
-5
-1
-6
-4
-6
-5
-3
-4
-4
-2
H
-4
-4
-5
2
-4
4
-2
-3
-3
-1
I
0
-2
3
-5
4
-4
-6
-4
-1
-4
L
1
-4
5
-3
4
-3
-6
-4
5
-3
K
-3
-2
-5
5
-5
-3
-3
-1
-3
-1
M
10
-2
4
-2
0
-2
-5
-4
3
-3
F
-2
-5
0
-2
1
4
-6
-4
0
-4
P
-5
4
-5
-4
-5
-5
-4
-4
-4
-4
S
-4
2
-5
-2
-5
-3
-1
4
-3
3
T
-3
4
-3
0
-3
-2
-3
6
-3
6
W
-4
-5
-4
-1
-4
2
-7
-5
-3
-5
Y
-3
-4
-3
0
-3
8
-5
-4
-2
-4
V
-1
-3
2
-3
3
-4
-6
-2
-1
-3
How to efficiently search?
How to directly infer?
Gapped-Dipeptide Representation
A0A,
A1A,
A2A,
A3A,
A4A,
A5A
, …,
Y5Y
{0.81396, 0.78755, 0.788206, 0.799535, 0.784058, 0.742093,…,0.437457}
PLSA Reduction
How to intuitively predict?
{0.012103, 0.014095, 0.015480, 0.018894,…,0.003121}
SVMCP
SVMIM
SVMPP
SVMOM
Highest Probability
Predicted Localization Site
SVMEC
PSLDoc 2
Correspondence analysis
CA may be defined as a special case of principal
components analysis - eigenvector methods
Matrix Y is decomposed using the generalized singular value decomposition under
the constraints imposed by the matrices M (masses for the rows) and W (weights
for the columns):
This is illustrated by the analysis of the columns of matrix X, or equivalently by
the rows of the transposed matrix XT .
Because the factor scores obtained for the rows and the columns have the same
variance(i.e., they have the same “scale”), it is possible to plot them in the same
space.
http://www.universityoftexasatdallascomets.com/~herve/abdi-CorrespondenceAnaysis2010-pretty.pdf
Correspondence analysis of the Gram-negative
IM, OM, CP, EC, PP : proteins
gapped-dipeptides
* gapped-dipeptide signatures
.
Compact set
S1
S2
S3
S4
S5
S6
S1
S2
S3
S4
S5
S6
0
10
16
18
13
8
0
14
17
15
9
0
9
10
12
0
8
19
0
11
0
S6
11
S5
8
S1
10
S4
9
S2
S3
C is a compact set if min { E(vi ,vk)|vi ÎC, vk Î V \ C } > max{ D(vi ,vj)|vi ,vj
ÎC }
Hierarchical clustering
S1
S1
S2
S3
S4
S5
S6
0
10
16
18
13
8
0
14
17
15
9
0
9
10
12
0
8
19
0
11
S2
S3
S4
S5
S6
C2
0
C3
S2
C1
s1
S4
S1
s6
s2
s3
s4
compact set tree
s5
s1
S3
s6
s2
s3
s4
s5
single-linkage clustering
Compact set
• Input : Given a connected undirected graph G = (V, E) , V represents
proteins and the edge E(vi ,vj) = the distance between two proteins vi
and vj measured as the Euclidean distance in CA reduced space
• Output : Find all the compact sets in G
– Step1 : Construct a Kruskal Merging Ordering Tree TKru of G. (CONSTRUCT_TKru)
– Step2 : Verify all candidate sets.
• Time = O(L + M + M log N)
– M = the numbers of edges
– N = the numbers of vertices
– L = the sum of the sizes of all compact sets
CS+1NN on Gram-Negative
PSORTdb
• http://psort.org/psortb/index.html
• Peabody,M.A. et al. (2016) PSORTdb: expanding the bacteria and
archaea protein subcellular localization database to better reflect
diversity in cell envelope structures. Nucleic Acids Res., 44, D663–8.
BLAST
Basic local alignment search tool
SF Altschul, W Gish, W Miller, EW Myers, DJ Lipman
Journal of molecular biology 215 (3), 403-410
BLAST
• The top 100 papers
– http://www.nature.com/news/the-top-100-papers1.16224#/interactive
What is BLAST?
Nucleotide/Protein
Sequence Databases
BLAST :
as
Google : Internet
Credit by David Fristrom, Bibliographer/Librarian, Science and Engineering Library, [email protected]
Alignment
AACGTTTCCAGTCCAAATAGCTAGGC
===--===
=-===-==-======
AACCGTTC
TACAATTACCTAGGC
Hits(+1): 18
Misses (-2): 5
Gaps (existence -2, extension -1): 1 Length: 3
Score = 18 * 1 + 5 * (-2) – 2 – 2 = 6
Credit by David Fristrom, Bibliographer/Librarian, Science and Engineering Library, [email protected]
Global Alignment
• Compares total length of two sequences
• Needleman, S.B. and Wunsch, C.D. A general
method applicable to the search for
similarities in the amino acid sequence of two
proteins. J Mol Biol. 48(3):443-53(1970).
Credit by David Fristrom, Bibliographer/Librarian, Science and Engineering Library, [email protected]
Local Alignment
• Compares segments of sequences
• Finds cases when one sequence is a part of
another sequence, or they only match in parts.
• Smith, T.F. and Waterman, M.S. Identification of
common molecular subsequences. J Mol Biol.
147(1):195-7 (1981)
Credit by David Fristrom, Bibliographer/Librarian, Science and Engineering Library, [email protected]
Arranging Everything in a Table
F
A
S
T
F
A
T
1…I-1 1…I
1…J-1 1…J-1
1…I-1 1…I
1…J
1…J
Adapted from Cedric Notredame
Filing Up The Matrix
Adapted from Cedric Notredame
Delivering the alignment: Trace-back
T S A F
T - A F
Score of 1…3 Vs 1…4

Optimal Aln Score
Adapted from Cedric Notredame
Local Alignments
GLOBAL Alignment
LOCAL Alignment
Smith And Waterman (SW)=LOCAL Alignment
Adapted from Cedric Notredame
Search Tool
• By aligning query sequence against all
sequences in a database, alignment can be
used to search database for similar sequences
• But alignment algorithms are slow
Credit by David Fristrom, Bibliographer/Librarian, Science and Engineering Library, [email protected]
What is BLAST?
• Quick, heuristic alignment algorithm
• Divides query sequence into short words, and
initially only looks for (exact) matches of
these words, then tries extending alignment.
• Much faster, but can miss some alignments
• Altschul, S.F. et al. Basic local alignment search tool.
J Mol Biol. 215(3):403-10(1990).
Credit by David Fristrom, Bibliographer/Librarian, Science and Engineering Library, [email protected]
What is BLAST?
• Basic Local Alignment Search Tool
• BLAST is a Program Designed for RAPIDLY
Comparing Your Sequence With every
Sequence in a database and REPORT the
most SIMILAR sequences
Adapted from Cedric Notredame
Database Search
Q
SW
1.10e-20
10
1.10e-100
1.10e-2
1.10e-1
10
QUERRY
Comparison Engine
Database
3
1
3
6
1.10e-2
1
20
15
E-values
How many time do we expect such an
Alignment by chance?
13
Adapted from Cedric Notredame
Database Search
1-Query
2-Comparison Engine
3-Database
LOCAL Alignment
4-Statistical Evaluation (E-Value)
PROBLEM: LOCAL ALIGNMENT (SW)TOO SLOW
Adapted from Cedric Notredame
BLAST
BLAST is a Heuristic Smith and Waterman
BLAST = 3 STEPS
1-Decide who will be compared
This is where Blast SAVES TIME
This is where it LOSES HITS
Most BLAST parameters refer to this step
Adapted from Cedric Notredame
BLAST
BLAST is a Heuristic Smith and Waterman
BLAST = 3 STEPS
1-Decide who will be compared
2-Check the most promising Hits
3-Compute the E-value of the most
interesting Hits
Adapted from Cedric Notredame
Inside BLAST
Step 1: finding the worthy words
Query
score < T
...
YYY
List of all the 3AA words that
Can be found in the database
score > T
ACT
... ...
AAA
AAC
AAD
REL
RSL
LKP
RSL
TVF
Words with a score > T
Adapted from Cedric Notredame
Inside BLAST
Step 2: Eliminate the database sequences that do
not contain any interesting word
Sequences within the database
... ...
ACT
ACT
ACT
Look for
«interesting»
words
RSL
TVF
RSL
RSL
RSL
RSL
TVF
TVF
List of « interesting » words > T
 Sequences containing
interesting words (Hits)
Adapted from Cedric Notredame
Inside BLAST: the end
Step 3: Extension of the Hits
Database sequence
Query
X
• 2 "Hits" on the same
diagonal distant by less
than X
Database sequence
Query
X
Extension by limited Dynamic
Programming
Adapted from Cedric Notredame
BLAST Statistics
• Raw Score
– Sum of the substitutions and gap penalties.
– Not very informative
• p-value (Derived Statistics)
– Probability of finding an alignment with such a score,
by chance.
– The lower, the better
Adapted from Cedric Notredame
Any Question?
Related documents