Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Silencer (genetics) wikipedia , lookup

Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup

G protein–coupled receptor wikipedia , lookup

Gene expression wikipedia , lookup

Magnesium transporter wikipedia , lookup

Amino acid synthesis wikipedia , lookup

Biosynthesis wikipedia , lookup

Metalloprotein wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Expression vector wikipedia , lookup

Protein wikipedia , lookup

Interactome wikipedia , lookup

Biochemistry wikipedia , lookup

Western blot wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Protein purification wikipedia , lookup

Structural alignment wikipedia , lookup

Genetic code wikipedia , lookup

Point mutation wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Proteolysis wikipedia , lookup

Transcript
Lecture 11
SUPERVISED NEURAL
NETWORKS FOR PROTEIN
SEQUENCE ANALYSIS
Dr Lee Nung Kion
Faculty of Cognitive Sciences and Human Development
UNIMAS, http://www.unimas.my
Introduction
Protein sequences are composed of 20
amino acids
 The twenty amino acid letters are: A, C, D,
E, F, G, H, I, K, L, M, N, P, Q, R, S, T,V, W,Y
 Proteins are product of genes which have
many functions in our body: antibodies,
enzymes, structural (hairs, tendons etc)
etc.

Introduction
A sequence motif is a short amino
pattern sequence in a protein sequence
that has biological significance.
 For example:
 AATCKLMMVTVVWTTAGA
Underlined are motifs important for the
function of this protein
 Proteins in the same functional domain
will share a common motif

Introduction
A protein superfamily comprises set of
protein sequences that are evolutionary
and therefore functionally and structurally
related
 Protein sequences in a family share some
common motifs
 Two protein sequences are assigned to
the same class if they have high homology
in the sequence level (e.g., common
motif).

First fact of biology

“if two peptides stretches exhibit
sufficient similarity at the sequence level,
then they are likely to be biologically
related”
Sequence alignment

The similarity between two protein
sequences are commonly established
through multiple alignment algorithm (e.g.,
BLAST)
Example of multiple sequence alignment to identify
common motifs in protein sequences
Protein families
Rapid grow in the number of protein
sequences
 Searching one query sequence against all
in all the databases is computationally
expensive, need super-computer or
weeks of computational time

growth of sequence data in GenBank
Solution

Grouping protein into families based on
the sequence level similarity can aid in:
◦ Group analysis of sequences to identify
common motifs;
◦ Support rapid search of a protein sequence
The idea
Query
sequence
Feature extraction
Feature
extraction
Fam 1
Protein
database
Fam 2
Group into
families using
clustering
algorithms
Fam m
Fam
3
Features that
represent
each protein
family
matching
Neural network for protein family
classification
Extract
features for
all protein
sequences
Fam 1
Protein
database
Fam 2
Group into
families using
clustering
algorithms
Fam m
Input
patterns from
different
…
classes
…
…
…..
…
…..
…
…..
Neural
network
training
predicti
on
Fam
3
Query
sequence
Fam ?
Input vector
Issues on application on NN for
biological sequences analysis
Input encoding
How to convert the amino acids into
numerical values?
 Direct encoding:

◦ Each amino acid letter is represented by a
binary coding
◦ Various binary representation can be obtained
based on the properties of amino acids
Input encoding

Each amino acid is converted into
numerical value based on its properties
Input encoding

Amino acids are converted to binary
vectors or feature values
(Wu & McLarty, 2000)
Direct encoding 1-of-20 (Bin20)
Each amino acid is represented by 20
binary digits, with only one of them is 1,
others zero.
 E.g.,

A = [1, 0, 0, 0, 0, 0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0]
 D = [0, 1, 0, 0, 0, 0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0]
 …
 K = [0, 0, 0, 0, 0, 0, 0,0,0,0,0,0,0,0,0,0,0,0,0,1].
 A protein sequence becomes the concatenation of
these binary vectors.

Exchange groups

The twenty amino acids can be grouped
according to various chemical/structural
properties.
Indirect encoding

n-gram features
◦ n-gram feature is the number of occurrences
of a short protein sequence of length n in a
protein sequence.
◦ Definition:
The n-gram features is a pair of values (vi, ci) where vi
  depicts the i-th gram feature and ci is the counts
of this feature in a protein sequence for i=i… . |  |n
◦ For example: the 2-gram are:
◦ (AA, AB, …, AY, BA, BB, …, BY,YA, …,YY).
N-gram feature example

AGCCDDAGAGKDDV
AG – 3
GC- 1
CC -1
CD-1
DD-2
DA-1
GA-1
GK -1
KD -1
DV - 1
Indirect encoding

Problems with n-gram features:
◦ Large input dimension for n > 2
N-gram
# of inputs
2
400
3
8000
4
16000
5
3200000
6
64000000
◦ Solution: feature selection
N-gram feature
N-gram feature can also be reduced by
using the amino acid exchange groups in
Table 6.2.
 E.g.

N-gram with exchange groups

Using hydrophobicity
◦ A={DENQRK}, B={CSTPGHY},
C={AMILVFW}
AGCCDDAGAGKDDV
CBBBAACBCBAAAC
 2-grams are:
 CB-2, BB-2, BA-2, AA-2, AC-1, BC-1, AA-2
 # of features just 32

Conclusion

Neural network is an effective and
efficient option for protein family
classification with proper feature
representation and input encoding.