Download Sequence Search

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Promoter (genetics) wikipedia , lookup

Genomic library wikipedia , lookup

Molecular ecology wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Genetic code wikipedia , lookup

Non-coding DNA wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Multilocus sequence typing wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Point mutation wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Transcript
Sequence Search
Abhishek Niroula
Department of Experimental Medical Science
Lund University
2015-12-10
1
Sequence search
• Sequences
– Nucleotide and amino acid sequences
– Known sequences are stored in different databases
• NCBI, ensembl, and others
– Number of organisms being sequenced is increasing
• Ensembl, Ensembl plant, Ensembl fungi, Ensembl bacteria, etc
• Genome 10K (genomic zoo) project
– Use of sequences is expanding rapidly in biomedical research
– 1000 Genomes Project, 100K Genomes Project
• Sequence search
– Search for an appropriate sequence
– Search for similar sequences in a database
2015-12-10
2
Sequence search
•
•
•
Identify a new sequence
Functional and structural annotation of sequences
Finding homolog sequences for
–
•
Genomic, phylogenetic, structural studies, etc
Haemophilus influezae
–
–
–
The genome of Haemophilus influezae was reported in 1995 (Fleischmann et al.)
1,743 assumed coding regions were translated into amino acid sequences, and searched for
similarity in the Swiss-Prot database
1,007 of them matched
•
•
the biochemical function could be deduced for each of them
Multiple Sclerosis (source: Martin Tompa)
–
–
–
Autoimmune disease in which the T-cells recognise nerves’ myelin sheaths as foreign body
and attack them
Hypothesis: Nerves’ myelin sheath proteins were similar to bacterial sheath proteins from
earlier infection
Methodology
•
•
•
2015-12-10
Myelin sheath proteins were sequenced
Search in a database for similar bacteria and virus sequences
Lab tests to check if the T-cells attacked the identified bacterial and viral proteins
3
Similarity vs homology
• Similarity
– Similarity is the degree of likeness of two sequences
– It is a quantitative measure
• Homology
– Homology is an evolutionary relationship between two sequences
– It can not be measured.
– Distant and close homology refers to the distance between the
sequences and their common ancestors
• Two sequences are 80% similar.
• Two sequences are 80% homologous.
2015-12-10
4
Orthologs and paralogs
2015-12-10
Source: http://www.ensembl.org/info/genome/compara/tree_example1.png
5
Sequence search: Problem
• Given
– Query sequence
– Database
Query
Database
Result
• Goal
1
2
3
4
5
6
7
Search
1
2
4
7
– To find statistically significant
similarity that can be used to infer
homology
Sensitivity: Are all related sequences identified?
Specificity: Are all unrelated sequences rejected?
2015-12-10
TP
1, 2, 4
FP
7
FN
3, 5
TN
6
6
Heuristic database searching
• Sequence search: problem
– Exact similarity computation between a query sequence and a database
using dynamic programming is computationally intense
– With available technology, aligning a query sequence against an entire
database is not feasible
• Solution
– Heuristic methods: Fast scanning of similar sequences
– Sequences similar to a query sequence are searched from the database
using heuristic methods before computing exact alignment scores
– Tools
• BLAST
• FASTA
2015-12-10
7
BLAST
•
•
•
•
BASIC Local Alignment Search Tool
Developed by Altschul et al., 1990
Determines the local alignment between a query and a database
BLAST consist of two steps:
– Searching matches
– Computing statistical significance of the matches
2015-12-10
8
BLAST
Query sequence
• Given a query sequence, split the query
sequence into words with k residues
– k = 3, for amino acid sequence
– K = 11-12, for nucleotide sequence
• Generate all other combination of words
with k residues
• Score each of the words using substitution
matrix
MDLSALTRQ
k-mers
MDL DLS
MDV DLR
MDM QLS
MRL DVS
MQL DKS
-----
LSA
-----------
SAL ALT ----- ----- ----- ----- ----- ---
– Words with scores higher than threshold
are considered in the next step
2015-12-10
9
BLAST
High scoring words
•
Match each of the high scoring words in
the database sequences
•
The matches are extended on both
directions to form ungapped local
alighment to find high scoring pair
(HSP)
•
The HSP with a cutoff score greater
than the threshold are kept
•
Significance of the ungapped HSP is
calculated
2015-12-10
Database
Ungapped extension
HSP
10
Gapped BLAST
•
•
Altschul et al., 1997
Extension of matches requires two
non-overlapping matches in the
same diagonal within a distance ”A”
•
Less number of extensions makes
the search faster
•
Perform gapped alignments around
the hits that have higher scores than
a pre-defined score
2015-12-10
11
FASTA
•
•
•
•
FAST All (extension of FASTP and FASTN)
Developed by Lipman and Pearson, 1985
FASTA also builds a local alignment between query and database
FASTA has four steps:
–
–
–
–
2015-12-10
Hashing
1𝑠𝑡 scoring
2𝑛𝑑 scoring
Alignment
12
FASTA
•
Hashing
– Query sequence is split into words of size k
– Exact word matches are identified in the database
– Regions populated with matches are identified and 10 best regions are selected
•
1𝑠𝑡 scoring
– Within the selected regions, optimal local alignment is computed using substitution
matrix
•
2𝑛𝑑 scoring
– Alignments are combined to obtain a single larger alignment
– Gaps are allowed in the alignment
•
Alignment
– Alignment is iptimized using Smith-Waterman dynamic programming
•
Statistical significance for each alignment is computed
2015-12-10
13
Variants of BLAST and FASTA
Query
Database
Program
Protein
Protein
blastp
fasta
Nucleotide
Nucleotide
blastn
fasta
Nucleotide
Protein
blastx
fastx, fasty
Translate query to a protein
Protein
Nucleotide
tblastn
tfastx, tfasty
Translate database
Nucleotide
Nucleotide
tblastx
Translate both query and
database
2015-12-10
Comment
14
Using BLAST and FASTA
• Web application
– BLAST
• http://blast.ncbi.nlm.nih.gov/Blast.cgi
– FASTA
• http://www.ebi.ac.uk/Tools/sss/fasta/
• Standalone
– Local installation
– Database should also be downloaded
2015-12-10
15
BLASTP
2015-12-10
16
FASTA
2015-12-10
17
Input formats
•
FASTA format files
– Widely used in bioinformatics
•
Other file formats
– GCG, EMBL, GenBank, PIR, UniProtKB/Swiss-Prot, PHYLIP
•
Identifiers
– Supported in BLAST
– Accession
– Gene identifier
2015-12-10
18
Database
• Generic databases
– UniProt or RefSeq databases
– UniRef and Non-redundant database: Database of unique sequence
entries
– Genome, Chromosome
• Structure databases
– Database of sequences for which 3D structures are available in PDB
– Used specially for finding template sequence for homology modelling
• Specialized database
– Local database can be created including the sequences that are
relevant for your purpose
2015-12-10
19
Other parameters
• Expect
– Statistical significance parameter
– Default = 10,
i.e. 10 matches are expected by chance
• Filter
– Mask regions of low-complexity and short repeats
• Alignment options
– Substitution table and gap function
2015-12-10
20
Output
• There are three major sections in BLAST output
– Header
• Information about the query sequence and the database searched
• Graphical overview of matches (only in web version)
– Description
•
•
•
•
Description of the sequences (hits)
Scores: Generated from alignment, Higher is better
E-value: Number of hits expected by chance, Lower is better
Sequence identifier in NCBI databases
– Alignment
• Pairwise alignment
• Details of the alignment (Score, E-value, similarity, etc.)
2015-12-10
21
PSI BLAST
• Position Specific Iterated BLAST
• More sensitive to distantly related sequences
• Algorithm
– In the first iteration, standard BLAST is run
• A PSSM (position specific scoring matrix) is generated based on the
significant alignments
– In the next iteration, the new PSSM is used to score the alignments
• A new PSSM is generated based on the significant alignments
– The above step is repeated until a stop criterion is met. Stop criteria
may be:
• No new sequences are identified in two consecutive iterations
• Number of desired iteration reached
2015-12-10
22
Sequence search: Challenges
•
Self hits are uninteresting
•
Size of target database
– Use no big database than required
•
Paralogs have similar sequences but often have different function
•
Low-complexity regions reduce the quality of alignments
•
Short repeats give false hits
•
Results for very short queries may be less reliable
– Matches that are 50% identical with length 20-40 amino acids occur frequently by
chance
•
Distant homologs may have very low similarity
2015-12-10
23
Sequence search
• Exercise
– BLAST
– FASTA
2015-12-10
24