Download Paracel GeneMatcher2 - ILRI Research Computing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Database wikipedia , lookup

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Database model wikipedia , lookup

Transcript
Paracel GeneMatcher2
Overview
1
GeneMatcher2
• The GeneMatcher system comprises of hardware
and software components that significantly accelerate
a number of computationally intensive sequence
similarity search algorithms.
• There are two hardware components:
– GeneMatcher accelerator
– Post-Processor (Blastmachine)
• Two client intefaces:
– Unix command line
– Web-based GUI (BioView Workbench)
2
GeneMatcher2 Architecture
CPU 1
CPU 2
a
g
...
CPU 6912
a
Query #1 (agaggt..)
...
Query #n
Web interface
Switch
GeneMatcher2
Blast machine3
GeneMatcher2 System
• Massively Parallel Bioinformatics supercomputer
• Array of ASIC (Application Specific Integrated Circuit)
chips combined with state-of-the-art Linux cluster
technology
• Accelerates dynamic programming search algorithms
• 3,000 to 220,000 processors
• Thousands of times faster than general purpose
computers
4
GeneMatcher2 Components
3 Processor units
(6,142 processors
per unit)
ULTRASparc
computer
Up to 4 disk drives
For database storage
5
GeneMatcher2 Algorithms
• HMM and HMM-Frame
– Searches protein or DNA sequence data with domain models
– HMM-Frame aligns protein models to DNA with frame shift
and optional intron tolerance
• Profile and Profile-Frame
– Position-specific scoring with profile models
– Frame shift tolerant protein profile searches against DNA
sequence data
• GeneWise
– Aligns protein sequences or HMM against genomic data
– Tolerates introns and frame shifts
6
GeneMatcher2 Algorithms cont,
• Smith-Waterman
– Comparison of DNA-DNA, Protein-Protein, Protein-DNA or
DNA-DNA through protein
– Frame algorithms tolerate frame shifts, unlike BLAST
counterparts
– Optional intron tolerance for searches of genomic data
– Highly sensitive search capacity finds hits BLAST
potentially misses
– NCBI Blast
7
What about Blast?
• Blast is an approximation of Smith-Waterman
• So is FastA, but it's better and has protein fragment
searches
• Approx. may not yield correct results in some situations:
– Data with many ambiguities or frameshifts, such as raw ESTs and
unfinished genomic sequence
– Distantly related sequences
– When global alignments are desired
– Protein alignment of Sequences with introns (not penalized on
GeneMatcher)
8
Why GeneMatcher2
•Comparison of sensitivity and selectivity of various
sequence search methods
•Sensitivity: What proportion of the real hits are reported?
(More sensitive means more real hits)
•Selectivity: What proportion of the reported hits are real?
(More selective means less false positives)
Less False
positives
More true positives
9
GeneMatcher2 Performance
•Time-to-completion comparison of original methods
and methods on GeneMatcher2
•TBLASTX improvement is 20-fold
•Other methods at least 100-fold improvement
Runtime for an average query
1000
1000
Seconds
800
600
400
376
270
200
16
0
*
13
16
0.1
*
4
1
*
Method
Source: Genome Canada Bioinformatics Platform Project
10
Running a search
• Load a sequence (or set of sequences) as a query
set if it will be used several times
• Select the appropriate search depending on the
query type and database type (only suitable
candidates will be displayed on the search forms)
• Check your form options!
• Watch the search queue (can raise priority of small
jobs if machine is busy)
• Select a result format
11
Databases
• While you can load your own databases, disk space
on the post-processor is not infinite! Ask us about
maintaining public databases that are not currently
available.
• If you upload a private database. Special files need
to be created to use translated database searches
such as rframe.
• You can create private data sets to search against
(e.g. Unigene-mouse and Unigene-rat in a data set
called Unigene-rodent). These don’t take up any
space.
12
Hidden Markov Models
Positive examples
Seq 1
Seq 2
Seq 3
Seq 4
THE
THE
THE
THE
LAST FAT CAT
FAST CAT
VERY FAST CAT
FAT CAT
Multiple sequence alignment
(Clustalw or T-coffee)
orororor
or
VERY
gap
Query
THE VAST VERY FAST CAT
Hidden Markov
Model
GeneMatcher2
THE LAST FAST CAT
+++ ++++ ++++ +++ all matches
Only nothing,
“LAST” or “VERY”
in that position
}
Position specific
Positive examples
THE LAST FA T CAT
THE
FAST CAT
THE VERY FAST CAT
THE
FA T CAT
THE LAST FAST CAT
Query
THE VAST FAST CAT
“AST” from LAST
“V” from VERY
gapgapgap
13
GeneWise
• Predict introns and exons based on conserved
protein domains (e.g Pfam database)
• Uses HMMs, reverse query/data set relationship
holds
• Unlike genscan or fgenes, you can believe these hits,
though they may not be complete where exons don’t
contain conserved domains.
14