Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Paracel GeneMatcher2 Overview 1 GeneMatcher2 • The GeneMatcher system comprises of hardware and software components that significantly accelerate a number of computationally intensive sequence similarity search algorithms. • There are two hardware components: – GeneMatcher accelerator – Post-Processor (Blastmachine) • Two client intefaces: – Unix command line – Web-based GUI (BioView Workbench) 2 GeneMatcher2 Architecture CPU 1 CPU 2 a g ... CPU 6912 a Query #1 (agaggt..) ... Query #n Web interface Switch GeneMatcher2 Blast machine3 GeneMatcher2 System • Massively Parallel Bioinformatics supercomputer • Array of ASIC (Application Specific Integrated Circuit) chips combined with state-of-the-art Linux cluster technology • Accelerates dynamic programming search algorithms • 3,000 to 220,000 processors • Thousands of times faster than general purpose computers 4 GeneMatcher2 Components 3 Processor units (6,142 processors per unit) ULTRASparc computer Up to 4 disk drives For database storage 5 GeneMatcher2 Algorithms • HMM and HMM-Frame – Searches protein or DNA sequence data with domain models – HMM-Frame aligns protein models to DNA with frame shift and optional intron tolerance • Profile and Profile-Frame – Position-specific scoring with profile models – Frame shift tolerant protein profile searches against DNA sequence data • GeneWise – Aligns protein sequences or HMM against genomic data – Tolerates introns and frame shifts 6 GeneMatcher2 Algorithms cont, • Smith-Waterman – Comparison of DNA-DNA, Protein-Protein, Protein-DNA or DNA-DNA through protein – Frame algorithms tolerate frame shifts, unlike BLAST counterparts – Optional intron tolerance for searches of genomic data – Highly sensitive search capacity finds hits BLAST potentially misses – NCBI Blast 7 What about Blast? • Blast is an approximation of Smith-Waterman • So is FastA, but it's better and has protein fragment searches • Approx. may not yield correct results in some situations: – Data with many ambiguities or frameshifts, such as raw ESTs and unfinished genomic sequence – Distantly related sequences – When global alignments are desired – Protein alignment of Sequences with introns (not penalized on GeneMatcher) 8 Why GeneMatcher2 •Comparison of sensitivity and selectivity of various sequence search methods •Sensitivity: What proportion of the real hits are reported? (More sensitive means more real hits) •Selectivity: What proportion of the reported hits are real? (More selective means less false positives) Less False positives More true positives 9 GeneMatcher2 Performance •Time-to-completion comparison of original methods and methods on GeneMatcher2 •TBLASTX improvement is 20-fold •Other methods at least 100-fold improvement Runtime for an average query 1000 1000 Seconds 800 600 400 376 270 200 16 0 * 13 16 0.1 * 4 1 * Method Source: Genome Canada Bioinformatics Platform Project 10 Running a search • Load a sequence (or set of sequences) as a query set if it will be used several times • Select the appropriate search depending on the query type and database type (only suitable candidates will be displayed on the search forms) • Check your form options! • Watch the search queue (can raise priority of small jobs if machine is busy) • Select a result format 11 Databases • While you can load your own databases, disk space on the post-processor is not infinite! Ask us about maintaining public databases that are not currently available. • If you upload a private database. Special files need to be created to use translated database searches such as rframe. • You can create private data sets to search against (e.g. Unigene-mouse and Unigene-rat in a data set called Unigene-rodent). These don’t take up any space. 12 Hidden Markov Models Positive examples Seq 1 Seq 2 Seq 3 Seq 4 THE THE THE THE LAST FAT CAT FAST CAT VERY FAST CAT FAT CAT Multiple sequence alignment (Clustalw or T-coffee) orororor or VERY gap Query THE VAST VERY FAST CAT Hidden Markov Model GeneMatcher2 THE LAST FAST CAT +++ ++++ ++++ +++ all matches Only nothing, “LAST” or “VERY” in that position } Position specific Positive examples THE LAST FA T CAT THE FAST CAT THE VERY FAST CAT THE FA T CAT THE LAST FAST CAT Query THE VAST FAST CAT “AST” from LAST “V” from VERY gapgapgap 13 GeneWise • Predict introns and exons based on conserved protein domains (e.g Pfam database) • Uses HMMs, reverse query/data set relationship holds • Unlike genscan or fgenes, you can believe these hits, though they may not be complete where exons don’t contain conserved domains. 14