* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Basic Overview of Bioinformatics Tools and Biocomputing
Survey
Document related concepts
Cre-Lox recombination wikipedia , lookup
Non-coding DNA wikipedia , lookup
Gene regulatory network wikipedia , lookup
Molecular evolution wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Gene expression wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Community fingerprinting wikipedia , lookup
Transcriptional regulation wikipedia , lookup
Protein structure prediction wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Transcript
Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre Common Computational Analyses • Sequence Assembly • Simple sequence analysis – Translation and reverse Complement, ORF – Composition statistics (protein & DNA) – Molecular mass – Total charge and pI; local hydropathy – Simple determination of secondary structures – Restriction site analysis – Internal repeat analysis • Detection of active sites, functional residues, characteristic structures, substrates, and processing signals Common Computational Analyses • Database sequence search • Multiple alignment • 2 and 3 Structure prediction; transmembrane helix detection • Structure modeling • Docking prediction and design • Hidden Markov model searches Database Searching • Text-based Database Searching using a text string to match an annotation in a sequence database record, ie. Keyword search • Sequence-based Database Searching using a biological sequence to match its whole or parts of its sequence to the sequences of every sequence database records Text-Based Database Searching • Examples: Entrez, SRS, DBGET, AceDB - common integrated database systems • Search Concepts – – – – – Boolean Search - AND, OR, NOT Broadening Search Narrowing the Search Proximity searching, soundex Wild Card, Stemming eg. Thala* for thalasemia, thalassemia, thalassemic • Use standard string search algorithms and boolean operations, vocabulary matches Text-based Database Searching • Example: To find the human homolog of the Drosophila per gene • Procedure – – – – – Web to Entrez All Fields : enter "human" "per" Hits returned, irrelevant - broaden search "human" "period" - more hits check every one, find the human RIGUI gene • Hit and miss, clever guess work, free form or controlled vocabulary (MeSH terms)? Use Boolean searches? Sequence-based Database Searching • • • • • • • Homology Search Global or Local Sequence Alignment Needleman-Wunch Algorithm Smith-Waterman Algorithm Lipman - Pearson FASTA Altschul's BLAST Take a sequence, pairwise comparison with each sequence in the database Sequence-based Database Searching • Basic Assumptions: • Sequences of homologous Genes/Protein diverge over time even though structure and/or function change little • Significant sequence similarity inferred as potential structural /functional similarity or common evolutionary origin • Based on well-characterised protein, infer the function of an unknown sequence at gene or protein sequence level. Sequence-based Database Searching • Global Alignment forces complete alignment of the pairwise comparison of the two input sequences • Local Alignment looks for local stretches of similarity and tries to align the most similar segments • Algorithms used may be similar, but output different, statistics needed to assess results Sequence-based Database Searching • Alignment Scoring • Substitution score and substitution matrix PAM, BLOSUM • affine gap costs/gap penalty and gap scores • Optimal alignments, dynamic programming Needleman-Wunsch algorithm, Smith-Waterman algorithm (SSEARCH) • Additional heuristics to speed up the search - FASTA, BLAST Some definitions • Affine gap costs - scoring system for gaps within alignments which charges a penalty for gap formation and additional perresidue penalty proportional to size of gap • Alignment score - numerical value indicating the overall quality of an alignment, the higher the better the alignment. • Algorithm - fixed procedure embodied in a computer program • Heuristics - a computer science term referring to guesses made by the program to approximate results, usually based on arbitrary or predefined rules. • Gapped Alignment - alignment of sequences where gaps are permitted Computational Genefinding • Major challenge in genome project • Given a DNA sequence, where does a gene begin and stop? - ORF • Where are the exons and introns? • Where are the transcription elements? • Gene structure and other regulatory elements? Genomic Elements • • • • • • • • • Intron-exon splice sites Start-Stop codons Branch Points Promoters and terminators of transcription Polyadenylation sites ribosomal binding sites Topoisomerase II binding sites Topoisomerase I cleavage sites Transcription factor binding sites Detecting Genomic Elements • Local sites and motifs/patterns for such element - signals and signal sensors • Extended variable-length regions eg exons and introns- contents and content sensors • Linguistic technique - gene structure described in formal grammar - GeneLang genefinding program Signal sensors • Simple consensus sequence Use of Pattern matching algorithms • Weight matrices allow for weighted score for each weight matrix sensors to be summed • Use of Artificial Neural Networks (ANN) Content Sensors • Long ORF for bacteria • Statistical models eg. Markov models GeneMark statistical models of nucleotide frequencies and dependencies in codon structure • Neural Nets eg Grail exon detection by neural network combined with signal sensors for exon-intron splice sites Some Definitions • Artificial Neural Nets - statistical pattern recognition method - a type of nonlinear regression • Markov Models - statistical models for sequences in which the probability of each residue depends on the residues preceding it. • Dynamic Programming - type of algorithm widely used for constructing sequence aligments and for evaluating all posible candidate gene structure Other Genefinding methods • Use of dynamic programming Linguistic rules for functional features Parameters of a Markov Process on hidden variables - hidden Markov Models (HMM) • HMM genefinder - EcoParse, Xpound GeneMark HMM, Veil, HMMgene, GenScan