* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Basic Overview of Bioinformatics Tools and Biocomputing
Cre-Lox recombination wikipedia , lookup
Non-coding DNA wikipedia , lookup
Gene regulatory network wikipedia , lookup
Molecular evolution wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Gene expression wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Community fingerprinting wikipedia , lookup
Transcriptional regulation wikipedia , lookup
Protein structure prediction wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre Common Computational Analyses • Sequence Assembly • Simple sequence analysis – Translation and reverse Complement, ORF – Composition statistics (protein & DNA) – Molecular mass – Total charge and pI; local hydropathy – Simple determination of secondary structures – Restriction site analysis – Internal repeat analysis • Detection of active sites, functional residues, characteristic structures, substrates, and processing signals Common Computational Analyses • Database sequence search • Multiple alignment • 2 and 3 Structure prediction; transmembrane helix detection • Structure modeling • Docking prediction and design • Hidden Markov model searches Database Searching • Text-based Database Searching using a text string to match an annotation in a sequence database record, ie. Keyword search • Sequence-based Database Searching using a biological sequence to match its whole or parts of its sequence to the sequences of every sequence database records Text-Based Database Searching • Examples: Entrez, SRS, DBGET, AceDB - common integrated database systems • Search Concepts – – – – – Boolean Search - AND, OR, NOT Broadening Search Narrowing the Search Proximity searching, soundex Wild Card, Stemming eg. Thala* for thalasemia, thalassemia, thalassemic • Use standard string search algorithms and boolean operations, vocabulary matches Text-based Database Searching • Example: To find the human homolog of the Drosophila per gene • Procedure – – – – – Web to Entrez All Fields : enter "human" "per" Hits returned, irrelevant - broaden search "human" "period" - more hits check every one, find the human RIGUI gene • Hit and miss, clever guess work, free form or controlled vocabulary (MeSH terms)? Use Boolean searches? Sequence-based Database Searching • • • • • • • Homology Search Global or Local Sequence Alignment Needleman-Wunch Algorithm Smith-Waterman Algorithm Lipman - Pearson FASTA Altschul's BLAST Take a sequence, pairwise comparison with each sequence in the database Sequence-based Database Searching • Basic Assumptions: • Sequences of homologous Genes/Protein diverge over time even though structure and/or function change little • Significant sequence similarity inferred as potential structural /functional similarity or common evolutionary origin • Based on well-characterised protein, infer the function of an unknown sequence at gene or protein sequence level. Sequence-based Database Searching • Global Alignment forces complete alignment of the pairwise comparison of the two input sequences • Local Alignment looks for local stretches of similarity and tries to align the most similar segments • Algorithms used may be similar, but output different, statistics needed to assess results Sequence-based Database Searching • Alignment Scoring • Substitution score and substitution matrix PAM, BLOSUM • affine gap costs/gap penalty and gap scores • Optimal alignments, dynamic programming Needleman-Wunsch algorithm, Smith-Waterman algorithm (SSEARCH) • Additional heuristics to speed up the search - FASTA, BLAST Some definitions • Affine gap costs - scoring system for gaps within alignments which charges a penalty for gap formation and additional perresidue penalty proportional to size of gap • Alignment score - numerical value indicating the overall quality of an alignment, the higher the better the alignment. • Algorithm - fixed procedure embodied in a computer program • Heuristics - a computer science term referring to guesses made by the program to approximate results, usually based on arbitrary or predefined rules. • Gapped Alignment - alignment of sequences where gaps are permitted Computational Genefinding • Major challenge in genome project • Given a DNA sequence, where does a gene begin and stop? - ORF • Where are the exons and introns? • Where are the transcription elements? • Gene structure and other regulatory elements? Genomic Elements • • • • • • • • • Intron-exon splice sites Start-Stop codons Branch Points Promoters and terminators of transcription Polyadenylation sites ribosomal binding sites Topoisomerase II binding sites Topoisomerase I cleavage sites Transcription factor binding sites Detecting Genomic Elements • Local sites and motifs/patterns for such element - signals and signal sensors • Extended variable-length regions eg exons and introns- contents and content sensors • Linguistic technique - gene structure described in formal grammar - GeneLang genefinding program Signal sensors • Simple consensus sequence Use of Pattern matching algorithms • Weight matrices allow for weighted score for each weight matrix sensors to be summed • Use of Artificial Neural Networks (ANN) Content Sensors • Long ORF for bacteria • Statistical models eg. Markov models GeneMark statistical models of nucleotide frequencies and dependencies in codon structure • Neural Nets eg Grail exon detection by neural network combined with signal sensors for exon-intron splice sites Some Definitions • Artificial Neural Nets - statistical pattern recognition method - a type of nonlinear regression • Markov Models - statistical models for sequences in which the probability of each residue depends on the residues preceding it. • Dynamic Programming - type of algorithm widely used for constructing sequence aligments and for evaluating all posible candidate gene structure Other Genefinding methods • Use of dynamic programming Linguistic rules for functional features Parameters of a Markov Process on hidden variables - hidden Markov Models (HMM) • HMM genefinder - EcoParse, Xpound GeneMark HMM, Veil, HMMgene, GenScan