Download Basic Overview of Bioinformatics Tools and Biocomputing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cre-Lox recombination wikipedia , lookup

Non-coding DNA wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene regulatory network wikipedia , lookup

Molecular evolution wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Gene expression wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Community fingerprinting wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Protein structure prediction wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Homology modeling wikipedia , lookup

Structural alignment wikipedia , lookup

Transcript
Basic Overview of
Bioinformatics Tools and
Biocomputing Applications II
Dr Tan Tin Wee
Director
Bioinformatics Centre
Common Computational Analyses
• Sequence Assembly
• Simple sequence analysis
– Translation and reverse Complement, ORF
– Composition statistics (protein & DNA)
– Molecular mass
– Total charge and pI; local hydropathy
– Simple determination of secondary structures
– Restriction site analysis
– Internal repeat analysis
• Detection of active sites, functional residues, characteristic
structures, substrates, and processing signals
Common Computational Analyses
• Database sequence search
• Multiple alignment
• 2 and 3 Structure prediction;
transmembrane helix detection
• Structure modeling
• Docking prediction and design
• Hidden Markov model searches
Database Searching
• Text-based Database Searching using a text string to match an annotation in
a sequence database record, ie. Keyword
search
• Sequence-based Database Searching using a biological sequence to match its
whole or parts of its sequence to the
sequences of every sequence database
records
Text-Based Database Searching
• Examples: Entrez, SRS, DBGET, AceDB
- common integrated database systems
• Search Concepts
–
–
–
–
–
Boolean Search - AND, OR, NOT
Broadening Search
Narrowing the Search
Proximity searching, soundex
Wild Card, Stemming eg. Thala* for
thalasemia, thalassemia, thalassemic
• Use standard string search algorithms and
boolean operations, vocabulary matches
Text-based Database Searching
• Example: To find the human homolog of the
Drosophila per gene
• Procedure
–
–
–
–
–
Web to Entrez
All Fields : enter "human" "per"
Hits returned, irrelevant - broaden search
"human" "period" - more hits
check every one, find the human RIGUI gene
• Hit and miss, clever guess work,
free form or controlled vocabulary (MeSH terms)?
Use Boolean searches?
Sequence-based Database
Searching
•
•
•
•
•
•
•
Homology Search
Global or Local Sequence Alignment
Needleman-Wunch Algorithm
Smith-Waterman Algorithm
Lipman - Pearson FASTA
Altschul's BLAST
Take a sequence, pairwise comparison with
each sequence in the database
Sequence-based Database
Searching
• Basic Assumptions:
• Sequences of homologous Genes/Protein diverge
over time even though structure and/or function
change little
• Significant sequence similarity inferred as
potential structural /functional similarity or
common evolutionary origin
• Based on well-characterised protein, infer the
function of an unknown sequence at gene or
protein sequence level.
Sequence-based Database
Searching
• Global Alignment
forces complete alignment of the pairwise
comparison of the two input sequences
• Local Alignment
looks for local stretches of similarity and
tries to align the most similar segments
• Algorithms used may be similar, but output
different, statistics needed to assess results
Sequence-based Database
Searching
• Alignment Scoring
• Substitution score and substitution matrix
PAM, BLOSUM
• affine gap costs/gap penalty and gap scores
• Optimal alignments, dynamic programming
Needleman-Wunsch algorithm,
Smith-Waterman algorithm (SSEARCH)
• Additional heuristics to speed up the search
- FASTA, BLAST
Some definitions
• Affine gap costs - scoring system for gaps within alignments
which charges a penalty for gap formation and additional perresidue penalty proportional to size of gap
• Alignment score - numerical value indicating the overall quality
of an alignment, the higher the better the alignment.
• Algorithm - fixed procedure embodied in a computer program
• Heuristics - a computer science term referring to guesses made
by the program to approximate results, usually based on
arbitrary or predefined rules.
• Gapped Alignment - alignment of sequences where gaps are
permitted
Computational Genefinding
• Major challenge in genome project
• Given a DNA sequence, where does a gene
begin and stop? - ORF
• Where are the exons and introns?
• Where are the transcription elements?
• Gene structure and other regulatory
elements?
Genomic Elements
•
•
•
•
•
•
•
•
•
Intron-exon splice sites
Start-Stop codons
Branch Points
Promoters and terminators of transcription
Polyadenylation sites
ribosomal binding sites
Topoisomerase II binding sites
Topoisomerase I cleavage sites
Transcription factor binding sites
Detecting Genomic Elements
• Local sites and motifs/patterns for such
element - signals and signal sensors
• Extended variable-length regions eg exons
and introns- contents and content sensors
• Linguistic technique - gene structure
described in formal grammar - GeneLang
genefinding program
Signal sensors
• Simple consensus sequence
Use of Pattern matching algorithms
• Weight matrices
allow for weighted score for each weight
matrix sensors to be summed
• Use of Artificial Neural Networks (ANN)
Content Sensors
• Long ORF for bacteria
• Statistical models eg. Markov models GeneMark
statistical models of nucleotide frequencies
and dependencies in codon structure
• Neural Nets eg Grail
exon detection by neural network combined
with signal sensors for exon-intron splice
sites
Some Definitions
• Artificial Neural Nets - statistical pattern
recognition method - a type of nonlinear
regression
• Markov Models - statistical models for sequences
in which the probability of each residue depends
on the residues preceding it.
• Dynamic Programming - type of algorithm widely
used for constructing sequence aligments and for
evaluating all posible candidate gene structure
Other Genefinding methods
• Use of dynamic programming
Linguistic rules for functional features
Parameters of a Markov Process on hidden
variables - hidden Markov Models (HMM)
• HMM genefinder - EcoParse, Xpound
GeneMark HMM, Veil, HMMgene,
GenScan