Download Bioinformatics - cs@union

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Microevolution wikipedia , lookup

Genome evolution wikipedia , lookup

Metagenomics wikipedia , lookup

Designer baby wikipedia , lookup

Pathogenomics wikipedia , lookup

Gene expression programming wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genomics wikipedia , lookup

Maximum parsimony (phylogenetics) wikipedia , lookup

Smith–Waterman algorithm wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Transcript
Bioinformatics
Bioinformatics
Timothy Ketcham
Union College
Gradutate Seminar
2003
Bioinformatics
Bioinformatics
Introduction
Agenda
- What is Bioinformatics?
- Goals
- Molecular Biology – Genes & Proteins
- AI Techniques applied to Gene & Protein Studies
- Molecular Biology – Phylogenetic Trees
- CS Techniques applied to Tree Estimation
- Databases
- Tools
- Results
- Discussion
Bioinformatics
Bioinformatics
Introduction
What is Bioinformatics?
- Entire field of Computational Biology?
- Computational Molecular Biology?
- Application of Computer Science to Genome Analysis?
Bioinformatics
Bioinformatics
Introduction
What is Bioinformatics?
Definition:
…conceptualizing biology in terms of molecules
(in the sense of physical chemistry) and applying
“informatics techniques” (derived from disciplines
such as applied maths, computer science and
statistics) to understand and organize the
information associated with these molecules, on
a large scale. In short, Bioinformatics is an
information management system for molecular
biology…
Bioinformatics
Bioinformatics
Goals
Organizing existing biological data.
Developing tools and techniques to mine the data.
Using the data and tools for knowledge discovery.
Bioinformatics
Bioinformatics
Molecular Biology
Genetics
- Genome
- Chromosomes
- Genes
- Nucleotides
- Base Pairs
- Key Point
The sequence of nucleotides in a gene
determines its functions and changes in
the sequence can lead major changes in
those functions.
Bioinformatics
Bioinformatics
Molecular Biology
Proteins
- Linear chains of amino acids
- Structural Components
- Primary
- Secondary
- Tertiary
- Quaternary
- Key Point
The four structural components along with
the chemical properties of the amino acids
determine the function of the protein.
Bioinformatics
Bioinformatics
Artificial Intelligence
- Components
- Performance element
- Learning element
- Critic
- Training
- Testing
- Operation
Techniques
Techniques
Bioinformatics
Bioinformatics
Decision Trees
Attribute 1
Condition 1
Condition 2
Attribute 2
Condition 1
Result 1
Condition 2
Result 2
Attribute 2
Condition 1
Result 1
Condition 2
Result 2
Condition 3
Attribute 2
Condition 1
Result 1
Condition 2
Result 2
Techniques
Bioinformatics
Bioinformatics
Decision Trees
Attribute 2
Condition 1
Result 1
Condition 2
Result 2
Techniques
Bioinformatics
Bioinformatics
Neural Networks
Attribute 1
fact
Attribute 2
Attribute 3
fact
fact
Attribute 4
Attribute 5
fact
Decision
Techniques
Bioinformatics
Bioinformatics
Belief Networks
Attribute 1
p = 0.3
Result 1
p = 0.7
p = 0.3
Attribute 2
p = 0.4
Attribute 3
p = 0.5
p = 0.3
Result 2
p = 0.5
Result 3
Techniques
Bioinformatics
Bioinformatics
Hidden Markov Models
Start
End
Match State
Insert State
Delete State
Bioinformatics
Bioinformatics
Molecular Biology
Phylogenetic Trees
- Used to map evolutionary relationships
- Traditionally done at the organism level
- Mapping at molecular level can help evaluate
the relationships and/or evolution of genetic
structures, proteins or organisms
Techniques
Bioinformatics
Bioinformatics
Tree Estimation
- Number of Trees (T) for a given number of taxa (n)
n 1
T   (2i  3)
i 1
- T increases very rapidly (108 trees for 11 taxa)
- Need efficient search methods
Bioinformatics
Bioinformatics
Techniques
Exhaustive Search
- Brute Force Method
- Algorithm
- Create all possible trees
- Evaluate against optimality criteria
- Select best tree
- Only used up to 11 taxa
Bioinformatics
Bioinformatics
Techniques
Branch and Bound
- Effectively used for problems involving less than 20
taxa (approximately 1022 trees)
- Algorithm
- Establish minimally acceptable criteria
- Evaluate all n taxa trees, discard ones not
meeting criteria
- Evaluate n+1 taxa trees using remaining 4 taxa
trees as bases
- Repeat until all taxa have been evaluated
- Select optimal remaining tree
Bioinformatics
Bioinformatics
Techniques
Branch Swapping
- Used in most phylogenetic tree estimates
- Algorithm
- Construct trees with n taxa
- Discard all but optimal tree
- Rearrange branches of optimal tree to check
for more optimal arrangement
- Best tree becomes base for n+1 taxa
- Repeat for n+1 taxa
Bioinformatics
Bioinformatics
Techniques
Divide and Conquer
- Subdivides problem by finding optimal sub-trees into
a super-tree
- Algorithm
- Select a subset size (less than n)
- Divide taxa into subsets
- Find optimal trees for each subset of taxa
- Combine optimal sub-trees into super-tree
with all taxa
Bioinformatics
Bioinformatics
Techniques
Problem
All the previous methods (except Exhaustive Search)
may result in a finding a locally optimal tree, but not the
globally optimal tree
Bioinformatics
Bioinformatics
Techniques
Stochastic Methods
- Simulated Annealing Algorithm
- Create trees for n taxa (based on other methods)
- Evaluate against optimality criteria, select best
- Evaluate remaining trees using other parameters
(“cooling schedule”)
- Tree retained is one best meeting both optimality criteria
and cooling schedule
- Allows retention of a less optimal tree in some
cases, but may lead to better globally optimal result
Bioinformatics
Bioinformatics
Techniques
Stochastic Methods
- Genetic Algorithm
- Create trees for n taxa (based on other methods)
- Select a population of trees to proceed to next
generation
- Allow trees to mutate or cross over based on criteria
established by designer
- Follows the Darwinian Evolution Model (Survival of
the Fittest)
Bioinformatics
Bioinformatics
Resources
Databases
- Overwhelming amount of information available
- As of 1998, over 200 databases
- Some have well over 1,000,000 entries
- Includes sequences and metadata
- Most freely available over web
Bioinformatics
Bioinformatics
Resources
Databases
- EpoDB
- Used for study of gene regulation of blood
- Organized by gene, not structure
- 10,000 entries
- GenBank
- Operated by NIH
- Over 18,000,000 records
- Contains info on all publicly available DNA
sequences
- Flat file structure
Bioinformatics
Bioinformatics
Resources
Databases
- GeneCards
- Focus on medical aspects of genetics
- Uses metadata
- Provides efficient navigation system to other
databases
- The Genome Database
- Official database for HGI
- Information includes maps of gene locations,
genetic structure and variations.
Bioinformatics
Bioinformatics
Resources
Databases
- PIR – International Protein Sequence Database
- oldest database of molecular sequence info
- begun in 1960’s (paper based)
- info on protein sequences, functional and
structural properties and phylogeny
- SWISS-PROT
- Protein database (90,000 entries)
- Links to other databases
- Most often cited
Bioinformatics
Bioinformatics
Resources
Tools
- Search engines
- Programming languages for structured queries
- Phylogenetic Tree Analysis tools
Bioinformatics
Bioinformatics
Resources
Tools
- BLAST (Basic Local Alignment Search Tool)
- Dominant search engine for biological
sequence databases.
- Uses an algorithm that concentrates on finding
regions of high local similarity and then
attempting to extend the sequence over
adjacent areas.
- Provides an estimate of the statistical
significance of sequence matches.
- Various versions
Bioinformatics
Bioinformatics
Resources
Tools
- Entrez
- Search and retrieval system at National
Center for Biotechnology Information
- Searches all databases at NCBI for
information on nucleotide and protein
sequences, macromolecular structures
and whole genomes.
- User defined custom search strategies
- Frequently cited
Bioinformatics
Bioinformatics
Resources
Tools
- Kleisli
- Integrated data management system
- Functional programming language (CPL)
- Built in data types – user extensible
- Extends Flat and Relational DBs to OODB
- Works with Sybase, ORACLE, Entrez & BLAST
Bioinformatics
Bioinformatics
Resources
Tools
- PHYLIP (Phylogeny Inference Package)
- Collection of tools for developing trees
- Works with proteins and genes
- Uses branch and bound & branch swapping
techniques.
- Created in 1980 (lots of citations)
- Freely available on web (both source code
& executables
Bioinformatics
Bioinformatics
Resources
Tools
- SMART (Simple Modular Architecture Research Tool)
- Analyzes protein sequences
- Can identify more than 400 structural families
- Information on phylogeny, function and
structure
- Uses Hidden Markov Models
- Web-based
Bioinformatics
Bioinformatics
Research
Human Genome Project
- Requires identifying and decoding 35,000 genes
- From 2,000 – 2,000,000 base pairs per gene
- First draft (~90% of base pairs) in 2001
- Recently published 4th chromosome map
(87,000,000 base pairs)
- Expect to complete in April, 2003
Bioinformatics
Bioinformatics
Research
Other Work
- HIV-1 Genome Mutation Detection
- Link between Neuregulin-1 and Schizophrenia
- MLP and Cardiomyopathy Link
Bioinformatics
Bioinformatics
Research
Other Work
- Study to Identify Genetic & Environmental Disease
Causes
- “in silico” Biology
Bioinformatics
Bioinformatics
Discussion
- What level of domain knowledge is needed for IT
professionals working in Bioinformatics?
- What courses would be needed in a Bioinformatics
curriculum?
- Is a Bioethics course needed for IT professionals
working in the field?