Download Introduction to bioi.. - Computer Science Home

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Replisome wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

DNA nanotechnology wikipedia , lookup

Microsatellite wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transcript
Introduction to bioinformatics
Based on Kim’s article “Computers are from mars, organisms
are from venus”, Computer, p25-32, July, 2002.
CS480-01 Computer Science Seminar
Fall, 2002
Merging computer science &
biology: motivation
• Erwin Schrödinger envisioned life as an aperiodic
crystal suggesting structure of life is neither
periodic nor amorphous. As a result, classical or
traditional mathematical tools has not been
satisfactory in biological analysis.
• Elegant algorithms + brute-force calculations
suggests a reasonable approach to this aperiodic
structure.
Merging computer science &
biology: motivation continued
• Computing seeks to create a machine that can
flexibly solve diverse problems. In nature, such
plastic problem solving resides uniquely in the
domain of organic matters. Whether through
historical evolution or individual behavior,
organisms always adaptively solve problems their
environment poses.
• Thus, examining how organisms solve problems
can lead to new computation and algorithmdevelopment approaches, e.g., DNA computers.
Computer is needed to analyze huge
amount of information
• Until recently, major activities in biology had been
information (data) gathering. The amount of information
gathered, especially at the molecular level over the last five
years has been overwhelming, e.g., info in GenBan
(http://www.ncbi.nlm.nih.gov/) has nearly doubled every
18 months, mainly due to the improvement of
biotechnology. Ten years ago, it took 5 days to obtain 200
base pairs of DNA sequence data. Today, the number
increased to 28 million a month with the Human Genome
Project.
• Two most successful use of computers in biology are:
– comparative sequence analysis
– In silico cloning
Comparative sequence analysis
• Researchers isolates bio-molecular sequence in the
lab, they want to know if it is similar to any existing
sequences.
• By comparing extrapolated information to similar,
already well-studied sequences, the scientists can
learn a great deal about the newly isolated sequence.
• BLAST (Basic Local Alignment Search Tool), a
search database similar to GenBank is used by most
researchers when they isolate a sequence.
In silico cloning
• The process of using a computer search of existing
databases to clone a gene.
• For example, one may want to clone a gene by its
phenotype such as olfactory; by its structure such as G
protein-coupled receptor; or by a pattern fragment such as
as DNA pattern “ACCAGTC”.
• Computational algorithms has been designed to guide the
otherwise extremely time-consuming and expensive wetlab experiments with great success. For example, a 15year-old problem of fruit-fly’s olfactory genes was isolated
and identified with the help of a computer algorithm
named QFC (quasiperiodic feature classifier), Bioinformatics,
vol. 16, 2000, p767-775
Other challenging tasks
• Collecting and integrating vast amount
information from distributed databases of
heterogeneous sources into a coherent information
set. It has been a challenge due to many problems
that need be addresses, e.g., same objects may be
given many different names, or database
specialists may define a gene differently.
(Currently, most major databases such as
GenBank and Swiss-PROT
(http://www.ebi.ac.uk/swissprot/index.html/)
operate partially by human curation and partially
by automated tools.
Other challenging tasks continued
• Annotating raw data collected from genome projects with
all the relevant information (such as whether a stretch of
DNA contains an amino acid coding sequence,
transposons, or a regulator sequence, and if an amino acid
is coded, what its putative function is, etc.) As of now,
genome projects generate raw data without giving them
biological meaning and the fact that biologists use existing
information to extrapolate knowledge about novel biomolecules.
• One complete annotation addressed the 3.000,000 bases
that surround the Drosophila melanogaster ADH sequence
(http://www.flybase.org/).
• Given the rate at which researchers now generate DNA sequence
information, automatically annotating the raw data presents a
computational challenge, and careful human analysis is becoming
increasingly difficult. The tools necessary for addressing this problem
include gene prediction, gene classification, comparative genomics,
and evolutionary modeling.
Other challenging tasks continued
• Synthesizing information (data) into general theories
remain a challenging task.
• For example, researchers working on the same object made
hundreds of independent observations which appear in
thousands of research articles that use scores of variations
in terminology, methodology, and so forth.
• To cope with information explosion, we may be need a
computer system that performs automatic knowledge
extraction and produces synthetic new information. Such a
tool may be used in many other fields as well
(http://www.cmu.edu/cald/research.html).
Other computer tasks frequently
performed by biologists today
• Bimolecular sequence alignment
• Assembly of DAN pieces
• Multivariate analysis of large-scale gene
expressions
• Metabolic pathway analysis
Computational biology’s holy grail
• Predicting molecular structure
• Compute the genotype-phenotype map
Predicting molecular structure
• Given the molecules sequence identity, predict its
3-D structure and from the structure, infer the
molecular function.
• What’s the challenge?
– The genetic code consists of 20 amino acids.
– Proteins consist of approx. 1,000 different major
structures called folds, each with tens of thousands of
variations.
– In proteins, the physical forces that govern the
interaction of the hundreds t thousands of amino acid
residues determine the structure and we do not know
the details of these interactions. (even we do, it is an
extremely difficult many-body problem.)
• Still, some significant progress has been made, thanks in part to CASP
(Critical Assessment of Structure Prediction
http://www.ncbi.nlm.nih.gov/structure/Research/casp3/index.shtml)
From structure to function
• A protein’s structure approximately determines its
molecular functions such as catalysis, DNA
binding, and cell component binding.
• Some researchers believe that a relational map
between structure and function should be
deducible (third genetic code). The idea also
drives “rational drug design”. The idea is beyond
reach because the knowledge about protein
function is unclear in theory and practice. Also, an
object’s function often depends on context. For
example, the function of a screw holds a chair
together is quite different from a screw of a car
jack.
From genotype to phenotype
• A genotype refers to the genetically encoded
information in an individual genome; it consists
the sequence identity for a person’s entire DNA.
• The phenotype refers to any measured trait of a
particular individual such as hair color, body
weight, propensity for a heart attack, and so on
(drug companies have begun to ask how the
efficacy of a given drug treatment interacts with
the recipient’s genotype).
The DNA computer
• Motivated by lack of efficient algorithm to solve the NPcomplete (Non-deterministic polynomial time), Adleman
suggested using DNA for computation. He used sequencespecific hybridization of DNA molecules and polymerase
chain reaction to solve the problem of finding Hamiltonian
path in a directed graph.
• Additional research has shown that by using DNA’s ability
to find complementary sequence paris, DNA can be used to
encode a universal computer.
• DNA computer remains elusive due to
– Encoding the problem and reading the output is
extremely time consuming.
– Inherent computational errors.
– The amount DNA required to solve a practical hard
problem.
Genetic algorithm
• John Holland laid the foundation in the ’60s.
• The algorithm emulates the evolutionary adaptive
behavior or real organisms.
• Three components:
– The organism must have a property or suite of
properties that governs its differential survival.
– The individuals should inherit these properties.
– A mechanism should generate variations of these
properties via mutation.
• Evolutionary computing thus involves generating
a population of computer programs and --- by
tying their survival to their problem-solving ability
--- selects those that are particularly good at
solving a posed problem.
Terminology
phe·no·type
a. The observable physical or biochemical
characteristics of an organism, as determined
by both genetic makeup and environmental
influences.
b. b. The expression of a specific trait, such as
stature or blood type, based on genetic and
environmental influences.
Terminology continued
trans·po·son
A segment of DNA that is capable of moving to a new
position within the same or another chromosome,
plasmid, or cell and thereby transferring genetic
properties such as resistance to antibiotics.
dro·soph·i·la
Any of various small fruit flies of the genus Drosophila,
especially D. melanogaster, used extensively in
genetic research.
Terminology continued
• phy·log·e·ny
• 1. The evolutionary development and history
of a species or higher taxonomic grouping of
organisms. Also called phylogenesis.
• 2. The evolutionary development of an organ
or other part of an organism: the phylogeny of
the amphibian intestinal tract.
• 3. The historical development of a tribe or
racial group.
Genomics
• Each cell of a living organism contains
chromosomes composed of a sequence of DNA
base pairs.The sequence, the genome, represents a
set of instructions that controls the replication and
function of each organism.
• Genomics: The automated DNA sequencer give
birth to genomics --- the analytic and comparative
study of genomes, by allowing researchers to
decode entire gnomes.
Other articles of interests
• Genome sequence assembly: algorithms and
issues.
• Toward new software for computational
phylogenetics.
• BioSig: an imaging bioinformatics systems for
studying phenomics.
• A random walk down the genomes: DNA
evolution in Valis (Vast active living intelligent
systems).
• Interactively exploring hierarchical clustering
results.