Download Bioinformatics Basics III

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Microsatellite wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transcript
Bioinformatics
The Genome
The hereditary information that an organism passes to its offspring is represented in each of
its cells. The representation is in the form of DNA molecules. The totality of this information
is called the genome of the organism. In humans the genome consists of nucleotides.
The genome is the totality of DNA stored in chromosomes typical of each species. The
genome contains most of the information needed to specify an organism’s properties.
Genetics: Genetics is the study of heredity.
Proteins:
A protein is a very large biological molecule composed of a chain of smaller molecules
called amino acids.
DNA
DNA was discovered in 1869. Most of the DNA in cells is contained in the chromosomes.
DNA is chemically very different from protein. DNA is structuredas a double helix
consisting of two long strands that wind around a common axis. Each strand is a very long
chain of nucleotides of four types, A,C, T and G.
The linear ordering of the nucleotides determines the genetic information
A major task of molecular biology is to:
 Extract the information contained in the genomes of different organisms;
 Elucidate the structure of the genome;
 Apply this knowledge to the diagnosis and ultimately, treatment, of genetic diseases
(about 4000 such diseases in humans have been identified);
 By comparing the genomes of different species, explain the process and mechanisms
of evolution. These tasks require the invention of new algorithms.
Bioinformatics supports in all the above objectives
Biopolymer:
A macromolecule in a living organism that is formed by linking together several smaller
molecules, as a protein from amino acids or DNA from nucleotides.
Sequencing:
sequencing means to determine the primary structure of an unbranched biopolymer.
Sequencing results in a symbolic linear depiction known as a sequence which succinctly
summarizes much of the atomic-level structure of the sequenced molecule. Eg, DNA
sequencing, RNA sequencing, Protein sequencing.
1
The Human Genome Project (HGP) is an international scientific research project with the
goal of determining the sequence of chemical base pairs which make up human DNA, and
of identifying and mapping all of the genes of the human genome from both a physical and
functional standpoint. It remains the world's largest collaborative biological project.
The Human Genome Project (HGP) was the international, collaborative research program
whose goal was the complete mapping and understanding of all the genes of human
beings. All our genes together are known as our "genome." The HGP was the natural
culmination of the history of genetics research.
The HGP has revealed that there are about 20,500 human genes. The completed human
sequence can now identify their locations. This ultimate product of the HGP has given the
world a resource of detailed information about the structure, organization and function of
the complete set of human genes.
When the Human Genome Project was begun in 1990 it was understood that to meet the
project's goals, the speed of DNA sequencing would have to increase and the cost would
have to come down. Over the life of the project virtually every aspect of DNA sequencing
was improved. It took the project approximately four years to sequence its first one billion
bases but just four months to sequence the second billion bases.
During the month of January, 2003, 1.5 billion bases were sequenced. As the speed of
DNA sequencing increased, the cost decreased from 10 dollars per base in 1990 to 10
cents per base at the conclusion of the project in April 2003. Although the Human Genome
Project is officially over, improvements in DNA sequencing continue to be made.
Researchers are experimenting with new methods for sequencing DNA that have the
potential to sequence a human genome in just a matter of weeks for a few thousand dollars.
DNA sequencing performed on an industrial scale has produced a vast amount of data to
analyze. In August 2005 it was announced that the three largest public collections of DNA
and RNA sequences together store one hundred billion bases, representing over 165,000
different organisms. As sequence data began to pile up, the need for new and better
methods of sequence analysis was critical.
Bioinformatics is the branch of biology that is concerned with the acquisition, storage, and
analysis of the information found in nucleic acid and protein sequence data. Computers and
bioinformatics software are the tools of the trade.
2
Genetic data represent a treasure trove for researchers and companies interested in how
genes contribute to our health and well being. Almost half of the genes identified by the
Human Genome Project have no known function. Researchers are using bioinformatics to
identify genes, establish their functions, and develop gene-based strategies for preventing,
diagnosing, and treating disease.
A DNA sequencing reaction produces a sequence that is several hundred bases long.
Gene sequences typically run for thousands of bases. The largest known gene is that
associated with Duchenne muscular dystrophy. It is approximately 2.4 million bases in
length. In order to study genes, scientists first assemble long DNA sequences from series
of shorter overlapping sequences.
Scientists enter their assembled sequences into genetic databases so that other scientists
may use the data. Since the sequences of the two DNA strands are complementary, it is
only necessary to enter the sequence of one DNA strand into a database. By selecting an
appropriate computer program, scientists can use sequence data to look for genes, get
clues to gene functions, examine genetic variation, and explore evolutionary relationships.
Bioinformatics is a young and dynamic science. New bioinformatic software is being
developed while existing software is continually updated.
BIOINFORMATICS:
Bioinformatics is the recording, annotation, storage, analysis, and searching/retrieval of
nucleic acid sequence (genes and RNAs), protein sequence and structural information. This
includes databases of the sequences and structural information as well methods to access,
search, visualize and retrieve the information.
Sequence data can be used to make predictions of the functions of newly identified
genes,estimate evolutionary distance in phylogeny reconstruction, determine the active
sites of enzymes, construct novel mutations and characterize alleles of genetic diseases to
name just a few uses. Sequence data facilitates:
 Analysis of the organization of genes and genomes and their evolution
 Protein sequence can be predicted from DNA sequence which further facilitates
 possible prediction of protein properties, structure, and function (proteins rarely
sequenced in entirety today)
3
 Identification of regulatory elements in genes or RNAs
 Identification of mutations thatlead to disease, etc.
Bioinformatics is the field of science in which biology, computer science, and information
technology merge into a single discipline. The ultimate goal of the field is to enable the
discovery of new biological insights as well as to create a global perspective from which
unifying principles in biology can be discerned.
There are three important sub-disciplines within bioinformatics involving computational
biology:
 the development of new algorithms and statistics with which to assess relationships
among members of large data sets;
 the analysis and interpretation of various types of dataincluding nucleotide and amino
 acid sequences, protein domains, and protein structures; and
 the development and implementation of tools that enable efficient access and
management of different types of information.
One of the simpler tasks used in bioinformatics concern the creation and maintenance of
databases of biological information. Nucleic acid sequences (and the protein sequences
derived from them) comprise the majority ofsuch databases. While the storage and or
ganization of millions of nucleotides is far from trivial, designing a database and developing
an interface whereby researchers can both access existing information and submit new
entries is only the beginning.
The most pressing tasks in bioinformatics involve the analysis of sequence information.
Computational Biologyis the name given to this process, and it involves the following:
 Finding the genes in the DNA sequences of various organisms
 Developing methods to predict the structure and/or function of newly discovered
proteins and structural RNA sequences.
 Clustering protein sequences into families of related sequences and the development of
protein models.
 Aligning similar proteins and generating phylogenetic trees to examine evolutionary
relationships.
4
Data-mining is the process by which testable hypotheses are generated regarding the
function or structure of a gene or protein of interest by identifying similar sequences in
better characterized organisms. For example, new insight into the molecular basis of a
disease may come from investigating the function of homologs of the disease gene in
model organisms.
Equally exciting is the potential for uncovering phylogenetic relationships and evolutionary
patterns.The process of evolution has produced DNA sequences that encode proteins with
very specific functions. It is possible to predict the three-dimensional structure of a protein
using algorithms that have been derived fromour knowledge of physics, chemistry and most
importantly, from the analysis of other proteins with similar amino acid sequences.
Definition of Bioinformatics
Roughly, bioinformatics describes any use of computers to handle biological
information.
In practice the definition used by most people is narrower; bioinformatics to them is
a synonym for "computational molecular biology"- the use of computers to
characterize the molecular components of living things.
"Classical" bioinformatics:
"The mathematical, statistical and computing methods that aim to solve biological
problems using DNA and amino acid sequencesand related information.”
The National Center for Biotechnology Information (NCBI 2001) defines bioinformatics
as:
"Bioinformatics is the field ofscience in which biology, computer science, and information
technology merge into a single discipline. There are three important sub-disciplines within
bioinformatics: the development of new algorithms and statistics with which to assess
relationships among members of large data sets; the analysis and interpretation of various
types of data including nucleotide and amino acid sequences, protein domains, and protein
structures; and the development and implementation of tools that enable efficient access
and management of different types of information."
5
Introduction
Bioinformatics derives knowledge from computer analysis of biological data. These
can consist of the information stored in the genetic code, but also experimental results from
various sources, patient statistics, and scientific literature. Research in bioinformatics
includes method development for storage, retrieval, and analysis of the data. Bioinformatics
is a rapidly developing branch of biology and is highly interdisciplinary, using techniques
and concepts from informatics, statistics, mathematics, chemistry, biochemistry, physics,
and linguistics. It
has many practical applications in different areas of biology and
medicine.
Bioinformatics and computational biology involve the use or development of
techniques including applied mathematics, informatics, statistics, computer science,
artificial intelligence, chemistry, and biochemistry to solve biological problems usually on the
molecular level. The core principle of these techniques is using computing resources in
order to solve problems on scales of magnitude far too great for human discernment.
Research in computational biology often overlaps with systems biology. Major research
efforts in the field include sequence alignment, gene finding, genome assembly, protein
structure alignment, protein structure prediction, prediction of gene expression and proteinprotein interactions, and the modeling of evolution.
Bioinformatics has evolved into a full-fledged multidisciplinary subject that integrates
developments in information and computer technology as applied to Biotechnology and
Biological Sciences. Bioinformatics uses computer software tools for database creation,
data management, data warehousing, data mining and global communication networking.
Bioinformatics is the recording, annotation, storage, analysis, and searching/retrieval
of nucleic acid sequence (genes and RNAs), protein sequence and structural information.
This includes databases of the sequences and structural information as well methods to
access, search, visualize and retrieve the information. Bioinformatics concern the creation
and maintenance of databases of biological information whereby researchers can both
access existing information and submit new entries. Function genomics, biomolecular
structure, proteome analysis, cell metabolism, biodiversity, downstream processing in
chemical engineering, drug and vaccine design are some of the areas in which
6
Bioinformatics is an integral component.
Sub-disciplines within bioinformatics
There are three important sub-disciplines within bioinformatics involving computational
biology:

The development of new algorithms and statistics with which to assess relationships
among members of large data sets

The analysis and interpretation of various types of data including nucleotide and
amino acid sequences, protein domains, and protein structures and

The development and implementation of tools that enable efficient access and
management of different types of information
Activities in bioinformatics
We can split the activities in bioinformatics in two areas (1) the organization and (2) the
analysis of biological data.
Analysis activity in Bioinformatics
Organization activity in
Bioinformatics
- The creation of databases of - Development of methods to predict the structure
biological information
-
The
maintenance
databases
and/or function of newly discovered proteins and
of
these structural RNA1 sequences.
- Clustering protein sequences into families of
related sequences and the development of protein
models.
-
Aligning
phylogenetic
similar
trees
proteins
to
and
examine
generating
evolutionary
relationships
Aims of Bioinformatics:
1
RiboNucleic Acid. a long linear polymer of nucleotides found in the nucleus but mainly in the cytoplasm of a cell
where it is associated with microsomes; it transmits genetic information from DNA to the cytoplasm and controls
certain chemical processes in the cell
7
The aims of bioinformatics are basically three-fold. They are

Organization of data in such a way that it allows researchers to access existing
information & to submit new entries as they are produced. While data-creation is an
essential task, the information stored in these databases is useless unless analysed.
Thus the purpose of bioinformatics extends well beyond mere volume control.

To develop tools and resources that help in the analysis of data. For example, having
sequenced a particular protein, it is with previously characterized sequences. This
requires more than just a straightforward database search. As such, programs such
as BLAST much consider what constitutes a biologically significant resemblance.
Development of such resources extensive knowledge of computational theory, as
well as a thorough understanding of biology.

Use of these tools to analyse the individual systems in detail, and frequently
compared them with few that are related.
Bioinformatics and its scope:
Bioinformatics uses advances in the area of computer science, information science,
computer and information technology, communication technology to solve complex
problems in life sciences and particularly in biotechnology. Data capture, data warehousing
and data mining have become major issues for biotechnologists and biological scientists
due to sudden growth in quantitative data in biology such as complete genomes of
biological species including human genome, protein sequences, protein 3-D structures,
metabolic pathways databases, cell line & hybridoma information, biodiversity related
information. Advancements in information technology, particularly the Internet, are being
used to gather and access ever-increasing information in biology and biotechnology.
Functional genomics, proteomics, discovery of new drugs and vaccines, molecular
diagnostic kits and pharmacogenomics are some of the areas in which bioinformatics has
become an integral part of Research & Development. The knowledge of multimedia
databases, tools to carry out data analysis and modeling of molecules and biological
systems on computer workstations as well as in a network environment has become
essential for any student of Bioinformatics.
Bioinformatics, the multidisciplinary area, has grown so much that one divides it into
8
molecular bioinformatics, organal bioinformatics and species bioinformatics. Issues related
to biodiversity and environment, cloning of higher animals such as Dolly and Polly, tissue
culture and cloning of plants have brought out that Bioinformatics is not only a support
branch of science but is also a subject that directs future course of research in
biotechnology and life sciences. The importance and usefulness of Bioinformatics is
realized in last few years by many industries. Therefore, large Bioinformatics R & D
divisions are being established in many pharmaceutical companies, biotechnology
companies and even in other conventional industry dealing with biological. Bioinformatics is
thus rated as number one career in the field of biosciences.
In short, Bioinformatics deals with database creation, data analysis and modeling.
Data capturing is done not only from printed material but also from network resources.
Databases in biology are generally in the multimedia form organized in relational database
model. Modeling is done not only on single biological molecule but also on multiple systems
thus requiring a use of high performance computing systems.
The Potential of Bioinformatics:
The potential of Bioinformatics in the identification of useful genes leading to the
development of new gene products, drug discovery and drug development has led to a
paradigm shift in biology and biotechnology-these fields are becoming more & more
computationally intensive. The new paradigm, now emerging, is that all the genes will be
known "in the sense of being resident in database available electronically", and the starting
point of biological investigation will be theoretical and a scientist will begin with a theoretical
conjecture and only then turning to experiment to follow or test the hypothesis. With a much
deep understanding of the biological processes at the molecular level, the Bioinformatics
scientist have developed new techniques to analyse genes on an industrial scale resulting
in a new area of science known as 'Genomics'.
The shift from gene biology has resulted in the development of strategies-from lab
techniques to computer programmes to analyse whole batch of genes at once. Genomics is
revolutionizing drug development, gene therapy, and our entire approach to health care and
human medicine.
The genomic discoveries are getting translated in to practical biomedical results
through Bioinformatics applications. Work on proteomics and genomics will continue using
highly sophisticated software tools and data networks that can carry multimedia databases.
9
Thus, the research will be in the development of multimedia databases in various areas of
life sciences and biotechnology. There will be an urgent need for development of software
tools for data mining, analysis and modelling, and downstream processing. Security of data,
data transfer and data compression, auto checks on data accuracy and correctness will
also be major research area of bioinformatics. The use of virtual Reality in drug design,
metabolic pathway design, and unicellular organism design, paving the way to design and
modification of muticellular organisms, will be the challenges challenges which
Bioinformatics scientist and specialist have to tackle. It has now been universally
recognized that Bioinformatics is the key to the new grand data-intensive molecular biology
that will take us into 21 century.
Bioinformatics - Industry Overview
The Bioinformatics industry has grown to keep up with the information explosion,
growing at 25-50% a year. In 2000, the US market Research company Oscar Gruss
estimated that the value of the Bioinformatics industry would touch $2 billion. Now it s
demand for individuals capable of doing bioinformatics is soaring. Industry's demand for
scientists with skills in Bioinformatics far exceeds the supply of qualified specialists in the
field, Seems likely that this figure will be reached within the coming year. Therefore,
companies are developing methods of spotting potential Bioinformatics experts and then
training them on the job.
Bioinformatics and drug discovery:
In recent years, we have seen an explosion in the amount of biological information
that is available. Various databases are doubling in size every 15 months and we now have
the complete genome sequences of more than 100 organisms. It appears that the ability to
generate vast quantities of data has surpassed the ability to use this data meaningfully. The
pharmaceutical industry has embraced genomics as a source of drug targets. It also
recognises that the field of bioinformatics is crucial for validating these potential drug
targets and for determining which ones are the most suitable for entering the drug
development pipeline.
Recently, there has been a change in the way that medicines are being developed
due to our increased understanding of molecular biology. In the past, new synthetic organic
10
molecules were tested in animals or in whole organ preparations. This has been replaced
with a molecular target approach in which in-vitro screening of compounds against purified,
recombinant proteins or genetically modified cell lines is carried out with a high throughput.
This change has come about as a consequence of better and ever improving knowledge of
the molecular basis of disease.
All marketed drugs today target only about 500 gene products. The elucidation of the
human genome which has an estimated round 30,000 genes, presents immense new
opportunities for drug discovery and simultaneously creates a potential bottleneck regarding
the choice of targets to support the drug discovery pipeline. The major advances in
genomics and sequencing means that finding an attractive target is no longer a problem but
finding the targets that are most likely to succeed has become the challenge. The focus of
bioinformatics in the drug discovery process has therefore shifted from target identification
to target validation.
The accumulation of this information into databases about potential targets means
that the pharmaceutical companies can save themselves much time, effort and expense
exerting bench efforts on targets that will ultimately fail. The information that is gathered
helps to characterise the different targets into families and subfamilies. It also classifies the
behaviour of the different molecules in a biochemical and cellular context.
Bioinformatics and computational biology:
Bioinformatics and computational biology each maintain close interactions with life
sciences to realize their full potential. Bioinformatics applies principles of information
sciences and technologies to make the vast, diverse, and complex life sciences data more
understandable and useful. Computational biology uses mathematical and computational
approaches to address theoretical and experimental questions in biology. Although
bioinformatics and computational biology are distinct, there is also significant overlap and
activity at their interface.
Biocomputing
Biocomputing is often used as a catch-all term covering all this area at the
intersection of Biology and Computation , although many other terms are used to name the
same area. We can distinguish in to (non-disjoint) sub-fields:
11

Bioinformatics - this includes management of biological databases, data mining and
data modeling, as well as IT-tools for data visualization

Computational Biology - this includes efforts to solve biological problems with
computational tools (such as modeling, algorithms, heuristics)

DNA2 computing and nano-engineering - this includes models and experiments to
use DNA (and other) molecules to perform computations

Computations in living organisms - this is concerned with constructing computational
components in living cells, as well as with studying computational processes taking
place daily in living organisms
Computational Biology
Computational Biology is application of core technology of computer science (eg.
algorithms, artificial intelligence, databases etc) to problems arising from biology.
Computational biology is particularly exciting today because the problems are large enough
to motivate the efficient algorithms and moreover the demand of biology on computational
science is increasing.
The most pressing tasks in bioinformatics involve the analysis of sequence information.
Computational Biology is the name given to this process, and it involves the following:

Finding the genes in the DNA sequences of various organisms

Developing methods to predict the structure and/or function of newly discovered
proteins and structural RNA sequences.

Clustering protein sequences into families of related sequences and the
development of protein models.

Aligning similar proteins and generating phylogenetic trees to examine evolutionary
relationships.
Conclusion:
The ultimate goal of bioinformatics is to uncover the wealth of biological information
hidden in the mass of sequence, structure, literature and other biological data and obtain a
clearer insight into the fundamental biology of organisms and to use this information to
2
DesoxyriboNucleic Acid. a long linear polymer found in the nucleus of a cell and formed from nucleotides and
shaped like a double helix; associated with the transmission of genetic information. "DNA is the king of molecules"
12
enhance the standard of life for mankind. It is being used now and in the foreseable future
in the areas of molecular medicine to help produce better and more customised medicines
to prevent or cure diseases, it has environmental benefits in, identifying waste cleanup
bacteria and in agriculture it can be used for producing high yield low maintenance crops.
These are just a few of the many benefits bioinformatics will help develop.
The influence of genomics and bioinformatics will not only influence science. It will influence
the society in may ways. From crop cultivation and food production to health care and life
insurance. From crime investigation and personal identification to computer chip fabrication
and genetic modification law development.
13