Download Diapositiva 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

List of types of proteins wikipedia , lookup

JADE1 wikipedia , lookup

Transcript
Bioinformatics
The application of computational
techniques to understand and organise
the information associated with
biological macromolecules
Aims of Bioinformatics
1. to organise data in a way that allows researchers to
access existing information and to submit new
entries as they are produced
2. to develop tools and resources that aid in the
analysis of data
3. to conduct global analyses of all the available data
with the aim of uncovering common principles that
apply across many systems and highlight novel
features
Aims of Bioinformatics
1. to organise data in a way that allows researchers to
access existing information and to submit new
entries as they are produced
2. to develop tools and resources that aid in the
analysis of data
3. to conduct global analyses of all the available data
with the aim of uncovering common principles that
apply across many systems and highlight novel
features
Source of data
1.
DNA or Protein sequences
2.
Macromolecular structures
3.
Results of functional genomics
and proteomics experiments
(gene expression data)
DNA or Protein sequences
DNA sequences are strings of the 4
base-letters comprising genes, each
tipically 1,000 bases long. The widest
db contains at least 27 million entries.
Protein sequences are strings of the
20 aminoacid-letters. At present more
than 400,000 protein sequences are
known.
Size of data
Biological data are being produced at a phenomenal rate
At April 2001,
- GenBank db of nucleic acid sequences contained
11,546,000 entries
- SwissProt db of protein sequences contained
95,320 entries
These databases doubled in
size in 15 months
Size of data
Anthony Kervelage of Celera
recently cited that
an experimental laboratory can
produce over 100 gigabytes of data
per day with ease.
Biological processing power
This incredible processing
power has been matched by
developments in computer
technology
Areas of improvements
-
CPU (faster computations)
-
disk storage (better data storage)
-
Internet (revolutionalised
methods for accessing
exchanging data)
the
and
Source of data
1.
DNA or Protein sequences
2.
Macromolecular structures
3.
Results of functional genomics
and proteomics experiments
(gene expression data)
Macromolecular structure
There are currently 15,000 entries in the
Protein Data Bank, PDB
The PDB db contains atomic structures
(xyz-coordinates) of proteins, DNA and RNA
solved by x-ray crystallography and NMR
A typical PDB file contains
coordinates of ca. 2000 atoms.
the
xyz-
Source of data
1.
DNA or Protein sequences
2.
Macromolecular structures
3.
Results of functional genomics
and proteomics experiments
(gene expression data)
Gene expression data
These experiments measure the amount of
mRNA (functional genomics) or protein
(proteomics) that is produced by the cell
under different conditions, different stages of
the cell cycle and different cell types in multicellular organisms.
One of the largest dataset available has made
approximately 20 time-point measurements
for 6,000 genes (yeast).
Gene expression data
On a experimental point of view, it is
possible to determine the expression levels
of almost every gene in a given cell on a
whole-genome level.
However there is currently no central
depository for these data and public
availability is limited.
Biological data
The diversity in the size and complexity of
different datasets.
Although macromolecular structures and
gene expression experiments are giving
much more biological information than the
raw sequence data, there are invariably
more sequence-based data than others.
Why?
Why?
Because of the relative ease with
which they can be produced
Why?
Because they can be easily
managed by both biologists and by
computer scientists also with very
low biological background
Gene expression data
On the other hand, gene expression data are
far more complex to be managed and:
1. biologists rarely achieve mathematical
competence
beyond
elementary
calculus and maybe a few statistical
formulae.
2. although everybody uses a computer,
biologists rarely use anything but
standard commercial software
Gene expression data
Gene expression data are far more complex
to be managed and:
3. people with non-biological background
can find surprisingly difficult to master
the
complex
and
apparently
unconnected information that is the
working knowledge of every biologist
Source of data
1.
DNA or Protein sequences
2.
Macromolecular structures
3.
Results of functional genomics
and proteomics experiments
(gene expression data)
Source of data
4.
Genomic-scale
data
include
biochemical information on metabolic
pathways,
regulatory
networks,
protein-protein interactions and data
from two-hybrid experiments and
systematic knockouts of individual
genes
Integration
Integration of multiple sources of data.
At a basic level, this problem is frequently
addressed by providing external links to
other databases.
At a more advanced level, an integrated
access across several data sources is
provided.
Data organisation
First biological databases were simple
flat files.
At the moment most of them are
relational db with Web-page interfaces.
Aims of Bioinformatics
1. to organise data in a way that allows researchers to
access existing information and to submit new
entries as they are produced
2. to develop tools and resources that aid in the
analysis of data
3. to conduct global analyses of all the available data
with the aim of uncovering common principles that
apply across many systems and highlight novel
features
Data and Software Tools
Data and Software Tools
For example:
software for gene finding
of coding regions)
(identification
-
software for similarity searches
-
multiple
sequence
alignments
searching for functional domains
-
homology modeling
-
calculations of surface and volume shapes
and analysis of protein interactions
with
DNA, RNA, other proteins
or
drugs
(chemoinformatics)
and
Similarity searching
Having sequenced a particular protein, it is of
interest to compare it with previously characterised
sequences.
This need more than just simple text-based search,
and these programs must consider what
constitutes a biologically significant match.
Biologically significant match:
- two sequences share a common function
- two sequences share a common
evolutionary history (homologs)
Homology modeling
At a structural level, it is predicted to be a finite
number of different tertiary structures - estimates
range between 1,000 and 10,000 folds.
A structure can be predicted on a homology-based
manner, by comparison with known structures (3-D
structural alignments)
Although the number of structures in the PDB db
has increased exponentially, the rate of discovery
of novel folds has actually decreased.
Ab initio structure prediction
Prediction of the 3-D structure is
based on the protein sequence
only: e.g. the propensity of
certain aminoacid combinations
to produce secondary structural
elements.
Aims of Bioinformatics
1. to organise data in a way that allows researchers to
access existing information and to submit new
entries as they are produced
2. to develop tools and resources that aid in the
analysis of data
3. to conduct global analyses of all the available data
with the aim of uncovering common principles that
apply across many systems and highlight novel
features
Data exploration
Finding relationships
proteins:
between
different
- Analysis of one type of data to infer and
understand the observations for another type
of data
- Comparative analysis to do classification
Expansion of biological analysis in two
dimensions, depht and breadth
Expansion of biological analysis in two dimensions: depht
Example: Rational drug design
This approach takes a single gene and follow through
ana anlysis that maximises our understaning of the
protein it encodes.
Then prediction algorithms can be used to calculate the
structure and to make hypothesis on its function
Geometry calculations can define the shape of the
protein’s surface and identify or design ligands that can
become drugs specifically altering the protein’s
function.
Expansion of biological analysis in two dimensions: breadth
Example: comparison of a gene or a
gene product with others.
This approach can lead to extract sequence patterns or
structural templates that define a family of proteins
sharing a common property.
This approach can also lead to construct phylogenetic
trees to trace evolutions. E.g. the SARS virus.
Data organisation
First biological databases were simple
flat files.
At the moment most of them are
relational db with Web-page interfaces.
Sequence analysis
Techniques include mainly string
comparison methods
Motif and pattern identification
and classification
depend on:
-Machine learning
-Clustering and data mining techniques
3-D structural analysis
include:
-Euclidean geometry calculations
-Basic application of physical chemistry
-Graphical representation of surface and
volumes
-Structural comparison (3-D matching)
Bio Informatics
This unexpected union between the two
subjects is attributed to te fact that life
itself is an information technology
Un organism’s physiology is largely
determined by its genes, which at its
most basic can be viewed as digital
information