* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Download Powerpoint - School of Engineering and Computer Science
Survey
Document related concepts
Transcript
What Is Bioinformatics?
• Using computers to solve problems in
biology.
• Advances in biology have generated large
amounts of data; it is no longer possible to
categorize or search it all manually.
• Advances in computers have made it
possible to investigate problems that were
formerly too computationally intensive to
tackle.
What Is Bioinformatics?
• Using computers to solve problems in
biology.
• Advances in biology have generated large
amounts of data; it is no longer possible to
categorize or search it all manually.
• Advances in computers have made it
possible to investigate problems that were
formerly too computationally intensive to
tackle.
Some Areas Of Bioinformatics
• Creation and maintenance of databases of
DNA and protein sequences.
• Prediction of protein structure and function.
• Construction of ancestry trees of organisms.
Why Study Bioinformatics?
• Solve interesting scientific and technological
problems.
• Help cure diseases.
• Job opportunities.
The Central Dogma
Proposed by Francis Crick in 1958 to describe the
flow of information in a cell.
DNA
Information stored in DNA is transferred
residue-by-residue to RNA which in turn transfers
the information residue-by-residue to protein.
RNA
Protein
The Central Dogma was proposed by Crick to help
scientists think about molecular biology. It has
undergone numerous revisions in the past 45
years.
Central Dogma
Replication
DNA
duplication of DNA using DNA as the template
Transcription
synthesis of RNA using DNA as the template
RNA
Translation
synthesis of proteins using RNA as the template
Protein
Central Dogma: DNA
The Central Dogma
deoxyribonucleic acid
DNA
RNA
Protein
DNA stores the
blueprint for cellspecific synthesis
of proteins
necessary for life.
Molecular Structure of DNA
base: thymine
(pyrimidine)
monophosphate
sugar: 2’-deoxyribose
base:adenine
(purine)
DNA: Nucleoside Structure
nucleoside
base
nucleotides (nucleoside mono-, di-,
and triphosphates)
base
phosphate(s)
sugar
sugar
DNA: Molecule Structure
•
•
•
•
•
•
DNA is double stranded.
DNA strands are
antiparallel.
G-C pairs have 3 hydrogen
bonds.
A-T pairs have 2 hydrogen
bonds.
One strand is the
complement of the other.
Major and minor grooves
present different surfaces
for interaction.
Central Dogma: RNA
DNA
ribonucleic acid
RNA
Protein
RNA carries the
instructions from the
cell nucleus to the
cytoplasm for
synthesis of
proteins.
RNA Terminology
Base
Nucleoside (RNA)
Deoxynucleoside (DNA)
base
Adenine
Adenosine
Deoxyadenosine
Guanine
Guanosine
Deoxyguanosine
Cytosine Cytidine
Deoxycytidine
sugar
Uracil
Uridine
(not usually found)
Thymine
(not usually found)
(Deoxy)thymidine
nucleoside
RNA: Structure
•
•
•
•
•
RNA can be single or double stranded
G-C pairs have 3 hydrogen bonds
A-U pairs have 2 hydrogen bonds
Single-stranded, double-stranded, and loop RNA
present different surfaces
Central Dogma: Proteins
DNA
Proteins are synthesized in
RNA
Protein
the cytoplasm using the
message on messengerRNA. Proteins serve as the
basis for the cellular
structure, function,
communications and
metabolism.
Amino Acids: Protein Building Blocks
The 20 Amino Acids
carboxyl group
amino group
Protein Structure
-helix
antiparallel -sheet
Central Dogma
Replication
DNA
duplication of DNA using DNA as the template
Transcription
RNA
synthesis of RNA using DNA as the template
Translation
Protein
synthesis of proteins using RNA as the template
Biological Databases
Problems of Biological Databases
• Biologists have sequenced (determined the
base pairs of) a large amount of DNA.
• In addition to the raw sequence data,
ancillary information about each sequence
must also be stored, such as what species it
is from, who discovered it, what protein it
encodes, and the function of the protein.
GenBank
• To handle all this information, and make it
available to researchers, biologists have set
up several different databases, each
specializing on one aspect of the data.
• The main one in the United States that deals
with DNA sequences is called GenBank.
• It is maintained by the NCBI, a branch of the
National Institute of Health.
EMBL and DDBJ
• There are two other DNA databases: EMBL
in Europe and DDBJ in Japan.
• Researches can submit their sequences to
any of these; they exchange information
daily to keep the databases in
synchronization.
• The amount of data is huge: over 30 million
sequences and almost 40 billion base pairs.
(As of February, 2004.)
Other Databases
• In addition to the sequence, other databases
track the proteins that result from the
translation and transcription of the DNA, the
function of these proteins (ontology), the
species the DNA came from (taxonomy) and
the authors and journal the reports about the
sequence are published in.
Accession Numbers
• The key to tying the records that appear in
different databases together is the accession
number.
• This is assigned to the sequence when it is
first submitted, and is subsequently used by
all other databases.
• It is either a six character (one letter followed
by five numbers) or eight character (two
letters, six numbers) field.
Database Schema
Taxonomy
SeqAccession
Version
Kingdom
Phylum
Species
Feature
Sequence
Publications
FeaureId
SeqAccession
Version
StartLoc
EndLoc
Date
Author
SeqAccession
Version
Source
Author
Date
Length
SeqAccession
Version
PubMed-Id
Feature Data
Sequence Data
FeatureId
Details
URL
SeqAccession
Version
Data
GenBank Format
• GenBank uses a flat file format. An example:
GenBank Schema (portion)
Usable via:
•Web interface at NCBI
•http://www.ncbi.nlm.nih.gov/BLAST/
•Local web server
•Download database and search engine to
personal computer
Types of Sequence Searches
•DNA
•Protein
•Translated searches
•Pairwise
•Genomic
•Specialized
•Existing searches
Nucleotide Databases
•NR: All non-redundant
•Month: Last 30 days
•EST: Expressed Sequence Tags
•EST_Human
•EST_Mouse
•HTG: High Throughput Genomic
•Yeast
•Ecoli
GenBank Format
The heading gives the accession number, a brief description
(if known) and the date submitted.
GenBank Format
• The next section indicates which organism the DNA
was obtained from; in this case, a human. Clicking
on the hyperlink will take us to the taxonomy
database, for more details.
Human Taxonomy Detail
GenBank Format
• This portion names the researchers who
sequenced the DNA, and tells in what journal the
paper describing it may be found. The hyperlink will
take us to the actual article.
Journal Article
•
1: Genomics. 1998 Dec 15;54(3):542-55. Related Articles, Links
–
A long terminal repeat of the human endogenous retrovirus ERV-9 is located in
the 5' boundary area of the human beta-globin locus control region.
Long Q, Bengra C, Li C, Kutlar F, Tuan D.
Department of Medicine, Medical College of Georgia, Augusta, Georgia, 30912, USA.
Transcription of the human beta-like globin genes in erythroid cells is regulated by the
far-upstream locus control region (LCR). In an attempt to define the 5' border of the
LCR, we have cloned and sequenced 5 kb of new upstream DNA. We found an LTR
retrotransposon belonging to the ERV-9 family of human endogenous retroviruses in
the apparent 5' boundary area of the LCR. This ERV-9 LTR contains an unusual U3
enhancer region composed of 14 tandem repeats with recurrent GATA, CACCC, and
CCAAT motifs. This LTR is conserved in human and gorilla, indicating its evolutionary
stability in the genomes of the higher primates. In both recombinant constructs and the
endogenous human genome, the LTR enhancer and promoter activate the
transcription of cis-linked DNA preferentially in erythroid cells. Our findings suggest the
possibility that this LTR retrotransposon may serve a relevant host function in
regulating the transcription of the beta-globin LCR. Copyright 1998 Academic Press.
GenBank Format
• Here is a portion of the actual sequence, in this
case for a gene that encodes a part of the
hemoglobin molecule. The a,c,t, and g represent
adenine, cytosine, thymine, and guanine,
respectively.
Protein, Ontology, Etc
• [slides]
DB Relations
Other Biological Databases
•
•
•
•
TIGR dbs, e.g. Chlamydia trachomatis
ACEdb
Globin Gene Server
DAS – Generalized Annotation Services
The Institute for Genome Research
• [history, Venter, etc]
• [collection of microbial databases]
TIGR: Chlamydia
• [screen shot of query page]
TIGR: Chlamydia
ACEdb
• http://www.wormbase.org/
• Originally designed for Ceanorrhabitis
elegans (a small worm), now used for many
organisms.
• Object oriented.
ACEdb BNF Definition
• BNF Grammar for the ACEDB Models
• <models> ::= <model> | <model> <models> ;
• <model> ::= ?<model name> <unique> <tag column> /* For classes */ |
#<model name> <unique> <tag column> /* For constructed types */ ;
<tag column> ::= <tag node> | <tag node> NL <tag column> ; <tag
node> ::= <tag> | <tag> <unique> <data cluster> | <tag> <unique>
START_INDENT <tag column> END_INDENT /* In addition, in ACEDB
1-x, we allowed */ | <tag> <unique> START_INDENT <data cluster> NL
<tag column> END_INDENT /* This construction however can lead to
ambiguities when parsing data and */ /* will be forbidden in release 2-x */
; <data cluster> ::= <data type list> | <data type list> REPEAT | <data
type list> #<model name> | #<model name> ; <data type list> ::= <data
type> | <data type> <unique> <data type list> ; <unique> ::= <null> |
UNIQUE ; <data type> ::= <primitive data type> | <class reference> |
ANY /* reserved for kernel use */ ; <primitive data type> ::= Int | Text |
Float ; <class reference> ::= ? <class name> | ? <class name> XREF
<tag name> ;
• Etc.
ACEdb
• [screen shots]
• [live links]
Globin Gene Server
• Catalogs variants in hemoglobin genes, that
can cause such diseases as sickle-cell
anemia and beta-thalessemia.
• [live links]
Globin Server Schema
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
TABLE 1. A Synopsis of the Schema for HbVar—
a Database of Human Hemoglobin Variants
and Thalassemias
Name
Category
Type of Thalassemia
Description:
Chain
Residue number
Substitutions
Insertions
Deletions
Fusion gene Hbs
Contact
Haplotype
Hematology:
Genotype
Hemotological findings
Modifier
Condition
Laboratory findings
Assay
Range
Units
Other factors
Electrophoresis
Method
Quantitative result
Chromatography
Method
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Stability
Relative stability
Dissociation
Other stability information
Occurrence
Ethnic background
Frequency
Structure studies
Separation of hemoglobins
Separation of globin chains
Methods
Protein analysis
DNA analysis
Functional studies
Study
Result
What the study covered
Comments on the variant
References
Authors/editors
Journal articles
Other references
Globin Gene Server
Distributed Annotation System - DAS
• [discussion]
• [import and modify slides from project]
APIs for Database Access
Bio-PERL
• Officially organized in 1995
• The Bio-PERL Project is an open source
project using PERL tools for
bioinformatics, genomics and life science
research.
• http://www.bioperl.org
Retrieving a Sequence from GenBank
use Bio::Perl;
use Bio::DB::EMBL;
$gb = new Bio::DB::EMBL();
$id = 'AF162692';
#Accession number
$seq = $gb->get_Seq_by_acc($id);
BioPerl API for
$desc = $seq->desc ();
sequence retrieval
$description = $seq->description ();
$len = $seq->length ();
Member Functions of
$dna = $seq->seq();
the sequence object
print "Sequence Desc = $desc \n";
print "Sequence Description = $description \n";
print "Sequence Len = $len \n";
print "Seq = $dna \n";
Clones and Contigs
#!/usr/bin/perl -w
use Bio::EnsEMBL::DBSQL::DBAdaptor;
my $db = Bio::EnsEMBL::DBSQL::DBAdaptor->new (host=>'kaka.sanger.ac.uk',
-user=>'anonymous',
-dbname=>'homo_sapiens_core_19_34a');
my $slice_adaptor = $db->get_SliceAdaptor;
my $slice = $slice_adaptor->fetch_by_chr_start_end('18', 1, 10000000);
$count = 0;
@genes = @{$slice->get_all_Genes};
foreach $gene (@genes)
{
$count++;
print "Gene # $count- ". $gene->stable_id . ":" . $gene->description . "\n";
}