Download Data Acquisition Tools & Techniques

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gel electrophoresis of nucleic acids wikipedia , lookup

Silencer (genetics) wikipedia , lookup

DNA barcoding wikipedia , lookup

Molecular cloning wikipedia , lookup

QPNC-PAGE wikipedia , lookup

List of types of proteins wikipedia , lookup

Protein adsorption wikipedia , lookup

Western blot wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

DNA sequencing wikipedia , lookup

RNA-Seq wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Genome evolution wikipedia , lookup

Exome sequencing wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Point mutation wikipedia , lookup

Homology modeling wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Non-coding DNA wikipedia , lookup

Community fingerprinting wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Molecular evolution wikipedia , lookup

Transcript
Data Acquisition
Tools & Techniques
In this presentation……
Part 1 – Sequencing Technology
Part 2 – Genomic Databases
Part
1
Sequencing
Technology
Principles of DNA sequencing
• DNA sequencing is performed using an automated
version of the chain termination reaction, in which
limiting amounts of dideoxyribonucleotides
generate nested sets of DNA fragments with
specific terminal bases
• Four reactions are set up, one for each of the four
bases in DNA, each incorporating a different
fluorescent label
• The DNA fragments are separated by PAGE and
the sequence is read by a scanner as each fragment
moves to the bottom of the gel
Types of DNA sequencing
DNA sequences come in three major forms
• Genomic DNA comes directly from the genome and
includes extragenic material as well as genes. In
eukaryotes, genomic DNA contains introns
• cDNA is reverse-transcribed from mRNA and
corresponds only to the expressed parts of the
genome. It does not contain introns
• Recombinant DNA comes from the laboratory and
comprises artificial DNA molecules such as cloning
vectors
Genome sequencing strategies
Only short DNA molecules (~800 bp) can be
sequenced in one read, so large DNA molecules,
such as genomes, must first be broken into
fragments. Genome sequencing can be approached
in two ways
• Shotgun sequencing involves the generation of
random DNA fragments, which are sequenced in
large numbers to provide genome-wide coverage
• Clone contig sequencing involves the systematic
production and sequencing of subclones
Sequence quality control
• High quality sequence data is generated by
performing multiple reads on both DNA strands
• Preliminary trace data is then base called and
assessed for quality using a program such as Phred
• Vector sequences and repeated DNA elements are
masked off and then the sequence is assembled
into contigs using a program such as Phrap
• Remaining inconsistencies must be addressed by
human curators
Single-pass sequencing
• Sequence data of lower quality can be
generated by single reads (single-pass
sequencing)
• Although somewhat inaccurate, single-pass
sequences such as ESTs and GSSs can be
generated in large amounts very quickly and
inexpensively
RNA sequencing
Most RNA sequencing are deduced from the
corresponding DNA sequences but special methods
are required for the identification of modified
nucleotides. These include biochemical assays,
NMR spectroscopy and MS
Protein sequencing
• Most protein sequencing is now-a-days carried out by
MS, a technique in which accurate molecular masses
are calculated from the mass/charge ration of ions in a
vacuum
• Soft ionization methods allow MS analysis of large
macromolecules such as proteins
• Sequences can be deduced by comparing the masses of
tryptic peptide fragments to those predicted from
virtual digests of proteins in databases
• Also, de novo sequencing can be carried out by
generating nested sets of peptide fragments in a
collision cell and calculating difference in mass
between fragments differing in length by a single
amino acid residue
Importance of protein interactions
• They underlie most cellular functions. Protein-protein
interactions result in formation of transient or stable
multi-subunit complexes
• Understanding of these complexes is required for
functional annotation of proteins and is a step towards
the elucidation of molecular pathways such as signaling
cascades and regulatory networks
• Protein interactions with nucleic acids form an
important area of study, since such interactions are
required for replication, transcription, recombination,
DNA repair and many other processes. Proteins also
interact with small molecules, which act as ligands,
substrates, cofactors and allosteric regulators
Methods for protein interactions
• Genetic methods
– Suppressor mutant
– Synthetic lethal effect
– Dominant negative
mutations
• Affinity methods
– Affinity chromatography
– Co-immunoprecipitation
• Molecular and atomic
methods
– X-ray crystallography
– NMR spectroscopy
– Other methods
• FRET
• SPR spectroscopy
• SELDI
• Library-based methods
– Y2H system
Other methods
• For larger proteins that do not readily form
crystals, alternative analytical methods are
required to deduce structures
• These include X-ray fiber diffraction, electron
microscopy and circular dichroism (CD)
spectroscopy
Protein structure determination
• X-ray crystallography
• NMR spectroscopy
• Other methods
– X-ray fiber diffraction
– Electron microscopy
– CD spectroscopy
X-ray crystallography
• Involves determination of protein structure by
studying diffraction pattern of X-rays through a
precisely orientated protein crystal
• They way in which X-rays are scattered depends
on the electron density and spatial orientation of
the atoms in the crystal
• A mathematical method called the Fourier
transform is used to reconstruct electron density
maps from the diffraction data allowing structural
models to be built
NMR spectroscopy
• NMR is a property of certain atoms that can switch
between magnetic states in an applied magnetic field
by absorbing electromagnetic radiation
• The nature of absorbance spectrum is influenced by
the type of atom and its chemical context, so that
NMR spectroscopy can discriminate between different
chemical groups
• NMR spectra are also modified by the proximity of
atoms in space
• Analysis of NMR spectra allows 3D configuration of
atoms to be reconstructed, resulting in a series of
structural models
• The technique is suitable only for the analysis of
small, soluble proteins
2-D gel electrophoresis
• The current method for studying proteins consists in part of a technique
called two dimensional gel electrophoresis, which separates proteins by
charge and size
• In the technique, researchers squirt a solution of cell contents onto a
narrow polymer strip that has a gradient of acidity. When the strip is
exposed to an electric current, each protein in the mixture settles into a
layer according to its charge. Next, the strip is placed along the edge of a
flat gel and exposed to electricity again. As the proteins migrate through
the gel, they separate according to their molecular weight. What results
is a smudgy patterns of dots, each of which contains a different protein
• In academic laboratories, scientists generally use a tool similar to a hole
puncher to cut the protein spots from 2-D gels for individual
identification by another method, mass spectroscopy
• Now-a-days, companies have started using robots to do it
Part
2
Genomic Databases
Types of databases
• There are many types of databases available
for researchers in the field of biology
– Primary sequence databases - for storage of raw
experimental data
– Secondary databases - contain information on
sequence patterns and motifs
– Organism specific databases
– Other databases
Primary sequence databases
• Three primary sequence databases are GenBank
(NCBI), the Nucleotide Sequence Database
(EMBL) and the DNA Databank of Japan (DDBJ)
• These are repositories for raw sequence data, but
each entry is extensively annotated and has
features table to highlight the important properties
of each sequence
• The three databases exchange data on a daily basis
Subsidiary sequence databases
• Particular types of sequence data are stored
in subsidiaries of the main sequence
databases. For instance, ESTs are stored in
dbEST, a division of GenBank
• There are also subsidiary databases for GSSs
and unfinished genomic sequence data
Organism specific resource
• As well as general databases that serve the entire
biology community, there are many organism
specific databases that provide information and
resources for those researches working on
particular species
• The number of such databases is growing as more
genome projects are initiated, and many can be
accessed from general genomics gateway sites
such as GOLD
Organism-specific genomic databases
Organism
Database/resource
URL
Escherichia coli
EcoGene
EcoCyc (Encyclopedia of E. coli
genes and metabolism
Colibri
http://bmb.med.miami.edu/EcoGene/EcoWeb
http://ecocyc.pangeasystems.com/ecocyc/ecocyc
.html
http://genolist.pasteur.fr/Colibri
Bacillus subtilis
SubtiList
http://genolist.pasteur.fr/SubtiList
Saccharomyces
cerevisiae
Saccharomyces Genome Database
(SGD)
http://genome-www.stanford.edu/Saccharmyces
Plasmodium falciparum
PlasmoDB
http://PlasmoDB.org
Arabidopsis thaliana
MIPS Arabidopsis thaliana
Database (MAtDB)
The Arabidopsis information
resource (TAIR)
http://mips.gsf.de/proj/thal/db
Drosophila
melanogaster
FlyBase
http://flybase.bio.indiana.edu
Caenorhabditis elegans
A C. elegans DataBase (ACeDB)
http://www.acedb.org
Mouse
Mouse Genome Database (MGD)
http://www.informatics.jax.org
Human
OnLine Mendelian Inheritance in
Man (OMIM)
http://www.ncbi.nlm.nih.gov/omim
http://www.arabidopsis.org
Finding organism-specific
databases
• Organism specific databases are widely distributed
on the Internet
• In order to find and interrogate databases on specific
organisms, it is necessary to use a gateway site to
access relevant databases and information resources
• Worked examples are provided, using GOLD as the
gateway and illustrated with Ebola virus, the
bacterium E. coli, the fruit fly Drosophila
melanogaster and the human genome
Useful gateway sites providing information on
multiple, organism and genomic resources
Gateway site
URL
NCBI Genomic Biology
www.ncbi.nlm.nih.gov/Genomes/
GOLD (Genomes OnLine
Database)
wit.integratedgenomics.com/GOLD
Organism specific genomic
databases
www.unl.edu/stc95/ResTools/biotools/biotools10.html
TIGR Microbial Database
www.tigr.org/tdb/mdb/mdbcomplete.html
Bacterial genomes
genolist.pasteur.fr
Yeast database
genomewww.stanford.edu/Saccharomyces/yeast_info.html
EnsEMBL genome database project www.ensembl.org
MIPS (Munich Information Centre
for Protein Sequences)
mips.gsf.de
Nematode
Baker’s Yeast Cells
Other databases
• Specialized sequence databases – for storage and analysis of
particular types of sequences e.g., rRNA and tRNA, introns,
promoters and other regulatory elements
• OMIM – for study of human genetics and molecular biology
• Incyte and UniGene – for providing gene sequences and
transcripts with expert annotation for use in drug design and
research
• Structural databases – for protein structural data (e.g. PDB,
MMDB) – containing X-ray Crys. and NMR studies
• Proteins and higher order functions – to store information on
particular types of proteins such as receptors, signal
transduction components, regulatory hierarchies and enzymes
• Literature databases – to store scientific articles with text search
facility (e.g. Medline and PubMED)
Database tools for displaying and
annotating genomic sequence data
Viewer format
URL
Artemis
www.sanger.ac.uk/Software/Artemis
ACeDB
www.acedb.org/Tutorial/brief-tutorial/shtml
Apollo
www.ensembl.org/apollo
EnsEMBL
www.ensembl.org
NCBI map viewer www.ncbi.nlm.nih.gov
GoldenPath
genome.ucsc.edu
Database formats
• There is no universally agreed format for genome
databases and several viewers and browsers have
been developed with graphical displays for
genomic sequence analysis and annotation
• One of the most versatile formats is ACeDN
(originally designed for the nematode C. elegans),
which has an object-oriented database architecture
and is now used in many applications outside the
field of genomic bioinformatics
Common formats
• There are several conventions for representing
nucleic acid and protein sequences, of which
the following are widely used
– NBRF/PIR
– FASTA
– GDE
• These formats have limited facilities for
comments, which must include a unique
identifier code and sequence accession number
Formats for multiple sequence
alignment
• There are separate formats for multiple
sequence alignment representation, of
which the following are popular
– MSF
– PHYLIP
– ALN
Files of structural data
• Structural data are maintained as flat files
using the PDB format
• Such files contain orthogonal atomic coordinates together with annotations,
comments and experimental details
Submission of sequences
• Sequences may be submitted to any of the
three primary databases using the tools
provided by the database curators
• Such tools include WebIn and BankIt,
which can be used over the Internet, and
Sequin, a stand-alone application
Database interrogation
• All the databases discussed above can be searched
by sequence similarity
• However, detailed text-based searches of the
annotations are also possible using tools such as
Entrez
• The simplest way to cross-reference between the
primary nucleotide sequence databases and
SWISS-PROT is to search by accession number,
as this provides an unambiguous identifier of
genes and their products
Databases covered by Entrez
Category
Database
Nucleic acid sequences
Entrez nucleotides: sequences obtained from GenBank, RefSeq and
PDB
Protein sequences
Entrez protein: sequences obtained from SWISS-PROT, PIR, PRF,
PDB and translations from annotated coding regions in GenBank and
RefSeq
3D structures
Entrez Molecular Modeling Database (MMDB)
Genomes
Complete genome assemblies from many sources
PopSet
From GenBank, set of DNA sequences that have been collected to
analyze the evolutionary relatedness of a population
OMIM
OnLine Mendelian Inheritance in Man
Taxonomy
NCBI Taxonomy Database
Books
Bookshelf
ProbeSet
Gene Expression Omnibus (GEO)
3D domains
Domains from the Entrez MMDB
Literature
PubMED
Databases covered by DBGET/LinkDB
Category
Database
Nucleic acid sequences
GenBank, EMBL
Protein sequences
SWISS-PROT, PIR, PRF, PDBSTR
3D structures
PDB
Sequence motifs
PROSITE, EPD, TRANSFAC
Enzyme reactions
LIGAND
Metabolic pathways
PATHWAY
Amino acid mutations
PMD
Amino acid indices
AAindex
Genetic diseases
OMIM
Literature
LITDB Medline
Organism-specific gene
catalogs
E. coli, H. influenzae, M. genitalium, M. pneumoniae,
M. jannashii, Synechocystis, S. cerevisiae