Download Intro to Bioinformatics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Western blot wikipedia , lookup

RNA-Seq wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Expression vector wikipedia , lookup

SNP genotyping wikipedia , lookup

Gene expression wikipedia , lookup

Community fingerprinting wikipedia , lookup

Biochemistry wikipedia , lookup

Interactome wikipedia , lookup

Gene wikipedia , lookup

Proteolysis wikipedia , lookup

Molecular Inversion Probe wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Point mutation wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Non-coding DNA wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Protein structure prediction wikipedia , lookup

Homology modeling wikipedia , lookup

Genomic library wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Molecular evolution wikipedia , lookup

Transcript
Topics
The topics:








basic concepts of molecular biology
more on Perl
overview of the field
biological databases and database searching
sequence alignments
phylogenetic trees
protein structure prediction
microarray data analysis
The Human Genome Project
The human genome sequence is complete almost - approximately 3 billion base pairs.
Some of these slides are adapted from Lecture Notes of Stuart M. Brown at NYU
Whole genome sequencing
has now become routine
How does the human genome
stack up?
Organism
Genome Size (Bases) Estimated Genes
Human (Homo sapiens)
3.2 billion
25,000
Laboratory mouse (M. musculus)
2.6 billion
25,000
Mustard weed (A. thaliana)
100 million
25,000
Roundworm (C. elegans)
97 million
19,000
Fruit fly (D. melanogaster)
137 million
13,000
Yeast (S. cerevisiae)
12.1 million
6,000
Bacterium (E. coli)
4.6 million
3,200
Human immunodeficiency virus (HIV)
9700
9
U.S. Department of Energy Genome Programs, Genomics and Its Impact on Science and Society, 2003
The Path Forward

How does DNA impact health?


What do all the genes do?


Discover the functions of human genes by experimentation
and by finding genes with similar funcs in the model
organisms
What are the functions of nongene areas?


Identify and understand the difference in DNA sequence
(A,T,C,G) among human populations
Identify important elements in the nongene regions of DNA
How does info in the genome enable life?

Explore life at the ultimate level of the whole organism
instead of single genes/proteins.
U.S. Department of Energy, 2005
Diverse applications


Medicine – customized treatments, …
Microbes for energy and the environment – generate
clean energy source, clean up toxic wastes,…


Bioanthropology – human lineage
Agriculture, livestock breeding, Bioprocessing –
crops&animals more resistant to diseases, efficient
industrial processes,…

DNA identification – implicate people accused of
crimes, identify contaminants in air, water, …
U.S. Department of Energy, 2005
Genomics: Journey to the Center of Biology
Without doubt, the greatest achievement in biology over the
past millennium has been the elucidation of the mechanism
of heredity. The instructions for assembling every organism
on the planet are all specified in DNA sequences that can
be translated into digital information and stored in a
computer for analysis. As a consequence of this revolution,
biology in the 21st century is rapidly becoming an
information science. Powerful new types of
bioinformatics will clearly be required to assimilate and
interpret the data that will issue from various types of
genomics research.
Eric Lander & Robert Weinberg, Science, 2000
Nucleic Acid Sequence Databases

the principal nucleic acid sequence databases are
GeneBank, EMBL and DDBJ, which each collect a
portion of the total sequence data reported worldwide, and exchange new and updated entries on a
daily basis
Nucleic acid sequence Databases
EMBL (European Molecular Biology Laboratory)
GenBank (USA)
DDBJ (DNA Data Bank of Japan)
ENSEMBL (project between EMBL - EBI and the Sanger Institute, to produce and maintain
automatic annotation on selected eukaryotic genomes )
dbEST (division of GenBank)
GSDB (Genome Sequence DataBase, division of GenBank)
GenBank

Once upon a time, GenBank sent out
sequence updates on CD-ROM disks a few
times per year.
Specialised Genomic Resources


In addition to the comprehensive DNA sequence DBs, there is a
variety of more specialised genomic resources.
These so called boutique DBs bring focus to species-specific
genomics and to particular sequencing techniques.
Specialised Genomic Resources
SGD – Saccharomyces Genome Database
UniGene - gene-oriented clusters from GenBank
TIGR - Databases of The Institute for Genomic Research
ACeDB – A C.elegans DataBase
Protein Information Resources
Levels of protein sequence and structural organisation:
primary
secondary
tertiary
The primary structure of a protein is its amino acid sequence
The second structure of a protein corresponds to regions of local
regularity (e.g., α-helices and β-strands).
The tertiary structure of a protein arises from the packing of its
secondary structure elements, which may form discrete
domains within a fold.
Primary Protein Databases

The primary structure of a protein is its amino
acid sequence. These are stored in primary
databases as linear alphabets that denote the
constituent residues.
Protein sequence Databases
SWISS-PROT - Protein knowledgebase
TrEMBL - Computer-annotated supplement to Swiss-Prot
PIR – Protein Information Resource
MIPS – Munich Information Centre for Protein Sequences
NRL-3D - produced by PIR
Structure Classification DBs

Contain 3D structures available from crystallographic
and spectroscopic studies
Structure Classification Databases
PDB – Protein Data Bank
CATH – Class, Architecture, Topology, Homology
SCOP – Structural Classification of Proteins
PDB: Growth (2006)
Databases concerning Mutations

dbSNP
http://www.ncbi.nlm.nih.gov/SNP

HGBASE (Human Genome Variation Database)
http://hgbase.cgr.ki.se

The SNP Consortium (TSC)
http://snp.cshl.org
Literature Databases

PubMed
http://www.ncbi.nlm.nih.gov/entrez/query

Bioinformatics Online
http://www.bioinformatics.oupjournals.org

Nature
http://www.nature.com

Science
http://www.sciencemag.org
Systems Biology

Integrate different levels of
information to understand
how biological systems function

Use computational and mathematical models to
analyze, model and simulate cellular networks,
interactions and pathways.
Microarray
 DNA
microarray is a new
technology to measure the level of
the mRNA gene products of a
living cell.
®
Affymetrix GeneChip Probe Arrays
Hybridized Probe Cell
GeneChip Probe Array
Single stranded, fluorescently
labeled cRNA target
*
*
*
*
*
*
Oligonucleotide probe
24~50µm
1.28cm
Each probe cell or feature contains
millions of copies of a specific
oligonucleotide probe
Image of Hybridized Probe Array
BGT108_DukeUniv
Bioinformatics Tools
 Database
& searching
 Computational algorithms
 Alignment
 Similarity
 Clustering
 Pattern
 Structure
Searching
predictions
 Statistical methods
 Data visualization
Bioinformatics
Bioinformatics is the research, development, or application
of computational tools and approaches for expanding the
use of biological, medical, behavioral or health data,
including those to acquire, store, organize, archive,
analyze, or visualize such data;
Computational biology is the development and
application of data-analytical and theoretical methods,
mathematical modeling and computational simulation
techniques to the study of biological, behavioral, and
social systems.