Download Slides - Biomedical Informatics

Document related concepts

Genome evolution wikipedia , lookup

Human genome wikipedia , lookup

RNA-Seq wikipedia , lookup

Pathogenomics wikipedia , lookup

Protein moonlighting wikipedia , lookup

NEDD9 wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Genome editing wikipedia , lookup

Point mutation wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Metagenomics wikipedia , lookup

Genomics wikipedia , lookup

Transcript
IBGP/BMI 730 Bio-(medical)-Informatics
Director: Prof. Kun Huang
What is Bio(medical)-informatics?
bio·in·for·mat·ics
: the collection, classification, storage, and analysis of biochemical and
biological information using computers especially as applied in molecular
genetics and genomics.
Source: Merriam-Webster's Medical Dictionary, © 2002 Merriam-Webster,
Inc.
Myth1 : Bioinformatics is about
genomics
•
•
•
•
•
Nucleotide – DNA, RNA, …
Genome – Sequences, chromosomes, expressed data, …
Protein – Sequences, 3-D structure, interaction, …
System – Gene network, protein network, TFs, …
Other – Masspec, microarray, images, lab records, journals, literatures, …
The goal is to understand how the system works.
Myth2 : Data vs. Information
Data
Nucleotide – DNA, RNA, …
Genome – Sequences, chromosomes, expressed data, …
Protein – Sequences, 3-D structure, interaction, …
System – Gene network, protein network, TFs, …
Other – Masspec, microarray, images, lab records, journals, literatures, …
Information
Genotype
Phenotype
Genotype-Phenotype relationship
SNPs
Pathways
Drug targets
Getting data is “easy”, extracting information is hard!
Myth3 : Computer is intelligent
Pros
• Repeated work
• Accurate storage
• Precise computation
• Fast communication
…
Cons
• Cannot generalize
• No real intelligence
…
The results must be reviewed and validated by
biologists. In addition, biologists must have some
understanding of how computer processes data
(algorithms) – that’s why we need to learn
bioinformatics.
Biology – Biomedical informatics –
System biology
Biomedical Informatics
Barabasi A-L,
Network
medicine –
from obesity to
“Diseasome”,
NEJM, 357(4): 404407, 2007.
System Sciences
Understanding!
Theory
Analysis
Modeling
• Synthesis/prediction
• Simulation
• Hypothesis generation
Prediction!
System Biology
Biology
Informatics
Domain knowledge
Data management
• Hypothesis testing
Experimental work
• Genetic manipulation
• Quantitative measurement
• Validation
• Database
Computational infrastructure
• Modeling tools
• High performance computing
Visualization
Where does large data come from
(who to blame)?
High-throughput techniques
Fred Sanger
• Nobel prize in chemistry
in 1958
"for his work on the
structure of proteins,
especially that of insulin"
• Nobel prize in chemistry
in 1980
"for their contributions
concerning the
determination of base
sequences in nucleic
acids"
High-throughput techniques
DNA Sequencing
• 1970’s – Nobel prize
• 1980’s – Ph.D. thesis
• Early 1990’s – Major
research projects
• Late 1990’s to now - $20
Human Genome Project
The Beginning (1988)
Cold Spring Harbor Laboratory
Long Island, New York
June 26, 2000 at the Whitehouse
Initial Analysis of the Human Genome
The sequencing of the equivalent of an entire human genome for $1,000
has been announced as a goal for the genetics community, and new
technologies suggest that reaching this goal is a matter of when, rather
than if. What then? In celebration of its upcoming 15th anniversary,
Nature Genetics is asking prominent geneticists to weigh in on this
question: what would you do if this sequencing capacity were available
immediately? This new Nature Genetics 'Question of the Year' website,
sponsored by Applied Biosystems, will reveal their answers. The website
will be updated monthly, so check back regularly to get a glimpse of the
future of genetics.
http://www.nature.com/ng/qoty/index.html
What information do we want to extract?
Science, 9/2/2005
Total genetic difference (# of bases) is 4%
35 million single base substitutions plus 5
million insertions or deletions (indels)
The average protein differs by only two
amino acids, and 29% of proteins are
identical.
Genotype – Phenotype relationship!!!
Phenotype
• mRNA level
• Protein expression
• Protein structure
• Cell morphology
• Tissue morphology
• System physiological functions
• Behavior
•…
High-throughput techniques
High throughput protein crystalization
Mass spectrometry
Microarray
High throughput cell imaging
High throughput in vivo screening
…
“A key element of the GTL program is an
integrated computing and technology
infrastructure, which is essential for timely and
affordable progress in research and in the
development of biotechnological solutions. In
fact, the new era of biology is as much about
computing as it is about biology. Because of
this synergism, GTL is a partnership between
our two offices within DOE’s Office of Science—
the Offices of Biological and Environmental
Research and Advanced Scientific Computing
Research.
Only with sophisticated computational power and
information management can we apply new
technologies and the wealth of emerging data to
a comprehensive analysis of the intricacies and
interactions that underlie biology. Genome
sequences furnish the blueprints,
technologies can produce the data, and
computing can relate enormous data sets to
models linking genome sequence to
biological processes and function.”
How to extract the information?
Computational tools
• Building the databases
• Perform analysis/extract features
• Data funsion/Integration
• Data mining/Classification/statistical
learning
• Visualization/representation
Biological information!!!
How to extract the information?
How to extract the information?
How to extract the information?
http://bmi.osu.edu/bioinformatics/project.php?id=200
What we are going to do:
• Search the databases
• Perform analysis
• Present output
Be a salient user!
What we are going to teach:
• Genomics
• Data sources
• Proteomics
(databases)
• Microarray analysis
• Available tools
• Other aspects
• Major issues in using the
• Ontology
• Imaging informatics
databases and tools
• Other resources
• System biology
• Machine learning / statistical analysis
• Visualization
Review of Biology
Central dogma
Review of Biology
Operon
Review of Biology
mRNA, cDNA,
exon, intron
Review of Biology
Protein folding and structure
Databases
GenBank www.ncbi.nlm.nih.gov/GenBank/
EMBL www.ebi.ac.uk/embl/
DDBJ www.ddbj.nig.ac.jp
Synchronized daily.
Accession numbers are managed in a consistent way.
AceDB
DDJP DNA
JJPID
MIPS
PHRED
PIR
PROSITE
RDP
TIGR
UNIGENE
…
Resources
Local:
OSU library
Web:
PubMed
JSTOR (http://www.jstor.com)
http://www.expasy.org
http://www.genecards.org
http://www.pathguide.org/
Resources – What’s out there?
PubMed – Entrez
PubMed : http://www.pubmed.gov,
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi
PubMed training : http://www.nlm.nih.gov/bsd/disted/pubmed.html
Entrez : http://www.ncbi.nlm.nih.gov/Database/index.html
Entrez is the integrated, text-based search and retrieval system used
at NCBI for the major databases, including PubMed, Nucleotide and
Protein Sequences, Protein Structures, Complete Genomes,
Taxonomy, and others. Click on the graphic below for a more detailed
view of Entrez integration.
Entrez Databases
Literatures
Examples:
1. E2F3
2. Retinoblastoma
Constraints: automatics vs. manual
Save: Tutorial at
http://www.nlm.nih.gov/bsd/viewlet/myncbi/saving_searches.swf
Literatures
Literatures
Literatures
Examples:
1. E2F3
2. Retinoblastoma
Constraints: automatics vs. manual
Literatures
Nucleotide
•
•
•
•
•
•
•
Gene
Genome
Sequence
mRNA
cDNA
SNP
ESTs (expressed sequence tags) / UniGene
•
•
•
•
•
Name
Accession number
GI number
Version number
Alias
Accession number, GI number, Version
• accession number (GenBank) - The accession number is the unique identifier
assigned to the entire sequence record when the record is submitted to
GenBank. The GenBank accession number is a combination of letters and
numbers that are usually in the format of one letter followed by five digits (e.g.,
M12345) or two letters followed by six digits (e.g., AC123456).
•
The accession number for a particular record will not change even if the author submits a request to change
some of the information in the record. Take note that an accession number is a unique identifier for a
complete sequence record, while a Sequence Identifier, such as a Version, GI, or ProteinID, is an
identification number assigned just to the sequence data. The NCBI Entrez System is searchable by
accession number using the Accession [ACCN] search field.
• GI (GenBank) - A GI or "GenInfo Identifier" is a sequence identifier that can be
assigned to a nucleotide sequence or protein translation. Each GI is a numeric
value of one or more digits. The protein translation and the nucleotide sequence
contained in the same record will each be assigned different GI numbers.
•
Every time the sequence data for a particular record is changed, its version number increases and it
receives a new GI. However, while each new version number is based upon the previous version number, a
new GI for an altered sequence may be completely different from the previous GI. For example, in the
GenBank record M12345, the original GI might be 7654321, but after a change in the sequence is submitted,
the new GI for the changed sequence could be 10529376. Individuals can search for nucleotide sequences
and protein translations by GI using the UID search field in the NCBI sequence databases.
• GI number is NOT GeneID.
Example : E2F3
Example : E2F3
Data Format
FASTA (.fasta file)
>gi|33469954|ref|NM_000240.2| Homo sapiens monoamine oxidase A (MAOA), nuclear gene
encoding mitochondrial protein, mRNA
GGGCGCTCCCGGAGTATCAGCAAAAGGGTTCGCCCCGCCCACAGTGCCCGGCTCCCCCCGGGTATCAAAA
GAAGGATCGGCTCCGCCCCCGGGCTCCCCGGGGGAGTTGATAGAAGGGTCCTTCCCACCCTTTGCCGTCC
CCACTCCTGTGCCTACGACCCAGGAGCGTGTCAGCCAAAGCATGGAGAATCAAGAGAAGGCGAGTATCGC
GGGCCACATGTTCGACGTAGTCGTGATCGGAGGTGGCATTTCAGGACTATCTGCTGCCAAACTCTTGACT
GAATATGGCGTTAGTGTTTTGGTTTTAGAAGCTCGGGACAGGGTTGGAGGAAGAACATATACTATAAGGA
ATGAGCATGTTGATTACGTAGATGTTGGTGGAGCTTATGTGGGACCAACCCAAAACAGAATCTTACGCTT
GTCTAAGGAGCTGGGCATAGAGACTTACAAAGTGAATGTCAGTGAGCGTCTCGTTCAATATGTCAAGGGG
AAAACATATCCATTTCGGGGCGCCTTTCCACCAGTATGGAATCCCATTGCATATTTGGATTACAATAATC
TGTGGAGGACAATAGATAACATGGGGAAGGAGATTCCAACTGATGCACCCTGGGAGGCTCAACATGCTGA
CAAATGGGACAAAATGACCATGAAAGAGCTCATTGACAAAATCTGCTGGACAAAGACTGCTAGGCGGTTT
GCTTATCTTTTTGTGAATATCAATGTGACCTCTGAGCCTCACGAAGTGTCTGCCCTGTGGTTCTTGTGGT
ATGTGAAGCAGTGCGGGGGCACCACTCGGATATTCTCTGTCACCAATGGTGGCCAGGAACGGAAGTTTGT
AGGTGGATCTGGTCAAGTGAGCGAACGGATAATGGACCTCCTCGGAGACCAAGTGAAGCTGAACCATCCT
GTCACTCACGTTGACCAGTCAAGTGACAACATCATCATAGAGACGCTGAACCATGAACATTATGAGTGCA
AATACGTAATTAATGCGATCCCTCCGACCTTGACTGCCAAGATTCACTTCAGACCAGAGCTTCCAGCAGA
GAGAAACCAGTTAATTCAGCGGCTTCCAATGGGAGCTGTCATTAAGTGCATGATGTATTACAAGGAGGCC
TTCTGGAAGAAGAAGGATTACTGTGGCTGCATGATCATTGAAGATGAAGATGCTCCAATTTCAATAACCT
TGGATGACACCAAGCCAGATGGGTCACTGCCTGCCATCATGGGCTTCATTCTTGCCCGGAAAGCTGATCG
ACTTGCTAAGCTACATAAGGAAATAAGGAAGAAGAAAATCTGTGAGCTCTATGCCAAAGTGCTGGGATCC
CAAGAAGCTTTACATCCAGTGCATTATGAAGAGAAGAACTGGTGTGAGGAGCAGTACTCTGGGGGCTGCT
ACACGGCCTACTTCCCTCCTGGGATCATGACTCAATATGGAAGGGTGATTCGTCAACCCGTGGGCAGGAT
TTTCTTTGCGGGCACAGAGACTGCCACAAAGTGGAGCGGCTACATGGAAGGGGCAGTTGAGGCTGGAGAA
CGAGCAGCTAGGGAGGTCTTAAATGGTCTCGGGAAGGTGACCGAGAAAGATATCTGGGTACAAGAACCTG
…
>gi|4557735|ref|NP_000231.1| monoamine oxidase A [Homo sapiens]
MENQEKASIAGHMFDVVVIGGGISGLSAAKLLTEYGVSVLVLEARDRVGGRTYTIRNEHVDYVDVGGAYV
GPTQNRILRLSKELGIETYKVNVSERLVQYVKGKTYPFRGAFPPVWNPIAYLDYNNLWRTIDNMGKEIPT
DAPWEAQHADKWDKMTMKELIDKICWTKTARRFAYLFVNINVTSEPHEVSALWFLWYVKQCGGTTRIFSV
TNGGQERKFVGGSGQVSERIMDLLGDQVKLNHPVTHVDQSSDNIIIETLNHEHYECKYVINAIPPTLTAK
IHFRPELPAERNQLIQRLPMGAVIKCMMYYKEAFWKKKDYCGCMIIEDEDAPISITLDDTKPDGSLPAIM
GFILARKADRLAKLHKEIRKKKICELYAKVLGSQEALHPVHYEEKNWCEEQYSGGCYTAYFPPGIMTQYG
RVIRQPVGRIFFAGTETATKWSGYMEGAVEAGERAAREVLNGLGKVTEKDIWVQEPESKDVPAVEITHTF
WERNLPSVSGLLKIIGFSTSVTALGFVLYKYKLLPRS
Data Format
Other formats
NBRF/PIR (.pir file)
Begin with “>P1;” for protein sequence and “>N1;”
for nucleotide.
GDE (.gde file)
Similar to FASTA file, begin with “%” instead of “>”.
Protein Databases
UniProt is the universal protein database, a central repository of
protein data created by combining Swiss-Prot, TrEMBL and PIR. This
makes it the world's most comprehensive resource on protein
information.
The Protein Information Resource (PIR), located at Georgetown
University Medical Center (GUMC), is an integrated public
bioinformatics resource to support genomic and proteomic
research, and scientific studies.
Swiss-Prot is a curated biological database of protein sequences
from different species created in 1986 by Amos Bairoch during his
PhD and developed by the Swiss Institute of Bioinformatics and the
European Bioinformatics Institute.
Pfam is a large collection of multiple sequence alignments and
hidden Markov models covering many common protein domains and
families.
PDB
NCBI
http://proteome.nih.gov/links.html
PubMed – Protein Databases
The Protein database contains sequence data from the translated
coding regions from DNA sequences in GenBank, EMBL, and DDBJ
as well as protein sequences submitted to Protein Information
Resource (PIR), SWISS-PROT, Protein Research Foundation (PRF),
and Protein Data Bank (PDB) (sequences from solved structures).
The Structure database or Molecular Modeling Database (MMDB)
contains experimental data from crystallographic and NMR structure
determinations. The data for MMDB are obtained from the Protein
Data Bank (PDB). The NCBI has cross-linked structural data to
bibliographic information, to the sequence databases, and to the
NCBI taxonomy. Use Cn3D, the NCBI 3D structure viewer, for easy
interactive visualization of molecular structures from Entrez.
Tutorial: http://www.pdb.org/pdbstatic/tutorials/tutorial.html
Example – UniProt - Expasy
http://www.uniprot.org/
http://www.expasy.org/
Example – UniProt - Expasy
Example – UniProt - Expasy
Example – UniProt - Expasy
Example – UniProt - Expasy
Annotation - Visualization
UCSC Genome Browser (http://genome.ucsc.edu/)
Exercises
Question 1 - Database search
Find the following genes in GenBank. Write down their accession
numbers, GI number, chromosome numbers:
Rb1 (human), Rb1 (mouse), Rb1(rat), Rb1(bovine)
Find the protein sequences for the above. Present them in FASTA format.
Note: find the most close ones (e.g., if both Rb1 and Rb are present,
choose Rb1).
Question 2 – Gene information search
Find the function and alias for the following genes:
TCF3, Col4A1, MMP9 and WASP.
Question 3 – Protein information search
Look up Human Catalase in www.expasy.org. Find out:
• How long is the protein chain? Where is its active site?
• Is its 3D structure available? If so, how was it obtained?
• How long is its longest helix chain and where is it located?
Reading – Entrez tutorial
http://www.ncbi.nlm.nih.gov/entrez/query/static/help/entrez_tutorial_BIB.pdf