Download Bios 560R: Introduction to Bioinformatics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Exome sequencing wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transcript
Bios 540
Introduction to Bioinformatics
Lecture 1.
Course outline
The biological system
Omics and its impact
Big data
The statistician/bioinformatician’s role
Course Outline
Instructor:
Tianwei Yu
Office: GCR Room 334
Email: [email protected]
Office Hours: by appointment.
Teaching Assistant:
Mr. Qingpo Cai
Office Hours: TBA
Course Website:
http://web1.sph.emory.edu/users/tyu8/540/index.htm
Evaluation
Class participation (5%)
Three homeworks (15% × 3)
Final report based on a research article (50%).
Course Outline
Bioinformatics
Other
Disciplines
CS
Biology
Genetics
……
Machine learning
and other courses
540
Statistics
Course Outline
 Biological sequence analysis
Pariwise alignment; multiple alignment; sequence models;
motifs; fast alignment; phylogenetic trees
 High-throughput data generation and preprocessing
Next generation sequencing; Microarray RNA/DNA
profiling; LC/MS based Proteomics/Metabolomics. (Technique;
popular models)
 General statistical technics in high-throughput data
Multiple testing & FDR; clustering; classification
 Data Interpretation & Integration
Ontology; Some Important Databases; Networks
Related course
Bios 740 (Bios/CS 534 from 2017): Machine Learning.
Supervised learning:
Classification: Bayesian decision theory, LDA, classification
tree, random forest, SVM, boosting, bump hunting, neural
networks, deep learning.
Model generalization.
Variance/Bias, training/testing error, cross validation.
Unsupervised learning:
Dimension reduction: PCA, factor analysis, ICA, NCA,SIR
Clustering: similarity measures, hierarchical, k-means,
model-based clustering …
Tentative schedule
Lecture 1
Lecture 2
Lecture 3
Lecture 4
Lecture 5
Lecture 6
Lecture 7
Lecture 8
Lecture 9
Lecture 10
Lecture 11
Lecture 12
Lecture 13
Lecture 14
Introduction
Sequencing; Dynamic programming sequence alignment
BLAST; Hidden Markov Models in alignment (1)
Hidden Markov Models (2); Multiple Alignment
Motif discovery; Phylogeny
Gene expression: microarray and deep sequencing
Supervised and Unsupervised Learning (1)
Supervised and Unsupervised Learning (2)
Multiple Testing
Analyzing the DNA by deep sequencing (1)
Analyzing the DNA by deep sequencing (2);
MS-based Proteomics & Metabolomics(1)
MS-based Proteomics & Metabolomics (2)
Networks and Ontology
Data integration
Course Outline
Recommended Readings for the basics:
Richard Durbin et al. (2005) Biological Sequence
Analysis: Probabilistic Models of Proteins and Nucleic
Acids.
Michael Waterman (1996) Introduction to Computational
Biology – Maps, Sequences & Genomes.
The complex biological system
Red: central dogma
Blue line: interactions
Metabolites
(Picture edited from http://www.ncbi.nlm.nih.gov/Class/MLACourse/Modules/MolBioReview/)
9
The complex biological systems
Organism
Cell
Tissue architectures
Genome
Transcriptome
interactome
Cell interactions
Sigaling
Proteome
……
Metabolome
10
Environment
Chemicals
Microorganisms
The complex biological systems
How many players are there in the human system ?
30,000~70,000 genes
 one or more regulatory sequence per gene
 ~70% of the genes are alternatively spliced to generate >1
transcripts and >1 proteins per gene
 > 40,000 different metabolites (Human Metabolome Database)
 Hundreds of signaling molecules
 Different cellular architectures
The above listed are just Species. Amounts of each species also
matter!
11
The complex biological systems
The complex biological systems
Our goal – the comprehensive understanding of diseases
We face the “big data” challenge
Comprehensive studies of a disease
Li et al. Seminars in Immunology 25:209
The complex biological system – our goals
“Omics”
Advanced preprocessing
techniques
Reliable highthroughput
information
includes
Genomics
Transcriptomics
Proteomics
Metabolomics
Interactomics
……
To reduce noise
Measured by
High-throughput
Sequencing
Microarrays
LC/MS
NMR
Two hybrid
……
Their data are
High-noise
Techniques to analyze highdimensional data and
knowledgebases
Biological knowledge
Medical knowledge
Improved health
The complex biological systems --- the genome
http://content.answers.com/main/conten
16
t/wp/en/f/f0/DNA_Overview.png
http://www.insectscience.org/2.10/ref/fig5a.gif
The complex biological systems --- the genome
 The human genome is a book with 3 billion characters. 5%
are words (protein coding sequences) and 95% are not.
 The mouse genome contains about 2.5 billion characters. It
is very similar to the human genome (85% identical in
protein coding regions). That is one of the reasons why mice
are suitable for elucidation of biological mechanisms and
drug discovery. The similarity results from a common
ancestor 80 million years ago.
 How many genomes
are sequenced? The number
increases rapidly.
17
http://gregoryzynda.com/ncbi/genome/python/2
014/03/31/ncbi-genome.html
The complex biological systems --- the genome
Small variations in the genome can cause huge differences in
pheonotypes – disease susceptibility, drug response etc.
The sequence variations in the genome can be measured by
PCR (low-throughput), microarray and deep-sequencing (highthroughput) – individual genome.
18
http://archive.hpcwire.com/hpcwire/2013-06-05/dell_boxes_up_hpc_for_life_sciences.html
The complex biological systems --- the epigenome
DNA is structured. Modifications to relevant proteins (methylation/acetylation/…) and
DNA itself can change its structure and control gene expression (DNA -> RNA).
http://www.roadmapepigenomics.org/
The complex biological systems --- transcriptome and proteome
The cell is a complex machinery. The active parts of it are the proteins.
The DNA records how each protein should be made, but not the quantity
at a given moment.
To understand the operation of the machinery, we want to know how much
of each protein is present under certain conditions. There are potentially
>10,000 species of proteins in the cell. Protein modifications further
complicate things. The proteome can be directly measured by methods like
LC/MS/MS, which is costly.
A much easier way is to measure the
transcriptome. The messenger RNAs
serve as the molds in the making of the
parts. Normally, the more molds, the
more parts made. mRNA doesn’t have
tertiary structures – much easier to
quantify by micro-arrays.
http://www.katiephd.com/a-whole-new-rna-world/
The complex biological systems --- metabolome
 Small molecules – not coded by DNA.
 Substrates of enzymes (proteins). Reflects activities of
the regulatory systems and the environment.
 Directly reflects
Metabolic regulation
Nutrition
Environmental response
Drug response
 Indirectly reflects system changes (redox potential…)
 Measured by NMR, GC/MS, LC/MS,…….
The interactome
The Scientist 2004, 18(12):18
The reactome
KEGG network – proteins(enzymes) & metabolites.
The relevance of omics experiments in medicine
Biomarker discovery
To find non-invasive methods to:
Predict disease risk; early detection
Disease classification
Predict response to treatment
Monitor disease progression
Before the era of high-throughput experiments, what did the
doctors do?
Age, gender, ethnicity, behavioral measures, …
Disease stage, dissection of disease tissue …
Use one-at-a-time methods to analyze proteins/metabolites
in disease tissue or biological fluids
The relevance of omics experiments in medicine
To study the disease mechanism to find a cure:
(1)Diseases with pathogen  Interaction of the human system
with the pathogen.
Protein interaction, regulation of gene expression, change
in metabolite concentration …
What can we block to stop the disease progression?
(2)Diseases without pathogen What goes wrong in the human
system? Is it a genetic disorder? Is it disturbance of the
regulation of the system?
The relevance of omics experiments in medicine
Genomics – a few examples
Medical question.
Experimental Techniques.
Computational Techniques
Is there a (set of) special
mutation causing a
disease?
Deep sequencing;
Single Nucleotide
Polymorphism (SNP)
arrays
Association Analysis;
Linkage Analysis;
Multiple testing;
……
How to find gene products
that aid/suppress the
development of a certain
type of cancer ?
array comparative
genomic hybridization;
SNP array CGH/LOH;
Deep sequencing.
Segmentation;
Multiple testing;
Clustering;
Classification
……
How to find a region of
DNA whose folding
structure affect disease
26
status?
Deep sequencing.
Alignment;
Peak modeling;
Segmentation …..
The relevance of omics experiments in medicine
Transcriptomics – a few examples
Medical question.
Experimental Techniques. Computational
Techniques
Are certain gene products
associated with the
incidence/progression
of a disease?
Expression microarrays
Are there subtypes of
disease undetected by
regular medical
examination?
Gene expression
(and potentially all other
“omics” methods)
27
Whole Transcriptome
Shotgun Sequencing
Alignment;
Multiple testing;
Dimension reduction;
Clustering;
Classification
……
(same as above)
The relevance of omics experiments in medicine
Proteomics – a few examples
Medical question.
Experimental Techniques.
Computational
Techniques
Are certain proteins
Mass spectrometry
associated with the
(2D gel -> MS, tandom MS,
incidence/progression
LC/MS/MS,……)
of a disease?
Sequence matching;
Multiple testing;
Dimension reduction;
Clustering;
Classification ……
How do proteins change (targeted) Mass spectrometry
their modification
patterns in a disease state?
(same as above)
How do proteins of
pathogens work and
interact with human
28
proteins?
(same as above)
Protein structure
analysis;
Mass spectrometry
Immunological methods
Large-scale structural study
The relevance of omics experiments in medicine
Metabolomics – a few examples
Medical question.
Experimental
Techniques.
Computational Techniques
How are bodily metabolic Mass spectrometry
networks disrupted in NMR
metabolic diseases?
etc
Data alignment
Metabolite mapping
Multiple testing
Dimension reduction
Functional data analysis……
How do some drugs
interfere with the human
metabolome? How are
they transformed/
degraded?
(same as above)
(same as above)
Do pollutants accumulate
in the human body and
cause diseases?
Mass Spectrometry
(same as above)
29
The relevance of omics experiments in medicine
“Omics” is revolutionizing medicine.
Personalized medicine
Understand each patient’s system, match them with treatments.
(success example: Oncotype DX breast cancer test from Genomic
Health, in order to tailor treatment.)
Predictive medicine & preventive medicine
Find the increased risk, even before the disease onset.
Predict the progression of disease after it occurs.
Systems biology  better understanding of diseases
How are all the “omics” measurements related? How do they interact?
What does it say about possible treatments and development of drugs?
30
The relevance of omics experiments in medicine
What is Personalized medicine?
Each person is different by
Different DNA sequence (tens of millions of sequence variations)
Different DNA structures
Different gene expression levels
Number of SNPs
Different protein modification/
degradation patterns
Different metabolite levels in the blood
Different exposure history
……
31
Bioinformatics, Vol. 27 no. 13 2011, pages 1741–1748
The relevance of omics experiments in medicine
Fig. 1. Personalized medicine.
Personal genomics connect
genotype to phenotype and
provide insight into disease.
Pharmacogenomics connect
connects genotype to patientspecific treatment. Traditional
medicine defines the pathologic
states and clinical observations
to evaluate and adjust
treatments.
Bioinformatics, Vol. 27 no. 13 2011, pages 1741–1748
The relevance of omics experiments in medicine
nature medicine volume 17
| number 3 | 297 – 303.
nature medicine volume 17
| number 3 | 297 – 303.
The relevance of omics
experiments in medicine
35
The challenges to statisticians/bioinformaticians
Luckily, or unluckily, we are part of the “big data” game.
http://jeffhurtblog.com/2012/07/20/three-vs-of-big-data-as-applied-conferences/
The challenges to statisticians/bioinformaticians
All Omics experiments share one characteristic:
Omics  The “totality”  there are many !
We are measuring hundreds of thousands of features from one single
person.
We are overwhelmed by data --- even eyeballing the data becomes
impossible.
The task:
Reduce the data into a more
useful form.
Make use of the data in medicine
and biological research !
37
Nature Methods 6, S2 - S5 (2009)
The challenges to statisticians/bioinformaticians
The sample size issue.
Up to now, most genome-wide association studies (GWAS)
yielded very weak biomarkers. Biomarkers found by microarray
are often unreliable. Why?
Diseases are complicated! The human population is diverse!
We are limited by sample size!
If a disease is caused by the combinatorial effect from 3 genes
located at different regions in the genome, high-throughput
technology will have difficulty finding them, even with 1000
samples!
38
The challenges to statisticians/bioinformaticians
Many medical questions using Omics can be generalized into these forms:
Processing the data to find the features. Pre-processing, sequence comparison, data
modeling…
Identifying features (SNPs, genes, proteins etc) associated with a disease (or disease state)
Find if a feature is significantly different between normal/disease samples. Statistical
models, Multiple Testing, model validation, generalization…
 Finding previously unknown subtypes of a disease
Group samples based on there feature measurements. Dimension Reduction,
Clustering, …
 Predicting disease/normal status or different disease subtypes/states
Based on the measurements of some features, predict a new case. Predictive Model
Building…
39
The challenges specific to statistical bioinformaticians
Compromise
The models may be too complex  assumptions may not
hold; theoretical rigors may not be achieved
Too much background knowledge
Computing needs
Work with others
Different data types - integration
“Dirty” data
Speed: the first few methods (not the best method) dominates,
and data evovles
40