Download CS 598SS Probabilistic Methods in Biological Sequence Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Microsatellite wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transcript
CS 598SS
Probabilistic Methods in
Biological Sequence Analysis
Saurabh Sinha
What is the course about?
• Bioinformatics / Computational Biology
• Tools for analyzing genomes
• Probabilistic methods
What is the course format?
• Research course
• Lectures by instructor
• Student presentations of research papers
– 1 or 2 paper(s) per student
• Research project & presentation
– Typically, 2 students per project
– 30 mins presentation at end of course.
Grading
•
•
•
•
•
Project: 40%
Paper presentation: 25%
Assignments and/or tests: 25%
Participation: 10%
Grade distribution
Expectations
• Programming skills (for the project)
• Basic exposure to probability theory
• Basic exposure to algorithms
What you can do at the end of
the course
• Start working on research projects in
bioinformatics: biological sequence analysis
• Use principled approaches, supported by
probability theory, instead of ad hoc methods
• Join me as a graduate advisee ?
Administrative Details
• Instructor:
– Saurabh Sinha
– Room 2122, Siebel Center
– Email: [email protected]
• Class hrs: Tue & Thurs, 2:00pm - 3:15pm,
1131SC
• CRN: 43781
• Credits: 4 graduate hrs
• Welcome to sit in, if not taking for credit
Books
•
Not required
1.
Biological Sequence Analysis : Probabilistic Models
of Proteins and Nucleic Acids
-- Durbin, Eddy, Krogh, Mitchison
Bioinformatics: The Machine Learning Approach
-- Baldi, Brunak
Statistical Methods in Bioinformatics
-- Ewens and Grant
Bioinformatics
-- Polanski and Kimmel
2.
3.
4.
Why study bioinformatics?
• Molecular biology is the new frontier of
21st century science
• Computer science is the crown prince of
20th century engineering
• Bioinformatics is the application and
development of computer science with
the goal of supporting molecular biology
Why study bioinformatics?
• Flood of data: several Giga (Tera?)
bytes of sequence, and gene
expression data.
• Noise in the data
– Biological
– Experimental
• Algorithms needed to make discoveries
– Probabilistic methods
– Need for efficiency
Why study bioinformatics?
• The big picture:
– Human health and quality of life
– Fundamental science
• Billions of dollars being spent
– Health research gets the major chunk of the US
Govt’s funds
– Fundamental health research is at the molecular
level
– Molecular biology research increasingly a
quantitative science
Why study bioinformatics?
• Recent issue of Science: top 25 questions
>What Is the Universe Made Of?>What is the Biological Basis of
Consciousness?>Why Do Humans Have So Few Genes?>To What
Extent Are Genetic Variation and Personal Health Linked?>Can the
Laws of Physics Be Unified?>How Much Can Human Life Span Be
Extended?>What Controls Organ Regeneration?>How Can a Skin Cell
Become a Nerve Cell?>How Does a Single Somatic Cell Become a
Whole Plant?>How Does Earth's Interior Work?>Are We Alone in the
Universe?>How and Where Did Life on Earth Arise?>What Determines
Species Diversity?>What Genetic Changes Made Us Uniquely
Human?>How Are Memories Stored and Retrieved?>How Did
Cooperative Behavior Evolve?>How Will Big Pictures Emerge from a
Sea of Biological Data?>How Far Can We Push Chemical SelfAssembly?>What Are the Limits of Conventional Computing?>Can We
Selectively Shut Off Immune Responses?>Do Deeper Principles
Underlie Quantum Uncertainty and Nonlocality?>Is an Effective HIV
Vaccine Feasible?>How Hot Will the Greenhouse World Be?>What
Can Replace Cheap Oil -- and When?>Will Malthus Continue to Be
Wrong?
Basic Molecular Biology
Life, Cells, Proteins
• The study of life  the study of cells
• Cells are born, do their job, duplicate,
die
– What is “their job”?
– Break down nutrients, produce energy,
produce required molecules
• All these processes controlled by
proteins
Protein functions
• “Enzymes” (catalysts)
– Control chemical reactions in cell
• Transfer of signals/molecules between
and inside cells
– E.g., sensing of environment
• Regulate production of other proteins
Protein molecule
• Protein is a sequence of amino-acids
• 20 possible amino acids
• The amino-acid sequence “folds” into a
3-D structure called protein
Protein Structure
Protein
PNAS cover, courtesy Amie Boal
DNA
The DNA repair protein MutY (blue) bound to DNA (purple).
DNA
• Deoxyribonucleic acid: a molecule that
is involved in production of proteins
• Double helical structure (discovered by
Watson, Crick, Wilkins & Franklin)
• Chromosomes are densely coiled and
packed DNA
Chromosome
DNA
SOURCE: http://www.microbe.org/espanol/news/human_genome.asp
The DNA Molecule
5’

G
A
T
G
C
G
T
G
T
T
A
A
C
3’ T
---------------
Base = Nucleotide
C
T
A
C
G
C
A
C
A
A
T
T
G
A
From DNA to Amino-acid sequence
Cell
SRC:http://www.biologycorner.com/resources/DNA-RNA.gif
From DNA to Protein: In words
1. DNA = nucleotide sequence
•
Alphabet size = 4 (A,C,G,T)
2. DNA  mRNA (single stranded)
•
Alphabet size = 4 (A,C,G,U)
3. mRNA  amino acid sequence
•
Alphabet size = 20
4. Amino acid sequence “folds” into 3dimensional molecule called protein
Central Dogma
• “Information” flows from DNA to RNA to
Protein
• Why “information” ?
• The DNA in a cell has complete
information of which proteins will be
present in the cell
DNA and genes
• DNA is a very “long” molecule
• DNA in human has 3 billion base-pairs
– String of 3 billion characters !
• DNA harbors “genes”
– A gene is a substring of the DNA string
Genes code for proteins
• DNA  mRNA  protein can actually
be written as Gene  mRNA  protein
• A gene is typically few hundred basepairs (bp) long
Transcription
• Process of making a single stranded
mRNA using double stranded DNA as
template
Step 1: From DNA to mRNA
Transcription
Step 1: From DNA to mRNA
Transcription
Translation
• Process of making an amino acid sequence
from (single stranded) mRNA
• Each triplet of bases translates into one
amino acid: each such triplet is called “codon”
• The translation is basically a table lookup
SOURCE:
http://www.bioscience.org/atlases/genecode/genecode.htm
Step 2: mRNA to Amino acid sequence
Translation
Review so far
• Proteins: important molecules, amino acid
sequences
• DNA: structure, base-pairing.
• Genes: substrings of DNA
• Gene --> mRNA (transcription)
• mRNA --> amino acid sequence (translation),
genetic code.
Gene expression
• Process of making a protein from a
gene as template
• Transcription, then translation
• Can be regulated
Transcriptional regulation
TRANSCRIPTION
FACTOR
GENE
ACAGTGA
PROTEIN
Transcriptional regulation
TRANSCRIPTION
FACTOR
GENE
ACAGTGA
PROTEIN
The importance of gene
regulation
Genetic regulatory network controlling the development of the body plan of the sea urchin embryo
Davidson et al., Science, 295(5560):1669-1678.
• That was the “circuit” responsible for
development of the sea urchin embryo
• Nodes = genes
• Switches = gene regulation
• Change the switches and the circuit
changes
• Gene regulation significance:
– Development of an organism
– Functioning of the organism
– Evolution of organisms
Genome
• The entire sequence of DNA in a cell
• All cells have the same genome
– All cells came from repeated duplications
starting from initial cell (zygote)
• Human genome is 99.9% identical
among individuals
• Human genome is 3 billion base-pairs
(bp) long
Genome features
• Genes
• Regulatory sequences
• The above two make up 5% of human
genome
• What’s the rest doing?
– We don’t know for sure
• “Annotating” the genome
– Task of bioinformatics