Download CS691K Bioinformatics Kulp Lecture Notes #0 Molecular

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

DNA damage theory of aging wikipedia , lookup

Metagenomics wikipedia , lookup

Mutation wikipedia , lookup

Genetic code wikipedia , lookup

Cancer epigenetics wikipedia , lookup

Cell-free fetal DNA wikipedia , lookup

Non-coding RNA wikipedia , lookup

Designer baby wikipedia , lookup

Molecular cloning wikipedia , lookup

DNA supercoil wikipedia , lookup

Replisome wikipedia , lookup

Epigenomics wikipedia , lookup

History of RNA biology wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Nucleic acid double helix wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

DNA vaccination wikipedia , lookup

Human genome wikipedia , lookup

Genome evolution wikipedia , lookup

Genomic library wikipedia , lookup

Microevolution wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Minimal genome wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

RNA-Seq wikipedia , lookup

Extrachromosomal DNA wikipedia , lookup

Gene wikipedia , lookup

NEDD9 wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Point mutation wikipedia , lookup

Genome editing wikipedia , lookup

Non-coding DNA wikipedia , lookup

Genomics wikipedia , lookup

Primary transcript wikipedia , lookup

History of genetic engineering wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Helitron (biology) wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
CS691K Bioinformatics
Kulp Lecture Notes #0
Molecular & Cell Biology
Fall 2005
[email protected]
Logistics
• Syllabus distributed
– Class taught in 3 stages by faculty in CS, math/stats, and microbio
– Grades will be based on up to six homework assignments
– Office hours on syllabus. All faculty are readily available by email.
We are happy to discuss the class with you personally.
– Not all notes will be available online - you should attend all lectures
and take good notes
• Diverse group of students
• Emphasis will be on understanding methods and practical
use of existing bioinformatics tools
• Why are you here? What is your background? What are
you hoping to get out of this class? Please sign the email
sheet!
• Homework will involve the use of the unix ED-LAB
computers. There will be a special meeting on
WEDNESDAY, SEPTEMBER 14 for novice unix users.
What is Bioinformatics
• Computational Biology: The use of algorithmic,
mathematical, and statistical methods to analyze
genome sequences (i.e. DNA, RNA, protein) and
derived data (e.g. expression, NMR, etc.)
• Informatics: The software and data management
methodologies for storing, retrieving, and
intrigrating such data
• Data Mining / In-silico Biology: Hypothesis
generation and testing from genome data sets
Topics
• Detecting similar sequences (homology)
– Pairwise and multiple sequence alignment
– Protein function/structure prediction
• Sequence pattern modeling and recognition
– Motif discovery
– Gene finding
• Analyzing high-dimension data
– Function prediction, target discovery, etc. from gene
expression
• Constructing trees
– Phylogenetics
• Informatics and integration
– Genome biology
The Cell
• Prokaryotes are unicellular with minimal compartments bacteria, archaea
• Eukaryotes are multicellular with differentiation and many
organelles including the nucleus that typically can
reproduce sexually - all higher organisms including
mammals, birds, fish, invertebrates, mushrooms, plants,
and yeast. ~300,000,000,000,000 cells in a human.
The Cell
•
The cell is composed of and makes thousands of proteins, e.g.
– the cell wall is made of a layer of proteins and lipids.
– There are special proteins embedded in the wall as channels and
pumps
– And the cell makes (synthesizes) proteins
• “DNA makes RNA, RNA makes proteins, and proteins make us!” F.
Crick
•
•
The cell is a chemical catalytic machine
Networks:
– one type of network are metabolic networks describing catalytic
reactions for the consumption or synthesis of products necessary
for life. Many of these are fairly well understood. (e.g.
photosynthesis)
– Another type of network are signaling networks where information
is conveyed about the environment. These are partially understood.
(e.g. protein kinases are involved in cell differentiation and cell
death)
• From KEGG
(http://www.genome.ad.jp/kegg/pathway.html)
The Cell - Genetic Information
• There is a third major type of network: genetic
information processing. We will focus on these
networks.
• To understand this:
– we describe the nature of DNA
– Tangentially mention homology and conservation
– Then discuss the process of translation
DNA Structure - Eukaryotic Chromosome
•
•
•
•
DNA - a string of nucleic acids (Adenine, Guanine, Cytosine, and Thymine)
Regular, long, stable, oriented, double-stranded, helical structure
Humans: 23 pairs of chromosomes. Total ~3B “bases” (x2)
DNA resides in nucleus in eukaryotes
DNA Structure
DNA
• Always: chemical pairing of A-T and
C-G. Thus, strands are
complementary.
• Two chains run in opposite directions:
5’ to 3’
5’
3’
3’
5’
Prokaryotic Chromosomes
•
Prokaryotes (and
mitochondria)
have one circular
chromosome
•
This shows the E.
coli genome with
orange and
yellow bars
indicating the
positions of the
genes on the two
strands.
RNA
RNA is a similar molecule composed of 4 nucleic acids (A, C,
G, and U)
• Single-stranded.
• Can base-pair with DNA (synthesis)
• Can self-base-pair and fold
DNA Replication
• We won’t be discussing the details of DNA replication.
There are 2 processes:
– Mitosis for normal cell duplication
– Meiosis for gametes for sexual reproduction - single,
recombined chromosomes
• In both processes, DNA is copied by breaking doublestrand (dsDNA) into single-strands (ssDNA) at origins
of replication and synthesizing a complementary copy
from the template.
– 50 bp/sec * 15K origins = ~1 hr to replicate human genome
• Problem:
– How does DNA polymerase find the origins? Are there
sequence patterns?
The Tree of Life
Single common ancestral genome!
DNA Conservation and Variation
•
•
•
•
•
Mutations occur in DNA due to environmental effects (e.g. radiation)
and random mistakes during synthesis. Usually just single
nucleotides are changes, sometimes large rearrangements.
Those changes occurring in somatic (non-sex) cells cause local
damage, usually cell death, but can cause cancer. (Search for the
common mutations that cause different types of cancers.)
Those changes occurring in gametes can be inherited and if favorable
can become “fixed”
Variation in non-functional (junk) DNA tends to “drift”, whereas
functional DNA (e.g. containing genes) tends to remain “conserved”.
Problems:
– Given a set of sequences from different organisms:
• Identify and align sequences from a common ancestor (homologous)
• What are the important (conserved) parts?
• What was the evolutionary history? (Reconstruct the “tree”)
– Given a model organism (e.g. mouse, yeast, fruitfly, etc.), find the
orthologous locus in human
Examples of Sequence Conservation
•
A segment from the RNA needed for protein synthesis - a fundamental
process in all life forms. It is conserved across all 3 major branches of
the tree of life.
•
A multiple alignment of homologous protein sequences. Colors
indicate different classes of amino acids. Dots are inserts/deletes.
DNA contains “GENES”
• Genes are heriditary units of DNA
– We now know that, for the most part, genes are regions that “code”
for proteins
• Proteins are derived from DNA according to the “central
dogma”: DNA => RNA => Protein
– Like DNA replication, DNA is opened into two single strands.
– Using a ssDNA as a template, a complementary copy of RNA is
synthesized for a small region of the genome (1000-100000nt)
– The RNA is processed and transported (more about that in later
lectures)
– Each triple of RNA (codon) is translated to one of 20 amino acids
creating a polypeptide chain, which folds into a protein
• Problems:
– How does the cell know where to find a gene? (Sequence
patterns?)
– How does RNA transcription know when to stop? (Patterns?)
– How is RNA edited?
“Central Dogma” - DNA - RNA - Protein
©1998 by Alberts, Bray,
Johnson, Lewis, Raff,
Roberts, Walter
Codon Translation
• Each triplet translates to a unique amino acid. For
example, CUU is Leucine.
• There are 4*4*4=64 possible codons that translate into 20
amino acids
• This translation table is fixed for almost all life
Cell Differentiation
• Eukaryotes have many different cell types (skin,
muscle, neurons, etc.) that each play a different
role.
• To accomplish the cell’s role, different genes must
be activated
• Problems:
– How are genes activated? What regulatory patterns are
in the DNA?
– What genes control other genes? What network
associations among genes can be found?
– What genes are “differentially expressed”?
Cell Differentiation
Differential Expression
• Interleukin 1 alpha expressed in different cell
types
Protein Sequence, Structure, Function
• Lastly, given a protein sequence, what is the 3-D
structure and function?
• The most common approach is to exploit
conservation (see earlier)
• Problem:
– Find similar proteins to my query protein. Maybe I can
assign structure or function to my new query protein, if
structure or function is already known for a homologous
protein. (Sequence similarity searching, protein family
modeling)
Protein Structure
Further Reading
• Many online intros to genome biology
– E.g. http://www.ncbi.nlm.nih.gov/About/primer/
• Any molecular biology text
– E.g. Molecular Biology of the Cell by Alberts, et al or
Genomes by Brown.