Download cis667-1 - Electrical Engineering and Computer Science

CIS 667 Bioinformatics Cleveland State University Department of Computer and Information Science Fall 2003 What is Bioinformatics? • Field of science in which biology, computer science, information technology merge to form a single discipline  Historically, creation/maintenance of biological sequence databases important • Biology is being transformed from a purely lab-based science to an information science as well What is Bioinformatics? • Three important sub-disciplines  Development of new algorithms and statistical methods to analyze relationships among members of large data sets  Analysis and interpretation of various types of data (nucleotide and amino acid sequences, protein structures, etc.)  Development/implementation of tools for efficient access/mgmt. of various types data Why now? • Recent advances in molecular biology and genomic technologies lead to an explosive growth in the amount of biological information generated • Requires computerized databases to store/organize/index data and specialized tools to view and analyze data What skills should a Bioinformatician have? • Deep background in some area of molecular biology • Understand the central dogma of molecular biology • Substantial experience with at least one or two major packages • Experience working in a command-line computing environment • Experience with both high-level and scripting languages Others… • • • • • • Molecular Evolution Physical chemistry Statistics and probability Database design Algorithm development Molecular biology lab methods What will we learn? • Central dogma of molecular biology + other necessary biology background • Working in a Unix command-line environment • Programming in Perl • Algorithms for molecular biology • Hands-on experience with bioinformatics tools Molecular Biology • Primarily concerned with two basic molecules of all living things:  Proteins  Structural proteins are tissue building blocks while enzymes catalyze chemical reactions  Proteins are chains of amino acids Example Amino Acid Side Chain H2N Amino Group CH3 C H Alpha Carbon COOH Carboxy Group Amino Acids • There are 20 naturally occurring amino acids  Amino acids can be identified by a 3-letter code (and sometimes by 1-letter code)  In a protein, amino acids are joined by peptide bonds (C from carboxy group binds to N from amino group)  A water molecule is liberated so we speak of residues in the chain Amino Acids Name One-letter code Three-letter code Alanine A Ala Cysteine C Cys Aspartic Acid D Asp Glutamic Acid E Glu Phenylalanine F Phe Glycine G Gly Histidine H His Isoleucine I Ile Lysine K Lys Leucine L Leu Methionine M Met Asparagine N Asn Proline P Pro Glutamine Q Gln Arginine R Arg Serine S Ser Threonine T Thr Valine V Val Tryptophan W Trp Tyrosine Y Tyr Proteins • Typical protein contains about 300 residues • Chain have an amino group at one end and a carboxy group at the other giving the chain an orientation (start - end) • The sequence of residues in the chain is called the protein’s primary structure Proteins • Proteins fold in three dimensions resulting in secondary, tertiary, quaternary structures • The two most common secondary structures are the -helix and the -sheet Secondary Structure • Only a small number of patterns are common • Patterns formed by regular intramolecular hydrogen bonding patterns Proteins • The specific shape that a protein folds into determines its unique function  Different shapes mean the protein can bind to different molecules • Proteins are produced in a cell structure called a ribosome  Amino acids are added one after the other in the sequence coded by a messenger ribonucleic acid (mRNA) molecule Ribosomes Large subunit Small subunit Nucleic Acid • Two types of nucleic acids  Ribonucleic acid (RNA)  Deoxyribonucleic acid (DNA) • DNA, like protein, is a chain of simpler molecules, but double stranded  Each strand consists of a chain of nucleotides Nucleic Acids • Each nucleotide consists of  A sugar molecule  A phosphate residue  A base • The sugar molecule has five carbon atoms labeled 1’ - 5’  The 3’ carbon of one nucleotide is bound to the 5’ carbon of the next nucleotide in the chain giving an orientation to the chain  5’ is the start and 3’ is the end Nucleic Acids Nucleic Acids • The chain of sugar/phosphate groups forms the backbone of a strand of DNA • Attached to each 1’ carbon in the backbone is a molecule called a base  There are four different bases  Adenine (A)  Guanine (G)  Cytosine (C)  Thymine (T) DNA • DNA molecules are double strands  The strands form a double helix  The strands are held in the helix form by bonds between complementary bases in the two strands  A and T are complements  G and C are complements  We refer to the paired bases as base pairs (bp) and use base pairs as the unit of length of DNA molecules DNA Double Helix DNA • DNA can be considered as a string of letters from the set {A, T, C, G}  5’ … TACTGAA … 3’ • This other strand connected to this one is antiparallel and complentary  3’ … ATGACTT … 5’ • Note that the orientations of the two strands are opposite DNA • Given one of the strands, we can infer the other strands  One of the strands can act as a template for the construction of the other  This property allows for cell division and replication with each new cell containing a copy of the DNA from the original cell • Complementary base pairs are held together by hydrogen bonds DNA • In higher organisms, DNA is found inside the cell nucleus  Also in cell organelles called mitochondria (plants and animals) and chloroplasts (plants only) • The DNA is found in a few very long DNA molecules called chromosomes RNA • RNA molecules are similar to DNA, but  Have a different sugar  Have the base uracil (U) instead of thymine (T)  U binds with A, as does T  RNA does not form a double helix  Hybrid DNA-RNA helices may form  Parts of an RNA molecule may bind to other parts of the same molecule by complementarity  Three-dimensional structure is variable (compare Protein) Central Dogma of Molecular Biology • Information stored in DNA is used to make a transient RNA  Process is called transcription accomplished through use of enzyme RNA polymerase • The RNA is used to make proteins  Process is called translation and is performed by ribosomes RNA Transcription RNA Transcription QuickTime™ and a Graphics decompressor are needed to see this picture. Genes and the Genetic Code • All of the proteins in an organism are specified by a contiguous stretch of DNA called a gene  Remember that the DNA is contained in a small number of molecules called chromosomes  Not all of the DNA specifies some protein  Some genes code for RNA products Gene Expression • Gene expression is the process of using the information stored in DNA to make an RNA molecule and then a protein  RNA polymerases must  determine the start of genes  determine whether the protein coded by a gene is needed at the present moment  Start of gene marked by 13 nucleotides (why 13, not, e.g. 1) promoter sequence Gene Expression Gene Expression • How does the RNA polymerase then tell if a protein should now be produced?  Specific regulatory genes produce proteins capable of binding to a cell’s DNA near the promoter sequence of a gene they control in some circumstances  Positive regulation when binding makes RNA polymerase initiation of transcription easier, negative regulation when harder Genetic Code • A gene codes the sequence of amino acids needed to form a protein • 20 aa > 4 bases  need more than one base to specify an aa  43 > 20, so 3 bases suffice  Each sequence of 3 bases (a codon) codes for an amino acid (with 3 exceptions)  Three codons cause translation to end and are called stop codons Genetic Code • Since 64 > 20, more than one codon must code for some amino acid(s) • In fact, 18 of the 20 amino acids are coded for by more than one codon • The genetic code is therefore a degenerate code  Errors in transcription may not cause the wrong aa to be produced (especially if the error is in the 3rd nucleotide)  Even if the wrong aa is produced due to a single error, a similar aa is likely to be produced Open Reading Frames • One special start codon (AUG) marks the spot where translation begins • A sequence of codons is called a reading frame  A sequence of codons which begins with a start codon and has no stop codons is called an open reading frame (orf) Prokaryotes and Eukaryotes • Living organisms may be classified as either prokaryote (bacteria) or eukaryote (higer organisms like yeast, plants, people)  The cells of eukaryotes have a nucleus while prokaryotes don’t  DNA is linear in eukaryotes and circular in prokaryotes Prokaryotes and Eukaryotes Introns and Exons • In prokaryotes, the mRNA copies of the genes corresponds directly to the DNA sequence in the genome (with U substituted for T) • In eukaryotes, the mRNA is carried outside the nucleus before translation  The mRNA is modified by splicing out sequences of introns and rejoining the exons that flank them Introns and Exons • Splicing is controlled by enzyme complexes called spliceosomes  Incorrect splicing leads to frame shifts or premature stop codons which make the resulting protein useless  The position of introns is signalled by several specific sequences of nucleotides  Since there is more than one sequence we can have alternative splicing resulting in different proteins being produced in different circumstances. Molecular Biology Tools • A small set of laboratory techniques are used by molecular biologists to identify the information content of organisms so that it can be processed using bioinformatics methods Restriction Enzyme Digests • Restriction enzymes can be used to cut DNA molecules wherever a particular sequence occurs  Digesting a DNA molecule and observing how many fragments occur gives some insight into the organization and sequence of that DNA  This is called restriction mapping  Allows isolation and experimentation of individual genes for the first time Restriction Enzyme Digests Gel Electrophoresis • We can separate the fragments of DNA obtained by restriction enzymes with gel electrophoresis  DNA fragments are pulled through a gel towards an electrical charge  Larger fragments do not move as quickly, so this provides a way to separate the fragments by size Gel Electrophoresis Blotting and Hybridization • To study a single fragment, DNA is transferred from the gel to a piece of paper or cloth (blotting)  The DNA fragments are then permanently attached to the membrane using (e.g.) UV light  A specially prepared labeled fragment of DNA (a probe) is allowed to base pair with the fragments to try to find a specific fragment Blotting and Hybridization • The probe is tagged using (e.g.) a fluorescent dye (hybridization)  Then determine where on the membrane base pairing has occurred • DNA chip or microarray techniques are similar  Thousands of nucleotide sequences are affixed to portions of a small silica chip  A large number of probes are washed over the chip and a laser is used to find which probes bind to which sequences DNA Chip Cloning • Large amounts of DNA material is typically required for analysis  In cloning, specific DNA fragments are inserted into chromosome-like carriers called vectors in living cells  The identical copies of the fragments are called molecular clones and can be stored in libraries for later study  Vectors are derived from bacteria and yeast chromosomes Cloning Polymerase Chain Reaction • Polymerase Chain Reaction (PCR) is an alternative to cloning for amplifying DNA fragments  DNA fragments are heated to break them into single strands  Probes are added to bind to the portion of DNA to be amplified  DNA polymerase grows the strands from the probes  The process repeats Polymerase Chain Reaction

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download cis667-1 - Electrical Engineering and Computer Science