Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
BIONF/BENG 203: Functional Genomics Sources of Functional Data Lectures 1 and 2 Lecture TI 1 Trey Ideker UCSD Department of Bioengineering Grading 40% Problem Sets (best 4 of 5) 30% Midterm 30% Final Project Outline of the course Biological data sources (2) Data preprocessing (6) Total of 17 lectures Unsupervised: Project Presentations (2) Clustering Inference Supervised: Classification Population Genetics and Linkage Single Source (3) (3) (1) MultiSource (2) FINAL PROJECT FINAL PROJECT Functional Genomics Data – Expression mRNA, protein – Molecular interactions Protein, mRNA, small molecules – Knockout phenotypes 1st, 2nd, higher orders – – SNP sequence (polymorphism) data Imaging data Sub-cellular localization Cell morphology – Gene ontology Dividing the data into two classes of information: Biological Networks and Network States 1) Directly observe the network “wires” themselves Protein-protein interactions: Two-hybrid system, coIP, protein antibody arrays BIND, DIP Other types not yet possible: e.g., protein-small molecule DNA/RNA Gene expression: DNA microarrays, SAGE Protein levels, locations, and modifications: Mass spectrometry, fluorescence microscopy, protein arrays Protein-DNA interactions: Chromatin IP BIND, Transfac, SCPD 2) Observe molecular states that result from the interaction wiring Gross phenotypes: e.g., growth rates of single and double deletion strains High-throughput methods for measuring cellular states Gene expression levels: RT-PCR, arrays Protein levels, modifications: mass spec Protein locations: fluorescent tagging Metabolite levels: NMR and mass spec Systematic phenotyping The transcriptome and proteome The transcriptome is the full complement of RNA molecules produced by a genome The proteome is the full complement of proteins enabled by the transcriptome DNA RNA protein Genome transcriptome proteome 30,000 genes ??? RNAs ??? proteins? For example, the drosophila gene Dscam can generate 40,000 distinct transcripts through alternative splicing. What is the minimum number of exons that would be required? Expression: High-throughput approaches RNA DNA Microarrays cDNA / EST sequencing RT-PCR Differential display SAGE Massively parallel signature sequencing (MPSS) Proteins 2D PAGE Mass spectrometry Gene expression arrays They are really, really, really, really, really, really, really, really, really, really, really, really, really important Microarrays Monitors the level of each gene: Is it turned on or off in a particular biological condition? Is this on/off state different between two biological conditions? Microarray is a rectangular grid of spots printed on a glass microscope slide, where each spot contains DNA for a different gene Two-color DNA microarray design Reverse Transcription cDNA-chip of brain glioblastoma Types of microarrays Spotted (cDNA) – – Robotic transfer of cDNA clones or PCR products Spotting on nylon membranes or glass slides coated with poly-lysine Synthetic (oligo) – – Direct oligo synthesis on solid microarray substrate Uses photolithography (Affymetrix) or ink-jet printing (Agilent) All configurations assume the DNA on the array is in excess of the hybridized sample—thus the kinetics are linear and the spot intensity reflects that amount of hybridized sample. Labeling can be radioactive, fluorescent (one-color), or two-color Microarray Spotter Affymetrix High Density Arrays Microarrays (continued) Imaging – – Radioactive 32P labeling: Autoradiography or phosphorimager Fluorescent labeling: Confocal microscope (invented by Marvin Minsky!!) Feature density – – – Nylon membrane macroarrays 100-1000 features Glass slide spotted array 5,000 features / cm2 Synthesized arrays 50,000 features / cm2 Microarray confocal scanner Collects sharply defined optical sections from which 3D renderings can be created The key is spatial filtering to eliminate out-of-focus light or glare in specimens whose thickness exceeds the immediate plane of focus. Two lasers for excitation Two color scan in less than 10 minutes High resolution, 10 micron pixel size cDNA / EST sequencing projects cDNA = complementary or copy DNA EST = Expressed Sequence Tag The microarray could be described as a “closed system” because information about RNAs is limited by the targets available for hybridization. RNAs not represented on the array are not interrogated. Direct sequencing of cDNAs (yielding ESTs) overcomes this problem by large-scale random sampling of sequences from a whole-cell RNA extract Statistical counting of distinct sequences provides an estimate of expression level Conversely, cDNA library can be normalized to capture rare messages Requires large scale sequencing to get statistical significance cDNA / EST Sequencing: Preparation of a cDNA library in phage l vector SAGE Technology Serial Analysis of Gene Expression Takes idea of sequence sampling to the extreme Generates short ESTs (9-14nt) which are joined into long concatamers and then sequenced 49 is 262,144, ~5-fold the number of human genes The count of each type of tag estimates RNA copy number >50X more efficient than cDNA sequencing because many RNAs are represented in a single sequencing run Steps to SAGE Copy mRNA ds cDNA using biotinylated (dT) Cleave with anchoring enzyme (AE) which cleaves within ~250bp of poly-A tail at 3’ end. Capture this segment on streptavidin beads Ligate to linkers containing a type IIs restriction site, which cleave DNA 14 bp away from this site. Ligate sequences to each other and PCR amplify Cleave with AE to remove linkers Concatenate, clone, and sequence Velculescu et al. Science (1995) WHY DI-TAGS? Ditags are used to detect bias in the PCR amplification step. B B B A A A PrimerA PrimerA The probability of any two tags being coupled in the same ditag is small. PrimerB PrimerB Biased amplification can be detected as many ditags always having the same 2 tags present. SAGE (continued) Example of a concatemer: CATGACCCACGAGCAGGGTACGATGATACATGGAAACCTATGCACCTTGGGTAGCACATG TAG1 TAG2 TAG3 TAG4 Tag Sequence Tag Sequence Counting the tags: Count Count GCGATATTGT 66 ATCTGAGTTC 1075 TACGTTTCCA 66 GCGCAGACTT 125 TCCCGTACAT 66 TCCCCGTACA 112 TCCCTATTAA 66 TAGGACGAGG 92 GGATCACAAT 55 GCGATGGCGG 91 AAGGTTCTGG 54 TAGCCCAGAT 83 CAGAACCGCG 50 GCCTTGTTTA 80 GGACCGCCCC 48 Proteomics SDS PAGE 2D PAGE MS/MS An example SDS-PAGE How many proteins are in a band? Protein stains: Silver Copper Coomassie Blue 2D-PAGE Dimension 2: size Dimension 1: Isoelectric focusing gel 2D gel from macrophage phagosomes Mass spectrometry Mass spectrometers consist of three essential parts – – – Ionization source: Converts peptides into gas-phase ions (MALDI + ESI) Mass analyzer: Separates ions by mass to charge (m/z) ratio (Ion trap, time of flight, quadrupole) Ion detector: Current over time indicates amount of signal at each m/z value MS/MS Overview MS/MS Overview A raw fragmentation spectrum By calculating the molecular weight difference between ions of the same type the sequence can be determined. SEQUEST uses the fragmentation pattern to search through a complete protein database to identify the sequence which best fits the pattern. Tandem Mass Spec (MS/MS) Typical nanoelectrospray source Isotope Coded Affinity Tags (ICAT) Mass spec based method for measuring relative protein abundances between two samples ICAT Reagents: Heavy reagent: d8-ICAT (X=deuterium) Normal reagent: d0-ICAT (X=hydrogen) O N N O XX N S Biotin tag XX O O O XX O XX Linker (d0 or d8) N I Thiol specific reactive group Protein Quantification & Identification via ICAT Strategy 100 Mixture 1 Light 0 550 560 Heavy 570 580 m/z ICATlabeled cysteines Quantitation 100 NH2-EACDPLR-COOH Mixture 2 Combine and proteolyze (trypsin) Affinity separation (avidin) 0 200 400 600 800 m/z ICAT Flash animation: http://occawlonline.pearsoned.com/bookbind/pubbooks/bc_mcampbell_genomics_1/medialib/method/ICAT/ICAT.html Protein identification ICAT continued The heavy (blue) and light (gray) peptides are separated and quantified to produce a ratio for each peptide – here, a single peptide ratio is shown Each peptide is subjected to CID fragmentation in the second MS stage in order to identify it Metabolomic measurements 2D NMR or mass spectrometry Currently not global and in less widespread use than microarrays, but have tremendous potential Gene knockout and RNAi libraries for model species Example from yeast: Replacement of yeast ORFS with kanMX gene flanked by unique oligo barcodes– Yeast Deletion Project Consortium YFP tagging for protein localization YPF is green, transmitted light is red NIC96 Nuclear Pore TUB1 Tubulin cytoskeleton HHF2 Histone Nucleus BNI4 Bud neck Images courtesy T. Davis lab See also recent work by Weissman and O’Shea labs at UCSF Systematic phenotyping Barcode CTAACTC (UPTAG): Deletion Strain: TCGCGCA TCATAAT yfg2D yfg3D yfg1D Rich media … Growth 6hrs in minimal media (how many doublings?) Harvest and label genomic DNA Systematic phenotyping with a barcode array Ron Davis and friends… These oligo barcodes are also spotted on a DNA microarray Growth time in minimal media: – – Red: 0 hours Green: 6 hours Molecular Interactions Among proteins, mRNA, small molecules, and so on… Protein→DNA interactions ▲ Chromatin IP ▼ DNA microarray Gene levels (on/off) Protein—protein interactions ▲ Protein coIP ▼ Mass spectrometry Protein levels (present/absent) Biochemical reactions ▲Not yet!!! Metabolic flux ▼ measurements Biochemical levels Also like sequence, protein interaction data are exponentially growing… EMBL Database Growth DIP Database Growth total nucleotides (gigabases) total interactions 10 90,000 80,000 70,000 60,000 50,000 5 40,000 30,000 20,000 10,000 0 0 1980 1990 2000 (As are the false positives!!!) 2000 2001 2002 2003 2004 2005 High-throughput methods for measuring interaction networks 2-hybrid co-immunoprecipitation w/ mass spec chIP-on-chip systematic genetic analysis Yeast two-hybrid method Fields and Song Detection of protein interactions with antibody arrays McBeath and Schreiber Kinase-target interactions Mike Snyder and colleagues High-throughput methods for measuring networks 2-hybrid co-immunoprecipitation w/ mass spec chIP-on-chip systematic genetic analysis Protein interactions by protein immunoprecipitation followed by mass spectrometry TEV = Tobacco Etch Virus proteolytic site CBP = Calmodulin binding peptide Protein A = IgG binding from Staphylococcus Gavin / Cellzome TAP purification Image courtesy of Bertrand Seraphin High-throughput methods for measuring networks 2-hybrid co-immunoprecipitation w/ mass spec chIP-on-chip systematic genetic analysis ChIP-chip measurement of protein→DNA interactions From Figure 1 of Simon et al. Cell 2001 High-throughput methods for measuring networks 2-hybrid co-immunoprecipitation w/ mass spec chIP-on-chip systematic genetic analysis Genetic interactions: synthetic lethals and suppressors Genetic Interactions: Widespread method used by geneticists to discover pathways in yeast, fly, and worm Implications for drug targeting and drug development for human disease Thousands are now reported in literature and systematic studies As with other types, the number of known genetic interactions is exponentially increasing… Adapted from Tong et al., Science 2001 Most recorded genetic interactions are synthetic lethal relationships A B A DB DA B DA DB Adapted from Hartman, Garvik, and Hartwell, Science 2001 Synthetic-lethal protein interaction A A DB DA DB B DA DA DB X B Suppressor protein interaction A A B B B DA X DA DA DB DB Interpretation of genetic interactions (Guarente T.I.G. 1990) Parallel Effects (Redundant or Additive) Sequential Effects (Additive) a a GOAL: Identify downstream B physical pathways A w Single A or B mutations typically abolish their biochemical activities A B w Single A or B mutations typically reduce their biochemical activities