Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The Transcriptome Gene Discovery Quantitation of Gene Expression Reading: Ch 15.1 BIO520 Bioinformatics Jim Lund WHY? • The genes (proteins) expressed determine the state of the cell. – – – – Signaling. Metabolic capabilities. Differentiation state (cell type). Response to changes in environment. • Verifies gene predictions. • Transcriptional regulation – Normal vs. abnormal – Conditional expression Transcriptome Analysis • Gene (transcript) discovery – transcripts – alternative splicing/processing • • • • Transcript assays Promoter analysis Transcription Factors Cellular control networks Gene Discovery • Inference from genomic DNA –Prokaryotes & fungi OK • cDNA characterization –EST –SAGE EST (Expressed Sequence Tag) • Sequence cDNA libraries –proportional libraries –subtracted or normalized libraries • Which end? –5’ or 3’ or Whole Library Type • “regular” or proportional • Subtracted – Miss alternate transcripts • normalized • Tissue • Primer – dT vs random Ideal cDNAs “Real” cDNAs Which end? • Whole cDNA – BEST & HARDEST (Long) • 3’-end – Consistent technically, limited information • 5’end – Coding “identity” highest • 5’ AND 3’ – Good, but technical & informatic challenge Gene Discovery-Yeast 1 Fraction Known 0.8 0.6 0.4 0.2 0 # of ESTs (log scale!) EST Data Analyses • Clustering Analysis – – – – Assemble ESTs into genes. Alternative splicing forms Find coding SNPs. Truncated, unspliced, and junk ESTs can be misleading – Project: Unigene – Program: stackPACK • Frequency analysis – Digital Differential Display • DDD is a computational method for comparing sequence-based gene representation profiles among individual cDNA libraries or pools of libraries. EST Results (old) • Known genes (30%) • Similarities to other ORFs, ESTs (30%) – Infer Function? • Novel Class (30%, w/ time) Typical Progress/Results • Humans – 6,694,833 ESTs – 124,179 clusters (“sets”) • 29,000 sets contain EST and mRNA seqs. – CGAP EST library ”plateau” broken by: • different tissues, different states • normalized libraries Data Quality Considerations • 99% correct data (1% errors!). • Frameshifts-effects depend on tools – BLASTX tool to “find” frameshifts • How sensitive? – TBLASTX, TBLASTN to “use” in other projects • How sensitive? Gene Expression Assays • EST (Poor method) • SAGE • Microarray Hybridization • Next Gen Sequencing. • Transcriptional Fusions – GFP, LacZ fusions Serial Analysis of Gene Expression (SAGE) 1. Collect mRNA 2. Isolate short oligomers from each transcript. 3. Ligate together the oligomers and clone them. 4. Sequence thousands of clones. 5. Map the 1x104 – 1x105 oligomers to their genes. 6. Find which genes are transcribed and their relative expression levels. 7. http://www.sagenet.org (Vogelstein at JHU) SAGE technique • Prepare biotin labeled cDNA • Cleave with anchoring enzyme (NlaIII) SAGE technique • Ligate on linkers • Cleave with tagging enzyme (BsmFI) SAGE technique • Ligate, PCR, and gel purify ditags (102bp). • Recleave with anchoring enzyme (NlaIII), ligate to form concatemers. • Size select, clone and sequence concatemers. Colon cancer vs. normal colon epithelium (SAGE) Microarray Hybridization • Determine gene expression by parallel hybridization of labeled cDNA to DNA attached to a fixed support. • http://cmgm.stanford.edu/pbrown/ Microarray Hybridization • Producing chips • Producing probes / reading arrays • Analyzing and interpreting data Transcriptional Array orf 1 orf 2 orf 3 1 2 3 4 5 6 7 8 9 3 cm 200 spots 2 Condition 1 mRNA Condition 2 mRNA 40,000 dot/9 cm or > All human genes Transcriptional Array-1 orf 1 orf 2 orf 3 1 22 3 4 5 6 7 88 9 3 cm 200 spots 2 Condition 1 mRNA Condition 2 mRNA 40,000 dot/9 cm or > All human genes Transcriptional Array-2 orf 1 orf 2 orf 3 1 22 3 4 5 6 7 88 9 3 cm 200 spots 2 Condition 1 mRNA Condition 2 mRNA 40,000 dot/9 cm or > All human genes Microarray Technologies • Spotted arrays (Brown et al.) – Spot arrays on glass slides – PCR fragments – Long (50-70bp) oligo arrays • Synthesis – Affymetrix (www.affymetrix.com) • High density array of 25 bp oligos • Made using light directed oligonucleotide synthesis and photolithography – Agilent, CombiMatrix • Made using light directed oligonucleotide synthesis and mirrors. Spotted Arrays Print Quill Spotted microarray image Affymetrix photolithographic technology •Lithographic masks are used to either block or transmit light onto specific locations of the array. •The surface is then flooded with a solution containing either adenine, thymine, cytosine, or guanine, and coupling occurs only in those regions on the glass that have been deprotected through illumination. •The coupled nucleotide also bears a light-sensitive protecting group, so the cycle can be repeated. •Microarray is built as the probes are synthesized through repeated cycles of deprotection and coupling. •Typically ends at 25 bps.) •Current arrays have 1.3 million unique features per array. GeneChip Expression Assay Design Affymetrix GeneChips: Expression Analysis • Available for humans and model organisms. • Made only by Affymetrix. • Chip designs change slowly. • GeneChips: – Human: 50,000 RefSeq genes and ESTs – C. elegans: 22,500 genes (12/00 genome annotation) – Rat 230: 30,000 genes, ESTs – Yeast: 6100 gene set – Tiling arrays for model organisms • http://affymetrix.com Quantitation of fluorescence signals (Image to data) 1. Hybridization, scan in chip image. 2. Gridding – Determine where the spots are. 3. Spot intensity and local background determination. 4. Normalization – Adjust to make the red and green total signal intensities the same. 5. Gene expression ratio. – Red channel/green channel. • Programs: – ScanAlyze, http://rana.lbl.gov/EisenSoftware.htm – GenePix, http://www.moleculardevices.com/pages/instruments/mic roarray_main.html Microarray data Big tables of numbers! Viewing microarray data Scatter plot: log(ch1) vs log(ch2) M vs A: expression levell vs expression change Clustergram Volcano plot: log(expr) vs p-value