Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DNA, Gene, and Genome Translating Machinery for Genetic Information Transcription factors mRNA levels Automated DNA Sequencing Data Increase (from NCBI web site) Partial Display of Human Draft Sequence (Nature, 2001) Human Genome Map at NCBI 60-70 KDa Protein interacting with prostate cancer suppressor MGALRPTLLPPSLPLLLLLMLGMGCWAREVLVPEGPLYRVAGTAVSISCNVTGY EGPAQQNFEWFLYRPEAPDTALGIVSTKDTQFSYAVFKSRVVAGEVQVQRLQGD AVVLKIARLQAQDQGIYECTPSTDTRYLGSYSGKVELRVLPDVLQVSAAPPGPR GRQAPTSPPRMTVHEGQELALGCLARTSTQKHTHLAVSFGRSVPEAPVGRSTLQ EVVGIRSDLAVEAGAPYAERLAAGELRLGKEGTDRYRMVVGGAQAGDAGTYH CTAAEWIQDPDGSWAQIAEKRAVLAHVDVQTLSSQLAVTVGPGERRIGPGEPLE LLCNVSGALPPAGRHAAYSVGWEMAPAGAPGPGRLVAQLDTEGVGSLGPGYE GRHIAMEKVASRTYRLRLEAARPGDAGTYRCLAKAYVRGSGTRLREAASARSR PLPVHVREEGVVLEAVAWLAGGTVYRGETASLLCNISVRGGPPGLRLAASWWV ERPEDGELSSVPAQLVGGVGQDGVAELGVRPGGGPVSVELVGPRSHRLRLHSL GPEDEGVYHCAPSAWVQHADYSWYQAGSARSGPVTVYPYMHALDTLFVPLL VGTGVALVTGATVLGTITCCFMKRLRKR Molecular biology databases • Sequence databases – Annotated – Low-annotation – Specialized • Structural databases • Motif databases • Genome databases • • • • • • • • Proteome databases RNA expression Literature Populations Mutations Polymorphisms Organisms Pathways Mutations/polymorphisms Promoters ESTs Tissues and cells DNA motifs RNA expression DNA sequences Molecular Phylogeny Substrates Transcription Factors Metabolic pathways Genome maps Protein sequences Protein structures Gene Family Protein motifs Databases formats • Relational databases – GDB, GSDB, MGD etc. – Vender: Sybase, Oracle etc. • Flat file databases – GenBank, SWISS-PROT etc. • Object-oriented databases – ACeDB, AtDB etc. Molecular biology data types Organisms Mouse chromosome X from the Mouse Genome Informatics project http://www.informatics.jax.org/ Genome maps Molecular biology data types Organisms Genome maps DNA sequences RNA sequences ...AATGGTACCGATGACCTGGAGCTTGGTTCGA... Molecular biology data types Organisms Genome maps DNA sequences RNA sequences Protein sequences ...TRLRPLLALLALWPPPPARAFVNQHLCGSHLVEA... Molecular biology data types Organisms Genome maps DNA sequences RNA sequences Protein sequences Protein structures PDB entry 1CIS P.Osmark, P.Sorensen, F.M.Poulsen RNA structures Molecular biology data types Organisms Genome maps DNA motifs RNA expression DNA sequences RNA sequences Protein sequences Protein structures Protein motifs RNA structures DNA microarrays measure variations in RNA levels The full Yeast genome on a chip Red dots: genes whose RNA level increased Green dots: genes whose RNA level decreased De Risi et al, Science 278:680 http://cmgm.Stanford.EDU/pbrown/ Substrates for High Throughput Arrays Nylon Membrane Single label P33 GeneChip Single label biotin streptavidin Glass Slides Dual label Cy3, Cy5 GeneChip Probe Arrays ® Hybridized Probe Cell GeneChip Probe Array Single stranded, labeled RNA target * * * * * Oligonucleotide probe 24µm 1.28cm Millions of copies of a specific oligonucleotide probe >200,000 different complementary probes Image of Hybridized Probe Array ® GeneChip Expression Array Design Gene 5´ Sequence 3´ Multiple oligo probes Probes designed to be Perfect Match Probes designed to be Mismatch Procedures for Target Preparation Cells Labeled transcript AAAA IVT Poly (A)+/ Total RNA cDNA (Biotin-UTP Biotin-CTP) Hybridize (16 hours) L L L Fragment (heat, Mg2+) L Wash & Stain Scan L L L L Labeled fragments Microarray Technology Printing Arrays on 50 slides NSF Soybean Functional Genomics Steve Clough / Vodkin Lab Ratio of expression of genes from two sources Cells from condition A Total or mRNA Cells from condition B Label Dye 1 Label Dye 2 cDNA Mix NSF / U of Illinois Microarray Workshop -Steve Clough / Vodkin Lab equal over under GSI Lumonics NSF Soybean Functional Genomics Steve Clough / Vodkin Lab Cattle and Soy Controls Beta Actin PKG HPRT Beta 2 microglobulin Rubisco AB binding protein Major latex protein homologue (MSG) Array of cattle and soy spiking controls. 50 ug of cattle brain total RNA was labeled with Cy3 (green). 1 ul each of in vitro transcribed soy Rubisco (5 ng), AB binding protein (0.5 ng) and MSG (0.05 ng) were labeled with Cy5. The two labeled samples were cohybridized on superamine slides (Telechem, Inc.). To the right of each set of spots are five negative controls (water). Fetal Spleen-Cy3 Adult Spleen-Cy5 IgM IgM MYLK MYLK IgM heavy chain COL1A2 IgM heavy chain COL1A2 GenePix Image Analysis Software Placenta vs. Brain – 3800 Cattle Placenta Array cy3 cy5 GeneFilter Comparison Report GeneFilter 1 Name: GeneFilter 1 Name: O2#1 8-20-99adjfinal N2#1finaladj INTENSITIES RAW NORMALIZED ORF NAME GENE NAME CHRM F G GF1 R GF2 GF1 GF2 DIFFERENCE RATIO YAL001C TFC3 1 1 A 12.03 1 7.38 2 403.83 209.79 194.04 1.92 YBL080C PET112 2 1 A 53.21 1 35.62 3 "1,786.11" "1,013.13" 772.98 1.76 YBR154C RPB5 2 1 A 79.26 1 78.51 4 "2,660.73" "2,232.86" 427.87 1.19 YCL044C 3 1 A 53.22 1 44.66 5 "1,786.53" "1,270.12" 516.41 1.41 YDL020C SON1 4 1 A 23.80 1 20.34 6 799.06 578.42 220.64 1.38 YDL211C 4 1 A 17.31 1 35.34 7 581.00 "1,005.18" -424.18 -1.73 YDR155C CPH1 4 1 A 349.78 1 8 401.84 "11,741.98" "11,428.10" 313.88 1.03 YDR346C 4 1 A 64.97 1 65.88 9 "2,180.87" "1,873.67" 307.21 1.16 YAL010C MDM10 1 1 A 13.73 2 9.61 2 461.03 273.36 187.67 1.69 YBL088C TEL1 2 1 A 8.50 2 7.74 3 285.38 220.01 65.37 1.30 YBR162C 2 1 A 226.84 2 4 293.83 "7,614.82" "8,356.39" -741.57 -1.10 YCL052C PBN1 3 1 A 41.28 2 34.79 5 "1,385.79" 989.41 396.38 1.40 YDL028C MPS1 4 1 A 7.95 2 6.24 6 266.99 177.34 89.65 1.51 YDL219W 4 1 A 16.08 2 11.33 7 539.93 322.20 217.74 1.68 YDR163W 4 1 A 19.13 2 14.19 8 642.17 403.56 238.61 1.59 YDR354W TRP4 4 1 A 62.24 2 40.74 9 "2,089.48" "1,158.64" 930.84 1.80 YAL018C 1 1 A 10.72 3 8.81 2 359.75 250.60 109.15 1.44 YBL096C 2 1 A 10.91 3 8.98 3 366.40 255.40 111.00 1.43 YBR169C SSE2 2 1 A 17.33 3 27.81 4 581.80 790.84 -209.05 -1.36 YCL060C 3 1 A 17.99 3 24.75 5 603.96 703.75 -99.79 -1.17 YDL036C 4 1 A 14.22 3 8.86 6 477.39 251.94 225.44 1.89 YDL227C HO 4 1 A 25.61 3 31.52 7 859.71 896.46 -36.75 -1.04 YDR171W HSP42 4 1 A 102.08 3 8 98.37 "3,426.83" "2,797.58" 629.25 1.22 YDR362C 4 1 A 16.32 3 12.95 9 547.96 368.39 179.57 1.49 YAL026C DRS2 1 1 A 11.32 4 7.97 2 379.85 226.53 153.33 1.68 YBL102W SFT2 2 1 A 55.88 4 63.74 3 "1,875.82" "1,812.81" 63.02 1.03 YBR177C 2 1 A 63.31 4 29.03 4 "2,125.20" 825.60 "1,299.60" 2.57 YCL068C 3 1 A 8.33 4 4.47 5 279.51 127.16 152.35 2.20 YDL044C MTF2 4 1 A 11.73 4 6.96 6 393.88 198.07 195.81 1.99 YDL235C YPD1 4 1 A 38.71 4 30.20 7 "1,299.33" 858.83 440.50 1.51 YDR179C 4 1 A 12.77 4 11.05 8 428.60 314.12 114.48 1.36 YDR370C 4 1 A 16.70 4 15.30 9 560.62 435.13 125.49 1.29 YAL034C FUN19 1 1 A 20.89 5 24.21 2 701.32 688.59 12.73 1.02 YBL111C 2 1 A 22.38 5 13.67 3 751.39 388.69 362.70 1.93 Microarray Data Process 1. Experimental Design 2. Image Analysis – raw data 3. Normalization – “clean” data 4. Data Filtering – informative data 5. Model building 6. Data Mining (clustering, pattern recognition, et al) 7. Validation Fetal Scatterplot of Normalized Data Adult <-0.3 >0.3 Complexity Levels of Microarray Experiments: 1. Compare genes in a control situation versus a treatment situation • Example: Is the level of expression (up-regulated or down-regulated) significantly different in the two situations? (drug design application) • Methods: t-test, Bayesian approach 2. Find multiple genes that share common functionalities • Example: Find related genes that are dependent? • Methods: Clustering (hierarchical, k-means, self-organizing maps, neural network, support vector machines) 3. Infer the underlying gene and protein networks that are responsible for the patterns and functional pathways observed • Example: What is the gene regulation at system level? • Directions: mining regulatory regions, modeling regulatory networks on a global scale Comparing data from two experiments. Clustering to extract genes which tightly co-express. Statistical filters used: The genes present (Presence Call in Affymetrix) in drug treated, ANOVA p<0.02 between groups. Red indicates increased expression, and green is decreased expression (Log(fold change)). Genesight 3 (Biodiscovery Software, www.biodiscovery.com) NO DRUG 1nM Drug 1 mM Drug Statistical filters used: The genes present (Presence Call in Affymetrix) in absence of drug, ANOVA p<0.02 between groups. NO DRUG 1nM Drug 1 mM Drug Self Organizing Maps Molecular Classification of Cancer Gene Expression Profile of Aging and Its Retardation by Caloric Restriction Cheol-Koo Lee, Roger G. Klopp, Richard Weindruch, Tomas A. Prolla Data Mining Methods Classification, Regression (Predictive Modeling) Clustering (Segmentation) Association Discovery (Summarization) Change and deviation detection Dependency Modeling Information Visualization