* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Protein World
Human genome wikipedia , lookup
Oncogenomics wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Metagenomics wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Ridge (biology) wikipedia , lookup
Gene therapy wikipedia , lookup
Gene desert wikipedia , lookup
Gene nomenclature wikipedia , lookup
Public health genomics wikipedia , lookup
Genomic imprinting wikipedia , lookup
Genome (book) wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
History of genetic engineering wikipedia , lookup
Microevolution wikipedia , lookup
Genome evolution wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Designer baby wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Gene expression profiling wikipedia , lookup
Gene expression programming wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Affymetrix Expression Data Comics Group 12-05-2003 Nijmegen Tim Hulsen General Information (1) • Affymetrix oligo microarrays: HG-U133 A and B (human) and MG-U74v2 A, B and C (mouse) • Updated every two months; releases used here: november 2002 and january 2003 • UniGene-based • Probes: 25mer oligos complementary to the sequences of interest • Probe pairs: perfect match (PM) probe and mismatch (MM) probe, MM is different from PM in the 13th position General Information (2) • Human chips: 3269 samples, 44792 fragments, 115 tissue categories (114 for nov. 2002 release), 15 SNOMED tissue categories • Mouse chips: 859 samples, 36701 fragments, 25 tissue categories, 12 SNOMED tissue categories • Results from all samples within a tissue category are combined by generating electronic Northerns • For each tissue fragment and each tissue category is determined: – Median expression value – Present call (percentage) Median expression value • Expression value: intensity • All expression values that have a ‘present call’ are used to determine the median expression value • Varies from 0 to ~65,000 in human and from 0 to ~97,000 in mouse Present Call (Percentage) • Normalization/scaling procedures (MAS 5.0) are used to determine an expression intensity value with an associated confidence level to each fragment • When confidence level p for expression is smaller than 0.05, the expression intensity for this specific fragment in this particular sample is called present (P) • Call values are used to calculate a present call percentage (P calls / total calls) Snomed category definitions (1) • SNOMED: Systematised Nomenclature of Medicine • Combines specific categories into more global categories, i.e. organ systems • In human far more useful than in mouse (115>15,25->12) • Categories like: cardiovascular system, digestive organs, endocrine gland, female genital system, male genital system, musculoskeletal system, nervous system, respiratory system, etc. Snomed category definitions (2) • Example: cat. 7: ‘hematopoietic system’: Human Mouse 36. Bone marrow 10. Blood 37. Dendritic reticulum cell 11. Bone marrow 38. Lymph node 12. Mesenteric lymph node 39. Lymphocyte 13. Spleen 40. Monocyte 14. Thymus 41. Segmented neutrophil 42. Spleen 43. Thymus 44. Tonsil 45. White blood cell Annotation provided For each fragment, if available: • title • unigeneAcc • geneSymbol • unigeneId • geneAlias • interproId • exemplarAcc • pfamId • omimId • swissprotId • snpId • goId • refseqId • goFunction • refseqprotId • goProcess • ncbiNuclId • goComponent • ncbiProtId • comment Goals & Problems • Goal: use data set to see if co-expression between orthologous/paralogous gene pairs is higher than between ‘unrelated’ gene pairs, in human & mouse • Problem 1: limited annotation • Problem 2: empty expression profiles • Problem 3: size of data set Limited annotation (1) For example for three of the most used protein ids: ncbiProtId (in red), refseqProtId (in green), swissprotId (in blue) Human: Mouse: Limited annotation (2) Solutions: • Smith-Waterman of (SWX) of all Affymetrix sequences to the human & mouse IPI sets, for which orthologs and paralogs were already defined -> IPI id added to database • Smith-Waterman (SWN) of all Affymetrix sequences to each other for better orthology/paralogy prediction Empty expression profiles • Lots of genes have no expression at all in any tissue category • Useless for correlation calculation; two genes with no expression will have a top correlation! • For human: 4114 out of 44792 fragments completely no expression in all tissue categories -> 40678 left • For mouse: 6791 out of 36701 fragments completely no expression in all tissue categories -> 29910 left Size of data set • Correlation between gene pairs is calculated: the number of pairs is (x2-x)/2 for x genes -> millions of data points • Number of gene pairs is already brought down by the ‘no expression gene removal’: in human from 1,003,139,236 to 827,329,503, in mouse from 673,463,350 to 447,289,095 • For some quick analyses, sets of e.g. 1000 randomly selected genes were used -> 499,500 gene pairs Uncentered Correlation • ‘Uncentered’: from 0 to 1 • UC(X,Y)= Σ( X / ( sqrt ( Σ( X2 / N ) ) ) * ( Y / ( sqrt ( Σ( Y2 / N ) ) ) ) / N • Calculated correlations between gene pairs were used to see if the co-expression for orthologous pairs and/or paralogous pairs is higher than for ‘unrelated’ pairs • This was measured by using the KEGG Pathway map (release 25) • The best, however not completely convincing, result was found using PCP and not ME: Correlation KEGG Pathway Check • Data points above a correlation threshold of 0,9 and 1,0 were left out because of very low numbers (irreliability) • Only orthologous conserved gene pairs have a higher accuracy when increasing the correlation threshold • May be a combination of PCP and ME should be used • Another measure could be used: same GO category instead of KEGG, GO is already annotated by Affymetrix • Lots of genes have only an expression value in one tissue; this correlation method is not really suitable -> mutual information analysis Mutual Information • For each tissue category: 0 or 1 (ME/PCP value below/above a specified threshold) • x0 = % of 0s, x1 = % of 1s • x00 = % of 0-0 pairs, x01 = % of 0-1 pairs, x10 = % of 1-0 pairs, x11 = % of 1-1 pairs • Entropy per gene: -(x0*ln(x0)+x1*ln(x1)) • Entropy per gene pair: (x00*ln(x00)+x01*ln(x01)+x10*ln(x10)+x11*ln(x11)) • MI = Entropy(1) + Entropy(2) – Entropy(1,2) • 0<=MI<=0,693147 MI GO Category Check • Mutual information check using GO Biological Process 3rd level of specification • Horizontal axis shows log(MI) • Different lines: different thresholds for defining as a ‘0’ or a ‘1’ • Accuracy indeed seems to be higher for pairs with much mutual information, but there is also a peak at -9<=log(MI)<-8 • Orthologous/paralogous pairs not checked yet Future plans • Complete mutual information analysis, using both KEGG Pathway and GO databases; look at orthologous and paralogous gene pairs too • Check alternative splicing • Speed – license ends at the end of June