Download View

Looking for signals in tens of thousands of GeneChips Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex [email protected] There are >105 GeneChip experiments in the public domain, that cost ~$109 to produce. Extracting further information from this resource will be very cost effective. Microarray informatics at Essex University Departments of Mathematical Sciences and Biological Sciences Faculty Dr Andrew Harrison Professor Graham Upton Dr Berthold Lausen + Dr Hugh Shanahan (Royal Holloway) Degrees in ….. Physics Statistics Statistics Physics PhD students Farhat Memon Anne Owen Fajriyah Rohmatul Computer Science Mathematics Statistics Current MSc and UG students Aleksandra Iljina Lina Hamadeh Madalina Ghita Statistics and Data Analysis Statistics and Data Analysis Mathematics Alumni Dr Jose Arteaga-Salas Dr Renata Camargo Dr Caroline Johnston Dr William Langdon Dr Joanna Rowsell Dr Olivia Sanchez-Graillet Dr Maria Stalteri + 4 former MSc students Statistics Computer Science Molecular Biology and Bioinformatics Computer Science and Physics Mathematics Computer Science and Bioinformatics Inorganic Chemistry and Bioinformatics Perfect Match (PM) Mismatch (MM) m=log2(Fold Change), a=log2(Average Intensity) The biggest uncertainty in GeneChip analysis is how to merge all the probe information for one gene Harrison, Johnston and Orengo, 2007, BMC Bioinformatics, 8: 195 There is a huge multiple-testing problem. What can be learnt from comparing different experiments? Some genes are represented by multiple probe-sets. Probe-set A Probe-set B If they are measuring the same thing the signals should be up and down regulated together. Is that always true? No Stalteri and Harrison, 2007, BMC Bioinformatics, 8:13 Probes map to different exons. Alternative splicing may cause some exons to be upregulated and others to be downregulated. Genes come in pieces. But exons do not. Multiple probes mapping to the same exon should measure the same thing. We are studying the correlations in expression across >6,000 GeneChips (HGU-133A), sampling RNA from many tissues and phenotypes. The correlations in intensities (log2) between probes in probeset 208772_at on the HG-U133A array. The number in each square is the correlation ×10 Blue = low correlation Yellow = high correlation Average intensity in GEO Probe order along the gene The correlation calculated for PM probes 9 and 11 , the data in the earlier scatter plot, is reported as 8 (0.76 multiplied by 10 and rounded). This probeset shows no coherent correlations amongst its probes. Some probesets clearly have outliers. Probes 1-11 all map to the same exon. This is a different probeset mapping to the same exon – there seems to be one outlier. The outliers are correlated with each other! There is little sequence similarity between the probes, they are from probe-sets picking up different biology, yet they are correlated! TCCTGGACTGAGAAAGGGGGTTCCT GAGACACACTGTACGTGGGGACCAC GGTAGACTGGGGGTCATTTGCTTCC Virtually all of the probes in the group have runs of Guanines within their 25 bases. Comparing probes with runs of Gs. Number of contiguous Gs Mean Correlation 3 0.14 4 0.42 5 0.49 6 0.62 7 0.75 We are only looking at a small fraction of the entire probe, yet it is dominating the effects across all experiments. G-quadruplexes G G G G G G G G G G G G Probes all have the same sequence in a cell – a run of guanines will result in closely packed DNA with just the right properties to form G-quadruplexes. Upton et al. 2008 BMC Genomics, 9, 613 How do we deal with known outliers such as G-quadruplexes? What is the best way to calculate expression in the presence of outliers? G-stacks bias which genes are reported to be clustered together within published experiments. Kerkhoven et al. 2008, PLoS ONE 3(4): e1980 Probes containing GCCTCCC will hybridize to the primer spacer sequence that is attached to all aRNA prior to hybridization. Log(magnitude) of averaged probe values Colour coded by size. Note the perimeter of bright-dark pairs. Cell (0,0) contains a probe which does not measure any biology Corner correlations (correlations with values in cell (0,0)) Numbers are correlations times 10 (red greater than 0.8) Negative correlations appear as blanks Filled circles indicate probes not listed in CDF file. Large circles indicate correlations greater than 0.8 Correlations with cell (0,0) Being in the opposite corner has not reduced the correlations of the interior row and column What are in the sheep pens? Entries are correlation with cell (0,0) Entries are log(mean(Intensity)) Sheep! Many thousands of probes are correlated with each other simply because they are adjacent to bright probes. We believe that the focus of the scanner may be responsible – regions adjacent to bright spots will gain the same fraction of light. A comparison of many images at different levels of blurriness will appear to indicate that dark regions adjacent to bright regions are correlated in their intensities. A CEL file contains information about the ID of the scanner as well as the date on which the image was scanned – how does the impact of blur change over time for each scanner? Upton and Harrison, 2010, Stat Appl Genet Mol Biol, 9(1), Article 37 How best to transform a DAT image into a CEL file? We are testing whether ideas from astronomy are applicable. We are checking whether the temporal patterns in scanner performance for human and other organisms are related. Bioinformatix, Genomix, Mathematix, Physix, Statistix, Transcriptomix are needed in order to extract reliable information from Affymetrix GeneChips Thank you for your attention.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download View