Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Analysis Nick Holmes Pathology Flow Cytometry Facility http://www.bio.cam.ac.uk/~nh106/flowsite/flowindex.html Or via Quick links on http://www.path.cam.ac.uk/ Flow Cytometry data is technically a discontinuous variable function as the values recorded in your FCS data files are the results from analogue to digital converters (ADCs) The number of bins (or channels as flow jocks call them) varies greatly for today’s cytometers.. E,g, Facscan Cyan Canto CytekDxP Accuri 1024 65536 262144 262144 16777216 Different analysis programmes will display these data on a variety of scales. Often the data are rebinned, ie channels are recombined into a smaller number of values. This helps make distributions look smooth. However, more complex data processing algorithms can also be used e.g. in FlowJo; these will make your data look different (sometimes nicer). Sources of true background noise Thermionic emission Stray light Electrical circuits Sources of variation in signal Mains (transformer) fluctuation Laser power fluctuation Cell (particle) position within the beam (fluidic fluctuation) Sample preparation and staining Biological Variance It is becoming increasingly common for operators to QC cytometers by regularly running beads – some machines have automated software routines for doing this. If I do it manually what kind of variability am I prepared to tolerate before I decide the machine needs attention? Typically I would have a batch of control beads and standard settings which would have placed my beads in linear channel 128 on a 256 channel display (note really these are in channels 32512-32767 if I am using our Cyan). When we repeat this e.g. every week, as well as reasonable precision (CV<3-8% depending on fluorescence parameter), I am also expecting fluctuations in mean and median from week to week. I guess I am not going to worry if the median lies in the range 115-140 or so. Which measure of central tendancy? Often we will want to use a single statistic value to compare populations but which one gives the best description of non-Gaussian populations? median 553 mode 200 geometric mean 621 mean 1755 I think you can see that both mode and mean do a pretty poor job at describing the complex population. Geometric mean and median are quite similar. I prefer median as a general rule When are cells positive? Here our control (red) population has a lower peak (mode) than our test sample. But does this mean all or most of the cells are positive for our test antibody (antigen by implication)? I would favour a conservative interpretation, namely that the main peak in the test sample is negative – its slightly higher fluorescence is probably due to differences between the control and test antibody preparations (coupling profile, aggregation profile etc.) Even then, bear in mind that wherever you set your lower boundary for positivity, you count some negatives as positives or some positives as negatives – usually both. So this boundary is a sensitive and subjective measure. Be careful when making conclusions that the exact position of the boundary doesn’t alter your result! Basic KS comparisons are too sensitive to distinguish real from instrument variation Dmax=1 Dmax=O.1857 p<0.001* p<0.001* The right-hand above panel shows 2 samples from the same tube so it cant be ‘truly’ different except by sampling * Kolmogorov-Smirnov comparison in FlowJo If we use KS to compare histograms, any d>0.042 would be significant at p<10-6 for n=4096* and you might think that this level of cut off could avoid mere spurious statistical noise making things appear different when they aren’t. However, even this level isn’t high enough – or anywhere near actually Unfortunately FlowJo doesn’t return D values for KS Nor is it easy to calculate (with precision) Dc for p<10-7 or less In practice, you have to use common sense judgement. If things don’t look different enough to be believably BIOLOGICALLY different than don’t let stats trick you. Conversely anything which looks biologically meaningful WILL be statistically different if you just compare the raw data of histograms BUT is this what we need to establish? * In fact a general approximation can be made that p≈10exp-6 for any dn≈2.69 The Mann-Whitney U test: a simple test for reproducibility • • Mann-Whitney U test will be generally applicable wherever you want to compare univariate distributions for a test sample and a control sample If the ranges do not overlap, you only need 3 samples of each to get p<0.05 Overlapping ranges require more samples but for example median FI Controls test 5.61 5.99 5.83 6.58 5.87 7.02 6.01 7.31 7.15 7.39 Gives U = 4; p < 0.05 Where inter- experiment variability is too high, the Wilcoxon’s signed rank test will still deliver significance at 5% for 5 pairs of samples Provided that which the control value is always lower than the test sample within each experimental pair Example: a monoclonal antibody against a novel antigen is used to stain the cell line BRAVO The fluorescence obtained by indirect staining with the Mab + anti-mouse Ig is compared to that obtained with same secondary + an isotype control. The experiment is repeated on 5 separate occasions. The following median FI values are obtained Control 3 6 14 11 8 test 5 7 16 13 10 By Mann-Whitney U=10, P>0.1 By Wilcoxon W=0, P<0.05 1% significance requires seven such samples NB: These two sets are very close. I would still exercise caution in interpreting the data. Clearly they give a small reproducible difference. This could mean that the BRAVO expresses low levels of the novel antigen. Alternatively, there may be unknown differences in the unspecific binding of the control and novel antibody, e.g. a higher level of dimers, trimers etc – almost all antibody preps contain some higher order species. A comparison of multiple treatments If we want to compare multiple pairs of samples within an experiment then we need a different method. Friedmann’s test provides a method to assess the possibility that a dataset is different by either ‘block’ (by which we would mean experiment/run) or by ‘treatment’ by which we could mean different antibodies, different concentrations of antibodies or drug etc. However, in order to compare individual pairs of treatments we need to apply Dunn’s post test to the data – Friedmann only tests whether the null hypothesis that all samples might plausibly be drawn from the same population is demonstrably false at some defined level of certainty. An example of Friedmann’s test We have T lymphocyte cell line which expresses GFP under the control of a minimal promoter with 3 NFAT binding sites upstream of the transcription start site. Thus we can measure the degree of NFAT activity after anti-CD3 activation. We have 4 different drugs which we believe may inhibit the dephosphorylation of NFAT (required for nuclear entry, hence transcriptional activity). We treat cells with anti-CD3 and, independently, each of the 4 drugs with vehicle as a control and we measure the fluorescence of cells stimulated by 488nm light and measured between 515-545nm. We did this experiment 3 times using, so far as possible, the same cells, drug doses, cytometer settings etc. Median GFP fluorescence Expt Vehicle 1 A B C D 2657 2612 2271 1907 1439 2 2347 2333 2201 1899 1333 3 2784 2636 2311 2089 1566 2527 2261 1965 1466 97 87 76 56 mean as % C 2596 Convert these to Ranks within each experiment Expt Vehicle 1 5 4 3 2 1 2 5 4 3 2 1 3 5 4 3 2 1 ΣRi 15 12 9 6 3 45 ΣRi2 225 144 81 36 9 495 A B C D Friedmann’ s test statistic S is given by 2 R S R12 R22 R32 R42 R52 n 2 R Ri2 n i 1 n Where n=the number of treatments, Ri is the sum of ranks of treatment i and R is the sum of all Ri For our example S=495- 405 thus S= 90 We can use tables of significance level to find that for 3 replicates of 5 treatments, S=86 has a probability p=0.009 that all values are drawn from the same underlying population. This only tells us that the drugs made a difference! Dunn’s Post test for pairwise comparison If we want to ask whether particular drugs were effective , and whether some were better than others we need to do pairwise comparisons of treatments. We could chose two levels of query. 1. For each drug, does it inhibit the activation of GFP expression? 2. For all drugs, is A>B>C etc ? Whenever you perform multiple comparisons within a dataset, you need to correct for the fact that the more comparisons you do, the more probable it is that you will see an effect by chance. Query 1 makes a total of 4 comparisons and Q2 10 comparisons so we divide the level of significance we are prepared to accept by these values* i.e for Q1 we need p≤0.0125 and for Q2, p≤0.005, if we want to use the conventional low level significance threshold (p=0.05). We need to find the value of z from the normal distribution that corresponds to that twotailed probability – this can be done using online calculators e.g. http://graphpad.com/quickcalcs/Statratio1.cfm * Technically the correction is to 1-(0.95)1/N where N is the number of comparisons but 0.05/N is a close approximation for small N To compare groups i and j, we find the absolute value of the difference between the mean ranks in group i and the mean ranks in group j then divide this difference in mean ranks by its standard deviation (square root of [(N*(N+1)/12)*(1/Ni + 1/Nj)]). Here N is the total number of data points in all groups, and Ni and Nj are the number of data points in the two groups being compared. Furthermore the ranks are calculated using all samples rather than the within ‘block’ ranks used for the Friedmann test If the ratio calculated in the preceding paragraph is larger that the critical value of z then we conclude that the difference is statistically significant. For Q1 we require z ≥ 2.498 for p≤0.0125 For Q2 we need z≥ 2.807 for p≤0.005 The upshot is that if we asked Q1, then only drug D gave significantly different activation For Q2 only the same Vehicle- drug D comparison can be said to be significantly different This does not mean that drug C doesn’t inhibit NFAT activation or that there is no difference between drug D and drug A! It means we need to do more replicate experiments to show further significance. Set X Expt Vehicle A 1 2657 2612 2271 1907 1439 2 2347 2333 2201 1899 1333 3 2784 2636 2311 2089 1566 means 2596 2527 2261 1965 1466 97 87 76 56 % V control Dunn’s test Vehicle control Vs Expt Set Y Vehicle B NS NS A D NS B p<0.05 C D 1 797.92 109.45 28.76 16.73 10.09 2 758.02 103.98 27.32 15.89 9.59 3 853.77 117.11 30.77 17.9 10.8 803.24 110.18 28.95 16.84 10.16 14 4 2 1 means % V control ANOVAR (set Y) Ctrl vs Set Y C P<<0.001 P<<0.001 P<<0.001 P<<0.001 These two datasets have exactly the same ranks For Set Y however, the Dunn results clearly miss something important Parametric ANOVAR may be permissible (after all we don’t know for certain that the underlying distribution of median values ISNT Gaussian) • Use common sense • Biological significance not statistical significance • Compare like with like • Reduce heterogeneity as far as possible • Use non-parametric tests – Mann-Whitney – Wilcoxon’s – Friedmann + Dunn’s • Clear, reproducible cytometry data does not need stats; but if pushed you can risk using parametric stats with care