Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD [email protected] 1 Overview Descriptive Statistics & Graphical Presentation of Data Statistical Inference Hypothesis Tests & Confidence Intervals T-tests (Paired/Two-sample) Regression (SLR & Multiple Regression) ANOVA/ANCOVA Intended as an interview. Will provide slides after lectures What’s in the lectures?... 2 Lecture 1 Lecture 2 Lecture 3 Lecture 4 Descriptive Statistics and Graphical Presentation of Data 1. 2. 3. 4. 5. 6. 7. Terminology Frequency Distributions/Histograms Measures of data location Measures of data spread Box-plots Scatter-plots Clustering (Multivariate Data) 3 Lecture 1 Lecture 2 Lecture 3 Lecture 4 Statistical Inference 1. 2. 3. 4. 5. 6. 7. 8. Distributions & Densities Normal Distribution Sampling Distribution & Central Limit Theorem Hypothesis Tests P-values Confidence Intervals Two-Sample Inferences Paired Data 4 Lecture 1 Lecture 2 Lecture 3 Lecture 4 Sample Inferences 1. 2. 3. Two-Sample Inferences Paired t-test Two-sample t-test Inferences for more than two samples One-way ANOVA Two-way ANOVA Interactions in Two-way ANOVA DataDesk demo 5 Lecture 1 Lecture 2 Lecture 3 Lecture 4 1. 2. 3. 4. 5. 6. 7. 8. Regression Correlation Multiple Regression ANCOVA Normality Checks Non-parametrics Sample Size Calculations Useful tools and websites 6 FIRST, A REALLY USEFUL SITE Explanations of outputs Videos with commentary Help with deciding what test to use with what data 7 1. Terminology Populations & Samples Population: the complete set of individuals, objects or scores of interest. Often too large to sample in its entirety It may be real or hypothetical (e.g. the results from an experiment repeated ad infinitum) Sample: A subset of the population. A sample may be classified as random (each member has equal chance of being selected from a population) or convenience (what’s available). Random selection attempts to ensure the sample is representative of the population. 8 Variables Variables are the quantities measured in a sample.They may be classified as: Quantitative i.e. numerical Continuous (e.g. pH of a sample, patient cholesterol levels) Discrete (e.g. number of bacteria colonies in a culture) Categorical Nominal (e.g. gender, blood group) Ordinal (ranked e.g. mild, moderate or severe illness). Often ordinal variables are re-coded to be quantitative. 9 Variables Variables can be further classified as: Dependent/Response. Variable of primary interest (e.g. blood pressure in an antihypertensive drug trial). Not controlled by the experimenter. Independent/Predictor called a Factor when controlled by experimenter. It is often nominal (e.g. treatment) Covariate when not controlled. If the value of a variable cannot be predicted in advance then the variable is referred to as a random variable 10 Parameters & Statistics Parameters: Quantities that describe a population characteristic. They are usually unknown and we wish to make statistical inferences about parameters. Different to perimeters. Descriptive Statistics: Quantities and techniques used to describe a sample characteristic or illustrate the sample data e.g. mean, standard deviation, box-plot 11 2. Frequency Distributions An (Empirical) Frequency Distribution or Histogram for a continuous variable presents the counts of observations grouped within prespecified classes or groups A Relative Frequency Distribution presents the corresponding proportions of observations within the classes A Barchart presents the frequencies for a categorical variable 12 Example – Serum CK Blood samples taken from 36 male volunteers as part of a study to determine the natural variation in CK concentration. The serum CK concentrations were measured in (U/I) are as follows: 13 Serum CK Data for 36 male volunteers 121 95 84 119 62 25 82 145 57 104 83 123 100 151 64 201 139 60 110 113 67 93 70 48 68 101 78 118 92 95 58 163 94 203 110 42 14 Relative Frequency Table Serum CK (U/I) Frequency Relative Frequency Cumulative Rel. Frequency 20-39 1 0.028 0.028 40-59 4 0.111 0.139 60-79 7 0.194 0.333 80-99 8 0.222 0.555 100-119 8 0.222 0.777 120-139 3 0.083 0.860 140-159 2 0.056 0.916 160-179 1 0.028 0.944 180-199 0 0.000 0.944 200-219 2 0.056 1.000 Total 36 1.000 15 Frequency Distribution Distributions CK-concentration-(U/l) Quantiles 8 4 Frequency 6 100.0% maximu 99.5% 97.5% 90.0% 75.0% quart 50.0% media 25.0% quart 10.0% 2.5% 0.5% 0.0% minimu 2 20 40 60 80 100 120 140 160 180 200 220 16 Relative Frequency Distribution Distributions CK-concentration-(U/l) Quantiles Mode Shaded area is percentage of males with CK values between 60 and 100 U/l, i.e. 42%. 0.15 Right tail (skewed) 0.10 Relative Frequency 0.20 100.0% maxim 99.5% 97.5% 90.0% 75.0% quar 50.0% med 25.0% quar 10.0% 2.5% 0.5% 0.0% minim Left tail 0.05 20 40 60 80 100 120 140 160 180 200 220 17 3. Measures of Central Tendency (Location) Measures of location indicate where on the number line the data are to be found. Common measures of location are: (i) the Arithmetic Mean, (ii) the Median, and (iii) the Mode 18 The Mean Let x1,x2,x3,…,xn be the realised values of a random variable X, from a sample of size n. The sample arithmetic mean is defined as: x 1 n n xi i 1 19 Example Example 2: The systolic blood pressure of seven middle aged men were as follows: 151, 124, 132, 170, 146, 124 and 113. 151 124 132 170 146 124 113 The mean is x 7 137.14 20 The Median and Mode If the sample data are arranged in increasing order, the median is (i) (ii) the middle value if n is an odd number, or midway between the two middle values if n is an even number The mode is the most commonly occurring value. 21 Example 1 – n is odd The reordered systolic blood pressure data seen earlier are: 113, 124, 124, 132, 146, 151, and 170. The Median is the middle value of the ordered data, i.e. 132. Two individuals have systolic blood pressure = 124 mm Hg, so the Mode is 124. 22 Example 2 – n is even Six men with high cholesterol participated in a study to investigate the effects of diet on cholesterol level. At the beginning of the study, their cholesterol levels (mg/dL) were as follows: 366, 327, 274, 292, 274 and 230. Rearrange the data in numerical order as follows: 230, 274, 274, 292, 327 and 366. The Median is half way between the middle two readings, i.e. (274+292) 2 = 283. Two men have the same cholesterol level- the Mode is 274. 23 Mean versus Median Large sample values tend to inflate the mean. This will happen if the histogram of the data is right-skewed. The median is not influenced by large sample values and is a better measure of centrality if the distribution is skewed. Note if mean=median=mode then the data are said to be symmetrical e.g. In the CK measurement study, the sample mean = 98.28. The median = 94.5, i.e. mean is larger than median indicating that mean is inflated by two large data values 201 and 203. 24 4. Measures of Dispersion Measures of dispersion characterise how spread out the distribution is, i.e., how variable the data are. Commonly used measures of dispersion include: 1. 2. 3. 4. Range Variance & Standard deviation Coefficient of Variation (or relative standard deviation) Inter-quartile range 25 Range the sample Range is the difference between the largest and smallest observations in the sample easy to calculate; Blood pressure example: min=113 and max=170, so the range=57 mmHg useful for “best” or “worst” case scenarios sensitive to extreme values 26 Sample Variance The sample variance, s2, is the arithmetic mean of the squared deviations from the sample mean: n xi x s i 1 2 2 n 1 > 27 Standard Deviation The sample standard deviation, s, is the square-root of the variance n s xi x i 1 2 n 1 s has the advantage of being in the same units as the original variable x 28 Example Data 151 124 132 170 146 124 113 Sum = 960.0 x 137.14 Deviation 13.86 -13.14 -5.14 32.86 8.86 -13.14 -24.14 Sum = 0.00 Deviation2 192.02 172.73 26.45 1079.59 78.45 172.73 582.88 Sum = 2304.86 29 Example (contd.) 7 x x i 1 Therefore, 2 i 2304.86 2304.86 s 7 1 19.6 30 Coefficient of Variation The coefficient of variation (CV) or relative standard deviation (RSD) is the sample standard deviation expressed as a percentage of the mean, i.e. s CV 100% x The CV is not affected by multiplicative changes in scale Consequently, a useful way of comparing the dispersion of variables measured on different scales 31 Example The CV of the blood pressure data is: 19.6 CV 100 % 137.1 14.3% i.e., the standard deviation is 14.3% as large as the mean. 32 Inter-quartile range The Median divides a distribution into two halves. The first and third quartiles (denoted Q1 and Q3) are defined as follows: 25% of the data lie below Q1 (and 75% is above Q1), 25% of the data lie above Q3 (and 75% is below Q3) The inter-quartile range (IQR) is the difference between the first and third quartiles, i.e. IQR = Q3- Q1 33 Example The ordered blood pressure data is: 113 124 124 132 146 151 170 Q1 Q3 Inter Quartile Range (IQR) is 151-124 = 27 34 60% of slides complete! 35 5. Box-plots A box-plot is a visual description of the distribution based on Minimum Q1 Median Q3 Maximum Useful for comparing large sets of data 36 Example 1 The pulse rates of 12 individuals arranged in increasing order are: 62, 64, 68, 70, 70, 74, 74, 76, 76, 78, 78, 80 Q1=(68+70)2 = 69, Q3=(76+78)2 = 77 IQR = (77 – 69) = 8 37 Example 1: Box-plot 38 8 10 12 14 Example 2: Box-plots of intensities from 11 gene expression arrays AG_04659_AS.cel AG_11745_AS.cel KB_5828_AS.cel KB_8840_AS.cel 39 Outliers An outlier is an observation which does not appear to belong with the other data Outliers can arise because of a measurement or recording error or because of equipment failure during an experiment, etc. An outlier might be indicative of a subpopulation, e.g. an abnormally low or high value in a medical test could indicate presence of an illness in the patient. 40 Outlier Boxplot Re-define the upper and lower limits of the boxplots (the whisker lines) as: Lower limit = Q1-1.5IQR, and Upper limit = Q3+1.5IQR Note that the lines may not go as far as these limits If a data point is < lower limit or > upper limit, the data point is considered to be an outlier. 41 Example – CK data outliers 42 6. Scatter-plot Displays the relationship between two continuous variables Useful in the early stage of analysis when exploring data and determining is a linear regression analysis is appropriate May show outliers in your data 43 Example 1: Age versus Systolic Blood Pressure in a Clinical Trial 44 Example 2: Up-regulation/Down-regulation of gene expression across an array (Control Cy5 versus Disease Cy3) 45 Example of a Scatter-plot matrix (multiple pair-wise plots) 46 Other graphical representations Dot-Plots, Stem-and-leaf plots Pie-chart Not visually appealing Visually appealing, but hard to compare two datasets. Best for 3 to 7 categories. A total must be specified. Violin-plots =boxplot+smooth density Nice visual of data shape 47 Multivariate Data Clustering is useful for visualising multivariate data and uncovering patterns, often reducing its complexity Clustering is especially useful for highdimensional data (p>>n): hundreds or perhaps thousands of variables An obvious areas of application are gel electrophoresis and microarray experiments where the variables are protein abundances or gene expression ratios 48 7. Clustering Aim: Find groups of samples or variables sharing similiarity Clustering requires a definition of distance between objects, quantifying a notion of (dis)similarity Points are grouped on the basis on minimum distance apart (distance measures) Once a pair are grouped, they are combined into a single point (using a linkage method) e.g. take their average. The process is then repeated. 49 Clustering Clustering can be applied to rows or columns of a data set (matrix) i.e. to the samples or variables A tree can be constructed with branch length proportional to distances between linked clusters, called a Dendrogram Clustering is an example of unsupervised learning: No use is made of sample annotations i.e. treatment groups, diagnosis groups 50 UPGMA Unweighted Pair-Group Method Average Most commonly used clustering method Procedure: 1. Each observation forms its own cluster 2. The two with minimum distance are grouped into a single cluster representing a new observation- take their average 3. Repeat 2. until all data points form a single cluster 51 Contrived Example 5 genes of interest on 3 replicates arrays/gels Array1 Array2 Array3 p53 9 3 7 mdm2 10 2 9 bcl2 1 9 4 cyclinE 6 5 5 caspase 8 1 10 3 Calculate distance between each pair of genes e.g. d ( p53, mdm2) (9 10)2 (3 2)2 (7 9)2 2.5 52 Example Construct a distance matrix of all pair-wise distances p53 mdm2 bcl2 cyclinE caspase 8 p53 0 2.5 10.44 4.12 11.75 mdm2 - 0 12.5 6.4 13.93 bcl2 - - 0 6.48 1.41 cyclinE - - - 0 7.35 caspase 8 - - - - 0 Cluster the 2 genes with smallest distance Take their average & re-calculate distances to other genes 53 p53 mdm2 p53 mdm2 0 cyclin E 2.5 4.12 10.9 0 6.4 9.1 0 6.9 cyclin E {caspase-8 & bcl-2} 0 {p53 & mdm2} {p53 & mdm2} cyclin E {caspase-8 & bcl-2} {caspase-8 & bcl-2} 0 cyclin E {caspase-8 & bcl-2} 3.7 9.2 0 6.9 0 54 Example (contd) ..and the final cluster: 55 Example of a gene expression dendrogram 56 Variety of approaches to clustering • Clustering techniques – agglomerative -start with every element in its own cluster, and iteratively join clusters together – divisive - start with one cluster and iteratively divide it into smaller clusters • Distance Metrics – – – – Euclidean (as-the-crow-flies) Manhattan Minkowski (a whole class of metrics) Correlation (similarity in profiles: called similarity metrics) • Linkage Rules – – – – – average: Use the mean distance between cluster members single: Use the minimum distance (gives loose clusters) complete: Use the maximum distance (gives tight clusters) median: Use the median distance centroid: Use the distance between the “average” member or each cluster 57 Clustering Summary The clusters & tree topology often depend highly on the distance measure and linkage method used Recommended to use two distance metrics, such as Euclidean and a correlation metric A clustering algorithm will always yield clusters, whether the data are organised in clusters or not! 58