Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to Biostatistics (ZJU 2008) Wenjiang Fu, Ph.D Associate Professor Division of Biostatistics, Department of Epidemiology Michigan State University East Lansing, Michigan 48824, USA Email: [email protected] www: http://www.msu.edu/~fuw Introduction Biostatistics ? Why do we need to study Biostatistics? A test for myself ! Statistics – Data science to help to decipher data collected in many aspects of events using probability theory and statistical principles with the help of computer. Statistics Data: Theoretical Applied Biostats Economics Finance Engineering Sports …… Events: party, disease, accident, award, game … Subjects: human, animal … Characteristics: sex, race, age, weight, height … Statistics Most commonly, statistics refers to numerical data or other data. Statistics may also refer to the process of collecting, organizing, presenting, analyzing and interpreting data for the purpose of making inference, decision, policy and assisting scientific discoveries. population sampling parameter sample statistic descriptive statistics frequency probability Prediction Estimation Inferential statistics Hypothesis testing Grand challenges we are facing “Data” Knowledge & Information … Decision Statistics 21st century will be the golden age of statistics ! Grand challenges we are facing 1. 2. 3. … Data collection technology has advanced dramatically, but without sufficient statistical sampling design and experimental design. Advancement of technology for discovering and retrieving useful information has been lagging and has become the bottleneck. More sophisticated approaches are needed for decision making and risk management. Statistical Challenges - Massive Amount of Data Statistical Challenges – Image Data Statistics in Science Cosmic microwave background radiation Tick-by-tick stock data High Energy Physics Genomic/protomic data Statistics in Science Finger Prints Microarray What do we do? New ways of thinking and attacking problems Finding sub-optimal but computationally feasible solutions. New paradigm for new types of data Be satisfied with ‘very rough’ approximations Turn research results into easy and publicly available software and programs Join force with computer scientists. Some ‘hot’ research directions Dimension reduction Visualization Dynamic systems Simulation and real time computation Uncertainty and risk management Interdisciplinary research Reasons to Study Biostatistics I Biostatistics is everywhere around us: Our life: entertainment, sports game, shopping, party, communication (cell phone), travel … Our work: career, business, school … Our health: food, weather, disease … Our environment: safety, security, chemical, animal, Our well-being: physical examination, hospital, being happy, longevity. Reasons to Study Biostatistics I Entertainment - party: music / dance /food Sports game Allergy to certain food /smell : peanut, flowers … Communication - cell phone use Car racing, skiing (time to event – survival analysis). Shopping: diff taste /preference : Alcohol, cigarette, drug, etc. Potential hazard – leads to health problem (CA …) Travel – infectious diseases, safety, accident … Reasons to Study Biostatistics II We care our society, our family, our environment, our school, scientific research … Major impact on society and communities. Disease transmission Healthcare benefit, health economics Quality of life (research, health improvement) Safety issue (outbreaks of diseases, etc.) Job market is very promising. Applications in a wide-range of areas. Healthcare, quality of life, Career – job market: scientific, public or private, industrial … Reasons to Study Biostatistics III Biostatistics research and applications Major employers in the US Research universities, Hospitals, Institutes (NIH), CDC, DoD, NASA, pharmaceutical industry, biotech industry, banks and other data warehouse … Major universities having biostatistics department in the US Harvard U, U. Michigan, U. Washington (Seattle), UC (Berkeley, LA, SF), JHU, Yale U, Stanford U … Reasons to Study Biostatistics IV New Biostatistics research areas (still growing) Medical research. Recent trend in employment Private industry: Google, Microsoft … Affymetrix, Illumina, Agilent, Golden Helix, 23andMe … Investment – stock market, Capital One, Bank of America, Goldman Sack, etc. Nano tech, green energy (alternative energy) … Example 1. Medical study data: Ob/Gyn Modeling of PlGF: Placental Growth Factor Example 2. Genomics study Single Nucleotide Polymorphism (SNP) Homologous pairs of chromosomes Paternal allele Maternal allele Paternal allele ACGAACAGCT TGCTTGTCGA SNP A/G Maternal allele ACGAGCAGCT TGCTCGTCGA Computational Genomics: SNP Genotype Error rate : around 5% : Genome-wide association studies – millions of SNPs Applications Genetic counseling: Achieve accurate estimation and prediction gene expression + family medical history disease Breast cancer (BRCA) … Early detection / early treatment (cancer, …) Accurate diagnosis (HIV +) Help development of new drugs for treatment. Help to protect environment, live longer and happier, improve quality of life. Did I pass my test? I hope I have convinced you to study biostatistics. Chapter 2. Descriptive Statistics First important thing to do is to visualize data. Plot of data Scatter plot – pair-wise (var 1 vs. var 2) Scatter plot Descriptive Statistics Summarize data using statistics Central location (mean, median) Range (min, max) Variability (variance, standard deviation) Mode Quantiles (percentiles) Rank data, but avoid long listing (use grouping, instead) Measure of Location Mean The mean is the sum of all the observations divided by the number of observations. Population mean : 1 N N x i 1 i N The number of observations in the population. Sample mean : 1 n x xi n i 1 n The number of observations in the sample. Properties of the mean The mean is the most widely used measure of location and has the following properties : N n (x ) (x x) 0 i 1 i i 1 i yi axi b, i 1,, n y ax b The mean is oversensitive to extreme values in the sample. Translation of data Measure of Location Median and Mode The median is the value of the “middle” point of samples, when samples are arranged in ascending order. Median = The [(n+1)/2]th largest observation if n is odd. = The average of the (n/2)th and (n/2+1)th largest observation if n is even. The mode is the most frequently occurring value among all the observations in a sample. It is the most probable value that would be obtained if one data point is selected at random from a population. Example: Median and Mode Calculate the median and mode of the following data: 12, 24, 36, 25, 17, 19, 24, 11 Sorted data : 11, 12, 17, 19, 24, 24, 25, 36 19 24 21.5, Median = 2 Mode = 24 The mean is influenced by outliers while the median is not. The mode is very unstable. Minor fluctuations in the data can change it substantially; for this reason it is seldom calculated. ≤≤ bimodal mode == Mean Median Mode mode ≤≤ Symmetry and Skewness in Distribution When the shape of a distribution to the left and the right is mirror image of each other, the distribution is symmetrical. Examples of symmetrical distribution are shown below : A skewed distribution is a distribution that is not symmetrical . Examples of skewed distributions are shown below : Positively skewed Negatively skewed Measure of Dispersion Range and Mean Absolute Deviation (MAD) The Range is the simplest measure of dispersion. It is simply the difference between the largest and smallest observations in a sample. Range xmax xmin The mean absolute deviation is the average of the absolute values of the deviations of individual observations from the mean. n MAD | x x | i 1 i n Measure of Dispersion Quantiles or Percentiles Quantile (percentile) is the general term for a value at or below which a stated proportion (p/100) of the data in a distribution lies. Quartiles: p = .25, .50, .75 Quantile / Percentile : p is any probability value Calculating Quantiles or Percentiles Let [k] denote the largest integer k. For example, [3]=3, [4.7]=4. The p-th percentile is defined as follows: • Find k = np/100. • If k is an integer, the p-th percentile is the mean of the k-th and (k+1)-th observations (in the ascending sorted order). • If k is NOT an integer, the p-th percentile is the [k]+1-th observation. Example Calculate the 10th percentile and the 75th percentile of the following data: 7, 12, 16, 2, 8, 4, 20, 14, 19, 17 Sorted data : 2, 4, 7, 8, 12, 14, 16, 17, 19, 20 (n = 10) 10th percentile: k = np/100 = 10×10/100 = 1 Average of 1st and 2nd observations = (2+4)/2 = 3 75th percentile: k = np/100 = 10×75/100 = 7.5 [7.5]+1 = 7+1 = 8th observation = 17 Measure of Dispersion Variance and Standard Deviation The variance is a measure of how spread out a distribution is. It is computed as the average squared deviation of each number from its mean. The standard deviation is the square root of the variance. It is the most commonly used measure of spread. n sample variance sample standard deviation yi axi b, i 1,, n s x2 2 ( x x ) i i 1 n 1 sx sx2 s y2 a 2 sx2 , s y | a | sx , Example Five people have their body mass index (BMI) calculated as [body weight (kg)] / [height] 2 18, 20, 22, 25, 24 1 n 109 X xi 21.8 n i 1 5 1 n 32.8 2 s ( xi X ) 8.2 n 1 i 1 5 1 2 x s x 8.2 2.86 Relative Dispersion – Coefficient of Variation A direct comparison of two or more measures of dispersion may be difficult because of difference in their means. A relative dispersion is the amount of variability in a distribution relative to a reference point or benchmark. A common measure of relative dispersion is the coefficient of variation (CV). sx CV 100 x This measure remains the same regardless of the units used when only scaling applies. Very useful ! Good Example: Weight, Kg versus Lb. Bad Example: Temperature: C vs F. Frequency Distribution Long list of data collection can be confusing, and need to be grouped in moderate intervals, rather than listed as raw data point. Hospital Length of Stay (LOS) __________________________________________________________________________________________ 81 63 98 86 83 44 43 58 55 50 29 28 42 36 32 23 21 28 27 26 16 16 20 19 27 13 13 15 15 14 12 12 13 12 12 11 11 12 12 12 11 11 11 11 11 64 58 10 10 10 43 42 93 83 81 28 28 56 50 48 22 21 36 32 30 16 15 28 27 23 13 13 20 18 17 12 12 15 14 14 11 11 12 12 11 10 12 12 12 11 11 11 12 10 10 10 A summary table works better than raw data. Interval LOS LOS LOS LOS LOS LOS LOS LOS LOS LOS Frequency Relative Frequency Graphic Methods Bar Graph A bar graph is simply a bar chart of data that has been classified into a frequency distribution. The attractive feature of a bar graph is that it allows us to quickly see where the most of the observations are concentrated. Interval LOS LOS LOS LOS LOS LOS LOS LOS LOS LOS Frequency Graphic Methods Histogram Histogram provides a distribution plot, where the bars are not necessarily of the same length. The area of each bar is proportional to the density of the data or percentage of data points within the bar. Graphic Methods Box Plot The box Plot is summary plot based on the median and interquartile range (IQR) which contains 50% of the values. Whiskers extend from the box to the highest and lowest values, excluding outliers. A line across the box indicates the median. IQR Q3 Q1 MIN Q1 1.5 IQR, MAX Q3 1.5 IQR MIN MAX