Download Introduction to Biostatistics (ZJU, 2008)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Introduction to
Biostatistics (ZJU 2008)
Wenjiang Fu, Ph.D
Associate Professor
Division of Biostatistics, Department of
Epidemiology
Michigan State University
East Lansing, Michigan 48824, USA
Email: [email protected]
www: http://www.msu.edu/~fuw
Introduction

Biostatistics ? Why do we need to study Biostatistics? A test for myself !

Statistics – Data science to help to decipher data collected in many aspects of
events using probability theory and statistical principles with the help of
computer.

Statistics

Data:
Theoretical
Applied
Biostats
Economics
Finance
Engineering
Sports
……
Events: party, disease, accident, award, game …
Subjects: human, animal …
Characteristics: sex, race, age, weight, height …
Statistics
Most commonly, statistics refers to numerical data or other data.
Statistics may also refer to the process of collecting, organizing,
presenting, analyzing and interpreting data for the purpose of making
inference, decision, policy and assisting scientific discoveries.
population
sampling
parameter
sample
statistic
descriptive
statistics
frequency
probability
Prediction
Estimation
Inferential
statistics
Hypothesis
testing
Grand challenges we are facing
“Data”
Knowledge
&
Information
…
Decision
Statistics
21st century will be the golden age
of statistics !
Grand challenges we are facing
1.
2.
3.
…
Data collection technology has advanced
dramatically, but without sufficient statistical
sampling design and experimental design.
Advancement of technology for discovering and
retrieving useful information has been lagging
and has become the bottleneck.
More sophisticated approaches are needed for
decision making and risk management.
Statistical Challenges
- Massive Amount of Data
Statistical Challenges – Image Data
Statistics in Science
Cosmic microwave background radiation
Tick-by-tick stock data
High Energy Physics
Genomic/protomic data
Statistics in Science
Finger Prints
Microarray
What do we do?

New ways of thinking and attacking
problems
 Finding sub-optimal but computationally
feasible solutions.
 New paradigm for new types of data
 Be satisfied with ‘very rough’
approximations
 Turn research results into easy and publicly
available software and programs

Join force with computer scientists.
Some ‘hot’ research directions
Dimension reduction
 Visualization
 Dynamic systems
 Simulation and real time computation
 Uncertainty and risk management
 Interdisciplinary research

Reasons to Study Biostatistics I

Biostatistics is everywhere around us:
Our life: entertainment, sports game, shopping, party,
communication (cell phone), travel …
 Our work: career, business, school …
 Our health: food, weather, disease …
 Our environment: safety, security, chemical, animal,
 Our well-being: physical examination, hospital, being
happy, longevity.

Reasons to Study Biostatistics I

Entertainment - party: music / dance /food


Sports game


Allergy to certain food /smell : peanut, flowers …
Communication - cell phone use


Car racing, skiing (time to event – survival analysis).
Shopping: diff taste /preference :


Alcohol, cigarette, drug, etc.
Potential hazard – leads to health problem (CA …)
Travel – infectious diseases, safety, accident …
Reasons to Study Biostatistics II


We care our society, our family, our environment, our
school, scientific research …
Major impact on society and communities.






Disease transmission
Healthcare benefit, health economics
Quality of life (research, health improvement)
Safety issue (outbreaks of diseases, etc.)
Job market is very promising.
Applications in a wide-range of areas.


Healthcare, quality of life,
Career – job market: scientific, public or private, industrial …
Reasons to Study Biostatistics III


Biostatistics research and applications
Major employers in the US
Research universities, Hospitals, Institutes (NIH),
CDC, DoD, NASA, pharmaceutical industry, biotech
industry, banks and other data warehouse …

Major universities having biostatistics
department in the US

Harvard U, U. Michigan, U. Washington (Seattle), UC
(Berkeley, LA, SF), JHU, Yale U, Stanford U …
Reasons to Study Biostatistics IV



New Biostatistics research areas (still growing)
Medical research.
Recent trend in employment




Private industry: Google, Microsoft …
Affymetrix, Illumina, Agilent, Golden Helix,
23andMe …
Investment – stock market, Capital One, Bank of America,
Goldman Sack, etc.
Nano tech, green energy (alternative energy) …
Example 1. Medical study data:
Ob/Gyn
Modeling of PlGF: Placental Growth Factor
Example 2. Genomics study
Single Nucleotide Polymorphism (SNP)

Homologous pairs of chromosomes
Paternal allele
Maternal allele

Paternal allele
ACGAACAGCT
TGCTTGTCGA
SNP A/G

Maternal allele
ACGAGCAGCT
TGCTCGTCGA
Computational Genomics: SNP Genotype
Error rate : around 5% :
Genome-wide association studies – millions of SNPs
Applications

Genetic counseling:



Achieve accurate estimation and prediction




gene expression + family medical history  disease
Breast cancer (BRCA) …
Early detection / early treatment (cancer, …)
Accurate diagnosis (HIV +)
Help development of new drugs for treatment.
Help to protect environment, live longer and happier,
improve quality of life.
Did I pass my test?

I hope I have convinced you to study
biostatistics.
Chapter 2. Descriptive Statistics


First important thing to do is to visualize data.
Plot of data

Scatter plot – pair-wise (var 1 vs. var 2)
Scatter plot
Descriptive Statistics

Summarize data using statistics
Central location (mean, median)
 Range (min, max)
 Variability (variance, standard deviation)
 Mode
 Quantiles (percentiles)


Rank data, but avoid long listing (use grouping,
instead)
Measure of Location
Mean
The mean is the sum of all the observations divided by the number
of observations.
Population mean :
1

N
N
x
i 1
i
N  The number of observations
in the population.
Sample mean :
1 n
x   xi
n i 1
n  The number of observations
in the sample.
Properties of the mean
The mean is the most widely used measure of location
and has the following properties :
N
n
 (x  )   (x  x)  0
i 1
i
i 1
i
yi  axi  b, i  1,, n

y  ax  b
The mean is oversensitive to extreme values in the
sample.
Translation of data
Measure of Location
Median and Mode
The median is the value of the “middle” point of samples,
when samples are arranged in ascending order.
Median = The [(n+1)/2]th largest observation if n is odd.
= The average of the (n/2)th and (n/2+1)th
largest observation if n is even.
The mode is the most frequently occurring value among all
the observations in a sample. It is the most probable value
that would be obtained if one data point is selected at
random from a population.
Example: Median and Mode
Calculate the median and mode of the following data:
12, 24, 36, 25, 17, 19, 24, 11
Sorted data : 11, 12, 17, 19, 24, 24, 25, 36
19  24
 21.5,
Median =
2
Mode = 24
The mean is influenced by outliers
while the median is not.
The mode is very unstable.
Minor fluctuations in the data
can change it substantially;
for this reason it is seldom
calculated.
≤≤
bimodal
mode
==
 Mean
 Median
 Mode
mode
≤≤
Symmetry and Skewness in Distribution
When the shape of a distribution to the left and the right is mirror
image of each other, the distribution is symmetrical. Examples of
symmetrical distribution are shown below :
A skewed distribution is a distribution that is not symmetrical .
Examples of skewed distributions are shown below :
Positively skewed
Negatively skewed
Measure of Dispersion
Range and Mean Absolute Deviation (MAD)
The Range is the simplest measure of dispersion. It is
simply the difference between the largest and smallest
observations in a sample.
Range  xmax  xmin
The mean absolute deviation is the average of the
absolute values of the deviations of individual
observations from the mean.
n
MAD 
| x  x |
i 1
i
n
Measure of Dispersion
Quantiles or Percentiles
Quantile (percentile) is the general term for a value at or
below which a stated proportion (p/100) of the data in a
distribution lies.
 Quartiles: p = .25, .50, .75
 Quantile / Percentile : p is any probability value
Calculating Quantiles or Percentiles
Let [k] denote the largest integer k.
For example, [3]=3, [4.7]=4.
The p-th percentile is defined as follows:
• Find k = np/100.
• If k is an integer, the p-th percentile is the mean of
the k-th and (k+1)-th observations (in the ascending
sorted order).
• If k is NOT an integer, the p-th percentile is the
[k]+1-th observation.
Example
Calculate the 10th percentile and the 75th percentile
of the following data:
7, 12, 16, 2, 8, 4, 20, 14, 19, 17
Sorted data : 2, 4, 7, 8, 12, 14, 16, 17, 19, 20
(n = 10)
10th percentile: k = np/100 = 10×10/100 = 1
Average of 1st and 2nd observations = (2+4)/2 = 3
75th percentile: k = np/100 = 10×75/100 = 7.5
[7.5]+1 = 7+1 = 8th observation = 17
Measure of Dispersion
Variance and Standard Deviation
The variance is a measure of how spread out a distribution
is. It is computed as the average squared deviation of each
number from its mean. The standard deviation is the square
root of the variance. It is the most commonly used measure
of spread.
n


sample variance
sample standard deviation
yi  axi  b, i  1,, n
s x2 
2
(
x

x
)
 i
i 1
n 1
sx  sx2
s y2  a 2 sx2 , s y | a | sx ,
Example
Five people have their body mass index (BMI) calculated as
[body weight (kg)] / [height] 2
18, 20, 22, 25, 24
1 n
109
X   xi 
 21.8
n i 1
5
1 n
32.8
2
s 
( xi  X ) 
 8.2

n  1 i 1
5 1
2
x
s x  8.2  2.86
Relative Dispersion – Coefficient of Variation
A direct comparison of two or more measures of dispersion
may be difficult because of difference in their means.
A relative dispersion is the amount of variability in a
distribution relative to a reference point or benchmark.
A common measure of relative dispersion is the coefficient of
variation (CV).
sx
CV  100 
x
This measure remains the same regardless of the units used
when only scaling applies. Very useful !
Good Example: Weight, Kg versus Lb.
Bad Example: Temperature: C vs F.
Frequency Distribution
Long list of data collection can be confusing, and
need to be grouped in moderate intervals, rather
than listed as raw data point.
Hospital Length of Stay (LOS)
__________________________________________________________________________________________
81
63
98
86
83
44
43
58
55
50
29
28
42
36
32
23
21
28
27
26
16
16
20
19
27
13
13
15
15
14
12
12
13
12
12
11
11
12
12
12
11
11
11
11
11
64
58
10
10
10
43
42
93
83
81
28
28
56
50
48
22
21
36
32
30
16
15
28
27
23
13
13
20
18
17
12
12
15
14
14
11
11
12
12
11
10
12
12
12
11
11
11
12
10
10
10
A summary table works better
than raw data.
Interval
LOS
LOS
LOS
LOS
LOS
LOS
LOS
LOS
LOS
LOS
Frequency
Relative Frequency
Graphic Methods
Bar Graph
A bar graph is simply a bar chart of data that has been
classified into a frequency distribution. The attractive feature
of a bar graph is that it allows us to quickly see where the
most of the observations are concentrated.
Interval
LOS
LOS
LOS
LOS
LOS
LOS
LOS
LOS
LOS
LOS
Frequency
Graphic Methods
Histogram
Histogram provides a distribution plot, where the bars are not
necessarily of the same length. The area of each bar is
proportional to the density of the data or percentage of data
points within the bar.
Graphic Methods
Box Plot
The box Plot is summary plot based on the median and
interquartile range (IQR) which contains 50% of the
values. Whiskers extend from the box to the highest and
lowest values, excluding outliers. A line across the box
indicates the median.
IQR  Q3  Q1
MIN  Q1  1.5  IQR, MAX  Q3  1.5  IQR
MIN
MAX