Transcript
Introduction to Biostatistics and Bioinformatics
Exploring Data and Descriptive Statistics
Learning Objectives
Python matplotlib library to visualize data:
• Scatter plot
• Histogram
• Kernel density estimate
• Box plots
Descriptive statistics:
• Mean and median
• Standard deviation and inter quartile range
• Central limit theorem
An Example Data Set
0.022
-0.083
0.048
-0.010
-0.125
0.195
-0.071
-0.147
0.033
0.080
0.073
0.016
0.148
0.135
0.006
-0.089
0.165
-0.088
-0.137
0.094
Measurement
Scatter Plot
Order or Measurement
Measurement
Histogram
Order or Measurement
Number of Measurements
Measurement
Bin size = 0.025
Number of Measurements
Bin size = 0.05
Number of Measurements
Bin size = 0.1
Measurement
Measurement
Measurement
Cumulative Distributions
Cumulative Frequency
Order or Measurement
Measurement
Measurement
Kernel Density Estimate
Number of Measurements
Order or Measurement
Measurement
Measurement
Original Distribution
Order or Measurement
Histogram
Original Distribution
Kernel Density Estimate
Measurement
Number of Measurements
Frequency
Number of Measurements
Bin size = 0.05
Measurement
Measurement
Measurement
More Data
Order or Measurement
Histogram
Original Distribution
Kernel Density Estimate
Measurement
Number of Measurements
Frequency
Number of Measurements
Bin size = 0.05
Measurement
Measurement
Exercise 1
(a) Draw 20 points from a normal distribution with mean=0 and standard
deviation=0.1.
import numpy as np
y=0.1*np.random.normal(size=20)
print y
[-0.09946073 -0.19612617 0.03442682 0.02622746
-0.28418124 -0.04245968 0.05922837 0.01199874
0.13454915 -0.07482707 -0.11688758 0.01714036
0.03280043 0.01356022 0.09128649 -0.18923468
0.14536047 -0.07764629 -0.0349553
0.04300367]
Exercise 1
(b) Make scatter plot of the 20 points.
import matplotlib.pyplot as plt
x=range(1,points+1)
fig, (ax1) = plt.subplots(1,figsize=(6,6))
ax1.scatter(x,y,color='red',lw=0,s=40)
ax1.set_xlim([0,points+1])
ax1.set_ylim([-1,1])
fig.savefig('ibb2015_7_exercise1_scatter_points'+str(poi
nts)+'.png',dpi=300,bbox_inches='tight')
plt.close(fig)
Exercise 1
(c) Plot histograms.
for bin in [20,40,80]:
fig, (ax1) = plt.subplots(1,figsize=(6,6))
ax1.hist(y,bins=bin,histtype='step',color='black'
, range=[-1,1], lw=2, normed=True)
ax1.set_xlim([-1,1])
fig.savefig('ibb2015_7_exercise1_bin'+str(bin)+'_
points'+str(points)+'.png',dpi=300,bbox_inches='t
ight')
plt.close(fig)
Exercise 1
(d) Plot cumulative distribution.
y_cumulative=np.linspace(0,1,points)
x_cumulative=np.sort(y)
fig, (ax1) = plt.subplots(1,figsize=(6,6))
ax1.plot(x_cumulative,y_cumulative,color='black', lw=2)
ax1.set_xlim([-1,1])
ax1.set_ylim([0,1])
fig.savefig('ibb2015_7_exercise1_cumulative_points'+
str(points)+'.png',dpi=300,bbox_inches='tight')
plt.close(fig)
Exercise 1
(e) Plot kernel density estimate.
import scipy.stats as stats
kde_points=1000
kde_x = np.linspace(-1,1,kde_points)
fig, (ax1) = plt.subplots(1,figsize=(6,6))
kde_y=stats.gaussian_kde(y)
ax1.plot(kde_x,kde_y(kde_x),color='black', lw=2)
ax1.set_xlim([-1,1])
fig.savefig('ibb2015_7_exercise1_kde_points'+str(points)
+'.png',dpi=300,bbox_inches='tight')
plt.close(fig)
Comparing Measurements
Comparing Measurements – Cumulative distributions
Systematic Shifts
Exercise 2
(a) Generate 5 data sets with 20 data points each from normal
distributions with means = 0, 0, 0.1, 0.5 and 0.3 and standard
deviation=0.1.
y=[]
for j in range(5):
y.append(0.1*np.random.normal(size=20))
y[2]+=0.1
y[3]+=0.5
y[4]+=0.3
print y
Exercise 2
(b) Make scatter plots for the 5 data sets.
sixcolors=['#D4C6DF','#8968AC','#3D6570','#91732B',
'#963725','#4D0132']
fig, (ax1) = plt.subplots(1,figsize=(6,6))
for j in range(5):
ax1.scatter(np.linspace(j+1-0.2,j+1+0.2,20),
y[j],color=sixcolors[6-(j+1)], lw=0,
alpha=1)
ax1.set_xlim([0,6])
ax1.set_ylim([-1,1])
fig.savefig('ibb2015_7_exercise2_scatter_sample'+
str(20),dpi=300,bbox_inches='tight')
plt.close(fig)
Correlation Between Two Variables
Correlation Between Two Variables
Correlation Between Two Variables
Correlation Between Two Variables
Correlation Between Two Variables
Data Visualization
http://blogs.nature.com/methagora/2013/07/data
-visualization-points-of-view.html
Process of Statistical Analysis
Population
Random
Sample
Describe
Sample
Statistics
Make
Inferences
Distributions
Normal
n=3
n=10
n=100
Skewed
Long tails
Complex
Mean
Sample
x , x ,..., x
1
2
n
Mean
i n

x
i 1
n
i
Mean - Sample Size
Normal Distribution
Mean
0.2
0.0
-0.2
0
20
40
60
80 100
Sample Size
Mean – Sample Size
Normal
Skewed
1
-1
0.2
-0.2
100
Sample Size
Long tails
Complex
Mode, Maximum and Minimum
Sample
x , x ,..., x
1
2
n
Mode
the most common value
Maximum
max( x1 , x2 ,..., xn)
Minimum
min( x1 , x2 ,..., xn)
Median, Quartiles and Percentiles
Sample
,
,...,
x1 x2 xn
Quartiles
Q  x for 25% of the sample
Q2  xi for 50% of the sample (median)
Q  x for 75% of the sample
i
1
3
i
P x
m
i
Percentiles
for m% of the sample
Median and Mean – Sample Size
Normal
1
Skewed
Median - Gray
-1
0.2
-0.2
100
Sample Size
Long tails
Complex
Variance
Sample
,
,...,
x1 x2 xn
Mean
i n

x
i 1
i
n
Variance
i n

2

 ( xi   )
2
i 1
n
Variance – Sample Size
Normal
Skewed
0.6
0
0.1
0
100
Sample Size
Long tails
Complex
Inter Quartile Range (IQR)
Sample
,
,...,
x1 x2 xn
Quartiles
Q  x for 25% of the sample
Q2  xi for 50% of the sample (median)
Q  x for 75% of the sample
i
1
3
i
Inter Quartile Range
IQR  Q  Q
3
1
Inter Quartile Range and Standard Deviation
Normal
Skewed
1.0
IRQ/1.349 - Gray
0
0.4
0
100
Sample Size
Long tails
Complex
Central Limit Theorem
The sum of a large number of values drawn
from many distributions converge normal if:
•
•
•
The values are drawn independently;
The values are from the one distribution; and
The distribution has to have a finite mean and
variance.
Uncertainty in Determining the Mean
Normal
Skewed
Long tails
Complex
n=3
n=3
n=3
n=10
n=10
n=10
n=10
n=100
n=100
n=100
n=100
n=1000
Mean
Standard Error of the Mean
Sample
,
,...,
x1 x2 xn
Mean
i n

x
i 1
Variance
 ( xi   )
i n
i

n
2

2
i 1
n
Standard Error of the Mean
s.e.m 

n
Exercise 3
(a) Generate skewed data sets.
sample_size=10
x_test=np.random.uniform(-1.0,1.0,size=30*sample_size)
y_test=np.random.uniform(0.0,1.0,size=30*sample_size)
y_test2=skew(x_test,-0.1,0.2,10)
y_test2/=max(y_test2)
x_test2=x_test[y_test<y_test2]
x_sample=x_test2[:sample_size]
1.
2.
3.
4.
5.
Generate a pair of random numbers within the range.
Assign them to x and y
Keep x if the point (x,y) is within the distribution.
Repeat 1-3 until the desired sample size is obtained.
The values x obtained in this was will be distributed according to the
original distribution.
Exercise 3
(b) Calculate the mean of samples drawn from the skewed data set and the
standard error of the mean, and plot the distribution of averages.
for repeat in range(1000):
…
average.append(np.mean(x_sample))
sem=np.std(average)
fig, (ax1) = plt.subplots(1,figsize=(6,6))
ax1.set_title('Sample size = '+str(sample_size)+', SEM = '
+str(sem))
ax1.hist(average,bins=100,histtype='step',color='red',range=
[-0.5,0.5],normed=True,lw=2)
ax1.set_xlim([-0.5,0.5])
Box Plot
M. Krzywinski & N. Altman, Visualizing samples with box plots, Nature Methods 11 (2014) 119
Box Plots
Normal
Skewed
Long tails
n=5
n=5
n=5
n=10
n=10
n=10
n=100
n=100
n=100
Complex
n=5
n=10
n=100
Box Plots with All the Data Points
Normal
Skewed
Long tails
n=5
n=5
n=5
n=10
n=10
n=10
n=100
n=100
n=100
Complex
n=5
n=10
n=100
Box Plots, Scatter Plots and Bar Graphs
Normal Distribution
Error bars: standard deviation
error bars: standard deviation
error bars: standard error
error bars: standard error
Box Plots, Scatter Plots and Bar Graphs
Skewed Distribution
Error bars: standard deviation
error bars: standard deviation
error bars: standard error
error bars: standard error
Exercise 4
fig, (ax1) = plt.subplots(1,figsize=(6,6))
ax1.scatter(np.linspace(1-0.1, 1+0.1,sample_size),
x_sample, facecolors='none',
edgecolor=thiscolor, lw=1)
bp=ax1.boxplot(x_samples, notch=False, sym='')
plt.setp(bp['boxes'], color=thiscolor, lw=2)
plt.setp(bp['whiskers'], color=thiscolor, lw=2)
plt.setp(bp['medians'], color='black', lw=2)
plt.setp(bp['caps'], color=thiscolor, lw=2)
plt.setp(bp['fliers'], color=thiscolor, marker='o', lw=0)
fig.savefig(…)
Descriptive Statistics - Summary
• Example distribution:
• Normal distribution
• Skewed distribution
• Distribution with long tails
• Complex distribution with several peaks
• Mean, median, quartiles, percentiles
• Variance, Standard deviation, Inter Quartile Range (IQR), error bars
• Box plots, bar graphs, and scatter plots