Download Descriptive Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Previous Lecture: Exploring Data
This Lecture
Introduction to Biostatistics and Bioinformatics
Descriptive Statistics
Process of Statistical Analysis
Population
Random
Sample
Describe
Sample
Statistics
Make
Inferences
Distributions
Normal
Skewed
Long tails
Complex
Randomly Sample from any Distribution
1.
2.
3.
4.
5.
Generate a pair of random numbers within the range.
Assign them to x and y
Keep x if the point (x,y) is within the distribution.
Repeat 1-3 until the desired sample size is obtained.
The values x obtained in this was will be distributed
according to the original distribution.
Mean
Sample
x , x ,..., x
1
2
n
Mean
i n

x
i 1
n
i
Mean
Normal
Skewed
1
-1
0.2
-0.2
100
Sample Size
Long tails
Complex
Median, Quartiles and Percentiles
Sample
,
,...,
x1 x2 xn
Quartiles
Q  x for 25% of the sample
Q2  xi for 50% of the sample (median)
Q  x for 75% of the sample
i
1
3
i
Inter Quartile Range
IQR  Q  Q
3
P x
m
i
1
Percentiles
for m% of the sample
Median and Mean
Normal
1
Skewed
Median - Gray
-1
0.2
-0.2
100
Sample Size
Long tails
Complex
Quartiles and Mean
Normal
Skewed
1
Q3 - Purple
Q1 – Gray
-1
0.2
-0.2
100
Sample Size
Long tails
Complex
Central Limit Theorem
The sum of a large number of values drawn
from many distributions converge normal if:
•
•
•
The values are drawn independently;
The values are from the one distribution; and
The distribution has to have a finite mean and
variance.
Variance
Sample
,
,...,
x1 x2 xn
Mean
i n

x
i 1
i
n
Variance
i n

2

 ( xi   )
2
i 1
n
Variance
Normal
Skewed
0.6
0
0.1
0
100
Sample Size
Long tails
Complex
Inter Quartile Range and Standard Deviation
Normal
Skewed
1.0
IRQ/1.349 - Gray
0
0.4
0
100
Sample Size
Long tails
Complex
Uncertainty in Determining the Mean
Normal
Skewed
Long tails
Complex
n=3
n=3
n=3
n=10
n=10
n=10
n=10
n=100
n=100
n=100
n=100
n=1000
Average
Standard Error of the Mean
Sample
,
,...,
x1 x2 xn
Mean
i n

x
i 1
Variance
 ( xi   )
i n
i

n
2

2
i 1
n
Standard Error of the Mean
s.e.m 

n
Error bars
In 2012, error bars appeared in Nature Methods in about two-thirds
of the figure panels in which they could be expected (scatter and bar
plots). The type of error bars was nearly evenly split between s.d. and
s.e.m. bars (45% versus 49%, respectively). In 5% of cases the error
bar type was not specified in the legend. Only one figure used bars
based on the 95% CI.
None of the error bar types is intuitive. An alternative is to select a
value of CI% for which the bars touch at a desired P value (e.g., 83% CI
bars touch at P = 0.05).
M. Krzywinski & N. Altman, Error Bars, Nature Methods 10 (2013) 921
Box Plot
M. Krzywinski & N. Altman, Visualizing samples with box plots, Nature Methods 11 (2014) 119
Box Plots
Normal
Skewed
Long tails
n=5
n=5
n=5
n=10
n=10
n=10
n=100
n=100
n=100
Complex
n=5
n=10
n=100
Box Plots with All the Data Points
Normal
Skewed
Long tails
n=5
n=5
n=5
n=10
n=10
n=10
n=100
n=100
n=100
Complex
n=5
n=10
n=100
Box Plots, Scatter Plots and Bar Graphs
Normal Distribution
Error bars: standard deviation
error bars: standard deviation
error bars: standard error
error bars: standard error
Box Plots, Scatter Plots and Bar Graphs
Skewed Distribution
Error bars: standard deviation
error bars: standard deviation
error bars: standard error
error bars: standard error
Box Plots, Scatter Plots and Bar Graphs
Distribution with Fat Tail
Error bars: standard deviation
error bars: standard deviation
error bars: standard error
error bars: standard error
Measured Concentration
Application: Analytical Measurements
Theoretical Concentration
A Few Characteristics of Analytical Measurements
Accuracy: Closeness of agreement
between a test result and an accepted
reference value.
Precision: Closeness of agreement
between independent test results.
Robustness: Test precision given small, deliberate changes in test
conditions (preanalytic delays, variations in storage temperature).
Lower limit of detection: The lowest amount of analyte that is
statistically distinguishable from background or a negative control.
Limit of quantification: Lowest and highest concentrations of
analyte that can be quantitatively determined with suitable precision
and accuracy.
Linearity: The ability of the test to return values that are directly
proportional to the concentration of the analyte in the sample.
Measuring Blanks
Coefficient of Variation
Sample
,
,...,
x1 x2 xn
Mean
i n

x
i 1
Variance
 ( xi   )
i n
i

n
2

2
i 1
n
Coefficient of Variation (CV)


Lower Limit of Detection
The lowest amount of analyte that is statistically distinguishable
from background or a negative control.
Two methods to determine lower limit of detection:
1.
Lowest concentration of the analyte where CV is less than
for example 20%.
2. Determine level of blank by taking 95th percentile of the
blank measurements and add a constant times the standard
deviation of the lowest concentration.
K. Linnet and M. Kondratovich, Partly Nonparametric Approach for Determining the Limit of Detection,
Clinical Chemistry 50 (2004) 732–740.
Measured Concentration
Measured Concentration
Limit of Detection and Linearity
Theoretical Concentration
Theoretical Concentration
Measured Concentration
Measured Concentration
Precision and Accuracy
Theoretical Concentration
Theoretical Concentration
Descriptive Statistics - Summary
• Example distribution:
• Normal distribution
• Skewed distribution
• Distribution with long tails
• Complex distribution with several peaks
• Mean, median, quartiles, percentiles
• Variance, Standard deviation, Inter Quartile Range (IQR), error bars
• Box plots, bar graphs, and scatter plots
• Application: Analytical measurements:
• Accuracy and precision
• Limit of detection and quantitation
• Linearity
• Robustness
Descriptive Statistics – Recommended Reading
http://blogs.nature.com/methagora/2013/08/giving_statistics_the_attention_it_deserves.html
Descriptive Statistics – Recommended Reading
http://greenteapress.com/thinkstats/
Next Lecture: Data types and representations
in Molecular Biology
FASTA
>URO1 uro1.seq
11:50 Type: N
Length: 2018 November 9, 2000
Check: 3854 ..
CGCAGAAAGAGGAGGCGCTTGCCTTCAGCTTGTGGGAAATCC
CGAAGATGGCCAAAGACAACTCAACTGTTCGTTGCTTCCAGG
GCCTGCTGATTTTTGGAAATGTGATTATTGGTTGTTGCGGCA
TTGCCCTGACTGCGGAGTGCATCTTCTTTGTATCTGACCAAC
ACAGCCTCTACCCACTGCTTGAAGCCACCGACAACGATGACA
TCTATGGGGCTGCCTGGATCGGCATATTTGTGGGCATCTGCC
TCTTCTGCCTGTCTGTTCTAGGCATTGTAGGCATCATGAAGT
CCAGCAGGAAAATTCTTCTGGCGTATTTCATTCTGATGTTTA
TAGTATATGCCTTTGAAGTGGCATCTTGTATCACAGCAGCAA
CACAACAAGACTTTTTCACACCCAACCTCTTCCTGAAGCAGA
TGCTAGAGAGGTACCAAAACAACAGCCCTCCAAACAATGATG
ACCAGTGGAAAAACAATG
FASTQ
GFF3
##gff-version 3
#!gff-spec-version 1.20
##species_http://www.ncbi.nlm.nih.gov/Taxonomy/Browse
r/wwwtax.cgi?id=7425
NC_015867.2 RefSeq cDNA_match 66086 66146 .
-.
ID=aln0;Target=XM_008204328.1 1 61 +;
for_remapping=2;gap_count=1;num_ident=8766;num_mis
match=0;pct_coverage=100;pct_coverage_hiqual=100;pct_
identity_gap=99.9886;pct_identity_ungap=100;rank=1
NC_015867.2 RefSeq cDNA_match 65959 66007 .
-.
ID=aln0;Target=XM_008204328.1 62 110
+;for_remapping=2;gap_count=1;num_ident=8766;num_mi
smatch=0;pct_coverage=100;pct_coverage_hiqual=100;pct
_identity_gap=99.9886;pct_identity_ungap=100;rank=1
NC_015867.2 RefSeq cDNA_match 65799 65825 .
-.
ID=aln0;Target=XM_008204328.1 111 137
+;for_remapping=2;gap_count=1;num_ident=8766;num_mi
smatch=0;pct_coverage=100;pct_coverage_hiqual=100;pct
_identity_gap=99.9886;pct_identity_ungap=100;rank=1
@SRR350953.5 MENDEL_0047_FC62MN8AAXX:1:1:1646:938 length=152
NTCTTTTTCTTTCCTCTTTTGCCAACTTCAGCTAAATAGGAGCTACACTGATTAGGCAGAAACTTGATTAACAGGGCTTAAGGTAACCTTGTTGTAGGCC
GTTTTGTAGCACTCAAAGCAATTGGTACCTCAACTGCAAAAGTCCTTGGCCC
+SRR350953.5 MENDEL_0047_FC62MN8AAXX:1:1:1646:938 length=152
+50000222C@@@@@22::::8888898989::::::<<<:<<<<<<:<<<<::<<:::::<<<<<:<:<<<IIIIIGFEEGGGGGGGII@IGDGBGGGG
GGDDIIGIIEGIGG>GGGGGGDGGGGGIIHIIBIIIGIIIHIIIIGII
@SRR350953.7 MENDEL_0047_FC62MN8AAXX:1:1:1724:932 length=152
NTGTGATAGGCTTTGTCCATTCTGGAAACTCAATATTACTTGCGAGTCCTCAAAGGTAATTTTTGCTATTGCCAATATTCCTCAGAGGAAAAAAGATACA
ATACTATGTTTTATCTAAATTAGCATTAGAAAAAAAATCTTTCATTAGGTGT
+SRR350953.7 MENDEL_0047_FC62MN8AAXX:1:1:1724:932 length=152
#.,')2/@@@@@@@@@@<:<<:778789979888889:::::99999<<::<:::::<<<<<@@@@@::::::IHIGIGGGGGGDGGDGGDDDIHIHIII
II8GGGGGIIHHIIIGIIGIBIGIIIIEIHGGFIHHIIIIIIIGIIFIG
Next Tutorial: Python Programming
Saturday 9/13 at 3 PM in TRB 120
Related documents