Download BISC-576 Practical Statistics and Bioinformatics Instructors:

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Proteolysis wikipedia , lookup

Molecular ecology wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Gene expression wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Western blot wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

RNA-Seq wikipedia , lookup

Non-coding DNA wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Community fingerprinting wikipedia , lookup

Genomic library wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Protein structure prediction wikipedia , lookup

Molecular evolution wikipedia , lookup

Transcript
BISC-576
Practical Statistics and Bioinformatics
Instructors:
Ting Chen, RRI-408H, Ph: 213-740-2415; [email protected]
Remo Rohs, RRI-404C, Ph: 213-740-0552; [email protected]
Units: 2.
Description:
This course provides basic training and practical experience in statistics and bioinformatics.
Students will learn basic statistical and bioinformatics methods and apply them to the state-ofthe-art biological applications.
Goals:
• To develop basic analytical skills in statistics and bioinformatics.
• To gain familiarity and competency in statistics and bioinformatics software packages
applicable to molecular biology, genomics analysis, and structural bioinformatics and
their underlying principles.
Textbooks
The Practice of Statistics in the Life Sciences (second edition) by Brigitte Baldi and David S.
Moore (W.H. Freeman 2010).
Bioinformatics: Sequence and Genome Analysis (second edition) by David W. Mount.(Cold
Spring Harbor Lab 2004)
Introduction to Proteins. by A. Kessel and N. Ben-Tal (Chapman & Hall/CRC Press, 1st Edition,
2011).
Course Contents: This course will cover three major areas of bioinformatics: statistics for
biological sequence analysis, computer algorithms for sequence alignment, molecular structural
analysis. More specifically, it includes the following topics: discrete and continuous random
variables, parametric and nonparametric statistics, NCBI resources, pairwise sequence
alignment, multiple sequence alignment, BLAST searching, phylogenetic trees, UCSC genome
browser, clustering, analysis of the high-throughput sequencing data, molecular structure
analysis and prediction.
Homework: Eight sets of homework will be assigned by the instructors. Students should hand
in each homework by the specified due date. Points will be subtracted for projects submitted
after the due date.
Grade: The course grade will be based upon 160 points with 20 points for each set of eight
homework assignments.
Tentative Course Schedule:
Class
WK
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Topic
Introduction to probability: discrete random variables and distributions (Chap 9 &
10, The Practice of Statistics in the Life Sciences
Continuous random variables and normal distributions confidence intervals (Chap
11 & 12, The Practice of Statistics in the Life Sciences)
The Central Limit Theorem (Chap 13, The Practice of Statistics in the Life
Sciences)
Parametric hypothesis testing: t-test, F-test, chi-square test (Chap 17 & 21, The
Practice of Statistics in the Life Sciences)
Pairwise sequence alignment (Chap 3, Bioinformatics: Sequence and Genome
Analysis)
BLAST and Statistics (Chap 4, Bioinformatics: Sequence and Genome Analysis)
Multiple sequence alignment (Chap 5, Bioinformatics: Sequence and Genome
Analysis)
Motif finding (Chap 5, Bioinformatics: Sequence and Genome Analysis)
Hierarchical Clustering (Chap 7, Bioinformatics: Sequence and Genome Analysis)
Analysis of High-throughput sequencing data (Lecture notes)
Secondary structure elements, structural alignment, and fold classification (Chap1,
Introduction to Proteins)
Homology modeling and molecular simulations (Chap 3, Introduction to Proteins)
Protein function annotation and prediction (Chap 2, Introduction to Proteins)
RNA folding and sequence-dependent DNA shape (Lecture notes)
Classification of protein-nucleic acid readout modes (Lecture notes)
Introduction to Probability: discrete random variables and distributions. We will introduce
the basic concept of probability under the context of DNA and protein sequences, the binomial
distribution and the Poisson distribution. (Chap 9 & 10, The Practice of Statistics in the Life
Sciences)
Introduction to Probability: continuous random variables and distributions. We will
introduce the basic concept of probability, and the normal distributions. (Chap 11 & 12, The
Practice of Statistics in the Life Sciences)
The Central Limit Theorem: We will introduce the central limit theorem that is basic for data
analysis. (Chap 13, The Practice of Statistics in the Life Sciences)
Parametric hypothesis testing: We will introduce the concept of hypothesis testing, t-test, Ftest, and Chi-square test. (Chap 17 & 21, The Practice of Statistics in the Life Sciences)
Pairwise sequence alignment: We will introduce the pairwise sequence alignment algorithm:
the dynamic programming, and applications in DNA and protein sequence alignments. (Chap 3,
Bioinformatics: Sequence and Genome Analysis)
BLAST and Statistics: We will introduce the basic hashing used in BLAST, and the statistics of
the BLAST scores. (Chap 4, Bioinformatics: Sequence and Genome Analysis)
Motif Finding: In this section, we will introduce the concept of DNA motifs for protein-DNA
binding sites, representations of DNA motifs. We will discuss several algorithms for finding DNA
motifs: the word-count statistics, the maximum likelihood method, and the Bayesian method.
(Chap 5, Bioinformatics: Sequence and Genome Analysis)
Multiple Sequence Alignments: We will introduce the neighbor-joining method and its
application for multiple sequence alignments. (Chap 5, Bioinformatics: Sequence and Genome
Analysis)
Hierarchical Clustering: Hierarchical clustering has many applications in biology. We will
introduce the basic algorithms and three basic clustering strategies: single-linkage, averagelinkage and complete-linkage. (Chap 7, Bioinformatics: Sequence and Genome Analysis)
Next Generation Sequence Analysis: The analysis of next generation sequencing data is
critical in many biological applications. We will introduce the basic algorithms for read-mapping,
and the challenges in the analysis. We will also discuss the identification of genome variants,
and discuss two basic statistical models: likelihood ratio test and Bayesian methods. (Lecture
Notes)
Secondary structure elements, structural alignment, and fold classifications: This lecture
will introduce alpha-helices and beta-sheets and the Ramachandran plot as means of identifying
secondary structure elements of proteins. In addition, the basic principles for structural
alignment methods will be discussed and various proteins will be aligned. We will classify
protein folds according to their structural topology. (Chap1, Introduction to Proteins)
Homology modeling and molecular simulations: Computational prediction methods are
applied if an experimentally solved structure is unavailable. We will compare knowledge-based
prediction methods such as homology modeling with physics-based molecular simulation
approaches, including molecular dynamics and Monte Carlo methods. (Chap 3, Introduction to
Proteins)
Protein function annotation and prediction: Revealing the unknown function of a protein is a
primary of structural bioinformatics analyses. We will demonstrate how the function of a protein
can be annotated or predicted based on structural homology, evolutionary conservation,
electrostatic potential, and other properties. (Chap 2, Introduction to Proteins)
RNA folding and sequence-dependent DNA shape: While RNA and DNA have very similar
chemical properties, they have very different biological functions. We will explain this difference
by analyzing fold characteristics of RNA in comparison with nuances in the double helix as a
function of its base sequence. (Lecture notes)
Classification of protein-nucleic acid readout modes: Non-specific binding in nucleosomes
deforms DNA while many transcription factors recognize DNA without major deformations but
high binding specificity. We will identify base readout (hydrogen bonding between protein side
chains and base pairs) and shape readout (recognition of sequence-dependent electrostatic
potential) as origins of binding specificity. (Lecture notes)
Statement for Students with Disabilities
Any student requesting academic accommodations based on a disability is required to register
with Disability Services and Programs (DSP) each semester. A letter of verification for approved
accommodations can be obtained from DSP. Please be sure the letter is delivered to me (or to
TA) as early in the semester as possible. DSP is located in STU 301 and is open 8:30 a.m.–
5:00 p.m., Monday through Friday. The phone number for DSP is (213) 740-0776.
Statement on Academic Integrity
USC seeks to maintain an optimal learning environment. General principles of academic
honesty include the concept of respect for the intellectual property of others, the expectation
that individual work will be submitted unless otherwise allowed by an instructor, and the
obligations both to protect one’s own academic work from misuse by others as well as to avoid
using another’s work as one’s own. All students are expected to understand and abide by these
principles. Scampus, the Student Guidebook, contains the Student Conduct Code in Section
11.00, while the recommended sanctions are located in Appendix A: . Students will be referred
to the Office of Student Judicial Affairs and Community Standards for further review, should
there be any suspicion of academic dishonesty. The Review process can be found at:
http://www.usc.edu/student-affairs/SJACS/.