Download Course Syllabus - Brandeis University

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Course Syllabus
Brandeis University
Division of Graduate Profession Studies
Rabb School of Continuing Studies
I. Course Information
1. Biological Data Mining and Modeling
2. 143RBIF-112-1DL
3. Distance Learning Course
Week 1 starting on Wednesday Sep. 17, 2014. The course week runs
Wednesdays through to Tuesdays. The last week ends November 25, 2014.
4. Instructor’ Contact Information
Madhu Natarajan, PhD
[email protected]
Please use email to arrange appointments.
5. Document Overview
This syllabus contains all relevant information about the course: its objectives and
outcomes, the grading criteria, the texts and other materials of instruction, and of weekly
topics, outcomes, assignments, and due dates.
Consider this your roadmap for the course. Please read through the syllabus carefully
and fell free to share any questions that you may have. Please print a copy of this
syllabus for reference.
6. Course Description
 The development of new bioinformatics tools typically involves some form of data
modeling, prediction or optimization. This course introduces various modeling and
prediction techniques including linear and nonlinear regression, principal component
analysis, support vector machines, self-organizing maps, neural networks, set
enrichment, Bayesian networks, and model-based analysis.
 This course is not intended to explore intricacies of analysis methods and/or algorithm
development but to explore how to use different approaches to analyze biological data
and extract some insight into biology.
 The didactic part of this course is designed to introduce you to (a) various analysis
techniques & methods, (b) tool kits for implementing these methods, and (c) introduction
to some biological/experimental methods providing the data for analysis using (a) and
(b). Students will be introduced to examples of analysis from scientific literature, and are
actively encouraged to identify new examples/sources and bring these to the class for
discussion. Part of the expectation of the student is also to contribute to weekly
discussions, especially around the pros and cons of methods, identifying when methods
fail and how these translate into real life expectations of the practicing bioinformatician.
It is important to realize that distance learning does not imply learning in isolation communication is crucial to success in a DL and provides opportunities for selfexploration, collaboration with peers and learning from your own A-Ha moments when
you learn by asking probing questions. I look forward to our discussions during these ten
weeks.
 Prerequisites: Probability & Statistics; Proficiency in R programming, RBIF 111.
7. Materials of Instruction
a. Required Texts

Bioinformatics and Computational Biology Solutions Using R and Bioconductor, Eds. R.
Gentleman, V. Carey, W. Huber, R. Irizarry, and S. Dudoit, Springer 1st Edition, 2005.
ISBN 0-387-25146-4. Students from previous years have pointed to links where portions
of this book are available online. I cannot vouch for these links, but I would encourage
students to look online to see if the authors have shared any of this information.
b. Required Software
 R version 3.1.1 (see http://www.r-project.org/ )
c. Optional Text(s) / Jounals

The elements of statistical learning: data mining, inference, and prediction Trevor Hastie,
Robert Tibshirani, Jerome H. Friedman, Springer-Verlag, New York, 2009. This book is
highly recommended. It makes for very dense reading so I am not using this as a textbook, but it is a very useful reference manual for this course and for all practicing
bioinformaticians.
 An Introduction to Statistical learning Gareth James, Trevor Hastie, Robert Tibshirani,
Springer-Verlag, 2013. This is by the same lead authors as the previous book, but the
reading matierla is a lot les dense. There are more examples with R that allow you to
look at the implementation of the algorithms.
 Handbook of Parametric and Nonparametric Statistical Procedures, D. J. Sheskin,
Chapman & Hall/CRC 3rd Edition, 2003. ISBN-10 1584884401; ISBN-13 9781584884408
 Pattern Classification, R.O. Duda, P.E. Hart, and D.G. Stork, Wiley-Interscience 2nd
Edition, 2000. ISBN 0-471-05669-3
 A Handbook of Statistical Analyses Using R, B.S. Everitt and T. Hothorn, Chapman and
Hall/CRC 1st Edition, 2006. ISBN 1-584-88539-4
d. Online Course Content
 This is a Distance Learning (DL) course, which will be hosted at Brandeis’ LATTE site,
available at http://latte.brandeis.edu. The site contains the course syllabus, weekly topic
notes, assignments, and discussion forums through which we will communicate during
this course.
8. Overall Course Objectives
This course is intended to provide students with an understanding of:
 What methods are commonly used for data analysis
 How analysis results are interpreted in the context of drug discovery and development
 How specific software tools are applied to data modeling
9. Overall Course Outcomes
At the end of the course, students will be able to:
1. Find the appropriate method for common data analysis problems
2. Have a sense of where to look if these methods are insufficient
3. Be familiar with the application of commonly used software tools
4. Be able to compose a meaningful report of the analysis
10. Course Grading Criteria
Percentages earned per assignment:
Percent
Component
N/A
Course questionnaire
50%
Homework problem sets (5 weeks)
30%
Discussion and online class participation (10 weeks).
20%
Final project
b. Grading Criteria for Discussions/Online Participation (100 raw points total per week,
translating to 3% of total course per week)
 Per GPS guidelines you must post on three different calendar days of the course week.
Failure to do so will result in a 10 point deduction from your total participation points for
the week.
 There will be two discussion topics posted each week.
 The 100 points for discussion each week are divided into two 35 point discussion
responses to instructor posts, and one 30 point responses to a peer post.
 Exceptional posts are those that (for example)
o Provide/include original analysis of the course material,
o Provide/include analysis of the same methods on novel data sets,
o Provide/include appended code that runs without errors,
o Provide/include extrapolation and analysis of where methods successfully worked
and where they did not,
o Provide appropriate citation of references,
o Are well-written (grammar/spelling).
 Responses to peer posts will be graded on the same above criteria, but with additional
requirements that responses to peer posts must clearly identify the original
author/message to which the post is a comment in response, and provide novel insight
beyond a simple “I agree”
 In layman terms, the response must clearly go beyond being the equivalent of a +1 or a
“Like” post.
 Any discussion disagreements on analysis, interpretation or results MUST be polite and
constructive. This is a critical and absolute requirement.
c. Make up policies.
 Any student who misses deadlines for homework submissions can make up grades by
working on additional assignments. These can be in the form of extra credits associated
with already assigned homeworks, or can be materials provided upon special request.
10. Academic Integrity
http://lts.brandeis.edu/courses/instruction/academic-integrity/index.html#Student
All students are expected to read and understand the guidelines posted in the Academic
Honesty and Student Integrity website posted above. If any part of this is not clear, please
contact your instructor immediately.
II. Course Information
Week 1 (Sep 17-23)
Introduction to biological data mining and modeling
 On the differences between data mining and modeling
 An introduction to regression
 An introduction to model building
 Understanding the predictive power of modeling
o When models go wrong
 Overview of the field of biological data mining - applications, challenges, future
directions.
Homework 1 assigned
Week 2 (Sep 24-30)
Uncertainty in Biology – Causes, concerns, approaches to deal with uncertainty.
 Introduction to data visualization
 Introduction to normalization
Introduction to high throughput biology
 High throughput technologies
 What can we reliably measure and what can it tell us about the cell?
a. Target-based compound screening
b. Cell-based screening,
c. High content screens,
d. Large scale RNAi screens.
 Statistical analysis of screens, Z and Z’ factor, data visualization and integration.
Homework 1 due
Week 3 (Oct 1-7)
Unsupervised methods – Part I
 Hierarchical clustering
 Principal component analysis
 Independent component analysis
Introduction to transcription data
 RNA profiling technologies, experimental design,
 Data normalization,
 Application of clustering and dimension reduction methods
Homework 2 assigned
Week 4 (Oct 8-14)
Unsupervised methods – Part II
 Unsupervised: hierarchical clustering, principal component analysis, independent
component analysis
 Set enrichment methods,
 Meta analysis of microarray data to build on identified patterns
Homework 2 due
Week 5 (Oct 15-21)
Supervised methods, model assessment and selection
 Linear methods for regression and classification:
 Linear discriminant analysis,
 Logistic regression;
 Naïve Bayes classifier;
 Nearest-neighbor method
Homework 3 assigned
Week 6 (Oct 22-28)
Supervised methods, model assessment and selection - Part II
 Regression and classification trees,
 Neural networks,
 Support vector machines
 Model assessment and selection: AIC, BIC, cross-validation;
Homework 3 due
Final Project assigned
Week 7 (Oct 29-Nov 04)
Meta methods
 Boosting trees,
 Model averaging and bagging,
 Random forest
Homework 4 assigned
Week 8 (Nov 05-11)
Integration and meta-analysis of high throughput datasets
 Biological database, set enrichment methods, text-mining
Proteomics
 Review of proteomics technologies: 2D gels, mass spectrometry, protein arrays,
2-hybrid methods, post-translational modification detection,
 Analysis of protein networks, network properties.
Protein pathways and their interaction
 Protein pathway compendia
 Pathway comparison metrics and applications
 Data integration examples: relevance networks, machine learning.
Homework 4 due
Week 9 (Nov 12-18)
Principles of biological networks.
 Reconstruction of networks - Graphical models:
a. Boolean networks,
b. Co-regulation networks,
c. Bayesian networks.
 Dynamic network inference
Homework 5 assigned
Week 10 (Nov 19-25)
Mechanistic modeling of biological systems.

Principles of mechanistic modeling: mass balance, chemical reaction systems,
flux balance analysis
 Deterministic and probabilistic modeling approaches to specific common
biological problems.
Homework 5 due
Final Project due
III. Course Policies and Procedures
I. Late Policies
 Discussion responses will be accepted late with a 5 (raw) point deduction per day.
 Homework assignments will be accepted late with a 5 (raw) point deduction per
day after the deadline.
 Substantive responses to discussion posts will not be accepted after the
deadlines.
 Any student who misses deadlines for homework submissions can make up
grades by working on additional assignments. These can be in the form of extra
credits associated with already assigned homework, or can be de novo materials
provided upon special request. Students cannot make up for missed discussion
responses.
II. Work Expectations
 Expect to spend about 2-4 hours per week reading the course material and
anywhere from 4-8 hours doing homework, responding to discussion posts, etc.
 Plan ahead to make sure your tasks are completed in a timely manner.
 The final project will be an amalgamation of tasks accomplished throughout the
course and will take an addition of 4-24 hours of work.
 I will post weekly deadlines for expectations for the week.
 A cumulative list of all expectation deadlines will also be posted on Week 1.
III. Feedback
 Homework and class participation grades will typically be posted within a week of
completion of tasks.
IV. Confidentiality
 In the course of the class, some of you may want to post examples of real data
from your day jobs or other sources. Please remember that you must not share
any information that is confidential, proprietary or in any way embargoed from
public disclosure.
 Please refrain from discussion of your peer’s work or interactions with peers
outside the classroom.