Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Course Syllabus Brandeis University Division of Graduate Profession Studies Rabb School of Continuing Studies I. Course Information 1. Biological Data Mining and Modeling 2. 143RBIF-112-1DL 3. Distance Learning Course Week 1 starting on Wednesday Sep. 17, 2014. The course week runs Wednesdays through to Tuesdays. The last week ends November 25, 2014. 4. Instructor’ Contact Information Madhu Natarajan, PhD [email protected] Please use email to arrange appointments. 5. Document Overview This syllabus contains all relevant information about the course: its objectives and outcomes, the grading criteria, the texts and other materials of instruction, and of weekly topics, outcomes, assignments, and due dates. Consider this your roadmap for the course. Please read through the syllabus carefully and fell free to share any questions that you may have. Please print a copy of this syllabus for reference. 6. Course Description The development of new bioinformatics tools typically involves some form of data modeling, prediction or optimization. This course introduces various modeling and prediction techniques including linear and nonlinear regression, principal component analysis, support vector machines, self-organizing maps, neural networks, set enrichment, Bayesian networks, and model-based analysis. This course is not intended to explore intricacies of analysis methods and/or algorithm development but to explore how to use different approaches to analyze biological data and extract some insight into biology. The didactic part of this course is designed to introduce you to (a) various analysis techniques & methods, (b) tool kits for implementing these methods, and (c) introduction to some biological/experimental methods providing the data for analysis using (a) and (b). Students will be introduced to examples of analysis from scientific literature, and are actively encouraged to identify new examples/sources and bring these to the class for discussion. Part of the expectation of the student is also to contribute to weekly discussions, especially around the pros and cons of methods, identifying when methods fail and how these translate into real life expectations of the practicing bioinformatician. It is important to realize that distance learning does not imply learning in isolation communication is crucial to success in a DL and provides opportunities for selfexploration, collaboration with peers and learning from your own A-Ha moments when you learn by asking probing questions. I look forward to our discussions during these ten weeks. Prerequisites: Probability & Statistics; Proficiency in R programming, RBIF 111. 7. Materials of Instruction a. Required Texts Bioinformatics and Computational Biology Solutions Using R and Bioconductor, Eds. R. Gentleman, V. Carey, W. Huber, R. Irizarry, and S. Dudoit, Springer 1st Edition, 2005. ISBN 0-387-25146-4. Students from previous years have pointed to links where portions of this book are available online. I cannot vouch for these links, but I would encourage students to look online to see if the authors have shared any of this information. b. Required Software R version 3.1.1 (see http://www.r-project.org/ ) c. Optional Text(s) / Jounals The elements of statistical learning: data mining, inference, and prediction Trevor Hastie, Robert Tibshirani, Jerome H. Friedman, Springer-Verlag, New York, 2009. This book is highly recommended. It makes for very dense reading so I am not using this as a textbook, but it is a very useful reference manual for this course and for all practicing bioinformaticians. An Introduction to Statistical learning Gareth James, Trevor Hastie, Robert Tibshirani, Springer-Verlag, 2013. This is by the same lead authors as the previous book, but the reading matierla is a lot les dense. There are more examples with R that allow you to look at the implementation of the algorithms. Handbook of Parametric and Nonparametric Statistical Procedures, D. J. Sheskin, Chapman & Hall/CRC 3rd Edition, 2003. ISBN-10 1584884401; ISBN-13 9781584884408 Pattern Classification, R.O. Duda, P.E. Hart, and D.G. Stork, Wiley-Interscience 2nd Edition, 2000. ISBN 0-471-05669-3 A Handbook of Statistical Analyses Using R, B.S. Everitt and T. Hothorn, Chapman and Hall/CRC 1st Edition, 2006. ISBN 1-584-88539-4 d. Online Course Content This is a Distance Learning (DL) course, which will be hosted at Brandeis’ LATTE site, available at http://latte.brandeis.edu. The site contains the course syllabus, weekly topic notes, assignments, and discussion forums through which we will communicate during this course. 8. Overall Course Objectives This course is intended to provide students with an understanding of: What methods are commonly used for data analysis How analysis results are interpreted in the context of drug discovery and development How specific software tools are applied to data modeling 9. Overall Course Outcomes At the end of the course, students will be able to: 1. Find the appropriate method for common data analysis problems 2. Have a sense of where to look if these methods are insufficient 3. Be familiar with the application of commonly used software tools 4. Be able to compose a meaningful report of the analysis 10. Course Grading Criteria Percentages earned per assignment: Percent Component N/A Course questionnaire 50% Homework problem sets (5 weeks) 30% Discussion and online class participation (10 weeks). 20% Final project b. Grading Criteria for Discussions/Online Participation (100 raw points total per week, translating to 3% of total course per week) Per GPS guidelines you must post on three different calendar days of the course week. Failure to do so will result in a 10 point deduction from your total participation points for the week. There will be two discussion topics posted each week. The 100 points for discussion each week are divided into two 35 point discussion responses to instructor posts, and one 30 point responses to a peer post. Exceptional posts are those that (for example) o Provide/include original analysis of the course material, o Provide/include analysis of the same methods on novel data sets, o Provide/include appended code that runs without errors, o Provide/include extrapolation and analysis of where methods successfully worked and where they did not, o Provide appropriate citation of references, o Are well-written (grammar/spelling). Responses to peer posts will be graded on the same above criteria, but with additional requirements that responses to peer posts must clearly identify the original author/message to which the post is a comment in response, and provide novel insight beyond a simple “I agree” In layman terms, the response must clearly go beyond being the equivalent of a +1 or a “Like” post. Any discussion disagreements on analysis, interpretation or results MUST be polite and constructive. This is a critical and absolute requirement. c. Make up policies. Any student who misses deadlines for homework submissions can make up grades by working on additional assignments. These can be in the form of extra credits associated with already assigned homeworks, or can be materials provided upon special request. 10. Academic Integrity http://lts.brandeis.edu/courses/instruction/academic-integrity/index.html#Student All students are expected to read and understand the guidelines posted in the Academic Honesty and Student Integrity website posted above. If any part of this is not clear, please contact your instructor immediately. II. Course Information Week 1 (Sep 17-23) Introduction to biological data mining and modeling On the differences between data mining and modeling An introduction to regression An introduction to model building Understanding the predictive power of modeling o When models go wrong Overview of the field of biological data mining - applications, challenges, future directions. Homework 1 assigned Week 2 (Sep 24-30) Uncertainty in Biology – Causes, concerns, approaches to deal with uncertainty. Introduction to data visualization Introduction to normalization Introduction to high throughput biology High throughput technologies What can we reliably measure and what can it tell us about the cell? a. Target-based compound screening b. Cell-based screening, c. High content screens, d. Large scale RNAi screens. Statistical analysis of screens, Z and Z’ factor, data visualization and integration. Homework 1 due Week 3 (Oct 1-7) Unsupervised methods – Part I Hierarchical clustering Principal component analysis Independent component analysis Introduction to transcription data RNA profiling technologies, experimental design, Data normalization, Application of clustering and dimension reduction methods Homework 2 assigned Week 4 (Oct 8-14) Unsupervised methods – Part II Unsupervised: hierarchical clustering, principal component analysis, independent component analysis Set enrichment methods, Meta analysis of microarray data to build on identified patterns Homework 2 due Week 5 (Oct 15-21) Supervised methods, model assessment and selection Linear methods for regression and classification: Linear discriminant analysis, Logistic regression; Naïve Bayes classifier; Nearest-neighbor method Homework 3 assigned Week 6 (Oct 22-28) Supervised methods, model assessment and selection - Part II Regression and classification trees, Neural networks, Support vector machines Model assessment and selection: AIC, BIC, cross-validation; Homework 3 due Final Project assigned Week 7 (Oct 29-Nov 04) Meta methods Boosting trees, Model averaging and bagging, Random forest Homework 4 assigned Week 8 (Nov 05-11) Integration and meta-analysis of high throughput datasets Biological database, set enrichment methods, text-mining Proteomics Review of proteomics technologies: 2D gels, mass spectrometry, protein arrays, 2-hybrid methods, post-translational modification detection, Analysis of protein networks, network properties. Protein pathways and their interaction Protein pathway compendia Pathway comparison metrics and applications Data integration examples: relevance networks, machine learning. Homework 4 due Week 9 (Nov 12-18) Principles of biological networks. Reconstruction of networks - Graphical models: a. Boolean networks, b. Co-regulation networks, c. Bayesian networks. Dynamic network inference Homework 5 assigned Week 10 (Nov 19-25) Mechanistic modeling of biological systems. Principles of mechanistic modeling: mass balance, chemical reaction systems, flux balance analysis Deterministic and probabilistic modeling approaches to specific common biological problems. Homework 5 due Final Project due III. Course Policies and Procedures I. Late Policies Discussion responses will be accepted late with a 5 (raw) point deduction per day. Homework assignments will be accepted late with a 5 (raw) point deduction per day after the deadline. Substantive responses to discussion posts will not be accepted after the deadlines. Any student who misses deadlines for homework submissions can make up grades by working on additional assignments. These can be in the form of extra credits associated with already assigned homework, or can be de novo materials provided upon special request. Students cannot make up for missed discussion responses. II. Work Expectations Expect to spend about 2-4 hours per week reading the course material and anywhere from 4-8 hours doing homework, responding to discussion posts, etc. Plan ahead to make sure your tasks are completed in a timely manner. The final project will be an amalgamation of tasks accomplished throughout the course and will take an addition of 4-24 hours of work. I will post weekly deadlines for expectations for the week. A cumulative list of all expectation deadlines will also be posted on Week 1. III. Feedback Homework and class participation grades will typically be posted within a week of completion of tasks. IV. Confidentiality In the course of the class, some of you may want to post examples of real data from your day jobs or other sources. Please remember that you must not share any information that is confidential, proprietary or in any way embargoed from public disclosure. Please refrain from discussion of your peer’s work or interactions with peers outside the classroom.