Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill Bioinformatics vs computational genetics • Bioinformatics: The application of computing technology to molecular biology • Computational genetics: The interdisciplinary intersection of genetics, computer science and statistics Course emphasis • Data analysis in molecular genetics • We will not cover – Developments in IT hardware – Analysis of protein structure – Modeling of metabolic pathways, cells, tissues, organs, etc. (i.e. systems biology) Prerequisites • Bio 50: Molecular Biology and Genetics – Gene/protein structure and expression – Principles of inheritance • Comp Sci 14: Introduction to Programming – Algorithms and their design – Fundamental programming skills • Stat 31: Introduction to Statistics – Probability and Distributions – Hypothesis testing and parameter estimations Related courses at UNC • Biology 170/Math 107, Mathematical and Computational Models in Biology (Tim Elston and Maria Servedio) • Summer courses in – Computer Science • Graduate courses in – Bioinformatics and Computational Biology – Biostatistics – School of Pharmacy Readings • Gibson and Muse, A Primer of Genome Science, Sinauer Associates. – Available in Student Bookstore – Primarily covers genomic technologies – Brief on computational/statistical aspects • Supplemental papers – Handed out in class or posted on Blackboard – Includes • More detail on computational/statistical aspects • Papers which you will review for class assignments https://blackboard.unc.edu QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Computer labs / Problem sets • Thursdays 3:30-4:30 in Wilson 132 • Assignments are due following Tuesday • Purpose: – Familiarity with genomic databases and tools • Functional and evolutionary sequence analysis • Gene expression analysis • Mapping of genomes and complex traits – Comfort with command-line tools and computing – Exercise of scientific reasoning and biological judgement – No programming required (but learn Perl anyway!) Research paper • Critical review of the computational challenges involved in assembly of the human genome • Based on opposing articles from the main players in the drama • Paper will be judged on – Understanding of content – Critical and synthetic reasoning – Clarity of scientific writing Late policy • Assignments are due at beginning of class on the due date • Late assignments receive half-credit • Exceptions can be made but require more than 24 hours notice Group work • You are encouraged to work together on most assignments (some exceptions) • What you turn in should be your own – Show your work – Be able to defend your answers • Know and love the UNC Honor Code – http://honor.unc.edu Exams • Two midterms • Final exam will be cumulative • May include material from labs/problem sets, readings and lectures • Most questions will be similar to those on lab/problem sets • You will receive a study guide in advance Grading • • • • • 10 Labs/problem sets - 50% (5% each) Review paper - 10% Midterms - 20% (10% each) Final exam - 20% Final grades – No curve, point divisions at discretion of instructor – Different divisions for undergraduate/graduate students Computer lab server: Biolinux • All necessary analysis software is installed • Dell PowerEdge server – Linux Redhat operating system – 2 Xeon processors – 2 GB RAM – 60 GB disk space • Requires an ONYEN for login • Uses AFS file space Connecting to Biolinux • biolinux.bio.unc.edu (IP 152.2.66.25) • Windows – Zip archive contains necessary connection software • MacOSX – X11 for graphical sessions – Fugu for secure ftp • Linux/Solaris/etc. – Should work as is https://onyen.unc.edu QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. http://cilantro.bio.unc.edu/biolinu x QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Cretaceous Park? • In 1994, researchers reported a remarkably well-preserved Cretaceous dinosaur fossil. • DNA was extracted – Care was taken to prevent contamination • Specific regions were amplified – 20 different PCR primer pairs used, including 6 pairs from mitochondrial cytB – How would you design primers for dinosaur DNA? – All yielded products in mammals, birds and reptiles – Only one cytB pair yielded a product from the fossil Cretaceous Park? • One cytB fragment amplified • 9 sequences obtained from two bone samples – Variability was present within and between the two samples, none were identical • Consensus sequences used to search for homologs – Genbank (215,000 sequences) – BLAST • Measured percent identity • Closest matches were ~70% identical – Equidistant to mammals, birds, and reptiles Cretaceous Park? • One would expect dinosaur DNA to be most similar to that of birds, and then crocodilians • Other authors reanalyzed the data – Multiple alignment – Protein sequence scoring matrix – Phylogenetic analysis • All concluded that the DNA was clearly mammalian, possibly human • One group showed that similar sequences could be amplified from human nuclear DNA Cretaceous Park? • Three possibilities – Preparation of human nuclear DNA could have been contaminated by dinosaur DNA – Dinosaurs and humans might have hybridized during the Cretaceous – Dinosaur extracts were contaminated by human DNA • Study revealed an interesting aspect of human molecular evolution, but not much about dinosaurs • Lesson learned: naïve computational analysis can lead to very misguided conclusions! Discussion question • You are given the sequence of a new gene and asked to determine its function. • How would you begin? – What ‘wet lab’ approaches are possible? – What ‘in silico’ approaches are possible? – What approaches might require both wet lab and in silico components? Biological topics • • • • • • • Sequence alignment and assembly Sequence homology searching Sequence evolution and phylogenetics Finding genes and other features Patterns of gene expression Genetic mapping Dissecting genetic diseases and quantitative traits Computational topics • • • • • • • Dynamic programming Regular expressions and suffix trees Markov chains Hidden Markov models and machine learning Techniques for clustering and classification Maximum likelihood and Bayesian statistics Graph traversal Some informatics tools • Genbank, Uniprot, and major sequence repositories • InterPro and protein signature dBs • Gene Ontology • Model organism genome databases (SGD, FlyBase, Ensembl) • A sampling of software programs – Chosen primarily for pedagogical utility Genomics • • • • Genetics on lots of genes? Hypothesis-free science? Some technologies Enabled by – Robotics – Computers Genome database examples • Primary databases – Genbank/EMBL/DDBJ • Secondary databases – Pfam (protein domains) • Organism-specific – SGD (yeast genomics) • Specialized dBs – OMIM (human genetic disorders) • Annual database issue of Nucleic Acids research: http://www3.oup.co.uk/nar/database/c/ Growth of Genbank QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. http://www.expasy.org/cgi-bin/show_thumbnails.pl?2 First bacterial genome: 1995 • Haemophilus influenzae (TIGR) – 1.8 x 106 bp shotgun assembly – Required 9 months of computer time • Now there are hundreds – 160 Bacterial – 19 Archaeal – 32 Eukaryotic • Over a thousand projects ongoing • And a bacterial genome takes only days to sequence and assemble Tree of life QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. More protein families await QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Other types of genomic data • • • • Spatiotemporal gene expression Alternative transcription Genetic knockout/overexpression phenotypes Genetic variability – Molecular polymorphism • Phenotypic variation / disease • Comparative data / molecular evolution • Protein – Structure, including modifications – Interactions with other molecules • Metabolic profiling, etc., etc. Algorithmic/statistical innovations • The most fundamental and heavily used application in the field is pairwise alignment – Smith-Waterman algorithm (1981) • Still too slow for general database search – BLAST (1987) • Made database search of 107-108 sequences feasible • Statistical ranking of each alignment • Statistical methods in molecular evolution <25 yrs old • Modern genetic mapping methods ~15 yrs old QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Things to review • Chemical differences among amino acids • Prokaryotic and eukaryotic gene structure • The central dogma • Anatomy of a typical protein Reading for Thursday • Gibson and Muse, Ch.1 Genome Projects, pgs. 1-58.