Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Biomedical Data Science 0010100101 1001011011 0010100100 0010010010 1001011100 0101000011 0000101001 1001011011 0010100100 0010010010 1 10001010011 10001010011010 1000101001 ataacgtagc acatagtagt ccagtagctg atcgtagaac tgcatgatcc aagctgctga tacgatgaac acctgagatg ctgatgctga tagctagtcg atgatcgctga acgaacccgtagt aaggtgtgaac Sawsan Khuri, Ph.D. Stefan Wuchty, Ph.D. Director of Engagement, Center for Computational Science Assistant Research Professor, Dept of Computer Science Associate Professor, Dept of Computer Science Biology • An information science – what, when, where • Diverse data types – molecular, cellular, system, individual, population – Local, regional, global • Cross disciplinary – Great explorers – Current explorers Got Data, Want Info 1 – Creation and Curation of Databases what goes into it, how should it be searched 2 – Data Analysis statistics, inference, prediction what is your question 3 – Tool Development bought / freely available / in-house What Data Biodiversity – whole populations and species species diversity and distribution, conservation, systematics, ecology Molecular – molecular sequences and structures DNA: genes, genomes, regulation Protein: sequence, structure, function Medical (Health informatics) – patient data Name, age, gender, ethnicity Disease symptoms, treatment, progress Ancient history… Protein and nucleic acic sequence Chromatographic and labeling methods began 1960s … Protein Structure X-ray crystallography methods improved in 1960s protein structures began to be resolved DNA sequencing Gene cloning came in, manual sequencing methods began in late 1970s In mid-late 1980s : Polymerase Chain Reaction and its automation 1985 saw the launch of the Human Genome Project Soon we had automation of DNA sequencing … fast forward Humanity Communicates Telecommunications went from Morse code in 1837 00010011000 To phones about 40 years later It took another 100 years for computers to come in 1971 1973 1976 1980 Birth of Email Internet becomes international UCL UNIX, becomes routine in scientific community Cellphones, fax machines WWW was released by CERN in 1991, was immediately put to good use by the academic community, and enabled the big data world as we now know it… Number Crunching – Part 1 Abacus Punch cards WWII Colossus WITCH Mainframe Super Computer Number Crunching – Part IIa Mainframe Super Computer PC / Unix Number Crunching – Part IIb Mainframe Super Computers PC / Unix Clusters Number Crunching – Part III Mainframe Super Computers PC / Unix Clusters WWW Parallel Systems Number Crunching – Part IV http://www.galaxyzoo.org http://www.iau.org Mainframe Super Computers PC / Unix Clusters WWW Parallel Systems Grid computing & Citizen Science Number Crunching – Part V Mainframe Super Computers PC / Unix Clusters WWW Parallel Systems Grid computing & Citizen Science Cloud Computing Source: http://cyberpingui.free.fr/humour/evolution-white.jpg Molecular Data DNA: single genes, chromosomes, genomes A C T G (N) gaagtatcataaacactcatcatatatatcatccaaataattgcagaaagaaaaagaaaa tggtgatgatgagaatcttcttcttcctattcctcttggcctttccggtcttcactgcaa atgcatcagtgaatgacttctgcgtggccaacggccctggagcccgcgacaccccgtcag gcttcgtgtgcaagaataccgccaaagtcacagccgccgacttcgtctactccggcctgg caaaacccggcaacaccaccaacatcatcaacgccgccgttactccggcgttcgtgggtc RNA: DNA is transcribed to mRNA (regulated by snoRNA and miRNA) Protein: mRNA is translated to polypeptide chains, and these chains get folded into protein structures Orders of Magnitude Human average figures: Gene 10-15 Kb, huge variability Chromosome 50 x106 - 250 x106 nucleotides Genome ~3 billion nucleotides Pattern recognition: Genes within genomes Repeat regions Regulatory elements ©SawsanKhuri Levels of Complexity A Gene: A gene is made up of exons, that (sometimes) code for protein, and introns, which (usually) do not. Within “a gene”, there are also • UTRs • alternative splicing leading to transcript variants, • alternative promoters, • genes within the introns of other genes • regulatory elements everywhere and anywhere and there are intergenic regions, centromeres, telomeres… ©SawsanKhuri Data Deluge Then it became silly to continue counting… Human Sequence Data With Added Value Human Genome Project HapMap project SNP consortium Individual genomes + non-sequence data that is relevant + every single major lab in the world + every single medium lab + every single small lab + non-human data that is relevant "Here's my sequence...” New Yorker Data Science It’s about handling, manipulating, analysing, visualizing, and interpreting data. So first you have to learn how to handle data files, and this is where we will start. Data Analysis • The process of manipulating data in order to extract useful information. A good experiment can be ruined through bad analysis. A good analysis can sometimes save a bad experiment. Good tools are important. Good people are crucial. Algorithms • An algorithm is a formula, a recipe You need something done compare two sequences Devise a method of doing it the bases one by one along both sequences Create the algorithm For two sequences of length n and m, compare base at position 1 of n with base at position 1 of m, repeat and record same and record different and add them up and divide by fudge factor q. Implement by writing the code that will execute your algorithm Biologist or Computer Scientist • When an algorithm doesn’t work and you’ve checked it it isn’t the programmer’s fault it could be the method need gaps, sliding window, seqs of same length … or the type of sequences that are being compared or the question you are asking needs to be modified Who created the algorithm? Did they have enough biological knowledge? Did they understand how algorithms work? Biomedical Algorithms • May be applicable to different types of data • Usually involve some type of approximation exhaustive approach vs heuristic approach, ie more suited to available resources of time and computational power Existing algorithms work “well enough”, or “strangely enough, they work!” • Current research on making existing algorithms more efficient, scalable, and more (or just as) accurate Tool Development • Enabling others to use your algorithm Too many good methods lie dormant in journal articles Some for good reason Others because noone has developed a tool that packages them into something a biologist can use I want to click on a button and get the answer immediately. Interpretation Course Objectives To provide students with the computational skills needed for analysis of genomic data sets. 1. Manipulate data files in unix/linux 2. Work in an HPC environment 3. Write scripts in python that are relevant to genomic data analysis 4. Gain a deeper knowledge of biomedical data types and the commonly used bioinformatics algorithms 5. Apply above skill set in a genomics data analysis project 6. Interpret and present the results of your project In this course you will also • Have fun • Gain 3 Credits • Learn, Achieve, Evolve … but you will not • Create a new algorithm • Develop a new software tool • Leave disappointed #gdbk, not required Another #gdbk What you have to do • • • • Come to class Be an engaged student Read assigned papers Submit assignments on time First Homework • Download the Emacs editor • Play with it a little • We will go through it together on Thursday If you already use an editor, let us know which one. We may still ask you to please learn this one, the class has to be graded in a comparable manner.