Download Overture

Overture (Motivating bioinformatics) UIUC Saurabh Sinha Special issue of journal Science, July 1, 2005. >What Is the Universe Made Of?>What is the Biological Basis of Consciousness?>Why Do Humans Have So Few Genes?>To What Extent Are Genetic Variation and Personal Health Linked?>Can the Laws of Physics Be Unified?>How Much Can Human Life Span Be Extended?>What Controls Organ Regeneration?>How Can a Skin Cell Become a Nerve Cell?>How Does a Single Somatic Cell Become a Whole Plant?>How Does Earth's Interior Work?>Are We Alone in the Universe?>How and Where Did Life on Earth Arise?>What Determines Species Diversity?>What Genetic Changes Made Us Uniquely Human?>How Are Memories Stored and Retrieved?>How Did Cooperative Behavior Evolve?>How Will Big Pictures Emerge from a Sea of Biological Data?>How Far Can We Push Chemical SelfAssembly?>What Are the Limits of Conventional Computing?>Can We Selectively Shut Off Immune Responses?>Do Deeper Principles Underlie Quantum Uncertainty and Nonlocality?>Is an Effective HIV Vaccine Feasible?>How Hot Will the Greenhouse World Be?>What Can Replace Cheap Oil -- and When?>Will Malthus Continue to Be Wrong? Where does bioinformatics come in ? “Why do humans have so few genes ?” A simple organism Environmental signal GENE Response (protein) A simple organism GENE1 GENE2 GENE3 A simple organism GENE1 GENE6 GENE2 GENE7 GENE3 GENE8 GENE4 GENE9 GENE5 GENE10 A complex organism GENE1 GENE6 GENE2 GENE7 GENE3 GENE8 GENE4 GENE9 GENE5 GENE10 Complex circuit of interactions Regulatory networks • Genes are switches, transcription factors are (one type of) input signals, proteins are outputs • Proteins (outputs) may be transcription factors and hence become signals for other genes (switches) • This may be the reason why humans have so few genes (the circuit, not the number of switches, carries the complexity) • Bioinformatics can unravel such networks, given the genome (DNA sequence) and gene activity information Decoding the regulatory network: Method 1 • Find patterns (“motifs”) in DNA sequence that occur more often than expected by chance – These are likely to be binding sites for transcription factors – Knowing these can tell us if a gene is regulated by a transcription factor (i.e., the “switch”) How to find motifs ? • One method called “Gibbs sampling” • A special kind of “Markov chain Monte Carlo” sampling technique popular in physics and computer science • We shall see such an approach (Lawrence et al. 1992) later in the course Decoding the regulatory network: Method 2 More on method 2 • Paper from Nature journal, Sep 2004. – We shall see this later in the course • Integrates multiple, heterogeneous sources of information Common problem in bio– Multiple species conservation informatics: Integration of • What is functional is probably conserved different types of information, • How to find “conserved” sequence ? (Next topic) in principled manner – “ChIP-on-chip” data • Which transcription factors bind which sequences ? A high throughput, low resolution measurement Sequence alignment • The staple of a bioinformatician’s diet • Are two sequences similar ? Are portions of two sequences similar ? Which portions are these ? • Several algorithms that use dynamic programming, a popular technique in computer science • Realistic versions of the problem are NPhard, approximation algorithms abound. • Can we handle sequences of length ~109 ? Sequence alignment • Blanchette et al. (2004) in the journal Genome Research – We shall see this paper later in the course • Efficient algorithm for genome-wide multiple sequence alignment • Uses the “divide and conquer” approach, as well as “dynamic programming” • Can help in reconstruction of “ancestral genomes” – Jurassic Park MMVI ? ! ? ! On counting genes • The original question was “Why do humans have so few genes?” • How do we know how many genes there are in the human genome ? (And where they are in the genome) • Experiments can be designed, but bioinformatics plays a major role Gene prediction • The task of predicting the locations of genes in a new genome (“annotation”) • Several gene prediction software exist • We shall see one such approach (Lukashin & Borodovsky 1998, in the journal Nucleic Acids Research) later in the course. • The more sophisticated ones use Hidden Markov models (HMM) and multiple species comparison “What controls organ regeneration ?” “How does a single somatic cell become a whole plant ?” Regeneration is mysterious, but do we understand the “generation” part of “regeneration” • Developmental biology • The timeline from a single cell (with genetic material from mother and father) to a multicellular embryo, and to an adult • A paradox : All cells in the adult body have the same DNA, then how come different cells are different ? … and to this ? Drosophila (fruitfly) How does a single cell lead to this ? … Answer: Regulatory networks (Again !) • Bioinformatics used to scan entire genome for regions that participate in “segmenting” the embryo • Hidden Markov models, a popular technique in signal processing, used to detect such regions – Rajewsky et al. (2002) in the journal “BMC Bioinformatics” • Multiple species comparison aids discovery – Evolutionary models to integrate multiple species information in the HMM framework – Sinha et al. (2004) in the journal “BMC Bioinformatics” “How did cooperative behavior evolve?” A related question • What is the genetic (molecular) basis of social behavior ? • Social behavior in honey bees • Young worker bees are nurses in the hive; older ones go out to forage • This behavioral maturation is determined by needs of colony – Shortage of foragers => some nurses will become foragers prematurely – Bees respond to social cues. – What is the genetic basis of this ? Bioinformatics of social behavior • UIUC team (Sinha, Ling, Whitfield, Zhai, Robinson) scanned the genome to understand this – Attend the BIO seminar to learn about this • Regulatory network of social behavior • Statistical tools, such as Hypergeometric test, partial correlation analysis, threshold-based classification, support vector machine classifier, etc. used for this project “How will big pictures emerge from a sea of biological data?” The sea • Genomes: 3 x 109 bp of human genome • Similar numbers for other genomes: mouse, rat, dog, chicken, chimp etc. • Microarray: snapshots of 1000s of genes’ activities at one time and condition. Thousands of microarrays. • ChIP-on-chip data: measurements of a transcription factor’s binding affinity for 1000s of genes (promoters). Big pictures • An example: Segal et al. (2004) in the journal Nature Genetics – We shall see this work later in the course • “A module map showing conditional activity of expression modules in cancer” (Segal et al.) • Integrates sequence, motif, and microarray data through statistical tests to predict involvement of particular genes in particular types of cancer The sea of biological data • Biological literature, capturing decades of painstaking experimental work on genetics and molecular biology • Can we glean useful information from this vast body of knowledge ? • Biological literature mining. – Natural language processing – Text Information Retrieval (statistical approaches) • Ling et al. (2005) in Pacific Symposium on Biocomputing. On “Gene summarization” – We shall see this paper later in the course Some other challenges • Protein structure prediction • Can we predict the 3-D structure of a protein from its amino acid sequence ? – Why ? – One good reason: structure gives clues about function. If we can tell the structure, we can perhaps tell the function – We can design amino acid sequences that will fold into proteins that do what we want them to do. Drug design !! • One approach: neural networks, a popular technique in comptuer science, applied to this problem (Jones, 1999 in the J. of Mol Bio.) Some other challenges • “Metagenomics” • Most studies to date are on genomes of one species • A sample from the soil contains hundreds of bacteria, thousands of viruses. Can we study all of these ? • Bioinformatics is indispensable !! • New type of data, new types of algorithms Many more challenges • New types of data come due to technological breakthroughs in biology • High throughput data carries unprecedented amount of information • Too much noise • Bioinformatics removes the noise and reveals the truth Bioinformatics • Is not about one problem (e.g., designing better computer chips, better compilers, better graphics, better networks, better operating systems, etc.) • Is about a family of very different problems, all related to biology, all related to each other • How can computers help solve any of this family of problems ? Bioinformatics and You • You can learn the tools of bioinformatics • These tools owe their origin to computer science, information theory, probability theory, statistics, etc. • You can learn the language of biology, enough to understand what the problems are • You can apply the tools to these problems and contribute to science

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Overture