Download slides - Indiana University Computer Science Department

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cell-penetrating peptide wikipedia , lookup

History of molecular evolution wikipedia , lookup

Non-coding DNA wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Gene expression wikipedia , lookup

Gene regulatory network wikipedia , lookup

Community fingerprinting wikipedia , lookup

RNA-Seq wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Genome evolution wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Synthetic biology wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Molecular evolution wikipedia , lookup

List of types of proteins wikipedia , lookup

Transcript
Overview of I519/I617 &
Introduction to Bioinformatics
Yuzhen Ye ([email protected])
School of Informatics and Computing, IUB
Structure of I519
 
 
 
 
 
Two classes and one lab each week
Python & C (& R)
Textbook: Understanding Bioinformatics
Homework assignments (~5 in total)
Grading:
–  midterm exam (30%) + final exam (25%) +
assignments (30%) + class project (15%)
  Course webpage:
http://mendel.informatics.indiana.edu/~yye/lab/teaching/fall2011-I519.php
What’s bioinformatics
What’s Bioinformatics
  "Bioinformatics is the field of science in which biology,
computer science, and information technology merge into a
single discipline. There are three important sub-disciplines
within bioinformatics: the development of new algorithms and
statistics with which to assess relationships among members of
large data sets; the analysis and interpretation of various types
of data including nucleotide and amino acid sequences, protein
domains, and protein structures; and the development and
implementation of tools that enable efficient access and
management of different types of information.” (NCBI)
  "I do not think all biological computing is bioinformatics, e.g.
mathematical modelling is not bioinformatics, even when
connected with biology-related problems. In my opinion,
bioinformatics has to do with management and the
subsequent use of biological information, particular genetic
information.” (Durbin)
What’s bioinformatics
Bioinformatics vs Computational
Biology
  Almost interchangeable
  Computational biology may be broader
–  Computational biology is an interdisciplinary field
that applies the techniques of computer science,
applied mathematics and statistics to address
biological problems (wikipedia)
–  Includes bioinformatics
What’s bioinformatics
Impacts of Bioinformatics
  On biological sciences (and medical sciences)
–  Large scale experimental techniques
–  Information growth
  On computational sciences
–  Biological has become a large source for new
algorithmic and statistical problems!
What’s bioinformatics
Related Fields
  Proteomics/genomics/metagenomics/
comparative genomics/structural genomics
  Chemical informatics
  Health informatics/Biomedical informatics
  Complex systems
  Systems biology
  Biophysics
  Mathematical biology
–  tackles biological problems using methods that
need not be numerical and need not be
implemented in software or hardware
What’s bioinformatics
Bioinformatics Problems/Applications
Figure from “Bioinformatics dummies”
Biology primer
Biology Primer
Eggs
Cell divisions
Multicullar
organisms
Figure 1-1 Molecular Biology of the Cell
Underlying the diversity of life is a striking unity: DNA is
universal genetic language; Cells are the basic units of
structure and function
Biology primer
Cells are the Basic Unit of Life
  Cell Theory
– 
– 
– 
– 
All organisms are made up of cells
The cell is the basic living unit of organization for all organisms
All cells come from pre-existing cells by division
Cells contains hereditary information which is passed from cell to cell
during cell division.
–  All cells are basically the same in chemical composition
–  All energy flow (metabolism & biochemistry) of life occurs within cells
  Organisms can be of single cells or multiple cells
(multicellular organisms)
−  Most living organisms are single cells (e.g., E.coli, Yeast)
−  Multicellular organisms (e.g., human has more than 1013 cells. Have
no idea about this number? World population as of July 2008 is
6.684 billion, (1 billion = 109)
Biology primer
Cell Structures
Animal cell structure
Prokaryotic cell structure
http://hyperphysics.phy-astr.gsu.edu/hbase/biology/imgbio/cellhlabel.gif
http://micro.magnet.fsu.edu/cells/procaryotes/images/procaryote.jpg
Biology primer
Scale Down to the Atomic Level
Cell
Figure 9-1 Molecular Biology of the Cell
Figure 9-2
Biology primer
The Central Dogma
The flow of genetic information in cells
is from DNA to RNA to protein. All
cells, from bacteria to humans,
express their genetic information in
this way—a principle so fundamental
that it is termed the central dogma of
molecular biology.
Transcription
DNA
Translation
RNA
retrovirus
RNA virus
Protein
Biology primer
DNA and Replication
Figure 1-2 Molecular Biology of the Cell, Fifth Edition
Biology primer
From DNA (to RNA) to Protein
Biology primer
The Genetic Code
Biology primer
Genome
  Definition
–  Genome of an organism is its whole hereditary
information and is encoded in the form of DNA
(or, for some viruses, RNA)
–  Chromosome: structure composed of a long DNA
and associated proteins; human has 46
chromosomes
  DNA sequences can be determined by various
sequencing techniques
  Sequence first. Ask questions later
–  Cell. 2002 Oct 4;111(1):13-6
Biology primer
Three (Super)Kingdoms
Characteristic
Archaea
Bacteria Eukaryote
s
Predominately multicellular
No
No
Yes
DNA structure
circular
circular
linear
Cytoplasma is
compartmentalized
No
No
Yes
Introns are present in most
genes
No
No
Yes
Photosynthesis with
chlorophyll
No
Yes
Yes
Histone proteins present in
cell
Yes
No
Yes
Cell
14
Biology primer
Organisms at Pivotal Positions in
the Tree of Life
Fly: 2000
Worm: 1998
E.coli: 1997
Cell. 2002 Oct 4;111(1):13-6
Figure 1. Concepts in Phylogeny as It Relates to Comparative Genomics
(A) Tree of select organisms (large font: whole-genome sequence obtained or slated for sequencing) and the higher t
represent, drawn to emphasize major innovations in our evolutionary history. Notice that there is something of an evol
Biology primer
Model Organisms
  A model organism is a species that is
extensively studied to understand particular
biological phenomena, with the expectation that
discoveries made in the organism model will
provide insight into the workings of other
organisms.
  Genetic models (with short generation times,
such as the fruit fly and nematode worm),
experimental models, and genomic models, with
a pivotal position in the tree of life
Biology primer
Escherichia coli (E. coli)
  A common gut bacterium, is the most widelyused organism in molecular genetics.
  Some strains of E. coli are capable of causing
disease under certain conditions
  Different strains of E. coli have been extensively
studied
  Whole genome of several E. coli strains was
sequenced (e.g., K-12, O157:H7, HS)
Biology primer
The Genome of E. coli K-12
Circular DNA: a single,
closed loop
Protein-coding genes
RNA genes
The whole genome was sequenced in 1997
Total 4,639,221 bp.
Figure 1-29 Molecular Biology of the Cell, Fifth Edition (© Garland Science 2008)
Biology primer
Caenorhabditis elegans
  C. elegans is a eukaryote (nematodes, or round
worms)
  Has small genome (~97megabases) (whole
genome sequencing, 1998)
  C. elegans is easy to maintain in the laboratory (in
petri dishes) and has a fast and convenient life
cycle.
–  the life span is 2-3 weeks.
–  tiny (1 mm in length) and transparent organism and the developmental
pattern of all 959 of its somatic cells has been traced.
•  somatic cell: any cell of a plant or animal other than cells of the
germ line (from Greek soma, body)
Biology primer
Caenorhabditis elegans (Cont.)
  Discovery of the mechanism of
RNA interference in C. elegans
(1998)
–  Andrew Fire and Craig C. Mello shared the
Nobel Prize in Physiology or Medicine in
2006
–  Silencing was triggered efficiently by injected
dsRNA, but weakly or not at all by sense or
antisense single-stranded RNAs
Biology primer
Drosophila melanogaster (fruit fly)
  It has been used as a model organism for over
100 years, widely used to study genetic and
development biology
–  Small and has a simple diet.
–  Short life cycle: taking about two weeks
–  Have large polytene chromosomes, whose barcode patterns of light
and dark bands allow genes to be mapped accurately
  It was chosen in 1990 as one of the model
organisms to be studied under the auspices of the
federally funded Human Genome Project
  Whole genome sequenced in 2000
  >10 Drosophila genomes have been sequenced
  FlyBase: http://flybase.org/
Biology primer
Species Classification
  Classification is arrangement of organisms into
orderly groups based on their similarities
  Also known as taxonomy
  Provide accurate and uniform naming system
Biology primer
Linnaean System of Classification
  Carolus Linnaeus (the “father of taxonomy”) -- the first
widely accepted hierarchical scheme, which consists
today of 7 categories (kingdom, phylum, class, order,
family, genus, and species) (not including domain)
  Species is the most basic unit of biological classification
(means “kind” in Latin)
–  Each species is different, and reproduces itself faithfully
–  Heredity is a central part of the definition of life
  The Linnaean system uses two Latin name categories,
genus and species, to designate each type of organism
–  Salmonela saintpaul (which caused the latest food-borne
disease)
–  Capitalize the genus, but not the species; italicized in print
Biology primer
Homo sapiens
Domain: Eukaryotes
Kingdom: Matazon (many-celled animal)
King
Phylum: Chordata (characterized by a notochord, nerve cord, and gill slits)
Philip
(subphylum: Vertebrata)
Class: Mammalia (warm-blooded vertebrates)
Came
Over
Order: Primates
For
Family: Hominidae
Gooseberry
Genus: Homo
Soup
Species: Sapiens
http://www.ncbi.nlm.nih.gov/sites/entrez?db=taxonomy
Biology primer
Gene/Protein Family
  A protein/gene family is a group of evolutionarily
related proteins/genes
  Genes/proteins of the same family typically have
similar functions (and structures for proteins) and
with sequence similarity
  There are far more genes/proteins than the
number of families—which shows the advantage of
grouping genes/proteins into different families
Biology primer
Evolution of Genes
  New genes are generated from preexisting
genes
–  Intragenic mutation (modified by changes in DNA
sequence – errors occurred in the process of
DNA replication)
–  Gene duplication – two copies of genes may then
diverge in the course of evolution
–  Segment shuffling
–  Horizontal transfer
More on what’s bioinformatics
Analysis of Gene/Protein Families
– Key Problems in Bioinformatics
  Homolog detection
  Alignment (the residual-level mapping among
homologous genes/proteins)
  Application of the alignments
–  Detect the conserved residues – functional sites
–  Prediction of protein structures
–  Motif finding (cis-elements)
  Phylogeny
  Function annotation
None of these problems have been solved!
More on what’s bioinformatics
Is Protein A Related/Similar to
Protein B?
 
 
 
 
Sequence similarity (alignment!)
Structure similarity (structural comparison)
Co-expression (Microarray data analysis)
Any types of correlation (operon-structure, etc)
You will see this question again and again!
More on what’s bioinformatics
Guilty by Association
More on what’s bioinformatics
Computational Abstractions:
Biological Sequences as Strings
DNA
DNA
RNA
Protein
RNA
Protein
A string in a four-letter alphabet
Phylotype
More on what’s bioinformatics
Computational Abstractions:
Networks (and Others) as Graphs
  Protein-protein interaction network
  Protein structures presented as graphs
  Gene functions presented as graphs (Gene
ontology)
  Metabolic pathways as graphs (directed)
More on what’s bioinformatics
More than Implementation
  Find old/new biological problems
–  Remember biology has become a large source for
new algorithmic and statistical problem
  Formulate as a computational problem
–  Define inputs and outputs
–  (though there are many paper work on welldefined bioinformatics problems)
  Apply existing algorithms and/or tools to solving
your problem
  Develop new ones if necessary
  Implement your algorithms with appropriate
programming language(s)
More on what’s bioinformatics
Where Can I Get the Biological Data?
  Sequences
–  NCBI genbank
–  Swissprot
  Structures
–  PDB
  Genomes
–  NCBI, IMG, GOLD
–  Specialized genome resources
•  Ensembl: selected eukaryotic genomes; not true
anymore—release 19 (July 2013) includes a total
number of 6440 genomes!
  Others
–  KEGG, NCBI SRA, etc
More on what’s bioinformatics
Dealing with Databases
  Databases are the backbone of bioinformatics
research
  Flat files were the first type of database; and are
still used today
  Rational databases are good for searching
purposes
  Databases can contain data and annotations of
data
–  Primary and derived (secondary) data
Buzz Word: Big Data
  “Big data is new and “ginormous” and scary –
very, very scary. No, wait. Big data is just another
name for the same old data marketers have
always used, and it’s not all that big, and it’s
something we should be embracing, not fearing.
No, hold on. That’s not it, either. What I meant to
say is that big data is as powerful as a tsunami,
but it’s a deluge that can be controlled . . . in a
positive way, to provide business insights and
value. Yes, that’s right, isn’t it?”
  Ref: http://www.forbes.com/sites/lisaarthur/
2013/08/15/what-is-big-data/
Biologists Join Big-data Club
  “Biologists are joining the big-data club. With the advent
of high-throughput genomics, life scientists are
starting to grapple with massive data sets, encountering
challenges with handling, processing and moving
information that were once the domain of astronomers
and high-energy physicists”
  “Much of the construction in big-data biology is virtual,
focused on cloud computing — in which data and
software are situated in huge, off-site centres that users
can access on demand, so that they do not need to
buy their own hardware and maintain it on site. ”
  Biology: The big challenges of big data, Nature 498,
255–260 (13 June 2013)
Big Data 2 Big Knowledge (BD2K)
  “I’m talking enormous quantities—think tera-,
peta-, and even exa-bytes. The challenge
presented by this revolution is the need to
develop and implement hardware and software
that can store, retrieve, and analyze this
mountain of complex data—and transform it into
knowledge that can improve our understanding
of human health and disease.”
  A post by Dr. Francis Collins (July 23, 2013; NIH
Director’s Blog)
Different ways of doing computing
  As a user, you have many choices
–  Download the tools to your local machine
–  Run the tool in a supercomputer
•  Yes, IU has several powerful supercomputers (newest
addition is BigRedII).
–  Use a web server
–  Use a Cloud
•  Galaxy
–  An app on your smart phone?
•  See a survey at https://cs.wmich.edu/elise/courses/
cs603-bio/SII-12/Presentation1-Jason.pdf
  Similarly, as a developer, you also have many
choices
Readings
  Biology primer (available at the course website)
  Anything about Python and/or C (if you have no
programming experience at all)
  Biology: The big challenges of big data
  What’s in the textbook?
–  Chapter 1 (The Nucleic Acid World)
–  Chapter 2 (Protein Structure)
–  Chapter 3 (Dealing With Databases)