* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Lecture1cont
List of types of proteins wikipedia , lookup
History of molecular evolution wikipedia , lookup
Ridge (biology) wikipedia , lookup
Genomic imprinting wikipedia , lookup
Gene desert wikipedia , lookup
Gene expression wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Exome sequencing wikipedia , lookup
Gene regulatory network wikipedia , lookup
Non-coding DNA wikipedia , lookup
Community fingerprinting wikipedia , lookup
Personalized medicine wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Molecular ecology wikipedia , lookup
Gene expression profiling wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Genome evolution wikipedia , lookup
The BIG Goal “The greatest challenge, however, is analytical. … Deeper biological insight is likely to emerge from examining datasets with scores of samples.” Eric Lander, “array of hope” Nat. Gen. volume 21 supplement pp 3 - 4, 1999. Bio-informatics: Provide methodologies for elucidating biological knowledge from biological data. Central Paradigm of Bio-informatics Genetic Information Central Paradigm of Bio-informatics Genetic Information Molecular Structure Central Paradigm of BioInformatics Genetic Information Molecular Structure Biochemical Function Central Paradigm of Bio-informatics Genetic Information Molecular Structure Biochemical Function Symptoms Central Paradigm of Bio-informatics Genetic Information Molecular Structure Biochemical Function Symptoms Computer Science Tools are Crucial http://www.sanger.ac.uk/PostGenomics/S_pombe/presentations/EMBOCopenhagenWebsite.pdf Computer Science Tools are Crucial • New bio-technologies create huge amounts of data. • It is impossible to analyze data by manual inspection. • Novel mathematical, statistical, algorithmic and computational tools are necessary ! Automated Sequencing http://cbms.st-and.ac.uk/academics/ryan/Teaching/SB&Bioinf/lecture1.htm What is Bio-Informatics ? • A field of science in which Biology, Computer Science and Information Technology merge into a single discipline. • Computers (& software tools) are used to collect, analyze and interpret biological information at the molecular level. • Goal: To enable the discovery of new biological insights and create a global perspective for biologists. Disciplines • Development of new algorithms and statistical methods to assess relationships among members of large data sets. • Analysis and interpretation of various types of data. • Development and implementation of tools to efficiently access and manage different types of information. Why Use Bio-Informatics ? An explosive growth in the amount of biological information necessitates the use of computers for cataloging and retrieval of data (> 3 billion bps, > 30,000 genes). • The human genome project. • Automated sequencing. • GenBank has over 16 Billion bases and is doubling every year !!! New Types of Biological Data • Micro arrays - gene expression. • Multi-level maps: genetic, physical: sequence, annotation. • Networks of protein-protein interactions. • Cross-species relationships: • Homologous genes. • Chromosome organization. http://www.the-scientist.com/yr2002/apr/research020415.html Why Bio Informatics ? (cont.) • A more global view of experimental design. (from “one scientist = one gene/protein/disease” paradigm to whole organism consideration). • Data mining - functional/structural information is important for studying the molecular basis of diseases, diagnostics, developing drugs (personal medicine), evolutionary patterns, etc. Why Bio Informatics ? (cont.) http://www.library.csi.cuny.edu/~davis/Bioinfo_326/lectures/lect14/lect_14.html Future of Genomic Research Principle milestones in data mining and genome analysis: • Sanger method for sequencing, invented in 1977 (winner of the Nobel Prize in 1980), • Polymerase chain reaction (PCR), invented in 1989 (awarded the Nobel Prize in 1993). http://www.usgenomics.com/technology/index.shtml The next step: Locate all the genes and understand their function. This will probably take another 15-20 years ! Disease Genes Discovered The job of biologists is changing… One can efficiently find information: Using databases and software on the web . Question: How likely are you to use a free bio-informatics library of accessible software ? http://www.cryst.bbk.ac.uk/classlib/BBSRC_poster/potential.html Molecular Biology Analysis Software Tools Freely Available on the Web. - Highlights Broad Classification of Biological Databases http://www.mrc-lmb.cam.ac.uk/genomes/madanm/pres/biodb.htm NCBI ENTREZ - PubMed http://www3.ncbi.nlm.nih.gov/Entrez/index.html Post-genomic terms (Oct. 2002) Genome 2.1x106 76,566 Proteome 89,300 1,701 Transcriptome Gene function Metabolome Glycome 9,960 229 1.2x106 6.5x105 1,170 29 138 6 PubMed Hits Google search PubMed From: Computational Proteomics, Mark B Gerstein, Yale U. Proteome http://cbms.st-and.ac.uk/academics/ryan/Teaching/SB&Bioinf/lecture1.htm http://cbms.st-and.ac.uk/academics/ryan/Teaching/SB&Bioinf/lecture1.htm http://cbms.st-and.ac.uk/academics/ryan/Teaching/SB&Bioinf/lecture1.htm http://cbms.st-and.ac.uk/academics/ryan/Teaching/SB&Bioinf/lecture1.htm Similarity / Analogy Examples: If looks like an elephant, and smells like an elephant– it’s an elephant. If walks like a duck, and quacks like a duck– it’s a duck. http://cbms.st-and.ac.uk/academics/ryan/Teaching/molbiol/Bioinf_files/v3_document.htm Similarity Search in Databanks Find similar sequences to a working draft. As databanks grow, homologies get harder, and quality is reduced. >gb|BE588357.1|BE588357 194087 BARC 5BOV Bos taurus cDNA 5'. Length = 369 Score = 272 bits (137), Expect = 4e-71 Identities = 258/297 (86%), Gaps = 1/297 (0%) Strand = Plus / Plus Query: 17 Sbjct: 1 Query: 77 Alignment Tools: BLAST & FASTA (time saving heuristicsapproximations). Sbjct: 60 Pairwise alignment: aggatccaacgtcgctccagctgctcttgacgactccacagataccccgaagccatggca 76 |||||||||||||||| | ||| | ||| || ||| | |||| ||||| ||||||||| aggatccaacgtcgctgcggctacccttaaccact-cgcagaccccccgcagccatggcc 59 agcaagggcttgcaggacctgaagcaacaggtggaggggaccgcccaggaagccgtgtca 136 |||||||||||||||||||||||| | || ||||||||| | ||||||||||| ||| || agcaagggcttgcaggacctgaagaagcaagtggagggggcggcccaggaagcggtgaca 119 Query: 137 gcggccggagcggcagctcagcaagtggtggaccaggccacagaggcggggcagaaagcc 196 |||||||| | || | ||||||||||||||| ||||||||||| || |||||||||||| Sbjct: 120 tcggccggaacagcggttcagcaagtggtggatcaggccacagaagcagggcagaaagcc 179 Query: 197 atggaccagctggccaagaccacccaggaaaccatcgacaagactgctaaccaggcctct 256 ||||||||| | |||||||| |||||||||||||||||| |||||||||||||||||||| Sbjct: 180 atggaccaggttgccaagactacccaggaaaccatcgaccagactgctaaccaggcctct 239 Query: 257 gacaccttctctgggattgggaaaaaattcggcctcctgaaatgacagcagggagac 313 || || ||||| || ||||||||||| | |||||||||||||||||| |||||||| Sbjct: 240 gagactttctcgggttttgggaaaaaacttggcctcctgaaatgacagaagggagac 296 Multiple Sequence Alignment Multiple alignment: find protein families and functional domains. Structure - Function Relationships structure sequence function Protein Structure (domains) Phylogeny Evolution - a process in which small changes occur within species over time. These changes could be monitored today using molecular techniques. The Tree of Life: A classical, basic science problem, since Darwin’s 1859 “Origin of Species”. Tree of Life: Searching Protein Sequence Databases How far can we see back ? Mammalian radiation Invertebrates/ vertebrates Plant/ animals Prokaryotes/ eukaryotes First self replicating systems Formation of the solar system Origin of the universe ? The Human Genome Project (HGP) • Write down all of human DNA on a single CD (“completed” 2001). • Identify all genes, their location and function (far from completion). Example for Gene Localization Bio-Tool (FISH). FISH - Fluorescence In-Situ Hybridization. • Fluorescent labeled probes hybridize to specific chromosomal locations. • Example application: low resolution localization of a gene. Sequencing Genes & Gene Assembly Automated sequencing Gene Finding • Only 2-3% of the human genome encodes for functional genes. • Genes are found along large non-coding DNA regions. • Repeats, pseudo-genes, introns, contamination of vectors, are very confusing. Gene Finding - cont. Find special gene patterns: • Translation start and stop sites (open reading frames - ORF). • Transcription factors, promoters. • Intron splice sites. Etc… Micro Arrays (“DNA Chips”) New biotechnology breakthrough: measure RNA expression levels of thousands of genes (in one experiment). The Idea Behind Micro Arrays Clustering Analysis of Gene Expression Data DNA chips and personalized medicine (leading edge, future technologies). Pharmaco-genomics Use DNA information to measure and predict the reaction to drugs. Personalized medicine. Faster clinical trials: selected populations. Less drug side-effects. Protein and Other Arrays Sequencing the human genome => finite problem. Studying the proteome => endless possible variations, dynamic. Future fields of study: Proteins + Genomics = Proteomics Lipids + Genomics = Lipomics Sugars + Genomics = Glycomics Protein array Understanding Mechanisms of Disease EC number compound Putting it all together: Bio-Informatics SEQUENCE ALIGNMENT ORTHOLOG GENES (Taxonomy) CODING REGIONS CONSERVED DOMAINS 3-D STRUCTURE SEQUENCES & LITERATURE SIGNAL PEPTIDE CELLULAR LOCATION GENE FAMILIES GENOME MAPS MUTATIONS & POLYMORPHISM Putting it all together: Bio-Informatics SEQUENCE ALIGNMENT ORTHOLOG GENES (Taxonomy) CODING REGIONS 3-D STRUCTURE SIGNAL PEPTIDE GENE EXPRESSION, GENES FUNCTION, DRUG & PERSONAL THERAPY CELLULAR LOCATION GENOME MAPS CONSERVED DOMAINS GENE FAMILIES MUTATIONS & POLYMORPHISM