Download 1_genomics

From Genetics to Bioinformatics Lushan Wang 2008.10.15 Modern Biology Molecular Basis of Inheritance  DNA structure and its biological function  The Human Genome  The Human Genome Project I. DNA structure and its biological function Mendel: The Father of Genetics 1865 Gregor Mendel discover the basic rules of heredity of garden pea. What is these factor? And where are they located? 复杂的生物学特征可以用数学规律来描述。 DNA structure and its biological function Cell Chromosome Nuclein (Didn’t know its function) Johann Miescher Johann Miescher discovered DNA and named it nuclein. Major events in the history of Molecular Biology 1900-1911  1902 - Emil Fischer wins Nobel prize: showed amino acids are linked and form proteins Emil Fischer introduced formulas depicting the spatial arrangement of groups around chiral carbon atoms Major events in the history of Molecular Biology 1900-1911 Thomas Morgan Fruit Fly：Finding the Genes 1911 – Thomas Morgan discovers genes on chromosomes are the discrete units of heredity 1910-1925: Development of Cytological Genetics  Cytogenetics is the study of chromosomes and chromosome abnormalities (畸形) Inborn Errors of Metabolism  The relationship between genes and proteins was first proposed by Garrod in 1908  Garrod, a prominent physician at St. Bartholomew's Hospital in London, understood both the new science of biochemistry and the emerging discipline of genetics Inborn Errors of Metabolism  Following Mendel’s laws, Garrod concluded that alkaptonuria is a congenital disorder(先天性的变异), not the result of a bacterial infection as was commonly thought.  He observed that inherited diseases reflect a patient's inability to make a particular enzyme, which he referred to as “inborn errors of metabolism” Tetranucleotide Hypothesis  Phoebus Levene (RussianAmerican, 1869-1940)  He worked with Albrecht Kossel and Emil Fischer, the nucleic acid and protein experts at the turn of the 20th. Century  The simplicity of the structure implied that DNA was too uniform to contribute to complex genetic variation Major events in the history of Molecular Biology 1940 - 1950 One gene-One enzyme Hypothesis George Beadle Edward Tatum Identify that genes make proteins, but what is gene? “Transforming Principle” identified as DNA Avery's work The Hershey-Chase Experiment  The Americans Alfred Hershey (1908-1997) and Martha Chase (1930-2003) published in 1952 a now classical paper. Base Ratios  Erwin Chargaff showed (1950’s): – Amount of adenine relative to guanine differs among species – Amount of adenine always equals amount of thymine and amount of guanine always equals amount of cytosine %A=%T and %G=%C Edwin Chagraff (1905-2002) X-ray Crystallography Applied to Nucleic Acids  Between 1940's and 1950's: Maurice Wilkins (1916-) and Rosalind Franklin (1920-1958) worked on X-ray/DNA. Major events in the history of Molecular Biology 1952 - 1960   James Watson (American, 1928-) Francis Crick (British, 1916-2004) Principle Data Source X-ray crystallography Stacked layers of subunits in Wilkins and Franklin spirals; long chain, no ruling (but mostly Franklin) out of two chains, sugarphosphate in the outside Organic chemistry 4 nucleotides Levene Biochemistry a-helix, model building Pauling Chromatography Base ratios Chargaff Chemical bonding Right form of the bases J. Donahue Mathematics Attractive forces between DNA bases J. Griffith Enter Watson and Crick Informational approach (transfer of information, translation of information) The central dogma of molecular biology DNA transcript   RNA translatio n  Protein ion DNA transcription rRNA (ribosomal) transcription mRNA (messenger) Ribosome translation Protein transcription tRNA (transfer) Molecular biology is born Major events in the history of Molecular Biology 1970- 1977 1972 Paul Berg and coworkers create the first recombinant DNA molecule. Proc Natl Acad Sci U S A. 1972 ,69(10):2904-2909. A new method for sequencing DNA. 1977 Allan Maxam and Walter Gilbert (pictured) at Harvard University and Frederick Sanger at the U.K. Medical Research Council (MRC) independently develop methods for sequencing DNA ( PNAS , February; PNAS , December). Maxam, A.M., Gilbert, W. Proc Natl Acad Sci, 74 (2): 560-4. 1977. Major events in the history of Molecular Biology 1980 - 1995 GenBank Database Formed  1982,GenBank, NIH’s publicly accessible genetic sequence database, was formed at Los Alamos National Laboratory. Scientists submit DNA sequence data from a wide range of organisms to GenBank; researchers routinely retrieve and analyze the data in the archive. 1983: First Disease Gene Mapped A genetic marker linked to Huntington disease was found on chromosome 4 in 1983, making Huntington disease, or HD, the first genetic disease mapped 亨廷顿舞蹈病 using DNA polymorphisms. A polymorphic DNA marker genetically linked to Huntington's disease. Nature, 306(5940):234-8 1983. PCR Invented at 1985: 1993 Nobel Prize in Chemistry  1985, Kary Mullis and colleagues at Cetus Corp. develop PCR , a technique to replicate vast amounts of DNA The first automated DNA sequencing machine 1986 Leroy Hood and Lloyd Smith of the California Institute of Technology and colleagues announce the first automated DNA sequencing machine The Secret to Sanger Sequencing Principles of DNA Sequencing 5’ 3’ Template G C A T G C 5’ Primer dATP dCTP dGTP dTTP ddCTP GddC GCATGddC dATP dCTP dGTP dTTP ddATP GCddA dATP dCTP dGTP dTTP ddTTP GCAddT dATP dCTP dGTP dTTP ddCTP ddG GCATddG Automating Sanger Sequencing The first comprehensive genetic map of human chromosomes was based on 400 restriction fragment length polymorphisms (RFLPs), 1986: First Time Gene Positionally Cloned A genetic linkage map of the human genome. Cell. 1987 Oct 23;51(2):319-37. 1987: YACs Developed Yeast artificial chromosomes (YAC) can carry large segments of DNA from other species, like humans. YACs can carry million base-pair-long fragments of human DNA, whereas plasmids and viruses carry a few thousand base-pair-long pieces only. 1989: Microsatellites, New Genetic Markers A microsatellite is a stretch of DNA made of a two to four base-pair long sequence that is repeated in tandem – e.g. a stretch of DNA that looks like this: CAGCAGCAGCAGCAGCAGCAG. Weber, J.L., May, P.E. Abundant class of human DNA polymorphisms which can be typed using the polymerase chain reaction. Am J Hum Genet, 44:388-96 1989. 1989: Sequence-tagged Sites, Another Marker 序列标记位点 A Common Language for Physical Mapping of the Human Genome. Science, 245:1434-5. 1989 A sequence-tagged site (STS) is a unique stretch of DNA that polymerase chain reaction (PCR) can easily detect. STSs are very useful for making physical maps of human chromosomes. Creating a physical map is much like putting together a large puzzle, where the pieces of the puzzle are pieces of DNA made by cutting up chromosomes. Office of Human Genome Research 1988 NIH establishes the Office of Human Genome Research and snags Watson (pictured) as its head. Watson declares that 3% of the genome budget should be devoted to studies of social and ethical issues. 1990: Launch of the Human Genome Project  Watson, J.D., Jordan, E. The Human Genome Program at the National Institutes of Health. Genomics, 5: 654-56. 1989  Beginning in December 1984, the U.S. Department of Energy (DOE), National Institutes of Health (NIH) and international groups had sponsored meetings to consider the feasibility and usefulness of mapping and sequencing the human genome. The Human Genome Project  In 1990, DOE and NIH published a plan for the first five years of what was projected to be a 15-year project. The goals of the project included:  Mapping the human genome and eventually determining the sequence of all 3.2 billion letters in it;  Mapping and sequencing the genomes of other organisms important to the study of biology;  Developing technology for analyzing DNA;  Studying the ethical, legal and social implications of genome research. The human chromosomes Genome DNA Sequencing and work flow Goals of the Human Genome Project 1. Map and sequence the human genome – Build genetic and physical maps spanning the human genome. – Determine the sequence of the estimated 3 billion letters of human DNA, to 99.99% accuracy. – Chart variations in DNA spelling among human beings. – Map all the human genes. – Begin to label the functions of genes and other parts of the genome. Goals of the Human Genome Project 2. Map and sequence the genomes of model organisms – The bacterium E. coli (4.6 million) – The yeast S. cerevisiae (12 million) – The roundworm C. elegans (100 million) – The fruitfly D. melanogaster (180 million) – The mouse M. musculus (3 billion) Goals of the Human Genome Project 3. Collect and distribute data – Distribute genomic information and the tools for using it to the research community. – Release all sequence data that spans more than 2000 base pairs within 24 hours. – Create and run databases. – Develop software for large-scale DNA analysis. – Develop tools for comparing and interpreting genome information. – Share information with the wider public. Goals of the Human Genome Project 4. Study the ethical, legal and social implications of genetic research 5. Train researchers 6. Develop technologies – Make large-scale sequencing faster and cheaper. – Develop technology for finding sequence variations. – Develop ways to study functions of genes on a genomic scale. 7. Transfer technology to the private sector Time Line of the Human Genome Project Standard Molecular Biology techniques – running agarose gels. CS-Packard DNA Production Robotic Systems (x 3) Capillary Electrophoresis Separation by Electro-osmotic Flow Technology 生命科学本质的探索  第一阶段：建立了遗传的细胞基础——染色体。  第二阶段：定义了遗传的分子基础——DNA双螺旋。  第三阶段：解开了遗传的信息基础——中心法则。伴随着细胞识别基因信息的生物学机理的发现，与DNA重组克隆和测序技术的发明，通过运用这些技术，科学家可以探索基因中包含的信息。  第四阶段，完成一项伟大科学计划——人类基因组计划。把人类26条染色体上32亿对碱基的序列测出并完成相应的分析。 1998: Company Announces Sequencing Plan In May 1998, the company Celera Genomics was formed to sequence much of the human genome in three years. While, the company used many HGP-generated resources, unlike the HGP, which built detailed maps before sequencing defined regions, Celera used a shotgun sequencing strategy, in which the 1996 HGP start entire genome is fragmented and random segments are sequenced and then put in order. Shotgun Sequencing Sequence Chromatogram Send to Computer Assembled Sequence Draft Sequences, 2001  International Human Genome Sequencing Consortium (‘public project’) – Initial Sequencing and Analysis of the Human Genome. Nature 409:860-921, 2001  Celera Genomics – Venter JC et al. (‘private project’) – The Sequence of the Human Genome. Science 291:1304-1351, 2001. Biochemistrymolecular biologybioinformatics 核苷酸序列的生化表示方式氨基酸序列的生化表示方式一段蛋白质序列的化学结构式三字母氨基酸符号表示的序列 Ser-Gly-Tyr-Ala-Leu 单字母氨基酸符号表示的序列 SGYAL 左侧为蛋白质多肽的N末端，右侧为C末端生物信息学数据中核酸序列的表示方式注意DNA 与RNA的表示方式相同，U， T都用T表示。核苷酸的IUB/IUPAC符号 A Adenosine 腺苷 M A C，amino 氨基 C Cytidine 胞啶 S G C， strong 强相互作用的核苷酸 G Guanine 鸟苷 W A T ，weak 弱相互作用的核苷酸 T 胸苷 B G T C 非A核苷酸 U Uridine 尿苷 D GAT 非C核苷酸 R G A，purine 嘌呤 H A C T 非G核苷酸 Y T C，pyrimidine嘧啶 V G C A 非T核苷酸 Thymidine K G T，keto - 酮式 N A G C T any，任意一种碱基 Gap of indeterminate length 不明长度的空位生物信息学数据中蛋白质序列表示方式蛋白质的一级结构是氨基酸序列，在生物信息学分析过程中，蛋白质的序列信息通常是以单字母符号进行信息的存贮，而非三字母的形式。肌红蛋白(Myoglobin)，含有154个氨基酸残基的多肽链的，在生物信息学数据库中的以如下方式存贮： MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPE TLEKFDKFKHLKSEDEMKASEDLKKHGATVLTALGGILKKK GHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKHPGDF GADAQGAMNKALELFRKDMASNYKELGFQG 其中第一字母M（即甲硫氨酸，Met）为肌红蛋白的N末端，而最后一个字母G（即甘氨酸，Gly）为肌红蛋白的C末端。序列数据FASTA格式 >肌红蛋白 MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLS HGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNF KLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR >gi|4504345|ref|NP_000508.1| alpha 2 globin [Homo sapiens] MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLS HGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNF KLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR 其中标题行相关部分用“|”分隔，其序列的GI号为4504345，登录号为NP_000508.1，英文名称为alpha 2 globin，[Home sapiens]表示是在 “人”种的。 GenBank格式生物信息学数据的文件形式  文本文件 (flat-file) – 信息在文件中顺序存放且具有特定格式 – 记录(Entry)通过“获得号”(accession #) 唯一确定 – 同一文件间和不同文件间信息的联系均通过accession #实现  关系数据库 (relational DB) – 基于实体联系模型 (E-R模型) – 表中的记录(record/tuple)键唯一确定 – 表之间通过外键建立联系信息表示：关系数据库语义匹配 semantic mapping Attributes 查询 Relations 语义映射和处理过程结果 Growth of PDB (1975-2005) 35000 30000 Yearly growth of total structures 25000 20000 15000 10000 5000 0 1975 1980 1985 1990 Year 1995 2000 2005 http://nar.oxfordjournals.org/ 生物信息学数据存在的问题  信息源分布在世界各地不同的站点上  涉及多个数据源的全局问题无法立刻得到答案 – Painfully collecting unstructured information around the sites – Manually putting pieces together – Hopefully getting the right picture...  总之，信息源的特点是： – 自治的 (autonomous) – 分布式的 (distributed) – 异构的 (heterogeneous) 数据集成 Data Integration 数据集成 Data Integration XML XML Site A Site B 生物信息学最重要的任务是从海量数据中提取新知识三、生物数据库的种类三、生物数据库的种类生物数据库的发展方向序列数据库  主要核酸序列数据库: GenBank、EMBL、 DDBJINSDC  主要蛋白质序列数据库: Swissprot, PIRUniprot 核酸序列数据库  美国的核酸数据库GenBank〖Banson,D.A. et al. (1998) Nucleic Acids Res. 26, 1-7〗从1979年开始建设，1982年正式运行；  欧洲分子生物学实验室的EMBL数据库也于1982年开始服务  日本于1984年开始建立国家级的核酸数据库DDBJ，并于 1987年正式服务。从那个时候以来，DNA序列的数据已经从80年代初期的百把条序列，几十万碱基上升至现在的110亿碱基！这就是说，在短短的约18年间，数据量增长了近十万倍。核酸序列  核酸序列是由4种核苷酸的单字母（ATGC）符号排成的序列。蛋白质序列数据库  SWISS-PROT和PIR是国际上二个主要的蛋白质序列数据库，目前这二个数据库在EMBL和GenBank数据库上均建立了镜像 (mirror) 站点。SWISS-PROT数据库包括了从EMBL翻译而来的蛋白质序列，这些序列经过检验和注释。PIR数据库的数据由美国家生物技术信息中心(NCBI)翻译自GenBank的DNA序列。 1952年桑格测定了一条蛋白质——胰岛素的序列，1977年桑格等发明了DNA序列测定技术。 Protein库主要由DNA 库翻译而来，并进行注释/ 蛋白质序列  MNIQQLALQNIKGNWRNYKVFFLSSCFAIFASFAYMSV IVHPYMKETMWYQNVRWGLIICNIIIISFFIIFILYSTSIFI EARKKELGLYMLMGATKSNVIGVIMTEQMLIGVFANIF GIGLGIIFLKLFFMVFSMLLGLPKELPIIFDVRAIGGTFIA YMVVFVVLSFISALRIWNIKIIRLLKEFRTDKKEKKTSM RLCIFGLICLGIGYALALQTTMPTIAFYFFPVSILVFFGT YFSFTHGTAQILELIKRNKKIMYTYPYLFIVNQLSHRM KENGRFFFLMSMATTFVVTATGTVFLYFSGMQDMWR GGGVHSFSYIEKGTSSHEVFAEGMVEQLLHQYGYDDF QSMSFVGVYASFQSSKGETEIATLMKESEYNQEARKQ GQKTYHPKKGSVTLVYYNKYNHPNMYDQKEIQLQV MNQTYSFVFNGQKEGIQFNYHPSQINGLFFVMHDEDF DGIANKVPDSEKMIYRGYTLPNIENTKELNEDLRKHM KQDDNNAFRSNMELYVNMKAFGDITLFVGSFISILFFL TSCSIVYFKWFHNIASDRKEYGALSKLGMTKEEVWRIS RWQLCMLFFAPIIVGSMHSAVALYTFHNTIFMDGSLRK VGLFILFYIAACIMYFFFAQREYRKHLD  蛋白质序列是由20 种氨基酸的单字母符号排成的序列。基因组数据库  GDB   人类基因组数据库 AceDB  线虫(Caenorhabditis elegans)基因组数据库四、数据库检索工具  Entrez  SRS http://www.ncbi.nlm.nih.giv/Entrez/  Entrez--GenBank SRS (Sequence Retrieval System ) SRS是欧洲分子生物学网EMBnet的主要检索工具。 SRS, Sequence Retrieval System, is a powerful database management system developed specifically for biological databases. The goal of SRS is to provide an efficient access to databases with biological contents no matter in what format are they available and allowing for complex search criteria. 生物信息学最重要的任务是从海量数据中提取新知识 http://www.bioinfo.sdu.edu.cn http://202.194.15.192/bioinfo/bio

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 1_genomics