* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download 1_genomics
DNA vaccination wikipedia , lookup
Zinc finger nuclease wikipedia , lookup
DNA damage theory of aging wikipedia , lookup
Gel electrophoresis of nucleic acids wikipedia , lookup
Comparative genomic hybridization wikipedia , lookup
Oncogenomics wikipedia , lookup
Primary transcript wikipedia , lookup
Transposable element wikipedia , lookup
Synthetic biology wikipedia , lookup
Mitochondrial DNA wikipedia , lookup
Human genetic variation wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
United Kingdom National DNA Database wikipedia , lookup
Nucleic acid double helix wikipedia , lookup
Molecular cloning wikipedia , lookup
DNA supercoil wikipedia , lookup
DNA sequencing wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Genetic engineering wikipedia , lookup
Genome (book) wikipedia , lookup
Genealogical DNA test wikipedia , lookup
Public health genomics wikipedia , lookup
Epigenomics wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Minimal genome wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Cell-free fetal DNA wikipedia , lookup
Pathogenomics wikipedia , lookup
Designer baby wikipedia , lookup
Extrachromosomal DNA wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Microevolution wikipedia , lookup
Microsatellite wikipedia , lookup
Metagenomics wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Non-coding DNA wikipedia , lookup
Helitron (biology) wikipedia , lookup
Human genome wikipedia , lookup
Genome evolution wikipedia , lookup
Whole genome sequencing wikipedia , lookup
History of genetic engineering wikipedia , lookup
Genomic library wikipedia , lookup
Genome editing wikipedia , lookup
From Genetics to Bioinformatics Lushan Wang 2008.10.15 Modern Biology Molecular Basis of Inheritance DNA structure and its biological function The Human Genome The Human Genome Project I. DNA structure and its biological function Mendel: The Father of Genetics 1865 Gregor Mendel discover the basic rules of heredity of garden pea. What is these factor? And where are they located? 复杂的生物学特征可以用数学规律来描述。 DNA structure and its biological function Cell Chromosome Nuclein (Didn’t know its function) Johann Miescher Johann Miescher discovered DNA and named it nuclein. Major events in the history of Molecular Biology 1900-1911 1902 - Emil Fischer wins Nobel prize: showed amino acids are linked and form proteins Emil Fischer introduced formulas depicting the spatial arrangement of groups around chiral carbon atoms Major events in the history of Molecular Biology 1900-1911 Thomas Morgan Fruit Fly:Finding the Genes 1911 – Thomas Morgan discovers genes on chromosomes are the discrete units of heredity 1910-1925: Development of Cytological Genetics Cytogenetics is the study of chromosomes and chromosome abnormalities (畸形) Inborn Errors of Metabolism The relationship between genes and proteins was first proposed by Garrod in 1908 Garrod, a prominent physician at St. Bartholomew's Hospital in London, understood both the new science of biochemistry and the emerging discipline of genetics Inborn Errors of Metabolism Following Mendel’s laws, Garrod concluded that alkaptonuria is a congenital disorder(先天性的变异), not the result of a bacterial infection as was commonly thought. He observed that inherited diseases reflect a patient's inability to make a particular enzyme, which he referred to as “inborn errors of metabolism” Tetranucleotide Hypothesis Phoebus Levene (RussianAmerican, 1869-1940) He worked with Albrecht Kossel and Emil Fischer, the nucleic acid and protein experts at the turn of the 20th. Century The simplicity of the structure implied that DNA was too uniform to contribute to complex genetic variation Major events in the history of Molecular Biology 1940 - 1950 One gene-One enzyme Hypothesis George Beadle Edward Tatum Identify that genes make proteins, but what is gene? “Transforming Principle” identified as DNA Avery's work The Hershey-Chase Experiment The Americans Alfred Hershey (1908-1997) and Martha Chase (1930-2003) published in 1952 a now classical paper. Base Ratios Erwin Chargaff showed (1950’s): – Amount of adenine relative to guanine differs among species – Amount of adenine always equals amount of thymine and amount of guanine always equals amount of cytosine %A=%T and %G=%C Edwin Chagraff (1905-2002) X-ray Crystallography Applied to Nucleic Acids Between 1940's and 1950's: Maurice Wilkins (1916-) and Rosalind Franklin (1920-1958) worked on X-ray/DNA. Major events in the history of Molecular Biology 1952 - 1960 James Watson (American, 1928-) Francis Crick (British, 1916-2004) Principle Data Source X-ray crystallography Stacked layers of subunits in Wilkins and Franklin spirals; long chain, no ruling (but mostly Franklin) out of two chains, sugarphosphate in the outside Organic chemistry 4 nucleotides Levene Biochemistry a-helix, model building Pauling Chromatography Base ratios Chargaff Chemical bonding Right form of the bases J. Donahue Mathematics Attractive forces between DNA bases J. Griffith Enter Watson and Crick Informational approach (transfer of information, translation of information) The central dogma of molecular biology DNA transcript RNA translatio n Protein ion DNA transcription rRNA (ribosomal) transcription mRNA (messenger) Ribosome translation Protein transcription tRNA (transfer) Molecular biology is born Major events in the history of Molecular Biology 1970- 1977 1972 Paul Berg and coworkers create the first recombinant DNA molecule. Proc Natl Acad Sci U S A. 1972 ,69(10):2904-2909. A new method for sequencing DNA. 1977 Allan Maxam and Walter Gilbert (pictured) at Harvard University and Frederick Sanger at the U.K. Medical Research Council (MRC) independently develop methods for sequencing DNA ( PNAS , February; PNAS , December). Maxam, A.M., Gilbert, W. Proc Natl Acad Sci, 74 (2): 560-4. 1977. Major events in the history of Molecular Biology 1980 - 1995 GenBank Database Formed 1982,GenBank, NIH’s publicly accessible genetic sequence database, was formed at Los Alamos National Laboratory. Scientists submit DNA sequence data from a wide range of organisms to GenBank; researchers routinely retrieve and analyze the data in the archive. 1983: First Disease Gene Mapped A genetic marker linked to Huntington disease was found on chromosome 4 in 1983, making Huntington disease, or HD, the first genetic disease mapped 亨廷顿舞蹈病 using DNA polymorphisms. A polymorphic DNA marker genetically linked to Huntington's disease. Nature, 306(5940):234-8 1983. PCR Invented at 1985: 1993 Nobel Prize in Chemistry 1985, Kary Mullis and colleagues at Cetus Corp. develop PCR , a technique to replicate vast amounts of DNA The first automated DNA sequencing machine 1986 Leroy Hood and Lloyd Smith of the California Institute of Technology and colleagues announce the first automated DNA sequencing machine The Secret to Sanger Sequencing Principles of DNA Sequencing 5’ 3’ Template G C A T G C 5’ Primer dATP dCTP dGTP dTTP ddCTP GddC GCATGddC dATP dCTP dGTP dTTP ddATP GCddA dATP dCTP dGTP dTTP ddTTP GCAddT dATP dCTP dGTP dTTP ddCTP ddG GCATddG Automating Sanger Sequencing The first comprehensive genetic map of human chromosomes was based on 400 restriction fragment length polymorphisms (RFLPs), 1986: First Time Gene Positionally Cloned A genetic linkage map of the human genome. Cell. 1987 Oct 23;51(2):319-37. 1987: YACs Developed Yeast artificial chromosomes (YAC) can carry large segments of DNA from other species, like humans. YACs can carry million base-pair-long fragments of human DNA, whereas plasmids and viruses carry a few thousand base-pair-long pieces only. 1989: Microsatellites, New Genetic Markers A microsatellite is a stretch of DNA made of a two to four base-pair long sequence that is repeated in tandem – e.g. a stretch of DNA that looks like this: CAGCAGCAGCAGCAGCAGCAG. Weber, J.L., May, P.E. Abundant class of human DNA polymorphisms which can be typed using the polymerase chain reaction. Am J Hum Genet, 44:388-96 1989. 1989: Sequence-tagged Sites, Another Marker 序列标记位点 A Common Language for Physical Mapping of the Human Genome. Science, 245:1434-5. 1989 A sequence-tagged site (STS) is a unique stretch of DNA that polymerase chain reaction (PCR) can easily detect. STSs are very useful for making physical maps of human chromosomes. Creating a physical map is much like putting together a large puzzle, where the pieces of the puzzle are pieces of DNA made by cutting up chromosomes. Office of Human Genome Research 1988 NIH establishes the Office of Human Genome Research and snags Watson (pictured) as its head. Watson declares that 3% of the genome budget should be devoted to studies of social and ethical issues. 1990: Launch of the Human Genome Project Watson, J.D., Jordan, E. The Human Genome Program at the National Institutes of Health. Genomics, 5: 654-56. 1989 Beginning in December 1984, the U.S. Department of Energy (DOE), National Institutes of Health (NIH) and international groups had sponsored meetings to consider the feasibility and usefulness of mapping and sequencing the human genome. The Human Genome Project In 1990, DOE and NIH published a plan for the first five years of what was projected to be a 15-year project. The goals of the project included: Mapping the human genome and eventually determining the sequence of all 3.2 billion letters in it; Mapping and sequencing the genomes of other organisms important to the study of biology; Developing technology for analyzing DNA; Studying the ethical, legal and social implications of genome research. The human chromosomes Genome DNA Sequencing and work flow Goals of the Human Genome Project 1. Map and sequence the human genome – Build genetic and physical maps spanning the human genome. – Determine the sequence of the estimated 3 billion letters of human DNA, to 99.99% accuracy. – Chart variations in DNA spelling among human beings. – Map all the human genes. – Begin to label the functions of genes and other parts of the genome. Goals of the Human Genome Project 2. Map and sequence the genomes of model organisms – The bacterium E. coli (4.6 million) – The yeast S. cerevisiae (12 million) – The roundworm C. elegans (100 million) – The fruitfly D. melanogaster (180 million) – The mouse M. musculus (3 billion) Goals of the Human Genome Project 3. Collect and distribute data – Distribute genomic information and the tools for using it to the research community. – Release all sequence data that spans more than 2000 base pairs within 24 hours. – Create and run databases. – Develop software for large-scale DNA analysis. – Develop tools for comparing and interpreting genome information. – Share information with the wider public. Goals of the Human Genome Project 4. Study the ethical, legal and social implications of genetic research 5. Train researchers 6. Develop technologies – Make large-scale sequencing faster and cheaper. – Develop technology for finding sequence variations. – Develop ways to study functions of genes on a genomic scale. 7. Transfer technology to the private sector Time Line of the Human Genome Project Standard Molecular Biology techniques – running agarose gels. CS-Packard DNA Production Robotic Systems (x 3) Capillary Electrophoresis Separation by Electro-osmotic Flow Technology 生命科学本质的探索 第一阶段:建立了遗传的细胞基础——染色体。 第二阶段:定义了遗传的分子基础——DNA双螺旋。 第三阶段:解开了遗传的信息基础——中心法则。 伴随着细胞识别基因信息的生物学机理的发现,与DNA重 组克隆和测序技术的发明,通过运用这些技术,科学家可 以探索基因中包含的信息。 第四阶段,完成一项伟大科学计划——人类基因组计划。 把人类26条染色体上32亿对碱基的序列测出并完成相应的 分析。 1998: Company Announces Sequencing Plan In May 1998, the company Celera Genomics was formed to sequence much of the human genome in three years. While, the company used many HGP-generated resources, unlike the HGP, which built detailed maps before sequencing defined regions, Celera used a shotgun sequencing strategy, in which the 1996 HGP start entire genome is fragmented and random segments are sequenced and then put in order. Shotgun Sequencing Sequence Chromatogram Send to Computer Assembled Sequence Draft Sequences, 2001 International Human Genome Sequencing Consortium (‘public project’) – Initial Sequencing and Analysis of the Human Genome. Nature 409:860-921, 2001 Celera Genomics – Venter JC et al. (‘private project’) – The Sequence of the Human Genome. Science 291:1304-1351, 2001. Biochemistrymolecular biologybioinformatics 核苷酸序列的生化表示方式 氨基酸序列的生化表示方式 一段蛋白质序列的化学结构式 三字母氨基酸符号表示的序列 Ser-Gly-Tyr-Ala-Leu 单字母氨基酸符号表示的序列 SGYAL 左侧为蛋白质多肽的N末端,右侧为C末端 生物信息学数据中核酸序列的表示方式 注意DNA 与RNA的 表示方式 相同,U, T都用T表 示。 核苷酸的IUB/IUPAC符号 A Adenosine 腺苷 M A C,amino 氨基 C Cytidine 胞啶 S G C, strong 强相互作用的核苷酸 G Guanine 鸟苷 W A T ,weak 弱相互作用的核苷酸 T 胸苷 B G T C 非A核苷酸 U Uridine 尿苷 D GAT 非C核苷酸 R G A,purine 嘌呤 H A C T 非G核苷酸 Y T C,pyrimidine嘧啶 V G C A 非T核苷酸 Thymidine K G T,keto - 酮式 N A G C T any,任意一种碱基 Gap of indeterminate length 不明长度的空位 生物信息学数据中蛋白质序列表示方式 蛋白质的一级结构是氨基酸序列,在生物信息学分析过 程中,蛋白质的序列信息通常是以单字母符号进行信息的存贮, 而非三字母的形式。 肌红蛋白(Myoglobin),含有154个氨基酸残基的多肽链 的,在生物信息学数据库中的以如下方式存贮: MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPE TLEKFDKFKHLKSEDEMKASEDLKKHGATVLTALGGILKKK GHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKHPGDF GADAQGAMNKALELFRKDMASNYKELGFQG 其中第一字母M(即甲硫氨酸,Met)为肌红蛋白的N末端, 而最后一个字母G(即甘氨酸,Gly)为肌红蛋白的C末端。 序列数据FASTA格式 >肌红蛋白 MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLS HGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNF KLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR >gi|4504345|ref|NP_000508.1| alpha 2 globin [Homo sapiens] MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLS HGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNF KLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR 其中标题行相关部分用“|”分隔,其序列的GI号为4504345,登录号 为NP_000508.1,英文名称为alpha 2 globin,[Home sapiens]表示是在 “人”种的。 GenBank格式 生物信息学数据的文件形式 文本文件 (flat-file) – 信息在文件中顺序存放且具有特定格式 – 记录(Entry)通过“获得号”(accession #) 唯一确定 – 同一文件间和不同文件间信息的联系均 通过accession #实现 关系数据库 (relational DB) – 基于实体联系模型 (E-R模型) – 表中的记录(record/tuple)键唯一 确定 – 表之间通过外键建立联系 信息表示:关系数据库 语义匹配 semantic mapping Attributes 查询 Relations 语义映射 和处理过程 结果 Growth of PDB (1975-2005) 35000 30000 Yearly growth of total structures 25000 20000 15000 10000 5000 0 1975 1980 1985 1990 Year 1995 2000 2005 http://nar.oxfordjournals.org/ 生物信息学数据存在的问题 信息源分布在世界各地不同的站点上 涉及多个数据源的全局问题无法立刻得到答案 – Painfully collecting unstructured information around the sites – Manually putting pieces together – Hopefully getting the right picture... 总之,信息源的特点是: – 自治的 (autonomous) – 分布式的 (distributed) – 异构的 (heterogeneous) 数据集成 Data Integration 数据集成 Data Integration XML XML Site A Site B 生物信息学最重要的任务是从海量数据中提取新知识 三、生物数据库的种类 三、生物数据库的种类 生物数据库的发展方向 序列数据库 主要核酸序列数据库: GenBank、EMBL、 DDBJINSDC 主要蛋白质序列数据库: Swissprot, PIRUniprot 核酸序列数据库 美国的核酸数据库GenBank〖Banson,D.A. et al. (1998) Nucleic Acids Res. 26, 1-7〗从1979年开始建设,1982年正式 运行; 欧洲分子生物学实验室的EMBL数据库也于1982年开始服务 日本于1984年开始建立国家级的核酸数据库DDBJ,并于 1987年正式服务。 从那个时候以来,DNA序列的数据已经从80年代初期的百 把条序列,几十万碱基上升至现在的110亿碱基!这就是说, 在短短的约18年间,数据量增长了近十万倍。 核酸序列 核酸序列是 由4种核苷 酸的单字母 (ATGC) 符号排成的 序列。 蛋白质序列数据库 SWISS-PROT和PIR是国际上二个主要的蛋白质序列 数据库,目前这二个数据库在EMBL和GenBank数据 库上均建立了镜像 (mirror) 站点。SWISS-PROT数据 库包括了从EMBL翻译而来的蛋白质序列,这些序列 经过检验和注释。PIR数据库的数据由美国家生物技 术信息中心(NCBI)翻译自GenBank的DNA序列。 1952年桑格测定了一条 蛋白质——胰岛素的序 列,1977年桑格等发明 了DNA序列测定技术。 Protein库主要由DNA 库翻译而来,并进行注 释/ 蛋白质序列 MNIQQLALQNIKGNWRNYKVFFLSSCFAIFASFAYMSV IVHPYMKETMWYQNVRWGLIICNIIIISFFIIFILYSTSIFI EARKKELGLYMLMGATKSNVIGVIMTEQMLIGVFANIF GIGLGIIFLKLFFMVFSMLLGLPKELPIIFDVRAIGGTFIA YMVVFVVLSFISALRIWNIKIIRLLKEFRTDKKEKKTSM RLCIFGLICLGIGYALALQTTMPTIAFYFFPVSILVFFGT YFSFTHGTAQILELIKRNKKIMYTYPYLFIVNQLSHRM KENGRFFFLMSMATTFVVTATGTVFLYFSGMQDMWR GGGVHSFSYIEKGTSSHEVFAEGMVEQLLHQYGYDDF QSMSFVGVYASFQSSKGETEIATLMKESEYNQEARKQ GQKTYHPKKGSVTLVYYNKYNHPNMYDQKEIQLQV MNQTYSFVFNGQKEGIQFNYHPSQINGLFFVMHDEDF DGIANKVPDSEKMIYRGYTLPNIENTKELNEDLRKHM KQDDNNAFRSNMELYVNMKAFGDITLFVGSFISILFFL TSCSIVYFKWFHNIASDRKEYGALSKLGMTKEEVWRIS RWQLCMLFFAPIIVGSMHSAVALYTFHNTIFMDGSLRK VGLFILFYIAACIMYFFFAQREYRKHLD 蛋白质序 列是由20 种氨基酸 的单字母 符号排成 的序列。 基因组数据库 GDB 人类基因组数据库 AceDB 线虫(Caenorhabditis elegans)基因组数据库 四、数据库检索工具 Entrez SRS http://www.ncbi.nlm.nih.giv/Entrez/ Entrez--GenBank SRS (Sequence Retrieval System ) SRS是欧洲分子生物学网EMBnet的 主要检索工具。 SRS, Sequence Retrieval System, is a powerful database management system developed specifically for biological databases. The goal of SRS is to provide an efficient access to databases with biological contents no matter in what format are they available and allowing for complex search criteria. 生物信息学最重要的任务是从海量数据中提取新知识 http://www.bioinfo.sdu.edu.cn http://202.194.15.192/bioinfo/bio