Download 1_genomics

Document related concepts

DNA vaccination wikipedia , lookup

Zinc finger nuclease wikipedia , lookup

DNA damage theory of aging wikipedia , lookup

Chromosome wikipedia , lookup

Gel electrophoresis of nucleic acids wikipedia , lookup

Comparative genomic hybridization wikipedia , lookup

Oncogenomics wikipedia , lookup

Primary transcript wikipedia , lookup

Transposable element wikipedia , lookup

RNA-Seq wikipedia , lookup

Synthetic biology wikipedia , lookup

Mitochondrial DNA wikipedia , lookup

Human genetic variation wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

NUMT wikipedia , lookup

Gene wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

Nucleic acid double helix wikipedia , lookup

Molecular cloning wikipedia , lookup

DNA supercoil wikipedia , lookup

DNA sequencing wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Genetic engineering wikipedia , lookup

Genome (book) wikipedia , lookup

Genealogical DNA test wikipedia , lookup

Public health genomics wikipedia , lookup

Epigenomics wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Minimal genome wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Cell-free fetal DNA wikipedia , lookup

Pathogenomics wikipedia , lookup

Designer baby wikipedia , lookup

Extrachromosomal DNA wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

ENCODE wikipedia , lookup

Microevolution wikipedia , lookup

Microsatellite wikipedia , lookup

Metagenomics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Non-coding DNA wikipedia , lookup

Helitron (biology) wikipedia , lookup

Human genome wikipedia , lookup

Genome evolution wikipedia , lookup

Whole genome sequencing wikipedia , lookup

History of genetic engineering wikipedia , lookup

Genomic library wikipedia , lookup

Genome editing wikipedia , lookup

Human Genome Project wikipedia , lookup

Genomics wikipedia , lookup

Transcript
From Genetics to
Bioinformatics
Lushan Wang
2008.10.15
Modern Biology
Molecular Basis of Inheritance

DNA structure and its biological
function

The Human Genome

The Human Genome Project
I. DNA structure and its biological function
Mendel: The Father of Genetics
1865 Gregor Mendel discover the basic rules of heredity of garden pea.
What is these factor? And where are they located?
复杂的生物学特征可以用数学规律来描述。
DNA structure and its biological function
Cell
Chromosome
Nuclein
(Didn’t know
its function)
Johann Miescher
Johann Miescher
discovered DNA
and named it
nuclein.
Major events in the history of
Molecular Biology 1900-1911

1902 - Emil Fischer wins Nobel
prize: showed amino acids are
linked and form proteins
Emil Fischer
introduced formulas depicting the spatial arrangement of groups around chiral carbon atoms
Major events in the history of
Molecular Biology 1900-1911
Thomas Morgan
Fruit Fly:Finding the Genes
1911 – Thomas Morgan discovers genes on
chromosomes are the discrete units of heredity
1910-1925: Development of Cytological
Genetics

Cytogenetics is the study of chromosomes and
chromosome abnormalities (畸形)
Inborn Errors of Metabolism

The relationship between genes and
proteins was first proposed by Garrod in
1908

Garrod, a prominent physician at St.
Bartholomew's Hospital in London,
understood both the new science of
biochemistry and the emerging
discipline of genetics
Inborn Errors of Metabolism

Following Mendel’s laws, Garrod
concluded that alkaptonuria is a
congenital disorder(先天性的变异),
not the result of a bacterial infection
as was commonly thought.

He observed that inherited diseases
reflect a patient's inability to make a
particular enzyme, which he referred
to as “inborn errors of metabolism”
Tetranucleotide Hypothesis

Phoebus Levene (RussianAmerican, 1869-1940)

He worked with Albrecht Kossel
and Emil Fischer, the nucleic acid
and protein experts at the turn of
the 20th. Century

The simplicity of the structure
implied that DNA was too
uniform to contribute to complex
genetic variation
Major events in the history of
Molecular Biology 1940 - 1950
One gene-One enzyme Hypothesis
George
Beadle
Edward
Tatum
Identify that genes make proteins, but what is gene?
“Transforming Principle” identified as DNA
Avery's work
The Hershey-Chase Experiment

The Americans Alfred Hershey (1908-1997) and Martha Chase
(1930-2003) published in 1952 a now classical paper.
Base Ratios

Erwin Chargaff showed (1950’s):
– Amount of adenine relative to guanine
differs among species
– Amount of adenine always equals
amount of thymine and amount of
guanine always equals amount of
cytosine
%A=%T and %G=%C
Edwin Chagraff (1905-2002)
X-ray Crystallography Applied to
Nucleic Acids

Between 1940's and 1950's: Maurice Wilkins (1916-) and Rosalind
Franklin (1920-1958) worked on X-ray/DNA.
Major events in the history of
Molecular Biology 1952 - 1960


James Watson (American, 1928-)
Francis Crick (British, 1916-2004)
Principle
Data
Source
X-ray crystallography
Stacked layers of subunits in Wilkins and Franklin
spirals; long chain, no ruling (but mostly Franklin)
out of two chains, sugarphosphate in the outside
Organic chemistry
4 nucleotides
Levene
Biochemistry
a-helix, model building
Pauling
Chromatography
Base ratios
Chargaff
Chemical bonding
Right form of the bases
J. Donahue
Mathematics
Attractive forces between
DNA bases
J. Griffith
Enter Watson and Crick
Informational approach
(transfer of information, translation of information)
The central dogma of molecular biology
DNA transcript

 RNA translatio
n  Protein
ion
DNA
transcription
rRNA
(ribosomal)
transcription
mRNA
(messenger)
Ribosome
translation
Protein
transcription
tRNA
(transfer)
Molecular biology is born
Major events in the history of
Molecular Biology 1970- 1977
1972 Paul Berg and coworkers create the first
recombinant DNA
molecule.
Proc Natl Acad Sci U S A.
1972 ,69(10):2904-2909.
A new method for sequencing DNA.
1977
Allan Maxam and Walter Gilbert
(pictured) at Harvard University
and Frederick Sanger at the U.K.
Medical Research Council (MRC)
independently develop methods for
sequencing DNA ( PNAS ,
February; PNAS , December).
Maxam, A.M., Gilbert, W. Proc Natl Acad Sci, 74 (2): 560-4. 1977.
Major events in the history of
Molecular Biology 1980 - 1995
GenBank Database Formed

1982,GenBank, NIH’s publicly
accessible genetic sequence
database, was formed at Los
Alamos National Laboratory.
Scientists submit DNA sequence
data from a wide range of
organisms to GenBank;
researchers routinely retrieve and
analyze the data in the archive.
1983: First Disease Gene Mapped
A genetic marker linked to
Huntington disease was found on
chromosome 4 in 1983, making
Huntington disease, or HD, the
first genetic disease mapped
亨廷顿舞蹈病
using DNA polymorphisms.
A polymorphic DNA marker genetically linked to
Huntington's disease. Nature, 306(5940):234-8 1983.
PCR Invented at 1985: 1993
Nobel Prize in Chemistry

1985, Kary Mullis and
colleagues at Cetus Corp.
develop PCR , a technique to
replicate vast amounts of DNA
The first automated DNA
sequencing machine
1986
Leroy Hood and
Lloyd Smith of the
California Institute of
Technology and
colleagues announce
the first automated
DNA sequencing
machine
The Secret to Sanger Sequencing
Principles of DNA Sequencing
5’
3’ Template
G C A T G C
5’ Primer
dATP
dCTP
dGTP
dTTP
ddCTP
GddC
GCATGddC
dATP
dCTP
dGTP
dTTP
ddATP
GCddA
dATP
dCTP
dGTP
dTTP
ddTTP
GCAddT
dATP
dCTP
dGTP
dTTP
ddCTP
ddG
GCATddG
Automating Sanger Sequencing
The first
comprehensive
genetic map of
human
chromosomes
was based on
400 restriction
fragment length
polymorphisms
(RFLPs),
1986: First Time Gene
Positionally Cloned
A genetic linkage map of the
human genome.
Cell. 1987 Oct 23;51(2):319-37.
1987: YACs Developed
Yeast artificial
chromosomes (YAC) can
carry large segments of
DNA from other species,
like humans.
YACs can carry million
base-pair-long fragments of
human DNA, whereas
plasmids and viruses carry a
few thousand base-pair-long
pieces only.
1989: Microsatellites, New Genetic Markers
A microsatellite is a stretch of DNA
made of a two to four base-pair long
sequence that is repeated in tandem – e.g. a
stretch of DNA that looks like this:
CAGCAGCAGCAGCAGCAGCAG.
Weber, J.L., May, P.E. Abundant class of
human DNA polymorphisms which can be
typed using the polymerase chain reaction.
Am J Hum Genet, 44:388-96 1989.
1989: Sequence-tagged Sites, Another Marker
序列标记位点
A Common Language
for Physical Mapping of
the Human Genome.
Science, 245:1434-5.
1989
A sequence-tagged site (STS) is a
unique stretch of DNA that
polymerase chain reaction (PCR)
can easily detect. STSs are very
useful for making physical maps of
human chromosomes. Creating a
physical map is much like putting
together a large puzzle, where the
pieces of the puzzle are pieces of
DNA made by cutting up
chromosomes.
Office of Human Genome Research
1988
NIH establishes the Office of
Human Genome Research and
snags Watson (pictured) as its head.
Watson declares that 3% of the
genome budget should be devoted
to studies of social and ethical
issues.
1990: Launch of the Human
Genome Project

Watson, J.D., Jordan, E. The Human Genome
Program at the National Institutes of Health.
Genomics, 5: 654-56. 1989
 Beginning in December 1984, the U.S.
Department of Energy (DOE), National Institutes
of Health (NIH) and international groups had
sponsored meetings to consider the feasibility and
usefulness of mapping and sequencing the human
genome.
The Human Genome Project

In 1990, DOE and NIH published a plan for the first five
years of what was projected to be a 15-year project. The
goals of the project included:

Mapping the human genome and eventually determining
the sequence of all 3.2 billion letters in it;

Mapping and sequencing the genomes of other organisms
important to the study of biology;

Developing technology for analyzing DNA;

Studying the ethical, legal and social implications of
genome research.
The human chromosomes
Genome
DNA Sequencing and work flow
Goals of the Human Genome Project
1. Map and sequence the human genome
– Build genetic and physical maps spanning the human
genome.
– Determine the sequence of the estimated 3 billion
letters of human DNA, to 99.99% accuracy.
– Chart variations in DNA spelling among human beings.
– Map all the human genes.
– Begin to label the functions of genes and other parts of
the genome.
Goals of the Human Genome Project
2. Map and sequence the genomes of model
organisms
– The bacterium E. coli (4.6 million)
– The yeast S. cerevisiae (12 million)
– The roundworm C. elegans (100 million)
– The fruitfly D. melanogaster (180 million)
– The mouse M. musculus (3 billion)
Goals of the Human Genome Project
3. Collect and distribute data
– Distribute genomic information and the tools for using
it to the research community.
– Release all sequence data that spans more than 2000
base pairs within 24 hours.
– Create and run databases.
– Develop software for large-scale DNA analysis.
– Develop tools for comparing and interpreting genome
information.
– Share information with the wider public.
Goals of the Human Genome Project
4. Study the ethical, legal and social implications of
genetic research
5. Train researchers
6. Develop technologies
– Make large-scale sequencing faster and cheaper.
– Develop technology for finding sequence variations.
– Develop ways to study functions of genes on a genomic
scale.
7. Transfer technology to the private sector
Time Line of the Human Genome Project
Standard Molecular Biology techniques –
running agarose gels.
CS-Packard DNA Production Robotic Systems (x 3)
Capillary Electrophoresis
Separation by Electro-osmotic Flow
Technology
生命科学本质的探索

第一阶段:建立了遗传的细胞基础——染色体。

第二阶段:定义了遗传的分子基础——DNA双螺旋。

第三阶段:解开了遗传的信息基础——中心法则。
伴随着细胞识别基因信息的生物学机理的发现,与DNA重
组克隆和测序技术的发明,通过运用这些技术,科学家可
以探索基因中包含的信息。

第四阶段,完成一项伟大科学计划——人类基因组计划。
把人类26条染色体上32亿对碱基的序列测出并完成相应的
分析。
1998: Company Announces
Sequencing Plan
In May 1998, the company Celera Genomics
was formed to sequence much of the human
genome in three years. While, the company
used many HGP-generated resources, unlike
the HGP, which built detailed maps before
sequencing defined regions, Celera used a
shotgun sequencing strategy, in which the
1996 HGP start
entire genome is fragmented and random
segments are sequenced and then put in order.
Shotgun Sequencing
Sequence
Chromatogram
Send to Computer
Assembled
Sequence
Draft Sequences, 2001
 International Human Genome Sequencing Consortium
(‘public project’)
– Initial Sequencing and Analysis of the Human
Genome. Nature 409:860-921, 2001
 Celera Genomics – Venter JC et al. (‘private project’)
– The Sequence of the Human Genome. Science
291:1304-1351, 2001.
Biochemistrymolecular
biologybioinformatics
核苷酸序列的生化表示方式
氨基酸序列的生化表示方式
一段蛋白质序列的化学结构式
三字母氨基酸符号表示的序列 Ser-Gly-Tyr-Ala-Leu
单字母氨基酸符号表示的序列 SGYAL
左侧为蛋白质多肽的N末端,右侧为C末端
生物信息学数据中核酸序列的表示方式
注意DNA
与RNA的
表示方式
相同,U,
T都用T表
示。
核苷酸的IUB/IUPAC符号
A Adenosine
腺苷 M A C,amino 氨基
C Cytidine
胞啶 S G C, strong 强相互作用的核苷酸
G Guanine
鸟苷 W A T ,weak 弱相互作用的核苷酸
T
胸苷 B G T C
非A核苷酸
U Uridine
尿苷
D GAT
非C核苷酸
R G A,purine
嘌呤 H A C T
非G核苷酸
Y T C,pyrimidine嘧啶 V G C A
非T核苷酸
Thymidine
K G T,keto
-
酮式
N A G C T any,任意一种碱基
Gap of indeterminate length 不明长度的空位
生物信息学数据中蛋白质序列表示方式
蛋白质的一级结构是氨基酸序列,在生物信息学分析过
程中,蛋白质的序列信息通常是以单字母符号进行信息的存贮,
而非三字母的形式。
肌红蛋白(Myoglobin),含有154个氨基酸残基的多肽链
的,在生物信息学数据库中的以如下方式存贮:
MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPE
TLEKFDKFKHLKSEDEMKASEDLKKHGATVLTALGGILKKK
GHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKHPGDF
GADAQGAMNKALELFRKDMASNYKELGFQG
其中第一字母M(即甲硫氨酸,Met)为肌红蛋白的N末端,
而最后一个字母G(即甘氨酸,Gly)为肌红蛋白的C末端。
序列数据FASTA格式
>肌红蛋白
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLS
HGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNF
KLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
>gi|4504345|ref|NP_000508.1| alpha 2 globin [Homo sapiens]
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLS
HGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNF
KLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
其中标题行相关部分用“|”分隔,其序列的GI号为4504345,登录号
为NP_000508.1,英文名称为alpha 2 globin,[Home sapiens]表示是在
“人”种的。
GenBank格式
生物信息学数据的文件形式
 文本文件 (flat-file)
– 信息在文件中顺序存放且具有特定格式
– 记录(Entry)通过“获得号”(accession #)
唯一确定
– 同一文件间和不同文件间信息的联系均
通过accession #实现
 关系数据库 (relational DB)
– 基于实体联系模型 (E-R模型)
– 表中的记录(record/tuple)键唯一
确定
– 表之间通过外键建立联系
信息表示:关系数据库
语义匹配
semantic
mapping
Attributes
查询
Relations
语义映射
和处理过程
结果
Growth of PDB
(1975-2005)
35000
30000
Yearly growth of total structures
25000
20000
15000
10000
5000
0
1975
1980
1985
1990
Year
1995
2000
2005
http://nar.oxfordjournals.org/
生物信息学数据存在的问题
 信息源分布在世界各地不同的站点上
 涉及多个数据源的全局问题无法立刻得到答案
– Painfully collecting unstructured information around the
sites
– Manually putting pieces together
– Hopefully getting the right picture...
 总之,信息源的特点是:
– 自治的 (autonomous)
– 分布式的 (distributed)
– 异构的 (heterogeneous)
数据集成
Data Integration
数据集成
Data Integration
XML
XML
Site A
Site B
生物信息学最重要的任务是从海量数据中提取新知识
三、生物数据库的种类
三、生物数据库的种类
生物数据库的发展方向
序列数据库
 主要核酸序列数据库:
GenBank、EMBL、 DDBJINSDC
 主要蛋白质序列数据库:
Swissprot, PIRUniprot
核酸序列数据库

美国的核酸数据库GenBank〖Banson,D.A. et al. (1998)
Nucleic Acids Res. 26, 1-7〗从1979年开始建设,1982年正式
运行;

欧洲分子生物学实验室的EMBL数据库也于1982年开始服务

日本于1984年开始建立国家级的核酸数据库DDBJ,并于
1987年正式服务。
从那个时候以来,DNA序列的数据已经从80年代初期的百
把条序列,几十万碱基上升至现在的110亿碱基!这就是说,
在短短的约18年间,数据量增长了近十万倍。
核酸序列

核酸序列是
由4种核苷
酸的单字母
(ATGC)
符号排成的
序列。
蛋白质序列数据库

SWISS-PROT和PIR是国际上二个主要的蛋白质序列
数据库,目前这二个数据库在EMBL和GenBank数据
库上均建立了镜像 (mirror) 站点。SWISS-PROT数据
库包括了从EMBL翻译而来的蛋白质序列,这些序列
经过检验和注释。PIR数据库的数据由美国家生物技
术信息中心(NCBI)翻译自GenBank的DNA序列。
1952年桑格测定了一条
蛋白质——胰岛素的序
列,1977年桑格等发明
了DNA序列测定技术。
Protein库主要由DNA
库翻译而来,并进行注
释/
蛋白质序列

MNIQQLALQNIKGNWRNYKVFFLSSCFAIFASFAYMSV
IVHPYMKETMWYQNVRWGLIICNIIIISFFIIFILYSTSIFI
EARKKELGLYMLMGATKSNVIGVIMTEQMLIGVFANIF
GIGLGIIFLKLFFMVFSMLLGLPKELPIIFDVRAIGGTFIA
YMVVFVVLSFISALRIWNIKIIRLLKEFRTDKKEKKTSM
RLCIFGLICLGIGYALALQTTMPTIAFYFFPVSILVFFGT
YFSFTHGTAQILELIKRNKKIMYTYPYLFIVNQLSHRM
KENGRFFFLMSMATTFVVTATGTVFLYFSGMQDMWR
GGGVHSFSYIEKGTSSHEVFAEGMVEQLLHQYGYDDF
QSMSFVGVYASFQSSKGETEIATLMKESEYNQEARKQ
GQKTYHPKKGSVTLVYYNKYNHPNMYDQKEIQLQV
MNQTYSFVFNGQKEGIQFNYHPSQINGLFFVMHDEDF
DGIANKVPDSEKMIYRGYTLPNIENTKELNEDLRKHM
KQDDNNAFRSNMELYVNMKAFGDITLFVGSFISILFFL
TSCSIVYFKWFHNIASDRKEYGALSKLGMTKEEVWRIS
RWQLCMLFFAPIIVGSMHSAVALYTFHNTIFMDGSLRK
VGLFILFYIAACIMYFFFAQREYRKHLD

蛋白质序
列是由20
种氨基酸
的单字母
符号排成
的序列。
基因组数据库

GDB


人类基因组数据库
AceDB

线虫(Caenorhabditis elegans)基因组数据库
四、数据库检索工具

Entrez

SRS
http://www.ncbi.nlm.nih.giv/Entrez/

Entrez--GenBank
SRS
(Sequence Retrieval System )
SRS是欧洲分子生物学网EMBnet的
主要检索工具。
SRS, Sequence Retrieval System, is a powerful database management
system developed specifically for biological databases. The goal of SRS
is to provide an efficient access to databases with biological contents no
matter in what format are they available and allowing for complex
search criteria.
生物信息学最重要的任务是从海量数据中提取新知识
http://www.bioinfo.sdu.edu.cn
http://202.194.15.192/bioinfo/bio