* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download CSCE590/822 Data Mining Principles and Applications
Primary transcript wikipedia , lookup
United Kingdom National DNA Database wikipedia , lookup
DNA vaccination wikipedia , lookup
SNP genotyping wikipedia , lookup
Cancer epigenetics wikipedia , lookup
Genetic engineering wikipedia , lookup
DNA barcoding wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Oncogenomics wikipedia , lookup
Public health genomics wikipedia , lookup
Gel electrophoresis of nucleic acids wikipedia , lookup
Comparative genomic hybridization wikipedia , lookup
Zinc finger nuclease wikipedia , lookup
Molecular cloning wikipedia , lookup
Designer baby wikipedia , lookup
DNA supercoil wikipedia , lookup
Point mutation wikipedia , lookup
Nucleic acid double helix wikipedia , lookup
Mitochondrial DNA wikipedia , lookup
Genealogical DNA test wikipedia , lookup
Transposable element wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Microevolution wikipedia , lookup
Cell-free fetal DNA wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Epigenomics wikipedia , lookup
DNA sequencing wikipedia , lookup
Extrachromosomal DNA wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Minimal genome wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
History of genetic engineering wikipedia , lookup
Microsatellite wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Pathogenomics wikipedia , lookup
Non-coding DNA wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Human genome wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Human Genome Project wikipedia , lookup
Metagenomics wikipedia , lookup
Genome evolution wikipedia , lookup
Genome editing wikipedia , lookup
Helitron (biology) wikipedia , lookup
CSCE555 Bioinformatics
Lecture 2
Meeting: MW 4:00PM-5:15PM SWGN2A21
Instructor: Dr. Jianjun Hu
Course page: http://www.scigen.org/csce555
University of South Carolina
Department of Computer Science and Engineering
2008 www.cse.sc.edu.
Roadmap
DNA, Chromosomes, Genomes
Genome Sequencing and whole genomes
DNA Sequence Representation, Models
Sequence Retrieval, Manipulation
Basic Analysis and Questions of Genomes
Summary
5/23/2017
2
Tools to Learn Concepts Quickly
Wikipedia.org
◦ Search “Genome” bringing up many related
information
◦ In google, type “keywards wiki”
Google search tips
◦ Find info from university websites
 Genome, site:edu
◦ Find info as powerpoint files
 Genome, tutorial, filetype:ppt
DNA
Deoxyribonucleic
acid (DNA) is a
nucleic acid that
contains the genetic
instructions used in
the development and
functioning of all
known living
organisms.
DNA is a long polymer of simple units
called nucleotides
Bases
A: adenosine
C: cytidine
G: guanosine
T: thymidine
Backbone:
sugars and phosphate groups
Microbial Genome: Clostridium sp.
OhILAs
CTGCTGTACTAGGATGCTGGTGGAGAGAGCTGCATATAAATCTTTGAGAGATGCACCAAG
AATCACCATCATGGTTTCCGCCATAGGGGCTTCTTTTTTTATTCAAAATCTTGCCATTGT
TTTATTTGGTGGTAGACCGAAAACTGTTCCAACGGTGGAGGTATTGTCCGGGGTGATAAA
GCTGGGGTCCGTATCTCTACAAAGGCTGACCTTAGTGATTCCAGTAGTAACCATACTGCT
ATTATTTCTTTTGATGTTTTTAGTGAACCAAACGAAAACTGGAATGGCAATGCGTGCCGT
ATCCAAGGACTATGAAACCGCGCGGCTTATGGGAATTGACGTCAATAAAATTATTACCAT
AACCTTTGGTATTGGCTCTGCTCTGGCAGCTATTGGTGGCATCATGTGGGGCGCAAAATT
TCCTAAAATAGACCCTTTTGTTGGGACTATGCCGGGTATTAAATGCTTTATTGCTGCAGT
TCTAGGTGGAATCGGAAACATTCCCGGTGCAGTAATCGGGGGGTTCATCTTAGGGATTGG
AGAGATTATGCTCATTGCTTTTCTACCGAGCCTAACTGGCTATCGAGATGCCTTTGCTTT
CATACTACTGATTATCATTCTACTGTTTAAGCCAACAGGAATCATGGGTGAAAAAATTGC
GGAGAAGGTGTAGACGATGAAAAAGAAAAATACCATATTAACTGGATTAGCAGTATTGCT
TTTATTGATTTATTTGATTTATGCAAATAAGAATTATGATTCTTATAAAATTAGAGTTCT
AAATCTATGTGCAATTTATGCTGTATTGGGACTCAGTATGAATTTGATCAATGGATTTAC
AGGTTTATTTTCCCTTGGACATGCAGGTTTTATGGCAGTAGGTGCCTATACTACCGCTCT
TCTGACCATGACACCGCAAAGTAAGGAGGCAACATTCTTCTTAGTGCCCATTGTAGAGCC
TTTGGCTAAAATTCAGCTTCCTTTTTTTGTGGCACTGATCATCGGTGGACTACTTTCAGC
AATGGTGGCATTTTTAATCGGTGCACCGACTTTAAGGCTGAAGGGCGATTATTTAGCCAT
Complementary Base Pairing:
AT
Write a program to export
CG
complementary sequence?
Genome of organisms
genome of an
organism is a
complete DNA
sequence of one set
of chromosomes
Sequencing: Basic Ideas
Current lab techniques can sequence small (say 700 base pairs) DNA
pieces.
◦ Use restriction enzymes to cut DNA pieces
◦ Sort pieces of different sizes using gel electrophoresis and use the sorting to
read them
Mapping and Walking
◦ Sequence one piece, get 700 letters, make a primer that allowed you to read the
next 700, and work sequentially down the clone
◦ Estimate for human genome sequencing using this method: 100 years
Shotgun sequencing (introduced by Sanger et al. 1977) for sequencing
genomes
◦ Obtain random sequence reads from a genome
◦ Assemble them into contigs on the basis of sequence overlaps
 Straightforward for simple genomes (with no or few repeat sequences)
 Merge reads containing overlapping sequence
Shotgun sequencing is more challenging for complex (repeat-rich)
genomes: two approaches
How Sequencing Works
Beckman CEQ 8000
Sequencing small DNA pieces
G
G
A
T
--------------
A
--------------
C
--------------
T
--------------
T
--------------
A
G
--------------
--------------
T
--------------
A
--------------
C
--------------
A
--------------
G
--------------
G
--------------
A
--------------
A
--------------
A
--------------
C
--------------
T
G
C
---------------------------
Use DNA cloning or PCR to make
multiple copies.
Put in 4 testtubes marked G, A, T and
C
In testtube G use restriction enzymes
that cuts at G.
Do the above step for the other
testubes.
Use gel electrophoresis separately for
the content in each testtube.
The data results in the table on the
left.
Reading the table we get G has
lengths 1, 7, 12, 13, 19; A has lengths 2,
6, 8, 11, 14,15,16; T has length 4, 5, 9,
18 and C has length 3, 10, 17.
This gives us the sequence.
Methods for very large scale sequencing
A hierarchical approach
◦ Map on a large scale (physical mapping), sequence
specific clones whose position in the genome is
known
Shot gun sequencing
◦ “Tear up” the genome and sequence random
fragments until it is done
Sequence tagged connectors (STC)
◦ Sequence the ends of many clones and use this info
to pick overlapping clones
“Shotgun” sequencing
Copy
Clone to sequence
Sequence and “assemble”
….GTCTACCTGTACTGATCTAGC...
…. CCTGTACTGATCTAGCATTA...
…. GTACTGATCTAGCATTACG...
Subclone
Emerging Sequence Methods
Sequencing by
Hybridization (SBH).
Mass
Spectrophotometric
Sequences.
Direct Visualization of
Single DNA Molecules
by Atomic force
Microscopy (AFM )
Single Molecule
Sequencing
Techniques
Single nucleotide
Cutting
Nanopore sequencing
Readout of Cellular
Gene Expression
Whole Genomes of Species
Bacterial Genomes
 Eukaryotic Genomes
 Human Genome Project
 Other Animal and Plant Genomes
 Model Genomes
The genomes of more than 180 organisms
have been sequenced since 1995
http://www.genomenewsnetwork.org/resour
ces/sequenced_genomes/genome_guide_
p1.shtml
Sizes of Genomes
You will learn to download all these
genomes into your computer’s
harddrive
Refer to Table 1.1 Page 2 of Intro
to Comp Genomics book.
Roadmap
DNA, Chromosomes, Genomes
Genome Sequencing and whole genomes
DNA Sequence Representation, Models
Sequence Retrieval, Manipulation
Basic Analysis and Questions of Genomes
Summary
5/23/2017
15
DNA Sequence Representation
DNA Sequence: a string of letters with
alphabet {A, C, G, T}
 Protein sequence: a string of amino acids
with alphabet
{ARNDCEQGHILKMFPSTWYV}
◦ 20 standard amino acids
Genetic code:
Genetic Code: Condon
DNA (ATCG)
RNA (AUCG)
 Three bases of DNA
encode an amino
acid
Genetic Code with Degeneracy
Representation of Sequences
Single DNA sequence
◦ ATCCTTAAGGAAA
Multiple sequences with similarity
◦
◦
◦
◦
◦
Regular Expression
ATAAA
ACAAAA
ATAAAAAA
A[TC]A+
Representation of Sequences
Probablistic Model: Position-specific
scoring matrices (PSSM)
Representation of Sequence:
FASTA format
text-based format for representing either
nucleic acid sequences or peptide
sequences,
 allows for sequence names and comments
to precede the sequences.
Roadmap
DNA, Chromosomes, Genomes
Genome Sequencing and whole genomes
DNA Sequence Representation, Models
Sequence Retrieval, Manipulation
Basic Analysis and Questions of Genomes
Summary
5/23/2017
22
Sequence Retrieval, Manipulation
Where to download genome/sequence
data
◦ Online databases: EMBL, GenBank
◦ Entrez cross-database search (life science
search engine)
◦ Goolge -
Example: Download H. influenzae
Genome
First bacterial genome: H. influenzae,
1830Kb
 http://www.ncbi.nlm.nih.gov/sites/entrez
NC_007146
LinksHaemophilus influenzae 86-028NP, complete genome
DNA; circular; Length: 1,914,490 nt
Replicon Type: chromosome
Created: 2005/06/27
Genome Information of H.
influenzae
Download the Complete Genome
Sequence in Fasta Format
Roadmap
DNA, Chromosomes, Genomes
Genome Sequencing and whole genomes
DNA Sequence Representation, Models
Sequence Retrieval, Manipulation
Basic Analysis and Questions of Genomes
Summary
5/23/2017
28
Simple Questions and Analysis of
Genome Sequence
Frequencies of Bases A/C/G/T by simple
counting
 Sliding windows to check local density
 AT AG AC TA TG TC
K-mers frequent/unusual words
◦ 2-mers AT AG AC TA TG TC etc.
◦ 3-mers
Genomic landscape: GC
content analysis
The overall GC content of the human
genome is 41%.
 A plot of GC content versus number of
20 kb windows shows a broad profile
with skewing to the right.
Page 627
GC content of the human genome: mean 41%
Source: IHGSC (2001)
Fig. 17.15
Page 628
Genomic landscape: CpG islands
Dinucleotides of CpG are under-represented in genomic
DNA, occuring at one fifth the expected frequency.
 CpG dinucleotides are often methylated on cytosine (and
subsequently may be deamination to thymine).
Methylated CpG residues are often associated with housekeeping genes in the promoter and exonic regions.
Methyl-CpG binding proteins recruit histone deacetylases
and are thus responsible for transcriptional repression.
 They have roles in gene silencing, genomic imprinting, and Xchromosome inactivation.
Broad genomic landscape: CpG
islands
Findings:
◦ 50,267 CpG islands in human genome
◦ 28,890 after masking repeats with
RepeatMasker
◦ 5-15 CpG islands per megabase
◦ (about <40 genes per megabase)
Summary
DNA, Chromosome, Genome
 Sequence models
 Sequence database, retrieval
 Whole genome sequence analysis
Slides Credits
Slides in this presentation are partially
based on the work of slides from
Internet.
					 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                            