Download No Slide Title - People.cs.uchicago.edu

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Computational Analysis of
Transcript Identification Using
GenBank
Slides by Terry Clark
Differentiation of hematopoietic cells
Pluripotent stem cell
Myeloid
Myeloid
Myeloid
Pluripotent stem cell
Lymphoid
Lymphoid
Lymphoid
Neutrophil Monocyte Erythrocyte Platelet Eosinophil Basophil
B cell
T cell
Genome-wide gene expression
SAGE
(Serial Analysis of Gene Expression)
Figure 1 Schematic illustration of the SAGE process
Jes Stollberg et al. Genome Res. 2000; 10: 1241-1248
SAGE & GLGI Overview
mRNA
SAGE
quantitative analysis of expressed genes
by collecting tags
SPGI
match
identify most of expressed genes
collect cDNA clones
match
GenBank
no match
multi-match
GLGI
extend tags into longer 3' cDNAs
single-match
Gene identification
What is the chance of duplicate tags?
• We can assume we are drawing randomly
from the set of all 4-letters sequences of the
given tag length
• This is the same problem as having unique
overlaps in the contig matching problem for
shotgun sequencing
Random Model
Random model does not reflect
biological process
• Genes evolve by duplication as well as
point mutation
• Many motifs are repeated
• Function widgets at work?
• Result is a strong bias in observed
biological sequences, not a uniform
distribution as the simple model hopes.
• Here are some numbers ….
SAGE tags match to many genes
(Tags from Hashimoto S, et al. Blood 94:837, 1999)
Tags
matched gene numbers
CCTGTAATCC
GTGAAACCCC
CCACTGCACT
ACTTTTTCAA
TTGGGGTTTC
TGCACGTTTT
TGTGTTGAGA
CCCGTCCGGA
TTGGTCCTCT
CTGACCTGTG
TACCTGCAGA
AGGCTACGGA
GGGCTGGGGT
CCCTGGGTTC
CACAAACGGT
GTGAAGGCAG
GGGCATCTCT
ATGGCTGGTA
CGCCGCCGGC
AGGGCTTCCA
TTGGTGAAGG
GTGGCCACGG
GTTCACATTA
TGGTGTTGAG
CCCATCGTCC
GTTGTGGTTA
TTGTAATCGT
CCCACAACCT
GAGGGAGTTT
CCAGAACAGA
40 5
30 5
17 4
44
9
8
5
5
4
3
3
3
3
2
2
2
2
2
2
2
2
1
1
1
1
1
1
1
1
1
Matched genes (only show up to 10)
Hs.267557,Hs.240615,Hs.231705,Hs.283045,Hs.236713,Hs.232277,Hs.181553,Hs.262716,Hs.181392,Hs.220696
Hs.282868,Hs.170225,Hs.184220,Hs.194021,Hs.231625,Hs.171830,Hs.270571,Hs.270572,Hs.272193,Hs.283921
Hs.118778,Hs.256868,Hs.96023,Hs.31575,Hs.47517,Hs.200451,Hs.271222,Hs.253240,Hs.270018,Hs.270415
Hs.16426,Hs.10669,Hs.75155,Hs.28166,Hs.13975,Hs.79136,Hs.111334,Hs.133430,Hs.79356,Hs.239100
Hs.231375,Hs.273127,Hs.275603,Hs.175173,Hs.276612,Hs.224773,Hs.62954,Hs.182771,Hs.276326
Hs.199160,Hs.279943,Hs.36927,Hs.5338,Hs.169793,Hs.83450,Hs.173902,Hs.183506
Hs.284136,Hs.275865,Hs.275221,Hs.274466,Hs.181165
Hs.276353,Hs.277498,Hs.277573,Hs.276350,Hs.180842
Hs.12328,Hs.108124,Hs.9739,Hs.112845
Hs.277477,Hs.181244,Hs.77961
Hs.100000,Hs.256957,Hs.253884
Hs.119122,Hs.211582,Hs.183297
Hs.183698,Hs.118757,Hs.90436
Hs.52891,Hs.111334
Hs.2043,Hs.195453
Hs.4221,Hs.77039
Hs.75061,Hs.76807
Hs.254246,Hs.182426
Hs.182825,Hs.132753
Hs.29797,Hs.276544
Hs.278674,Hs.75968
Hs.112405
Hs.84298
Hs.275865
Hs.151604
Hs.75415
Hs.125078
Hs.252136
Hs.76064
Hs.111222
Tag Frequency Groups for 10-base Tag Set
Containing 878,938 Tags for UniGene Human
Unique Tags among 878,938 EST Derived Tags
Unique Tags among 32,851 Gene Derived Tags
Converting tag into longer 3’ sequence
5' end
3' end
SAGE tag
3' longer sequence
3' end
Generation of Longer 3'cDNA for Gene Identification
(GLGI)
SAGE t ag
NNNNNNNNNN
nnnnnnnnnn
NNNNNNNNNN
nnnnnnnnnn
Sense extension
antisense extension
10 bases
TAAAAAAAAAAACTCGCCGGCGAA
ATTTTTTTTTTTGAGCGGCCGCTT
TAAAAAAAAAAACTCGCCGGCGAA
TGAGCGGCCGCTT
NNNNNNNNNN
nnnnnnnnnn
TAAAAAAAAAAACTCGCCGGCGAA
TGAGCGGCCGCTT
NNNNNNNNNN
nnnnnnnnnn
TAAAAAAAAAAACTCGCCGGCGAA
TGAGCGGCCGCTT
NNNNNNNNNN
nnnnnnnnnn
TAAAAAAAAAAACTCGCCGGCGAA
TGAGCGGCCGCTT
NNNNNNNNNN
nnnnnnnnnn
TAAAAAAAAAAACTCGCCGGCGAA
TGAGCGGCCGCTT
hundred bases
UniGene Human 3’ Part Length Distribution
Myeloid Tag Matches with UniGene Human SAGE
Tag Reference Database
SAGE Tag Processing with GIST
k-mer tree
GIST Performance with Improved IO
Conspirators
Terry Clark
Andrew Huntwork
Josef Jurek
L. Ridgway Scott
Sanggyu Lee
Janet D. Rowley
San Ming Wang
Related documents