Download Supplementary Figures (doc 9746K)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Artificial gene synthesis wikipedia, lookup

Site-specific recombinase technology wikipedia, lookup

Genome evolution wikipedia, lookup

RNA-Seq wikipedia, lookup

Therapeutic gene modulation wikipedia, lookup

NEDD9 wikipedia, lookup

Helitron (biology) wikipedia, lookup

Genomics wikipedia, lookup

Human genome wikipedia, lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia, lookup

Genome editing wikipedia, lookup

Metagenomics wikipedia, lookup

Pathogenomics wikipedia, lookup

Genomic library wikipedia, lookup

Molecular Inversion Probe wikipedia, lookup

Bisulfite sequencing wikipedia, lookup

Whole genome sequencing wikipedia, lookup

Oncogenomics wikipedia, lookup

SNP genotyping wikipedia, lookup

Human Genome Project wikipedia, lookup

Holliday junction wikipedia, lookup

Human–animal hybrid wikipedia, lookup

NUMT wikipedia, lookup

Alternative splicing wikipedia, lookup

Transcript
Supplementary Figures
Identification of latent biomarkers in hepatocellular carcinoma by ultradeep whole-transcriptome sequencing
Kuan-Ting Lin1,#, Yih-Jyh Shann2,#, Gar-Yang Chau3, Chun-Nan Hsu4,5, and
Chi-Ying F. Huang1,2,*
1. Institute of Biomedical Informatics, National Yang-Ming University, Taipei
112, Taiwan
2. Institute of Biopharmaceutical Sciences, National Yang-Ming University,
Taipei 112, Taiwan
3. Division of General Surgery, Department of Surgery, Taipei Veterans
General Hospital, Taipei 112, Taiwan
4. USC/Information Sciences Institute, Marina del Rey, CA 90292, USA
5. Institute of Information Science, Academia Sinica, Taipei 116, Taiwan
# These authors contributed equally to this work.
* To whom correspondence should be addressed.
Chi-Ying F. Huang, Ph.D.
Institute of Biopharmaceutical Sciences, National Yang-Ming University,
Taipei 112, Taiwan. Tel: 886-228267904; Fax: 886-228224045; E-mail:
cyhuang5@ym.edu.tw
Figure 1. Summary of mapped regions of RNA-Seq reads on the human genome. (a)
The percentage of the human genome mapped by RNA-Seq reads grew as we sequenced
more lanes of RNA-Seq. We show that Y% of the human genome has X number of
supporting reads. For example, aligning 1 lane of normal reads (the dashed blue line at
the very bottom) shows that ~3% of the human genome has at least 1 supporting read and
~2% of the human genome has at least 2 supporting reads. Likewise, 4 lanes of RNA-Seq
reads from the sequenced liver tumor can cover ~7.59% of the human genome at most. (b)
The percentages of categories of mapped regions shift from 1 lane to 4 lanes based on
the Ensembl gene annotation (Homo_sapiens.GRCh37.65). At 1 lane, on average, 47% of
the mapped area lies in exons, 39% is in the introns, and 14% is located in intergenic
regions. However, at 4 lanes, which contain ~4 times more reads than 1 lane, the
percentages changed significantly. The exonic proportion decreases from 47% to 28%, the
intronic proportion increases from 39 to 53%, and the intergenic proportion increases from
14% to 18%.
Figure 2. Summary of identified junctions from RNA-Seq. (a) According to the 5’ start
site (5’ SS) and 3’ stop site (3’ SS) information, we categorized TopHat (TH) and
MapSplice (MS) junctions into 6 subgroups: (1) known junction; (2) new junction; (3) new
3’ splice site; (4) new 5’ splice site; (5) intragenic; and (6) intergenic. The first two
categories are junctions whose start and stop sites have been annotated in the Ensembl
gene annotation. The only difference is that the new junction group uses different
combinations of start and stop sites. The 3rd and 4th categories are junctions having new
3’ stop sites (acceptor site) or 5’ start sites (donor site). The 5th and 6th categories are
junctions whose splice sites are both unknown. In general, MS reports more novel
junctions than TH. In (b), we show the junctions with at least 10 supporting reads.
Figure 3. The identification of DUNQU1 as a novel gene in the sequenced liver
tumor. (a) The functional potential of DUNQU1. DUNQU1 was predicted to be 5438 bp
long, and it has a relatively small coding region (colored in orange) across exon 1 (E1) and
exon 2 (E2). The protein sequence is 101 amino acids and comprises three domains
(IPR002591, IPR017849, and IPR017850) and one signal peptide sequence, as predicted
by InterProScan.[1] (b) DUNQU1 has three exons and two splicing isoforms: SP1 and
SP2. SP1 has an additional exon in the middle. SP1 is colored darker, because it is more
abundant than SP2 in terms of supporting reads. DUNQU1 is located on Chromosome
band 16p11.2. A transcriptional factor binding site (TFBS) is located upstream of DUNQU1
(red arrow). At the same location, there is a DNase hypersensitivity cluster. Moreover, the
first exon of DUNQU1 is conserved in several species, such as Xenopus tropicalis,
Tetraodon, Fugu, stickleback, medaka, zebrafish, and lamprey.
Figure 4. The domain structures of the two DUNQU1 isoforms. (a) SP1 (101 a.a.) and
(b) SP2 (94 a.a.) are similar to each other and have the same predicted domain structures.
Figure 5. Predicted DUNQU1 amino acid sequence reveals high sequence identity
between DUNQU1 and the N-terminus of ENPP7.
Figure 6. Identification of DUNQU1 transcriptional start site by 5’ RACE. (a) For the
5’-RACE strategy (Invitrogen 5'RACE system), we used primers DUN-4 and DUN-2 to
perform 5’ RACE. We directly sequenced the PCR product with primer DUN-7. (b)
Nucleotide sequences highlighted in yellow are the sequences confirmed by 5’ RACE. 5’
RACE detected a start site close to that predicted by RNA-Seq short reads.
DUNQU1
DUN$1
SP1
E3
E2
E1
DUN$2
DUN$1
SP2
E3
E1
DUN$2
C
T
T
T
G
T C A C
C A
T G
A C
C A
G
C
C
C
C
T G
C
C A C
T
T
C A C
C
C
T G
G
T
T
G T
C A C
C
G
A T
T C
T
T
T
30
T
T
A
C A A
A C T
A T
G
T
G
C
A
A A C
C A
A
T
G
A
G C
C
A
T
C
C A
G
G
T
A
C
A C
G
C
C A
A T
T
T
G
G
T
T
C
T
G
G
A
T
T
T
A T
G
A A
T
A G
C
A T
C
T A
T
T
C T
T
T
C
A
T A
G
C A
A A
G
T
T A
T
T
T
T
A
C
C
T
C A
C A
G
A A
A T
T
T A
A
A C
G T C G G
T
T
A
T
A
T
G A A
C
C
A
A
G
C
A
A
G
T
G
T
A
T G C
Figure 7.
DNA
fragment
of the
predicted
hybrid
Figure
7. Direct
Directsequencing
sequencingofofa agel-eluted
gel-eluted
DNA
fragment
of the
predicted
hybrid
form shows
and
SP2.
Primary
PCRform
shows the
the mixture
mixtureof
ofsignals
signalsfrom
fromboth
bothDUNQU1
DUNQU1SP1
SP1
and
SP2.
Primary
PCRamplified samples from
patient
cDNA
were
subjected
to
nested
PCR
with
primers
DUN-1
from patient cDNA were subjected to nested PCR with primers DUN-1
and DUN-2. The
The predicted
predicted hybrid
hybrid form
form was
wasgel-eluted
gel-elutedand
anddirectly
directlysequenced
sequencedwith
withprimer
primer
bar.
The
signal
from
DUN-1. The region
region with
with aa mixed
mixedsignal
signal(64
(64bp)
bp)isismarked
markedwith
witha ared
red
bar.
The
signal
from
signal
from
thethe
SP2 is composed
composed of
of SP2
SP2E3
E3bases
bases1-63
1-63and
andananadditional,
additional,nonspecific
nonspecificA A
signal
from
in in
agarose
gelgel
of DUNQU1
PCR
PCR reaction. These
These data
dataconfirm
confirmthat
thatthe
theextra
extrasignal
signal
agarose
of DUNQU1
PCR
product is a hybrid form of SP1 and SP2.
Inclusion Ratio Distribution
Tumor Inclusion Ratio
1.00
0.75
0.50
0.25
0
0
0.25
0.50
0.75
1.00
Normal Inclusion Ratio
CA-CA
AA-AA
AD-AD
IR-IR
CA-CS
Figure 8. 1,003 AS events are considered to be significant by SpliceTrap. Each
symbol represents an alternative splicing (AS) event such as CAssette exon (CA),
Alternative Acceptor (AA), Alternative Donor (AD), and Intron Retention (IR). SpliceTrap [2]
reports 1,003 AS events in 825 exons in 648 genes. The majority of the AS events are
detected from exons with lower inclusion ratios (below 0.25).
Figure 9. Alternative splicing events of FGFR2 and EXOC7. (a) Four of the exons of
FGFR2 are shown. From right to left, they are exon 7, exon 8, exon 9, and exon 10. Exon
8 is the IIIb-specific exon, whereas exon 9 is IIIc-specific. The blue exon peaks are the
alignment results of normal RNA-Seq reads, while the yellow ones are from tumor reads.
The arcs with numbers in them show the number of junction reads bridging the two exons.
For example, the junction of exon 7-exon 8 has 126 reads in normal and 22 reads in tumor
tissue. Overall, in normal liver, exon 8 has a higher inclusion ratio. In contrast, exon 9 has
a higher inclusion ratio in liver tumors. Therefore, the expression of FGFR2-IIIb decreased
in the tumors. (b) From right to left, they are exon 6, exon 7, exon 8, exon 9, exon 10, exon
11, exon 12, and exon 13. The inclusion of exon 7 increased in the tumor liver. It suggests
that EXOC7 switched its isoform from NM_001145298 to NM_001145299 in the tumor
liver. NM_001013839 also increased slightly.
Figure 10. The novel junction of TELO2 is highly expressed in normal but not tumor
liver.
Figure 11. FGFR2-IIIc inclusion ratios in cell lines and correlation with tumor sizes
and recurrence. (a) By realtime PCR, we used ΔCt values (normalized to actin) to
estimate FGFR2-IIIc inclusion ratios in cell lines. The inclusion ratios in normal/fetal cells
are near 50% or below 50%, whereas, in tumor cells, the inclusion ratios are above 50%
and near 90%. (b) By realtime PCR of 43 adjacent normal patient samples, we found that
the FGFR2-IIIc inclusion ratio is correlated with tumor size. Here, we used a logarithmic
trend line to fit the data points. (c) FGFR2-IIIc inclusion ratio change is the difference of
tumor inclusion ratio minus normal inclusion ratio. If the inclusion ratio change is positive, it
means that the inclusion ratio of FGFR2-IIIc in tumor liver increases. The violin plot shows
that patients without recurrence have higher inclusion ratios of FGFR2-IIIc in the tumors
while patients with recurrence has lower inclusion ratios. T-test showed that the two
distributions are significantly different (p=0.03265).
Figure 12. Expression patterns of LncRNAs in the sequenced tumor liver. Cancerassociated LncRNAs are grouped by their functions to cancer hallmarks. Low abundant
means the coverage of the gene is below 5, so the expression pattern is not
distinguishable.
Figure 13. Different alignment settings have different sizes of genome mapped by
reads. Here, we align our reads by Bowtie and report all valid alignments based on the
two settings: (1) zero mismatch in seed region is allowed, and (2) the maximum of total
quality scores of mismatches after the seed region is 70. (a) is the alignment results from
all pair-end reads and (b) is a simulation for single-end reads, which are derived from the
first end of our pair-end reads. (a) and (b) have the same amount of reads. These
alignments don’t contain any junction reads.
References
[1]
[2]
Mulder N, Apweiler R. InterPro and InterProScan: tools for protein sequence
classification and comparison. Methods Mol Biol 2007;396:59-70.
Wu J, Akerman M, Sun S, McCombie WR, Krainer AR, Zhang MQ. SpliceTrap: a
method to quantify alternative splicing under single cellular conditions.
Bioinformatics 2011;27:3010-3016.