Supplementary Figures Identification of latent biomarkers in hepatocellular carcinoma by ultradeep whole-transcriptome sequencing Kuan-Ting Lin1,#, Yih-Jyh Shann2,#, Gar-Yang Chau3, Chun-Nan Hsu4,5, and Chi-Ying F. Huang1,2,* 1. Institute of Biomedical Informatics, National Yang-Ming University, Taipei 112, Taiwan 2. Institute of Biopharmaceutical Sciences, National Yang-Ming University, Taipei 112, Taiwan 3. Division of General Surgery, Department of Surgery, Taipei Veterans General Hospital, Taipei 112, Taiwan 4. USC/Information Sciences Institute, Marina del Rey, CA 90292, USA 5. Institute of Information Science, Academia Sinica, Taipei 116, Taiwan # These authors contributed equally to this work. * To whom correspondence should be addressed. Chi-Ying F. Huang, Ph.D. Institute of Biopharmaceutical Sciences, National Yang-Ming University, Taipei 112, Taiwan. Tel: 886-228267904; Fax: 886-228224045; E-mail: email@example.com Figure 1. Summary of mapped regions of RNA-Seq reads on the human genome. (a) The percentage of the human genome mapped by RNA-Seq reads grew as we sequenced more lanes of RNA-Seq. We show that Y% of the human genome has X number of supporting reads. For example, aligning 1 lane of normal reads (the dashed blue line at the very bottom) shows that ~3% of the human genome has at least 1 supporting read and ~2% of the human genome has at least 2 supporting reads. Likewise, 4 lanes of RNA-Seq reads from the sequenced liver tumor can cover ~7.59% of the human genome at most. (b) The percentages of categories of mapped regions shift from 1 lane to 4 lanes based on the Ensembl gene annotation (Homo_sapiens.GRCh37.65). At 1 lane, on average, 47% of the mapped area lies in exons, 39% is in the introns, and 14% is located in intergenic regions. However, at 4 lanes, which contain ~4 times more reads than 1 lane, the percentages changed significantly. The exonic proportion decreases from 47% to 28%, the intronic proportion increases from 39 to 53%, and the intergenic proportion increases from 14% to 18%. Figure 2. Summary of identified junctions from RNA-Seq. (a) According to the 5’ start site (5’ SS) and 3’ stop site (3’ SS) information, we categorized TopHat (TH) and MapSplice (MS) junctions into 6 subgroups: (1) known junction; (2) new junction; (3) new 3’ splice site; (4) new 5’ splice site; (5) intragenic; and (6) intergenic. The first two categories are junctions whose start and stop sites have been annotated in the Ensembl gene annotation. The only difference is that the new junction group uses different combinations of start and stop sites. The 3rd and 4th categories are junctions having new 3’ stop sites (acceptor site) or 5’ start sites (donor site). The 5th and 6th categories are junctions whose splice sites are both unknown. In general, MS reports more novel junctions than TH. In (b), we show the junctions with at least 10 supporting reads. Figure 3. The identification of DUNQU1 as a novel gene in the sequenced liver tumor. (a) The functional potential of DUNQU1. DUNQU1 was predicted to be 5438 bp long, and it has a relatively small coding region (colored in orange) across exon 1 (E1) and exon 2 (E2). The protein sequence is 101 amino acids and comprises three domains (IPR002591, IPR017849, and IPR017850) and one signal peptide sequence, as predicted by InterProScan. (b) DUNQU1 has three exons and two splicing isoforms: SP1 and SP2. SP1 has an additional exon in the middle. SP1 is colored darker, because it is more abundant than SP2 in terms of supporting reads. DUNQU1 is located on Chromosome band 16p11.2. A transcriptional factor binding site (TFBS) is located upstream of DUNQU1 (red arrow). At the same location, there is a DNase hypersensitivity cluster. Moreover, the first exon of DUNQU1 is conserved in several species, such as Xenopus tropicalis, Tetraodon, Fugu, stickleback, medaka, zebrafish, and lamprey. Figure 4. The domain structures of the two DUNQU1 isoforms. (a) SP1 (101 a.a.) and (b) SP2 (94 a.a.) are similar to each other and have the same predicted domain structures. Figure 5. Predicted DUNQU1 amino acid sequence reveals high sequence identity between DUNQU1 and the N-terminus of ENPP7. Figure 6. Identification of DUNQU1 transcriptional start site by 5’ RACE. (a) For the 5’-RACE strategy (Invitrogen 5'RACE system), we used primers DUN-4 and DUN-2 to perform 5’ RACE. We directly sequenced the PCR product with primer DUN-7. (b) Nucleotide sequences highlighted in yellow are the sequences confirmed by 5’ RACE. 5’ RACE detected a start site close to that predicted by RNA-Seq short reads. DUNQU1 DUN$1 SP1 E3 E2 E1 DUN$2 DUN$1 SP2 E3 E1 DUN$2 C T T T G T C A C C A T G A C C A G C C C C T G C C A C T T C A C C C T G G T T G T C A C C G A T T C T T T 30 T T A C A A A C T A T G T G C A A A C C A A T G A G C C A T C C A G G T A C A C G C C A A T T T G G T T C T G G A T T T A T G A A T A G C A T C T A T T C T T T C A T A G C A A A G T T A T T T T A C C T C A C A G A A A T T T A A A C G T C G G T T A T A T G A A C C A A G C A A G T G T A T G C Figure 7. DNA fragment of the predicted hybrid Figure 7. Direct Directsequencing sequencingofofa agel-eluted gel-eluted DNA fragment of the predicted hybrid form shows and SP2. Primary PCRform shows the the mixture mixtureof ofsignals signalsfrom fromboth bothDUNQU1 DUNQU1SP1 SP1 and SP2. Primary PCRamplified samples from patient cDNA were subjected to nested PCR with primers DUN-1 from patient cDNA were subjected to nested PCR with primers DUN-1 and DUN-2. The The predicted predicted hybrid hybrid form form was wasgel-eluted gel-elutedand anddirectly directlysequenced sequencedwith withprimer primer bar. The signal from DUN-1. The region region with with aa mixed mixedsignal signal(64 (64bp) bp)isismarked markedwith witha ared red bar. The signal from signal from thethe SP2 is composed composed of of SP2 SP2E3 E3bases bases1-63 1-63and andananadditional, additional,nonspecific nonspecificA A signal from in in agarose gelgel of DUNQU1 PCR PCR reaction. These These data dataconfirm confirmthat thatthe theextra extrasignal signal agarose of DUNQU1 PCR product is a hybrid form of SP1 and SP2. Inclusion Ratio Distribution Tumor Inclusion Ratio 1.00 0.75 0.50 0.25 0 0 0.25 0.50 0.75 1.00 Normal Inclusion Ratio CA-CA AA-AA AD-AD IR-IR CA-CS Figure 8. 1,003 AS events are considered to be significant by SpliceTrap. Each symbol represents an alternative splicing (AS) event such as CAssette exon (CA), Alternative Acceptor (AA), Alternative Donor (AD), and Intron Retention (IR). SpliceTrap  reports 1,003 AS events in 825 exons in 648 genes. The majority of the AS events are detected from exons with lower inclusion ratios (below 0.25). Figure 9. Alternative splicing events of FGFR2 and EXOC7. (a) Four of the exons of FGFR2 are shown. From right to left, they are exon 7, exon 8, exon 9, and exon 10. Exon 8 is the IIIb-specific exon, whereas exon 9 is IIIc-specific. The blue exon peaks are the alignment results of normal RNA-Seq reads, while the yellow ones are from tumor reads. The arcs with numbers in them show the number of junction reads bridging the two exons. For example, the junction of exon 7-exon 8 has 126 reads in normal and 22 reads in tumor tissue. Overall, in normal liver, exon 8 has a higher inclusion ratio. In contrast, exon 9 has a higher inclusion ratio in liver tumors. Therefore, the expression of FGFR2-IIIb decreased in the tumors. (b) From right to left, they are exon 6, exon 7, exon 8, exon 9, exon 10, exon 11, exon 12, and exon 13. The inclusion of exon 7 increased in the tumor liver. It suggests that EXOC7 switched its isoform from NM_001145298 to NM_001145299 in the tumor liver. NM_001013839 also increased slightly. Figure 10. The novel junction of TELO2 is highly expressed in normal but not tumor liver. Figure 11. FGFR2-IIIc inclusion ratios in cell lines and correlation with tumor sizes and recurrence. (a) By realtime PCR, we used ΔCt values (normalized to actin) to estimate FGFR2-IIIc inclusion ratios in cell lines. The inclusion ratios in normal/fetal cells are near 50% or below 50%, whereas, in tumor cells, the inclusion ratios are above 50% and near 90%. (b) By realtime PCR of 43 adjacent normal patient samples, we found that the FGFR2-IIIc inclusion ratio is correlated with tumor size. Here, we used a logarithmic trend line to fit the data points. (c) FGFR2-IIIc inclusion ratio change is the difference of tumor inclusion ratio minus normal inclusion ratio. If the inclusion ratio change is positive, it means that the inclusion ratio of FGFR2-IIIc in tumor liver increases. The violin plot shows that patients without recurrence have higher inclusion ratios of FGFR2-IIIc in the tumors while patients with recurrence has lower inclusion ratios. T-test showed that the two distributions are significantly different (p=0.03265). Figure 12. Expression patterns of LncRNAs in the sequenced tumor liver. Cancerassociated LncRNAs are grouped by their functions to cancer hallmarks. Low abundant means the coverage of the gene is below 5, so the expression pattern is not distinguishable. Figure 13. Different alignment settings have different sizes of genome mapped by reads. Here, we align our reads by Bowtie and report all valid alignments based on the two settings: (1) zero mismatch in seed region is allowed, and (2) the maximum of total quality scores of mismatches after the seed region is 70. (a) is the alignment results from all pair-end reads and (b) is a simulation for single-end reads, which are derived from the first end of our pair-end reads. (a) and (b) have the same amount of reads. These alignments don’t contain any junction reads. References   Mulder N, Apweiler R. InterPro and InterProScan: tools for protein sequence classification and comparison. Methods Mol Biol 2007;396:59-70. Wu J, Akerman M, Sun S, McCombie WR, Krainer AR, Zhang MQ. SpliceTrap: a method to quantify alternative splicing under single cellular conditions. Bioinformatics 2011;27:3010-3016.