Download Metatranscriptome_Pipeline_Tutorial.doc

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of genetic engineering wikipedia , lookup

Minimal genome wikipedia , lookup

Non-coding RNA wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

Gene wikipedia , lookup

Human genome wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Pathogenomics wikipedia , lookup

Primary transcript wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Genomics wikipedia , lookup

Epitranscriptome wikipedia , lookup

Helitron (biology) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

RNA-Seq wikipedia , lookup

Metagenomics wikipedia , lookup

Transcript
Metatranscriptomics Tutorial
Overview
This tutorial will take you through various steps of processing metatranscriptomic data. These
steps are as follows:
1. Remove adaptor sequences. These are added during library preparation and sequencing steps
and can be generated during sequencing runs
2. Remove low quality sequences.
3. Remove abundant rRNA sequences which can dominate the dataset
4. Remove host reads (if exploring a microbiome in which host is an issue)
5. Assemble the reads into contigs to improve annotation quality
6. Annotate reads to known genes
7. Map the known genes to E. coli homologs to facilitate network visualization
8. Generate normalized expression values associated with each gene
9. Visualize the results
We are currently working on a single script to run the entire process taking fastq sequence files
generated by the sequencer to files that can be read by the visualization tool, here we illustrate
the use of the pipeline by going through the various steps to illustrate the complexity of the
process and the underlying tools and files used and generated by the pipeline.
To illustrate the process we are going to use sequence reads generated from the rumen of a cow.
These are 100bp paired end reads - single end reads can also be used, but paired end reads can
increase sequence length if there is significant overlap and consequently improve annotation
quality.
Rather than use the entire set of 14 million which might take several days to process on a desktop,
the tutorial will take you through processing a subset of 100,000 reads.
Preliminaries
Preinstalled software and tools
You need to install all required tools and set the proper environmental variables. The tools used
are as follows,
1) Perl
2) bioperl
3) Python
4) cross_match
5) Trinity
6) BWA
7) SAMTOOLS
8) BLAT
9) BLAST+
10) Cytoscape
Pre-downloaded databases
You need to download several reference databases as follows,
1) UniVec_Core (ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec/UniVec_Core)
2) Bacteria_rRNA.fasta – rRNA genes collated from the SILVA, Greengenes and NCBI
databases
3) Rfam (http://rfam.xfam.org/)
4) cow_cds.fa
(ftp://ftp.ensembl.org/pub/current_fasta/bos_taurus/cds/Bos_taurus.UMD3.1.cds.all.fa.gz).
5) microbial_all_cds.fasta – download from NCBI
(ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.ffn.tar.gz), and then merge all ffn files into a
single file.
6) nr – an indexed Non-Redundant (NR) protein database
(ftp://ftp.ncbi.nih.gov/blast/db/nr.*.tar.gz).
7) bacterial_nr – download from NCBI (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.faa.tar.gz),
and then merge all faa files into a single file.
8) EcoliMG1655.fsa – downloaded from NCBI
Input sample files
Our data set consists of 100 bp paired-end Illumina reads from cow rumen. To inspect their
contents:
less cow1.fastq
less cow2.fastq
Processing the Reads
To ensure compatibility with downstream software we first need to reformat the headers of the
paired-end reads such that the 5` and 3` ends are assigned appropriate matching sequence
identifiers s e.g. 5` reads are marked with a trailing '/1' while 3` reads are marked with a trailing
'/2'.
perl main_add_subID_reads_fastq.pl
cow
Notes:

check input file: 'less cow1.fastq'
o @SRR594215.2 FCFC81EB6ABXX:7:1101:1495:2185 length=200/1
o TGTACCTTGAGAGGAAGCACCGGCAAACTTCGTGCCAGGAGCCGCGGTAATACGAGGGGTGC
AAGCGTTGTTCGGAATTACTGGGCGGACAGGGAGAGGT

check output file: 'less cow1_new.fastq'
o @SRR594215.2/1
o TGTACCTTGAGAGGAAGCACCGGCAAACTTCGTGCCAGGAGCCGCGGTAATACGAGGGGTGC
AAGCGTTGTTCGGAATTACTGGGCGGACAGGGAGAGGT
Step 1. Remove adaptor sequences
Here we rely on the software tool cross_match, which is part of the Phrap suite of tools
originally developed in the late 90's (http://www.phrap.org). Trimmomatic (see below) offers a
reasonable alternative.
In order to run phrap, we first need to convert the read files from fastq format to fasta format:
python fastq2fastaqual_64.py cow1_new.fastq
python fastq2fastaqual_64.py cow2_new.fastq

output files: 'cow1_new.fastq.fa', 'cow1_new.fastq.fa.qual'
We can then run the cross-match tool to compare our sequences against a database of adaptor
sequences (in fasta format) to identify adaptor contamination:
cross_match.manyreads
minmatch 10 -minscore
cross_match.manyreads
minmatch 10 -minscore
cow1_new.fastq.fa cow1_new.fastq.fa.qual UniVec_Core 20 -screen
cow2_new.fastq.fa cow2_new.fastq.fa.qual UniVec_Core 20 -screen
Notes:

The command line parameters are:
o `-minmatch 10`: Minimum length of matching word is 10.
o `-minscore 20`: Minimum alignment score is 20.
o `-screen`: A file named "cow1_new.fastq.fa.screen" is created, containing "vector
masked" versions of the original sequences: i.e. any region that matches any part of a
vector sequence is replaced by X's. You can see which sequences have adaptors that
have been screened using:
grep -B1 X cow1_new.fastq.fa.screen
o
`UniVec_Core` is a fasta file of known vectors and sequencing adaptors, you can view it
with the command:
less UniVec_Core
Once we have identified the adaptor sequence (now represented as 'X's), we can remove those
reads whose length, after trimming the 'X's in 5` and 3` regions, is less than a predefined cutoff
(here we use a cutoff of 50bp):
perl main_merge_to_1line_fasta.pl cow
python merge_paired_filtered_qual_fa.py cow1_new.fastq.fa.screen_new
cow1_new.fastq.fa.qual cow2_new.fastq.fa.screen_new cow2_new.fastq.fa.qual 50

The resulting output files are termed: 'cow1_new_adapter.fastq', 'cow2_new_adapter.fastq'
Step 2. Trim and remove low quality sequence
We trim low quality base pairs of sequences whose quality score < 15, and then remove trimmed
sequences whose length < 50.
python trim_low_quality_paired_end.py cow1_new_adapter.fastq
cow2_new_adapter.fastq 50
mv cow1_new_adapter_qual.fastq cow1_qual_all.fastq
mv cow2_new_adapter_qual.fastq cow2_qual_all.fastq
Notes:


The resulting output files of 'trim_low_quality_paired_end.py' are termed:
'cow1_new_adapter_qual.fastq', 'cow2_new_adapter_qual.fastq'
In the future version of pipeline, we use Trimmomatic which is a relatively new tool that rapidly
identifies and removes low quality sequence data - you can find it at
http://www.usadellab.org/cms/?page=trimmomatic. In addition to trimming low quality data, it
can also identify and trim adaptor sequences and is therefore a viable alternative to
cross_match.
Step 3. Remove abundant rRNA sequences
rRNA genes tend to be highly expressed in all samples and must therefore be screened out to
avoid lengthy downstream processing times for the assembly and annotation steps. We
recommend the use of two tools to achieve this screening. The first uses the Burrows Wheeler
alignment tool (BWA) to align sequence reads against a database of Bacterial rRNA genes. Here
we use a custom database of 432,934 rRNA genes collated from the SILVA, Greengenes and
NCBI databases (Bacteria_rRNA.fasta). This is very fast!. The other tool is Infernal
(http://infernal.janelia.org/) which relies on a database of hidden Markov models (HMMs)
describing rRNA sequence profiles - typically obtained from the Rfam database. However, due
to the reliance on HMMs, Infernal, while more sensitive than BWA takes a long time (26 hours
for ~100,000 reads on a single core!). We have therefore pre-generated the output for this step
which can be ignored here.
1. BWA mapping against our Bacteria rRNA database:
For faster and more efficient alignment using BWA and SAMTools, we use the following
commands to build the Bacteria rRNA database.
bwa index -a bwtsw Bacteria_rRNA.fasta
samtools faidx Bacteria_rRNA.fasta
To align paired-end reads to rRNA sequences, we use the following commands with the default
parameters (single end reads use a different variant of the 'bwa sampe' called 'bwa samse':
bwa aln -t 4 Bacteria_rRNA.fasta cow1_qual_all.fastq > cow1_rRNA.sai
bwa aln -t 4 Bacteria_rRNA.fasta cow2_qual_all.fastq > cow2_rRNA.sai
bwa sampe Bacteria_rRNA.fasta cow1_rRNA.sai cow2_rRNA.sai cow1_qual_all.fastq
cow2_qual_all.fastq > cow_rRNA.sam
Next we apply SAMTools sequence formatting software (http://samtools.sourceforge.net/) to
convert sam-formatted binary BWA output files to easier to process tab-delimited files:
samtools view -bS cow_rRNA.sam | samtools sort -n - cow_rRNA
samtools view -F 4 cow_rRNA.bam > cow_rRNA.bwaout
Finally we extract unmapped reads for further processing:
perl main_read_samout.pl cow rRNA bwa pair
perl main_select_reads_fromfile.pl cow rRNA1 bwa pair
The processed files are called "cow1_qual_n_rRNA1.fasta" and "cow2_qual_n_rRNA1.fasta"
Notes:

You might get a warning when running the main_select_reads_fromfile.pl script - "Replacement
list is longer than search list at /usr/local/share/perl/5.18.2/Bio/Range.pm line 251." - This is a
known bug (feature?) associated with older versions of BioPerl that we can safely ignore here
AND elsewhere in this tutorial.
2. Infernal:
Infernal relies on profiles of RNA sequence and secondary structure consensus to identify
putative rRNA sequences.
cmscan -o cow1_rRNA.log --tblout cow1_rRNA.infernalout --noali --notextw -rfam -E 0.001 Rfam.cm cow1_qual_n_rRNA1.fasta
cmscan -o cow2_rRNA.log --tblout cow2_rRNA.infernalout --noali --notextw -rfam -E 0.001 Rfam.cm cow2_qual_n_rRNA1.fasta
Notes:


The command line parameters are:
o `--tblout`: save a simple tabular file.
o `--noali`: omit the alignment section from the main output. This can greatly reduce the
output volume.
o `--rfam`: use a strict filtering strategy devised for large database. This will speed the
search at a potential cost to sensitivity.
o `-E`: report target sequences with an E-value of 0.001.
Due to the sensitive nature of these searches, Infernal would take ~6 hours to complete.
Finally from all these output files we need to extract all those reads which were not deemed to be
rRNA for subsequent processing:
perl main_get_sequence_length.pl cow rRNA
perl main_get_infernal_fromfile_1tophit.pl cow pair 1 0.001 90
perl main_select_reads_fromfile.pl cow rRNA infernal pair
Step 4. Remove host reads
To identify and screen out host reads (here reads of bovine origin) we use the Burrows Wheeler
aligner again to search against a database of cow sequences. For our purposes we use a cow
genome database, downloaded from Ensembl (ftp://ftp.ensembl.org/pub/release80/fasta/bos_taurus/cds/Bos_taurus.UMD3.1.cds.all.fa.gz). First, you need to generate an index
for these sequences using 'bwa index' and 'samtools faidx'. Then you can perform the alignments
for the reads using 'bwa aln' and 'bwa sampe':
bwa index -a bwtsw cow_cds.fa
samtools faidx cow_cds.fa
bwa aln -t 4 cow_cds.fa cow1_qual_n_rRNA.fastq > cow1_host.sai
bwa aln -t 4 cow_cds.fa cow2_qual_n_rRNA.fastq > cow2_host.sai
bwa sampe cow_cds.fa cow1_host.sai cow2_host.sai cow1_qual_n_rRNA.fastq
cow2_qual_n_rRNA.fastq > cow_host.sam
Again we use SAMTools to convert sam-formatted binary BWA output files and custom perl
scripts to extract unmapped reads (which are now our set of putative mRNA's - congratulations!):
samtools view -bS cow_host.sam | samtools sort -n - cow_host
samtools view -F 4 cow_host.bam > cow_host.bwaout
perl main_read_samout.pl cow host bwa pair
perl main_select_reads_fromfile.pl cow host bwa pair
Notes:

This step does not actually identify any reads of bovine origin, however, from Infernal results we
did identify 1937 reads that map to bovine LSU rRNAs!
Step 5. Assembling reads
Previous studies have shown that assembling reads into larger contigs significantly increases our
ability to annotate them through sequence similarity searches. Comparisons of various assembly
methods have shown Trinity yields the best performance in terms of number of reads annotated
after assembly (Celaj, A., Markle, J., Danska, J. and Parkinson, J. (2014) Comparison of
assembly algorithms for improving rate of metatranscriptomic functional annotation.
Microbiome. 2:39.). Here we will apply the Trinity pipeline to our set of putative mRNA's
recovered at the end of Step 4.
1. Perform Trinity assembly:
For paired-end reads:
Trinity --seqType fq --left cow1_qual_n_rRNA_n_host.fastq --right
cow2_qual_n_rRNA_n_host.fastq --CPU 8 --max_memory 10G --min_contig_length 75
--full_cleanup
Notes:

The command line parameters are:
o `--seqType`: type of reads: ( fa, or fq ).
o `--CPU`: number of CPUs to use is 8.
o `--max_memory`: maximal memory to use by Trinity is 10GB.
o `--min_contig_length`: minimal length of contigs is 75bp .
o `--full_cleanup`: remove the temporary folder and results.

Trinity assembles reads into contigs which are placed into a file named
"trinity_out_dir.Trinity.fasta". By entering "less trinity_out_dir.Trinity.fasta", you can see the
format of contig sequences as follows:
o
o
o

>TR1|c0_g1_i1 len=101 path=[157:0-100] [-1, 157, -2]
GGATATATGTCCTCTGTTAGAAATTTGACTAAGCCTGCTTCTGAATGGATTGCTGGTGGT
ATCCCCATCACTGCTATGATGAGTATGGAAAAAAGAAGTGG
Because Trinity headers are not consistent between runs, we have to cheat here slightly to
ensure that our named contigs are consistent with pregenerated BLASTx output used in a
subsequent annotation step and will rely on a pregenerated contig assembly file termed
"cow_contigs.fasta". Alternatively, you can rename it to "cow_contigs.fasta" using the following
command:
o mv trinity_out_dir.Trinity.fasta cow_contigs.fasta
2. Extract singleton reads to a fastq format file:
In order to extract unassembled reads, i.e. singletons, we need to map all putative mRNA reads
to our set of assembled contigs by BWA. Unmapped reads represent our set of singletons.
First, we need to build an index to allow BWA to search against our set of contigs:
bwa index -a bwtsw cow_contigs.fasta
samtools faidx cow_contigs.fasta
Next we attempt to map the entire set of putative mRNA reads to this contig database:
bwa aln -t 4 cow_contigs.fasta cow1_qual_n_rRNA_n_host.fastq >
cow1_trinity.sai
bwa aln -t 4 cow_contigs.fasta cow2_qual_n_rRNA_n_host.fastq >
cow2_trinity.sai
bwa sampe cow_contigs.fasta cow1_trinity.sai cow2_trinity.sai
cow1_qual_n_rRNA_n_host.fastq cow2_qual_n_rRNA_n_host.fastq > cow_trinity.sam
samtools view -bS cow_trinity.sam | samtools sort -n - cow_trinity
samtools view -F 4 cow_trinity.bam > cow_trinity.bwaout
We then extract singletons in a fastq format for subsequent processing:
perl main_read_samout.pl cow assembly bwa pair
perl main_select_reads_fromfile.pl cow assembly bwa pair
Finally we generate a mapping table in which each contig is associated with the number of reads
used to assemble that contig. This table is useful for determining how many reads map to a
contig and is used for determining relative expression (see Steps 6 and 8).
perl main_get_maptable_contig.pl cow assembly
Notes:
From the following files we observe 


cow_contigs.fasta: 281 contigs = 2109 reads
cow1_singletons.fastq: 3367 reads
cow2_singletons.fastq: 3367 reads
Step 6. Annotate reads to known genes/proteins
This is the step where we attempt to infer the origins of the putative microbial mRNA reads. We
use a tiered set of sequence similarity searches of decreasing accuracy - BWA, BLAT and
BLASTX. While BWA and BLAT provide high stringency, sequence diversity that occurs at the
nucleotide level results in few matches observed for these processes. Nonetheless they are quick.
To avoid the problems of diversity that occur at the level of nucleotide, particularly in the
absence of reference microbial genomes, we use BLASTX searches to provide more sensitive
peptide-based searches, which are less prone to sequence changes between strains.
Since BWA and BLAT utilize nucleotide searches, we rely on a microbial genome database that
we obtained from the NCBI, ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.ffn.tar.gz, which
contains 5220 ffn files. We then merge all 5220 ffn files into one fasta file
"microbial_all_cds.fasta". We then build indexes for this database to allow searching via BWA
and BLAT. For BLASTX searches we use an indexed Non-Redundant (NR) protein database
also from NCBI: ftp://ftp.ncbi.nih.gov/blast/db/nr.*.tar.gz.
Notes:

the commands used to build the indexed databases are as follows:
o bwa index -a bwtsw microbial_all_cds.fasta
o samtools faidx microbial_all_cds.fasta
o makeblastdb -in microbial_all_cds.fasta -dbtype nucl
BWA searches against microbial genome database
for contigs:
bwa aln -t 4 microbial_all_cds.fasta cow_contigs.fasta > cow_contigs.sai
bwa samse microbial_all_cds.fasta cow_contigs.sai cow_contigs.fasta >
cow_contigs.sam
samtools view -bS cow_contigs.sam | samtools sort -n - cow_contigs
samtools view -F 4 cow_contigs.bam > cow_contigs_micro_cds.bwaout
perl main_read_samout.pl cow microgenes bwa contigs micro_cds
perl main_select_reads_fromfile.pl cow microgenes bwa contigs micro_cds
for singletons:
bwa aln -t 4 microbial_all_cds.fasta cow1_singletons.fastq >
cow1_singletons.sai
bwa aln -t 4 microbial_all_cds.fasta cow2_singletons.fastq >
cow2_singletons.sai
bwa sampe microbial_all_cds.fasta cow1_singletons.sai cow2_singletons.sai
cow1_singletons.fastq cow2_singletons.fastq > cow_singletons.sam
samtools view -bS cow_singletons.sam | samtools sort -n - cow_singletons
samtools view -F 4 cow_singletons.bam > cow_singletons_micro_cds.bwaout
perl main_read_samout.pl cow microgenes bwa singletons micro_cds
perl main_select_reads_fromfile.pl cow microgenes bwa singletons micro_cds
Notes:

The contig searches rely on the 'single end' (samse) mode of searching, while the singleton
searches rely on the 'paired end' (sampe) mode of searching. This is one reason why we have
persisted with these two types of files through this pipeline.
BLAT searches against microbial genome database
Because the microbial genome database is very large, we can run into "out-of-memory"
features(!) when running BLAT. We therefore split the database into two sub-databases as
follows,
perl splitFsaSub.pl 3700000 microbial_all_cds.fasta
makeblastdb -in microbial_all_cds_1.fasta -dbtype nucl
makeblastdb -in microbial_all_cds_2.fasta -dbtype nucl
After building the corresponding indexed databases, we then issue the following commands:
for contigs:
blat -noHead -minIdentity=90 -minScore=50 microbial_all_cds_1.fasta
cow_contigs_n_micro_cds.fasta -fine -q=rna -t=dna -out=blast8
cow_contigs_1.blatout
blat -noHead -minIdentity=90 -minScore=50 microbial_all_cds_2.fasta
cow_contigs_n_micro_cds.fasta -fine -q=rna -t=dna -out=blast8
cow_contigs_2.blatout
cat cow_contigs_1.blatout cow_contigs_2.blatout >
cow_contigs_n_micro_cds.blatout
perl main_sort_blastout_fromfile.pl cow n_micro_cds blat contigs 10
perl main_get_blast_fromfile_1tophit.pl cow micro_cds blat contigs 1 100 85
65 60
perl main_select_reads_fromfile.pl cow microgenes_blat blat contigs micro_cds
Notes:

The command line parameters are:
o `-noHead`: Suppresses .psl header (so it's just a tab-separated file).
o `-minIdentity`: Sets minimum sequence identity is 90%.
o `-minScore`: Sets minimum score is 50. This is the matches minus the mismatches minus
some sort of gap penalty.
o `-find`: For high-quality mRNAs.
o `-q`: Query type is RNA sequence.
o `-t`: Database type is DNA sequence.
for singletons:
blat -noHead -minIdentity=90 -minScore=50 microbial_all_cds_1.fasta
cow1_singletons_n_micro_cds.fasta -fine -q=rna -t=dna -out=blast8
cow_singletons1_1.blatout
blat -noHead -minIdentity=90 -minScore=50 microbial_all_cds_2.fasta
cow1_singletons_n_micro_cds.fasta -fine -q=rna -t=dna -out=blast8
cow_singletons1_2.blatout
cat cow_singletons1_1.blatout cow_singletons1_2.blatout >
cow1_singletons_n_micro_cds.blatout
blat -noHead -minIdentity=90 -minScore=50 microbial_all_cds_1.fasta
cow2_singletons_n_micro_cds.fasta -fine -q=rna -t=dna -out=blast8
cow_singletons2_1.blatout
blat -noHead -minIdentity=90 -minScore=50 microbial_all_cds_2.fasta
cow2_singletons_n_micro_cds.fasta -fine -q=rna -t=dna -out=blast8
cow_singletons2_2.blatout
cat cow_singletons2_1.blatout cow_singletons2_2.blatout >
cow2_singletons_n_micro_cds.blatout
perl main_sort_blastout_fromfile.pl cow n_micro_cds blat singletons 10
perl main_get_blast_fromfile_1tophit.pl cow micro_cds blat singletons 1 100
85 65 60
perl main_select_reads_fromfile.pl cow microgenes_blat blat singletons
micro_cds
BLASTX against NR protein DB
BLASTX searches are very time consuming and would take over 11 hours if using 4 cores for
our relatively small dataset. These (like the Infernal searches) require access to compute cluster
platforms so that the searches can be split over several hundred nodes and the results
subsequently reimported back into the pipeline. We have therefore generated the BLASTX
output for our reads already - "cow_contigs_nr.blastxout", "cow1_singletons_nr.blastxout", and
"cow2_singletons_nr.blastxout".
for contigs
blastx -db nr -query cow_contigs_n_micro_cds_rest.fasta -outfmt 6 max_target_seqs 10 -evalue 10 -num_threads 4 -out cow_contigs_nr.blastxout
for singletons:
blastx -db nr -query cow1_singletons_n_micro_cds_rest.fasta -outfmt 6 max_target_seqs 10 -evalue 10 -num_threads 4 -out
cow1_singletons_nr.blastxout
blastx -db nr -query cow2_singletons_n_micro_cds_rest.fasta -outfmt 6 max_target_seqs 10 -evalue 10 -num_threads 4 -out
cow2_singletons_nr.blastxout
Notes:

The command line parameters are:
o `-query`: Input file name.
o `-db`: BLAST database name.
o `-out`: Output file name.
o `-evalue`: Expectation value (E) threshold for saving hits.
o `-outfmt 6`: Output file is in a tabular format.
o `-max_target_seqs`: Maximum number of aligned sequences to keep is 10.
o `-num_threads`: Number of threads to use in the BLAST search is 4.
Next we use these scripts to extract the top BLAST sequence similarity match for each contig
and each singleton
perl main_get_blast_fromfile_tophits.pl cow nr blastx contigs 1 100 85 65 60
perl main_get_blast_fromfile_1topbachit.pl cow nr blastx contigs
perl main_get_blast_fromfile_tophits.pl cow nr blastx singletons 1 100 85 65
60
perl main_get_blast_fromfile_1topbachit.pl cow nr blastx singletons
Notes



Here we consider a match if 85% sequence identity over 65% of the read length - this can result
in very poor e-values (E = 3!) but the matches nonetheless are reasonable
We see a lot of 'Errors' of Entries not being found in the database - this arises because our
precomputed search was against a database of non-redundant proteins, many of which are not
found in the more limited non-redundant database of bacterial proteins we provide here.
In the future version of pipeline, we use a recently published tool DIAMOND instead of BLASTX
against NR database. DIAMOND is a BLAST-compatible local aligner for mapping translated DNA
query sequences against a protein reference database (BLASTX alignment mode). The speedup
over BLAST is up to 20,000 on short reads at a typical sensitivity of 90-99% relative to BLAST
depending on the data and settings (Buchfink B. et al., Nature Methods, 2015[1]).
We then generate a sequence file of mapped microbial genes:
perl main_get_microbial_cds_sub.pl cow
As well as a sequence file of mapped NR proteins:
perl main_get_nr_sub.pl cow
perl main_get_sequence_length.pl cow nr_sub
SUMMARY:
In order to know the number of mapped reads at different processing steps, you can use the
following commands:
cp cow_contigs_nr_blastx_IDs.txt cow_contigs_nr_IDs.txt
cp cow_singletons_nr_blastx_IDs.txt cow_singletons_nr_IDs.txt
perl main_get_maptable_contig.pl cow bwa
perl main_get_maptable_contig.pl cow blat
perl main_get_maptable_contig.pl cow blastx
grep ">" microbial_cds_sub.fasta | wc -l
grep ">" nr_all_sub.fasta | wc -l
Note:

BWA: Total number of mapped-reads = 36 reads


BLAT: Total number of mapped-reads = 297 reads
BLASTX: Total number of mapped-reads = 1120 reads


Total number of mapped micro_cds genes = 318
Total number of mapped nr proteins = 877
Thus of ~6100 reads of putative microbial mRNA origin, we can annotate only ~1400 of them!!
This appears to be typical for microbiome samples.
Step 7. Map the known genes to E. coli homologs to facilitate network
visualization
To help interpret our metatranscriptomic datasets from a functional perspective, we rely on
mapping our data to functional networks such as metabolic pathways and maps of protein
complexes. Here we will use a previously published map of functional protein-protein
interactions (PPI) constructed for E. coli (Peregrín-Alvarez JM. et al., PLoS Comput Biol.
2009[2]) as a proxy to get a systems-level view of annotated reads. To begin, we need to first
define E. coli homologs for our annotated genes and proteins from the BWA, BLAT and
BLASTX searches.
For mapped microbial genes identified through our BWA and BLAT searches:
blastx -db EcoliMG1655.fsa -query microbial_cds_sub.fasta -outfmt 6 max_target_seqs 10 -evalue 1e-10 -num_threads 4 -out
microbial_cds_sub_ecoli_ppi.blastxout
perl main_get_blast_fromfile_1tophit.pl cow ecoli_ppi blastx genes 0
For mapped NR proteins identified through our BLASTX searches:
blastp -db EcoliMG1655.fsa -query nr_all_sub.fasta -outfmt 6 -max_target_seqs
10 -evalue 1e-10 -num_threads 4 -out nr_all_sub_ecoli_ppi.blastpout
perl main_get_blast_fromfile_1tophit.pl cow ecoli_ppi blastp proteins 0
Notes:

the output files are "microbial_cds_sub_ecoli_ppi_pairs.txt" and 'nr_all_sub_ecoli_ppi_IDs.txt"
We then need to generate a "PPI_pairs.txt" mapping file which lists E. coli homolog (defined
through its 'b'-number) for each of our mapped genes/proteins:
perl main_combine_PPI_results.pl cow
Step. 8 Generate normalized expression values associated with each gene
We have removed low quality, adaptors, rRNA and host sequences and annotated reads to the
best of our ability - now let’s summarize our findings. We do this by looking at the relative
expression of each of our genes in our microbiome. First we generate a mapping table, which
links our gene and proteins identified in our BWA, BLAT and BLASTX mappings with their
respective taxonomic information (NCBI taxon ID, species name and phylum). This enables us
to identify which species are contributing which functions to the microbiome:
perl
perl
perl
perl
main_get_taxonID_microbial_cds.pl cow
main_get_phylum.pl cow micro_cds
main_get_taxonID_nr_sub.pl cow
main_get_phylum.pl cow nr
Then for each gene and protein, we calculate a normalized expression value (Reads Per Kilobase
of Sequence Mapped - RPKM):
perl
perl
perl
perl
perl
main_get_mapped_genesID_counts.pl cow micro_cds
main_get_mapped_genesID_counts.pl cow nr
main_get_mapped_gene_table.pl cow micro_cds
main_get_mapped_gene_table.pl cow nr
main_get_mapped_gene_table_RPKM.pl cow
Notes:

The final output file is named "cow_table_RPKM_all.txt" and has the following format:
o [geneID/proteinID, length, #reads, taxonID, specie, phylum, RPKM, PPI]
o gi|303235869|ref|ZP_07322472.1| ribonucleoside-diphosphate reductase,
adenosylcobalamin-dependent 2595 1 866771 Prevotella disiens FB035-09AN
bacteroidetes 1.05 1.17.4.1 b2675
Step. 9 Visualization of PPI network
To visualize our processed microbiome dataset in the context of the functional PPI network, we
use the network visualization tool - Cytoscape together with the enhancedGraphics plugin.
Here we will skip the steps of generating the node attribute file "cow_PPI.nodes.txt" from
"cow_table_RPKM_all.txt", however for your information the steps involve:

predefine taxonomic categories (here we use the following 12 phylum categories: archaea,
protozoan, bacteria, actinobacteria, bacteroidetes, gammaproteobacteria, deltaproteobacteria,
betaproteobacteria, alphaproteobacteria, clostridiales, leuconostocaceae, lactobacillaceae, but
you could define these categories to fit your microbiome).

calculate RPKM values of each ecoli protein, for every phylum category, by adding RPKM values
of the protein's mapped genes/proteins.

generate a node attribute file which is a tab-delimited table with a format as follows:
o
the first line is the header - you could use:
ecoli_protein
b#
RPKM
piechart
archaea protozoan
bacteria
actinobacteria bacteroidetes
gammaproteobacteria
deltaproteobacteria
betaproteobacteria
alphaproteobacteria
clostridiales
leuconostocaceae
lactobacillaceae

subsequent lines then use the format, with the final numbers being the RPKM associated with
each taxon:
dapF
b3809
70.38
piechart:
attributelist="archaea,protozoan,bacteria,actinobacteria,bacteroidetes,gammap
roteobacteria,deltaproteobacteria,betaproteobacteria,alphaproteobacteria,clos
tridiales,leuconostocaceae,lactobacillaceae"
colorlist="#FFA500,#C0C0C0,#EDF252,#0000FF,#FF00FF,#2C94DE,#ED4734,#00FFFF,#F
FCCFF,#34C400,#A52A2A,#663366" showlabels=false
0.00
0.00
3.29
0.00
50.79
3.29
0.00
0.00
0.00
13.01
0.00
0.00
Once the node attribute file has been generated, we provide one network file onto which these
attributes can be mapped: "Ecoli PPI cellwall.cys" which can be opened using Cytoscape
installed in your local computer. To import node attributes:
1) select File -> Import -> Table -> File, select "cow_PPI.nodes.txt" from
your working folder, click OK from the prompting window.
2) from Control Panel, select Style -> Properties -> Paint -> Custom Paint 1
-> Custom Graphics 1,
3) click Custom Graphics 1, select piechart for Column, and select
Passthrough Mapping for Mapping Type.
Notes:

To install the required plugin within Cytoscape, please select Apps from the header bar —>
select App Manager. Search for enhancedGraphics in the Search box. Select enhancedGraphics
and click Install.

You can open the cytoscape file, 'Ecoli PPI cellwall cow.cys', to see what the PPI network look
like.