* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Metatranscriptome_Pipeline_Tutorial.doc
Survey
Document related concepts
History of genetic engineering wikipedia , lookup
Minimal genome wikipedia , lookup
Non-coding RNA wikipedia , lookup
United Kingdom National DNA Database wikipedia , lookup
Human genome wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Pathogenomics wikipedia , lookup
Primary transcript wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Epitranscriptome wikipedia , lookup
Helitron (biology) wikipedia , lookup
Transcript
Metatranscriptomics Tutorial Overview This tutorial will take you through various steps of processing metatranscriptomic data. These steps are as follows: 1. Remove adaptor sequences. These are added during library preparation and sequencing steps and can be generated during sequencing runs 2. Remove low quality sequences. 3. Remove abundant rRNA sequences which can dominate the dataset 4. Remove host reads (if exploring a microbiome in which host is an issue) 5. Assemble the reads into contigs to improve annotation quality 6. Annotate reads to known genes 7. Map the known genes to E. coli homologs to facilitate network visualization 8. Generate normalized expression values associated with each gene 9. Visualize the results We are currently working on a single script to run the entire process taking fastq sequence files generated by the sequencer to files that can be read by the visualization tool, here we illustrate the use of the pipeline by going through the various steps to illustrate the complexity of the process and the underlying tools and files used and generated by the pipeline. To illustrate the process we are going to use sequence reads generated from the rumen of a cow. These are 100bp paired end reads - single end reads can also be used, but paired end reads can increase sequence length if there is significant overlap and consequently improve annotation quality. Rather than use the entire set of 14 million which might take several days to process on a desktop, the tutorial will take you through processing a subset of 100,000 reads. Preliminaries Preinstalled software and tools You need to install all required tools and set the proper environmental variables. The tools used are as follows, 1) Perl 2) bioperl 3) Python 4) cross_match 5) Trinity 6) BWA 7) SAMTOOLS 8) BLAT 9) BLAST+ 10) Cytoscape Pre-downloaded databases You need to download several reference databases as follows, 1) UniVec_Core (ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec/UniVec_Core) 2) Bacteria_rRNA.fasta – rRNA genes collated from the SILVA, Greengenes and NCBI databases 3) Rfam (http://rfam.xfam.org/) 4) cow_cds.fa (ftp://ftp.ensembl.org/pub/current_fasta/bos_taurus/cds/Bos_taurus.UMD3.1.cds.all.fa.gz). 5) microbial_all_cds.fasta – download from NCBI (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.ffn.tar.gz), and then merge all ffn files into a single file. 6) nr – an indexed Non-Redundant (NR) protein database (ftp://ftp.ncbi.nih.gov/blast/db/nr.*.tar.gz). 7) bacterial_nr – download from NCBI (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.faa.tar.gz), and then merge all faa files into a single file. 8) EcoliMG1655.fsa – downloaded from NCBI Input sample files Our data set consists of 100 bp paired-end Illumina reads from cow rumen. To inspect their contents: less cow1.fastq less cow2.fastq Processing the Reads To ensure compatibility with downstream software we first need to reformat the headers of the paired-end reads such that the 5` and 3` ends are assigned appropriate matching sequence identifiers s e.g. 5` reads are marked with a trailing '/1' while 3` reads are marked with a trailing '/2'. perl main_add_subID_reads_fastq.pl cow Notes: check input file: 'less cow1.fastq' o @SRR594215.2 FCFC81EB6ABXX:7:1101:1495:2185 length=200/1 o TGTACCTTGAGAGGAAGCACCGGCAAACTTCGTGCCAGGAGCCGCGGTAATACGAGGGGTGC AAGCGTTGTTCGGAATTACTGGGCGGACAGGGAGAGGT check output file: 'less cow1_new.fastq' o @SRR594215.2/1 o TGTACCTTGAGAGGAAGCACCGGCAAACTTCGTGCCAGGAGCCGCGGTAATACGAGGGGTGC AAGCGTTGTTCGGAATTACTGGGCGGACAGGGAGAGGT Step 1. Remove adaptor sequences Here we rely on the software tool cross_match, which is part of the Phrap suite of tools originally developed in the late 90's (http://www.phrap.org). Trimmomatic (see below) offers a reasonable alternative. In order to run phrap, we first need to convert the read files from fastq format to fasta format: python fastq2fastaqual_64.py cow1_new.fastq python fastq2fastaqual_64.py cow2_new.fastq output files: 'cow1_new.fastq.fa', 'cow1_new.fastq.fa.qual' We can then run the cross-match tool to compare our sequences against a database of adaptor sequences (in fasta format) to identify adaptor contamination: cross_match.manyreads minmatch 10 -minscore cross_match.manyreads minmatch 10 -minscore cow1_new.fastq.fa cow1_new.fastq.fa.qual UniVec_Core 20 -screen cow2_new.fastq.fa cow2_new.fastq.fa.qual UniVec_Core 20 -screen Notes: The command line parameters are: o `-minmatch 10`: Minimum length of matching word is 10. o `-minscore 20`: Minimum alignment score is 20. o `-screen`: A file named "cow1_new.fastq.fa.screen" is created, containing "vector masked" versions of the original sequences: i.e. any region that matches any part of a vector sequence is replaced by X's. You can see which sequences have adaptors that have been screened using: grep -B1 X cow1_new.fastq.fa.screen o `UniVec_Core` is a fasta file of known vectors and sequencing adaptors, you can view it with the command: less UniVec_Core Once we have identified the adaptor sequence (now represented as 'X's), we can remove those reads whose length, after trimming the 'X's in 5` and 3` regions, is less than a predefined cutoff (here we use a cutoff of 50bp): perl main_merge_to_1line_fasta.pl cow python merge_paired_filtered_qual_fa.py cow1_new.fastq.fa.screen_new cow1_new.fastq.fa.qual cow2_new.fastq.fa.screen_new cow2_new.fastq.fa.qual 50 The resulting output files are termed: 'cow1_new_adapter.fastq', 'cow2_new_adapter.fastq' Step 2. Trim and remove low quality sequence We trim low quality base pairs of sequences whose quality score < 15, and then remove trimmed sequences whose length < 50. python trim_low_quality_paired_end.py cow1_new_adapter.fastq cow2_new_adapter.fastq 50 mv cow1_new_adapter_qual.fastq cow1_qual_all.fastq mv cow2_new_adapter_qual.fastq cow2_qual_all.fastq Notes: The resulting output files of 'trim_low_quality_paired_end.py' are termed: 'cow1_new_adapter_qual.fastq', 'cow2_new_adapter_qual.fastq' In the future version of pipeline, we use Trimmomatic which is a relatively new tool that rapidly identifies and removes low quality sequence data - you can find it at http://www.usadellab.org/cms/?page=trimmomatic. In addition to trimming low quality data, it can also identify and trim adaptor sequences and is therefore a viable alternative to cross_match. Step 3. Remove abundant rRNA sequences rRNA genes tend to be highly expressed in all samples and must therefore be screened out to avoid lengthy downstream processing times for the assembly and annotation steps. We recommend the use of two tools to achieve this screening. The first uses the Burrows Wheeler alignment tool (BWA) to align sequence reads against a database of Bacterial rRNA genes. Here we use a custom database of 432,934 rRNA genes collated from the SILVA, Greengenes and NCBI databases (Bacteria_rRNA.fasta). This is very fast!. The other tool is Infernal (http://infernal.janelia.org/) which relies on a database of hidden Markov models (HMMs) describing rRNA sequence profiles - typically obtained from the Rfam database. However, due to the reliance on HMMs, Infernal, while more sensitive than BWA takes a long time (26 hours for ~100,000 reads on a single core!). We have therefore pre-generated the output for this step which can be ignored here. 1. BWA mapping against our Bacteria rRNA database: For faster and more efficient alignment using BWA and SAMTools, we use the following commands to build the Bacteria rRNA database. bwa index -a bwtsw Bacteria_rRNA.fasta samtools faidx Bacteria_rRNA.fasta To align paired-end reads to rRNA sequences, we use the following commands with the default parameters (single end reads use a different variant of the 'bwa sampe' called 'bwa samse': bwa aln -t 4 Bacteria_rRNA.fasta cow1_qual_all.fastq > cow1_rRNA.sai bwa aln -t 4 Bacteria_rRNA.fasta cow2_qual_all.fastq > cow2_rRNA.sai bwa sampe Bacteria_rRNA.fasta cow1_rRNA.sai cow2_rRNA.sai cow1_qual_all.fastq cow2_qual_all.fastq > cow_rRNA.sam Next we apply SAMTools sequence formatting software (http://samtools.sourceforge.net/) to convert sam-formatted binary BWA output files to easier to process tab-delimited files: samtools view -bS cow_rRNA.sam | samtools sort -n - cow_rRNA samtools view -F 4 cow_rRNA.bam > cow_rRNA.bwaout Finally we extract unmapped reads for further processing: perl main_read_samout.pl cow rRNA bwa pair perl main_select_reads_fromfile.pl cow rRNA1 bwa pair The processed files are called "cow1_qual_n_rRNA1.fasta" and "cow2_qual_n_rRNA1.fasta" Notes: You might get a warning when running the main_select_reads_fromfile.pl script - "Replacement list is longer than search list at /usr/local/share/perl/5.18.2/Bio/Range.pm line 251." - This is a known bug (feature?) associated with older versions of BioPerl that we can safely ignore here AND elsewhere in this tutorial. 2. Infernal: Infernal relies on profiles of RNA sequence and secondary structure consensus to identify putative rRNA sequences. cmscan -o cow1_rRNA.log --tblout cow1_rRNA.infernalout --noali --notextw -rfam -E 0.001 Rfam.cm cow1_qual_n_rRNA1.fasta cmscan -o cow2_rRNA.log --tblout cow2_rRNA.infernalout --noali --notextw -rfam -E 0.001 Rfam.cm cow2_qual_n_rRNA1.fasta Notes: The command line parameters are: o `--tblout`: save a simple tabular file. o `--noali`: omit the alignment section from the main output. This can greatly reduce the output volume. o `--rfam`: use a strict filtering strategy devised for large database. This will speed the search at a potential cost to sensitivity. o `-E`: report target sequences with an E-value of 0.001. Due to the sensitive nature of these searches, Infernal would take ~6 hours to complete. Finally from all these output files we need to extract all those reads which were not deemed to be rRNA for subsequent processing: perl main_get_sequence_length.pl cow rRNA perl main_get_infernal_fromfile_1tophit.pl cow pair 1 0.001 90 perl main_select_reads_fromfile.pl cow rRNA infernal pair Step 4. Remove host reads To identify and screen out host reads (here reads of bovine origin) we use the Burrows Wheeler aligner again to search against a database of cow sequences. For our purposes we use a cow genome database, downloaded from Ensembl (ftp://ftp.ensembl.org/pub/release80/fasta/bos_taurus/cds/Bos_taurus.UMD3.1.cds.all.fa.gz). First, you need to generate an index for these sequences using 'bwa index' and 'samtools faidx'. Then you can perform the alignments for the reads using 'bwa aln' and 'bwa sampe': bwa index -a bwtsw cow_cds.fa samtools faidx cow_cds.fa bwa aln -t 4 cow_cds.fa cow1_qual_n_rRNA.fastq > cow1_host.sai bwa aln -t 4 cow_cds.fa cow2_qual_n_rRNA.fastq > cow2_host.sai bwa sampe cow_cds.fa cow1_host.sai cow2_host.sai cow1_qual_n_rRNA.fastq cow2_qual_n_rRNA.fastq > cow_host.sam Again we use SAMTools to convert sam-formatted binary BWA output files and custom perl scripts to extract unmapped reads (which are now our set of putative mRNA's - congratulations!): samtools view -bS cow_host.sam | samtools sort -n - cow_host samtools view -F 4 cow_host.bam > cow_host.bwaout perl main_read_samout.pl cow host bwa pair perl main_select_reads_fromfile.pl cow host bwa pair Notes: This step does not actually identify any reads of bovine origin, however, from Infernal results we did identify 1937 reads that map to bovine LSU rRNAs! Step 5. Assembling reads Previous studies have shown that assembling reads into larger contigs significantly increases our ability to annotate them through sequence similarity searches. Comparisons of various assembly methods have shown Trinity yields the best performance in terms of number of reads annotated after assembly (Celaj, A., Markle, J., Danska, J. and Parkinson, J. (2014) Comparison of assembly algorithms for improving rate of metatranscriptomic functional annotation. Microbiome. 2:39.). Here we will apply the Trinity pipeline to our set of putative mRNA's recovered at the end of Step 4. 1. Perform Trinity assembly: For paired-end reads: Trinity --seqType fq --left cow1_qual_n_rRNA_n_host.fastq --right cow2_qual_n_rRNA_n_host.fastq --CPU 8 --max_memory 10G --min_contig_length 75 --full_cleanup Notes: The command line parameters are: o `--seqType`: type of reads: ( fa, or fq ). o `--CPU`: number of CPUs to use is 8. o `--max_memory`: maximal memory to use by Trinity is 10GB. o `--min_contig_length`: minimal length of contigs is 75bp . o `--full_cleanup`: remove the temporary folder and results. Trinity assembles reads into contigs which are placed into a file named "trinity_out_dir.Trinity.fasta". By entering "less trinity_out_dir.Trinity.fasta", you can see the format of contig sequences as follows: o o o >TR1|c0_g1_i1 len=101 path=[157:0-100] [-1, 157, -2] GGATATATGTCCTCTGTTAGAAATTTGACTAAGCCTGCTTCTGAATGGATTGCTGGTGGT ATCCCCATCACTGCTATGATGAGTATGGAAAAAAGAAGTGG Because Trinity headers are not consistent between runs, we have to cheat here slightly to ensure that our named contigs are consistent with pregenerated BLASTx output used in a subsequent annotation step and will rely on a pregenerated contig assembly file termed "cow_contigs.fasta". Alternatively, you can rename it to "cow_contigs.fasta" using the following command: o mv trinity_out_dir.Trinity.fasta cow_contigs.fasta 2. Extract singleton reads to a fastq format file: In order to extract unassembled reads, i.e. singletons, we need to map all putative mRNA reads to our set of assembled contigs by BWA. Unmapped reads represent our set of singletons. First, we need to build an index to allow BWA to search against our set of contigs: bwa index -a bwtsw cow_contigs.fasta samtools faidx cow_contigs.fasta Next we attempt to map the entire set of putative mRNA reads to this contig database: bwa aln -t 4 cow_contigs.fasta cow1_qual_n_rRNA_n_host.fastq > cow1_trinity.sai bwa aln -t 4 cow_contigs.fasta cow2_qual_n_rRNA_n_host.fastq > cow2_trinity.sai bwa sampe cow_contigs.fasta cow1_trinity.sai cow2_trinity.sai cow1_qual_n_rRNA_n_host.fastq cow2_qual_n_rRNA_n_host.fastq > cow_trinity.sam samtools view -bS cow_trinity.sam | samtools sort -n - cow_trinity samtools view -F 4 cow_trinity.bam > cow_trinity.bwaout We then extract singletons in a fastq format for subsequent processing: perl main_read_samout.pl cow assembly bwa pair perl main_select_reads_fromfile.pl cow assembly bwa pair Finally we generate a mapping table in which each contig is associated with the number of reads used to assemble that contig. This table is useful for determining how many reads map to a contig and is used for determining relative expression (see Steps 6 and 8). perl main_get_maptable_contig.pl cow assembly Notes: From the following files we observe cow_contigs.fasta: 281 contigs = 2109 reads cow1_singletons.fastq: 3367 reads cow2_singletons.fastq: 3367 reads Step 6. Annotate reads to known genes/proteins This is the step where we attempt to infer the origins of the putative microbial mRNA reads. We use a tiered set of sequence similarity searches of decreasing accuracy - BWA, BLAT and BLASTX. While BWA and BLAT provide high stringency, sequence diversity that occurs at the nucleotide level results in few matches observed for these processes. Nonetheless they are quick. To avoid the problems of diversity that occur at the level of nucleotide, particularly in the absence of reference microbial genomes, we use BLASTX searches to provide more sensitive peptide-based searches, which are less prone to sequence changes between strains. Since BWA and BLAT utilize nucleotide searches, we rely on a microbial genome database that we obtained from the NCBI, ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.ffn.tar.gz, which contains 5220 ffn files. We then merge all 5220 ffn files into one fasta file "microbial_all_cds.fasta". We then build indexes for this database to allow searching via BWA and BLAT. For BLASTX searches we use an indexed Non-Redundant (NR) protein database also from NCBI: ftp://ftp.ncbi.nih.gov/blast/db/nr.*.tar.gz. Notes: the commands used to build the indexed databases are as follows: o bwa index -a bwtsw microbial_all_cds.fasta o samtools faidx microbial_all_cds.fasta o makeblastdb -in microbial_all_cds.fasta -dbtype nucl BWA searches against microbial genome database for contigs: bwa aln -t 4 microbial_all_cds.fasta cow_contigs.fasta > cow_contigs.sai bwa samse microbial_all_cds.fasta cow_contigs.sai cow_contigs.fasta > cow_contigs.sam samtools view -bS cow_contigs.sam | samtools sort -n - cow_contigs samtools view -F 4 cow_contigs.bam > cow_contigs_micro_cds.bwaout perl main_read_samout.pl cow microgenes bwa contigs micro_cds perl main_select_reads_fromfile.pl cow microgenes bwa contigs micro_cds for singletons: bwa aln -t 4 microbial_all_cds.fasta cow1_singletons.fastq > cow1_singletons.sai bwa aln -t 4 microbial_all_cds.fasta cow2_singletons.fastq > cow2_singletons.sai bwa sampe microbial_all_cds.fasta cow1_singletons.sai cow2_singletons.sai cow1_singletons.fastq cow2_singletons.fastq > cow_singletons.sam samtools view -bS cow_singletons.sam | samtools sort -n - cow_singletons samtools view -F 4 cow_singletons.bam > cow_singletons_micro_cds.bwaout perl main_read_samout.pl cow microgenes bwa singletons micro_cds perl main_select_reads_fromfile.pl cow microgenes bwa singletons micro_cds Notes: The contig searches rely on the 'single end' (samse) mode of searching, while the singleton searches rely on the 'paired end' (sampe) mode of searching. This is one reason why we have persisted with these two types of files through this pipeline. BLAT searches against microbial genome database Because the microbial genome database is very large, we can run into "out-of-memory" features(!) when running BLAT. We therefore split the database into two sub-databases as follows, perl splitFsaSub.pl 3700000 microbial_all_cds.fasta makeblastdb -in microbial_all_cds_1.fasta -dbtype nucl makeblastdb -in microbial_all_cds_2.fasta -dbtype nucl After building the corresponding indexed databases, we then issue the following commands: for contigs: blat -noHead -minIdentity=90 -minScore=50 microbial_all_cds_1.fasta cow_contigs_n_micro_cds.fasta -fine -q=rna -t=dna -out=blast8 cow_contigs_1.blatout blat -noHead -minIdentity=90 -minScore=50 microbial_all_cds_2.fasta cow_contigs_n_micro_cds.fasta -fine -q=rna -t=dna -out=blast8 cow_contigs_2.blatout cat cow_contigs_1.blatout cow_contigs_2.blatout > cow_contigs_n_micro_cds.blatout perl main_sort_blastout_fromfile.pl cow n_micro_cds blat contigs 10 perl main_get_blast_fromfile_1tophit.pl cow micro_cds blat contigs 1 100 85 65 60 perl main_select_reads_fromfile.pl cow microgenes_blat blat contigs micro_cds Notes: The command line parameters are: o `-noHead`: Suppresses .psl header (so it's just a tab-separated file). o `-minIdentity`: Sets minimum sequence identity is 90%. o `-minScore`: Sets minimum score is 50. This is the matches minus the mismatches minus some sort of gap penalty. o `-find`: For high-quality mRNAs. o `-q`: Query type is RNA sequence. o `-t`: Database type is DNA sequence. for singletons: blat -noHead -minIdentity=90 -minScore=50 microbial_all_cds_1.fasta cow1_singletons_n_micro_cds.fasta -fine -q=rna -t=dna -out=blast8 cow_singletons1_1.blatout blat -noHead -minIdentity=90 -minScore=50 microbial_all_cds_2.fasta cow1_singletons_n_micro_cds.fasta -fine -q=rna -t=dna -out=blast8 cow_singletons1_2.blatout cat cow_singletons1_1.blatout cow_singletons1_2.blatout > cow1_singletons_n_micro_cds.blatout blat -noHead -minIdentity=90 -minScore=50 microbial_all_cds_1.fasta cow2_singletons_n_micro_cds.fasta -fine -q=rna -t=dna -out=blast8 cow_singletons2_1.blatout blat -noHead -minIdentity=90 -minScore=50 microbial_all_cds_2.fasta cow2_singletons_n_micro_cds.fasta -fine -q=rna -t=dna -out=blast8 cow_singletons2_2.blatout cat cow_singletons2_1.blatout cow_singletons2_2.blatout > cow2_singletons_n_micro_cds.blatout perl main_sort_blastout_fromfile.pl cow n_micro_cds blat singletons 10 perl main_get_blast_fromfile_1tophit.pl cow micro_cds blat singletons 1 100 85 65 60 perl main_select_reads_fromfile.pl cow microgenes_blat blat singletons micro_cds BLASTX against NR protein DB BLASTX searches are very time consuming and would take over 11 hours if using 4 cores for our relatively small dataset. These (like the Infernal searches) require access to compute cluster platforms so that the searches can be split over several hundred nodes and the results subsequently reimported back into the pipeline. We have therefore generated the BLASTX output for our reads already - "cow_contigs_nr.blastxout", "cow1_singletons_nr.blastxout", and "cow2_singletons_nr.blastxout". for contigs blastx -db nr -query cow_contigs_n_micro_cds_rest.fasta -outfmt 6 max_target_seqs 10 -evalue 10 -num_threads 4 -out cow_contigs_nr.blastxout for singletons: blastx -db nr -query cow1_singletons_n_micro_cds_rest.fasta -outfmt 6 max_target_seqs 10 -evalue 10 -num_threads 4 -out cow1_singletons_nr.blastxout blastx -db nr -query cow2_singletons_n_micro_cds_rest.fasta -outfmt 6 max_target_seqs 10 -evalue 10 -num_threads 4 -out cow2_singletons_nr.blastxout Notes: The command line parameters are: o `-query`: Input file name. o `-db`: BLAST database name. o `-out`: Output file name. o `-evalue`: Expectation value (E) threshold for saving hits. o `-outfmt 6`: Output file is in a tabular format. o `-max_target_seqs`: Maximum number of aligned sequences to keep is 10. o `-num_threads`: Number of threads to use in the BLAST search is 4. Next we use these scripts to extract the top BLAST sequence similarity match for each contig and each singleton perl main_get_blast_fromfile_tophits.pl cow nr blastx contigs 1 100 85 65 60 perl main_get_blast_fromfile_1topbachit.pl cow nr blastx contigs perl main_get_blast_fromfile_tophits.pl cow nr blastx singletons 1 100 85 65 60 perl main_get_blast_fromfile_1topbachit.pl cow nr blastx singletons Notes Here we consider a match if 85% sequence identity over 65% of the read length - this can result in very poor e-values (E = 3!) but the matches nonetheless are reasonable We see a lot of 'Errors' of Entries not being found in the database - this arises because our precomputed search was against a database of non-redundant proteins, many of which are not found in the more limited non-redundant database of bacterial proteins we provide here. In the future version of pipeline, we use a recently published tool DIAMOND instead of BLASTX against NR database. DIAMOND is a BLAST-compatible local aligner for mapping translated DNA query sequences against a protein reference database (BLASTX alignment mode). The speedup over BLAST is up to 20,000 on short reads at a typical sensitivity of 90-99% relative to BLAST depending on the data and settings (Buchfink B. et al., Nature Methods, 2015[1]). We then generate a sequence file of mapped microbial genes: perl main_get_microbial_cds_sub.pl cow As well as a sequence file of mapped NR proteins: perl main_get_nr_sub.pl cow perl main_get_sequence_length.pl cow nr_sub SUMMARY: In order to know the number of mapped reads at different processing steps, you can use the following commands: cp cow_contigs_nr_blastx_IDs.txt cow_contigs_nr_IDs.txt cp cow_singletons_nr_blastx_IDs.txt cow_singletons_nr_IDs.txt perl main_get_maptable_contig.pl cow bwa perl main_get_maptable_contig.pl cow blat perl main_get_maptable_contig.pl cow blastx grep ">" microbial_cds_sub.fasta | wc -l grep ">" nr_all_sub.fasta | wc -l Note: BWA: Total number of mapped-reads = 36 reads BLAT: Total number of mapped-reads = 297 reads BLASTX: Total number of mapped-reads = 1120 reads Total number of mapped micro_cds genes = 318 Total number of mapped nr proteins = 877 Thus of ~6100 reads of putative microbial mRNA origin, we can annotate only ~1400 of them!! This appears to be typical for microbiome samples. Step 7. Map the known genes to E. coli homologs to facilitate network visualization To help interpret our metatranscriptomic datasets from a functional perspective, we rely on mapping our data to functional networks such as metabolic pathways and maps of protein complexes. Here we will use a previously published map of functional protein-protein interactions (PPI) constructed for E. coli (Peregrín-Alvarez JM. et al., PLoS Comput Biol. 2009[2]) as a proxy to get a systems-level view of annotated reads. To begin, we need to first define E. coli homologs for our annotated genes and proteins from the BWA, BLAT and BLASTX searches. For mapped microbial genes identified through our BWA and BLAT searches: blastx -db EcoliMG1655.fsa -query microbial_cds_sub.fasta -outfmt 6 max_target_seqs 10 -evalue 1e-10 -num_threads 4 -out microbial_cds_sub_ecoli_ppi.blastxout perl main_get_blast_fromfile_1tophit.pl cow ecoli_ppi blastx genes 0 For mapped NR proteins identified through our BLASTX searches: blastp -db EcoliMG1655.fsa -query nr_all_sub.fasta -outfmt 6 -max_target_seqs 10 -evalue 1e-10 -num_threads 4 -out nr_all_sub_ecoli_ppi.blastpout perl main_get_blast_fromfile_1tophit.pl cow ecoli_ppi blastp proteins 0 Notes: the output files are "microbial_cds_sub_ecoli_ppi_pairs.txt" and 'nr_all_sub_ecoli_ppi_IDs.txt" We then need to generate a "PPI_pairs.txt" mapping file which lists E. coli homolog (defined through its 'b'-number) for each of our mapped genes/proteins: perl main_combine_PPI_results.pl cow Step. 8 Generate normalized expression values associated with each gene We have removed low quality, adaptors, rRNA and host sequences and annotated reads to the best of our ability - now let’s summarize our findings. We do this by looking at the relative expression of each of our genes in our microbiome. First we generate a mapping table, which links our gene and proteins identified in our BWA, BLAT and BLASTX mappings with their respective taxonomic information (NCBI taxon ID, species name and phylum). This enables us to identify which species are contributing which functions to the microbiome: perl perl perl perl main_get_taxonID_microbial_cds.pl cow main_get_phylum.pl cow micro_cds main_get_taxonID_nr_sub.pl cow main_get_phylum.pl cow nr Then for each gene and protein, we calculate a normalized expression value (Reads Per Kilobase of Sequence Mapped - RPKM): perl perl perl perl perl main_get_mapped_genesID_counts.pl cow micro_cds main_get_mapped_genesID_counts.pl cow nr main_get_mapped_gene_table.pl cow micro_cds main_get_mapped_gene_table.pl cow nr main_get_mapped_gene_table_RPKM.pl cow Notes: The final output file is named "cow_table_RPKM_all.txt" and has the following format: o [geneID/proteinID, length, #reads, taxonID, specie, phylum, RPKM, PPI] o gi|303235869|ref|ZP_07322472.1| ribonucleoside-diphosphate reductase, adenosylcobalamin-dependent 2595 1 866771 Prevotella disiens FB035-09AN bacteroidetes 1.05 1.17.4.1 b2675 Step. 9 Visualization of PPI network To visualize our processed microbiome dataset in the context of the functional PPI network, we use the network visualization tool - Cytoscape together with the enhancedGraphics plugin. Here we will skip the steps of generating the node attribute file "cow_PPI.nodes.txt" from "cow_table_RPKM_all.txt", however for your information the steps involve: predefine taxonomic categories (here we use the following 12 phylum categories: archaea, protozoan, bacteria, actinobacteria, bacteroidetes, gammaproteobacteria, deltaproteobacteria, betaproteobacteria, alphaproteobacteria, clostridiales, leuconostocaceae, lactobacillaceae, but you could define these categories to fit your microbiome). calculate RPKM values of each ecoli protein, for every phylum category, by adding RPKM values of the protein's mapped genes/proteins. generate a node attribute file which is a tab-delimited table with a format as follows: o the first line is the header - you could use: ecoli_protein b# RPKM piechart archaea protozoan bacteria actinobacteria bacteroidetes gammaproteobacteria deltaproteobacteria betaproteobacteria alphaproteobacteria clostridiales leuconostocaceae lactobacillaceae subsequent lines then use the format, with the final numbers being the RPKM associated with each taxon: dapF b3809 70.38 piechart: attributelist="archaea,protozoan,bacteria,actinobacteria,bacteroidetes,gammap roteobacteria,deltaproteobacteria,betaproteobacteria,alphaproteobacteria,clos tridiales,leuconostocaceae,lactobacillaceae" colorlist="#FFA500,#C0C0C0,#EDF252,#0000FF,#FF00FF,#2C94DE,#ED4734,#00FFFF,#F FCCFF,#34C400,#A52A2A,#663366" showlabels=false 0.00 0.00 3.29 0.00 50.79 3.29 0.00 0.00 0.00 13.01 0.00 0.00 Once the node attribute file has been generated, we provide one network file onto which these attributes can be mapped: "Ecoli PPI cellwall.cys" which can be opened using Cytoscape installed in your local computer. To import node attributes: 1) select File -> Import -> Table -> File, select "cow_PPI.nodes.txt" from your working folder, click OK from the prompting window. 2) from Control Panel, select Style -> Properties -> Paint -> Custom Paint 1 -> Custom Graphics 1, 3) click Custom Graphics 1, select piechart for Column, and select Passthrough Mapping for Mapping Type. Notes: To install the required plugin within Cytoscape, please select Apps from the header bar —> select App Manager. Search for enhancedGraphics in the Search box. Select enhancedGraphics and click Install. You can open the cytoscape file, 'Ecoli PPI cellwall cow.cys', to see what the PPI network look like.