Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Appendix 1 Additional experimental procedure details Genomic DNA sequence generation and variant analysis DNA was randomly sheared into ~250bp fragments and the resulting fragments were used to create an Illumina library. This library was sequenced on Illumina HiSeq2000 sequencing system sequencers generating 73-100bp paired end reads. MAQ-0.7.1 (Li and Durbin 2009) and SOAP (Li et al. 2009b) packages were used to map quality filtered Illumina reads to the 8x genome assembly of the Bd21 reference strain. We required read support from at least three reads for identifying variants through the MAQ pipeline. Tracks of genomic positions with less than three reads were extracted and merged within 100bp to identify potential deletions. We used custom perl scripts to generate the high confidence SNP set, containing SNPs found in both MAQ and SOAP pipelines. SNPs common to the MAQ and SOAP pipelines were further filtered to remove SNPs in which the consensus base was ambiguous and SNPs in which the reference base was the most common allele (potential false positives). Putative SVs were called using BreakDancer, filtering for a confidence score of >90 (Chen et al. 2009). IMR/DENOM (Gan et al. 2011) was used in conjunction with the SOAP assembler and the BWA aligner (Li and Durbin 2009). Ambiguous and heterozygous SNPs were removed from IMR/DENOM output and variants from the pipeline were incorporated into the reference sequence using MCMERGE to create line-specific genomes. ACT was used to generate variant saturation and correlation plots (Jee et al. 2011). BEDTools was used to identify variable intersects between indels and other features as well as calculate variant frequency among genomic windows (Quinlan and Hall 2010). To identify SNPs for population divergence estimates we used samtools mpileup (Li et al. 2009a) to output counts for each base for all genomic positions, requiring a minimum read mapping quality of 29, while keeping track of genomic positions lacking sufficient sequence information. We further filtered the mpileup output for each genomic position, requiring between 5-200 coverage. 234,045,216 genomic positions met the above criteria for at least two resequenced lines in addition to the Bd21 reference (~86% of the Bd21 reference genome). To count towards a particular base assignment in a given sample we required a minimum base quality of phred 30. For subsequent comparative analysis we required the consensus base to contribute to at least 60 percent of the total base calls for a given position in a sample or the position was excluded as uninformative. A custom perl script was used to assign the consensus alternate base for each line and compare nucleotides at each position while omitting positions with missing data to calculate pairwise distance matrices using a 250kb sliding window with 100kb step size. A similar script was used to calculate nucleotide diversity. The PHYLIP package including the FITCH, CONSENSE and DRAWTREE programs were used to generate phylogenetic trees (Felsenstein 1989). Concordance between large indels and gene expression was performed using BEDTools (Quinlan and Hall 2010) and custom perl scripts. We functionally annotated variants using the phytozome 8 gene annotation and incorporated SNPs and small indels into the Bd21 reference sequence to generate conservative synthetic genome sequences for each line. Bd21 transcripts were mapped to the synthetic genome sequences with BLAT and further processed with custom scripts to annotate gene models. To obtain population genetics statistic separately from each protein coding gene, the annotated coding sequences were aligned with ClustalW2 (Larkin et al. 2007). Sites with alignment gap in any sequence were removed by custom perl script, and Tajima’s D was calculated with the PopGen modulex (Stajich and Hahn 2005) in BioPerl (Stajich et al. 2002). Quantitative mRNA-Seq data generation and analysis RNA was extracted using the Spectrum Plant Total RNA Kit (Sigma Aldrich, St. Louis, MO, USA) and quantified using a NanoDrop (NanoDrop Products). 1.5µg of total RNA was prepared for tag-based RNA sequencing (Meyer et al. 2011). This method incorporates sequencing primers and barcodes at the 3’ end of mRNA molecules; sequenced tags are therefore enriched for 3’ UTR and 3’ exonic sequence and do not require correction for gene length. Prepared library samples were analyzed for quality on a BioAnalyzer (Agilent, Santa Clara, CA, USA) and then run on two lanes of the SOLiD 5500 system (Life Technologies, Carlsbad, CA, USA) at the University of Texas at Austin Genome Sequence and Analysis Facility. Reads were sorted by barcode and assigned to individual sample plants. Low-quality reads (those with homopolymer stretches > 15% of the read length and/or those containing >10 bases with quality scores below 16) were removed and barcodes trimmed using custom Perl scripts. We mapped all reads against the 8x Bd21 genome sequence using SHRiMP ver 2.1.1b (Rumble et al. 2009). We recovered between 310,000 and 4,400,000 mapped reads per sample, with most samples falling in the range of one to three million reads. We recovered very few reads from multiple samples of Bd21-3; this line was excluded from further differential expression analyses. Mapping efficiency ranged from 61% to 70% and was significantly correlated with genotype (ANOVA: F6,32 = 5.58. P = 0.0005), though the mean mapping efficiency varied only by 4.1% among genotypes. Mapped reads from each sample were used to create a table reporting counts for each locus in each sample. Loci for which very few reads were recovered (zero reads from >50% of samples) were excluded from further analysis. This filtering scheme resulted in a dataset composed of 15,168 transcripts. Counts were normalized across samples using the KDMM protocol implemented in JMPGenomics 6.0 (SAS Institute, Cary, NC). We modeled the variance among samples as a negative binomial function and then fit a fixed effect general linear model (GLM) including a term for “genotype,” “treatment,” their interaction, and random “block” in JMPGenomics. Overdispersion was treated via incorporation of an overdispersion parameter estimated in ProcGLIMMIX. We controlled for multiple testing using a positive false-discovery rate of 0.05 (Storey and Tibshirani 2003). We also identified transcripts showing a significant treatment response in each line using the negative binomial test implemented in DESeq (Anders and Huber 2010), again using a FDR of 0.05. Compared to the full-model GLM, this approach results in a significant loss of statistical power, but allows us to identify transcripts that specifically respond to the treatment in each line. We used the program ELEMENT (Mockler et al. 2007) to test the hypothesis that gene sets are enriched for particular sequence elements that may control expression in cis. Specifically, we looked for over-represented motifs in the proximal 1000 bp upstream of genes annotated in the Bd21 reference sequence. Bd1-1 deep RNA sequencing and transcriptome assembly The strand specific libraries were subjected to a 101-bp cycle in a single end run in an Illumina HiSeq2000 sequencing system at the Center for Genome Research and Biocomputing (CGRB), Oregon State University, Corvallis, OR. Three lanes, with 8 libraries (one experiment) multiplexed per lane, were run in one flow cell. The processing of fluorescent images into sequences, base-calling and quality value calculations were performed using the CASAVA software version 1.8. Pooled reads from all experiments were filtered and assembled using Rnnotator (Martin et al. 2010). The resulting contigs were filtered by size (≥ or < 1kb) and aligned with GMAP (Wu and Watanabe 2005) to the Bd21 genome assembly (IBI 2010) and the line-specific genomes. BEDTools and custom scripts were used to identify aligned contigs defining previously unannotated transcripts (Quinlan and Hall 2010). Tophat (Trapnell et al. 2009) was used to map Bd1-1 and Bd21 Illumina RNA-Seq reads (Davidson et al. 2012) (obtained from the NCBI sequence read archive, SRP008505) to the genome sequences and independently determine expression support for each transcript not contained in the reference annotation. Bd1-1 contigs that did not align to the Bd21 and Bd1-1 genome sequences were extracted. These transcripts were annotated using Blast2GO by using BLASTx (E-value of <1e-6) against the NCBI Genbank non-redundant protein database followed by InterProScan search (E-value of <1e-6) (Conesa and Gotz 2008). References Anders, S. and Huber, W. (2010) Differential expression analysis for sequence count data. Genome Biol, 11, R106. Chen, K., Wallis, J.W., McLellan, M.D., Larson, D.E., Kalicki, J.M., Pohl, C.S., McGrath, S.D., Wendl, M.C., Zhang, Q., Locke, D.P., Shi, X., Fulton, R.S., Ley, T.J., Wilson, R.K., Ding, L. and Mardis, E.R. (2009) BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat Methods, 6, 677-681. Conesa, A. and Gotz, S. (2008) Blast2GO: A comprehensive suite for functional analysis in plant genomics. Int J Plant Genomics, 2008, 619832. Davidson, R.M., Gowda, M., Moghe, G., Lin, H., Vaillancourt, B., Shiu, S.H., Jiang, N. and Robin Buell, C. (2012) Comparative transcriptomics of three Poaceae species reveals patterns of gene expression evolution. Plant J, 71, 492-502. Felsenstein, J. (1989) PHYLIP -- Phylogeny Inference Package (Version 3.2). Cladistics, 5, 164-166. Gan, X., Stegle, O., Behr, J., Steffen, J.G., Drewe, P., Hildebrand, K.L., Lyngsoe, R., Schultheiss, S.J., Osborne, E.J., Sreedharan, V.T., Kahles, A., Bohnert, R., Jean, G., Derwent, P., Kersey, P., Belfield, E.J., Harberd, N.P., Kemen, E., Toomajian, C., Kover, P.X., Clark, R.M., Ratsch, G. and Mott, R. (2011) Multiple reference genomes and transcriptomes for Arabidopsis thaliana. Nature, 477, 419-423. IBI (2010) Genome sequencing and analysis of the model grass Brachypodium distachyon. Nature, 463, 763-768. Jee, J., Rozowsky, J., Yip, K.Y., Lochovsky, L., Bjornson, R., Zhong, G., Zhang, Z., Fu, Y., Wang, J., Weng, Z. and Gerstein, M. (2011) ACT: aggregation and correlation toolbox for analyses of genome tracks. Bioinformatics, 27, 1152-1154. Larkin, M.A., Blackshields, G., Brown, N.P., Chenna, R., McGettigan, P.A., McWilliam, H., Valentin, F., Wallace, I.M., Wilm, A., Lopez, R., Thompson, J.D., Gibson, T.J. and Higgins, D.G. (2007) Clustal W and Clustal X version 2.0. Bioinformatics, 23, 2947-2948. Li, H. and Durbin, R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754-1760. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G. and Durbin, R. (2009a) The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25, 2078-2079. Li, R., Li, Y., Fang, X., Yang, H., Wang, J. and Kristiansen, K. (2009b) SNP detection for massively parallel whole-genome resequencing. Genome Res, 19, 1124-1132. Martin, J., Bruno, V.M., Fang, Z., Meng, X., Blow, M., Zhang, T., Sherlock, G., Snyder, M. and Wang, Z. (2010) Rnnotator: an automated de novo transcriptome assembly pipeline from stranded RNASeq reads. BMC Genomics, 11, 663. Meyer, E., Aglyamova, G.V. and Matz, M.V. (2011) Profiling gene expression responses of coral larvae (Acropora millepora) to elevated temperature and settlement inducers using a novel RNA-Seq procedure. Mol Ecol, 20, 3599-3616. Mockler, T.C., Michael, T.P., Priest, H.D., Shen, R., Sullivan, C.M., Givan, S.A., McEntee, C., Kay, S.A. and Chory, J. (2007) The DIURNAL project: DIURNAL and circadian expression profiling, modelbased pattern matching, and promoter analysis. Cold Spring Harb Symp Quant Biol, 72, 353-363. Quinlan, A.R. and Hall, I.M. (2010) BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 26, 841-842. Rumble, S.M., Lacroute, P., Dalca, A.V., Fiume, M., Sidow, A. and Brudno, M. (2009) SHRiMP: accurate mapping of short color-space reads. PLoS Comput Biol, 5, e1000386. Stajich, J.E., Block, D., Boulez, K., Brenner, S.E., Chervitz, S.A., Dagdigian, C., Fuellen, G., Gilbert, J.G., Korf, I., Lapp, H., Lehvaslaiho, H., Matsalla, C., Mungall, C.J., Osborne, B.I., Pocock, M.R., Schattner, P., Senger, M., Stein, L.D., Stupka, E., Wilkinson, M.D. and Birney, E. (2002) The Bioperl toolkit: Perl modules for the life sciences. Genome Res, 12, 1611-1618. Stajich, J.E. and Hahn, M.W. (2005) Disentangling the effects of demography and selection in human history. Mol Biol Evol, 22, 63-73. Storey, J.D. and Tibshirani, R. (2003) Statistical significance for genomewide studies. Proc Natl Acad Sci U S A, 100, 9440-9445. Trapnell, C., Pachter, L. and Salzberg, S.L. (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics, 25, 1105-1111. Wu, T.D. and Watanabe, C.K. (2005) GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics, 21, 1859-1875.