Download Intro: sequencing and the data deluge

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Transposable element wikipedia , lookup

X-inactivation wikipedia , lookup

Genomic library wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Human genome wikipedia , lookup

Primary transcript wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Copy-number variation wikipedia , lookup

Oncogenomics wikipedia , lookup

Genetic engineering wikipedia , lookup

Genomics wikipedia , lookup

Essential gene wikipedia , lookup

Gene therapy wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Gene nomenclature wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Metagenomics wikipedia , lookup

Public health genomics wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Gene desert wikipedia , lookup

Genome editing wikipedia , lookup

History of genetic engineering wikipedia , lookup

Pathogenomics wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Genomic imprinting wikipedia , lookup

Ridge (biology) wikipedia , lookup

Helitron (biology) wikipedia , lookup

Gene expression programming wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Gene wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genome (book) wikipedia , lookup

Microevolution wikipedia , lookup

Minimal genome wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene expression profiling wikipedia , lookup

Designer baby wikipedia , lookup

Genome evolution wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
MCB3895-004 Lecture #21
Nov 20/14
Prokaryote RNAseq
Today:
• Building off last lecture, we will use reference
alignment methods to understand differential
gene expression in prokaryotes
• Use Bowtie2 for alignment
• Use Edge-pro for determining transcript
abundance
Experiment:
• Compare E.coli K-12 grow in glucose minimal
medium aerobically vs. anaerobically
• Aerobic datasets: SRR922260
• Anaerobic datasets: SRR922265
• All sequenced using Illumina GAIIx, 2x36bp PE
Basic idea of RNAseq
• One way to analyze a transcriptome (i.e., all
the mRNA molecules) is to count the number of
transcripts from each gene
• More transcripts implies more activity of that
gene
• Improvement over previous technology
(microarrays) that required some knowledge of
what genes to look for and were less sensitive
Problems:
1. How to compare short genes to long ones?
• Short genes will have fewer reads mapping to them
by random chance
2. How to compare genes from different
genomes with different sampling intensity?
• Transcripts sampled more deeply will have more
reads mapping to them
RPKM
• "Reads per kilobase per million"
• RPKM normalizes for both gene length and
sampling intensity
• RPKM = [# of mapped reads]/[length of
transcript in kb]/[million mapped reads]
• Allows genes to be compared to each other
• Allows transcription to be compared between
transcriptomes
RNAseq software
• Many packages exist for comparing
transcriptomes
• Most are tailored towards eukaryotes
• Emphasis on finding splice variants (not in bacteria)
• Do not account for overlapping genes (common in
bacteria, rare in eukaryotes)
Generalized scheme for
RNAseq
1. Map reads to reference genome
2. Count reads mapping to each gene
3. Normalize for gene length and sampling
depth (i.e., calculate RPKM)
4. Statistically compare test and control sample
sets (a topic in itself, not covered in depth
here)
EDGE-pro
• The software we will use is EDGE-pro
• Installed on server in
/opt/bioinformatics/EDGE_pro_v1.3.1/
• Tailored for prokaryotes
• Magoc et al. (2013) Evolutionary Bioinformatics
9:127-136
• http://ccb.jhu.edu/software/EDGE-pro/
EDGE-pro outline
1.
2.
3.
4.
5.
Use Bowtie2 to map reads
Calculate per base coverage
Assign per gene coverage
Disambiguate overlapping genes
Calculate RPKM for each gene
Running EDGE-pro
• syntax: $ perl
/opt/bioinformatics/EDGE_pro_v1.3.1/edge.pl
-g [.fna name] -p [.ppt name] -r [.rnt name]
-u [.fastq 1 name ] -v [.fastq 2 name] -s
/opt/bioinformatics/EDGE_pro_v1.3.1/
•
•
•
•
•
•
-g: reference .fna file name
-p: reference .ptt file name
-r: reference .rnt file name
-u: .fastq file name to map
-v: .fastq file pairing with that specified by -u, if exists
-s: location where program lives
• e.g.: $ perl
/opt/bioinformatics/EDGE_pro_v1.3.1/edge.pl
-g NC_000913.fna -p NC_000913.ptt -r
NC_000913.rnt -u SRR922260_1.fastq -v
SRR922260_2.fastq -s
/opt/bioinformatics/EDGE_pro.v1.3.1/
EDGE-pro: results
• One nice thing about EDGE-pro is that it runs
many scripts all by itself
• A "wrapper" or "pipeline" is something that bundles
different programs altogether
• Many of the output files are from bowtie2,
some are from EDGE-pro itself
• Note: make sure that you have enough space in
your account for these files
• The RPKM data are located in "out.rpkm_0",
which is a tab-delimited table listing the reads
mapped to each predicted transcript
Comparing conditions
• There are many different ways to compare test
and control conditions
• This is outside of the scope of this class
• The RPKM values generated by EDGE-pro can
be reformatted to be input
• EDGE-pro contains a script that will do this for
DESeq, one of the most popular
• Generally multiple replicates should be
considered for each condition
EDGE-pro comparison
• The EDGE-pro paper suggests an easy
heuristic for transcriptome comparison:
1. Average RPMK values from treatment
replicates
2. Determine the RPMK fold change between
test and control treatments using simple
division
3. Only keep results >4-fold different
A reference genome quirk:
• EDGE-pro requires the standard .fna
genome file and .ptt and .rnt files that list
gene locations on the chromosome
• Unfortunately only available from the old
version of the NCBI ftp server
• Location for today:
ftp://ftp.ncbi.nlm.nih.gov/genomes/
Bacteria/Escherichia_coli_K_12_subs
tr__MG1655_uid57779/
Today's assignment
• Use EDGE-pro to calculate RPMK values for
the E.coli K-12 RNAseq transcriptomes
generated under aerobic (SRR922260) and
anaerobic (SRR922265) conditions
• Write a short perl script to calculate the
recommended EDGE-pro comparison
• Only one replicate so no averaging needed
• Report 4-fold overrepresented genes in aerobic
treatment
• Report 4-fold overrepresented genes in anaerobic
treatment