Download 1d Mapping lab

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene expression programming wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Genomic library wikipedia , lookup

Ridge (biology) wikipedia , lookup

History of genetic engineering wikipedia , lookup

Genomic imprinting wikipedia , lookup

Public health genomics wikipedia , lookup

Gene wikipedia , lookup

Microevolution wikipedia , lookup

Human genome wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Pathogenomics wikipedia , lookup

Metagenomics wikipedia , lookup

Genomics wikipedia , lookup

Gene expression profiling wikipedia , lookup

Genome (book) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Helitron (biology) wikipedia , lookup

Designer baby wikipedia , lookup

Genome editing wikipedia , lookup

Minimal genome wikipedia , lookup

Genome evolution wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
Mappingthereadstoareference
AFASTQfilecontainsthesequencesofthereadsandtheircorrespondingquality
information.Itdoesnotcontainanyinformationabouttheoriginofthereads,i.e.
towhichtranscripttheybelong.TofindthatoutwecoulduseaBlastsearchfor
eachofthereads.
1. TakethefirstreadintheR1.fastqfileandsearchforitsoriginusingBlast
(forinstance:https://www.arabidopsis.org/Blast/index.jsp).Whatisthe
mostlikelygeneoforigin?
2. Estimatehowlongitwouldtaketoidentifythe250,000readsinR1this
way.
Bowtie2(http://bowtie-bio.sourceforge.net/bowtie2)isamuchfastersoftware
tooltosearch/align/mapreads.Itisalsomuchmoresensitivetomismatches
andindels.Bowtie2takesFASTQfilesasinputandalsoneedsareference
genome,inourcaseArabidopsisthalianaTAIR10.Thealignedreads,withtheir
coordinatesonthereferencegenome,arestoredinaso-calledBAMfile,whichis
thecompressed(‘zipped’)formofaSAMfile.SAMstandsfor“Sequence
Alignment/Map”.
(http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2723002/).
SAMfilescanbereadinatexteditor,BAMfilescannot(becausetheyare
compressed).Youcanviewthealignedreadsusingthe“IntegrativeGenomics
Viewer”(IGV).
3. Firstdownload(registrationrequired)IGVfromthiswebsite:
https://www.broadinstitute.org/igv/.Choosethe“JavaWebStart”
750MBor1.2GBversion.WhenIGVisrunning,gotothe“Genomes”
menuandthen“LoadGenomeFromServer”todownloadtheArabidopsis
TAIR10genome.
4. Inhttp://www.bioinformatics.nl/galaxysearchforthe“Bowtie2”tool.
Useit“paired-end”,selectR1.fastqastheforwardFASTQfileandR2.fastq
asthereverse.Makesuretoselecttherightfilesfromyourhistory.Click
“Execute”.
5. ViewingtheresultingBAMfilewillnotwork,butyoudogetasummaryof
themappingwiththenumberofreadsthataligned0,1,ormoretimes.
Whatisthepercentageofreadsthatmappedatleastonce?
6. DownloadtheBAMfiletoafolderontheD:\driveandalsomakesureto
downloaditscorrespondingbam_indexfile(usethedisc/saveicon).Open
theBAMfileinIGV(nottheindexfile,thatjusthastobeinthesame
folder).InIGV,typethegeneidentifierforRubisco(AT1G67090)inthe
searchbox(nexttothe“Go”button),toseewhetheranyreadsmappedto
thatgene.Doesthemappingmakesense?
RNAseqreadsoriginatefrommaturemRNAtranscripts,whichmeansthatthey
shouldonlyaligntoexonsandnottointrons.Areadthatspanstwoexonswill
thereforhavetobesplittobemappedproperlyonthegenome(toallowforthe
intron).Bowtie2doesnotdothat,butasplicejunctionmapperlikeTophatdoes
(http://ccb.jhu.edu/software/tophat/index.shtml).
TophatfirstusesBowtie2tomapthereadsthatfallentirelywithinoneexon,and
thentriestomaptheremainingreadssplitovertwoormoreexons.Tophatuses
theknownexon/intronstructure,butcanalsopredictnewexons.
7. Searchfor“Tophat”inGalaxy(donotusethe“TophatforIllumina”).
Select“Paired-endasindividualdatasets”andagainuseR1.fastqand
R2.fastqfromyourGalaxyhistory.Keeptherestdefaultandclick
“Execute”.Thiswilltakesometime.IfTophatisfinished,checkthe“align
summary”intheGalaxyhistorytoseewhatpercentageofreadsmapped.
IsitbetterthanBowtie2?
8. DownloadtheBAMfileanditsindexfromthe“acceptedhits”historyitem
andopentheBAMfileinIGV.Againlookatthereadsthatmappedto
Rubiscoandsomeothergenes.
TheBAMfilesshowthereadsalignedtothereferencegenome,buttheydonot
directlytellyouwhichgenesortranscriptsareactuallyexpressed.Atoolthat
doesthatisCufflinks(http://cole-trapnell-lab.github.io/cufflinks/).Ifagenehas
severalisoforms,Cufflinkspredictstheexpressionofthedifferentisoforms
basedontheexpressedexons.CufflinkstakesBAMfilesasinputandproduces
severaloutputtableswiththeexpressedtranscriptsandgenes.Cufflinks
optionallytakesaGTFfileasinput.AGTFfilecontainsthecoordinatesofexons,
genesetc.onaDNAsequence(likethereferencegenome).
http://www.ensembl.org/info/website/upload/gff.html
9. Use“Getdata”,“UploadFile”touploadtheGTFfilethataccompaniesthe
Arabidopsisthalianagenome.YoucandirectlyuploadfromthisURL:
http://www.bioinformatics.nl/courses/RNAseq/genes.gtf(use
“Paste/fetchdata”)orfirstdownloadthefiletoyourlocalcomputer.
10. IfyouclickontheeyeiconintheGTFhistoryitemyouseethetopofthe
GTFfile.Thefirstlineofthefileshowsthatthefirstexonofthegene
AT1G01010(ANAC001)startsatposition3631onchromosome1.Two
linesloweryoucanseethatthestartcodonforthisgeneisatposition
3760.
11. SearchforCufflinksinGalaxyandselecttheTophat“accepted_hits”BAM
fileasinput.Choosetousea“ReferenceAnnotationasguide”,andasthe
referenceannotationchoosetheGTFfilethatyoujustuploaded.Leavethe
restofthesettingsastheyareandclick“Execute”.Cufflinkswillrunfora
fewminutes.Cufflinksproducesfouroutputs.The“assembled
transcripts”lookssimilartothegenes.gtffile,butitwillhavesomenewly
predictedtranscriptswith“CUF”identifiers.Mostoftheseoverlapwith
alreadyknowngenes.Therearetwofileswith(normalized)readcounts,
forthegenesandthetranscripts.Youcouldusethe“Filterdata”toolin
GalaxytoseethereadcountfortheRubiscogene(c4=='AT1G67090'),or
loadthecompletetableinExcel.Moreaboutquantificationtomorrow.