* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download 1d Mapping lab
Gene expression programming wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Genomic library wikipedia , lookup
Ridge (biology) wikipedia , lookup
History of genetic engineering wikipedia , lookup
Genomic imprinting wikipedia , lookup
Public health genomics wikipedia , lookup
Microevolution wikipedia , lookup
Human genome wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Pathogenomics wikipedia , lookup
Metagenomics wikipedia , lookup
Gene expression profiling wikipedia , lookup
Genome (book) wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Helitron (biology) wikipedia , lookup
Designer baby wikipedia , lookup
Genome editing wikipedia , lookup
Minimal genome wikipedia , lookup
Mappingthereadstoareference AFASTQfilecontainsthesequencesofthereadsandtheircorrespondingquality information.Itdoesnotcontainanyinformationabouttheoriginofthereads,i.e. towhichtranscripttheybelong.TofindthatoutwecoulduseaBlastsearchfor eachofthereads. 1. TakethefirstreadintheR1.fastqfileandsearchforitsoriginusingBlast (forinstance:https://www.arabidopsis.org/Blast/index.jsp).Whatisthe mostlikelygeneoforigin? 2. Estimatehowlongitwouldtaketoidentifythe250,000readsinR1this way. Bowtie2(http://bowtie-bio.sourceforge.net/bowtie2)isamuchfastersoftware tooltosearch/align/mapreads.Itisalsomuchmoresensitivetomismatches andindels.Bowtie2takesFASTQfilesasinputandalsoneedsareference genome,inourcaseArabidopsisthalianaTAIR10.Thealignedreads,withtheir coordinatesonthereferencegenome,arestoredinaso-calledBAMfile,whichis thecompressed(‘zipped’)formofaSAMfile.SAMstandsfor“Sequence Alignment/Map”. (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2723002/). SAMfilescanbereadinatexteditor,BAMfilescannot(becausetheyare compressed).Youcanviewthealignedreadsusingthe“IntegrativeGenomics Viewer”(IGV). 3. Firstdownload(registrationrequired)IGVfromthiswebsite: https://www.broadinstitute.org/igv/.Choosethe“JavaWebStart” 750MBor1.2GBversion.WhenIGVisrunning,gotothe“Genomes” menuandthen“LoadGenomeFromServer”todownloadtheArabidopsis TAIR10genome. 4. Inhttp://www.bioinformatics.nl/galaxysearchforthe“Bowtie2”tool. Useit“paired-end”,selectR1.fastqastheforwardFASTQfileandR2.fastq asthereverse.Makesuretoselecttherightfilesfromyourhistory.Click “Execute”. 5. ViewingtheresultingBAMfilewillnotwork,butyoudogetasummaryof themappingwiththenumberofreadsthataligned0,1,ormoretimes. Whatisthepercentageofreadsthatmappedatleastonce? 6. DownloadtheBAMfiletoafolderontheD:\driveandalsomakesureto downloaditscorrespondingbam_indexfile(usethedisc/saveicon).Open theBAMfileinIGV(nottheindexfile,thatjusthastobeinthesame folder).InIGV,typethegeneidentifierforRubisco(AT1G67090)inthe searchbox(nexttothe“Go”button),toseewhetheranyreadsmappedto thatgene.Doesthemappingmakesense? RNAseqreadsoriginatefrommaturemRNAtranscripts,whichmeansthatthey shouldonlyaligntoexonsandnottointrons.Areadthatspanstwoexonswill thereforhavetobesplittobemappedproperlyonthegenome(toallowforthe intron).Bowtie2doesnotdothat,butasplicejunctionmapperlikeTophatdoes (http://ccb.jhu.edu/software/tophat/index.shtml). TophatfirstusesBowtie2tomapthereadsthatfallentirelywithinoneexon,and thentriestomaptheremainingreadssplitovertwoormoreexons.Tophatuses theknownexon/intronstructure,butcanalsopredictnewexons. 7. Searchfor“Tophat”inGalaxy(donotusethe“TophatforIllumina”). Select“Paired-endasindividualdatasets”andagainuseR1.fastqand R2.fastqfromyourGalaxyhistory.Keeptherestdefaultandclick “Execute”.Thiswilltakesometime.IfTophatisfinished,checkthe“align summary”intheGalaxyhistorytoseewhatpercentageofreadsmapped. IsitbetterthanBowtie2? 8. DownloadtheBAMfileanditsindexfromthe“acceptedhits”historyitem andopentheBAMfileinIGV.Againlookatthereadsthatmappedto Rubiscoandsomeothergenes. TheBAMfilesshowthereadsalignedtothereferencegenome,buttheydonot directlytellyouwhichgenesortranscriptsareactuallyexpressed.Atoolthat doesthatisCufflinks(http://cole-trapnell-lab.github.io/cufflinks/).Ifagenehas severalisoforms,Cufflinkspredictstheexpressionofthedifferentisoforms basedontheexpressedexons.CufflinkstakesBAMfilesasinputandproduces severaloutputtableswiththeexpressedtranscriptsandgenes.Cufflinks optionallytakesaGTFfileasinput.AGTFfilecontainsthecoordinatesofexons, genesetc.onaDNAsequence(likethereferencegenome). http://www.ensembl.org/info/website/upload/gff.html 9. Use“Getdata”,“UploadFile”touploadtheGTFfilethataccompaniesthe Arabidopsisthalianagenome.YoucandirectlyuploadfromthisURL: http://www.bioinformatics.nl/courses/RNAseq/genes.gtf(use “Paste/fetchdata”)orfirstdownloadthefiletoyourlocalcomputer. 10. IfyouclickontheeyeiconintheGTFhistoryitemyouseethetopofthe GTFfile.Thefirstlineofthefileshowsthatthefirstexonofthegene AT1G01010(ANAC001)startsatposition3631onchromosome1.Two linesloweryoucanseethatthestartcodonforthisgeneisatposition 3760. 11. SearchforCufflinksinGalaxyandselecttheTophat“accepted_hits”BAM fileasinput.Choosetousea“ReferenceAnnotationasguide”,andasthe referenceannotationchoosetheGTFfilethatyoujustuploaded.Leavethe restofthesettingsastheyareandclick“Execute”.Cufflinkswillrunfora fewminutes.Cufflinksproducesfouroutputs.The“assembled transcripts”lookssimilartothegenes.gtffile,butitwillhavesomenewly predictedtranscriptswith“CUF”identifiers.Mostoftheseoverlapwith alreadyknowngenes.Therearetwofileswith(normalized)readcounts, forthegenesandthetranscripts.Youcouldusethe“Filterdata”toolin GalaxytoseethereadcountfortheRubiscogene(c4=='AT1G67090'),or loadthecompletetableinExcel.Moreaboutquantificationtomorrow.