Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Gene760–ProblemSet2 ThepurposeofthisproblemsetistofamiliarizestudentswiththeanalysisofChIP-seqdata. YouwillbeworkingwithChIP-seqdataderivedfromtwohumancelllines:K562,an erythroleukemia(i.e.,blood)derivedcellline,andNHEK,anormalepidermal(i.e.,skin) keratinocyteline.ThetargetsintheseChIP-seqexperimentsareRNApolymeraseII(polII) andthehistonemodificationH3K27ac,whichmarksactivepromotersandenhancers.By theendofthisproblemset,youwillhavelearnedhowtouse: • Bowtietomapshortreadstoagenome • BAM/SAM/Wig/bigwigfilestostoregenomicdata • MACStocallpeaksinChIP-Seqdata • Genomebrowsertoolssuchastablebrowserandkenttools • DAVIDtoperformgeneontologyanalysisonalistofgenes • PythontoperformQCandnormalizereads,PySAMtointeractwithBAMsinpython Studentsaretosubmitagzippedtarballcalled[Your_NetID]_PS2.tar.gzcontainingthe followingfiles: • [Your_NetID]_PS2_answers.txt:Answerstothequestionsbelow • [Your_NetID]_QC.py:QCscriptinPython • [Your_NetID]_norm.py:NormalizationscriptinPython • [Your_NetID]_translate_IDs.py:IDTranslationscriptinPython • [Your_NetID]_bigWig • [Your_NetID]_DAVID_polII_both • [Your_NetID]_DAVID_polII_K562 • [Your_NetID]_DAVID_polII_NHEK • [Your_NetID]_GREAT_shared • [Your_NetID]_GREAT_K562 • [Your_NetID]_GREAT_NHEK Thesearedueto‘DROPBOX/PS2/’by9AMonMonday,March7 ThedatasetsforthisproblemsetarelocatedinDATA/PS2/ Inthisdirectory,youwillfind6FASTQformattedsequencingfiles: • K562_H3K27ac.fastq • K562_input.fastq • K562_polII.fastq • NHEK_H3K27ac.fastq • NHEK_input.fastq • NHEK_polII.fastq Aswellasthehg19indexfilesforbowtie • hg19indexfiles(hg19.1.ebwt,hg19.2.ebwt,hg19.3.ebwt,hg19.4.ebwt, hg19.rev.1.ebwt,hg19.rev.2.ebwt) 1.Generatealignmentsforeachdatasetusingbowtie. Allow2mismatchesperseed,andonlyreportuniquelymappingreads.Outputyourresultsin SAMformat. Thesearehumancelllines,sobesuretoalignyourreadstohg19.Remember:bowtieand everythingelsethisproblemsetcontainsiscomputationallyintensive!Donotrunonthe loginnode!*Ifyouarefeelingcrafty,youcouldwriteasimplequeuescriptandsetitupto runjobsinparallelandtakeadvantageofmorethanonenode. ANS:Foreachdataset,reportthecommandsyouused. ANS:ConverteachSAMfiletoaBAMfileusingsamtools.Reportthecommandsyouused andthefilesizeoftheSAMvs.BAMfilesinbytes. ANS:SorteachBAMfileusingsamtools.Reportthecommandsyouusedandthefilesizeof thesesortedfilesinbytesversustheoriginals.Isthereadifference?Whyorwhynot? ANS:Usesamtoolstocreateanindexforeachbamfile.Reportthecommandsyouused. Whatistheadvantageofindexingabamfile? ANS:Usesamtoolstoreportthenumberofmappedreadsineach.bamfile,usingboth ‘samtoolsidxstats’and‘samtoolsflagstat’.Arethenumbersthesame,whyorwhynot? Whichisfaster,flagstatoridxstats,andwhy? Note:onceyou’refinishedwiththisquestion,keeponlythesorted.bamfilesand.baiindices. 2.CallH3K27acenrichedregionsandRNAPolIIpeaks. WewillbeusingMACSforcallingpeaks. • (http://liulab.dfci.harvard.edu/MACS/README.html)ConsulttheREADMEfilefor informationabouthowtouseMACS.Theparametersforcallingpeakswilldepend onifyouarecallingaTFpeakorahistonemark. MACSoutputscalledpeaksinBEDformat.MACSwillalsooutputChIP-seqsignaltracksin WIG(wiggle)formataswellasinputsignaltracks.WIGfilesarelarge,sotheyare frequentlyconvertedtoabinaryformat,bigWig,whichisstoredlocallybutcanbe visualizedontheGenomeBrowser. EachMACSrunwilltakearoundanhour(4total). ANS:Uploadyourpeakcalls(inBEDformat)ascustomtrackstotheUCSCGenome BrowserandsaveasasessiontosharewiththeTAs(includethelinktothatsessioninyour answers) ANS:ReportthetotalnumberofpeakscalledforH3K27acandpolIIineachcelllineinyour answersheet ANS:Usingbedtools,determineandreportthefollowinginyouranswers: ThenumberofMACSH3K27acandpolIIpeakscalledforeachcelltypeatpromoters (definedas1kbupstreamofanannotatedtranscriptionstartsite),exons,and intergenic/intronicregions(potentialenhancers).Note:Callingpeaksasinoneregion shouldmeanexcludingthemfromanyotherregion(i.e.promoterpeaksoverlappingthefirst exonshouldnotalsobecalledasexonpeaks,exonpeaksshouldnotalsobecalled intronic/intergenicpeaks). Rememberthatwhenoverlappinggenes(theexonsbedfile),bedtoolswillnotconsiderjust theexonsbydefault.Also,somepeaksoverlapmultiplefeatures,sobesuretoreportunique resultsonly(makesurepromoters+exons+inter=total_peaks) BEDfilesforpromotersandexonscanbefoundinDATA/PS2 ANS:WhatdoesH3K27acpeaksatpromotersmean?PolIIatpromoters? ANS:WhataboutH3K27atintergenic/intronicregions?PolII? ANS:WhatdoesoverlapofbothH3K27acandPolIIatpromotersorintergenic/intronic regionsmean? IMPORTANT:MACSoutputsadirectorywith.wig.gzfilesforeachchromosome.Onceyouare done,copythechr1.wig.gzfilesoutputbyMACSoutofthisdirectoryandextractthem,then deletetherestofthechrwigfiles,aswewillonlybeworkingwithchr1wigsmovingforward 3.ChIPandMACSqualitycontrol: Itisimportanttorunsomebasicchecksofyourresultswhenperformingthissortof genomicsanalysis–bothtomakesureeverythingisworking,andtogetanideaofthedata you’reworkingwith.WriteapythonscriptthatcountsandreportsthepercentageofChIPSeqreadsthatfallwithinthepeakscalledbyMACS. Yourscriptshouldtakethefollowinginputs:[Inputbamfile][Inputbedfile] TocountthetotalreadsinaregionofaBAMfile,youshouldusethe‘pysam’module. Documentationcanbefoundhere:http://pysam.readthedocs.org/en/latest/Louisehasan olderversionofpySAM,sopysam.AlignmentFileinthedocumentationisequivalentto pysam.Samfileinourinstallation. ItisfairtoassumethatMACSpeakswillnotbecloseenoughtogethertoworryabout doublecountingasignificantfractionofreads,thoughforsomesamplesyoumaygetover 100%becauseofdouble-counting. IMPORTANT:pysamisnotinstalledinpythonbydefault.Torunyourpythonfile,use ‘py27file.py’ratherthan‘pythonfile.py’.py27isanaliasinyour.bashrcthatlinkstoa versionofpythonontheclusterwiththepysammoduleinstalled. ANS:Saveyourscriptas[Your_NetID]_QC.py ANS:Reportinyouranswersthepercentageofreadsthatfallwithinpeaksforeachfactor ineachcelltype.Writeasentenceinterpretingtheresult 4.NormalizeChIPdata WriteapythonscriptthatwillnormalizethesignalfilesoutputbyMACSrelativetothe totalnumberofmillionmappedreadsintheexperiment(readspermillion)i.e.ifyourchip has10millionmappedreadsafterfiltering,foranucleotidecoveredby10reads,the normalizedvalueshouldbe1. Yourpythonscriptshouldtakethefollowinginputsfromthecommandline: [inputwigfile][Totalmappedreads] Yourpythonscriptshouldoutputanormalizedwigglefilenamed [InputWigName]_Norm.wig.Yourscriptshouldskipoveranypositionsthatcontainfewer than0.5readsafternormalization. IMPORTANT:Youwillneedtomaintainproperstructureofthewigfile,asnotalllines containreadcoveragedata.Readhttp://genome.ucsc.edu/goldenpath/help/wiggle.html formoreinformation. ANS:Saveandnamethisscript[Your_NetID]_norm.py ANS:AddtoyourpreviousUCSCbrowsersessionthenormalizedchr1wiggletracksfor bothfactorsandinput,forbothtissuetypes.Savethesession. ANS:Whywritesuchascript,i.e.whynormalizesignalfiles? 5.ConverttheChr1wigfiletobigWigformatusingwigToBigWig Instructionsarefoundat:http://genome.ucsc.edu/goldenPath/help/bigWig.html, completethefirst5steps. Youshoulduse‘wget’onlouiseforStep3todownloadtheexecutablefile. YoudonotneedtocompleteStep4-Chromosomesizeannotationsarelocatedin ANNOTATION/hg19.chrom.sizes ANS:Savetheoutputas[Your_NetID]_bigWig ANS:RecordthefilesizefortheChr1bigWigvstheChr1wigfileforeachcelltypeand factorinyouranswerfile(treatment,notinputDNA). ANS:WhyisbigWigformatuseful?Aretheredrawbacks? 6.CompareH3K27acandpolIIprofilesbetweencelltypes. ANS:UseBedToolstodeterminethenumberof(1)promoterand(2)intergenic/intronic H3K27acandpolIIpeaks(MACScalls;treatH3K27acandPolIIseparately)thatare: a)SharedbetweenK562andNHEKcellsand b)Uniquetoeachcelltype. ForeachpromotermarkedbyH3K27acorpolIIineachcelltype,identifytheNCBIEntrez GeneIDforthatgene.Todothis,youshouldcreateasimplepythonscriptthatusesa dictionarytotranslateRefSeqIDsinabedfiletoEntrezIDsprovidedinatranslationfile. WehavecreatedafilewithRefSeqIDsandtheirassociatedEntrezGeneIDin DATA/PS2/Prom_with_Entrez.txt.NotethattherewillnotnecessarilybeanEntrezIDfor everyRefSeqID,ifnotignorethatpromoter. Yourscriptshouldtakethefollowinginputfromthecommandline:[Inputbedfilewith RefSeqIDs][Inputtranslationfile] ThescriptshouldoutputatextfilewithalistofEntrezIDscorrespondingtotheRefSeqIDs ofthepromotersintheinputfile,1perline.Thiswillbenecessaryforthenextproblem. ANS:Saveyourpythonscriptas[Your_NetID]_translate_IDs.py 7.UseDAVIDtodetermineGeneOntologyenrichmentsforpromotersmarkedby H3K27acandpolIIineachcelltype. DAVIDisaveryusefulsetoftoolsforannotatingfunctionalgenomicsdata.Goto http://david.abcc.ncifcrf.gov/.DAVIDacceptsalistofgenenames(inthiscase,itshouldbe theEntrezGeneIDsyouobtainedabove)identifiedingeneexpressionorChIP-seqanalyses andcanbeusedtodeterminefunctionalenrichmentsinthosegenesetscomparedtoall genes.ReadtheDAVIDFAQpagetounderstandhowDAVIDworks: http://david.abcc.ncifcrf.gov/content.jsp?file=FAQs.html ANS:UsetheDAVIDGeneFunctionalAnnotationTooltodetermineGOBiologicalProcess (intheGOTERM_BP_FAToutputtable)enrichmentsforthefollowingcomparisons(use MACSpeakcalls).Savetheoutputofeachanalysisasaflattextfilewiththeindicatedname. a)PromotersmarkedbypolIIinbothK562andNHEK([Your_NetID]_DAVID_polII_both) b)PromotersmarkedbypolIIonlyinK562([Your_NetID]_DAVID_polII_K562). c)PromotersmarkedbypolIIonlyinNHEK([Your_NetID]_DAVID_polII_NHEK) ANS:Interpretyourresultsina,b,andcwithrespecttothetwocelltypes. 8.UseGREATtoassociateputativecell-typespecificenhancerswithtargetgenes. GREAT(great.stanford.edu)isatoolforfunctionalenrichmentanalysesofdistant-acting regulatoryelements.Itcanbeusedtoidentifypotentialtargetgenesforenhancersandto determinewhetherthattargetgenesetisenrichedforparticularbiologicalfunctions. TherearetwowaystoloaddataintoGREAT:directlyfromtheTableBrowser,oruploading BEDfiles.Intask6above,yougeneratedBEDfilesforH3K27acpeakcallsinK562and NHEKcells. ANS:UseGREATtodeterminefunctionalenrichmentsfor: a)Intergenic/intronicH3K27acsitessharedbetweenbothcelltypes; b)Intergenic/intronicH3K27acsitesuniquetoeachcelltype. ANS:Foreachcomparison,exporttheGOBiologicalProcesstablesas.tsvfilesnamed [Your_NetID]_GREAT_shared,[Your_NetID]_GREAT_K562,[Your_NetID]_GREAT_NHEK. ANS:Interpretyourresultsina,bwithrespecttothetwocelltypes. ANS:Wouldyouexpecttheretobegreaterdivergenceingeneontology/functional enrichmentsbetweencellswhenlookingatpromotersorenhancers?Whatdoyouobserve here? Optionalprogrammingchallenges: Forthoseofyouwithsomeprogrammingexperiencelookingtopracticeyourpythonskills, hereareoptionalchallenges.ThesewillNOTbeworthmorethanapointofextracredit (negligibletoyourgrade),andyoushouldnotcompletethemifyouhaven’tfinished theproblemsetalready.Ifyoudecidetoworkonthese,pleaseuseaseparatefilefrom yourmainscriptwhenhandingin. Writingawrapper Itiscommonforprogrammerstowritewrappersaroundtheirscriptsthatareabletotake multiplefilesoruserinteractionandpipelinethescripts’functionsaccordingly. Medium:WriteawrapperforyourQCandnormalizationscriptsthatwillsearcha directoryfor.bamfiles,andautomaticallyrunQCandnormalizationoneveryMACSoutput fileforthosebamfiles(assumingyouuseconsistentnamingconventions). Hint:youmayfind‘importsubprocess’tobeuseful,asitwillallowyoutorunsystem commands(suchasls)andparsetheiroutput.