Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Gene760–ProblemSet2 ThepurposeofthisproblemsetistofamiliarizestudentswiththeanalysisofChIP-seqdata. YouwillbeworkingwithChIP-seqdataderivedfromtwohumancelllines:K562,an erythroleukemia(i.e.,blood)derivedcellline,andNHEK,anormalepidermal(i.e.,skin) keratinocyteline.ThetargetsintheseChIP-seqexperimentsareRNApolymeraseII(polII) andthehistonemodificationH3K27ac,whichmarksactivepromotersandenhancers.By theendofthisproblemset,youwillhavelearnedhowtouse: • Bowtietomapshortreadstoagenome • BAM/SAM/bedGraph/bigwigfilestostoregenomicdata • MACStocallpeaksinChIP-Seqdata • Genomebrowsertoolssuchastablebrowserandkenttools • DAVIDtoperformgeneontologyanalysisonalistofgenes • PythontoperformQCandnormalizereads,PySAMtointeractwithBAMsinpython Studentsaretosubmitagzippedtarballcalled[Your_NetID]_PS2.tar.gzcontainingthe followingfiles: • [Your_NetID]_PS1_answers.txt:Answerstothequestionsbelow • [Your_NetID]_QC.py:QCscriptinPython • [Your_NetID]_norm.py:NormalizationscriptinPython • [Your_NetID]_translate_IDs.py:IDTranslationscriptinPython • [Your_NetID]_bigWig • [Your_NetID]_DAVID_polII_both • [Your_NetID]_DAVID_polII_K562 • [Your_NetID]_DAVID_polII_NHEK • [Your_NetID]_GREAT_shared • [Your_NetID]_GREAT_K562 • [Your_NetID]_GREAT_NHEK Thesearedueto‘DROPBOX/PS2/’by5PMonFriday,Mar.10th. ThedatasetsforthisproblemsetarelocatedinDATA/PS2/ Inthisdirectory,youwillfind6FASTQformattedsequencingfiles: • K562_H3K27ac.fastq • K562_input.fastq • K562_polII.fastq • NHEK_H3K27ac.fastq • NHEK_input.fastq • NHEK_polII.fastq Aswellasthehg19indexfilesforbowtie2 • hg19indexfiles(hg19.1.bt2,hg19.2.bt2,hg19.3.bt2,hg19.4.bt2,hg19.rev.1.bt2, hg19.rev.2.bt2) Remembertorunandprocessalldatainyourprojectfolder(symlinkedinyourhome directory)asyourhomedirectorydoesnothavesufficientstorageallocatedforallyour intermediatefiles. Remember,everythingthisproblemsetcontainsiscomputationallyintensive!Donotrun ontheloginnode!Wehighlyrecommendedusingslurmtorunthejobsinparallelandtake advantageofmorethanonenode. 1.GeneratealignmentsforeachdatasetusingBowtie2. Allow1mismatchintheseedsequenceandusesamtoolstofilterforreadswithhighmapping quality.Thesearehumancelllines,sobesuretoalignyourreadstothehg19indexinthe DATA/PS2directory. ANS:Foreachdataset,reportthecommandsyouused. ANS:ConverteachSAMfiletoaBAMfileusingsamtoolswhilefilteringforaMAPQscore >=10.ReportthecommandsyouusedandthefilesizeoftheSAMvs.BAMfilesinbytes. Whatisthebowtie2MAPQscoreandwhydowefilterfor10orgreater? ANS:SorteachfilteredBAMfileusingsamtools.Reportthecommandsyouusedandthe filesizeofthesesortedfilesinbytesversustheoriginals.Isthereadifference?Whyorwhy not? ANS:Usesamtoolstocreateanindexforeachbamfile.Reportthecommandsyouused. Whatistheadvantageofindexingabamfile? ANS:Usesamtoolstoreportthenumberofmappedreadsineach.bamfile,usingboth ‘samtoolsidxstats’and‘samtoolsflagstat’.Arethenumbersthesame,whyorwhynot? Whichisfaster,flagstatoridxstats,andwhy? Note:onceyou’refinishedwiththisquestion,keeponlythesorted.bamfilesand.baiindices. 2.CallH3K27acenrichedregionsandRNAPolIIpeaks. WewillbeusingMACS2forcallingpeaks(https://github.com/taoliu/MACS).Consultthe READMEfileforinformationabouthowtouseMACS2.Remember,theparametersfor callingpeakswilldependonifyouarecallingaTFpeakorahistonemark.MACSoutputs calledpeaksinanextendedBEDformat.MACSwillalsooutputChIP-seqsignal (_tread_pileup.bdg)andcontrol(_control_lambda.bdg)tracksinbedGraph(.bdg)format. EachMACSrunwilltakearound30min(4total). ANS:Generateabedfilefromyourpeakcalls(.broadpeakor.narrowpeakfile)usingthe cutcommand.UploadthisbedfileascustomtrackstotheUCSCGenomeBrowserandsave asasessiontosharewiththeTAs(includethelinktothatsessioninyouranswers) ANS:ReportthetotalnumberofpeakscalledforH3K27acandpolIIineachcelllineinyour answersheet ANS:Usingbedtools,determineandreportthefollowinginyouranswers: ThenumberofMACSH3K27acandpolIIpeakscalledforeachcelltypeatpromoters (definedas1kbupstreamofanannotatedtranscriptionstartsite),exons,and intergenic/intronicregions(potentialenhancers).Note:Callingpeaksasinoneregion shouldmeanexcludingthemfromanyotherregion(i.e.promoterpeaksoverlappingthefirst exonshouldnotalsobecalledasexonpeaks,exonpeaksshouldnotalsobecalled intronic/intergenicpeaks). Rememberthatwhenoverlappinggenes(theexonsbedfile),bedtoolswillnotconsiderjust theexonsbydefault.Also,somepeaksoverlapmultiplefeatures,sobesuretoreportunique resultsonly(makesurepromoters+exons+inter=total_peaks) BEDfilesforpromotersandexonscanbefoundinDATA/PS2 ANS:WhatdoesH3K27acpeaksatpromotersmean?PolIIatpromoters? ANS:WhataboutH3K27atintergenic/intronicregions?PolII? ANS:WhatdoesoverlapofbothH3K27acandPolIIatpromotersorintergenic/intronic regionsmean? 3.ChIPandMACSqualitycontrol: Itisimportanttorunsomebasicchecksofyourresultswhenperformingthissortof genomicsanalysis–bothtomakesureeverythingisworking,andtogetanideaofthedata you’reworkingwith.WriteapythonscriptthatcountsandreportsthepercentageofChIPSeqreadsthatfallwithinthepeakscalledbyMACS2. Yourscriptshouldtakethefollowinginputsfromthecommandline:[Inputbamfile] [Inputbedfile] TocountthetotalreadsinaregionofaBAMfile,youshouldusethe‘pysam’module installedonFarnamandaccessedthroughthemodulesystem.Documentationcanbefound here:http://pysam.readthedocs.org/en/latest/ ItisfairtoassumethatMACS2peakswillnotbecloseenoughtogethertoworryabout doublecountingasignificantfractionofreads,thoughforsomesamplesyoumaygetover 100%becauseofdouble-counting. ANS:Saveyourscriptas[Your_NetID]_QC.py ANS:Reportinyouranswersthepercentageofreadsthatfallwithinpeaksforeachfactor ineachcelltype.Writeasentenceinterpretingtheresult 4.NormalizeChIPdata WriteapythonscriptthatwillnormalizethesignalfilesoutputbyMACSrelativetothe totalnumberofmillionmappedreadsintheexperiment(readspermillion)i.e.ifyourChIP has10millionmappedreadsafterfiltering,foranucleotidecoveredby10reads,the normalizedvalueshouldbe1. IMPORTANT:Weonlywanttoworkwithchr1datamovingforward,sousethegrep commandtoextractallchr1entriesfromtheChIPandinputbedGraphfilesforboth tissuesandH3K27acandPolII. Yourpythonscriptshouldtakethefollowinginputsfromthecommandline: [inputchr1bedGraphfile][Totalmappedreads] YourpythonscriptshouldoutputanormalizedbedGraphfilenamed [inputBedGraphName]_Norm.bdg.Yourscriptshouldskipoveranypositionsthat containfewerthan0.5readsafternormalization. IMPORTANT:YouwillneedtomaintainproperstructureofthebedGraphfile,as notalllinescontainreadcoveragedata.Read https://genome.ucsc.edu/goldenpath/help/bedgraph.htmlformoreinformation. ANS:Saveandnamethisscript[Your_NetID]_norm.py ANS:Compress(gzip)anduploadthenormalizedbedGraphtracksforbothfactors (H3K27ac&RNAPolII)andinputforbothtissuetypestoyourpreviousUCSCbrowser session.Savethesession.Don’tforgettoaddtheappropriatebedGraphheader. ANS:Whywritesuchascript(i.e.whynormalizesignalfiles)? 5.ConvertthebedGraphfilestobigWigformatusingbedGraphToBigWig BedGraphfilesarelarge,sotheyarefrequentlyconvertedtoabinaryformat,bigWig,which isstoredlocallybutcanbevisualizedontheGenomeBrowser.Instructionsarefoundat: http://genome.ucsc.edu/goldenPath/help/bigWig.html,completethefirst5steps. Toconvert,usethebedGraphToBigWigexecutableinTOOLS/bedGraphToBigWigand chromosomesizeannotationsinANNOTATION/chromInfo.txt.Remembertosortyour bedGraphfilesfirst!(usingsort-k1,1-k2,2n) ANS:Savetheoutputas[Your_NetID]_bigWig ANS:RecordthefilesizeforthebedGraphvsbigWigfileforeachcelltypeandfactorinyour answerfile(treatment,notinputDNA). ANS:WhyisbigWigformatuseful?Aretheredrawbacks? 6.CompareH3K27acandpolIIprofilesbetweencelltypes. ANS:UseBedToolstodeterminethenumberof(1)promoterand(2)intergenic/intronic H3K27acandpolIIpeaks(MACScalls;treatH3K27acandPolIIseparately)thatare: a)SharedbetweenK562andNHEKcellsand b)Uniquetoeachcelltype. ForeachpromotermarkedbyH3K27acorpolIIineachcelltype,identifytheNCBIEntrez GeneIDforthatgene.Todothis,youshouldcreateasimplepythonscriptthatusesa dictionarytotranslateRefSeqIDsinabedfiletoEntrezIDsprovidedinatranslationfile. WehavecreatedafilewithRefSeqIDsandtheirassociatedEntrezGeneIDin DATA/PS2/Prom_with_Entrez.txt.NotethattherewillnotnecessarilybeanEntrezIDfor everyRefSeqID,ifnotignorethatpromoter. Yourscriptshouldtakethefollowinginputfromthecommandline:[Inputbedfilewith RefSeqIDs][Inputtranslationfile] ThescriptshouldoutputatextfilewithalistofEntrezIDscorrespondingtotheRefSeqIDs ofthepromotersintheinputfile,1perline.Thiswillbenecessaryforthenextproblem. ANS:Saveyourpythonscriptas[Your_NetID]_translate_IDs.py 7.UseDAVIDtodetermineGeneOntologyenrichmentsforpromotersmarkedby H3K27acandpolIIineachcelltype. DAVIDisaveryusefulsetoftoolsforannotatingfunctionalgenomicsdata.Goto http://david.abcc.ncifcrf.gov/.DAVIDacceptsalistofgenenames(inthiscase,itshouldbe theEntrezGeneIDsyouobtainedabove)identifiedingeneexpressionorChIP-seqanalyses andcanbeusedtodeterminefunctionalenrichmentsinthosegenesetscomparedtoall genes.ReadtheDAVIDFAQpagetounderstandhowDAVIDworks: http://david.abcc.ncifcrf.gov/content.jsp?file=FAQs.html ANS:UsetheDAVIDGeneFunctionalAnnotationTooltodetermineGOBiologicalProcess (intheGOTERM_BP_FAToutputtable)enrichmentsforthefollowingcomparisons(use MACSpeakcalls).Savetheoutputofeachanalysisasaflattextfilewiththeindicatedname. a)PromotersmarkedbypolIIinbothK562andNHEK([Your_NetID]_DAVID_polII_both) b)PromotersmarkedbypolIIonlyinK562([Your_NetID]_DAVID_polII_K562). c)PromotersmarkedbypolIIonlyinNHEK([Your_NetID]_DAVID_polII_NHEK) ANS:Interpretyourresultsina,b,andcwithrespecttothetwocelltypes. 8.UseGREATtoassociateputativecell-typespecificenhancerswithtargetgenes. GREAT(great.stanford.edu)isatoolforfunctionalenrichmentanalysesofdistant-acting regulatoryelements.Itcanbeusedtoidentifypotentialtargetgenesforenhancersandto determinewhetherthattargetgenesetisenrichedforparticularbiologicalfunctions. TherearetwowaystoloaddataintoGREAT:directlyfromtheTableBrowser,oruploading BEDfiles.Intask6above,yougeneratedBEDfilesforH3K27acpeakcallsinK562and NHEKcells. ANS:UseGREATtodeterminefunctionalenrichmentsfor: a)Intergenic/intronicH3K27acsitessharedbetweenbothcelltypes; b)Intergenic/intronicH3K27acsitesuniquetoeachcelltype. ANS:Foreachcomparison,exporttheGOBiologicalProcesstablesas.tsvfilesnamed [Your_NetID]_GREAT_shared,[Your_NetID]_GREAT_K562,[Your_NetID]_GREAT_NHEK. ANS:Interpretyourresultsina,bwithrespecttothetwocelltypes. ANS:Wouldyouexpecttheretobegreaterdivergenceingeneontology/functional enrichmentsbetweencellswhenlookingatpromotersorenhancers?Whatdoyouobserve here? Optionalprogrammingchallenges: Forthoseofyouwithsomeprogrammingexperiencelookingtopracticeyourpythonskills, hereareoptionalchallenges.ThesewillNOTbeworthmorethanapointofextracredit (negligibletoyourgrade),andyoushouldnotcompletethemifyouhaven’tfinished theproblemsetalready.Ifyoudecidetoworkonthese,pleaseuseaseparatefilefrom yourmainscriptwhenhandingin. Writingawrapper Itiscommonforprogrammerstowritewrappersaroundtheirscriptsthatareabletotake multiplefilesoruserinteractionandpipelinethescripts’functionsaccordingly. Medium:WriteawrapperforyourQCandnormalizationscriptsthatwillsearcha directoryfor.bamfiles,andautomaticallyrunQCandnormalizationoneveryMACSoutput fileforthosebamfiles(assumingyouuseconsistentnamingconventions). Hint:youmayfind‘importsubprocess’tobeuseful,asitwillallowyoutorunsystem commands(suchasls)andparsetheiroutput.