Download Problem Set 2 The purpose of this problem set is to familiarize

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Designer baby wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

RNA-Seq wikipedia , lookup

NEDD9 wikipedia , lookup

Transcript
Gene760–ProblemSet2
ThepurposeofthisproblemsetistofamiliarizestudentswiththeanalysisofChIP-seqdata.
YouwillbeworkingwithChIP-seqdataderivedfromtwohumancelllines:K562,an
erythroleukemia(i.e.,blood)derivedcellline,andNHEK,anormalepidermal(i.e.,skin)
keratinocyteline.ThetargetsintheseChIP-seqexperimentsareRNApolymeraseII(polII)
andthehistonemodificationH3K27ac,whichmarksactivepromotersandenhancers.By
theendofthisproblemset,youwillhavelearnedhowtouse:
• Bowtietomapshortreadstoagenome
• BAM/SAM/Wig/bigwigfilestostoregenomicdata
• MACStocallpeaksinChIP-Seqdata
• Genomebrowsertoolssuchastablebrowserandkenttools
• DAVIDtoperformgeneontologyanalysisonalistofgenes
• PythontoperformQCandnormalizereads,PySAMtointeractwithBAMsinpython
Studentsaretosubmitagzippedtarballcalled[Your_NetID]_PS2.tar.gzcontainingthe
followingfiles:
• [Your_NetID]_PS2_answers.txt:Answerstothequestionsbelow
• [Your_NetID]_QC.py:QCscriptinPython
• [Your_NetID]_norm.py:NormalizationscriptinPython
• [Your_NetID]_translate_IDs.py:IDTranslationscriptinPython
• [Your_NetID]_bigWig
• [Your_NetID]_DAVID_polII_both
• [Your_NetID]_DAVID_polII_K562
• [Your_NetID]_DAVID_polII_NHEK
• [Your_NetID]_GREAT_shared
• [Your_NetID]_GREAT_K562
• [Your_NetID]_GREAT_NHEK
Thesearedueto‘DROPBOX/PS2/’by9AMonMonday,March7
ThedatasetsforthisproblemsetarelocatedinDATA/PS2/
Inthisdirectory,youwillfind6FASTQformattedsequencingfiles:
• K562_H3K27ac.fastq
• K562_input.fastq
• K562_polII.fastq
• NHEK_H3K27ac.fastq
• NHEK_input.fastq
• NHEK_polII.fastq
Aswellasthehg19indexfilesforbowtie
• hg19indexfiles(hg19.1.ebwt,hg19.2.ebwt,hg19.3.ebwt,hg19.4.ebwt,
hg19.rev.1.ebwt,hg19.rev.2.ebwt)
1.Generatealignmentsforeachdatasetusingbowtie.
Allow2mismatchesperseed,andonlyreportuniquelymappingreads.Outputyourresultsin
SAMformat.
Thesearehumancelllines,sobesuretoalignyourreadstohg19.Remember:bowtieand
everythingelsethisproblemsetcontainsiscomputationallyintensive!Donotrunonthe
loginnode!*Ifyouarefeelingcrafty,youcouldwriteasimplequeuescriptandsetitupto
runjobsinparallelandtakeadvantageofmorethanonenode.
ANS:Foreachdataset,reportthecommandsyouused.
ANS:ConverteachSAMfiletoaBAMfileusingsamtools.Reportthecommandsyouused
andthefilesizeoftheSAMvs.BAMfilesinbytes.
ANS:SorteachBAMfileusingsamtools.Reportthecommandsyouusedandthefilesizeof
thesesortedfilesinbytesversustheoriginals.Isthereadifference?Whyorwhynot?
ANS:Usesamtoolstocreateanindexforeachbamfile.Reportthecommandsyouused.
Whatistheadvantageofindexingabamfile?
ANS:Usesamtoolstoreportthenumberofmappedreadsineach.bamfile,usingboth
‘samtoolsidxstats’and‘samtoolsflagstat’.Arethenumbersthesame,whyorwhynot?
Whichisfaster,flagstatoridxstats,andwhy?
Note:onceyou’refinishedwiththisquestion,keeponlythesorted.bamfilesand.baiindices.
2.CallH3K27acenrichedregionsandRNAPolIIpeaks.
WewillbeusingMACSforcallingpeaks.
• (http://liulab.dfci.harvard.edu/MACS/README.html)ConsulttheREADMEfilefor
informationabouthowtouseMACS.Theparametersforcallingpeakswilldepend
onifyouarecallingaTFpeakorahistonemark.
MACSoutputscalledpeaksinBEDformat.MACSwillalsooutputChIP-seqsignaltracksin
WIG(wiggle)formataswellasinputsignaltracks.WIGfilesarelarge,sotheyare
frequentlyconvertedtoabinaryformat,bigWig,whichisstoredlocallybutcanbe
visualizedontheGenomeBrowser.
EachMACSrunwilltakearoundanhour(4total).
ANS:Uploadyourpeakcalls(inBEDformat)ascustomtrackstotheUCSCGenome
BrowserandsaveasasessiontosharewiththeTAs(includethelinktothatsessioninyour
answers)
ANS:ReportthetotalnumberofpeakscalledforH3K27acandpolIIineachcelllineinyour
answersheet
ANS:Usingbedtools,determineandreportthefollowinginyouranswers:
ThenumberofMACSH3K27acandpolIIpeakscalledforeachcelltypeatpromoters
(definedas1kbupstreamofanannotatedtranscriptionstartsite),exons,and
intergenic/intronicregions(potentialenhancers).Note:Callingpeaksasinoneregion
shouldmeanexcludingthemfromanyotherregion(i.e.promoterpeaksoverlappingthefirst
exonshouldnotalsobecalledasexonpeaks,exonpeaksshouldnotalsobecalled
intronic/intergenicpeaks).
Rememberthatwhenoverlappinggenes(theexonsbedfile),bedtoolswillnotconsiderjust
theexonsbydefault.Also,somepeaksoverlapmultiplefeatures,sobesuretoreportunique
resultsonly(makesurepromoters+exons+inter=total_peaks)
BEDfilesforpromotersandexonscanbefoundinDATA/PS2
ANS:WhatdoesH3K27acpeaksatpromotersmean?PolIIatpromoters?
ANS:WhataboutH3K27atintergenic/intronicregions?PolII?
ANS:WhatdoesoverlapofbothH3K27acandPolIIatpromotersorintergenic/intronic
regionsmean?
IMPORTANT:MACSoutputsadirectorywith.wig.gzfilesforeachchromosome.Onceyouare
done,copythechr1.wig.gzfilesoutputbyMACSoutofthisdirectoryandextractthem,then
deletetherestofthechrwigfiles,aswewillonlybeworkingwithchr1wigsmovingforward
3.ChIPandMACSqualitycontrol:
Itisimportanttorunsomebasicchecksofyourresultswhenperformingthissortof
genomicsanalysis–bothtomakesureeverythingisworking,andtogetanideaofthedata
you’reworkingwith.WriteapythonscriptthatcountsandreportsthepercentageofChIPSeqreadsthatfallwithinthepeakscalledbyMACS.
Yourscriptshouldtakethefollowinginputs:[Inputbamfile][Inputbedfile]
TocountthetotalreadsinaregionofaBAMfile,youshouldusethe‘pysam’module.
Documentationcanbefoundhere:http://pysam.readthedocs.org/en/latest/Louisehasan
olderversionofpySAM,sopysam.AlignmentFileinthedocumentationisequivalentto
pysam.Samfileinourinstallation.
ItisfairtoassumethatMACSpeakswillnotbecloseenoughtogethertoworryabout
doublecountingasignificantfractionofreads,thoughforsomesamplesyoumaygetover
100%becauseofdouble-counting.
IMPORTANT:pysamisnotinstalledinpythonbydefault.Torunyourpythonfile,use
‘py27file.py’ratherthan‘pythonfile.py’.py27isanaliasinyour.bashrcthatlinkstoa
versionofpythonontheclusterwiththepysammoduleinstalled.
ANS:Saveyourscriptas[Your_NetID]_QC.py
ANS:Reportinyouranswersthepercentageofreadsthatfallwithinpeaksforeachfactor
ineachcelltype.Writeasentenceinterpretingtheresult
4.NormalizeChIPdata
WriteapythonscriptthatwillnormalizethesignalfilesoutputbyMACSrelativetothe
totalnumberofmillionmappedreadsintheexperiment(readspermillion)i.e.ifyourchip
has10millionmappedreadsafterfiltering,foranucleotidecoveredby10reads,the
normalizedvalueshouldbe1.
Yourpythonscriptshouldtakethefollowinginputsfromthecommandline:
[inputwigfile][Totalmappedreads]
Yourpythonscriptshouldoutputanormalizedwigglefilenamed
[InputWigName]_Norm.wig.Yourscriptshouldskipoveranypositionsthatcontainfewer
than0.5readsafternormalization.
IMPORTANT:Youwillneedtomaintainproperstructureofthewigfile,asnotalllines
containreadcoveragedata.Readhttp://genome.ucsc.edu/goldenpath/help/wiggle.html
formoreinformation.
ANS:Saveandnamethisscript[Your_NetID]_norm.py
ANS:AddtoyourpreviousUCSCbrowsersessionthenormalizedchr1wiggletracksfor
bothfactorsandinput,forbothtissuetypes.Savethesession.
ANS:Whywritesuchascript,i.e.whynormalizesignalfiles?
5.ConverttheChr1wigfiletobigWigformatusingwigToBigWig
Instructionsarefoundat:http://genome.ucsc.edu/goldenPath/help/bigWig.html,
completethefirst5steps.
Youshoulduse‘wget’onlouiseforStep3todownloadtheexecutablefile.
YoudonotneedtocompleteStep4-Chromosomesizeannotationsarelocatedin
ANNOTATION/hg19.chrom.sizes
ANS:Savetheoutputas[Your_NetID]_bigWig
ANS:RecordthefilesizefortheChr1bigWigvstheChr1wigfileforeachcelltypeand
factorinyouranswerfile(treatment,notinputDNA).
ANS:WhyisbigWigformatuseful?Aretheredrawbacks?
6.CompareH3K27acandpolIIprofilesbetweencelltypes.
ANS:UseBedToolstodeterminethenumberof(1)promoterand(2)intergenic/intronic
H3K27acandpolIIpeaks(MACScalls;treatH3K27acandPolIIseparately)thatare:
a)SharedbetweenK562andNHEKcellsand
b)Uniquetoeachcelltype.
ForeachpromotermarkedbyH3K27acorpolIIineachcelltype,identifytheNCBIEntrez
GeneIDforthatgene.Todothis,youshouldcreateasimplepythonscriptthatusesa
dictionarytotranslateRefSeqIDsinabedfiletoEntrezIDsprovidedinatranslationfile.
WehavecreatedafilewithRefSeqIDsandtheirassociatedEntrezGeneIDin
DATA/PS2/Prom_with_Entrez.txt.NotethattherewillnotnecessarilybeanEntrezIDfor
everyRefSeqID,ifnotignorethatpromoter.
Yourscriptshouldtakethefollowinginputfromthecommandline:[Inputbedfilewith
RefSeqIDs][Inputtranslationfile]
ThescriptshouldoutputatextfilewithalistofEntrezIDscorrespondingtotheRefSeqIDs
ofthepromotersintheinputfile,1perline.Thiswillbenecessaryforthenextproblem.
ANS:Saveyourpythonscriptas[Your_NetID]_translate_IDs.py
7.UseDAVIDtodetermineGeneOntologyenrichmentsforpromotersmarkedby
H3K27acandpolIIineachcelltype.
DAVIDisaveryusefulsetoftoolsforannotatingfunctionalgenomicsdata.Goto
http://david.abcc.ncifcrf.gov/.DAVIDacceptsalistofgenenames(inthiscase,itshouldbe
theEntrezGeneIDsyouobtainedabove)identifiedingeneexpressionorChIP-seqanalyses
andcanbeusedtodeterminefunctionalenrichmentsinthosegenesetscomparedtoall
genes.ReadtheDAVIDFAQpagetounderstandhowDAVIDworks:
http://david.abcc.ncifcrf.gov/content.jsp?file=FAQs.html
ANS:UsetheDAVIDGeneFunctionalAnnotationTooltodetermineGOBiologicalProcess
(intheGOTERM_BP_FAToutputtable)enrichmentsforthefollowingcomparisons(use
MACSpeakcalls).Savetheoutputofeachanalysisasaflattextfilewiththeindicatedname.
a)PromotersmarkedbypolIIinbothK562andNHEK([Your_NetID]_DAVID_polII_both)
b)PromotersmarkedbypolIIonlyinK562([Your_NetID]_DAVID_polII_K562).
c)PromotersmarkedbypolIIonlyinNHEK([Your_NetID]_DAVID_polII_NHEK)
ANS:Interpretyourresultsina,b,andcwithrespecttothetwocelltypes.
8.UseGREATtoassociateputativecell-typespecificenhancerswithtargetgenes.
GREAT(great.stanford.edu)isatoolforfunctionalenrichmentanalysesofdistant-acting
regulatoryelements.Itcanbeusedtoidentifypotentialtargetgenesforenhancersandto
determinewhetherthattargetgenesetisenrichedforparticularbiologicalfunctions.
TherearetwowaystoloaddataintoGREAT:directlyfromtheTableBrowser,oruploading
BEDfiles.Intask6above,yougeneratedBEDfilesforH3K27acpeakcallsinK562and
NHEKcells.
ANS:UseGREATtodeterminefunctionalenrichmentsfor:
a)Intergenic/intronicH3K27acsitessharedbetweenbothcelltypes;
b)Intergenic/intronicH3K27acsitesuniquetoeachcelltype.
ANS:Foreachcomparison,exporttheGOBiologicalProcesstablesas.tsvfilesnamed
[Your_NetID]_GREAT_shared,[Your_NetID]_GREAT_K562,[Your_NetID]_GREAT_NHEK.
ANS:Interpretyourresultsina,bwithrespecttothetwocelltypes.
ANS:Wouldyouexpecttheretobegreaterdivergenceingeneontology/functional
enrichmentsbetweencellswhenlookingatpromotersorenhancers?Whatdoyouobserve
here?
Optionalprogrammingchallenges:
Forthoseofyouwithsomeprogrammingexperiencelookingtopracticeyourpythonskills,
hereareoptionalchallenges.ThesewillNOTbeworthmorethanapointofextracredit
(negligibletoyourgrade),andyoushouldnotcompletethemifyouhaven’tfinished
theproblemsetalready.Ifyoudecidetoworkonthese,pleaseuseaseparatefilefrom
yourmainscriptwhenhandingin.
Writingawrapper
Itiscommonforprogrammerstowritewrappersaroundtheirscriptsthatareabletotake
multiplefilesoruserinteractionandpipelinethescripts’functionsaccordingly.
Medium:WriteawrapperforyourQCandnormalizationscriptsthatwillsearcha
directoryfor.bamfiles,andautomaticallyrunQCandnormalizationoneveryMACSoutput
fileforthosebamfiles(assumingyouuseconsistentnamingconventions).
Hint:youmayfind‘importsubprocess’tobeuseful,asitwillallowyoutorunsystem
commands(suchasls)andparsetheiroutput.