Download Gene 760 – Problem Set 2 The purpose of this problem set is to

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Designer baby wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

NEDD9 wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
Gene760–ProblemSet2
ThepurposeofthisproblemsetistofamiliarizestudentswiththeanalysisofChIP-seqdata.
YouwillbeworkingwithChIP-seqdataderivedfromtwohumancelllines:K562,an
erythroleukemia(i.e.,blood)derivedcellline,andNHEK,anormalepidermal(i.e.,skin)
keratinocyteline.ThetargetsintheseChIP-seqexperimentsareRNApolymeraseII(polII)
andthehistonemodificationH3K27ac,whichmarksactivepromotersandenhancers.By
theendofthisproblemset,youwillhavelearnedhowtouse:
• Bowtietomapshortreadstoagenome
• BAM/SAM/bedGraph/bigwigfilestostoregenomicdata
• MACStocallpeaksinChIP-Seqdata
• Genomebrowsertoolssuchastablebrowserandkenttools
• DAVIDtoperformgeneontologyanalysisonalistofgenes
• PythontoperformQCandnormalizereads,PySAMtointeractwithBAMsinpython
Studentsaretosubmitagzippedtarballcalled[Your_NetID]_PS2.tar.gzcontainingthe
followingfiles:
• [Your_NetID]_PS1_answers.txt:Answerstothequestionsbelow
• [Your_NetID]_QC.py:QCscriptinPython
• [Your_NetID]_norm.py:NormalizationscriptinPython
• [Your_NetID]_translate_IDs.py:IDTranslationscriptinPython
• [Your_NetID]_bigWig
• [Your_NetID]_DAVID_polII_both
• [Your_NetID]_DAVID_polII_K562
• [Your_NetID]_DAVID_polII_NHEK
• [Your_NetID]_GREAT_shared
• [Your_NetID]_GREAT_K562
• [Your_NetID]_GREAT_NHEK
Thesearedueto‘DROPBOX/PS2/’by5PMonFriday,Mar.10th.
ThedatasetsforthisproblemsetarelocatedinDATA/PS2/
Inthisdirectory,youwillfind6FASTQformattedsequencingfiles:
• K562_H3K27ac.fastq
• K562_input.fastq
• K562_polII.fastq
• NHEK_H3K27ac.fastq
• NHEK_input.fastq
• NHEK_polII.fastq
Aswellasthehg19indexfilesforbowtie2
• hg19indexfiles(hg19.1.bt2,hg19.2.bt2,hg19.3.bt2,hg19.4.bt2,hg19.rev.1.bt2,
hg19.rev.2.bt2)
Remembertorunandprocessalldatainyourprojectfolder(symlinkedinyourhome
directory)asyourhomedirectorydoesnothavesufficientstorageallocatedforallyour
intermediatefiles.
Remember,everythingthisproblemsetcontainsiscomputationallyintensive!Donotrun
ontheloginnode!Wehighlyrecommendedusingslurmtorunthejobsinparallelandtake
advantageofmorethanonenode.
1.GeneratealignmentsforeachdatasetusingBowtie2.
Allow1mismatchintheseedsequenceandusesamtoolstofilterforreadswithhighmapping
quality.Thesearehumancelllines,sobesuretoalignyourreadstothehg19indexinthe
DATA/PS2directory.
ANS:Foreachdataset,reportthecommandsyouused.
ANS:ConverteachSAMfiletoaBAMfileusingsamtoolswhilefilteringforaMAPQscore
>=10.ReportthecommandsyouusedandthefilesizeoftheSAMvs.BAMfilesinbytes.
Whatisthebowtie2MAPQscoreandwhydowefilterfor10orgreater?
ANS:SorteachfilteredBAMfileusingsamtools.Reportthecommandsyouusedandthe
filesizeofthesesortedfilesinbytesversustheoriginals.Isthereadifference?Whyorwhy
not?
ANS:Usesamtoolstocreateanindexforeachbamfile.Reportthecommandsyouused.
Whatistheadvantageofindexingabamfile?
ANS:Usesamtoolstoreportthenumberofmappedreadsineach.bamfile,usingboth
‘samtoolsidxstats’and‘samtoolsflagstat’.Arethenumbersthesame,whyorwhynot?
Whichisfaster,flagstatoridxstats,andwhy?
Note:onceyou’refinishedwiththisquestion,keeponlythesorted.bamfilesand.baiindices.
2.CallH3K27acenrichedregionsandRNAPolIIpeaks.
WewillbeusingMACS2forcallingpeaks(https://github.com/taoliu/MACS).Consultthe
READMEfileforinformationabouthowtouseMACS2.Remember,theparametersfor
callingpeakswilldependonifyouarecallingaTFpeakorahistonemark.MACSoutputs
calledpeaksinanextendedBEDformat.MACSwillalsooutputChIP-seqsignal
(_tread_pileup.bdg)andcontrol(_control_lambda.bdg)tracksinbedGraph(.bdg)format.
EachMACSrunwilltakearound30min(4total).
ANS:Generateabedfilefromyourpeakcalls(.broadpeakor.narrowpeakfile)usingthe
cutcommand.UploadthisbedfileascustomtrackstotheUCSCGenomeBrowserandsave
asasessiontosharewiththeTAs(includethelinktothatsessioninyouranswers)
ANS:ReportthetotalnumberofpeakscalledforH3K27acandpolIIineachcelllineinyour
answersheet
ANS:Usingbedtools,determineandreportthefollowinginyouranswers:
ThenumberofMACSH3K27acandpolIIpeakscalledforeachcelltypeatpromoters
(definedas1kbupstreamofanannotatedtranscriptionstartsite),exons,and
intergenic/intronicregions(potentialenhancers).Note:Callingpeaksasinoneregion
shouldmeanexcludingthemfromanyotherregion(i.e.promoterpeaksoverlappingthefirst
exonshouldnotalsobecalledasexonpeaks,exonpeaksshouldnotalsobecalled
intronic/intergenicpeaks).
Rememberthatwhenoverlappinggenes(theexonsbedfile),bedtoolswillnotconsiderjust
theexonsbydefault.Also,somepeaksoverlapmultiplefeatures,sobesuretoreportunique
resultsonly(makesurepromoters+exons+inter=total_peaks)
BEDfilesforpromotersandexonscanbefoundinDATA/PS2
ANS:WhatdoesH3K27acpeaksatpromotersmean?PolIIatpromoters?
ANS:WhataboutH3K27atintergenic/intronicregions?PolII?
ANS:WhatdoesoverlapofbothH3K27acandPolIIatpromotersorintergenic/intronic
regionsmean?
3.ChIPandMACSqualitycontrol:
Itisimportanttorunsomebasicchecksofyourresultswhenperformingthissortof
genomicsanalysis–bothtomakesureeverythingisworking,andtogetanideaofthedata
you’reworkingwith.WriteapythonscriptthatcountsandreportsthepercentageofChIPSeqreadsthatfallwithinthepeakscalledbyMACS2.
Yourscriptshouldtakethefollowinginputsfromthecommandline:[Inputbamfile]
[Inputbedfile]
TocountthetotalreadsinaregionofaBAMfile,youshouldusethe‘pysam’module
installedonFarnamandaccessedthroughthemodulesystem.Documentationcanbefound
here:http://pysam.readthedocs.org/en/latest/
ItisfairtoassumethatMACS2peakswillnotbecloseenoughtogethertoworryabout
doublecountingasignificantfractionofreads,thoughforsomesamplesyoumaygetover
100%becauseofdouble-counting.
ANS:Saveyourscriptas[Your_NetID]_QC.py
ANS:Reportinyouranswersthepercentageofreadsthatfallwithinpeaksforeachfactor
ineachcelltype.Writeasentenceinterpretingtheresult
4.NormalizeChIPdata
WriteapythonscriptthatwillnormalizethesignalfilesoutputbyMACSrelativetothe
totalnumberofmillionmappedreadsintheexperiment(readspermillion)i.e.ifyourChIP
has10millionmappedreadsafterfiltering,foranucleotidecoveredby10reads,the
normalizedvalueshouldbe1.
IMPORTANT:Weonlywanttoworkwithchr1datamovingforward,sousethegrep
commandtoextractallchr1entriesfromtheChIPandinputbedGraphfilesforboth
tissuesandH3K27acandPolII.
Yourpythonscriptshouldtakethefollowinginputsfromthecommandline:
[inputchr1bedGraphfile][Totalmappedreads]
YourpythonscriptshouldoutputanormalizedbedGraphfilenamed
[inputBedGraphName]_Norm.bdg.Yourscriptshouldskipoveranypositionsthat
containfewerthan0.5readsafternormalization.
IMPORTANT:YouwillneedtomaintainproperstructureofthebedGraphfile,as
notalllinescontainreadcoveragedata.Read
https://genome.ucsc.edu/goldenpath/help/bedgraph.htmlformoreinformation.
ANS:Saveandnamethisscript[Your_NetID]_norm.py
ANS:Compress(gzip)anduploadthenormalizedbedGraphtracksforbothfactors
(H3K27ac&RNAPolII)andinputforbothtissuetypestoyourpreviousUCSCbrowser
session.Savethesession.Don’tforgettoaddtheappropriatebedGraphheader.
ANS:Whywritesuchascript(i.e.whynormalizesignalfiles)?
5.ConvertthebedGraphfilestobigWigformatusingbedGraphToBigWig
BedGraphfilesarelarge,sotheyarefrequentlyconvertedtoabinaryformat,bigWig,which
isstoredlocallybutcanbevisualizedontheGenomeBrowser.Instructionsarefoundat:
http://genome.ucsc.edu/goldenPath/help/bigWig.html,completethefirst5steps.
Toconvert,usethebedGraphToBigWigexecutableinTOOLS/bedGraphToBigWigand
chromosomesizeannotationsinANNOTATION/chromInfo.txt.Remembertosortyour
bedGraphfilesfirst!(usingsort-k1,1-k2,2n)
ANS:Savetheoutputas[Your_NetID]_bigWig
ANS:RecordthefilesizeforthebedGraphvsbigWigfileforeachcelltypeandfactorinyour
answerfile(treatment,notinputDNA).
ANS:WhyisbigWigformatuseful?Aretheredrawbacks?
6.CompareH3K27acandpolIIprofilesbetweencelltypes.
ANS:UseBedToolstodeterminethenumberof(1)promoterand(2)intergenic/intronic
H3K27acandpolIIpeaks(MACScalls;treatH3K27acandPolIIseparately)thatare:
a)SharedbetweenK562andNHEKcellsand
b)Uniquetoeachcelltype.
ForeachpromotermarkedbyH3K27acorpolIIineachcelltype,identifytheNCBIEntrez
GeneIDforthatgene.Todothis,youshouldcreateasimplepythonscriptthatusesa
dictionarytotranslateRefSeqIDsinabedfiletoEntrezIDsprovidedinatranslationfile.
WehavecreatedafilewithRefSeqIDsandtheirassociatedEntrezGeneIDin
DATA/PS2/Prom_with_Entrez.txt.NotethattherewillnotnecessarilybeanEntrezIDfor
everyRefSeqID,ifnotignorethatpromoter.
Yourscriptshouldtakethefollowinginputfromthecommandline:[Inputbedfilewith
RefSeqIDs][Inputtranslationfile]
ThescriptshouldoutputatextfilewithalistofEntrezIDscorrespondingtotheRefSeqIDs
ofthepromotersintheinputfile,1perline.Thiswillbenecessaryforthenextproblem.
ANS:Saveyourpythonscriptas[Your_NetID]_translate_IDs.py
7.UseDAVIDtodetermineGeneOntologyenrichmentsforpromotersmarkedby
H3K27acandpolIIineachcelltype.
DAVIDisaveryusefulsetoftoolsforannotatingfunctionalgenomicsdata.Goto
http://david.abcc.ncifcrf.gov/.DAVIDacceptsalistofgenenames(inthiscase,itshouldbe
theEntrezGeneIDsyouobtainedabove)identifiedingeneexpressionorChIP-seqanalyses
andcanbeusedtodeterminefunctionalenrichmentsinthosegenesetscomparedtoall
genes.ReadtheDAVIDFAQpagetounderstandhowDAVIDworks:
http://david.abcc.ncifcrf.gov/content.jsp?file=FAQs.html
ANS:UsetheDAVIDGeneFunctionalAnnotationTooltodetermineGOBiologicalProcess
(intheGOTERM_BP_FAToutputtable)enrichmentsforthefollowingcomparisons(use
MACSpeakcalls).Savetheoutputofeachanalysisasaflattextfilewiththeindicatedname.
a)PromotersmarkedbypolIIinbothK562andNHEK([Your_NetID]_DAVID_polII_both)
b)PromotersmarkedbypolIIonlyinK562([Your_NetID]_DAVID_polII_K562).
c)PromotersmarkedbypolIIonlyinNHEK([Your_NetID]_DAVID_polII_NHEK)
ANS:Interpretyourresultsina,b,andcwithrespecttothetwocelltypes.
8.UseGREATtoassociateputativecell-typespecificenhancerswithtargetgenes.
GREAT(great.stanford.edu)isatoolforfunctionalenrichmentanalysesofdistant-acting
regulatoryelements.Itcanbeusedtoidentifypotentialtargetgenesforenhancersandto
determinewhetherthattargetgenesetisenrichedforparticularbiologicalfunctions.
TherearetwowaystoloaddataintoGREAT:directlyfromtheTableBrowser,oruploading
BEDfiles.Intask6above,yougeneratedBEDfilesforH3K27acpeakcallsinK562and
NHEKcells.
ANS:UseGREATtodeterminefunctionalenrichmentsfor:
a)Intergenic/intronicH3K27acsitessharedbetweenbothcelltypes;
b)Intergenic/intronicH3K27acsitesuniquetoeachcelltype.
ANS:Foreachcomparison,exporttheGOBiologicalProcesstablesas.tsvfilesnamed
[Your_NetID]_GREAT_shared,[Your_NetID]_GREAT_K562,[Your_NetID]_GREAT_NHEK.
ANS:Interpretyourresultsina,bwithrespecttothetwocelltypes.
ANS:Wouldyouexpecttheretobegreaterdivergenceingeneontology/functional
enrichmentsbetweencellswhenlookingatpromotersorenhancers?Whatdoyouobserve
here?
Optionalprogrammingchallenges:
Forthoseofyouwithsomeprogrammingexperiencelookingtopracticeyourpythonskills,
hereareoptionalchallenges.ThesewillNOTbeworthmorethanapointofextracredit
(negligibletoyourgrade),andyoushouldnotcompletethemifyouhaven’tfinished
theproblemsetalready.Ifyoudecidetoworkonthese,pleaseuseaseparatefilefrom
yourmainscriptwhenhandingin.
Writingawrapper
Itiscommonforprogrammerstowritewrappersaroundtheirscriptsthatareabletotake
multiplefilesoruserinteractionandpipelinethescripts’functionsaccordingly.
Medium:WriteawrapperforyourQCandnormalizationscriptsthatwillsearcha
directoryfor.bamfiles,andautomaticallyrunQCandnormalizationoneveryMACSoutput
fileforthosebamfiles(assumingyouuseconsistentnamingconventions).
Hint:youmayfind‘importsubprocess’tobeuseful,asitwillallowyoutorunsystem
commands(suchasls)andparsetheiroutput.