* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download slides - Yin Lab @ NIU
Survey
Document related concepts
Metagenomics wikipedia , lookup
Genomic library wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Designer baby wikipedia , lookup
Genome (book) wikipedia , lookup
Pathogenomics wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Gene nomenclature wikipedia , lookup
Minimal genome wikipedia , lookup
Human genome wikipedia , lookup
Gene expression profiling wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Point mutation wikipedia , lookup
Genome editing wikipedia , lookup
Genome evolution wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Helitron (biology) wikipedia , lookup
Transcript
EBI web resources II: Ensembl and InterPro Yanbin Yin http://www.ebi.ac.uk/training/online/course/ 1 Homework3 • Gotohttp://www.ebi.ac.uk/interpro/training.html andfinishthe secondonlinetrainingcourse“Introductiontoproteinclassification attheEBI”andthenanswerthefollowingquestions: – Whatisthedifferencebetweenaproteinfamilyandaprotein domain? – Canaproteinbelongtomultiplefamiliesorcontainmultipledomains? – Whatareproteinsequencefeatures?Examples? – Whatisaproteinsignature?Whatisitusedfor? – Whatarethemajorsignaturetypes? – IsPROSITEasequencepatterndatabaseoraprofiledatabase?What aboutPfam? – Whatisthedefinitionof“annotation”? • Inyourreport,answerthesequestionsandalsoincludethescreen shotofthepage(s)thatsupportyouranswer. Dueon10/7(sendbyemail) Officehour: Tue,ThuandFri2-4pm,MO325A Oremail:[email protected] 2 Outline • Introtogenomeannotation • Proteinfamily/domaindatabases – InterPro,Pfam,Superfamilyetc. • Genomebrowser – Ensembl • HandsonPractice 3 Genomeannotation • Predictgenes(wherearethegenes?) – proteincoding – RNAcoding • Functionannotation(Whatarethesegenes?) – SearchagainstUniProt orNCBI-nr(GenPept) – Searchagainstproteinfamily/domaindatabases – SearchagainstPathwaydatabases Function vocabularies defined in GeneOntology Proteinscanbeclassifiedintogroups accordingtosequenceorstructuralsimilarity.These groups oftencontainwellcharacterizedproteins whosefunction isknown.Thus,whena novelprotein isidentified, itsfunctional properties canbeproposed basedonthegroup to whichitispredictedtobelong. 4 Superfamily Gene3D SCOP CATH PDB 5 InterPro components 1. CATH/Gene3D 2. PANTHER 3. PIRSF 4. Pfam 5. PRINTS 6. ProDom 7. PROSITE 8. SMART 9. SUPERFAMILY 10. TIGRFAMs 11. HAMAP UniversityCollege, London, UK Universityof Southern California,CA,USA ProteinInformation Resource,Georgetown University,USA Wellcome TrustSangerInstitute,Hinxton, UK Universityof Manchester,UK PRABIVilleurbanne, France SwissInstituteofBioinformatics(SIB),Geneva,Switzerland EMBL,Heidelberg, Germany Universityof Bristol,UK J.CraigVenterInstitute,Rockville,MD,US SwissInstituteofBioinformatics(SIB),Geneva,Switzerland CDDcomponents Pfam,SMART,TIGRFAM, COG,KOG, PRK,CD,LOAD 6 MostUniProt proteinsare annotatedwithatleastone InterPro signature 7 8 Proteinfamiliesareoftenarrangedintohierarchies,withproteinsthatsharea common ancestorsubdivided intosmaller, morecloselyrelatedgroups. Theterms superfamily (describing alargegroup ofdistantlyrelatedproteins) andsubfamily (describing asmallgroup ofcloselyrelatedproteins) aresometimesusedinthis context 9 ProteinClassification Nearlyallproteinshavestructuralsimilaritieswithother proteins and, insomeofthesecases, shareacommon evolutionary origin. Proteinsareclassifiedtoreflectboth structuraland evolutionary relatedness.Manylevelsexistinthehierarchy,but theprincipal levelsarefamily, superfamily andfold, describedbelow. Family:Clearevolutionarilyrelationship Proteinsclusteredtogetherintofamiliesareclearlyevolutionarily related.Generally,this meansthatpairwiseresidue identitiesbetweentheproteins are30%andgreater. Superfamily:Probablecommonevolutionaryorigin Proteinsthathavelowsequenceidentities, butwhosestructuralandfunctional features suggestthatacommon evolutionary origin isprobable areplacedtogetherinsuperfamilies. Fold:Majorstructuralsimilarity Proteinsaredefined ashavingacommon foldiftheyhavethesamemajorsecondary structuresinthesamearrangementandwiththesametopologicalconnections. Different proteins withthesamefoldoften haveperipheralelementsofsecondary structureandturn regions thatdifferinsizeandconformation. Proteinsplacedtogether inthesamefold category maynothaveacommon evolutionary origin: thestructuralsimilaritiescouldarisejustfrom thephysicsandchemistryofproteins favoringcertainpackingarrangementsandchain topologies. http://scop.mrc-lmb.cam.ac.uk/scop/intro.html 10 PDB Structure Superfamily Gene3D Pfam SMART ProSite Function (literature) SCOP CATH Protein Sequence UniProt GenPept Evolution 11 http://www.cathdb.info/ 12 fold~class– superfamily ~clan– family– subfamily – domainsequence 13 Family- anddomain-based classificationsarenotalwaysstraightforwardandcan overlap, sinceproteinsaresometimes assignedtofamiliesbyvirtueofthedomain(s) theycontain.Anexampleofthiskind ofcomplexityisoutlined below Domaincomposition ofphospholipase D1,whichisanenzymethatbreaksdown phosphatidylcholine. Theprotein containsaPX(phox) domain thatisinvolvedin binding phosphatidylinositol, aPH(pleckstrin homology) domainthathasarolein targetingtheenzymetoparticularlocationswithinthecell,andtwoPLD (phospholipase D)domains responsible fortheprotein’scatalyticactivity 14 Sequence featuresdifferfrom domainsinthattheyareusuallyquitesmall(often onlya fewaminoacidslong), whereasdomains represententirestructuralorfunctional unitsof theprotein(seeFigure).Sequence featuresareoftennestedwithindomains – aprotein kinasedomain, forexample,usuallycontainsaproteinkinaseactivesite Sequences featuresaregroups of aminoacidsthatconfercertaincharacteristicsupon a protein, andmaybeimportant foritsoverallfunction.Suchfeaturesinclude: activesites,whichcontainamino acidsinvolvedincatalyticactivity. binding sites,containing aminoacidsthataredirectlyinvolvedinbinding moleculesorions. post-translationalmodification (PTM)sites,whichcontainresiduesknown tobechemically modified (phosphorylated, palmitoylated, acetylated,etc)aftertheprocessofprotein translation. repeats,whicharetypicallyshort aminoacidsequencesthatarerepeatedwithinaprotein, andmayconferbinding orstructuralproperties upon it. 15 Handsonexercise1:search againstproteinfamilydatabases 16 http://www.ebi.ac.uk/interpro/ http://cys.bios.niu.edu/yyin/teach/PBB/csl-pr.fa,putthefirstsequence inthesearchbox HitSearch;takeabout1min ReadmoreaboutInterPro 17 http://www.ebi.ac.uk/interpro/release_notes.html 18 ClicktolinktoInterPro pageofthisdomain Clicktolinktoindividual databasewebsite Theseare individual family/domain matchesnot integratedin 19 Thisislinked fromtheprevious page:theInterPro pagetodescribeIPR029044 ScientificliteratureforthisIPRfamily 20 http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi NCBI’sConserved DomainDatabase(CDD):equivalentto InterPro ofEBI,much faster,but integratelessmember databases 21 22 Genomebrowser:ENSEMBL 23 http://www.ensembl.org/ TheEnsembl projectaimstoautomatically annotate genomesequences,integrate thesedatawithother biologicalinformation andtomaketheresults freelyavailable togeneticists,molecular biologists,bioinformaticians andthewiderresearch community. Ensembl isjointly headedbyDr StephenSearleattheWellcome Trust SangerInstitute andDr PaulFlicek attheEuropeanBioinformatics Institute (EBI). 24 Whatdoweneedingenomebrowsers? TomakethebareDNAsequence,itsproperties,andtheassociatedannotations moreaccessiblethroughgraphicalinterface. Genomebrowsersprovideaccesstolargeamountsofsequencedataviaagraphical userinterface.Theyuseavisual,high-leveloverviewofcomplexdatainaformthat canbegraspedataglanceandprovidethemeanstoexplorethedatainincreasing resolutionfrommegabase scalesdowntothelevelofindividualelementsofthe DNAsequence. 25 Short tutorialvideosintroducing ENSEMBL http://useast.ensembl.org/info/website/tutorials/index.html 26 http://useast.ensembl.org/info/website/tutorials/index.html 27 http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/data.shtml 28 Nature491,56-65(01November2012) 29 Nature458,719-724(9April2009) NATURE|Vol 464|15April2010 30 Whileausermaystartbrowsing foraparticulargene, theuserinterfacewilldisplaythe areaof thegenome containing thegene, alongwithabroader contextofother information availableintheregion ofthechromosome occupied bythegene. Thisinformation isshown in“tracks,”witheachtrackshowing eitherthegenomic sequencefrom aparticularspeciesoraparticularkindofannotation onthegene.The tracksarealignedsothattheinformation about aparticularbaseinthesequenceislined upandcanbeviewedeasily. Inmodern browsers, theabundance ofcontextualinformation linked toagenomicregion notonly helpstosatisfythemostdirectedsearch,butalsomakesavailableadepth of contentthatfacilitatesintegrationofknowledge aboutgenes, geneexpression, regulatory sequences, sequenceconservationbetweenspecies,andmanyother classesofdata. 31 •Ensembl GenomeBrowsers:http://www.ensemblgenomes.org •NCBIMapViewer:http://www.ncbi.nlm.nih.gov/mapview/ •UCSCGenome Browser:http://genome.ucsc.edu Eachusesacentralizedmodel, wherethewebsiteprovides accesstoalargepublic databaseofgenome dataformanyspeciesandalsointegratesspecializedtools,such asBLASTatNCBIandEnsembl andBLATatUCSC. Thepublic browsersprovideavaluableservicetotheresearchcommunity byproviding tools forfreeaccesstowholegenome dataandbysupporting thecomplexandrobust informaticsinfrastructure requiredtomakethedataaccessible 32 Handsonexercise2:Ensembl genesearch 33 http://www.ensembl.org/ Clicktolinktohumanpage 34 Put“livercancer”inthesearchboxandGo 35 Thiskeywordsearchgiveseverything thatcontains“livercancer” ClickonTableto haveatableview 36 Clickonthenumberstoonly showgeneentries Thiscoltellsthecategoryoftheentry 37 Thisisthelistofgenes Thefirsttwoentriesinthispageare ncRNA genes.Let’strythe2nd one ClickheretoshowthelistandselectLocation andScoretoshowchromosome locationinfo andscorerespectively Scoreiscalculatedbasedonthequery: howmuchtheannotationdescription issimilartothesearchingkeyword 38 (livercancer) Nowit’sshowing theGene;therearealsoothertabs Manythings canbeexplored ThisisENSEMBLGeneID LinktoNCBI ThisisENSEMBLTranscriptID Thisisisalong intergenic non-coding RNA gene Hereisthe graphical representation ofthegene 39 Let’stryaprotein-coding gene:LAT1,alsoknown asSLC7A5 40 Clickhere 41 Clicktoviewthesequencepage Differentnamesofthegene Thethreetranscripts 42 Nowchecktheexpression Clicktoopen ahelppagetoexplain whatthesehighlights mean 43 Differentgenome-wide expressionstudies 44 Linkstoother genome browsers Zoomed inview Thisiswherethegeneislocatedin thewholechromosome view Further zoomed in view 45 ThisisthesameregionintheUCSCbrowser PS:muchfasterandeasiertouse/understand thanENSEMBL(richerinfo?) 46 Nextlecture:ExPASy andDTU tools 47