Download slides - Yin Lab @ NIU

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Metagenomics wikipedia , lookup

Genomic library wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Designer baby wikipedia , lookup

Genome (book) wikipedia , lookup

Pathogenomics wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Gene wikipedia , lookup

Gene nomenclature wikipedia , lookup

RNA-Seq wikipedia , lookup

Minimal genome wikipedia , lookup

Human genome wikipedia , lookup

Gene expression profiling wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Point mutation wikipedia , lookup

Genome editing wikipedia , lookup

Genomics wikipedia , lookup

Genome evolution wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Helitron (biology) wikipedia , lookup

Protein moonlighting wikipedia , lookup

NEDD9 wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
EBI web resources II:
Ensembl and InterPro
Yanbin Yin
http://www.ebi.ac.uk/training/online/course/
1
Homework3
• Gotohttp://www.ebi.ac.uk/interpro/training.html andfinishthe
secondonlinetrainingcourse“Introductiontoproteinclassification
attheEBI”andthenanswerthefollowingquestions:
– Whatisthedifferencebetweenaproteinfamilyandaprotein
domain?
– Canaproteinbelongtomultiplefamiliesorcontainmultipledomains?
– Whatareproteinsequencefeatures?Examples?
– Whatisaproteinsignature?Whatisitusedfor?
– Whatarethemajorsignaturetypes?
– IsPROSITEasequencepatterndatabaseoraprofiledatabase?What
aboutPfam?
– Whatisthedefinitionof“annotation”?
• Inyourreport,answerthesequestionsandalsoincludethescreen
shotofthepage(s)thatsupportyouranswer.
Dueon10/7(sendbyemail)
Officehour:
Tue,ThuandFri2-4pm,MO325A
Oremail:[email protected]
2
Outline
• Introtogenomeannotation
• Proteinfamily/domaindatabases
– InterPro,Pfam,Superfamilyetc.
• Genomebrowser
– Ensembl
• HandsonPractice
3
Genomeannotation
• Predictgenes(wherearethegenes?)
– proteincoding
– RNAcoding
• Functionannotation(Whatarethesegenes?)
– SearchagainstUniProt orNCBI-nr(GenPept)
– Searchagainstproteinfamily/domaindatabases
– SearchagainstPathwaydatabases
Function vocabularies
defined in
GeneOntology
Proteinscanbeclassifiedintogroups accordingtosequenceorstructuralsimilarity.These
groups oftencontainwellcharacterizedproteins whosefunction isknown.Thus,whena
novelprotein isidentified, itsfunctional properties canbeproposed basedonthegroup to
whichitispredictedtobelong.
4
Superfamily
Gene3D
SCOP
CATH
PDB
5
InterPro components
1. CATH/Gene3D
2. PANTHER
3. PIRSF
4. Pfam
5. PRINTS
6. ProDom
7. PROSITE
8. SMART
9. SUPERFAMILY
10. TIGRFAMs
11. HAMAP
UniversityCollege, London, UK
Universityof Southern California,CA,USA
ProteinInformation Resource,Georgetown University,USA
Wellcome TrustSangerInstitute,Hinxton, UK
Universityof Manchester,UK
PRABIVilleurbanne, France
SwissInstituteofBioinformatics(SIB),Geneva,Switzerland
EMBL,Heidelberg, Germany
Universityof Bristol,UK
J.CraigVenterInstitute,Rockville,MD,US
SwissInstituteofBioinformatics(SIB),Geneva,Switzerland
CDDcomponents
Pfam,SMART,TIGRFAM,
COG,KOG, PRK,CD,LOAD
6
MostUniProt proteinsare
annotatedwithatleastone
InterPro signature
7
8
Proteinfamiliesareoftenarrangedintohierarchies,withproteinsthatsharea
common ancestorsubdivided intosmaller, morecloselyrelatedgroups. Theterms
superfamily (describing alargegroup ofdistantlyrelatedproteins) andsubfamily
(describing asmallgroup ofcloselyrelatedproteins) aresometimesusedinthis
context
9
ProteinClassification
Nearlyallproteinshavestructuralsimilaritieswithother proteins and, insomeofthesecases,
shareacommon evolutionary origin. Proteinsareclassifiedtoreflectboth structuraland
evolutionary relatedness.Manylevelsexistinthehierarchy,but theprincipal levelsarefamily,
superfamily andfold, describedbelow.
Family:Clearevolutionarilyrelationship
Proteinsclusteredtogetherintofamiliesareclearlyevolutionarily related.Generally,this
meansthatpairwiseresidue identitiesbetweentheproteins are30%andgreater.
Superfamily:Probablecommonevolutionaryorigin
Proteinsthathavelowsequenceidentities, butwhosestructuralandfunctional features
suggestthatacommon evolutionary origin isprobable areplacedtogetherinsuperfamilies.
Fold:Majorstructuralsimilarity
Proteinsaredefined ashavingacommon foldiftheyhavethesamemajorsecondary
structuresinthesamearrangementandwiththesametopologicalconnections. Different
proteins withthesamefoldoften haveperipheralelementsofsecondary structureandturn
regions thatdifferinsizeandconformation. Proteinsplacedtogether inthesamefold category
maynothaveacommon evolutionary origin: thestructuralsimilaritiescouldarisejustfrom
thephysicsandchemistryofproteins favoringcertainpackingarrangementsandchain
topologies.
http://scop.mrc-lmb.cam.ac.uk/scop/intro.html
10
PDB
Structure
Superfamily
Gene3D
Pfam
SMART
ProSite
Function
(literature)
SCOP
CATH
Protein
Sequence
UniProt
GenPept
Evolution
11
http://www.cathdb.info/
12
fold~class– superfamily ~clan– family– subfamily – domainsequence
13
Family- anddomain-based classificationsarenotalwaysstraightforwardandcan
overlap, sinceproteinsaresometimes assignedtofamiliesbyvirtueofthedomain(s)
theycontain.Anexampleofthiskind ofcomplexityisoutlined below
Domaincomposition ofphospholipase D1,whichisanenzymethatbreaksdown
phosphatidylcholine. Theprotein containsaPX(phox) domain thatisinvolvedin
binding phosphatidylinositol, aPH(pleckstrin homology) domainthathasarolein
targetingtheenzymetoparticularlocationswithinthecell,andtwoPLD
(phospholipase D)domains responsible fortheprotein’scatalyticactivity
14
Sequence featuresdifferfrom domainsinthattheyareusuallyquitesmall(often onlya
fewaminoacidslong), whereasdomains represententirestructuralorfunctional unitsof
theprotein(seeFigure).Sequence featuresareoftennestedwithindomains – aprotein
kinasedomain, forexample,usuallycontainsaproteinkinaseactivesite
Sequences featuresaregroups of aminoacidsthatconfercertaincharacteristicsupon a
protein, andmaybeimportant foritsoverallfunction.Suchfeaturesinclude:
activesites,whichcontainamino acidsinvolvedincatalyticactivity.
binding sites,containing aminoacidsthataredirectlyinvolvedinbinding moleculesorions.
post-translationalmodification (PTM)sites,whichcontainresiduesknown tobechemically
modified (phosphorylated, palmitoylated, acetylated,etc)aftertheprocessofprotein
translation.
repeats,whicharetypicallyshort aminoacidsequencesthatarerepeatedwithinaprotein,
andmayconferbinding orstructuralproperties upon it.
15
Handsonexercise1:search
againstproteinfamilydatabases
16
http://www.ebi.ac.uk/interpro/
http://cys.bios.niu.edu/yyin/teach/PBB/csl-pr.fa,putthefirstsequence inthesearchbox
HitSearch;takeabout1min
ReadmoreaboutInterPro
17
http://www.ebi.ac.uk/interpro/release_notes.html
18
ClicktolinktoInterPro pageofthisdomain
Clicktolinktoindividual databasewebsite
Theseare
individual
family/domain
matchesnot
integratedin
19
Thisislinked fromtheprevious page:theInterPro pagetodescribeIPR029044
ScientificliteratureforthisIPRfamily
20
http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi
NCBI’sConserved DomainDatabase(CDD):equivalentto
InterPro ofEBI,much faster,but integratelessmember
databases
21
22
Genomebrowser:ENSEMBL
23
http://www.ensembl.org/
TheEnsembl projectaimstoautomatically annotate genomesequences,integrate
thesedatawithother biologicalinformation andtomaketheresults freelyavailable
togeneticists,molecular biologists,bioinformaticians andthewiderresearch
community. Ensembl isjointly headedbyDr StephenSearleattheWellcome Trust
SangerInstitute andDr PaulFlicek attheEuropeanBioinformatics Institute (EBI).
24
Whatdoweneedingenomebrowsers?
TomakethebareDNAsequence,itsproperties,andtheassociatedannotations
moreaccessiblethroughgraphicalinterface.
Genomebrowsersprovideaccesstolargeamountsofsequencedataviaagraphical
userinterface.Theyuseavisual,high-leveloverviewofcomplexdatainaformthat
canbegraspedataglanceandprovidethemeanstoexplorethedatainincreasing
resolutionfrommegabase scalesdowntothelevelofindividualelementsofthe
DNAsequence.
25
Short tutorialvideosintroducing ENSEMBL
http://useast.ensembl.org/info/website/tutorials/index.html
26
http://useast.ensembl.org/info/website/tutorials/index.html
27
http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/data.shtml
28
Nature491,56-65(01November2012)
29
Nature458,719-724(9April2009)
NATURE|Vol 464|15April2010
30
Whileausermaystartbrowsing foraparticulargene, theuserinterfacewilldisplaythe
areaof thegenome containing thegene, alongwithabroader contextofother
information availableintheregion ofthechromosome occupied bythegene.
Thisinformation isshown in“tracks,”witheachtrackshowing eitherthegenomic
sequencefrom aparticularspeciesoraparticularkindofannotation onthegene.The
tracksarealignedsothattheinformation about aparticularbaseinthesequenceislined
upandcanbeviewedeasily.
Inmodern browsers, theabundance ofcontextualinformation linked toagenomicregion
notonly helpstosatisfythemostdirectedsearch,butalsomakesavailableadepth of
contentthatfacilitatesintegrationofknowledge aboutgenes, geneexpression, regulatory
sequences, sequenceconservationbetweenspecies,andmanyother classesofdata.
31
•Ensembl GenomeBrowsers:http://www.ensemblgenomes.org
•NCBIMapViewer:http://www.ncbi.nlm.nih.gov/mapview/
•UCSCGenome Browser:http://genome.ucsc.edu
Eachusesacentralizedmodel, wherethewebsiteprovides accesstoalargepublic
databaseofgenome dataformanyspeciesandalsointegratesspecializedtools,such
asBLASTatNCBIandEnsembl andBLATatUCSC.
Thepublic browsersprovideavaluableservicetotheresearchcommunity byproviding
tools forfreeaccesstowholegenome dataandbysupporting thecomplexandrobust
informaticsinfrastructure requiredtomakethedataaccessible
32
Handsonexercise2:Ensembl
genesearch
33
http://www.ensembl.org/
Clicktolinktohumanpage
34
Put“livercancer”inthesearchboxandGo
35
Thiskeywordsearchgiveseverything thatcontains“livercancer”
ClickonTableto
haveatableview
36
Clickonthenumberstoonly
showgeneentries
Thiscoltellsthecategoryoftheentry
37
Thisisthelistofgenes
Thefirsttwoentriesinthispageare
ncRNA genes.Let’strythe2nd one
ClickheretoshowthelistandselectLocation
andScoretoshowchromosome locationinfo
andscorerespectively
Scoreiscalculatedbasedonthequery:
howmuchtheannotationdescription
issimilartothesearchingkeyword
38
(livercancer)
Nowit’sshowing theGene;therearealsoothertabs
Manythings canbeexplored
ThisisENSEMBLGeneID
LinktoNCBI
ThisisENSEMBLTranscriptID
Thisisisalong intergenic
non-coding RNA gene
Hereisthe
graphical
representation
ofthegene
39
Let’stryaprotein-coding gene:LAT1,alsoknown asSLC7A5
40
Clickhere
41
Clicktoviewthesequencepage
Differentnamesofthegene
Thethreetranscripts
42
Nowchecktheexpression
Clicktoopen ahelppagetoexplain
whatthesehighlights mean
43
Differentgenome-wide expressionstudies
44
Linkstoother
genome browsers
Zoomed inview
Thisiswherethegeneislocatedin
thewholechromosome view
Further
zoomed in
view
45
ThisisthesameregionintheUCSCbrowser
PS:muchfasterandeasiertouse/understand thanENSEMBL(richerinfo?)
46
Nextlecture:ExPASy andDTU
tools
47