Download allJoiner

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genome (book) wikipedia , lookup

Minimal genome wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

RNA-Seq wikipedia , lookup

Metagenomics wikipedia , lookup

Microevolution wikipedia , lookup

Gene expression profiling wikipedia , lookup

Pathogenomics wikipedia , lookup

Designer baby wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genome evolution wikipedia , lookup

Transcript
all.joiner
A file that describes joinable fields in
the UCSC Genome Databases
basic example of an identifier
identifier softberryGeneName
"Link together Fshgene++ gene structure, peptide, and homolog"
$gbd.softberryGene.name
$gbd.softberryPep.name
$gbd.softberryHom.name
• The central concept of all.joiner is the identifier, which
appears in fields of multiple tables, sometimes even multiple
databases.
• $gbd is a variable that contains a comma-separated list of
genome databases.
• An identifier consists of a an identifier line, a required
comment in quotes, and a list of database.table.field where the
identifier is used. The first field listed is the master key. It
contains all identifiers. Later fields may not contain all.
Variables
• Variables are defined by the set keyword.
• In practice they are mostly used for commaseparated lists of databases.
set fish tetraodon,fugu,zebrafish
set worms elegans,briggsae
• After these two sets, typing $fish,$worms is
equivalent to typing:
tetraodon,fugu,zebrafish,elegans,briggsae
Databases by organism
# Define databases used for various organisms
set hg hg15,hg16,hg17,hg18
set mm mm3,mm4,mm5,mm6,mm7,mm8
set rn rn2,rn3,rn4
set fr fr1,fr2
set ce ce1,ce2,ce4
set cb cb1,cb3
set dm dm1,dm2,dm3
set dp dp2,dp3
set sc sc1
set sacCer sacCer1
set panTro panTro1,panTro2
set galGal galGal2,galGal3
# Define all genome databases.
set gbd $hg,$mm,$rn,$fr,$ce,$cb,$dm,$dp,$sc,$sacCer,$panTro,$galGa
# Only consider one of members of gbd at a time.
exclusiveSet $gbd
# Define other databases that we check
set otherDb visiGene,uniProt,go,proteome,hgFixed
…
# Set up list of databases we ignore and those we check. Program
# will complain about other databases.
databasesChecked $gbd,$otherDb
databasesIgnored mysql,lost+found,$proteinDb,$zooDb,hgcentraltest,hg
# Define databases that support known genes
set kgDb $hg,$mm,$rn
# Define databases that support Gene Sorter
# (which once was the gene family browser)
set familyDb $hg,$mm,$ce,$sacCer,$dm
Chains and nets are more complex than other identifiers
# Magic for tables split between chromosomes
set split splitPrefix=chr%_
# Stuff to link together self chains and nets
identifier chainSelf
"Link together self chain info"
$gbd.chainSelf.id $split
$gbd.chainSelfLink.chainId $split
$gbd.netSelf.chainId exclude=0
The splitPrefix= allows logical tables to be split. The % acts
as a wildcard (SQL style).
The exclude=0 says that the master key need not include 0.
Other chains and nets use a macro expansion of sorts so we
don’t need to define a separate identifier for each one.
set chainDest Hg15,Hg16,Hg17,Mm4,Mm5,Mm6…
identifier chain[${chainDest}]Id
"Link together chain info"
$gbd.chain[].id $split
$gbd.chain[]Link.chainId $split
$gbd.net[].chainId
$gbd.allChain[].id
$gbd.netRxBest[].chainId exclude=0
$gbd.net[]NonGap.chainId exclude=0
$gbd.netSynteny[].chainId exclude=0
# Genbank/trEMBL Accessions and meaningful subsets thereof
identifier genbankAccession external=genbank
"Generic Genbank Accession. More specific Genbank accessions follow"
$gbd.seq.acc
identifier stsAccession external=genbank typeOf=genbankAccession
"Genbank accession of a Sequence Tag Site (STS) sequence."
$gbd.stsInfo2.genbank dupeOk
identifier bacEndAccession typeOf=genbankAccession
"Genbank accession of a BAC end read."
$gbd.all_bacends.qName dupeOk
$gbd.bacEndPairs.lfNames comma
$hg.fishClones.beNames comma minCheck=0.70
The typeOf line allows joins between parent and child, but not
between siblings.
identifier hugoName external=HUGO fuzzy
"International Human Gene Identifier"
$hg.refLink.name
$hg.atlasOncoGene.locusSymbol
$hg.kgAlias.alias
$hg.kgXref.geneSymbol
$hg.refFlat.geneName
$hg.jaxOrtholog.humanSymbol
hg13,hg15.geneBands.name
“Biological” names for human genes are so messy, no
validation is done (note ‘fuzzy’ keyword).
identifier ensemblTranscriptId external=Ensembl dependency
"Ensembl Transcript ID"
$gbd.ensGene.name chopAfter=.
$gbd.superfamily.name
$gbd.ensGeneXref.transcript_name chopAfter=. minCheck=0.20
mm3,hg13.ensemblXref.transcript_name chopAfter=. minCheck=0.20
mm3.ensemblXref2.transcript_name chopAfter=. minCheck=0.20
$gbd.ensGtp.transcript chopAfter=. minCheck=0.98
$gbd.ensPep.name chopAfter=. minCheck=0.98
$gbd.ensTranscript.transcript_name chopAfter=. minCheck=0.20
$kgDb.knownToEnsembl.value chopAfter=.
$gbd.sfDescription.name chopAfter=.
mm3.superfamily.name chopAfter=.
Ensembl isn’t ‘fuzzy’ but requires relaxed ‘minCheck’
# Table types - describe tables sharing a common format.
type genePred
$hg.acembly
$gbd.ECgene
$gbd.geneid
$gbd.genscan
$gbd.sgpGene
$gbd.softberryGene
$gbd.twinscan
$gbd.ensGene
$gbd.vegaGene
$gbd.refGene
$gbd.jgiFilteredModels
Table browser looks for genePred.as file based on this,
and fills in descriptions in ‘describe schema’.
# Dependencies not already captured in identifiers.
# The joinerCheck program can quickly check times and
# dependencies sort of like make.
dependency $mm.affyGnfU74ADistance $mm.knownToU74 hgFixed.g
dependency $mm.affyGnfU74BDistance $mm.knownToU74 hgFixed.gn
dependency $mm.affyGnfU74CDistance $mm.knownToU74 hgFixed.gn
dependency $hg.gnfU95Distance $hg.knownToU95 hgFixed.gnfHumanU
dependency $ce.kimExpDistance hgFixed.kimWormLifeMedianRatio
dependency $dm.arbExpDistance $dm.bdgpToCanonical hgFixed.arbFly
dependency $sacCer.choExpDistance hgFixed.yeastChoCellCycle
# Ignored tables - no linkage here that we check at least.
tablesIgnored go
instance_data
source_audit
tablesIgnored $gbd
ancientRepeat
axtInfo
chromInfo
cpgIsland
…
trackDb%
chr%_mrna
joinerCheck squawks about any table (or database) not mentioned
joinerCheck
• Checks database vs. all.joiner in various ways.
• Very handy for QA but…
– Full joinerCheck takes a long long time to run
– Output is verbose because it complains about missing
stuff
– The -times check is fast, but sometimes we make tables
out of order without it being a true error.
joinerCheck - the tool you’ll love to hate!
joinerCheck - Parse and check joiner file
usage:
joinerCheck file.joiner
options:
-identifier=name - Just validate given identifier.
-database=name - Just validate given database.
-fields - Check fields in joiner file exist, faster with -fieldListIn
-fieldListOut=file - List all fields in all databases to file.
-fieldListIn=file - Get list of fields from file rather than mysql.
-keys - Validate (foreign) keys. Takes at least an hour.
-tableCoverage - Check that all tables are mentioned in joiner file
-dbCoverage - Check that all databases are mentioned in joiner file
-times - Check update times of tables are after tables they depend on
-all - Do all tests: -fields -keys -tableCoverage -dbCoverage -times
all.joiner in summary
• With .as files describes our large, messy,
useful database.
• Missing info in all.joiner results in missing
functionality in table browser.
• QA can automatically catch many problems
with joinerCheck
• Full path - src/hg/makeDb/schema/all.joiner
• See also src/hg/makeDb/schema/joiner.doc