Download final_report_columns

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mutation wikipedia , lookup

Genome (book) wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Non-coding RNA wikipedia , lookup

Epistasis wikipedia , lookup

Gene nomenclature wikipedia , lookup

Designer baby wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Gene expression profiling wikipedia , lookup

Primary transcript wikipedia , lookup

Genome evolution wikipedia , lookup

Genomics wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

RNA-Seq wikipedia , lookup

Protein moonlighting wikipedia , lookup

Microevolution wikipedia , lookup

Expanded genetic code wikipedia , lookup

Multiple sequence alignment wikipedia , lookup

Gene wikipedia , lookup

Helitron (biology) wikipedia , lookup

Sequence alignment wikipedia , lookup

NEDD9 wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genetic code wikipedia , lookup

Point mutation wikipedia , lookup

Frameshift mutation wikipedia , lookup

Transcript
Explanation of columns from FINAL_REPORT_filtered_with_COSMIC.csv
Milica Krunic
10.08.2016
We used annovar for variant annotation. Thus, some of the text is taken from:
http://annovar.openbioinformatics.org/en/latest/user-guide/gene/ and
http://annovar.openbioinformatics.org/en/latest/user-guide/filter/.
Columns description
Pos: variant genomic position
Function: it tells whether the variant hit exons or hit intergenic regions, or hit introns, or hit a noncoding RNA genes. If the variant is exonic/intronic/ncRNA, the second column gives the gene name (if
multiple genes are hit, comma will be added between gene names); if not, the second column will give
the two neighboring genes and the distance to these neighboring genes.
The possible values are summarized below:
Value
Explanation
exonic
variant overlaps a coding exon
splicing
variant is within 2-bp of a splicing junction (in intron)
ncRNA
variant overlaps a transcript without coding annotation in the gene definition
UTR5
variant overlaps a 5' untranslated region
UTR3
variant overlaps a 3' untranslated region
intronic
variant overlaps an intron
upstream
variant overlaps 1-kb region upstream of transcription start site
downstream
variant overlaps 1-kb region downtream of transcription end site
intergenic
variant is in intergenic region
Details:





- the "exonic" here refers only to coding exonic portion , but not UTR portion, as there are two
keywords (UTR5, UTR3) that are specifically reserved for UTR annotations.
- "splicing" is defined as variant that is within 2-bp away from an exon/intron boundary. If
"exonic,splicing" is shown, it means that this is a variant within exon but close to exon/intron
boundary. "Splicing" only refers to the 2bp in the intron that is close to an exon.
If a variant is located in both 5' UTR and 3' UTR region (possibly for two different genes), then
the "UTR5,UTR3" will be printed as the output.
The term "upstream" and "downstream" is defined as 1-kb away from transcription start site or
transcription end site, respectively, taking in account of the strand of the mRNA.
If a variant is located in both downstream and upstream region (possibly for 2 different genes),
then the "upstream,downstream" will be printed as the output.
Gene: Gene name, if a transcript maps to multiple locations, all as "coding transcripts", but none has a
1
complete ORF, then this transcript will not be used in exonic_variant_function annotation and the
corresponding annotation will be marked as "UNKNOWN".
ExonId: Lists the transcript's IDs and corresponding exon IDs
ExonicFunction: contains the amino acid changes as a result of the exonic variant. Note that only
exonic variants are annotated in this file. It contains the functional consequences of the variant
(possible values in this fields include: nonsynonymous SNV, synonymous SNV, frameshift insertion,
frameshift deletion, nonframeshift insertion, nonframeshift deletion, frameshift block substitution,
nonframshift block substitution).
Annotation
Explanation
frameshift insertion
an insertion of one or more nucleotides that cause frameshift changes in protein
coding sequence
frameshift deletion
a deletion of one or more nucleotides that cause frameshift changes in protein
coding sequence
frameshift block
substitution
a block substitution of one or more nucleotides that cause frameshift changes in
protein coding sequence
stopgain
a nonsynonymous SNV, frameshift insertion/deletion, nonframeshift
insertion/deletion or block substitution that lead to the immediate creation of
stop codon at the variant site. For frameshift mutations, the creation of stop
codon downstream of the variant will not be counted as "stopgain"!
stoploss
a nonsynonymous SNV, frameshift insertion/deletion, nonframeshift
insertion/deletion or block substitution that lead to the immediate elimination
of stop codon at the variant site
nonframeshift
insertion
an insertion of 3 or multiples of 3 nucleotides that do not cause frameshift
changes in protein coding sequence
nonframeshift
deletion
a deletion of 3 or mutliples of 3 nucleotides that do not cause frameshift
changes in protein coding sequence
nonframeshift block
substitution
a block substitution of one or more nucleotides that do not cause frameshift
changes in protein coding sequence
nonsynonymous
SNV
a single nucleotide change that cause an amino acid change
synonymous SNV
a single nucleotide change that cause an amino acid change
unknown
unknown function (due to various errors in the gene structure definition in the
database file)
AAChange: amino acid change (HGVS nomenclature, http://www.hgvs.org/mutnomen/recs-prot.html)
Ref: reference allele
2
Alt: alternative allele
Zyg: zygosity (homo- or heterozygote)
CoverageHiQualRef : coverage of the reference allele (only high quality bases are included)
CoverageHiQualAlt : coverage of the alternative allele (only high quality bases are included)
dbSNPId: identification (ID) from dbSNP database (http://www.ncbi.nlm.nih.gov/SNP/)
callQuality: bwa-GATK call quality (-10log(10)prob(call in ALT is wrong))
Coverage: total coverage at the position of a found variant (does not have to match the sum of
CoverageHiQualRef and CoverageHiQualAlt)
aaFreq1000g: allelic frequency in 1000g project (at least 1%), see details:
http://www.1000genomes.org.
pphProb and pphPrediction: PolyPhen-2 (Polymorphism Phenotyping v2) is a tool which predicts
possible impact of an amino acid substitution on the structure and function of a human protein using
straightforward physical and comparative considerations. Please find details about these columns on
the
tool’s
webpage:
http://genetics.bwh.harvard.edu/pph2/dokuwiki/overview
and
http://annovar.openbioinformatics.org/en/latest/user-guide/filter/#-polyphen-2-annotation.
MutationTasterScore and mutationTasterPrediction: MutationTaster employs a Bayes classifier to
eventually predict the disease potential of an identified variant (more information on:
http://www.mutationtaster.org/info/documentation.html). According to:
http://annovar.openbioinformatics.org/en/latest/user-guide/filter/#-mutationtaster-annotation: there are
four possible predictions: "A" ("disease_causing_automatic"), "D" ("disease_causing"), "N"
("polymorphism") or "P" ("polymorphism_automatic").
PhyloPScore and phyloPPrediction: determine the grade of conservation of a given nucleotide. A larger
score signifies higher conservation. ‘‘C’’ means that the prediction is conserved, otherwise, the
prediction is ‘‘N” for non-conserved. For details see:
http://annovar.openbioinformatics.org/en/latest/user-guide/filter/#-phylop-and-siphy-annotation
and
http://compgen.cshl.edu/phast/help-pages/phyloP.txt.
SIFTScore: SIFT is a sequence homology-based tool that sorts intolerant from tolerant amino acid
substitutions and predicts whether an amino acid substitution in a protein will have a phenotypic effect.
SIFT is based on the premise that protein evolution is correlated with protein function. Positions
important for function should be conserved in an alignment of the protein family, whereas unimportant
positions should appear diverse in an alignment. SIFT takes a query sequence and uses multiple
alignment information to predict tolerated and deleterious substitutions for every position of the query
sequence. SIFT is a multistep procedure that (1) searches for similar sequences, (2) chooses closely
related sequences that may share similar function to the query sequence , (3) obtains the alignment of
these chosen sequences, and (4) calculates normalized probabilities for all possible substitutions from
the alignment. Positions with normalized probabilities less than 0.05 are predicted to be deleterious,
those greater than or equal to 0.05 are predicted to be tolerated (benign). For more details see:
http://annovar.openbioinformatics.org/en/latest/user-guide/filter/#-sift-annotation and
3
http://sift.jcvi.org.
GERPScore: GERP identifies constrained elements in multiple alignments by quantifying substitution
deficits. These deficits represent substitutions that would have occurred if the element were neutral
DNA, but did not occur because the element has been under functional constraint. These deficits are
refered as "Rejected Substitutions". Rejected substitutions are a natural measure of constraint that
reflects the strength of past purifying selection on the element. GERP identifies constrained elements in
multiple alignments by quantifying substitution deficits (see:
http://mendel.stanford.edu/SidowLab/downloads/gerp/ for details). We made annotation databases for
all mutations with GERP++>2 in human genome, as this threshold is typically regarded as
evolutionarily conserved and potentially functional. Anything less than 2 is not informative. More
details are on: http://annovar.openbioinformatics.org/en/latest/user-guide/filter/#-gerp-annotation.
Confidence score: We used combination of 3 aligners and 2 variant callers. A variant can be identified
with minimum of 1 and maximum of 6 aligner-caller combinations. The number of aligner-caller
combinations we named confidence score. Example: value “6” in this field means that all alignercaller combinations found that variant and that variant is the most reliable.
COSMIC id: ID from Catalogue of Somatic Mutations in Cancer (http://cancer.sanger.ac.uk/cosmic).
4