* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download final_report_columns
Genome (book) wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Non-coding RNA wikipedia , lookup
Gene nomenclature wikipedia , lookup
Designer baby wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Gene expression profiling wikipedia , lookup
Primary transcript wikipedia , lookup
Genome evolution wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Protein moonlighting wikipedia , lookup
Microevolution wikipedia , lookup
Expanded genetic code wikipedia , lookup
Multiple sequence alignment wikipedia , lookup
Helitron (biology) wikipedia , lookup
Sequence alignment wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Genetic code wikipedia , lookup
Explanation of columns from FINAL_REPORT_filtered_with_COSMIC.csv Milica Krunic 10.08.2016 We used annovar for variant annotation. Thus, some of the text is taken from: http://annovar.openbioinformatics.org/en/latest/user-guide/gene/ and http://annovar.openbioinformatics.org/en/latest/user-guide/filter/. Columns description Pos: variant genomic position Function: it tells whether the variant hit exons or hit intergenic regions, or hit introns, or hit a noncoding RNA genes. If the variant is exonic/intronic/ncRNA, the second column gives the gene name (if multiple genes are hit, comma will be added between gene names); if not, the second column will give the two neighboring genes and the distance to these neighboring genes. The possible values are summarized below: Value Explanation exonic variant overlaps a coding exon splicing variant is within 2-bp of a splicing junction (in intron) ncRNA variant overlaps a transcript without coding annotation in the gene definition UTR5 variant overlaps a 5' untranslated region UTR3 variant overlaps a 3' untranslated region intronic variant overlaps an intron upstream variant overlaps 1-kb region upstream of transcription start site downstream variant overlaps 1-kb region downtream of transcription end site intergenic variant is in intergenic region Details: - the "exonic" here refers only to coding exonic portion , but not UTR portion, as there are two keywords (UTR5, UTR3) that are specifically reserved for UTR annotations. - "splicing" is defined as variant that is within 2-bp away from an exon/intron boundary. If "exonic,splicing" is shown, it means that this is a variant within exon but close to exon/intron boundary. "Splicing" only refers to the 2bp in the intron that is close to an exon. If a variant is located in both 5' UTR and 3' UTR region (possibly for two different genes), then the "UTR5,UTR3" will be printed as the output. The term "upstream" and "downstream" is defined as 1-kb away from transcription start site or transcription end site, respectively, taking in account of the strand of the mRNA. If a variant is located in both downstream and upstream region (possibly for 2 different genes), then the "upstream,downstream" will be printed as the output. Gene: Gene name, if a transcript maps to multiple locations, all as "coding transcripts", but none has a 1 complete ORF, then this transcript will not be used in exonic_variant_function annotation and the corresponding annotation will be marked as "UNKNOWN". ExonId: Lists the transcript's IDs and corresponding exon IDs ExonicFunction: contains the amino acid changes as a result of the exonic variant. Note that only exonic variants are annotated in this file. It contains the functional consequences of the variant (possible values in this fields include: nonsynonymous SNV, synonymous SNV, frameshift insertion, frameshift deletion, nonframeshift insertion, nonframeshift deletion, frameshift block substitution, nonframshift block substitution). Annotation Explanation frameshift insertion an insertion of one or more nucleotides that cause frameshift changes in protein coding sequence frameshift deletion a deletion of one or more nucleotides that cause frameshift changes in protein coding sequence frameshift block substitution a block substitution of one or more nucleotides that cause frameshift changes in protein coding sequence stopgain a nonsynonymous SNV, frameshift insertion/deletion, nonframeshift insertion/deletion or block substitution that lead to the immediate creation of stop codon at the variant site. For frameshift mutations, the creation of stop codon downstream of the variant will not be counted as "stopgain"! stoploss a nonsynonymous SNV, frameshift insertion/deletion, nonframeshift insertion/deletion or block substitution that lead to the immediate elimination of stop codon at the variant site nonframeshift insertion an insertion of 3 or multiples of 3 nucleotides that do not cause frameshift changes in protein coding sequence nonframeshift deletion a deletion of 3 or mutliples of 3 nucleotides that do not cause frameshift changes in protein coding sequence nonframeshift block substitution a block substitution of one or more nucleotides that do not cause frameshift changes in protein coding sequence nonsynonymous SNV a single nucleotide change that cause an amino acid change synonymous SNV a single nucleotide change that cause an amino acid change unknown unknown function (due to various errors in the gene structure definition in the database file) AAChange: amino acid change (HGVS nomenclature, http://www.hgvs.org/mutnomen/recs-prot.html) Ref: reference allele 2 Alt: alternative allele Zyg: zygosity (homo- or heterozygote) CoverageHiQualRef : coverage of the reference allele (only high quality bases are included) CoverageHiQualAlt : coverage of the alternative allele (only high quality bases are included) dbSNPId: identification (ID) from dbSNP database (http://www.ncbi.nlm.nih.gov/SNP/) callQuality: bwa-GATK call quality (-10log(10)prob(call in ALT is wrong)) Coverage: total coverage at the position of a found variant (does not have to match the sum of CoverageHiQualRef and CoverageHiQualAlt) aaFreq1000g: allelic frequency in 1000g project (at least 1%), see details: http://www.1000genomes.org. pphProb and pphPrediction: PolyPhen-2 (Polymorphism Phenotyping v2) is a tool which predicts possible impact of an amino acid substitution on the structure and function of a human protein using straightforward physical and comparative considerations. Please find details about these columns on the tool’s webpage: http://genetics.bwh.harvard.edu/pph2/dokuwiki/overview and http://annovar.openbioinformatics.org/en/latest/user-guide/filter/#-polyphen-2-annotation. MutationTasterScore and mutationTasterPrediction: MutationTaster employs a Bayes classifier to eventually predict the disease potential of an identified variant (more information on: http://www.mutationtaster.org/info/documentation.html). According to: http://annovar.openbioinformatics.org/en/latest/user-guide/filter/#-mutationtaster-annotation: there are four possible predictions: "A" ("disease_causing_automatic"), "D" ("disease_causing"), "N" ("polymorphism") or "P" ("polymorphism_automatic"). PhyloPScore and phyloPPrediction: determine the grade of conservation of a given nucleotide. A larger score signifies higher conservation. ‘‘C’’ means that the prediction is conserved, otherwise, the prediction is ‘‘N” for non-conserved. For details see: http://annovar.openbioinformatics.org/en/latest/user-guide/filter/#-phylop-and-siphy-annotation and http://compgen.cshl.edu/phast/help-pages/phyloP.txt. SIFTScore: SIFT is a sequence homology-based tool that sorts intolerant from tolerant amino acid substitutions and predicts whether an amino acid substitution in a protein will have a phenotypic effect. SIFT is based on the premise that protein evolution is correlated with protein function. Positions important for function should be conserved in an alignment of the protein family, whereas unimportant positions should appear diverse in an alignment. SIFT takes a query sequence and uses multiple alignment information to predict tolerated and deleterious substitutions for every position of the query sequence. SIFT is a multistep procedure that (1) searches for similar sequences, (2) chooses closely related sequences that may share similar function to the query sequence , (3) obtains the alignment of these chosen sequences, and (4) calculates normalized probabilities for all possible substitutions from the alignment. Positions with normalized probabilities less than 0.05 are predicted to be deleterious, those greater than or equal to 0.05 are predicted to be tolerated (benign). For more details see: http://annovar.openbioinformatics.org/en/latest/user-guide/filter/#-sift-annotation and 3 http://sift.jcvi.org. GERPScore: GERP identifies constrained elements in multiple alignments by quantifying substitution deficits. These deficits represent substitutions that would have occurred if the element were neutral DNA, but did not occur because the element has been under functional constraint. These deficits are refered as "Rejected Substitutions". Rejected substitutions are a natural measure of constraint that reflects the strength of past purifying selection on the element. GERP identifies constrained elements in multiple alignments by quantifying substitution deficits (see: http://mendel.stanford.edu/SidowLab/downloads/gerp/ for details). We made annotation databases for all mutations with GERP++>2 in human genome, as this threshold is typically regarded as evolutionarily conserved and potentially functional. Anything less than 2 is not informative. More details are on: http://annovar.openbioinformatics.org/en/latest/user-guide/filter/#-gerp-annotation. Confidence score: We used combination of 3 aligners and 2 variant callers. A variant can be identified with minimum of 1 and maximum of 6 aligner-caller combinations. The number of aligner-caller combinations we named confidence score. Example: value “6” in this field means that all alignercaller combinations found that variant and that variant is the most reliable. COSMIC id: ID from Catalogue of Somatic Mutations in Cancer (http://cancer.sanger.ac.uk/cosmic). 4