* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download CSC598BIL675-2016
Genetic engineering wikipedia , lookup
Copy-number variation wikipedia , lookup
Gene desert wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Point mutation wikipedia , lookup
Pathogenomics wikipedia , lookup
Polymorphism (biology) wikipedia , lookup
Human genome wikipedia , lookup
Protein moonlighting wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Genome (book) wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Designer baby wikipedia , lookup
Genome evolution wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Microevolution wikipedia , lookup
Molecular Inversion Probe wikipedia , lookup
Public health genomics wikipedia , lookup
Human genetic variation wikipedia , lookup
Haplogroup G-M201 wikipedia , lookup
Evolution Aristotle: classification of animals theories on change (change is the actuality of the potential) Darwin: descent with modification natural selection There is no evolution without change Evolving nomenclature change in DNA code = genetic variation change with respect to what? any consequence? • • • • • • Mutations Single Nucleotide Polymorphism SNPs Deletion/insertion polymorphism DIPs Short Nucleotide Polymorphism SNPs Short Nucleotide Variants SNVs Short Genetic Variants Definitions pol·y·mor·phism (pl-môrfzm) n. 1. Biology The occurrence of different forms, stages, or types in individual organisms or in organisms of the same species, independent of sexual variations. 2. Chemistry Crystallization of a compound in at least two distinct forms. Also called pleomorphism. var·i·ant (vâr-nt, vr-) adj. 1. Having or exhibiting variation; differing. 2. Tending or liable to vary; variable. 3. Deviating from a standard, usually by only a slight difference. n. Something that differs in form only slightly from something else, as a different spelling or pronunciation of the same word. Human Genome Project ENCODE project HapMap project SNP consortium Individual human genomes James Watson, Craig Venter, 3 asian gentlemen Evolving SNV analysis needs • Single SNP • Millions of SNPs How to structure the analysis is based on the same theories… It’s a question of scale and heuristics • Finding SNPs in single gene sequence • Finding SNPs in GWAS studies, other exome sequencing etc… Calling SNPs in NGS • Polymorphisms with respect to a reference genome • Challenging because of alignment errors, variable depth of coverage • Accuracy is essential – diagnostics, risk assessment • False positives and false negatives both a problem – Given 1% sequencing error, how many high quality reads do we need to call a variant – Quality scores differ per experiment – The tools we use should have prior knowledge of known SNPs and their relevance to our question, ie causing disease or not Prioritization of SNPs • You have millions • How do you know which are important for your research? First let’s look at what SNPs can do… So you have a SNP • Is it associated with disease? If so, why? – Is it to do with protein function – or transcriptional regulation – or both, or none, or what? • If none of the above, – then why is it associated with disease? – how do you begin toimagine imagine its function? SNP function prediction (summary) • (in coding sequence) Protein Function – Ligand binding affinity – Co-factor binding affinity – targeting to different cellular compartment • (in coding or non-coding sequence) Gene Processing – Transcriptional regulation – Translational regulation – Splicing Assessment of SNP Function • Position of SNP – dbSNP or new SNP: first identify location • In a coding sequnce: non-synonymous – Protein Data Bank , PolyPhen – UniProt, PsiPred (secondary structure prediction tool) – ProSite, InterPro Done individually, or incorporated into software to scale up for high throughput Example: AGT & Hypoxaluria SNP mutation causes disease CCA > CTA => Proline > Leucine (P11L) C C C C C P: Pro L: Leu C N C Two more in AGT Gly82Glu O blocks binding to cofactor O C H C C H G: Gly Gly41Arg E: Glu H C C H G: Gly N N disrupts intermonomer interactions R: Arg C C N Assessment of SNP Function - I • Position of SNP • In CDS: non-synonymous – Protein Data Bank , PolyPhen – UniProt, PsiPred – ProSite, InterPro • Upstream of CDS or in CDS and synonymous – SignalP, ProSite, rate of processing? – TRANSFAC – DBTSS Is it in a regulatory – NXSensor element? Translation initiation site Initiation codon ATG promoter 5’UTR Exon 1 5’ Exon 2 3’ TSS Transcriptional Start Site promoter Exon 1 Transcription factor binding sites TFBSs Exon 2 SNP in a regulatory element TFBS ACAGTCGTAAGGCTGATTGGCTGGATAGCAGTACG Single nucleotide polymorphism ACAGTCGTAAGGCTAATTGGCTGGATAGCAGTACG May disrupt TF binding and therefore functionality Example: CYP2E1 Track from DBTSS Nucleosomes Assessment of SNP Function - II • In non-coding sequence – First, assess conservation – TRANSFAC – miRNA registry – Repeatmasker – Alternative splicing – HapMap Is it in a regulatory element? Prioritization of SNPs • You have millions • How do you know which are important for your research? How do (can you?) you implement this into a pipeline so you can do thousands at once? How can you come up with strategies to prioritise? Statistical genetics • If a SNV is present in all members of the family, affected and not, then it is to do with something innocuous. Some methods are based on how common these variants are in families. ie shared ancestral variants and genetic linkage co-segregation Need pedigree haplotype information Mostly used in GWAS studies BEAGLE, GERMLINE, PLINK IBD, MERLIN Several Tools Out There • For example: – SeattleSeq – dbNSFP • built into other NGS analysis software • New ideas continue to emerge… The Plot Thickens… If you Google directly to dbSNP 10Nov2015 The NCBI homepage: if you go to dbSNP from here You get this: but no worries, both access the same underlying database. Combining gene expr. & variations eQTL: expression quantitative trait locus • • Correlation between gene expr. and freq. of variation Simple linear regression (matrixeQTL) g = a + b s+ e e ~ N(0, s ) 2 Significance is assessed by p-value