Download CSC598BIL675-2016

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genetic engineering wikipedia , lookup

Copy-number variation wikipedia , lookup

Gene desert wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Point mutation wikipedia , lookup

Pathogenomics wikipedia , lookup

Polymorphism (biology) wikipedia , lookup

Human genome wikipedia , lookup

Protein moonlighting wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Genome (book) wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Designer baby wikipedia , lookup

Genome evolution wikipedia , lookup

NEDD9 wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genomics wikipedia , lookup

Microevolution wikipedia , lookup

Molecular Inversion Probe wikipedia , lookup

Public health genomics wikipedia , lookup

Human genetic variation wikipedia , lookup

RNA-Seq wikipedia , lookup

Haplogroup G-M201 wikipedia , lookup

SNP genotyping wikipedia , lookup

Tag SNP wikipedia , lookup

Transcript
Evolution
Aristotle: classification of animals
theories on change
(change is the actuality of the potential)
Darwin: descent with modification
natural selection
There is no evolution without change
Evolving nomenclature
change in DNA code = genetic variation
change with respect to what? any consequence?
•
•
•
•
•
•
Mutations
Single Nucleotide Polymorphism SNPs
Deletion/insertion polymorphism DIPs
Short Nucleotide Polymorphism SNPs
Short Nucleotide Variants SNVs
Short Genetic Variants
Definitions
pol·y·mor·phism (pl-môrfzm) n.
1. Biology The occurrence of different forms, stages, or types in individual organisms or in
organisms of the same species, independent of sexual variations.
2. Chemistry Crystallization of a compound in at least two distinct forms. Also called
pleomorphism.
var·i·ant (vâr-nt, vr-) adj.
1. Having or exhibiting variation; differing.
2. Tending or liable to vary; variable.
3. Deviating from a standard, usually by only a slight difference.
n. Something that differs in form only slightly from something else, as a different spelling
or pronunciation of the same word.
Human Genome Project
ENCODE project
HapMap project
SNP consortium
Individual human genomes
James Watson, Craig Venter,
3 asian gentlemen
Evolving SNV analysis needs
• Single SNP
• Millions of SNPs
How to structure the analysis is based on the same theories…
It’s a question of scale and heuristics
• Finding SNPs in single gene sequence
• Finding SNPs in GWAS studies, other exome sequencing etc…
Calling SNPs in NGS
• Polymorphisms with respect to a reference genome
• Challenging because of alignment errors, variable depth of
coverage
• Accuracy is essential – diagnostics, risk assessment
• False positives and false negatives both a problem
– Given 1% sequencing error, how many high quality reads do we need
to call a variant
– Quality scores differ per experiment
– The tools we use should have prior knowledge of known SNPs and
their relevance to our question, ie causing disease or not
Prioritization of SNPs
• You have millions
• How do you know which are important for your research?
First let’s look at what SNPs can do…
So you have a SNP
• Is it associated with disease? If so, why?
– Is it to do with protein function
– or transcriptional regulation
– or both, or none, or what?
• If none of the above,
– then why is it associated with disease?
– how do you begin toimagine
imagine its function?
SNP function prediction (summary)
• (in coding sequence)
Protein Function
– Ligand binding affinity
– Co-factor binding affinity
– targeting to different cellular compartment
• (in coding or non-coding sequence)
Gene Processing
– Transcriptional regulation
– Translational regulation
– Splicing
Assessment of SNP Function
• Position of SNP
– dbSNP or new SNP: first identify location
• In a coding sequnce: non-synonymous
– Protein Data Bank , PolyPhen
– UniProt, PsiPred (secondary structure prediction tool)
– ProSite, InterPro
Done individually, or incorporated into software to scale up
for high throughput
Example: AGT & Hypoxaluria
SNP mutation causes disease
CCA > CTA => Proline > Leucine (P11L)
C
C
C
C
C
P: Pro
L: Leu
C
N
C
Two more in AGT
Gly82Glu
O
blocks binding
to cofactor
O
C
H
C
C
H
G: Gly
Gly41Arg
E: Glu
H
C
C
H
G: Gly
N
N
disrupts intermonomer
interactions
R: Arg
C
C
N
Assessment of SNP Function - I
• Position of SNP
• In CDS: non-synonymous
– Protein Data Bank , PolyPhen
– UniProt, PsiPred
– ProSite, InterPro
• Upstream of CDS or in CDS and synonymous
– SignalP, ProSite, rate of processing?
– TRANSFAC
– DBTSS
Is it in a regulatory
– NXSensor
element?
Translation initiation site
Initiation codon
ATG
promoter
5’UTR
Exon 1
5’
Exon 2
3’
TSS
Transcriptional Start Site
promoter
Exon 1
Transcription factor binding sites
TFBSs
Exon 2
SNP in a regulatory element
TFBS
ACAGTCGTAAGGCTGATTGGCTGGATAGCAGTACG
Single nucleotide polymorphism
ACAGTCGTAAGGCTAATTGGCTGGATAGCAGTACG
May disrupt TF binding and therefore functionality
Example: CYP2E1
Track from DBTSS
Nucleosomes
Assessment of SNP Function - II
• In non-coding sequence
– First, assess conservation
– TRANSFAC
– miRNA registry
– Repeatmasker
– Alternative splicing
– HapMap
Is it in a
regulatory
element?
Prioritization of SNPs
• You have millions
• How do you know which are important for your research?
How do (can you?) you implement this into a pipeline so you can do
thousands at once?
How can you come up with strategies to prioritise?
Statistical genetics
• If a SNV is present in all members of the family, affected and
not, then it is to do with something innocuous.
Some methods are based on how common these variants
are in families.
ie shared ancestral variants and genetic linkage co-segregation
Need pedigree haplotype information
Mostly used in GWAS studies
BEAGLE, GERMLINE, PLINK IBD, MERLIN
Several Tools Out There
• For example:
– SeattleSeq
– dbNSFP
• built into other NGS analysis software
• New ideas continue to emerge…
The Plot Thickens…
If you Google
directly to
dbSNP
10Nov2015
The NCBI homepage:
if you go to dbSNP from here
You get this:
but no worries, both access the same underlying database.
Combining gene expr. & variations
eQTL: expression quantitative trait locus
•
•
Correlation between gene expr. and freq. of variation
Simple linear regression (matrixeQTL)
g = a + b s+ e
e ~ N(0, s )
2
Significance is assessed by p-value