Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601 SNPs • DNA sequence variations that occur when a single nucleotide is altered. • Must be present in at least 1% of the population to be a SNP. • Occur every 100 to 300 bases along the 3 billion-base human genome. • Many have no effect on cell function but some could affect disease risk and drug response. Toy example SNPs on the chromosome SNP Chromosome Gene Bi-allelic SNPs • Most SNPs have one of two nucleotides at a given position • For example: – A/G denotes the varying nucleotide as either A or G. We call each of these an allele – Most SNPs have two alleles (bi-allelic) SNP genotype • We inherit two copies of each chromosome (one from each parent) • For a given SNP the genotype defines the type of alleles we carry • Example: for the SNP A/G one’s genotype may be – – – – AA if both copies of the chromosome have A GG if both copies of the chromosome have G AG or GA if one copy has A and the other has G The first two cases are called homozygous and latter two are heterozygous SNP genotyping Real SNPs • SNP consortium: snp.cshl.org • SNPedia: www.snpedia.com Application of SNPs: association with disease • Experimental design to detect cancer associated SNPs: – Pick random humans with and without cancer (say breast cancer) – Perform SNP genotyping – Look for associated SNPs – Also called genome-wide association study Case-control example • Study of 100 people: – Case: 50 subjects with cancer – Control: 50 subjects without cancer • Count number of alleles and form a contingency table #Allele1 #Allele2 Case 10 90 Control 2 98 Effect of population structure on genome-wide association studies • Suppose our sample is drawn from a population of two groups, I and II • Assume that group I has a majority of allele type I and group II has mostly the second allele. • Further assume that most case subjects belong to group I and most control to group II • This leads to the false association that the major allele is associated with the disease Effect of population structure on genome-wide association studies • We can correct this effect if case and control are equally sampled from all sub-populations • To do this we need to know the population structure Population structure prediction • Treated as an unsupervised learning problem (i.e. clustering) Clustering • Suppose we want to cluster n vectors in Rd into two groups. Define C1 and C2 as the two groups. • Our objective is to find C1 and C2 that minimize 2 2 || x j mi || i1 x j C i where mi is the mean of class Ci K-means algorithm for two clusters Input: x i R d ,i 1 Algorithm: n 1. Initialize: assign xi to C1 or C2 with equal probability and compute means: 1 1 m1 x m xi i 2 C1 x i C1 C2 x i C 2 2. Recompute clusters: assign xi to C1 if ||xi-m1||<||xi-m2||, otherwise assign to C2 Recompute meansm1 and m2 Compute objective 3. 4. 2 || x 2 m || j i i1 x j C i 5. Compute objective of new clustering. If difference is smaller than then stop, otherwise go to step 2. K-means • Is it guaranteed to find the clustering which optimizes the objective? • It is guaranteed to find a local optimal • We can prove that the objective decreases with subsequence iterations Proof sketch of convergence of k-means 2 2 || x m || j i i1 x j C i 2 2 || x m || j i i1 x j C i* 2 * 2 || x m j i || i1 x j C i* Justification of first inequality: by assigning xj to the closest mean the objective decreases or stays the same Justification of second inequality: for a given cluster its mean minimizes squared error loss