Download L17 preview - Computer Science and Engineering

CSE182-L17 Clustering Population Genetics: Basics Unsupervised Clustering • Given a set of points (in ndimensions), and k, compute the k “best clusters”. • In k-means, clustering is done by choosing k centers (means). • Each point is assigned to the closest center. • The notion of “best” is defined by distances to the center. • Question: How can we compute the k best centers? Cluster s Distance • Given a data point v and a set of points X, define the distance from v to X d(v, X) as the (Euclidean) distance from v to the closest point from X. • Given a set of n data points V={v1…vn} and a set of k points X, define the Squared Error Distortion d(V,X) = ∑d(vi, X)2 / n 1<i<n v K-Means Clustering Problem: Formulation • Input: A set, V, consisting of n points and a parameter k • Output: A set X consisting of k points (cluster centers) that minimizes the squared error distortion d(V,X) over all possible choices of X This problem is NP-complete in general. 1-Means Clustering Problem: an Easy Case • • Input: A set, V, consisting of n points. Output: A single point X that minimizes d(V,X) over all possible choices of X. This problem is easy. However, it becomes very difficult for more than one center. An efficient heuristic method for k-Means clustering is the Lloyd algorithm K-means: Lloyd’s algorithm • Choose k centers at random: – X’ = {x1,x2,x3,…xk} • Repeat – X=X’ – Assign each v  V to the closest cluster j • d(v,xj) = d(v,X)  Cj= Cj  {v} – Recompute X’ • x’j  (∑ v  Cj • until (X’ = X) v) /|Cj| expression in condition 2 5 4 x1 3 x2 2 1 x3 0 0 1 2 3 4 expression in condition 1 5 expression in condition 2 5 4 x1 3 x2 2 1 x3 0 0 1 2 3 4 expression in condition 1 5 expression in condition 2 5 4 x1 3 2 x3 x2 1 0 0 1 2 3 4 expression in condition 1 5 expression in condition 2 5 4 x1 3 2 x2 x3 1 0 0 1 2 3 4 expression in condition 1 5 Conservative K-Means Algorithm • Lloyd algorithm is fast but in each iteration it moves many data points, not necessarily causing better convergence. • A more conservative method would be to move one point at a time only if it improves the overall clustering cost • The smaller the clustering cost of a partition of data points is the better that clustering is • Different methods can be used to measure this clustering cost (for example in the last algorithm the squared error distortion was used) Microarray summary • Microarrays (like MS) are a technology for probing the dynamic state of the cell. • We answered questions like the following: – Which genes are coordinately regulated (They have similar expression patterns in different conditions)? – How can we reduce the dimensionality of the system? – Using gene expression values from a sample, can you predict if the sample is normal (state A) or diseased (state B) • The techniques employed for classification/clustering etc. are general and can be employed in a number of contexts. Microarray non-summary • We did not cover: – How are the gene expression values measured (the technology)? (CSE183) – How do you control variability across different experiments (normalization)? (CSE183) – What controls the expression of a gene (gene regulation), or a set of genes? (CSE 181) Population Genetics • The sequence of an individual does not say anything about the diversity of a population. • Small individual genetic differences can have a profound impact on “phenotypes” – Response to drugs – Susceptibility to diseases • Soon, we will have sequences of many individuals from the same species. Studying the differences will be a major challenge. Population Structure Africa Eurasia East Asia Oceania • 377 locations (loci) were sampled in 1000 people from 52 populations. • 6 genetic clusters were obtained, which corresponded to 5 geographic regions (Rosenberg et al. Science 2003) America Population Genetics • What is it about our genetic makeup that makes us measurably different? • These genetic differences are correlated with phenotypic differences • With cost reduction in sequencing and genotyping technologies, we will know the sequence for entire populations of individuals. • Here, we will study the basics of this polymorphism data, and tools that are being developed to analyze it. What causes variation in a population? • Mutations (may lead to SNPs) • Recombinations • Other genetic events (Ex: microsatellite repeats) • Deletions, inversions Single Nucleotide Polymorphisms Infinite Sites Assumption: Each site mutates at most once 00000101011 10001101001 01000101010 01000000011 00011110000 00101100110 Short Tandem Repeats GCTAGATCATCATCATCATTGCTAG GCTAGATCATCATCATTGCTAGTTA GCTAGATCATCATCATCATCATTGC GCTAGATCATCATCATTGCTAGTTA GCTAGATCATCATCATTGCTAGTTA GCTAGATCATCATCATCATCATTGC 4 3 5 3 3 5 STR can be used as a DNA fingerprint • Consider a collection of regions with variable length repeats. • Variable length repeats will lead to variable length DNA • Vector of lengths is a fingerprint 4 3 5 3 3 5 2 3 1 2 1 3 positions Recombination 00000000 11111111 00011111 What if there were no recombinations? • Life would be simpler • Each sequence would have a single parent • The relationship is expressed as a tree. The Infinite Sites Assumption 00000000 3 00100000 5 8 00100001 • • • 00101000 The different sites are linked. A 1 in position 8 implies 0 in position 5, and vice versa. Some phenotypes could be linked to the polymorphisms Some of the linkage is “destroyed” by recombination Infinite sites assumption and Perfect Phylogeny • Each site is mutated at most once in the history. • All descendants must carry the mutated value, and all others must carry the ancestral value i 1 in position i 0 in position i Perfect Phylogeny • Assume an evolutionary model in which no recombination takes place, only mutation. • The evolutionary history is explained by a tree in which every mutation is on an edge of the tree. All the species in one sub-tree contain a 0, and all species in the other contain a 1. Such a tree is called a perfect phylogeny. • How can one reconstruct such a tree? The 4-gamete condition • A column i partitions the set of species into two sets i0, and i1 • A column is homogeneous w.r.t a set of species, if it has the same value for all species. Otherwise, it is heterogenous. • EX: i is heterogenous w.r.t {A,D,E} A i0 B C D i1 E F i 0 0 0 1 1 1 4 Gamete Condition • 4 Gamete Condition – There exists a perfect phylogeny if and only if for all pair of columns (i,j), either j is not heterogenous w.r.t i0, or i1. – Equivalent to – There exists a perfect phylogeny if and only if for all pairs of columns (i,j), the following 4 rows do not exist (0,0), (0,1), (1,0), (1,1) 4-gamete condition: proof • Depending on which edge the mutation j occurs, either i0, or i1 should be homogenous. • (only if) Every perfect phylogeny satisfies the 4-gamete condition • (if) If the 4-gamete condition is satisfied, does a prefect phylogeny exist? i i0 i1 An algorithm for constructing a perfect phylogeny • We will consider the case where 0 is the ancestral state, and 1 is the mutated state. This will be fixed later. • In any tree, each node (except the root) has a single parent. – It is sufficient to construct a parent for every node. • In each step, we add a column and refine some of the nodes containing multiple children. • Stop if all columns have been considered. Inclusion Property • For any pair of columns i,j – i < j if and only if i1  j1 • Note that if i<j then the edge containing i is an ancestor of the edge containing i i j Example 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 r A B Initially, there is a single clade r, and each node has r as its parent C D E Sort columns • Sort columns according to the inclusion property (note that the columns are already sorted here). • This can be achieved by considering the columns as binary representations of numbers (most significant bit in row 1) and sorting in decreasing order A B C D E 1 1 0 1 0 1 2 1 0 1 0 0 3 0 1 0 1 0 4 0 0 1 0 0 5 0 0 0 1 0 Add first column A B C D E • In adding column i – Check each edge and decide which side you belong. – Finally add a node if you can resolve a clade 1 2 3 4 5 1 1 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 1 0 1 1 0 0 0 0 r u A C E B D Adding other columns A B C D E • Add other columns on edges using the ordering property 1 1 0 1 0 1 2 1 0 1 0 0 3 0 1 0 1 0 4 0 0 1 0 0 r 1 E 3 2 B 5 4 D C A 5 0 0 0 1 0 Unrooted case • Switch the values in each column, so that 0 is the majority element. • Apply the algorithm for the rooted case Handling recombination • A tree is not sufficient as a sequence may have 2 parents • Recombination leads to loss of correlation between columns Linkage (Dis)-equilibrium (LD) • Consider sites A &B • Case 1: No recombination – Pr[A,B=0,1] = 0.25 • Linkage disequilibrium • Case 2:Extensive recombination – Pr[A,B=(0,1)=0.125 • Linkage equilibrium A 0 0 0 0 1 1 1 1 B 1 1 0 0 0 0 0 0

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download L17 preview - Computer Science and Engineering