Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
COT6930 Course Project Outline • Gene Selection • Sequence Alignment Why Gene Selection • Identify marker genes that characterize different tumor status. • Many genes are redundant and will introduce noise that lower performance. • Can eventually lead to a diagnosis chip. (“breast cancer chip”, “liver cancer chip”) Why Gene Selection Gene Selection • Methods fall into three categories: – Filter methods – Wrapper methods – Embedded methods Filter methods are simplest and most frequently used in the literature Wrapper methods are likely the most accurate ones Filter Method • Features (genes) are scored according to the evidence of predictive power and then are ranked. • Top s genes with high score are selected and used by the classifier. – Scores: t-statistics, F-statistics, signal-noise ratio, … – The # of features selected, s, is then determined by cross validation. • Advantage: Fast and easy to interpret. Good versus bad features Filter Method: Problem • Genes are considered independently. – Redundant genes may be included. – Some genes jointly with strong discriminant power but individually are weak will be ignored. • Good single features do not necessarily form a good feature set • The filtering procedure is independent to the classifying method – Features selected can be applied to all types of classifying methods Wrapper Method • Iterative search: many “feature subsets” are scored base on classification performance and the best is used. – Select a good subset of features • Subset selection: Forward selection, backward selection, their combinations. – Exhaustive searching is impossible. – Greedy algorithm are used instead. Wrapper Method: Problem • Computationally expensive – For each feature subset considered, the classifier is built and evaluated. • Exhaustive searching is impossible – Greedy search only. • Easy to overfit. Embedded Method • Attempt to jointly or simultaneously train both a classifier and a feature subset. • Often optimize an objective function that jointly rewards accuracy of classification and penalizes use of more features. • Intuitively appealing Relief-F • Relief-F a filter approach for feature selection – Relief Relief-F • Original Relief is only able to handle binary classification problem. Extension was made to handle multiple-class problem Relief-F • Categorical attributes • Numerical attributes Relief-F Problem • Time Complexity – m×(m×a+c×m×a+a)=O(cm2a) – Assume m=100, c=3, a=10,000 – Time complexity 300×106 • Only considers one single attribute, cannot select a subset of “good” genes Solution: Parallel Relief-F • Version 1: – Clusters runs ReliefF in parallel, and updated weighted weight values are collected at the master. – Theoretical time complexity O(cm2a/p) • P is the # of clusters Parallel Relief-F • Version 2: – Clusters runs ReliefF in parallel, and each cluster directly update the global weight values. – Each cluster also considers the current weight values to select nearest neighbour instances – Theoretical time complexity O(cm2a/p) • p is the # of clusters Parallel Relief-F • Version 3 – Consider selecting a subset of important features – Comparing the difference between including/excluding a specific feature, and understand the importance of a gene with respect to an existing subset of features – Discussion in private! Outline • Gene Selection • Sequence Alignment – Given a dataset D with N=1000 sequences (e.g., 1000 each) – Given an input x, – Do pair-wise global sequence alignment between x and all sequences D • Dispatch jobs to clusters • And aggregate the results