Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
大数据分析中的算法 文再文 课程代码:00136720 (本科生),00100863 (本研合) 北京大学 北京国际数学研究中心 http://bicmr.pku.edu.cn/~wenzw/bigdata2016.html [email protected] 1/44 课程信息 大数据分析中的算法 侧重数值代数和最优化算法 课程代码:00136720 (本科生),00100863 (本研合) 教师信息:文再文,[email protected], 微信:wendoublewen 助教信息:户将,[email protected] 上课地点:三教301 上课时间:1-16周每周周二10-12节, 18:40pm-21:30pm 课程主页: http://bicmr.pku.edu.cn/~wenzw/bigdata2016.html 2/44 参考资料 class notes, and reference books or papers “Introduction to Algorithms”, Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein, The MIT Press “Mining of Massive Datasets”, Jure Leskovec, Anand Rajaraman, Jeff Ullman, Cambridge University Press “Convex optimization”, Stephen Boyd and Lieven Vandenberghe “Numerical Optimization”, Jorge Nocedal and Stephen Wright, Springer “Optimization Theory and Methods”, Wenyu Sun, Ya-Xiang Yuan 3/44 大数据分析相关的网站 Mining Massive Data Sets, Stanford University Jure Leskovec @ Stanford - Stanford Computer Science Mining Massive Data Sets: Hadoop Labs, Stanford University Big Data Analytics, Columbia University Theoretical Foundations of Big Data Analysis, The Simons Institute for the Theory of Computing, UC Berkeley Introduction to Data Science, University of Washington Core Methods in Data Science, University of Washington 课程主页: http://bicmr.pku.edu.cn/~wenzw/bigdata2016.html 4/44 课程计划 侧重数值代数和最优化的模型与算法 线性规划,半定规划 压缩感知和稀疏优化基本理论和算法 低秩矩阵恢复的基本理论和算法 图和网络流问题: shorted path, maximum flow 等等 Submodular 优化与数据挖掘 机器学习和数据挖掘: 聚类分析, 高维数据降维, 链接分析, 推荐系 统 大规模机器学习:support vector machine 现代医学成像与高维图像分析: 相位恢复以及低温电子显微镜 大数据分析的随机优化算法, 特征值和奇异值分解的随机算法 大数据分析的并行计算、分布式计算、分散式计算 5/44 课程信息 教学方式: 课堂讲授:80% 学生分组做小课题报告:20% 成绩评定办法: 4-5次大作业,包括习题和程序:35% 两个课程项目:65% 要求:作业和课程项目必须按时提交,迟交不算成绩,抄袭不算成 绩 6/44 量纲 http://zh.wikipedia.org/wiki/大数 7/44 线性规划,半定规划, 锥规划 Primal Dual min c> x max b> y s.t. s.t. A> y + s = c Ax = b x∈K s ∈ K∗ “大”问题: LP: K is the nonnegative orthant vector x ∈ Rn , millions of variables SOCP: K is the circular or Lorentz cone vector x ∈ Rn , millions of variables SDP: K is the cone of positive semidefinite matrices matrix x ∈ S n×n , n up to 5000 8/44 Compressive Sensing Find the sparest solution Given n=256, m=128. A = randn(m,n); u = sprandn(n, 1, 0.1); b = A*u; 1 1 0.8 0.8 0.4 0.6 0.6 0.2 0.4 0.4 0.2 0.2 0 0 0 −0.2 −0.2 −0.2 −0.4 −0.4 −0.6 −0.6 −0.4 −0.8 −0.8 −1 −1 −0.6 50 ( 100 150 200 min kxk0 x s.t. Ax = b (a) `0 -minimization 250 50 ( 100 150 200 min kxk2 x s.t. Ax = b (b) `2 -minimization 250 50 ( 100 150 200 250 min kxk1 x s.t. Ax = b (c) `1 -minimization 9/44 Wavelets and Images (Thanks: Richard Baraniuk) 10/44 Wavelet Approximation (Thanks: Richard Baraniuk) 11/44 Compressive sensing Given (A, b, Ψ), find the sparsest point: x∗ = arg min{kΨxk0 : Ax = b} From combinatorial to convex optimization: x̄ = arg min{kΨxk1 : Ax = b} 1-norm is sparsity promoting Basis pursuit (Donoho et al 98) Many variants: kAx − bk2 ≤ σ for noisy b Theoretical question: when is k · k0 ↔ k · k1 ? 12/44 Restricted Isometry Property (RIP) Definition (Candes and Tao [2005]) Matrix A obeys the restricted isometry property (RIP) with constant δs if (1 − δs )kck22 ≤ kAck22 ≤ (1 + δs )kck22 for all s-sparse vectors c. Theorem (Candes and Tao [2006]) If x is a k-sparse and A satisfies δ2k + δ3k < 1, then x is the unique `1 minimizer. RIP essentially requires that every set of columns with cardinality less than or equal to s behaves like an orthonormal system. 13/44 MRI: Magnetic Resonance Imaging (a) MRI Scan (b) Fourier Coefficients (c) Image Is it possible to cut the scan time into half? 14/44 MRI (Thanks: Wotao Yin) MR images often have sparse sparse representations under some wavelet transform Φ Solve µ min kΦuk1 + kRu − bk2 u 2 R: partial discrete Fourier transform The higher the SNR (signal-noise ratio) is, the better the image quality is. (a) full sampling (b) 39% sampling, SNR=32.2 15/44 MRI: Magnetic Resonance Imaging (a) full sampling (b) 39% sampling, SNR=32.2 (c) 22% sampling, SNR=21.4 (d) 14% sampling, SNR=15.8 16/44 The Netflix Prize Training data 100 million ratings, 480,000 users, 17,770 movies 6 years of data: 2000-2005 Test data Last few ratings of each user (2.8 million) Evaluation criterion: root mean squared error (RMSE) Netflix Cinematch RMSE: 0.9514 Competition 2700+ teams $1 million prize for 10% improvement on Cinematch 17/44 Netflix Problem: 1 million dollar award Given m movies x ∈ X and n customers y ∈ Y predict the “rating” W(x, y) of customer y for movie x training data: known ratings of some customers for some movies Goal: complete the matrix other applications: collaborative filtering, system identification, etc. 18/44 Matrix Rank Minimization Given X ∈ Rm×n , A : Rm×n → Rp , b ∈ Rp , we consider matrix completion problem: min rank(X), s.t. Xij = Mij , (i, j) ∈ Ω nuclear norm minimization: min kXk∗ s.t. A(X) = b where kXk∗ = P i σi and σi = ith singular value of matrix X. SDP reformulation min s.t. Tr(W1 ) + Tr(W2 ) W1 X 0 X > W2 A(X) = b 19/44 Video separation Partition the video into moving and static parts 20/44 Sparse and low-rank matrix separation Given a matrix M, we want to find a low rank matrix W and a sparse matrix E, so that W + E = M. Convex approximation: min kWk∗ + µkEk1 , s.t. W + E = M W,E Robust PCA 21/44 图与网络流问题 22/44 图与网络流问题 2014 ACM SIGMOD Programming Contest Shortest Distance Over Frequent Communication Paths 定义社交网络的边: 相互直接至少有x条回复并且相互认识。给定 网络里两个人p1和p2 以及另外一个整数x,寻找图中p1 和p2之间 数量最少节点的路径 Interests with Large Communities Socialization Suggestion Most Central People (All pairs shorted path) 定义网络:论坛中有标签t的成员,相互直接认识。给定整数k和 标签t,寻找k个有highest closeness centrality values的人 23/44 Combinatorial optimization problems Maximum cut problem given weights W for all edges in a graph, find the partition of the nodes of G into two sets V1 and V2 so that the sum of the weights of the edges between V1 and V2 is maximal Maximum k-cut problem (Frequency assignment problem) given a graph G, assign colors to the vertices in such a way that the number of non-defect edges (with endpoints of different colors) is maximal Maximum stable set problems X (G): clique number of a graph G = largest clique in G ω(G): coloring number of a graph G = smallest number of colors needed to color a graph, so that vertices connected by an edge have different colors θ(G) (θ+ (G)): Lovasz theta function of G: defined by a SDP ω(G) ≤ θ(G) ≤ θ+ (G) ≤ X (G). 24/44 Applications in social network: max flow and etc Community detection in social network Social network is a network of people connected to their “friends” Recommending friends is an important practical problem solution 1: recommend friends of friends solution 2: detect communities idea1: use max-flow min-cut algorithms to find a minimum cut it fails when there are outliers with small degree idea2: find partition A and B that minimize conductance: min A,B where c(A, B) = P i∈A c(A, B) , |A| |B| P j∈B cij 25/44 DIMACS Implementation Challenges http://dimacs.rutgers.edu/Challenges/ 2014, 11th: Steiner Tree Problems 2012, 10th: Graph Partitioning and Graph Clustering 2005, 9th: The Shortest Path Problem 2001, 8th: The Traveling Salesman Problem 2000, 7th: Semidefinite and Related Optimization Problems 1998, 6th: Near Neighbor Searches 1995, 5th: Priority Queues, Dictionaries, and Multi-Dim. Point Sets 1994, 4th: Two Problems in Computational Biology: Fragment Assembly and Genome Rearrangements 1993, 3rd: Effective Parallel Algorithms for Combinatorial Problems 1992, 2nd: Maximum Clique, Graph Coloring, and Satisfiability 1991, 1st: Network Flows and Matching 26/44 Semidefinite programming (SDP) relaxations Maximum cut problem min hC, Xi X∈Sn s.t. Xi,i = 1, X0 Frequency assignment problem (Maximum k-cut problem) 1 min 2k diag(We) + k−1 2k W, X −1 −1 , ∀(i, j) ∈ E\T, Xi,j = k−1 , ∀(i, j) ∈ T, s.t. Xi,j ≥ k−1 diag(X) = e, X 0. Maximum stable set problems θ(G) := max ee> , X s.t. Xij = 0, (i, j) ∈ E, hI, Xi = 1, X 0, θ+ (G) := max ee> , X s.t. Xij = 0, (i, j) ∈ E, hI, Xi = 1, X 0, X ≥ 0 27/44 Submodular Optimization: News recommendation Which set of articles satisfies most users? 28/44 Submodular Optimization: Sponsored search Which set of ads should be displayed to maximize revenue? 29/44 Dimensionality Reduction Given n data points Xi ∈ Rp and assume that this dataset has intrinsic dimensionality d where d p. How transform the data set X into a new dataset Y with dimensionality d? X max kyi − yj k2 ij s.t. kyi − yj k2 = kxi − xj k2 . Maximal variance unfolding (MVU): max Tr(K) s.t. Kii + Kjj − 2Kij = kxi − xj k2 X Kij = 0 ij K 0. which is SDP problem. 30/44 Supporting Vector Machine (SVM) Suppose we have two-class discrimination data. We assign the first class with 1 and the second with −1. A powerful discrimination method is the Supporting Vector Machine (SVM). With ȳi = 1 (in the first class), let the data point i be given by ai ∈ Rd , i = 1, . . . , n1 , and with ȳi = −1 (in the second class), let the data point j be given by bj ∈ Rd , j = 1, . . . , n2 . We wish to find a hyper-plane in Rd to separate ai s from bj s. Mathematically, we wish to find ω ∈ Rd and β ∈ R such that a> i ω + β > 1, ∀i and b> j ω + β < −1, ∀j, where x : ω > x + β = 0 is the desired separating hyper-plane. This is a linear program! 31/44 Supporting Vector Machine (SVM) When the data is noisy. Then we solve X X min |πi | + |σj | i s.t. a> i ω b> j ω j + β ≥ 1 − πi , ∀i + β ≤ −1 + σj , ∀j π, σ ≥ 0 which is a linear program, or X X min ω > ω + µ πi2 + σj2 i s.t. a> i ω b> j ω j + β ≥ 1 − πi , ∀i + β ≤ −1 + σj , ∀j π, σ ≥ 0 which is a quadratic program. 32/44 Phase Retrieval Phase carries more information than magnitude Y |F(Y)| phase(F(Y)) iF(|F(Y)|.*phase(F(S))) S |F(S)| phase(F(S)) iF(|F(S)|.*phase(F(Y))) Question: recover signal without knowing phase? 33/44 Classical Phase Retrieval Feasibility problem find x ∈ S ∩ M or find x ∈ S+ ∩ M given Fourier magnitudes: M := {x(r) | |x̂(ω)| = b(ω)} where x̂(ω) = F(x(r)), F: Fourier transform given support estimate: S := {x(r) | x(r) = 0 for r ∈ / D} or S+ := {x(r) | x(r) ≥ 0 and x(r) = 0 if r ∈ / D} 34/44 Ptychographic Phase Retrieval (Thanks: Chao Yang) Given bi = |F(Qi ψ)| for i = 1, . . . , k, can we recover ψ? Ptychographic imaging along with advances in detectors and computing have resulted in X-ray microscopes with increased spatial resolution without the need for lenses 35/44 Recent Phase Retrieval Model Problems Given A ∈ Cm×n and b ∈ Rm find x, s.t. |Ax| = b. (Candes et al. 2011b, Alexandre d’Aspremont 2013) SDP Relaxation: |Ax|2 is a linear function of X = xx∗ min X∈Sn Tr(X) s.t. Tr(ai a∗i X) = b2i , i = 1, . . . , m, X0 Exact recovery conditions 36/44 Single Particle Cryo-Electron Microscopy Projection images Pi (x, y) = R∞ 1 −∞ φ(xRi + yR2i + zR3i )dz+ “noise”. φ : R3 → R is the electric potential of the molecule. Cryo-EM problem: Find φ and R1 , . . . , Rn given P1 , . . . , Pn . 37/44 A Toy Example 38/44 Single Particle Cryo-Electron Microscopy Least Squares Approach min R1 ,...,RK ∈SO(3) X 2 wij Ri (~cij , 0)T − Rj (~cji , 0)T i6=j Since Ri (~cij , 0)T = Rj (~cji , 0)T = 1, we obtain max R1 ,...,RK ∈SO(3) X wij hRi (~cij , 0)T , Rj (~cji , 0)T i i6=j SDP Relaxation: maxG∈R2K×2K s.t. trace ((W ◦ S) G) Gii = I2 , i = 1, 2, · · · , K, G<0 39/44 Randomized Linear Algebra Algorithms (Thanks: Petros Drineas and Michael W. Mahoney) Goal: To develop and analyze (fast) Monte Carlo algorithms for performing useful computations on large (and later not so large!) matrices and tensors. Matrix Multiplication Computation of the Singular Value Decomposition Computation of the CUR Decomposition Testing Feasibility of Linear Programs Least Squares Approximation Tensor computations: SVD generalizations Tensor computations: CUR generalization Such computations generally require time which is superlinear in the number of nonzero elements of the matrix/tensor, e.g., O(n3 ) for n × n matrices. 40/44 CPU vs. GPU March 1, 2015: Intel Xeon Processor E7-8880 v2 15 cores, $5729.00 Intel Xeon Phi Coprocessor 7120A, 61 cores, $4235.00 Tesla K80, 4992 ( 2496 per GPU) cores, $4959.99 41/44 Super Computer http://www.top500.org/lists/2014/11/ Top 1: Tianhe-2 (MilkyWay-2): TH-IVB-FEP Cluster, Intel Xeon E5-2692 12C 2.200GHz, TH Express-2, Intel Xeon Phi 31S1P, NUDT, 3,120,000 cores #24: Edison is NERSC’s newest supercomputer, a Cray XC30, with a peak performance of 2.57 petaflops/sec, 133,824 compute cores, 357 terabytes of memory, and 7.56 petabytes of disk. 42/44 Optimization Formulation Mathmatical optimization problem min f (x) s.t. ci (x) = 0, i ∈ E ci (x) ≥ 0, i ∈ I x = (x1 , . . . , xn )> : variable f (x) : Rn → R: objective function ci (x) : Rn → R: constraints optimal solution x∗ : a feasible point with the smallest value of f 43/44 Classification Continuous versus discrete optimization Unconstrained versus constrained optimization Global and local optimization Stochastic and deterministic optimization Linear/nonlinear/quadratic programming, Convex/nonconvex optimization Least square problem, equation solving sparse optimization, PDE-constrained optimization, robust optimization 44/44