Download 00136720 ( 00100863 http://bicmr.pku.edu.cn/~wenzw/bigdata2016

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Orthogonal matrix wikipedia , lookup

System of linear equations wikipedia , lookup

Cayley–Hamilton theorem wikipedia , lookup

Gaussian elimination wikipedia , lookup

Four-vector wikipedia , lookup

Matrix calculus wikipedia , lookup

Matrix multiplication wikipedia , lookup

Non-negative matrix factorization wikipedia , lookup

Transcript
大数据分析中的算法
文再文
课程代码:00136720 (本科生),00100863 (本研合)
北京大学
北京国际数学研究中心
http://bicmr.pku.edu.cn/~wenzw/bigdata2016.html
[email protected]
1/44
课程信息
大数据分析中的算法
侧重数值代数和最优化算法
课程代码:00136720 (本科生),00100863 (本研合)
教师信息:文再文,[email protected], 微信:wendoublewen
助教信息:户将,[email protected]
上课地点:三教301
上课时间:1-16周每周周二10-12节, 18:40pm-21:30pm
课程主页:
http://bicmr.pku.edu.cn/~wenzw/bigdata2016.html
2/44
参考资料
class notes, and reference books or papers
“Introduction to Algorithms”, Thomas H. Cormen, Charles E.
Leiserson, Ronald L. Rivest, Clifford Stein, The MIT Press
“Mining of Massive Datasets”, Jure Leskovec, Anand Rajaraman,
Jeff Ullman, Cambridge University Press
“Convex optimization”, Stephen Boyd and Lieven Vandenberghe
“Numerical Optimization”, Jorge Nocedal and Stephen Wright,
Springer
“Optimization Theory and Methods”, Wenyu Sun, Ya-Xiang Yuan
3/44
大数据分析相关的网站
Mining Massive Data Sets, Stanford University
Jure Leskovec @ Stanford - Stanford Computer Science
Mining Massive Data Sets: Hadoop Labs, Stanford University
Big Data Analytics, Columbia University
Theoretical Foundations of Big Data Analysis, The Simons
Institute for the Theory of Computing, UC Berkeley
Introduction to Data Science, University of Washington
Core Methods in Data Science, University of Washington
课程主页:
http://bicmr.pku.edu.cn/~wenzw/bigdata2016.html
4/44
课程计划
侧重数值代数和最优化的模型与算法
线性规划,半定规划
压缩感知和稀疏优化基本理论和算法
低秩矩阵恢复的基本理论和算法
图和网络流问题: shorted path, maximum flow 等等
Submodular 优化与数据挖掘
机器学习和数据挖掘: 聚类分析, 高维数据降维, 链接分析, 推荐系
统
大规模机器学习:support vector machine
现代医学成像与高维图像分析: 相位恢复以及低温电子显微镜
大数据分析的随机优化算法, 特征值和奇异值分解的随机算法
大数据分析的并行计算、分布式计算、分散式计算
5/44
课程信息
教学方式:
课堂讲授:80%
学生分组做小课题报告:20%
成绩评定办法:
4-5次大作业,包括习题和程序:35%
两个课程项目:65%
要求:作业和课程项目必须按时提交,迟交不算成绩,抄袭不算成
绩
6/44
量纲
http://zh.wikipedia.org/wiki/大数
7/44
线性规划,半定规划, 锥规划
Primal
Dual
min c> x
max
b> y
s.t.
s.t.
A> y + s = c
Ax = b
x∈K
s ∈ K∗
“大”问题:
LP: K is the nonnegative orthant
vector x ∈ Rn , millions of variables
SOCP: K is the circular or Lorentz cone
vector x ∈ Rn , millions of variables
SDP: K is the cone of positive semidefinite matrices matrix
x ∈ S n×n , n up to 5000
8/44
Compressive Sensing
Find the sparest solution
Given n=256, m=128.
A = randn(m,n); u = sprandn(n, 1, 0.1); b = A*u;
1
1
0.8
0.8
0.4
0.6
0.6
0.2
0.4
0.4
0.2
0.2
0
0
0
−0.2
−0.2
−0.2
−0.4
−0.4
−0.6
−0.6
−0.4
−0.8
−0.8
−1
−1
−0.6
50
(
100
150
200
min kxk0
x
s.t. Ax = b
(a) `0 -minimization
250
50
(
100
150
200
min kxk2
x
s.t. Ax = b
(b) `2 -minimization
250
50
(
100
150
200
250
min kxk1
x
s.t. Ax = b
(c) `1 -minimization
9/44
Wavelets and Images (Thanks: Richard Baraniuk)
10/44
Wavelet Approximation (Thanks: Richard Baraniuk)
11/44
Compressive sensing
Given (A, b, Ψ), find the sparsest point:
x∗ = arg min{kΨxk0 : Ax = b}
From combinatorial to convex optimization:
x̄ = arg min{kΨxk1 : Ax = b}
1-norm is sparsity promoting
Basis pursuit (Donoho et al 98)
Many variants: kAx − bk2 ≤ σ for noisy b
Theoretical question: when is k · k0 ↔ k · k1 ?
12/44
Restricted Isometry Property (RIP)
Definition (Candes and Tao [2005])
Matrix A obeys the restricted isometry property (RIP) with constant δs
if
(1 − δs )kck22 ≤ kAck22 ≤ (1 + δs )kck22
for all s-sparse vectors c.
Theorem (Candes and Tao [2006])
If x is a k-sparse and A satisfies δ2k + δ3k < 1, then x is the unique `1
minimizer.
RIP essentially requires that every set of columns with cardinality
less than or equal to s behaves like an orthonormal system.
13/44
MRI: Magnetic Resonance Imaging
(a) MRI Scan
(b) Fourier Coefficients
(c) Image
Is it possible to cut the scan time into half?
14/44
MRI (Thanks: Wotao Yin)
MR images often have sparse sparse representations under
some wavelet transform Φ
Solve
µ
min kΦuk1 + kRu − bk2
u
2
R: partial discrete Fourier transform
The higher the SNR (signal-noise ratio) is, the better the image
quality is.
(a) full sampling
(b) 39% sampling,
SNR=32.2
15/44
MRI: Magnetic Resonance Imaging
(a) full sampling
(b) 39% sampling,
SNR=32.2
(c) 22% sampling,
SNR=21.4
(d) 14% sampling,
SNR=15.8
16/44
The Netflix Prize
Training data
100 million ratings, 480,000 users, 17,770 movies
6 years of data: 2000-2005
Test data
Last few ratings of each user (2.8 million)
Evaluation criterion: root mean squared error (RMSE)
Netflix Cinematch RMSE: 0.9514
Competition
2700+ teams
$1 million prize for 10% improvement on Cinematch
17/44
Netflix Problem: 1 million dollar award
Given m movies x ∈ X and n customers y ∈ Y
predict the “rating” W(x, y) of customer y for movie x
training data: known ratings of some customers for some movies
Goal: complete the matrix
other applications: collaborative filtering, system identification,
etc.
18/44
Matrix Rank Minimization
Given X ∈ Rm×n , A : Rm×n → Rp , b ∈ Rp , we consider
matrix completion problem:
min rank(X), s.t. Xij = Mij , (i, j) ∈ Ω
nuclear norm minimization:
min kXk∗ s.t. A(X) = b
where kXk∗ =
P
i σi
and σi = ith singular value of matrix X.
SDP reformulation
min
s.t.
Tr(W1 ) + Tr(W2 )
W1 X
0
X > W2
A(X) = b
19/44
Video separation
Partition the video into moving and static parts
20/44
Sparse and low-rank matrix separation
Given a matrix M, we want to find a low rank matrix W and a
sparse matrix E, so that W + E = M.
Convex approximation:
min kWk∗ + µkEk1 , s.t. W + E = M
W,E
Robust PCA
21/44
图与网络流问题
22/44
图与网络流问题
2014 ACM SIGMOD Programming Contest
Shortest Distance Over Frequent Communication Paths
定义社交网络的边: 相互直接至少有x条回复并且相互认识。给定
网络里两个人p1和p2 以及另外一个整数x,寻找图中p1 和p2之间
数量最少节点的路径
Interests with Large Communities
Socialization Suggestion
Most Central People (All pairs shorted path)
定义网络:论坛中有标签t的成员,相互直接认识。给定整数k和
标签t,寻找k个有highest closeness centrality values的人
23/44
Combinatorial optimization problems
Maximum cut problem
given weights W for all edges in a graph, find the partition of the
nodes of G into two sets V1 and V2 so that the sum of the weights
of the edges between V1 and V2 is maximal
Maximum k-cut problem (Frequency assignment problem)
given a graph G, assign colors to the vertices in such a way that
the number of non-defect edges (with endpoints of different
colors) is maximal
Maximum stable set problems
X (G): clique number of a graph G = largest clique in G
ω(G): coloring number of a graph G = smallest number of colors
needed to color a graph, so that vertices connected by an edge
have different colors
θ(G) (θ+ (G)): Lovasz theta function of G: defined by a SDP
ω(G) ≤ θ(G) ≤ θ+ (G) ≤ X (G).
24/44
Applications in social network: max flow and etc
Community detection in social network
Social network is a network of people connected to their “friends”
Recommending friends is an important practical problem
solution 1: recommend friends of friends
solution 2: detect communities
idea1: use max-flow min-cut algorithms to find a minimum cut
it fails when there are outliers with small degree
idea2: find partition A and B that minimize conductance:
min
A,B
where c(A, B) =
P
i∈A
c(A, B)
,
|A| |B|
P
j∈B cij
25/44
DIMACS Implementation Challenges
http://dimacs.rutgers.edu/Challenges/
2014, 11th: Steiner Tree Problems
2012, 10th: Graph Partitioning and Graph Clustering
2005, 9th: The Shortest Path Problem
2001, 8th: The Traveling Salesman Problem
2000, 7th: Semidefinite and Related Optimization Problems
1998, 6th: Near Neighbor Searches
1995, 5th: Priority Queues, Dictionaries, and Multi-Dim. Point
Sets
1994, 4th: Two Problems in Computational Biology: Fragment
Assembly and Genome Rearrangements
1993, 3rd: Effective Parallel Algorithms for Combinatorial
Problems
1992, 2nd: Maximum Clique, Graph Coloring, and Satisfiability
1991, 1st: Network Flows and Matching
26/44
Semidefinite programming (SDP) relaxations
Maximum cut problem
min hC, Xi
X∈Sn
s.t. Xi,i = 1,
X0
Frequency assignment problem (Maximum k-cut problem)
1
min 2k
diag(We) + k−1
2k W, X
−1
−1
, ∀(i, j) ∈ E\T, Xi,j = k−1
, ∀(i, j) ∈ T,
s.t. Xi,j ≥ k−1
diag(X) = e, X 0.
Maximum stable set problems
θ(G) := max ee> , X s.t. Xij = 0, (i, j) ∈ E, hI, Xi = 1, X 0,
θ+ (G) := max ee> , X s.t. Xij = 0, (i, j) ∈ E, hI, Xi = 1, X 0, X ≥ 0
27/44
Submodular Optimization: News recommendation
Which set of articles satisfies most users?
28/44
Submodular Optimization: Sponsored search
Which set of ads should be displayed to maximize revenue?
29/44
Dimensionality Reduction
Given n data points Xi ∈ Rp and assume that this dataset has intrinsic
dimensionality d where d p. How transform the data set X into a
new dataset Y with dimensionality d?
X
max
kyi − yj k2
ij
s.t.
kyi − yj k2 = kxi − xj k2 .
Maximal variance unfolding (MVU):
max
Tr(K)
s.t.
Kii + Kjj − 2Kij = kxi − xj k2
X
Kij = 0
ij
K 0.
which is SDP problem.
30/44
Supporting Vector Machine (SVM)
Suppose we have two-class discrimination data. We assign the first
class with 1 and the second with −1. A powerful discrimination
method is the Supporting Vector Machine (SVM). With ȳi = 1 (in the
first class), let the data point i be given by ai ∈ Rd , i = 1, . . . , n1 , and
with ȳi = −1 (in the second class), let the data point j be given by
bj ∈ Rd , j = 1, . . . , n2 . We wish to find a hyper-plane in Rd to separate
ai s from bj s. Mathematically, we wish to find ω ∈ Rd and β ∈ R such
that
a>
i ω + β > 1, ∀i
and
b>
j ω + β < −1, ∀j,
where x : ω > x + β = 0 is the desired separating hyper-plane. This is a
linear program!
31/44
Supporting Vector Machine (SVM)
When the data is noisy. Then we solve
X
X
min
|πi | +
|σj |
i
s.t.
a>
i ω
b>
j ω
j
+ β ≥ 1 − πi , ∀i
+ β ≤ −1 + σj , ∀j
π, σ ≥ 0
which is a linear program, or


X
X
min ω > ω + µ 
πi2 +
σj2 
i
s.t.
a>
i ω
b>
j ω
j
+ β ≥ 1 − πi , ∀i
+ β ≤ −1 + σj , ∀j
π, σ ≥ 0
which is a quadratic program.
32/44
Phase Retrieval
Phase carries more information than magnitude
Y
|F(Y)|
phase(F(Y))
iF(|F(Y)|.*phase(F(S)))
S
|F(S)|
phase(F(S))
iF(|F(S)|.*phase(F(Y)))
Question: recover signal without knowing phase?
33/44
Classical Phase Retrieval
Feasibility problem
find x ∈ S ∩ M or find x ∈ S+ ∩ M
given Fourier magnitudes:
M := {x(r) | |x̂(ω)| = b(ω)}
where x̂(ω) = F(x(r)), F: Fourier transform
given support estimate:
S := {x(r) | x(r) = 0 for r ∈
/ D}
or
S+ := {x(r) | x(r) ≥ 0 and x(r) = 0 if r ∈
/ D}
34/44
Ptychographic Phase Retrieval (Thanks: Chao Yang)
Given bi = |F(Qi ψ)| for i = 1, . . . , k, can we recover ψ?
Ptychographic imaging along with advances in detectors and
computing have resulted in X-ray microscopes with increased spatial
resolution without the need for lenses
35/44
Recent Phase Retrieval Model Problems
Given A ∈ Cm×n and b ∈ Rm
find x, s.t. |Ax| = b.
(Candes et al. 2011b, Alexandre d’Aspremont 2013)
SDP Relaxation: |Ax|2 is a linear function of X = xx∗
min
X∈Sn
Tr(X)
s.t. Tr(ai a∗i X) = b2i , i = 1, . . . , m,
X0
Exact recovery conditions
36/44
Single Particle Cryo-Electron Microscopy
Projection images Pi (x, y) =
R∞
1
−∞ φ(xRi
+ yR2i + zR3i )dz+ “noise”.
φ : R3 → R is the electric potential of the molecule.
Cryo-EM problem: Find φ and R1 , . . . , Rn given P1 , . . . , Pn .
37/44
A Toy Example
38/44
Single Particle Cryo-Electron Microscopy
Least Squares Approach
min
R1 ,...,RK ∈SO(3)
X
2
wij Ri (~cij , 0)T − Rj (~cji , 0)T i6=j
Since Ri (~cij , 0)T = Rj (~cji , 0)T = 1, we obtain
max
R1 ,...,RK ∈SO(3)
X
wij hRi (~cij , 0)T , Rj (~cji , 0)T i
i6=j
SDP Relaxation:
maxG∈R2K×2K
s.t.
trace ((W ◦ S) G)
Gii = I2 , i = 1, 2, · · · , K,
G<0
39/44
Randomized Linear Algebra Algorithms
(Thanks: Petros Drineas and Michael W. Mahoney)
Goal: To develop and analyze (fast) Monte Carlo algorithms for
performing useful computations on large (and later not so large!)
matrices and tensors.
Matrix Multiplication
Computation of the Singular Value Decomposition
Computation of the CUR Decomposition
Testing Feasibility of Linear Programs
Least Squares Approximation
Tensor computations: SVD generalizations
Tensor computations: CUR generalization
Such computations generally require time which is superlinear in the
number of nonzero elements of the matrix/tensor, e.g., O(n3 ) for n × n
matrices.
40/44
CPU vs. GPU
March 1, 2015:
Intel Xeon Processor E7-8880 v2 15 cores, $5729.00
Intel Xeon Phi Coprocessor 7120A, 61 cores, $4235.00
Tesla K80, 4992 ( 2496 per GPU) cores, $4959.99
41/44
Super Computer
http://www.top500.org/lists/2014/11/
Top 1: Tianhe-2 (MilkyWay-2): TH-IVB-FEP Cluster, Intel Xeon
E5-2692 12C 2.200GHz, TH Express-2, Intel Xeon Phi 31S1P,
NUDT, 3,120,000 cores
#24: Edison is NERSC’s newest supercomputer, a Cray XC30,
with a peak performance of 2.57 petaflops/sec, 133,824 compute
cores, 357 terabytes of memory, and 7.56 petabytes of disk.
42/44
Optimization Formulation
Mathmatical optimization problem
min f (x)
s.t. ci (x) = 0, i ∈ E
ci (x) ≥ 0, i ∈ I
x = (x1 , . . . , xn )> : variable
f (x) : Rn → R: objective function
ci (x) : Rn → R: constraints
optimal solution x∗ : a feasible point with the smallest value of f
43/44
Classification
Continuous versus discrete optimization
Unconstrained versus constrained optimization
Global and local optimization
Stochastic and deterministic optimization
Linear/nonlinear/quadratic programming, Convex/nonconvex
optimization
Least square problem, equation solving
sparse optimization, PDE-constrained optimization, robust
optimization
44/44