Download Lecture Notes - L3S Research Center

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Large Scale Data Mining
Introduction, Logistics
credits:www.mmds.org
credits:
http://www.mmds.org
Who are we ?
•
Dr. Avishek Anand •
post-­‐doc researcher • Informa8on Retrieval, Text mining, Graph mining • [email protected] •
Jaspreet Singh (TA) •
PhD student • Informa8on Retrieval •
[email protected]
Avishek Anand
2
What is Data Mining?
•
Given lots of data •
Discover paHerns and models that are: •
•
•
•
Valid: hold on new data with some certainty Useful: should be possible to act on the item Unexpected: non-­‐obvious to the system Understandable: humans should be able to interpret the pa;ern
Avishek Anand
3
Data Mining
•
But to extract the knowledge data needs to be •
Stored Managed • And ANALYZED ! this class •
Data Mining ≈ Big Data ≈ Predic8ve Analy8cs ≈ Data Science
Avishek Anand
4
Data Mining Tasks
•
Descrip8ve methods •
Find human-­‐interpretable pa;erns that describe the data •
•
Example: Clustering Predic8ve methods •
Use some variables to predict unknown or future values of other variables •
Avishek Anand
Example: Recommender systems
5
PiEalls and Risks
•
A risk with “Data mining” is that an analyst can “discover” paHerns that are meaningless •
StaHsHcians call it Bonferroni’s principle: •
Roughly, if you look in more places for interesHng pa;erns than your amount of data will support, you are bound to find crap
Avishek Anand
6
What ma;ers when dealing with data?
Challenges
Usage
Quality
Context
Streaming
ct
ru
St
Reason
Visualize
Data Operators
Avishek Anand
ur
Ne ed
tw
or
ks
M Tex
ul
tim t
ed
ia
Si
gn
al
s
ie
og
ol
nt
O
Collect
Prepare
Represent
Model
s
Scalability
7
Data Modalities
This Lecture
•
This class overlaps with machine learning, sta8s8cs, ar8ficial intelligence, databases but more stress on •
•
•
•
Scalability (big data) Algorithms Compu8ng architectures AutomaHon for handling large data
Avishek Anand
8
What will we learn?
•
We will learn to solve real-­‐world problems: •
•
•
•
•
Finding Similar Items Market Basket Analysis Spam detecHon Duplicate document detecHon We will learn various “tools”: •
•
•
•
OpHmizaHon (stochasHc gradient descent) Dynamic programming (frequent itemsets) Hashing (LSH, Bloom filters) ProbabilisHc analysis
Avishek Anand
9
Finding Similar Items
•
What are the near duplicates / similar items / nearest neighbours ? How do we find this efficiently for large input ?
Avishek Anand
10
Clustering
•
What are the clusters given large input collecHons ? Avishek Anand
11
Mining Data Streams
•
How do we sample data in a stream ? •
How do we count disHnct elements in a stream ? Avishek Anand
12
Mining Graphs
•
How do we find communiHes in large graphs ? •
How do we find shortest paths in massive graphs in reasonable Hme ? Avishek Anand
13
What will we learn?
•
We will learn to mine different types of data: •
•
•
•
Data is high dimensional Data is a graph Data is infinite/never-­‐ending We will learn to use different models of computa8on: •
•
•
MapReduce Streams and online algorithms Single machine in-­‐memory
Avishek Anand
14
About the Course
Course LogisHcs
•
Course website: hHp://www.l3s.de/~anand/lsdm16/ •
•
Lecture slides (at least 30min before the lecture) Homeworks, soluHons •
Readings: Book Mining of Massive Datasets with A. Rajaraman and J. Ullman •
ITIS Students: part of the theoreHcal foundaHons Free online:
hHp://www.mmds.org Avishek Anand
16
LogisHcs: CommunicaHon
•
For e-­‐mailing us, always use: •
•
•
•
[email protected],[email protected] Use this prefix in the subject field [lsdm16] You can message via Stud.ip as well We will post course announcements to the course webpage -­‐ via twiHer (make sure you check it regularly)
Avishek Anand
17
Work for the Course
•
7-­‐8 Assignments: •
•
•
•
•
TheoreHcal and programming quesHons Assignments take 8me. Start early!! 5-­‐6 ques8ons per assignment, online on the lecture day Due date Tuesday ajer the lecture 11:59 am How to submit? •
Homework write-­‐up: • Submit @ Jaspreet’s office or (11:59 am), •
Avishek Anand
Scan and send to [email protected] (11:59 pm)
18
Tutorials
•
•
•
•
•
Afer the lecture (with Jaspreet) Successfully evaluated assignments will be offered for soluHon presentaHon Each presentaHon (whiteboard) should be ~10 minutes..max 15 minutes 3 successful presentaHons give you 1 bonus point 1 bonus point = 0.3 grade improvement in your final exam
Avishek Anand
19
Final Exam
•
•
•
•
•
Wri;en Exam DuraHon : 2 hours 1 bonus point = 0.3 grade improvement in your final exam • 1.3 + 1 bonus point = 1.0 • 5.0 + 1 bonus point = 5.0 Modelled on Assignments More applied and algorithmic aspects rather than memorising Avishek Anand
20
Prerequisites
•
Algorithms and Data Structures •
•
Basic probability •
•
Dynamic programming, basic data structures Moments, typical distribuHons, MLE, … KDD1 •
FoundaHon lecture running this semester
Avishek Anand
21
Quiz
Quiz
•
Anonymous •
Tear off your Quiz Number on the top right •
Save it with you if you want to know your scores •
Dura8on : 20 minutes •
Afer the test: Shuffle the deck and redistribute Crowdsourcing
•
Avishek Anand
23