Download link mining - University of Maryland

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Network science wikipedia , lookup

Data analysis wikipedia , lookup

Corecursion wikipedia , lookup

Transcript
Challenge Problem:
Link Mining
Lise Getoor
University of Maryland, College Park
Link Mining
• Data
– Structured Input: Mining graphs and networks
– Structured Output: Extracting entity and
relationships from unstructured data
• Making use of Links
– For ranking nodes
– For collective classification of nodes
• Discovering Links
– Predicting missing links
– Discovering new kinds of links and relationships
Link Mining Tasks
• Node Centric
– Labeling/ranking nodes (aka Collective
Classification/PageRank)
– Consolidating nodes (aka Entity Resolution)
– Discovering hidden nodes (aka Group Discovery)
• Edge Centric
–
–
–
–
Labeling/ranking edges
Predicting the existence of edges
Predicting the number of edges
Discovering new relations/paths
• Graph/Subgraph Centric
– Discovering frequent subpatterns
– Generative models
– Metadata discovery, extraction, and reformulation
Reference: SigKDD Explorations Special Issue on Link Mining, December 2005.
The Link Mining Challenge
• Current research mostly focus on a single
task, e.g., node ranking or link prediction
• In real data analysis scenarios, we need a mix
of all of these capabilities
• Many potential domains:
–
–
–
–
–
Bioinformatics
Social network analysis
Citation Analysis
Fraud detection
….
Challenge Problem Requirements
1.
2.
3.
4.
5.
Relevant to data mining and based on analysis of
large volumes of data (including web, text, images,
links, etc), preferably publicly available data.
Important and difficult so that its solution will
advance the field and benefit the society
Interesting and exciting to attract researchers,
public and press attention, and funding. This
requires a simple and concise problem statement
The required domain knowledge should be relatively
accessible.
Other groups are not actively working on this
problem already
Domain
Evangelists: “Goal to distribute
free encyclopedia to every single person
on the planet in their own language”
Jimmy Wales Wikipedia founder
“Disaster is not too strong a word for wikipedia… the site is
Collaboratively edited user contributed encyclopedia
infested with moonbats”
Largest example of participatory journalism to date.
Mantra: maintain a neutral point of view (NPOV)
Eric Raymond, Open-source movement figure
Detractors::”Wikipedia has gone from a
nearly perfect anarchy to an anarchy
with gang rule.”
Larry Sanger Wikipedia co-founder
Know It All: Can Wikipedia Conquer Expertise? Stacy Schiff, New Yorker, July 31, 2006
Task #1: Descriptive Modeling
Modeling Growth of Wikipedia
Task #2: User Classification
vs.
Gnome
Troll
• Wiki Gnome: user that keeps a low profile, fixing
typos, poor grammar and broken links
• Wiki Troll: disruptive user who persistently violates
the site’s guidelines
Task #3: Text Classification
Three Wikipedia Content Guidelines:
1.
NPOV: represent views fairly and without
bias
2. Verifiability
3. No original research
#4: Link Prediction/Completion
• Identify where links should exist
• As Wikipedia grows, it becomes harder for any given
author to know about other relevant stuff they
can/should link to from some article.
• Some method that could help with this (link
suggestion, auto linking, etc.) would potentially be
very useful.
• Evaluation: Generate a dataset by taking a given set
of wikipedia pages, removing some of the existing
links, and then see if a system could identify those
places and suggest appropriate links.
Other Link Mining Tasks
• Trust/Reputation analysis
– “Gives no privilege to those who know what they are talking
about”, William Connolley, climate modeler and Wikipedia
admin
• Social network analysis
– Identification of communities
• Accuracy
– Nature comparison with Britannica (4-3 error ratio)
• Misuse
– Vandalism and self-promotion
• Coverage
– Which areas aren’t covered, or are poorly covered/linked?
But none of these are grand challenges…
• According to wikipedia
The Wikipedia Grand Challenge
The Wikipedia Test:
Given a collection of entries constructed via
participatory journalism (PJ) vs. link mining (LM),
Can you distinguish between PJ and LM? Which is
better?
Evaluation:
Via a panel of human experts
Via page rank
Solution will require a variety of
integrated link mining capabilities
$$ Already Available…
• Hutter prize
http://prize.hutter1.net/
• 50,000 € ≈ $64,000
http://en.wikipedia.org/wiki/The_64%2C0
00_Dollar_Question