Download NLP and Big Data - Center for Large

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Shanxi HPC Research Center
NLP and Big Data
Xiaoge LI
[email protected]
WBDB2013, Xi’an, China
Introduction

Internet is a big knowledge base
 unstructured

NLP & IE
 “understand”
human language
Unstructured data
Structure data
Problems

Human language changed
 Let
Google it !
 Net language ( LOL, 给力)
 compounds words (JFK airport)

Domain knowledge
 Domain

specific training sets
Chinese tokenization
 小菊/
nr /的/u/生活/ vn /很/d/给/v力/ vg
 小菊/ nr /的/u/生活/ vn /很/d/给力/a
NLP need big data

Unsupervised (weekly supervised)learning
 knowledge
acquisition
 Relationship
 New words
 NE gazette
System Architecture
Knowledge
acquisition
NLP & IE
HDFS
information
fusion
Entity
graph
Map
Reduce
HBase
Linux Cluster
knowledge acquisition
Large scale Corpus from Web
 Weekly supervised learning
 Bootstrapping technique
 Map reduce,Hbase
 Location NE and new word
 P = 87.28%, 72.1%

Chinese NLP & IE engine
Pipeline
FST & statistic mixture
model
Input:plain text
Out : structured XML
Map reduce
Speed: 500KB/s in 10 nodes
Information object
Profile and Event
Information Object
事件
Name Entity
Person
Organization
Location
Product
Time
Pre-defined
Event
General
Event
Example Profile
In Concept-Based
Profile, its attributes
are filled by its
participant profiles.
Information Network
NLP
IE
•
•
•
•
Tokenization
POS
Sallow parsing
Deep parsing
Cross document
information fusion
•
•
•
•
NE tag
CE linkage
NE Profile
Profile Merge
• Information
Object network
• Vertex: Profile
• Edge :
relationship
Cross Document Information fusion
Hierarchical Clustering
 Map Reduce Hbase
 Half Million Profiles
 Computing complexity
 P=94.65%
R=88.24%

F= 91.33%
Information Graph multi-dimension
Orange: location
Gray: organization
Blue: Person
Source:
2012 People’s daily
Query:
China Agricultural
University
Expand 1 level
Organization-Organization Network
Query: China Agricultural University filter: Organization
Location-Personal Network
Query : 青岛港, filter:Location
Person-location Network
Query: 金日成
Future Work
 Query
Language
 Graph Mining
 Enhance NLP Engine
 visualization
Questions?
Thank you
Related documents