Download Applying Semantics to Unstructured Data (Big and Getting Bigger)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Applying Semantics to Unstructured
Data (Big and Getting Bigger)
Wednesday, November 30, 2012
4:00 – 5:00
Bryan Bell
Vice President, Enterprise Solutions, Expert
System
Lynda Moulton,
Analyst & Consultant, LWM Technology Services
Peter O'Kelly
Principal Analyst, O'Kelly Associates
Overall Session Agenda
• Introduction and context-setting
• "Big Data" 101 for Business
• Semantics and the Big Data Opportunity
2
Big Data 101 Agenda
•
•
•
•
Big data in context
Recap
Risks
Recommendations
3
Big Data in Context
• What is “big data”?
– Unhelpfully, both “big data” and “NoSQL,” generally
considered a key part of the big data wave, are defined
more in terms of what they aren’t than what they are
– A typical big data definition (Wikipedia):
• “[…] data sets that grow so large that they become awkward to
work with using on-hand database management tools”
– Often associated with Gartner’s volume, variety (and
complexity), and velocity model
• Also value and veracity considerations
4
Big Data in Context
• Why is big data a big deal now?
– Commoditized hardware, software, and networking
• Capability and price/performance curves that continue to
defy all economic “laws”
• Cloud services with radical new capability/cost equations
– Maturation and uptake of related open source
software, especially Hadoop
• Powerful and often no- or low-cost
5
Big Data in Context
• Why is big data a big deal now (continued)?
– Market enthusiasm for “NoSQL” systems
– Useful and often “open source”/public domain data
sources and services
– Mainstreaming of semantic tools and techniques
6
A Prime Minicomputer, c1982
7
Fast-Forward to 2012
8
Fast-Forward to 2012
9
Fast-Forward to 2012
10
Fast-Forward to 2012
11
Fast-Forward to 2012
12
Google BigQuery
13
Hadoop
• Hadoop is often considered central to big data
– Originating with Google’s MapReduce architecture,
Apache Hadoop is an open source architecture for
distributed processing on networks of commodity
hardware
– From Wikipedia:
• “’Map’ step: The master node takes the input, divides it into
smaller sub-problems, and distributes them to worker nodes
• ‘Reduce’ step: The master node then collects the answers to
all the sub-problems and combines them in some way to
form the output – the answer to the problem it was
originally trying to solve”
14
Hadoop
• Commercial application domains include (from
Wikipedia)
–
–
–
–
–
–
–
Log and/or clickstream analysis of various kinds
Marketing analytics
Machine learning and/or sophisticated data mining
Image processing
Processing of XML messages
Web crawling and/or text processing
General archiving, including of relational/tabular data,
e.g. for compliance
15
Hadoop
• Hadoop is popular and rapidly evolving
– Most leading information management vendors
have embraced Hadoop
– There is now a Hadoop ecosystem
16
Meanwhile, Back in the Googleplex
• Dremel, BigQuery, Spanner, and other really
big data projects
17
Meanwhile, Back in the Googleplex
18
Google Now
19
A NoSQL Taxonomy
• From the NoSQL Wikipedia article:
20
A View of the NoSQL Landscape
21
Another NoSQL Landscape View
NoSQL Perspectives
• The “NoSQL” meme confusingly conflates
– Document database requirements
• Best served by XML DBMS (XDBMS)
– Physical database model decisions on which only DBAs and
systems architects should focus
• And which are more complementary than competitive with DBMS
– Object databases, which have floundered for decades
• But with which some application developers are nonetheless
enamored, for minimized “impedance mismatch,” despite significant
information management compromises
– Semantic (e.g., RDF) models
• Also more complementary than competitive with RDBMS/XDBMS
• Also consider: the “traditional” DBMS players can leverage
the same underlying technology power curves
23
Data as a Service
• The (single source of) truth is out there?...
– High-quality data sources are being commoditized
– Value is shifting to the ability to discern and leverage conceptual
connections, not just to manage big databases
• Some resources and developments to explore
–
–
–
–
–
–
–
–
Social networking graphs and activities
Data.com (Salesforce.com)
Data.gov
Google Knowledge Graph
Linked Data
Microsoft Windows Azure Data Marketplace
Wikidata.org
Wolfram Alpha
24
Mainstreaming Semantics
• Tools and techniques applied in search of
more meaning, e.g.,
– Vocabulary management
– Disambiguation and auto-categorization
– Text mining and analysis
– Context and relationship analysis
• It’s still ideal to help people capture and apply
data and metadata in context
– Semantic tools/techniques are complementary
25
Mainstreaming Semantics
• The Semantic Web is still more vision than reality
– But Google, Microsoft, and Yahoo, and Yandex, for
example, are improving Web searches by capturing
and applying more metadata and relationships via
schema.org schemas in Web pages
– And Google’s Knowledge Graph is about “things, not
strings,” with, as of mid-2012, “500 million objects, as
well as more than 3.5 billion facts about and
relationships between these different objects”
26
Recap
• Commoditization and cloud
– Very significant new opportunities
• Hadoop and related frameworks
– Complementary to RDBMS and XDBMS
• NoSQL
– Likely headed for meme-bust…
• Data services
– Game-changing potential
• Semantic tools and techniques
– Rapidly gaining momentum
27
Risks
• The potential for an ever-expanding set of information silos
– Focus on minimized redundancy and optimized integration
• GIGO (garbage in, garbage out) at super-scale
– New opportunities for unprecedented self-inflicted damage, for
organizations that don’t model or query effectively
• Cognitive overreach
– The potential for information workers to create and act on
nonsensical queries based on poorly-designed and/or
misunderstood information models
• Skills gaps can create competitive disadvantages
– Modeling, query formulation, and data analysis
– Critical thinking and information literacy
28
Recommendations
• Aim high: big data is in many respects just
getting started…
– A lot of technology recycling but also
significant and disruptive innovation
• Work to build consensus among stakeholders on the opportunities and risks
• Focus on human skills – e.g., critical
thinking and information literacy
– For now, an instance of the most creative and
powerful type of semantic big data processor
we know of is between your ears
29