Download Example: Data Mining for the NBA - The University of Texas at Dallas

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Entity–attribute–value model wikipedia , lookup

Database wikipedia , lookup

Big data wikipedia , lookup

Data Protection Act, 2012 wikipedia , lookup

Data center wikipedia , lookup

Clusterpoint wikipedia , lookup

Data model wikipedia , lookup

Forecasting wikipedia , lookup

Data analysis wikipedia , lookup

Information privacy law wikipedia , lookup

3D optical data storage wikipedia , lookup

Database model wikipedia , lookup

Data vault modeling wikipedia , lookup

Business intelligence wikipedia , lookup

Transcript
Data and Applications Security
Developments and Directions
Dr. Bhavani Thuraisingham
The University of Texas at Dallas
Data Warehousing, Data Mining and Security
September 2014
Outline
 Background on Data Warehousing
 Security Issues for Data Warehousing
 Data Mining and Security
What is a Data Warehouse?
 A Data Warehouse is a:
- Subject-oriented
- Integrated
- Nonvolatile
- Time variant
- Collection of data in support of management’s decisions
- From: Building the Data Warehouse by W. H. Inmon,
John Wiley and Sons
 Integration of heterogeneous data sources into a repository
 Summary reports, aggregate functions, etc.
Example Data Warehouse
Users
Query
the Warehouse
Oracle
DBMS for
Employees
Data Warehouse:
Data correlating
Employees With
Medical Benefits
and Projects
Sybase
DBMS for
Projects
Could be
any DBMS;
Usually based on
the relational
data model
Informix
DBMS for
Medical
Some Data Warehousing Technologies
 Heterogeneous Database Integration
 Statistical Databases
 Data Modeling
 Metadata
 Access Methods and Indexing
 Language Interface
 Database Administration
 Parallel Database Management
Data Warehouse Design
 Appropriate Data Model is key to designing the Warehouse
 Higher Level Model in stages
- Stage 1: Corporate data model
- Stage 2: Enterprise data model
- Stage 3: Warehouse data model
 Middle-level data model
- A model for possibly for each subject area in the higher level
model
 Physical data model
- Include features such as keys in the middle-level model
 Need to determine appropriate levels of granularity of data in order
to build a good data warehouse
Distributing the Data Warehouse
 Issues similar to distributed database systems
Branch A
Branch B
Central
Bank
Central
Warehouse
Non-distributed Warehouse
Branch A
Branch A
Warehouse
Branch B
Central
Bank
Central
Warehouse
Distributed Warehouse
Branch B
Warehouse
Multidimensional Data Model
Project Name
Project Leader
Project Sponsor
Years
Project Cost
Months
Project Duration
Weeks
Dollars
Pounds
Yen
Indexing for Data Warehousing
 Bit-Maps
 Multi-level indexing
 Storing parts or all of the index files in main memory
 Dynamic indexing
Metadata Mappings
Metadata
for the Warehouse
Metadata for
Mappings and
Transformations
Metadata
for Data source A
Metadata for
Mappings and
Transformations
Metadata
for Data source B
Metadata for
Mappings and
Transformations
Metadata
for Data source C
Data Warehousing and Security
 Security for integrating the heterogeneous data sources into
the repository
- e.g., Heterogeneity Database System Security, Statistical
Database Security
 Security for maintaining the warehouse
- Query, Updates, Auditing, Administration, Metadata
 Multilevel Security
- Multilevel Data Models, Trusted Components
Example Secure Data Warehouse
User
Secure Data Warehouse
Manager
Secure DBMS A
Secure
Database
Secure DBMS B
Secure
Database
Secure
Warehouse
Secure DBMS C
Secure
Database
Secure Data Warehouse Technologies
Secure Data Warehousing Technologies:
Secure data modeling
Secure heterogeneous database integration
Database security
Secure access methods and indexing
Secure query languages
Secure database administration
Secure high performance computing technologies
Secure metadata management
Security for Integrating Heterogeneous Data
Sources
 Integrating multiple security policies into a single policy for
the warehouse
- Apply techniques for federated database security?
Need to transform the access control rules
 Security impact on schema integration and metadata
- Maintaining transformations and mappings
 Statistical database security
- Inference and aggregation
e.g., Average salary in the warehouse could be
unclassified while the individual salaries in the databases
could be classified
 Administration and auditing
-
-
Security Policy for the Warehouse
Federated Policy
for Federation
F2
Federated Policy
for Federation
F1
Export Policy
for Component A
Export Policy
for Component B
Export Policy
for Component B
Export Policy
for Component C
Generic Policy
for Component A
Generic Policy
for Component B
Generic policy
for Component C
Component Policy
for Component A
Component Policy
for Component B
Component Policy
for Component C
Security Policy Integration and Transformation
Federated policies become warehouse policies?
Security Policy for the Warehouse - II
Policy
for the Warehouse
Policy for
Mappings and
Transformations
Policy
For Data Source A
Policy for
Mappings and
Transformations
Policy for
Mappings and
Transformations
Policy
For Data Source B
Policy
For Data Source C
Secure Data Warehouse Model
Project Name, U
Project Leader, U
Project Sponsor, S
Year, U
Project Cost, S
Months, U
Project Duration, U
Weeks, U
U = Unclassified
S = Secret
Dollars, S
Pounds, S
Yen, S
Methodology for Developing a Secure Data
Warehouse
Integrate
Secure
data
sources
Secure data
sources
Clean/
modify
data
Sources.
Integrate
policies
Build secure
data model,
schemas,
access
methods,
and index
strategies for
the secure
warehouse
Multi-Tier Architecture
Tier N: Secure
Data Warehouse
Data Warehouse
Builds on Tier N-1
*
*
Tier 2: Builds on Tier 1
Tier 1:Secure Data Sources
Each layer builds on the Previous
Layer
Schemas/Metadata/Policies
Administration
 Roles of Database Administrators, Warehouse
Administrators, Database System Security officers, and
Warehouse System Security Officers?
 When databases are updated, can trigger mechanism be used
to automatically update the warehouse?
- i.e., Will the individual database administrators permit
such mechanism?
Auditing
 Should the Warehouse be audited?
- Advantages
 Keep
up-to-date information on access to the
warehouse
Disadvantages
 May need to keep unnecessary data in the warehouse
 May need a lower level granularity of data
 May cause changes to the timing of data entry to the
warehouse as well as backup and recovery
restrictions
 Need to determine the relationships between auditing the
warehouse and auditing the databases
-
Multilevel Security
 Multilevel data models
- Extensions to the data warehouse model to support
classification levels
 Trusted Components
- How much of the warehouse should be trusted?
- Should the transformations be trusted?
 Covert channels, inference problem
Inference Controller
User
Inference
Controller
Secure Data Warehouse
Manager
Secure DBMS A
Secure
Database
Secure DBMS B
Secure
Database
Secure
Warehouse
Secure DBMS C
Secure
Database
Status and Directions
 Commercial data warehouse vendors are incorporating role-
based security (e.g., Oracle)
 Many topics need further investigation
- Building a secure data warehouse
- Policy integration
- Secure data model
- Inference control
Data Mining for Counter-terrorism
Data Mining for
Counterterrorism
Data Mining for
Non real-time
Threats:
Gather data,
build terrorist profiles
Mine data,
prune results
Data Mining for
Real-time
Threats:
Gather data in real-time,
build real-time models,
Mine data,
Report results
Data Mining Needs for Counterterrorism:
Non-real-time Data Mining
 Gather data from multiple sources
- Information on terrorist attacks: who, what, where, when, how
- Personal and business data: place of birth, ethnic origin,
religion, education, work history, finances, criminal record,
relatives, friends and associates, travel history, . . .
- Unstructured data: newspaper articles, video clips, speeches,
emails, phone records, . . .
 Integrate the data, build warehouses and federations
 Develop profiles of terrorists, activities/threats
 Mine the data to extract patterns of potential terrorists and predict
future activities and targets
 Find the “needle in the haystack” - suspicious needles?
 Data integrity is important
 Techniques have to SCALE
Data Mining for Non Real-time Threats
Integrate
data
sources
Clean/
modify
data
sources
Build
Profiles
of Terrorists
and Activities
Mine
the
data
Data sources
with information
about terrorists
and terrorist activities
Report
final
results
Examine
results/
Prune
results
Data Mining Needs for Counterterrorism:
Real-time Data Mining
 Nature of data
- Data arriving from sensors and other devices

Continuous data streams
- Breaking news, video releases, satellite images
- Some critical data may also reside in caches
 Rapidly sift through the data and discard unwanted data for later use
and analysis (non-real-time data mining)
 Data mining techniques need to meet timing constraints
 Quality of service (QoS) tradeoffs among timeliness, precision and
accuracy
 Presentation of results, visualization, real-time alerts and triggers
Data Mining for Real-time Threats
Integrate
data
sources in
real-time
Rapidly
sift through
data and
discard
irrelevant
data
Build
real-time
models
Mine
the
data
Data sources
with information
about terrorists
and terrorist activities
Report
final
results
Examine
Results in
Real-time
Data Mining Outcomes and Techniques for
Counter-terrorism
Data Mining
Outcomes and
Techniques
Classification:
Build profiles of
Terrorist and
classify terrorists
Association:
John and James
often seen
together after an
attack
Link Analysis:
Follow chain
from A to B
to C to D
Clustering:
Divide population; People from
country X of a certain religion;
people from Country Y
Interested in airplanes
Anomaly Detection:
John registers at
flight school;
but des not care
about takeoff or
landing
Example Success Story - COPLINK
 COPLINK developed at University of Arizona
- Research transferred to an operational system currently
in use by Law Enforcement Agencies
 What does COPLINK do?
Provides integrated system for law enforcement;
integrating law enforcement databases
- If a crime occurs in one state, this information is linked to
similar cases in other states
It has been stated that the sniper shooting case may have
been solved earlier if COPLINK had been operational at
that time
-
Where are we now?
 We have some tools for
- building data warehouses from structured data
- integrating structured heterogeneous databases
- mining structured data
- forming some links and associations
- information retrieval tools
- image processing and analysis
- pattern recognition
- video information processing
- visualizing data
- managing metadata
What are our challenges?
 Do the tools scale for large heterogeneous databases and petabyte
sized databases?
 Building models in real-time; need training data
 Extracting metadata from unstructured data
 Mining unstructured data
 Extracting useful patterns from knowledge-directed data mining
 Rapidly forming links and associations; get the big picture for real-
time data mining
 Detecting/preventing cyber attacks
 Mining the web
 Evaluating data mining algorithms
 Conducting risks analysis / economic impact
 Building testbeds
IN SUMMARY:
 Data Mining is very useful to solve Security Problems
- Data mining tools could be used to examine audit data
-
and flag abnormal behavior
Much recent work in Intrusion detection
e.g., Neural networks to detect abnormal patterns
Tools are being examined to determine abnormal patterns
for national security
 Classification techniques, Link analysis
Fraud detection
 Credit cards, calling cards, identity theft etc.
BUT CONCERNS FOR PRIVACY