Download The Weka Data Mining Software

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Paper Presentation
Authors:
Xiaofang Li
Changzhou Institute Of Technology
Changzhou, China
Guided By : Prof. Meiliu Lu
Presenting:
Pranavi Appana
Neelam Baviskar
Pallavi Vardhamane
Yingchi Mao
Hohai University
Nanjing, China
Agenda
● Background and Problem Statement
● Challenges and Problem Solution
● Improvised ETL Framework
● Dynamic Mirror Replication Technology
● Performance Evaluation
● Opinion
● Project Proposal
Background & Problem Definition
 Why to use Real time data ware house?
The load cycle of traditional data warehouse is fix and longer,
which cannot timely response the rapid data change. Whereas Real-time
data warehouse can capture the rapid data change and process the realtime data.
 Problem statements :
1.
2.
To get real-time data access without the processing delay with the
real-time data warehouse
To avoid the Query contention between OLAP queries and OLTP
updates
Challenges and Problem Solutions
 Challenges
- Enabling real-time ETL
- Data aggregation operation not synchronized with the real-time data
 Solutions
- Improvised ETL framework
- Dynamic mirror replication technology
Improvised ETL framework
Fig 1: The pre-processing framework for real-time data warehouse
Dynamic mirror replication technology
 Dynamic mirror creation and allocation
- Creation of mirror files and initiate bucket link
 Dynamic mirror release
- Load data into warehouse and release DSA
 The procedure of query processing
- Retrieve the data image in the dynamic data storage based on the
obtained data_id and perform processing
Performance Evaluation
 Experiment Settings
- The OLAP query response time in different update interval
 Experimental Results
- The OLAP query response time in different size of DSA.
The query response time in different update interval
Our opinion on the research paper
 We agree and confirm with the solutions suggested by author for
enabling Real time ETL and Data/Query Contention problem.
 We suggest additional solution of using MetaMatrix with
DataMigrator. This will help solving above problems and improve
Query efficiency
References
 Xiaofang Li, Yingchi Ma. Real-Time Data ETL Framework for Big Real-
Time Data Analysis, Information and Automation, 2015 IEEE
International Conference (1289-1294), August 2015
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=7279485
Team :
Pranavi Appana
Neelam Baviskar
Pallavi Vardhamane
Spring 2016
Agenda
 Background and Motivation
 Purpose and Scope
 Queries
 Objectives
 Resources
 Schedule
 References
Background and Motivation
 Dataset :
https://bythenumbers.sco.ca.gov/browse?utf8=%E2%9C%93&page=1
A list of relevant financial reports have been provided by the
Government of California in the above dataset. This dataset has details of the
Expenditures, Revenues and State Income of all the departments generated
in the form of fees, penalties and taxes.
Purpose and Scope : DW and DM
Purpose:
Develop a tool/web application for the state employee's and public use
to give important financial information on the government’s funding and
income.
Scope :
Multiple relevant datasets are available for billions of data. We are
trying to limit the scope for customer specific requirements like city,
county, departments, yearly dataset for financial data.
Queries
1.
What is the County wise and City wise State Income?
2.
What is the Business category under which this state income(taxes, fee)
was generated?
3.
Which sub-departments are responsible for the maximum income
collection?
4.
Determine the expenditures for a particular department.
Example: sewage, water, safety, public departments
5.
Estimate and determine which cities/counties are in profit or running in
debt due to high expenses.
Objectives
 Analyze, clean and prune the data.
 Create sample of data marts for different generic purposes to solve the
problems.
 Design database schema.
 Design a data warehouse application.
 Load data to warehouse and perform user queries.
Resources
 Data visualization
- OffVis
 Database Development
- MySQL, MariaDB
 Data warehouse
- PHP, HTML5, CSS
 Data mining
- Rapidminer, WEKA
Schedule
 Week 1: Data analysis, cleaning and pruning. Designing Schema.
 Week 2: Creating Data mart samples. Designing warehouse application
 Week 3: Applying data mining. Applying Query processing.
 Week 4: Testing and Documentation.
References
 California States Controller’s Office , Government Financial Reports,
Datasets,
https://bythenumbers.sco.ca.gov/browse?utf8=%E2%9C%93&page=1
This website gives consolidated information about the government
expenditures in the state of California.
 Xiaofang Li, Yingchi Ma. Real-Time Data ETL Framework for Big Real-Time
Data Analysis, Information and Automation, 2015 IEEE International
Conference (1289-1294), August 2015
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=7279485
Thank You!