Download A Fast and Simple Method for Maintaining Staleness of Database Menda.Sravani ,Chanti.Suragala

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Microsoft Jet Database Engine wikipedia, lookup

Relational model wikipedia, lookup

Extensible Storage Engine wikipedia, lookup

Database wikipedia, lookup

Concurrency control wikipedia, lookup

Functional Database Model wikipedia, lookup

Clusterpoint wikipedia, lookup

Database model wikipedia, lookup

International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 8- Dec 2013
A Fast and Simple Method for Maintaining Staleness
of Database
Final M.TechStudent,#Assistant professor
Dept of CSE , SISTAM college, Srikakulam, Andhra Pradesh
Abstract: In data-warehousing there is large amount of
data can store in the database. For maintaining the
updated database it takes more time. For more
scalability it can take more processing time also. So we
introduce a new framework to achieve this problem. In
this we included grouping and partitioning to schedule
the tasks of updating more number of jobs in less time
with respect to execution time and utilization time.
In Data warehousing when dealing with large
amount of database, there are so many problems of
updating of the database. When any transaction done on the
database it takes more amore amount of time to update.
During updating of the database if any other transaction
process the it makes collisions of the data. It results
duplication of the data or violation of the constraints of the
data tables.
When any query executed in the database table, It
may effects only root tables or root table and derived table.
So that the query results effects both the tables. It takes
more time to update the query results. Many researchers
studied this problem in many ways. One of the concepts is
deadlocks based solutions and processing time based
Deadlocks is the situation when two or more
actions are waiting to execute one after the other. In this
databases the deadlocks works like sequential order. One
transaction over at that ending time another transaction
starts. Processing time is the time taken for the completion
of the execution of the query and refreshing of the
database. After this processing time over another
transaction starts.
The goal of a streaming warehouse is to propagate
new data across all the relevant tables and views as quickly
as possible. Once new data are loaded, the applications and
triggers defined on the warehouse can take immediate
action and it allows businesses to make decisions in nearly
ISSN: 2231-5381
real time, which may lead to increased profits and it
improved customer satisfaction, and prevention of serious
problems that could develop if no action was taken.
In the scheduling algorithms generally each
transaction is considered as job. It contains three properties
such as utilization time ,processing time, and the execution
Utilization time: It is defined as the time taken by the piece
of equipment to complete the particular job.
Processing time: It is refered as the time taken to excute
particular query in the database.
Exceution time: The time taken by the cpu to complete the
particular jobincluding the run time of the job.
By using these three properties the sheduling can be
framed to complete the job.
Scheduling is mainly used to increase the excution
of more tasks in less amount of time. Considering this
context in our work completion of jobs in less amount of
time in the data warehouses. It maintains more stalesness
when updating the data tables in the databse. We can
resduce CPU utilization time also. In the runtime of the
query if more time utilizes by the CPU then the next job
switch to deadlock that means after completion of the first
job only the next job can start process.
The idea is to partition the update jobs by their
expected processing times to partition the available
computing resources into tracks. A track logically
represents a fraction of the computing resources required
by our complex jobs that is including of CPU and memory
and disk I/Os. When an update job is released and placed in
the queue corresponding to its assigned partition where
scheduling decisions are made by a local scheduler running
a basic algorithm. We assume that each job is executed on
exactly one track therefore that
tracks become a
mechanism for limiting concurrency and for separating
long jobs from short jobs(with the number of tracks being
the limit on the number of concurrent jobs). For simplicity,
we assume that the same type of basic scheduling
algorithm is used for each track.
Page 394
International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 8- Dec 2013
In our work we divide the total process into three
types such as partitioning, grouping and classification.
Partitioning is the process of ordering following the
particular process. Grouping is combining the similar data
into one. Classification is deciding that an item belongs to
which group of class.
EDF(Earlier deadline First) Partitioning algorithm. In this
algorithm initially it reads all jobs having the entities such
as starting time and ending time and the execution time. It
orders the jobs based on the ending times ordering in the
descending order. First ending job ordered first.
Grouping: Grouping is so called as clustering. In
this clustering process we used balanced iterative reducing
and clustering using hierarchies. It is an unsupervised data
mining algorithm used to perform hierarchical clustering
over particularly large data-sets. An advantage of Birch is
its ability to incrementally and dynamically cluster
incoming, multi-dimensional metric data points in an
attempt to produce the best quality clustering for a given
set of resources (memory and time constraints). Clustering
feature can be organized by using CF tree which is height
balancing factors such as Branching and height.
Classification: In classification we used naive Bayesian
classification. In this classification it mainly uses bayes
classifier. a naive Bayes classifier considers the presence or
absence of a particular feature is unrelated to the presence
or absence of any other feature, given the class variable. A
naive Bayes classifier considers each of these features to
contribute independently to the probability that this fruit is
an apple and regardless of the presence or absence of the
other features.
The algorithm shown as follows:
1. Input
2. Order the jobs based on the execution times and
deadlines. Example (jn,j3,j1,j2) that is ordered in
descending order.
3. After ordering of jobs the clustering process will
starts. It calculates centroid using
→ =∑
Distance between the clusters by
√∑ ( − ) 2
 Phase 1: Scan dataset once, build a CF tree in
 Phase 2: (Optional) Condense the CF tree to a
smaller CF tree
 Phase 3: Global Clustering
 Phase 4: (Optional) Clustering Refining (require
scan of dataset)
Consider example CF of a data point (3,4) is
Phase1:Insert a point to the tree:
ISSN: 2231-5381
Find the path (based on D0, D1, D2, D3, D4 between
CF of children in a non-leaf node) then Modify the
leaf . After that find closest leaf node entry (based
on D0, D1, D2, D3, D4 of CF in leaf node). Then
Check if it can “absorb” the new data point and
modify the path to the leaf. It splitting operation
starts – if leaf node is full then split into two leaf
node and add one more entry in parent.
Phase2: Chose a larger T (threshold)
Consider entries in leaf nodes and Reinsert CF
entries in the new tree.
If new “path” is “before” original “path”, move it
to new “path”
If new “path” is the same as original “path”, leave
it unchanged
Phase 3: It Consider CF entries in leaf nodes only and uses
centroid as the representative of a cluster. It performs
traditional clustering (e.g. agglomerative hierarchy
(complete link == D2) or K-mean or CL…) and Cluster CF
instead of data points.
Phase 4: It requires scan of dataset one more time and
use clusters found in phase 3 as seeds. Then redistribute
data points to their closest seeds and form new clusters and
remove outliers.
Classification: naive Bayes classifiers can be
trained very efficiently in a supervised learning
setting. In many practical applications, parameter
estimation for naive Bayes models uses the
method of maximum likelihood and in other
words and that can work with the naive Bayes
model without accepting Bayesian probability or
using any Bayesian methods.
Bayes theorem plays a critical role in probabilistic
learning and classification. It Uses prior probability of
each category given no information about an item.
Categorization produces a posterior probability
distribution over the possible categories given a
description of an item.
Product Rule:
P( A  B)  P( A | B) P ( B)  P( B | A) P( A)
Sum Rule:
P( A  B)  P( A | B) P ( B)  P( B | A) P( A)
It Estimates
instead of
reduces the number of parameters (and the data
sparseness).The learning step in Naïve Bayes consists of
based on the frequencies in the
training data. The unseen instance is classified by
computing the class that maximizes the posterior When
conditioned independence is satisfied, Naïve Bayes
corresponds to MAP classification.
Page 395
International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 8- Dec 2013
Experimental Results:
The below screen show the results of the EDF partitioning
based on the execution times. The Jobs are sorted in based
with respect to earlier deadline first algorithm.
In this next step the cluster process done using birch
algorithm and clusters are shown below. In this jobs are
clustered taking the input from the partitioning results. In
these calculations inputs are job ids and the utilization
times. After clustering similar attributes of the jobs are
grouped together with utilization times.
In our system we introduced a scheduling method
that schedule jobs which performs on the network. We first
partition the jobs based on the execution times and
utilization times and group the jobs. By using this we can
maintain staleness of the database. It takes less time to
refresh the data. We tested on the simulation and It will
work efficiently on complex environment. The calculation
complexity also less and tested manually.
After scheduling of the jobs shown below. In this less
utilization times of jobs are scheduled first. Jobs which are
grouped in the clusters are considered as track. In the
particular track the job which contains less utilization time
executed first in the particular track.
ISSN: 2231-5381
[1] B.Adelberg, H. Garcia-Molina, and B. Kao, “Applying
UpdateStreams in a Soft Real-Time Database System,” Proc.
ACMSIGMOD Int’l Conf. Management of Data, pp. 245-256,
[2] B. Babcock, S. Babu, M. Datar, and R. Motwani,
“Chain:Operator Scheduling for Memory Minimization in Data
StreamSystems,” Proc. ACM SIGMOD Int’l Conf. Management
of Data,pp. 253-264, 2003.
[3] S. Babu, U. Srivastava, and J. Widom, “Exploiting Kconstraintsto Reduce Memory Overhead in Continuous Queries
over DataStreams,” ACM Trans. Database Systems, vol. 29, no.
3, pp. 545-580, 2004.
[4] S. Baruah, “The Non-preemptive Scheduling of Periodic
Tasksupon Multiprocessors,” Real Time Systems, vol. 32, nos.
1/2, pp. 9-20, 2006.
[5] S. Baruah, N. Cohen, C. Plaxton, and D. Varvel,
“ProportionateProgress: A Notion of Fairness in Resource
Allocation,” Algorithmic ,vol. 15, pp. 600-625, 1996.
[6] M.H. Bateni, L. Golab, M.T. Hajiaghayi, and H.
Karloff,“Scheduling to Minimize Staleness and Stretch in Realtime DataWarehouses,” Proc. 21st Ann. Symp. Parallelism in
Algorithms andArchitectures (SPAA), pp. 29-38, 2009.
[7] A. Burns, “Scheduling Hard Real-Time Systems: A
Review,”Software Eng. J., vol. 6, no. 3, pp. 116-128, 1991.
[8] D. Carney, U. Cetintemel, A. Rasin, S. Zdonik, M. Cherniack,
andM. Stonebraker, “Operator Scheduling in a Data Stream
Manager,”Proc. 29th Int’l Conf. Very Large Data Bases (VLDB),
pp. 838-849, 2003.
Page 396
International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 8- Dec 2013
[9] J. Cho and H. Garcia-Molina, “Synchronizing a Database
toImprove Freshness,” Proc. ACM SIGMOD Int’l Conf.
Managementof Data, pp. 117-128, 2000.
[10] L. Colby, A. Kawaguchi, D. Lieuwen, I. Mumick, and K.
Ross,“Supporting Multiple View Maintenance Policies,” Proc.
ACMSIGMOD Int’l Conf. Management of Data, pp. 405-416,
[11] M. Dertouzos and A. Mok, “Multiprocessor On-Line
Schedulingof Hard- Real-Time Tasks,” IEEE Trans. Software.
Eng., vol. 15,no. 12, pp. 1497-1506, Dec. 1989.
[12] U. Devi and J. Anderson, “Tardiness Bounds under Global
EDFScheduling,” Real-Time Systems, vol. 38, no. 2, pp. 133-189,
[13] N. Folkert, A. Gupta, A. Witkowski, S. Subramanian,
S.Bellamkonda, S. Shankar, T. Bozkaya, and L. Sheng,
“OptimizingRefresh of a Set of Materialized Views,” Proc. 31st
Int’l Conf. VeryLarge Data Bases (VLDB), pp. 1043-1054, 2005.
[14] M. Garey and D. Johnson, Computers and Intractability: A
Guide tothe Theory of NP-Completeness. W.H. Freeman, 1979.
ISSN: 2231-5381
Page 397