Download A universal concurrency control model for datasystems

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
A universal concurrency control model for datasystems (DBMSs and
Data Mining Systems (e.g. DWs) embedded in a single system)
The current isolation levels (Isolation-levels scale):
 Serializable Isolation (SRI)
 Repeatable Reads Isolation (RRI)
 Reads Committed Isolation (RCI)
 Reads Uncommitted Isolation (RUI)
Consideration requirements for DBMSs (In importance order):
1. Correctness of data stored into and retrieved from DBs (increases by moving
up the isolation-levels scale)
2. Performance (increases by moving down the isolation-levels scale)
In practice, it is a compromise between the two where correctness is scarified for
performance issues to a point that even serializability (which should be the definition
of correctness) is not even attained!!! Transactions here include updaters (contain
both reads and writes) and queries (read-only transaction).
Consideration requirements for Data Mining systems (In importance order):
1. Performance (speed of data mining for decision support information)
2. CORRECTNESS of decision support information that system produces (e.g.,
Association Rules). We care more for the correctness of data mining process
(which gives us the correct decision support information) than we do for the
correctness of the data we are using to get decision support information. After
all, here CORRECTNESS includes also a time factor that isn’t included in
Data Mining Systems since data in them is not up-to-date (e.g. DWs). To sum
up, correctness of Data Mining Systems includes both
1. Correctness of data being mined (only this aspect is
currently embedded in Data Mining Systems)
2. Age of data being mined (We need up-to-date data)
Current DWs on which data mining is applied contains OLD data. In a theoretical
sense, this data is correct but never the less in this model here we only associate data
with CORRECTNESS (I’m referring to this correctness as CORRECTNESS in this
paper) if it is both:
 Correct (i.e. committed and consistent)
 Up-to-date
If we insure that all the data we are mining has the above two characteristics, then we
can be sure that the results we mined are correct to a very degree. In short, current
data mining systems insure only correctness when they should insure
CORRECTNESS (both correctness and up-to-date age).
In out model here, we should try to balance the correctness and the up-to-date age in
order to achieve CORRECTNESS (instead of focusing on only one aspect namely
correctness). To do that, data mining systems must mine UP-TO-DATE data, which is
the data in our DBMSs.
I will not discuss how the data should be stored, i.e. in DBMSs or DWs format. That
depends on the situation and is irrelevant to the discussion of our data model.
However, what needs to be discussed is how to create Data Systems that act as DBs,
on one hand, for storing data and as Data Mining systems on the other. We should
keep in mind that our model should be built in a way so as not to sacrifice any of
conditions requirements of DBMSs (i.e. correctness and performance) and/or Data
Mining Systems. Otherwise nobody would consider it!!!!!!!!!!!
Data Systems are to be used to execute two kinds of transactions:
 DB Trans
 DM Trans
Each of the above two types of trans needs a special environments to execute in. DB
trans need to be executed in such a way so as to ensure some kind of predefined data
correctness (this correctness is also application dependent, however I will assume that
DB trans must insure Serializibility). They must be executed at the SRI (see the
isolation-levels scale). Since now we have one system (data here is up-to-date) one
aspect of Data Mining Systems’ CORRECTNESS is satisfied, namely the up-to-date
age. As for the second aspect, DM trans need not execute at a high isolation level
because this would degrade the performance of the data mining process. We allow
DM trans to run at a lower isolation levels but minimum correctness requirements
must be met; i.e., we can’t the RUI since with the existence of DB trans in the system
that would imply that many data read by DM trans might be wrong (written by
aborted trans. Also running DM trans at the RCI is going to degrade the performance
of the mining process (we can’t tolerate any kind of waiting!!!), so I suggest a new
Isolation level (Read Last Committed Isolation level or RLCI) between the RCI and
RUI. RLCI states that each trans reads the last committed value of a data item x (in
RUI we read the current value of x (be that committed or not) and in RCI we wait
until x’s value is committed and then we read (we don’t want any waiting!!!). In
RLCI we try to read x, if x is committed then we simply read it; if not, then we read
its last committed value of x (this value is either in the log file or is stable database
storage, we cannot delete it (for recovery reasons at least!!!) until the current value of
x commits). This way, we know that all DM trans will read committed values(and
thus 100% correct data). The new Isolation level scale now looks like:
 Serializable Isolation (SRI)
 Repeatable Reads Isolation (RRI): suffers from phantom
 Reads Committed Isolation (RCI): suffers from phantom and non-repeatable
reads
 Read Last Committed Isolation (RLCI): suffers from phantom, non-repeatable
reads, and inconsistent retrieval (i.e. a DM transaction Ti reading a group of
data items reads some of those data items before a DB transaction Tj updates
and some data items after Tj updates them) however all values read were
written by committed transaction).
 Reads Uncommitted Isolation (RUI)
To achieve the separation between DB trans and DM trans we need some sort of
mixed integrated scheduler, which executes DB trans in an SR manner (e.g. 2PL)
insuring that they are at SRI level and executes DM trans at the RLCI level. To do so,
each transaction tags its type (whether DB trans or DM trans) to each of its
operations. When the scheduler receives an operation pi(x) it does the following:
If the pi(x) is a DB trans operation then
it processes it in an SR predefined manner (e.g. 2PL)
Else //pi(x) is a DM trans operation now and must be a READ !!!
//It checks to see if the current value of x is committed
If it is then
return this value
Else
return the last committed value of x (from log or stable storage)
Lets revisit the consideration requirements for each system see how they are met in
our new model:
Consideration requirements for DBMSs:
1. Correctness of data stored into and retrieved from DBs (DB trans are running
at SRI level and DM trans are read-only so they won’t affect the correctness of
the data since data correctness here is determined by DB trans only)
2. Performance (DB trans are executing as if DM trans didn’t exist no effect
on DB performance)
Consideration requirements for DM Systems:
1. Performance (DB trans are executing as if DB trans didn’t exist (to a certain
extent since here we testing to see if the value being read is committed or not,
meanwhile in normal DWs we don’t have to, I guess it’s a small price
compared to the outcome small effect on DM performance)
2. CORRECTNESS:
a. Age of data: all of the data read was the last committed data when we
read it, so it is up-to-date
b. Correctness of data: we may read some inconsistent data (remember
that here we have the problem of inconsistent retrieval, but never the
less everything we read is committed).
c. PS: Remember that in current DM systems, we read consistent
(produced by an SR execution) but out-of-date data; meanwhile here
we are reading somehow inconsistent committed but up-to-date data
which I view as more valuable for Data Mining purposes than the
former because what really matters is the final decision support
information provided by the data mining system and this is greatly
affected by the ‘up-to-datedness of the data being mined.