Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A universal concurrency control model for datasystems (DBMSs and Data Mining Systems (e.g. DWs) embedded in a single system) The current isolation levels (Isolation-levels scale): Serializable Isolation (SRI) Repeatable Reads Isolation (RRI) Reads Committed Isolation (RCI) Reads Uncommitted Isolation (RUI) Consideration requirements for DBMSs (In importance order): 1. Correctness of data stored into and retrieved from DBs (increases by moving up the isolation-levels scale) 2. Performance (increases by moving down the isolation-levels scale) In practice, it is a compromise between the two where correctness is scarified for performance issues to a point that even serializability (which should be the definition of correctness) is not even attained!!! Transactions here include updaters (contain both reads and writes) and queries (read-only transaction). Consideration requirements for Data Mining systems (In importance order): 1. Performance (speed of data mining for decision support information) 2. CORRECTNESS of decision support information that system produces (e.g., Association Rules). We care more for the correctness of data mining process (which gives us the correct decision support information) than we do for the correctness of the data we are using to get decision support information. After all, here CORRECTNESS includes also a time factor that isn’t included in Data Mining Systems since data in them is not up-to-date (e.g. DWs). To sum up, correctness of Data Mining Systems includes both 1. Correctness of data being mined (only this aspect is currently embedded in Data Mining Systems) 2. Age of data being mined (We need up-to-date data) Current DWs on which data mining is applied contains OLD data. In a theoretical sense, this data is correct but never the less in this model here we only associate data with CORRECTNESS (I’m referring to this correctness as CORRECTNESS in this paper) if it is both: Correct (i.e. committed and consistent) Up-to-date If we insure that all the data we are mining has the above two characteristics, then we can be sure that the results we mined are correct to a very degree. In short, current data mining systems insure only correctness when they should insure CORRECTNESS (both correctness and up-to-date age). In out model here, we should try to balance the correctness and the up-to-date age in order to achieve CORRECTNESS (instead of focusing on only one aspect namely correctness). To do that, data mining systems must mine UP-TO-DATE data, which is the data in our DBMSs. I will not discuss how the data should be stored, i.e. in DBMSs or DWs format. That depends on the situation and is irrelevant to the discussion of our data model. However, what needs to be discussed is how to create Data Systems that act as DBs, on one hand, for storing data and as Data Mining systems on the other. We should keep in mind that our model should be built in a way so as not to sacrifice any of conditions requirements of DBMSs (i.e. correctness and performance) and/or Data Mining Systems. Otherwise nobody would consider it!!!!!!!!!!! Data Systems are to be used to execute two kinds of transactions: DB Trans DM Trans Each of the above two types of trans needs a special environments to execute in. DB trans need to be executed in such a way so as to ensure some kind of predefined data correctness (this correctness is also application dependent, however I will assume that DB trans must insure Serializibility). They must be executed at the SRI (see the isolation-levels scale). Since now we have one system (data here is up-to-date) one aspect of Data Mining Systems’ CORRECTNESS is satisfied, namely the up-to-date age. As for the second aspect, DM trans need not execute at a high isolation level because this would degrade the performance of the data mining process. We allow DM trans to run at a lower isolation levels but minimum correctness requirements must be met; i.e., we can’t the RUI since with the existence of DB trans in the system that would imply that many data read by DM trans might be wrong (written by aborted trans. Also running DM trans at the RCI is going to degrade the performance of the mining process (we can’t tolerate any kind of waiting!!!), so I suggest a new Isolation level (Read Last Committed Isolation level or RLCI) between the RCI and RUI. RLCI states that each trans reads the last committed value of a data item x (in RUI we read the current value of x (be that committed or not) and in RCI we wait until x’s value is committed and then we read (we don’t want any waiting!!!). In RLCI we try to read x, if x is committed then we simply read it; if not, then we read its last committed value of x (this value is either in the log file or is stable database storage, we cannot delete it (for recovery reasons at least!!!) until the current value of x commits). This way, we know that all DM trans will read committed values(and thus 100% correct data). The new Isolation level scale now looks like: Serializable Isolation (SRI) Repeatable Reads Isolation (RRI): suffers from phantom Reads Committed Isolation (RCI): suffers from phantom and non-repeatable reads Read Last Committed Isolation (RLCI): suffers from phantom, non-repeatable reads, and inconsistent retrieval (i.e. a DM transaction Ti reading a group of data items reads some of those data items before a DB transaction Tj updates and some data items after Tj updates them) however all values read were written by committed transaction). Reads Uncommitted Isolation (RUI) To achieve the separation between DB trans and DM trans we need some sort of mixed integrated scheduler, which executes DB trans in an SR manner (e.g. 2PL) insuring that they are at SRI level and executes DM trans at the RLCI level. To do so, each transaction tags its type (whether DB trans or DM trans) to each of its operations. When the scheduler receives an operation pi(x) it does the following: If the pi(x) is a DB trans operation then it processes it in an SR predefined manner (e.g. 2PL) Else //pi(x) is a DM trans operation now and must be a READ !!! //It checks to see if the current value of x is committed If it is then return this value Else return the last committed value of x (from log or stable storage) Lets revisit the consideration requirements for each system see how they are met in our new model: Consideration requirements for DBMSs: 1. Correctness of data stored into and retrieved from DBs (DB trans are running at SRI level and DM trans are read-only so they won’t affect the correctness of the data since data correctness here is determined by DB trans only) 2. Performance (DB trans are executing as if DM trans didn’t exist no effect on DB performance) Consideration requirements for DM Systems: 1. Performance (DB trans are executing as if DB trans didn’t exist (to a certain extent since here we testing to see if the value being read is committed or not, meanwhile in normal DWs we don’t have to, I guess it’s a small price compared to the outcome small effect on DM performance) 2. CORRECTNESS: a. Age of data: all of the data read was the last committed data when we read it, so it is up-to-date b. Correctness of data: we may read some inconsistent data (remember that here we have the problem of inconsistent retrieval, but never the less everything we read is committed). c. PS: Remember that in current DM systems, we read consistent (produced by an SR execution) but out-of-date data; meanwhile here we are reading somehow inconsistent committed but up-to-date data which I view as more valuable for Data Mining purposes than the former because what really matters is the final decision support information provided by the data mining system and this is greatly affected by the ‘up-to-datedness of the data being mined.