Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ORACLE & VLDB Nilo Segura IT/DB - CERN VLDB • The real world is in the Tb range (British Telecom - 80Tb using Sun+Oracle) • Data consolidated from different sources to build Data Warehouses and using Data mining techniques to extract useful information • Data is always READ only • Physics data has similar characteristics VLDB • Current size limits : Solaris 64bits + Oracle 64bits = 4Pb per database • This is more or less 19240 SunStoredge A1000 units (216Gb per unit today) • The current technology does not allow us to store on-line all this data in a manageable way VLDB • A typical LHC experiment will get several Petabytes of raw data • But the end user ought not to have access to all this. • We need to process/group/summarize it following certain criterias • This also means more disk space (if we want to keep everything on-line) VLDB • Not to mention backup…we need to keep our data safe (have you devised your backup strategy?) • RAID technology to help us to increase the performance and availability • RAID 0+1(best) or RAID 5(cheaper) • Today, we should have raw data on tapes and reconstructed data on-line VLDB • Now some software tricks to deal with all this amount of data – – – – – Partitioning Parallel DML Materialized Views Parallel Server (Cluster configuration) Bitmapped indexes? VLDB • Partitioning • A large amount of data can be divided into physically independent structures according to a certain criteria • However the user continues to see the same logical structure • Partition keys can be defined by range or using a hash function • The system can discard partitions based on the user’s queries, reducing the amount of data to be processed VLDB • Parallel DML – A single select/insert/delete/update can be executed by several processes (slaves) coordinated by a master process – Naturally leads to a better use of SMP machines – The degree of parallelism can be set by the user of automatically by the system (testing is a must) – Parallel insert/update/delete does need partitioning – You need to plan carefully your I/O system VLDB • Materialized views – Normal views are just a name for a SQL query with no real data associated (until runtime) – This can be very costly if we run it regularly – MV is just a view with all the data that satisfy the query already there (like in a normal table) – It can be refreshed (manually or automatically) to reflect the dynamic nature of the view • Parallel Server – Several nodes are attacking the same database that is on a disk system shared by all the nodes in the cluster – Users can be assigned to different nodes (load balancing) – Intra parallelism – queries are executed across the different nodes – At CERN there are 3 Sun Clusters (2 for DB, 1 for Web) and 1 Compaq – There is no such thing for Linux (yet) Others • Another point is how to distribute data amongst the different Institutes • Network , Tapes + Post … • It would be nice to have a plug-in plug-out mechanism in the database • This is called transportable tablespaces • We may also use the database replication option but…. Conclusion • Do not despair by the size of problem • We are trying to solve tomorrow’s problem using today’s technology • So keep an eye open and be VERY flexible in your model to be able to adapt quickly and painlessly • Do never declare you model frozen, it is a mistake, try to improve it, adopt new technologies (if it is a benefit)