Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Efficiently preload large quantities of data in to IBM WebSphere eXtreme Scale PRELOADING DATA FROM A DATABASE IN TO IBM WEBSPHERE EXTREME SCALE ...... 2 THE FIRST WRONG WAY ........................................................................................................................................... 2 THE SECOND WRONG WAY ...................................................................................................................................... 2 THE THIRD WRONG WAY ......................................................................................................................................... 2 THE RIGHT WAY ........................................................................................................................................................ 3 Duplicate reference data in EVERY partition .......................................................................................... 3 Collocating the master and child objects in a single partition ......................................................... 3 Multiple Maps means multiple Loaders ..................................................................................................... 4 Preloading the Maps ONE by ONE is more efficient .............................................................................. 5 CONCLUSION .............................................................................................................................................................. 5 Author: Billy Newport May 13, 2011 Page: 1/5 Efficiently preload large quantities of data in to IBM WebSphere eXtreme Scale Preloading data from a database in to IBM WebSphere eXtreme Scale This is one of the most common use cases. The customer wants to place an IBM WebSphere eXtreme Scale (WXS) grid in front of a database to allow better scaling. Usually, the data involves several tables and the customer wants to use several WXS maps with relationships between them. The first wrong way Lets say we have two tables in the database. A customer table and a customer address table. A customer can have many addresses. We decide to use the JPA built-‐ in Loader to create a Customer entity and an Address entity and create a one to many relationship between them. The customer uses a single WXS Map for the Customer entity POJO. The key is the customer key and the value is an object graph consisting of one Customer POJO that has a collection of Address objects. The customer then tries to write a preloader application. This will run one of more JPA queries to fetch the Customer/Address graphs using the JPA implementation. This usually requires the database to execute a SELECT statement that fetches a join query where the Addresses are grouped together for a single customer. This query is usually very expensive. The preload is slow and the bottleneck is the database. The customer is parallel loading the Customer graphs in to the grid using agents but the process takes a very long time because of the database. The second wrong way Convinced the JPA implementation is the problem. A JDBC Loader is written to do the same thing but ultimately, the issue remains that executing the SELECT query with the group by is still the major bottleneck. This suffers more or less from the same issue as the first wrong way. The third wrong way Here, the graph is split in to two maps. We have a customer map and an address map. The customer map has the customer key and the value is the Customer POJO. The Address map has a key that contains the customer key and the value is one address. Loaders are written using JPA or JDBC to fetch and store data in these maps independently. Preloading the grid here can be made efficient as a simple table scan is sufficient to fetch all rows from both tables and store them in the grid using a bulk putAll (wxsutils). However, a major performance problem is discovered accessing the data in the grid after preload. Some of the customers have several addresses. The customer and address objects are stored in different partitions for the same customer. The client application is usually executing a get for each customer and each address. This results in a lot of gets for each logical operation. If a customer had 10 addresses then 11 get operations are required to get a single customer. The team will likely state that the grid is slower than the database as fetching the data from the DBMS can be done with a single RPC and it’s hard to do 11 RPCs faster than one. Author: Billy Newport May 13, 2011 Page: 2/5 Efficiently preload large quantities of data in to IBM WebSphere eXtreme Scale The right way The above approaches are flawed in two ways: • Preloading the data using SELECT statements that do joins or group or order bys will never work. It’s too slow and the DBMS will struggle to execute this SQL query especially when there are a lot of rows involved. I’ve seen use cases with 1.6 billion rows in the child table. I don’t care which database you have, sorting 1.6 billions rows is expensive. • The master and child objects are not stored together in a single partition. This means that working with the customer and the associated addresses is very expensive, as many RPCs are needed both to read the data as well as write changed to the grid. We will now examine what needs to be done to solve these problems. Duplicate reference data in EVERY partition Lets suppose our Address objects had a reference to the State. The state map is typically very small, 52 entries in the USA. We should make a Map for the state and then preload it with the same data in EVERY partition. This can be readily done like follows. First, running a query to fetch all the states from the database. Next, write an Agent that when executed on every partition to insert the states in a Map on every partition. This means that later if business logic requires data stored in the State map for a particular address or customer that it’s instantly available in the same partition. Clients will not be access to access the state map directly because such a map is not routable. The key to the Map does not determine which partition the entry will be found in. Collocating the master and child objects in a single partition This is absolutely HUGE from a performance point of view. The best news is that it’s extremely easy to do. A WXS Map has a key and a value. The customer Map has a key, lets say it’s a customer id. The address Map has a key and a value also. The key is a composite POJO that includes the customer key, the customer id. So, the customer map looks like: Customer: Map<String, Customer> Address: Map<AddressKey, Address> Class AddressKey implements Serializable { String customerId; int addressId; public int hashCode() {} public Boolean equals(…) {} } The address entries for a customer will be stored in many partitions for the same customer. We need to modify the application so WXS will store ALL address entries in the same partition as the one used for the master customer entry. Class AddressKey implements Serializable, PartitionableKey { String customerId; … public Object getIBMPartition() Author: Billy Newport May 13, 2011 Page: 3/5 Efficiently preload large quantities of data in to IBM WebSphere eXtreme Scale { return customerId; } } This version of the key accomplishes that with very little work. When WXS needs to calculate the partition for a specific key, it typical just calls the hashCode method and then figures out the partition. The initial version of AddressKey would always result in a different hashCode than the one returned by the Customer object. The new version implements the PartitionableKey interface. If WXS sees a key that implements this interface then it will instead use the hashCode of the object returned from the getIBMPartition on that key. Our improved AddressKey returns the customerId string as the result. This guarantees that WXS will place the Address entries for a specific customer in the same partition as the associated parent Customer entry because it calculates the hashCode the same exact way. This should be done for all Maps that are one to many children of a ‘master’ map. The child keys should all employ the same technique. This will massively improve performance as now the application can fetch a customer AND all addresses using a single operation using a very simply Agent call. Our previous example showed 11 RPCs to accomplish this. Now, we only use one and now WXS will outperform the database easily. Typically agents are written for all operations for the highest speed. These agents can be thought of as stored procedures to fetch or manipulate customer data in some particular way using a single RPC. This also improves application performance enormously. The Customer object will also typically have a Collection of Address keys. This collection allows the associated address objects to be retrieved efficiently rather than using an index to do the same thing. Both will cost memory so typically, I will use a Collection in the Customer object but then I’d have to make sure that the agents that add/remove Addresses also maintain this list in the associated Customer object. This is extra cost but performance wise, it’s usually very cheap to do. The preloading code clearly will need to make sure the Collection is set correctly also. Multiple Maps means multiple Loaders Sometimes, the application will decide to keep a single Map and the value includes the Customer and a Collection of Address objects. If the data is read only as far as WXS is concerned then this is a workable solution. It’s fast, you can fetch the customer and addresses in a single operation without resorting to agents. But, if a Loader is used then the Loader is more complex. It has to figure out how to apply the List of Addresses in the customer object to the Address table in the database. Typically, this means reading all the addresses from the DBMS, comparing them to the list in memory and then doing insert/delete and update SQL statements to account for any differences. This is expensive. WXS is unable to track the address changes as you are storing everything in a single Map and WXS does change detection map by map. When a Loader is used, a better approach is to use separate maps as described above. Each Map gets its own Loader. WXS can now track changes to the Address Author: Billy Newport May 13, 2011 Page: 4/5 Efficiently preload large quantities of data in to IBM WebSphere eXtreme Scale Map and instruct the Loader to insert/update/delete Addresses without needing to do the query first typically. This is simpler to code and more efficient in terms of SQL. The single Map approach is still more expensive even if write behind is enabled. The SQL is still complex. Preloading the Maps ONE by ONE is more efficient As explained earlier, trying to fetch the objects using a single query with joins and group/order bys is almost always too expensive in practice. Splitting the object in to separate maps allows the following approach to be used. First, preload the customer map. Do a simple SELECT * from Customer in blocks and use putAll to write the Customer objects to the Customer Map with empty Address collections. This is typically very fast. Next, preload the address map. Do a simple SELECT * from Address in blocks. There is no guarantee that you’ll get the Addresses for one customer at a time but that’s ok. We will write an agent to take a block of addresses, split them in to a block for each partition and then for each partition, the agent will merge the new addresses for a particular customer in to the existing set by updating the Collection of address keys in the customer as well as inserting the Address objects themselves. This is much faster than before as the SQL query is very efficient and we still bulk add the data into the grid. The key is using the merge Agent to add the unordered addresses to the existing customers efficiently. A great way to think about this is that we’re moving the join operations out of the database in to the grid. Conclusion Following the steps outlined in this article solves 99% of all the preloading performance issues that I’ve seen in customer situations. To summarize: 1. Use wxsutils for putAll or bulk implementations. 2. Map per table. 3. Collocate related map entries using PartitionableKey 4. Preload a table at a time. 5. Then, use a merge agent to preload child tables 6. Try preloading the master table using parallel chunks. Multi-‐thread it so that you use N threads with each of the threads fetching an exclusive range of records from the database. 7. Try preloading the child tables also using parallel chunks. 8. DO NOT try to fetch the child data at the same time as the master data from the DBMS. 9. Join data in the grid, not in the database 10. Duplicate reference data in every partition. 11. Write agents to manipulate the main data using a single RPC, both reads/puts and deletes. Following these techniques will ensure you preload data from a DBMS as efficiently as possible with the least load on the DBMS. Author: Billy Newport May 13, 2011 Page: 5/5