Download Paper - NDSU Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data analysis wikipedia , lookup

B-tree wikipedia , lookup

3D optical data storage wikipedia , lookup

Data vault modeling wikipedia , lookup

Disk formatting wikipedia , lookup

Information privacy law wikipedia , lookup

Business intelligence wikipedia , lookup

Open data in the United Kingdom wikipedia , lookup

Transcript
Otto Borchert
Computer Science 765
Maximizing the Effectiveness of Peano Trees
through the use of Distributed Computing and Data
Striping
Introduction
Peano trees are a new invention poised to change the way spatial data is recorded,
used, evaluated, and searched. This new technology is just coming to life, and the ability
to use it effectively is an important step to making the technology mainstream. An array
of tools will be necessary in order to exploit these Peano trees, also known as P-trees, to
their full potential. One of the first tools required for widespread use of the technology is
a way to increase the speed and efficiency of processing the information stored in them.
A combination of data striping and distributed computing is an excellent first step to
achieving these goals. This paper will start by giving a background of Peano trees, data
striping, and distributed computing and will describe how these can be used in tandem to
make P-trees a more useful technology. Finally, there will be a description of a prototype
system with an array of 16 nodes used to demonstrate the effectiveness of such a project.
Peano Trees - What are they?
The Peano Tree is a relatively new invention by Dr. William Perrizo from the
Computer Science department at North Dakota State University. Peano trees are used to
store spatial data in a lossless compressed form for which it is easy to perform various
operations to aid in data mining techniques, such as counts, ANDs, ORs, and
complements. [PER]
For an example of how a Peano tree is created, see figure one. To begin with, a
file of values is turned into its binary equivalent as shown in the first stage of the figure.
These equivalents are then split into different files based on bit position. The three "bit
sequential" files are shown in the second stage of the figure. Bit sequential files are a
related invention of Dr. Perrizo's. [PER] Next, these files are re-ordered into "Peano
order". If you follow the lines like a Z in the figure, you can see how the file is created
into Peano order. Finally this information is organized into a tree structure. The first child
represents the first quadrant of the whole file, the second child represents the second, etc.
The first child of the first child represents the first quadrant of the first quadrant and so
on. A one value node means that this is a "pure 1" or all of the values in this quadrant are
1. A zero value then means one of two things: if there are children in that node, that's
how the subquadrants of this quadrant are laid out; if there are no children, the quadrant
is pure 0; that is, all of the values in this quardrant are 0.
Figure 1. Peano Tree Example
Decimal Value
Binary Value
0
000
1
001
1
001
2
010
2
010
2
010
1
001
1
001
4
100
4
100
5
101
6
110
6
110
5
101
1
001
0
000
Bit Sequential Files (Raster order)
0000 0000 1111 1100
0001 1100 0001 1000
0110 0011 0010 0110
Bit position 1 in raster order. Notice the Z pattern in order to build the peano ordering
0 - 0
0 - 0
/
/ /
0 - 0 / 0 - 0
/
-------/
1 - 1
1 - 1
/
/ /
1 - 1 / 0 - 0
Peano Order
0000000011111100
Peano tree
0
|
0
0
1
0
|
1100
How exactly is this useful? You will notice the incredible compression this
allows, even in this simple example. Since spatial data is normally very continuous (as
opposed to having discrete locations that change drastically from point to point) this is
actually a fairly good example of a real world scenario. Also, by ordering data by bit
position you can gather more useful information. For example, the first bit position which
shows the largest amount of change can be more useful than the last bit position which
only shows differences in values of 1. There are also a number of recent presentations
that show the usefulness of peano trees in a variety of application including microarray
data, stock market data, genetic information, and the original use in crop analysis. [CLA]
Note that this is not the only way one could describe a peano tree. The particular
type of tree used in figure one is called the P1 tree. In an NZ tree, values of 0 are pure,
while 1 describes the mixed nodes. This project uses the fact that if you compare the two
different tree types, you can be sure of the underlying values. If both a P1 and NZ type
tree have the same value, all of the subquadrants of this quadrant have the same value,
where if they are different, the subquadrants are mixed.
Another way of describing a peano tree is through the use of a vector format. In
such a format, the quadrant in question is listed, followed by the P1 and NZ tree values.
The first pure 0 and pure 1 quadrants are displayed, but all children nodes are removed
because of redundancy. For example, see figure two below. The figure describes the ptree shown in figure 1. Such a system creates even smaller peano tree representations,
because there are no tree pointers involved.
Figure 2 - Peano Tree Vector Format
Quadrant
NZ
P1
[]
0011 0010
[11]
1100 1100
Data Striping - Spreading Data Out
Ramakrishnan describes data striping as segmenting data into partitions of equal
size for distribution across multiple disks. [RAM] With respect to peano trees, this would
be the distributing of pieces of the p-tree across multiple disk drives. There are a number
of well proven methods of data striping that already exist and are in practice today.
The most widely used method of data striping is in RAID arrays. RAID level 0
describes data striping when used to distribute data of any kind across multiple disks. In
RAID level 0, data is split into some segment known as a striping unit and placed on
multiple disks. So, for example, the first byte of a file is stored on one disk, the second
byte on a second disk and so on.
Distributed Computing - Putting data on different computers
Distributed computing is the final link of the three technologies necessary to build
a prototype system. By using computers in different, yet close, physical proximity can
allow for the distribution of processing for later tools that can do root counts, and
operations, and other data mining calculations.
According to Dr. Michael Stonebraker, and other sources, there are 3 main types
of distributed computing: Shared memory, shared disk, and shared nothing. Each has its
own advantages and disadvantages. [STO]
The shared memory architecture uses multiple processors that share a common
memory area. For example, a symmetric multiprocessor computer would fall into this
category, if it were running by itself. This architecture is the easiest to maintain with only
one central system, but is difficult to scale up to large sizes. For example, there are very
few computers built with more than 8 or 16 processors on one motherboard.
The shared disk architecture uses multiple processors that each has their own
memory area, but they share a hard disk. An example of such a system is a cluster of
computers connected to an NFS server. According to Stonebraker, This methodology is a
middle ground between shared memory and shared nothing when considering scalability
and maintenance difficulties. [STO]
The shared nothing architecture, as the name implies, involves using multiple
processors that share no disk or memory. Normally, this is accomplished using a network
connection between multiple computers. An example includes the Linux cluster, Beowulf
system. This system is the most difficult to maintain because each system has its own
operating system and set of hardware to maintain, but there is no single point of failure
and scales up very well.
The Project
By combining all three of these topics into one project, I hope to develop the best
way to store P-trees in a distributed environment. There has already been some work
done in developing the best strategy for laying out P-trees into vectors for use in
programs that calculate root counts, AND, and OR operations.
First, one needs to determine a way of splitting the p-trees into "striping units" so
that they can be distributed across multiple disks. There is an obvious point at which to
divide the p-trees, notably, each of their children can go to a different disk. In this theory,
information about any first quadrant data would be on disk 0, the second quadrant in disk
1, and so on.
Next, one must determine a way to use multiple computers to create a distributed
environment for the data to be stored. A shared memory architecture could be feasible, in
that you could use a multi-processor machine connected to an array of hard drives, but
this would not fit established systems that well. Another idea is to use a shared nothing
approach using TCP/IP to connect the server with a multitude of clients. This is the
approach I used when building the prototype system.
So based upon these ideas, one could build a system. First, a server would take a
tabular file and create a series of peano trees that need to be striped. Next, clients connect
to the server in order to perform the striping on each of their respective disks. Then the
server takes the p-trees, gives each of the clients a quadrant to work on, and keeps track
of what trees and quadrants each client controls. The clients then process the data,
striping information that needs to be striped and passes quadrants that need to be striped
to other clients.
One question that naturally arises from this algorithm is what would the best way
to divide work amongst the clients. My idea is to take a pseudo round robin approach. If
the server keeps track of what peano trees correspond to the first bit positions, you must
make an attempt to keep those quadrants on different computers as they will more than
likely be queried the most during calculations. For example, if the server parses three
tabular files, make sure that the quadrants from the first file created from each tabular file
are on separate systems.
A second question is what exactly does a client do when it "processes the data"
When a client receives a peano tree, it looks at an NZ representation of the tree and a P1
representation of the tree. It then writes these two values to its hard drive. Then it needs
to determine if more information needs to be stored. So, it compares the NZ and P1
values. If they are the same, no further processing is required because the quadrant is
either pure 1 or pure 0. If they are different however, that means that quadrant is mixed
and needs to be written to another disk. So, the client then sends that information to the
client responsible for storing that quadrant of data.
The Prototype System
In order to prove this would be a useful project, it would be best to build a
prototype system. I will start by giving an overview of what I built and then go into the
specifics of each module.
I chose to write the code in C++ for its modularity and my familiarity with
network programming in C. My overall goals were to make it as modular as possible so
that I could swap in and out implementations as much as possible. For example, I feel
that the P-tree implementation could be made much better. Because of the modularity of
the system, one would only need to swap in a new ptree.cpp and ptree.h file and change
how the P-trees are converted into network strings.
There is extensive use of sockets, which are a standard across many platforms
(including Windows and UNIX). This means that it may be feasible to re-write this
program for use in Windows systems as well, but that is beyond the scope of this project.
PTree Implementation
The P-tree was implemented by creating a tree structure that stored the NZ and P1
tree types in one structure, and assigning four children to it, which correspond to the four
children all P-trees have. This type of structure was mainly chosen for its ease of
checking NZ and P1 tree types. Also note that my representation of p-trees is very
inefficient because there are always four children, even if it is a pure node. I chose this
because I did not wish to take time away from the more interesting question of whether a
client server that striped data was feasible or not.
Sending a Ptree across the network
Next, I needed to create a method of creating a data stream to send across the
network so a client can understand and re-create the tree in question. To make a p-tree
into a network stream, I went through the tree recursively, giving each node an
identification number and writing the NZ and P1 giving a structure as follows:
nodeID|P1|NZ|child1nodeID|child2nodeID|child3nodeID|child4nodeID. On rebuilding
the tree at the other end, the program went backwards through this structure and rebuilt
the tree from the bottom up, returning the root node on completion.
The drawbacks of this choice was that it is incredibly large. I used characters to
send information instead of a bit stream, so the stream is at least 8 times larger than it
needs to be. Also, if I used peano vectors instead of sending a tree, the system may shrink
even further because of the lack of tree pointers. The advantages of the choice in network
implementation has the benefit of having easier to read code, and it was very easy to
debug because I could view packets as they went across the network to see if they were
correct or not.
Client/Server Implementation
There are 3 states that a server using this implementation can go through. First, it
waits for connections. Once the specified number of clients has connected, the server
begins a minimal amount of processing by converting the p-tree into a data stream as
prescribed above. This is then queued to be sent to the waiting clients. When data is
returned from the clients that resulting data is then requeued for further processing. This
loop is continued until each node returns a signal indicating that it has reached a leaf
node.
The clients similarly have three states to their life cycle. They wait for data from
the server, and then they process this data into a p-tree, write out the data necessary for
this tree sections storage, and return the children nodes to the server for distribution
among the other clients in the system.
Conclusion
I was able to compile the clients on two different FreeBSD systems, and using
NDSU's local Linux cluster, I was able to test the project as described. As expected,
results from the more remote FreeBSD systems came back slower than the local network
of Linux machines. However, it did prove the scalability of the system and ability to be
compiled on a variety of UNIX variant platforms.
There are a number of improvements to be made in the system. I would include a
better implementation of p-trees, some type of user interface, the ability to import any
tabular data you wanted to use, and using bit streams instead of characters to transport
and store data. Another improvement could be to give clients information about other
clients so that they could send information amongst themselves directly. This would
cause an increase of overhead, but would also make the system more stable, because it
would remove the single failure point that a server client paradigm of this nature
introduces.
Peano trees will be an important technology for the advancement of data mining
technology, and the development of tools to more efficiently use these trees will be
paramount to achieving their maximum potential. This project and prototype system has
proven the feasibility of data striping and distributed computing as a first step in building
a toolset for p-tree manipulation. Hopefully, many more projects will follow to continue
the excellent progress of this technology.
Note: The source code and presentation of this paper are available at:
http://www.cs.ndsu.nodak.edu/~oborcher/cs765
References
[STO] Stonebraker, Michael, The Case for Shared Nothing appears at URL
http://db.cs.berkeley.edu/papers/hpts85-nothing.pdf
[CLA] Class Presentations, Computer Science 765. North Dakota State University.
November 26, December 3, 10, and 17, 2001. Schedule and Topics appears at URL
http://www.cs.ndsu.nodak.edu/~perrizo/classes/765/topics.html
[RAM] Ramakrishnan, Raghu and Johannes Gehrke. Database Managment Systems:
Second Edition. Boston, McGraw-Hill Higher Education, 2000.
[PER] Perrizo, William. Bit Sequential (bSQ) Sata Model and Peano Count Trees (Ptrees), appears at URL
http://www.cs.ndsu.nodak.edu/~perrizo/classes/765/bSQptree.ppt