Download Query Processing, Resource Management and Approximate in a

Document related concepts
no text concepts found
Transcript
The Dawning of the Age
of
Infinite Storage
William Perrizo
Dept of Computer Science
North Dakota State Univ.
Google 10100
Tera Bytes are Here
 1 TB costs  1k$ to buy
 1 TB costs 300k$/y to own
Management & curation are expensive
 Searching 1TB takes hours
 I’m Terrified by TeraBytes
 I’m Petrified by PetaBytes
 I’ll soon be Exafied byExaBytes
We are here
...
Yotta
1024
Zetta
1021
Exa
1018
Peta
1015
Tera
1012
Giga
109
Mega
106
 I’m too old to ever be Zettafied by ZettaBytes
Kilo
103
But you may be in your lifetime
You may even be Yottafied by YottaBytes
You probably won’t ever be Googified by GoogiBytes
But one should “never say never”.
How much information is there?
 Soon everything can be
recorded and indexed.
 Most bytes will never be
seen by humans.
 Data summarization,
trend detection,
anomaly detection,
data mining,
are key technologies
Everything
!
Recorded
All Books
MultiMedia
All books
(words)
.Movi
e
A Photo
A Book
10-24
Yocto,
10-21
zepto,
10-18
atto,
10-15
femto,
10-12
pico,
10-9
nano,
10-6
micro,
10-3
milli
Yotta
Zetta
Exa
Peta
Tera
Giga
Mega
Kilo
First Disk 1956
 IBM 305 RAMAC
 4 MB
 50x24” disks
 1200 rpm
 100 ms access
 35k$/y rent
 Included computer &
accounting software
(tubes not transistors)
Me, at13.
1.6 meters
10 years later
30 MB
The Cost of Storage about 1K$/TB
12/1/1999
Price vs disk capacity 9/1/2000
k$/TB
Price vs disk capacity
9/1/2001
Price vs disk capacity
y = 17.9x
SCSI
IDE
$
IDE
8.0
15
20
y = 13x
20
0
40
= 2.0x
80
8.0
054.0 9.0
0 7.0 10
03.0 8.0
06.07.0
2.0
$
200
y=x
0
50
100
150
Raw Disk
unit Size
50
100
150GB
200
Raw Disk unit Size GB
20
rawSCSI 6
raw
IDE k$/TB
20 k$/TB
GB
30
40
50
40
Disk unit size GB
200
250
5.0
4.0
0
3.04.0
2.03.0
1.02.0
1.0
0.0
0.0
0
60
60
80
SCSI
6.0
0.0
50
100
150
Raw Disk unit Size GB
200
0
10.0
1.05.0
IDE
y = 2x
0 0
5
10
5.0
11/4/2003
y=x
400
10.0
7.0
IDE
raw
k$/TB
6.09.0
60
y
20
40
60
Raw Disk unit Size GB
SCSI
SCSI
10
15
y = 6.7x
SCSI
25
$
y = 7.2x
SCSI
0
800
600
20 9.0
SCSI
IDE
raw
k$/TB
10.0
25
30
$
$
200
30
35
Price vs
disk capacityy = 6x
IDE
SCSI
IDE
y = 3.8x
GB
$
400
35
40
4/1/2002
Price vs disk capacity
800 200
600
40
$
$
$
$
1000
900
1000
800
900
700
800
1400 600
700
500
1200 600
400
500
300
14001000 400
200
800 300
100
12001400200
600 0
100
10001200 0 0
400
0
1000
50
SCSI
IDE
100
150
Disk unit size GB
200
IDE
0
50
50
100
150
200
Disk unit size GB
Disk100
unit size150
GB
200
250
E.g., A recent Purchase Order
Company:
Date:
System Board:
Processor:
Hard Drives:
Controller:
2nd IDE Controller:
Video:
Diskette Drive:
Memory:
CD/DVD Drive:
Sound:
Case:
Keyboard:
Mouse:
Operating System:
Network Cards:
Price:
NDSU
8/7/03
Intel D865 GBFL system board w/LAN 800mhz FSB
Intel Pentium 4 2.6 GHz
4 x 250 GB IDE (total = 1 TB)
Onboard IDE Controller
Main expense is here
Integrated
1.44 MB
4 GB 400 mhz memory
DVD/CDRW
Integrated AC97 Audio w/Soundmax
Performance Minitower ATX w/300 Watt PS
Microsoft 104 Internet keyboard
Microsoft Intellimouse Optical
none
Integrated Intel 10/100 Ethernet w/D845GEBV2L board
$2,899.00
Kilo
Mega
Giga
Tera
Peta
Exa
Zetta
Yotta
Disk Evolution
Memex
As We May Think, Vannevar Bush, 1945
“A memex is a device in which an
individual stores all his books, records,
and communications, and which is
mechanized so that it may be consulted
with exceeding speed and flexibility”
“yet if the user inserted 5000 pages of
material a day it would take him
hundreds of years to fill the repository,
so that he can enter material freely”
Trying to fill a terabyte in a year
Item
Items/TB
Items/day
300 KB JPEG
3M
9,800
1 MB Doc
1M
2,900
1 hour 256 kb/s MP3
audio
9K
26
1 hour 1.5 Mbp/s MPEG
video
290
0.8
The Personal Terabyte
How Will We Find Anything?
 Need Queries, Indexing, Data Mining,
Pivoting, Scalability, Backup, Replication,
Online update, Set-oriented access.
 If you don’t use a DBMS, you will
implement one!
 Need Data Mining, Machine Learning!
 80% of data is personal/individual
 20% is Corporate, Governmental
SQL ++
DBMS
Why Mining Data?
 Parkinson’s Law (for data)
Data expands to fill available
storage (and then some)
 Disk-storage version of Moore’s law
Capacity  2
t / 9 months
 Available storage doubles every 9 months!
Another More’s Law: More is Less
The more volume, the less information. (AKA: Shannon’s Canon)
A simple illustration: Which phone book is more helpful?
BOOK-1
Name
Number
Smith
234-9816
Jones
231-7237
Name
Smith
Smith
Jones
Jones
BOOK-2
Number
234-9816
231-7237
234-9816
231-7237
EOS Data Mining example
This dataset is a 320 row and 320 column (102,400 pixels) spatial file with 5 feature
attributes (B,G,R,NIR,Y). The (B,G,R,NIR) features are in the TIFF image and the Y
(crop yield) feature is color coded in the Yield Map (blue=low; red=high)
TIFF image
Yield Map
What is the relationship between the color intensities and yield? We can hypothsize:
hi_green and low_red  hi_yield which, while not a simply SQL query result, is not
surprising. We could analyze the data to confirm this hypothesis, but:
Data
Mining is more than just confirming hypotheses
The stronger rule, hi_NIR and low_red  hi_yield is not an SQL result and is
surprising. Data Mining includes suggesting new hypotheses.
Another Precision Agriculture Example
Grasshopper (or any pest) Infestation Prediction
• Grasshopper caused significant economic loss each year.
• Early infestation prediction is key to damage control.
Association rule mining on remotely sensed imagery
holds significant promise to achieve early detection.
Can initial infestation be determined from RGB bands???
Gene Regulation Pathway Discovery

Results of clustering may indicate, for instance, that nine
genes are involved in a metabolic pathway.

High confident rule mining on that cluster may discover the
relationships among the genes in which the expression of one
gene (e.g., Gene2) is regulated by others. Other genes (e.g.,
Gene4 and Gene7) may not be directly involved in regulating
Gene2 and can therefore be excluded (more later).
Gene1
Gene2, Gene3
Gene4, Gene 5, Gene6
Gene7, Gene8
Gene9
Clustering
Gene4
Gene1
Gene6
ARM
Gene7
Gene3
Gene8
Gene5
Gene9
Gene2
Sensor Network Data Mining
 Micro and Nano scale sensor blocks
are being developed for sensing






Biological agents
Chemical agents
Motion detection
coatings deterioration
RF-tagging of inventory
Structural materials fatigue
 There will be trillions++ of individual
sensors creating mountains of data.
 The data must be mined for it’s information.
Sensor Network Application:
CubE for Active Situation Replication (CEASR)
Nano-sensors dropped
into the Situation space
Situation space
.:.:.:.:..::….:. : …:…:: ..:
. . :: :.:…: :..:..::. .:: ..:.::..
.:.:.:.:..::….:. : …:…:: ..:
. . :: :.:…: :..:..::. .:: ..:.::..
.:.:.:.:..::….:. : …:…:: ..:
. . :: :.:…: :..:..::. .:: ..:.::..
Soldier sees replica of sensed
situation prior to entering space
Drop or mortar “smart dust” sensors into the situation space
to detect armour, chemical, biological, thermal….
Wherever a threshold level is sensed a ping is sent for that
location.
Using Alien Technology’s Fluidic Self-assembly (FSA)
technology, clear plastic layers with embedded nanoLEDs at each voxel, are laminated into a viewing cube.
The the pings are transmitted to the cube, using one
Ptree, where the pattern is display on the cube.
A more sophisticated CEASR device could sense and transmit intensity
levels, lighting up the display voxel with the appropriate intensity.
What data structure should be used? Standard horizontal record
structures may be infeasible. We suggest one vertical P-tree.
==================================
\
CARRIER
/
Anthropology Application
Digital Archive Network for Anthropology (DANA)
(data mine arthropological artifacts (shape, color, discovery location,…)
Data Mining?
Querying is asking specific questions and expecting
specific answers.
Data Mining is going into the MOUNTAIN of DATA,
and returning with information gems.
But also, some fool’s gold?
Relevance and interestingness analysis, serves to
assay those information and knowledge gems.
Data Mining Process
 Data mining: the core of
the knowledge discovery
process.
Pattern Evaluation
and Assay
visualizatio
Data Mining
Task-relevant Data
Data Warehouse: cleaned,
integrated, read-only, periodic,
historical raw database
Selection
Feature extraction,
tuple selection
Data Cleaning/Integration:
missing data, outliers,
noise, errors
Smart files
Mountain of Raw Data
OLAP
Classification
Clustering
ARM
Loop
backs
Data Mining versus Querying
There is a whole spectrum of techniques to get information from data:
Fractals, …
Standard querying
SQL
SELECT
FROM
WHERE
Complex
queries
(nested,
EXISTS..)
Searching and Aggregating
FUZZY query,
Search engines,
BLAST searches
OLAP
(rollup,
drilldown,
slice/dice..
Machine Learning
Supervised
Learning –
classification
regression
Data Mining
Data Prospecting
Association Rule Mining
Unsupervised
Learning clustering
On the Query end, much work is yet to be done (D. DeWitt, ACM SIGMOD Record’02).
On the Data Mining end, the surface has barely been scratched.
But even those scratches had a great impact – One of the early scatchers became
the biggest corporation in the world recently. A Non-scratcher filed for bankruptcy
Walmart vs. KMart
Our Approach
 Vertical, compressed data structures, variously called either
Predicate-trees or Peano-trees (Ptrees in either case)1
processed horizontally


Ubiquitously, DBMSs process horizontal records vertically – thru SCANs
We propose processing vertical data structures (Ptree) horizontally - thru ANDs
 Ptrees are data-mining-ready, compressed vertical data
structures, which attempt to address the curses of
scalability and curse of dimensionality.
 How are Ptrees constructed? The next slides illustrates
the construction of a set of BASIC P-TREES which
represent a data file in a lossless, compressed
datamining-ready way.
1
Ptree Technology is patent pending
by North Dakota State University
A file, R(A1..An), contains horizontal
structures (a set of horizontal records)
processed vertically (vertical scans)
Ptrees: vertically partition; then compress
each vertical bit slice into a basic Ptree;
horizontally process these basic Ptrees
using one multi-operand logical AND.
R( A1 A2 A3 A4)
Horizontal
structures
(records)
Scanned
vertically
010
011
010
010
101
010
111
111
111
111
110
111
010
010
000
000
110
110
101
101
001
001
001
001
001
000
001
111
100
101
100
100
R[A1] R[A2] R[A3] R[A4]
R11
0
0
0
0
1
0
1
1
1-Dimensional Ptrees
are built by recording
the truth of the
predicate “pure 1”
recursively on halves,
until there is purity, P11:
1. Whole file is not pure1 0
2. 1st half is not pure1  0
3. 2nd half is not pure1  0
4. 1st half of 2nd half not  0
5. 2nd half of 2nd half is  1
6. 1st half of 1st of 2nd is  1But it is pure
(pure0) so this
7. 2nd half of 1st of 2nd not 0 branch ends
010
011
010
010
101
010
111
111
0
0
0
01
1
10
111
111
110
111
010
010
000
000
110
110
101
101
001
001
001
001
001
000
001
111
100
101
100
100
R11 R12 R13 R21 R22 R23 R31 R32 R33
0
0
0
0
1
0
1
1
1
1
1
1
0
1
1
1
0
1
0
0
1
0
1
1
1
1
1
1
0
0
0
0
1
1
1
1
1
1
0
0
1
1
0
1
0
0
0
0
1
1
1
1
0
0
0
0
1
1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
R41 R42 R43
0
0
0
1
1
1
1
1
0
0
0
1
0
0
0
0
1
0
1
1
0
1
0
0
P11 P12 P13
P21 P22 P23 P31 P32 P33 P41 P42 P43
0
0
0
0
0
0
0
0
0
0
0
0
0 0 1 0 1 0
0 01 0 0 0 0 1 0 1 0 0 0 0
0 0 1 0
10 10
10 01
01 0001
01 01
0100
01
01 01 10
01
01
01 10 01
10
Eg, to count, 111 000 001 100s, use “pure111000001100”: 0 23-level
P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = 0 0 22-level =2
01 21-level
Can anyone build us a hardware
ANDer for this Ptree AND?
 A card for a Pentium-4 or Itanium (or Opteron or G5 or …)
 An active network device (e.g., a modified ATM switch in
which the inbuffer “load” code is modified to disable the
clear-to-1’s – assuming buffer-load micro-code is clear-to1’s followed by AND)
 All optical device (ANDing on-the-fly with zero time
delay???)
 We envision a world-wide consortium of Beowulf clusters
of such machines, so that the WWW can be data mined in
parallel effectively??
Vertical Data Structures History
 In the 1980’s vertical data structures were proposed for
record-based workloads
 Decomposition Storage Model (DSM, Copeland et al)
 Attribute Transposed File (ATF)
 Bit Transposed File (BTF, Wang et al); Viper
 Band Sequential Format (BSQ) for Remotely Sensed Imagery
 DSM and BTF initiatives have disappeared. Why? (next slide)
 Vertical auxiliary and system structures
 Domain & Request Vectors (DVA/ROLL/ROCC Perrizo, Shi, et al)
 vertical system structures (query optimization & synchronization)
 Bit Mapped Indexes (BMIs - very popular in Data Warehouses)
 all indexes are vertical auxiliary structures really
 BMI’s use bit maps (positional approach to IDing records)
 other indexes use RID lists (keyword or value approach)
Horizontal Processing of Vertical Structures
for Record-based Workloads
 For record-based workloads (e.g., SQL) (where the result is a set of records),
changing the horizontal record structure and then having to reconstruct it, may
introduce too much post processing?
R11 R12 R13 R21 R22 R23 R31 R32 R33
0
0
0
0
1
0
1
1
1
1
1
1
0
1
1
1
0
1
0
0
1
0
1
1
1
1
1
1
0
0
0
0
1
1
1
1
1
1
0
0
1
1
0
1
0
0
0
0
1
1
1
1
0
0
0
0
1
1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
R41 R42 R43
0
0
0
1
1
1
1
1
0
0
0
1
0
0
0
0
1
0
1
1
0
1
0
0
R( A1 A2 A3 A4)
010
011
010
010
101
010
111
111
111
111
110
111
010
010
000
000
110
110
101
101
001
001
001
001
001
000
001
111
100
101
100
100
 For data mining workloads, the result is often a bit (Yes/No, True/False) or another
unstructured result, where there is no reconstructive post processing?
R11 R12 R13 R21 R22 R23 R31 R32 R33
0
0
0
0
1
0
1
1
1
1
1
1
0
1
1
1
0
1
0
0
1
0
1
1
1
1
1
1
0
0
0
0
1
1
1
1
1
1
0
0
1
1
0
1
0
0
0
0
1
1
1
1
0
0
0
0
1
1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
R41 R42 R43
0
0
0
1
1
1
1
1
0
0
0
1
0
0
0
0
1
0
1
1
0
1
0
0
1
Run Lists: Another way to handle vertical data.
Generalized Ptrees using standard run
length compression of vertical bit files (alternatively, using Lempl Zipf?, Golomb?, other?)
Run Lists: record the type and start-offset
of pure runs.
E.g., RL11:
R( A1 A2 A3 A4)
010
011
010
010
101
010
111
111
1.
1st run is Pure0
111
111
110
111
010
010
000
000
110
110
101
101
001
001
001
001
001
000
001
111
100
101
100
100
2nd run is Pure1
 1:100
3.
3rd run is Pure0
 0:101
4.
4th run is Pure1
 1:110
1
RL11
0:000
1:100
0:101
1:110
R[A1] R[A2] R[A3] R[A4]
010
011
010
010
101
010
111
111
R11
0
0
0
0
1
0
1
 0:000
truth:start
2.
-->
111
111
110
111
010
010
000
000
Eg, to count, 111 000 001 100s, use “pure111000001100”:
RL11^RL12^RL13^RL’21^RL’22^RL’23^RL’31^RL’32^RL33^RL41^RL’42^RL’43
001
000
001
111
100
101
100
100
R11 R12 R13 R21 R22 R23 R31 R32 R33
0
0
0
0
1
0
1
1
1
1
1
1
0
1
1
1
0
1
0
0
1
0
1
1
RL11 RL12 RL13
(to complement, flip purity bits)
110
110
101
101
001
001
001
001
1
1
1
1
0
0
0
0
1
1
1
1
1
1
0
0
1
1
0
1
0
0
0
0
RL21 RL22 RL23
1
1
1
1
0
0
0
0
1
1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
R41 R42 R43
0
0
0
1
1
1
1
1
0
0
0
1
0
0
0
0
1
0
1
1
0
1
0
0
RL31 RL32 RL33 RL41 RL42 RL43
0:000 1:000 0:000 1:000 1:000 1:000 1:000 1:000 0:000 0:000 0:000
1:100 0:100 1:001 0:100 0:110 0:010 0:100 0:010 1:010 1:010 1:010
0:101
1:011
0:101 1:101 0:010
0:100
1:100
1:110
0:101
1:110
1:000
0:001
1:010
0:100
1:101
0:110
Architecture for the DataMIME™ System
(DataMIMEtm = data mining, NO NOISE)
(PDMS = P-tree Data Mining System)
YOUR DATA MINING
YOUR DATA
Data Integration Language
Ptree (Predicates) Query Language
DIL
PQL
Internet
DII (Data Integration Interface)
DMI (Data Mining Interface)
Data Repository
lossless, compressed, distributed, verticallystructured P-tree database
2-Dimensional Pure1-trees
Node is 1 iff that quadrant is purely 1-bits, e.g.,
A bit-file
(from, e.g., high-order bit of the RED band of a 2-D image)
1111110011111000111111001111111011110000111100001111000001110000
Which, in spatial raster order looks like:
Run-length compress it into a quadrant tree using Peano order.
11
11
11
11
11
11
11
01
11
11
11
11
11
11
11
11
11
10
11
11
00
00
00
00
00
00
00
10
00
00
00
00
0
1
0
0
0 0 1 0
1 1 0 1
1 1 1 0 0 0 1 0 1 1 0 1
0
Count tree?
Counts are what’s needed in DM, but P1-trees are more compressed and produce counts quickly.
One can construct the Count-tree in which each inode counts 1s in that quadrant):
1=001
11
11
11
11
11
11
11
01
7=111
11
11
11
11
11
11
11
11
11
10
11
11
11
11
11
11
55
0
00
00
00
10
11
11
11
11
0
16
1
1
level-3 (pure=43)
2
0
3
15
0
2
03 0 14 01
04 04 03 14
3
1 1 1 0 0 0 1 0 1 1 0 1
116
2.2.3
 QID (Quadrant ID): e.g., 2.2.3


Pure-1/Pure-0 quadrants
Root Count
( 7, 1 )
 Tree levels: 3, 2, 1, 0, with
 Purity counts of 43 42 41 40 respectively
 The Fan-out = 2dim = 4
( 111, 001 )
10.10.11
level-2
level-1
level-0
Logical Operations on Ptrees
(are used to get counts of any pattern)
Ptree 1
Ptree 2
AND result
OR result
AND operation is faster than the bit-by-bit AND since, there are shortcuts
(any pure0 operand node means result node is pure0.) (any pure1, copy
subtree of the other operand to the result)
e.g., only load quadrant 2 to AND Ptree1, Ptree2, etc.
The more operands there are in the AND, the greater the benefit due to this
shortcut (more pure0 nodes).
Ptree
Algebra




And
Or
Complement
Other
Ptree:
55
____________/ / \ \___________
/
___ / \___
\
/
/
\
\
16
____8__
_15__
16
/ / | \
/ | \ \
3 0 4 1
4 4 3 4
//|\
//|\
//|\
1110
0010
1101
PM-tree1:
m
______/ / \ \______
/
/ \
\
/
/
\
\
1
m
m
1
/ / \ \
/ / \ \
m 0 1 m 11 m 1
//|\
//|\
//|\
1110
0010 1101
Complement:
9
____________/ / \ \___________
/
___ / \___
\
/
/
\
\
0
____8__
__1__
0
/ / | \
/ | \ \
1 4 0 3
0 0 1 0
//|\
//|\
//|\
0001
1101
0010
PM-tree2:
m
______/ / \ \______
/
/ \
\
/
/
\
\
1
0
m
0
/ / \ \
11 1 m
//|\
0100
AND Result: m
________ / / \ \___
/
____ / \
\
/
/
\
\
1
0
m
0
/ | \ \
1 1 m m
//|\ //|\
1101 0100
How to AND P-trees??? Depth-first Pure 1 path AND code
0 100 101 102 12 132 20 21 220 221 223 23 3 & 0 20 21 22 231
0
0
20
20
21
21
220 221 223
22
23
231






RESULT
0
20
21
220 221 223
231
Basic, Value and Tuple Ptrees
Basic Ptrees (a Pure1-Trees predicate-tree for target bit of target attribute)
e.g., P11, P12, …, P18, P21, …, P28, …, P71, …, P78
Target Attribute
Target Bit Position
AND
Value Ptrees (predicate: quad is purely target value in target attribute)
e.g., P1, 5 = P1, 101 = P11 AND
Target Attribute
Target Value
P12’ AND P13
AND
Tuple Ptrees (predicate: quad is purely target tuple)
e.g., P(1, 2, 3) = P(001, 010, 111) = P1, 001 AND P2, 010 AND P3, 111
AND/OR
Cube Ptrees (predicate: quad is purely in target cube (product of intervals)
e.g., P([13],, [0.2]) = (P1,1 OR P1,2 OR P1,3) AND (P3,0 OR P3,1 OR P3,2)
Hilbert Ordering?
 In 2-dimensions, Peano ordering is 22-recursive z-ordering (raster ordering)
•
Hilbert ordering is 44-recursive tuning fork ordering (H-trees have fanout=16)
down
0
1
down
0123456789ABCDEF
right
0123456789ABCDEF
2
...
.
3
4
5
6
0123456789ABCDEF
.
7
8
left
right
...
.
0123456789ABCDEF
down
0123456789ABCDEF
9
A
...
.
0
3
4
5
B
C
D
E
F
up
down
0123456789ABCDEF
.
.
...
0123456789ABCDEF
up
0123456789ABCDEF
Coordinates of a tuning-fork (upper-left) depend on ancestry.
(x,y) = (ggrrbb, ggrrbb). If your parent points
Down and you are the H node in your tuning-fork,
1
your 2-bit contribution is given by:
E
F
row(x)
col(y)
0  00 , 00
2
C
D
1  00 , 01
2  01 , 01
8
7
B
3  01 , 00
9
A
6
4  10 , 00
5  11 , 00
6  11 , 01
7  10 , 01
8  10 , 10
9  11 , 10
A  11 , 11
B  10 , 11
C  01 , 11
D  01 , 10
E  00 , 10
F  00 , 11
Lookup table for Up, Left, Right
Parents are similar.
3-Dimensional
Ptrees
3-Dimensional Ptrees
(e.g., for the CEASR sensor network
X
Y
Z
Intensity
0
0
0
15 (1111)
1
0
0
15 (1111)
0
1
0
15 (1111)
1
1
0
15 (1111)
0
0
1
15 (1111)
1
0
1
15 (1111)
0
1
1
15 (1111)
1
1
1
15 (1111)
2
0
0
15 (1111)
3
0
0
4 (0100)
2
1
0
1 (0001)
3
1
0
12 (1100)
2
0
1
12 (1100)
3
0
1
2 (0010)
2
1
1
12 (1100)
3
1
1
12 (1100)
0
2
0
15 (1111)
1
2
0
15 (1111)
0
3
0
2 (0010)
1
3
0
0 (0000)
0
2
1
15 (1111)
1
2
1
15 (1111)
0
3
1
2 (0010)
1
3
1
0 (0000)
2
2
0
12 (1100)
Ptree dimension
 The dimension of the Ptree structure is a
user chosen parameter
 It can be chosen to fit the data dimension
 Most datasets  1-D Ptrees (recursive halving)
 2-D Images  2-D Ptrees (recursive quartering)
 3-D Solids
 3-D Ptrees (recursive eighth-ing)
 Or dimension can be chosen based on other
considerations
 optimize compression
 increase processing speed (next slide)
Generalized Raster and Peano Sorting: generalizes to any table with
numeric attributes (not just images).
Raster Sorting:
Peano Sorting:
Unsorted relation
Attributes 1st
Bit position 1st
Bit position 2nd
Attributes 2nd
Generalize Peano Sorting
KNN speed improvement
(using 5 UCI Machine Learning Repository data sets)
Time in Seconds
120
100
80
60
40
20
0
Unsorted
Generalized Raster
Generalized Peano
Astronomy Application:
National Virtual Observatory data
 What Ptree dimension and what ordering should
be used for astronomical data?
 Where all bodies are assumed to be on the
surface of a sphere, the celestial sphere
(shares equatorial plane with earth and has no
specified radius)
 Peano Triangle Mesh Tree (PTM-tree)
 Peano Celestial Coordinate tree (PCCtree)
 Uses (RA, dec) coordinates of the celestial sphere


RA=Recession Angle (longitudinal angle)
dec=declination
(latitude angle)
Peano Triangular Mesh Tree (PTM-tree)
 Similar to the Hierarchical Triangular Mesh (HTM)
used in the Sloan Digital Sky Survey project.
In both:
 Sphere is divided into triangles
 Triangle sides are always great circle segments.
 PTM differs from HTM in the way in which they are
ordered?
The difference between HTM and
PTM-trees is in the ordering.
1,3,3
1,1,2
1,3,1
1.1.3
1,1,1
1
1,2
1
1,3,0
1,1,0
1,
21,1
1,3
1,0
1,1
Ordering of HTM
Why use a different ordering?
1,
0
1,
3
Ordering of PTM-tree
1,3,2
PTM Triangulation of the Celestial Sphere
Traverse southern
hemisphere in the
revere direction
(just the identical
pattern pushed
down, arriving at
the Southern
neighbor of the
start point – a
globe-filling curve?
dec
RA
This “Peano ordering” produces a sphere-surface filling curve with good continuity characteristics.
PTM triangulation – Next Level
LRLR
LRLR
LRLR
LRLR
PTM-triangulation - Next Level
LRLR RLRL LRLR
RLRL LRLR RLRL
LRLR RLRL
LRLR RLRL LRLR
RLRL LRLR RLRL
LRLR RLRL
Peano Celestial Coordinate Trees (PCCtrees)
Unlike PTM-trees which initially partition
the sphere into the 8 faces of an
octahedron:
the sphere is tranformed into a cylinder,
then into a rectangle,
then standard Peano ordering is used on the
Celestial Coordinates.
Celestial Coordinates
 RA is from 0 to 360o
 dec is -90o to 90o.
P
R
A
d
e
c
90o
North Plane
0o
South Plane
-90o
0o
360o
Sphere  Cylinder  Plane
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
PUBLIC (Ptree Unfied BioLogical
InformtiCs Data Cube and
Dimension Tables)
SubCell-Location
Myta
Ribo
Nucl
Ribo
Function
apop
meio
mito
apop
StopCodonDensity
.1
.1
.1
.9
PolyA-Tail
1
1
0
0
Organism
Species
Vert
Genome Size
(million bp)
human
Homo sapiens
1
3000
fly
Drosophila
melanogaster
0
185
yeast
Saccharomyces
cerevisiae
0
12.1
o3
mouse
Mus
musculus
1
3000
e0
Organism
Dimension
Table
g0
g1
g2
17, 78 12, 60
Mi, 40
1
1
1
1
1,
48
o1
10, 175e0 0 0
00
7, 1
o2
0 40
1
0
0
0
10
014, 65
10
1 0
1
16, 76
0e
9, 45
Pl, 43 0
1
1
0
1
1
1
0
1
1
e2
1
1
1
e3
0
1
e2
P
I
U
N
V
S
T
R
C
T
Y
S
T
Z
E
D
A
D
S
H
M
N
1
e3
Experiment
1 Dimension
Table
3
2
a
c
h
2
2
b
s
h
2
4
a
c
a
1
2
4
a
s
a
1
0
(MIAME)
(chromosome,length)
g3
o0
e1
L
A
B
Gene-Organism
Dimension Table
Gene Dimension Table
0
1
0
1
1
0
0
0
0
0
1
0
1
0
1
Gene-Experiment-Organism Cube
(1 iff that gene from that organism
expresses at a threshold level in that
experiment.)
many-to-many-to-many relationship
Protein-Protein Interaction Pyramid
SubCellLocation
Myt
a
Rib
o
Nucl
Rib
o
Function
StopCodonDensity
apo
p
.1
mei
o
.1
mit
o
.1
apo
p
.9
PolyA-Tail
1
1
0
0
Original Gene Dimension Table
g3
1
0
0
0
g2
0
0
1
0
1
1
1
0
0
g1
g13
1
0
0g2
1
0
0
0
1
1
g0
g1
1
1
0
g0
g0
1
g1
0
g2
1
g3
0
M
y
t
a
R
i
b
o
N
u
c
l
a
p
o
p
M
e
i
o
M
i
t
o
S
C
D
1
S
C
D
2
S
C
D
3
S
C
D
4
G
E
N
E
1
P
ol
yA
1
1
0
0
1
0
1
0
0
0
0
1
0
0
1
0
0
0
0
1
1
g1
0
0
1
0
0
1
0
0
0
1
0
g2
0
1
0
1
0
0
1
0
0
1
0
g3
g0
Boolean Gene Dimension Table (Binary)
Association of Computing Machinery KDD-Cup-02
NDSU Team
Greyware PPI graph mining tool
 Visualize feature information using a glyph for each gene (PPI graph node)
 PPI Edge iff the 2 genes code for interacting proteins
le
n
g
t
h
e
ss
e
nt
ia
l
Di
sce
nt
er
M
y
t
a
R
i
b
o
N
u
c
l
a
p
o
p
M
e
i
o
M
i
t
o
S
C
D
G
E
N
E
1
In
foqt
y
4
1
0
0
1
0
1
o
1
0
1
0
0
1
0
.1
6
0
5
1
g2
0
0
1
0
0
1
.1
4
0
0
5
g3
0
1
0
1
0
0
.9
9
0
8
2
g4
4
Glyp
h for
g1
g1
Gene Dimension Table (non-binary)
stopcodondensity
This visual data mining tool was effective in KDD-CUP ’02)
Network Security Application
(Network security through Vertical Structured data)

Network layers do their own partitioning
 Packets, frames, etc. (usually independent of any intrinsic data
structuring – e.g., record structure)
 Fragmentation/Reassembly, Segmentation/Reassembly

Data privacy is compromised when the horizontal (stream) message content
is eavesdropped upon at the reassembled level (in network


A standard solution is to host-encrypt the horizontal structure so that any
network reassembled message is meaningless.

Alt.: Vertically structure (decompose, partition) data (e.g., basic Ptrees).
 Send one Ptree per packet
 Send intra-message packets separately

Trick flow classifiers into thinking the multiple packets associated
with a particular message are unrelated.
 The message is only meaningful after destination demux-ing

Note: the only basic Ptree that holds actual information is the
high-order bit Ptree. Therefore encrypt it!
It seems like there ought to be a whole range of killer ideas associated with the
concept of using vertical structuring data within network transmission units

Active networking? (AND basic Ptrees (or just certain levels of) at active net nodes?)
A very informal seminar




Dr. Michael Vogt
Principal Investigator and Project Engineer
Chemical Microsensor Division
Argonne National Laboratory
 Friday, November 7, 2003, 3:30 P.M.
 IACC 204N
Related documents