Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The Dawning of the Age of Infinite Storage William Perrizo Dept of Computer Science North Dakota State Univ. Google 10100 Tera Bytes are Here 1 TB costs 1k$ to buy 1 TB costs 300k$/y to own Management & curation are expensive Searching 1TB takes hours I’m Terrified by TeraBytes I’m Petrified by PetaBytes I’ll soon be Exafied byExaBytes We are here ... Yotta 1024 Zetta 1021 Exa 1018 Peta 1015 Tera 1012 Giga 109 Mega 106 I’m too old to ever be Zettafied by ZettaBytes Kilo 103 But you may be in your lifetime You may even be Yottafied by YottaBytes You probably won’t ever be Googified by GoogiBytes But one should “never say never”. How much information is there? Soon everything can be recorded and indexed. Most bytes will never be seen by humans. Data summarization, trend detection, anomaly detection, data mining, are key technologies Everything ! Recorded All Books MultiMedia All books (words) .Movi e A Photo A Book 10-24 Yocto, 10-21 zepto, 10-18 atto, 10-15 femto, 10-12 pico, 10-9 nano, 10-6 micro, 10-3 milli Yotta Zetta Exa Peta Tera Giga Mega Kilo First Disk 1956 IBM 305 RAMAC 4 MB 50x24” disks 1200 rpm 100 ms access 35k$/y rent Included computer & accounting software (tubes not transistors) Me, at13. 1.6 meters 10 years later 30 MB The Cost of Storage about 1K$/TB 12/1/1999 Price vs disk capacity 9/1/2000 k$/TB Price vs disk capacity 9/1/2001 Price vs disk capacity y = 17.9x SCSI IDE $ IDE 8.0 15 20 y = 13x 20 0 40 = 2.0x 80 8.0 054.0 9.0 0 7.0 10 03.0 8.0 06.07.0 2.0 $ 200 y=x 0 50 100 150 Raw Disk unit Size 50 100 150GB 200 Raw Disk unit Size GB 20 rawSCSI 6 raw IDE k$/TB 20 k$/TB GB 30 40 50 40 Disk unit size GB 200 250 5.0 4.0 0 3.04.0 2.03.0 1.02.0 1.0 0.0 0.0 0 60 60 80 SCSI 6.0 0.0 50 100 150 Raw Disk unit Size GB 200 0 10.0 1.05.0 IDE y = 2x 0 0 5 10 5.0 11/4/2003 y=x 400 10.0 7.0 IDE raw k$/TB 6.09.0 60 y 20 40 60 Raw Disk unit Size GB SCSI SCSI 10 15 y = 6.7x SCSI 25 $ y = 7.2x SCSI 0 800 600 20 9.0 SCSI IDE raw k$/TB 10.0 25 30 $ $ 200 30 35 Price vs disk capacityy = 6x IDE SCSI IDE y = 3.8x GB $ 400 35 40 4/1/2002 Price vs disk capacity 800 200 600 40 $ $ $ $ 1000 900 1000 800 900 700 800 1400 600 700 500 1200 600 400 500 300 14001000 400 200 800 300 100 12001400200 600 0 100 10001200 0 0 400 0 1000 50 SCSI IDE 100 150 Disk unit size GB 200 IDE 0 50 50 100 150 200 Disk unit size GB Disk100 unit size150 GB 200 250 E.g., A recent Purchase Order Company: Date: System Board: Processor: Hard Drives: Controller: 2nd IDE Controller: Video: Diskette Drive: Memory: CD/DVD Drive: Sound: Case: Keyboard: Mouse: Operating System: Network Cards: Price: NDSU 8/7/03 Intel D865 GBFL system board w/LAN 800mhz FSB Intel Pentium 4 2.6 GHz 4 x 250 GB IDE (total = 1 TB) Onboard IDE Controller Main expense is here Integrated 1.44 MB 4 GB 400 mhz memory DVD/CDRW Integrated AC97 Audio w/Soundmax Performance Minitower ATX w/300 Watt PS Microsoft 104 Internet keyboard Microsoft Intellimouse Optical none Integrated Intel 10/100 Ethernet w/D845GEBV2L board $2,899.00 Kilo Mega Giga Tera Peta Exa Zetta Yotta Disk Evolution Memex As We May Think, Vannevar Bush, 1945 “A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility” “yet if the user inserted 5000 pages of material a day it would take him hundreds of years to fill the repository, so that he can enter material freely” Trying to fill a terabyte in a year Item Items/TB Items/day 300 KB JPEG 3M 9,800 1 MB Doc 1M 2,900 1 hour 256 kb/s MP3 audio 9K 26 1 hour 1.5 Mbp/s MPEG video 290 0.8 The Personal Terabyte How Will We Find Anything? Need Queries, Indexing, Data Mining, Pivoting, Scalability, Backup, Replication, Online update, Set-oriented access. If you don’t use a DBMS, you will implement one! Need Data Mining, Machine Learning! 80% of data is personal/individual 20% is Corporate, Governmental SQL ++ DBMS Why Mining Data? Parkinson’s Law (for data) Data expands to fill available storage (and then some) Disk-storage version of Moore’s law Capacity 2 t / 9 months Available storage doubles every 9 months! Another More’s Law: More is Less The more volume, the less information. (AKA: Shannon’s Canon) A simple illustration: Which phone book is more helpful? BOOK-1 Name Number Smith 234-9816 Jones 231-7237 Name Smith Smith Jones Jones BOOK-2 Number 234-9816 231-7237 234-9816 231-7237 EOS Data Mining example This dataset is a 320 row and 320 column (102,400 pixels) spatial file with 5 feature attributes (B,G,R,NIR,Y). The (B,G,R,NIR) features are in the TIFF image and the Y (crop yield) feature is color coded in the Yield Map (blue=low; red=high) TIFF image Yield Map What is the relationship between the color intensities and yield? We can hypothsize: hi_green and low_red hi_yield which, while not a simply SQL query result, is not surprising. We could analyze the data to confirm this hypothesis, but: Data Mining is more than just confirming hypotheses The stronger rule, hi_NIR and low_red hi_yield is not an SQL result and is surprising. Data Mining includes suggesting new hypotheses. Another Precision Agriculture Example Grasshopper (or any pest) Infestation Prediction • Grasshopper caused significant economic loss each year. • Early infestation prediction is key to damage control. Association rule mining on remotely sensed imagery holds significant promise to achieve early detection. Can initial infestation be determined from RGB bands??? Gene Regulation Pathway Discovery Results of clustering may indicate, for instance, that nine genes are involved in a metabolic pathway. High confident rule mining on that cluster may discover the relationships among the genes in which the expression of one gene (e.g., Gene2) is regulated by others. Other genes (e.g., Gene4 and Gene7) may not be directly involved in regulating Gene2 and can therefore be excluded (more later). Gene1 Gene2, Gene3 Gene4, Gene 5, Gene6 Gene7, Gene8 Gene9 Clustering Gene4 Gene1 Gene6 ARM Gene7 Gene3 Gene8 Gene5 Gene9 Gene2 Sensor Network Data Mining Micro and Nano scale sensor blocks are being developed for sensing Biological agents Chemical agents Motion detection coatings deterioration RF-tagging of inventory Structural materials fatigue There will be trillions++ of individual sensors creating mountains of data. The data must be mined for it’s information. Sensor Network Application: CubE for Active Situation Replication (CEASR) Nano-sensors dropped into the Situation space Situation space .:.:.:.:..::….:. : …:…:: ..: . . :: :.:…: :..:..::. .:: ..:.::.. .:.:.:.:..::….:. : …:…:: ..: . . :: :.:…: :..:..::. .:: ..:.::.. .:.:.:.:..::….:. : …:…:: ..: . . :: :.:…: :..:..::. .:: ..:.::.. Soldier sees replica of sensed situation prior to entering space Drop or mortar “smart dust” sensors into the situation space to detect armour, chemical, biological, thermal…. Wherever a threshold level is sensed a ping is sent for that location. Using Alien Technology’s Fluidic Self-assembly (FSA) technology, clear plastic layers with embedded nanoLEDs at each voxel, are laminated into a viewing cube. The the pings are transmitted to the cube, using one Ptree, where the pattern is display on the cube. A more sophisticated CEASR device could sense and transmit intensity levels, lighting up the display voxel with the appropriate intensity. What data structure should be used? Standard horizontal record structures may be infeasible. We suggest one vertical P-tree. ================================== \ CARRIER / Anthropology Application Digital Archive Network for Anthropology (DANA) (data mine arthropological artifacts (shape, color, discovery location,…) Data Mining? Querying is asking specific questions and expecting specific answers. Data Mining is going into the MOUNTAIN of DATA, and returning with information gems. But also, some fool’s gold? Relevance and interestingness analysis, serves to assay those information and knowledge gems. Data Mining Process Data mining: the core of the knowledge discovery process. Pattern Evaluation and Assay visualizatio Data Mining Task-relevant Data Data Warehouse: cleaned, integrated, read-only, periodic, historical raw database Selection Feature extraction, tuple selection Data Cleaning/Integration: missing data, outliers, noise, errors Smart files Mountain of Raw Data OLAP Classification Clustering ARM Loop backs Data Mining versus Querying There is a whole spectrum of techniques to get information from data: Fractals, … Standard querying SQL SELECT FROM WHERE Complex queries (nested, EXISTS..) Searching and Aggregating FUZZY query, Search engines, BLAST searches OLAP (rollup, drilldown, slice/dice.. Machine Learning Supervised Learning – classification regression Data Mining Data Prospecting Association Rule Mining Unsupervised Learning clustering On the Query end, much work is yet to be done (D. DeWitt, ACM SIGMOD Record’02). On the Data Mining end, the surface has barely been scratched. But even those scratches had a great impact – One of the early scatchers became the biggest corporation in the world recently. A Non-scratcher filed for bankruptcy Walmart vs. KMart Our Approach Vertical, compressed data structures, variously called either Predicate-trees or Peano-trees (Ptrees in either case)1 processed horizontally Ubiquitously, DBMSs process horizontal records vertically – thru SCANs We propose processing vertical data structures (Ptree) horizontally - thru ANDs Ptrees are data-mining-ready, compressed vertical data structures, which attempt to address the curses of scalability and curse of dimensionality. How are Ptrees constructed? The next slides illustrates the construction of a set of BASIC P-TREES which represent a data file in a lossless, compressed datamining-ready way. 1 Ptree Technology is patent pending by North Dakota State University A file, R(A1..An), contains horizontal structures (a set of horizontal records) processed vertically (vertical scans) Ptrees: vertically partition; then compress each vertical bit slice into a basic Ptree; horizontally process these basic Ptrees using one multi-operand logical AND. R( A1 A2 A3 A4) Horizontal structures (records) Scanned vertically 010 011 010 010 101 010 111 111 111 111 110 111 010 010 000 000 110 110 101 101 001 001 001 001 001 000 001 111 100 101 100 100 R[A1] R[A2] R[A3] R[A4] R11 0 0 0 0 1 0 1 1 1-Dimensional Ptrees are built by recording the truth of the predicate “pure 1” recursively on halves, until there is purity, P11: 1. Whole file is not pure1 0 2. 1st half is not pure1 0 3. 2nd half is not pure1 0 4. 1st half of 2nd half not 0 5. 2nd half of 2nd half is 1 6. 1st half of 1st of 2nd is 1But it is pure (pure0) so this 7. 2nd half of 1st of 2nd not 0 branch ends 010 011 010 010 101 010 111 111 0 0 0 01 1 10 111 111 110 111 010 010 000 000 110 110 101 101 001 001 001 001 001 000 001 111 100 101 100 100 R11 R12 R13 R21 R22 R23 R31 R32 R33 0 0 0 0 1 0 1 1 1 1 1 1 0 1 1 1 0 1 0 0 1 0 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 0 0 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 R41 R42 R43 0 0 0 1 1 1 1 1 0 0 0 1 0 0 0 0 1 0 1 1 0 1 0 0 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 01 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 10 10 10 01 01 0001 01 01 0100 01 01 01 10 01 01 01 10 01 10 Eg, to count, 111 000 001 100s, use “pure111000001100”: 0 23-level P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = 0 0 22-level =2 01 21-level Can anyone build us a hardware ANDer for this Ptree AND? A card for a Pentium-4 or Itanium (or Opteron or G5 or …) An active network device (e.g., a modified ATM switch in which the inbuffer “load” code is modified to disable the clear-to-1’s – assuming buffer-load micro-code is clear-to1’s followed by AND) All optical device (ANDing on-the-fly with zero time delay???) We envision a world-wide consortium of Beowulf clusters of such machines, so that the WWW can be data mined in parallel effectively?? Vertical Data Structures History In the 1980’s vertical data structures were proposed for record-based workloads Decomposition Storage Model (DSM, Copeland et al) Attribute Transposed File (ATF) Bit Transposed File (BTF, Wang et al); Viper Band Sequential Format (BSQ) for Remotely Sensed Imagery DSM and BTF initiatives have disappeared. Why? (next slide) Vertical auxiliary and system structures Domain & Request Vectors (DVA/ROLL/ROCC Perrizo, Shi, et al) vertical system structures (query optimization & synchronization) Bit Mapped Indexes (BMIs - very popular in Data Warehouses) all indexes are vertical auxiliary structures really BMI’s use bit maps (positional approach to IDing records) other indexes use RID lists (keyword or value approach) Horizontal Processing of Vertical Structures for Record-based Workloads For record-based workloads (e.g., SQL) (where the result is a set of records), changing the horizontal record structure and then having to reconstruct it, may introduce too much post processing? R11 R12 R13 R21 R22 R23 R31 R32 R33 0 0 0 0 1 0 1 1 1 1 1 1 0 1 1 1 0 1 0 0 1 0 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 0 0 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 R41 R42 R43 0 0 0 1 1 1 1 1 0 0 0 1 0 0 0 0 1 0 1 1 0 1 0 0 R( A1 A2 A3 A4) 010 011 010 010 101 010 111 111 111 111 110 111 010 010 000 000 110 110 101 101 001 001 001 001 001 000 001 111 100 101 100 100 For data mining workloads, the result is often a bit (Yes/No, True/False) or another unstructured result, where there is no reconstructive post processing? R11 R12 R13 R21 R22 R23 R31 R32 R33 0 0 0 0 1 0 1 1 1 1 1 1 0 1 1 1 0 1 0 0 1 0 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 0 0 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 R41 R42 R43 0 0 0 1 1 1 1 1 0 0 0 1 0 0 0 0 1 0 1 1 0 1 0 0 1 Run Lists: Another way to handle vertical data. Generalized Ptrees using standard run length compression of vertical bit files (alternatively, using Lempl Zipf?, Golomb?, other?) Run Lists: record the type and start-offset of pure runs. E.g., RL11: R( A1 A2 A3 A4) 010 011 010 010 101 010 111 111 1. 1st run is Pure0 111 111 110 111 010 010 000 000 110 110 101 101 001 001 001 001 001 000 001 111 100 101 100 100 2nd run is Pure1 1:100 3. 3rd run is Pure0 0:101 4. 4th run is Pure1 1:110 1 RL11 0:000 1:100 0:101 1:110 R[A1] R[A2] R[A3] R[A4] 010 011 010 010 101 010 111 111 R11 0 0 0 0 1 0 1 0:000 truth:start 2. --> 111 111 110 111 010 010 000 000 Eg, to count, 111 000 001 100s, use “pure111000001100”: RL11^RL12^RL13^RL’21^RL’22^RL’23^RL’31^RL’32^RL33^RL41^RL’42^RL’43 001 000 001 111 100 101 100 100 R11 R12 R13 R21 R22 R23 R31 R32 R33 0 0 0 0 1 0 1 1 1 1 1 1 0 1 1 1 0 1 0 0 1 0 1 1 RL11 RL12 RL13 (to complement, flip purity bits) 110 110 101 101 001 001 001 001 1 1 1 1 0 0 0 0 1 1 1 1 1 1 0 0 1 1 0 1 0 0 0 0 RL21 RL22 RL23 1 1 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 R41 R42 R43 0 0 0 1 1 1 1 1 0 0 0 1 0 0 0 0 1 0 1 1 0 1 0 0 RL31 RL32 RL33 RL41 RL42 RL43 0:000 1:000 0:000 1:000 1:000 1:000 1:000 1:000 0:000 0:000 0:000 1:100 0:100 1:001 0:100 0:110 0:010 0:100 0:010 1:010 1:010 1:010 0:101 1:011 0:101 1:101 0:010 0:100 1:100 1:110 0:101 1:110 1:000 0:001 1:010 0:100 1:101 0:110 Architecture for the DataMIME™ System (DataMIMEtm = data mining, NO NOISE) (PDMS = P-tree Data Mining System) YOUR DATA MINING YOUR DATA Data Integration Language Ptree (Predicates) Query Language DIL PQL Internet DII (Data Integration Interface) DMI (Data Mining Interface) Data Repository lossless, compressed, distributed, verticallystructured P-tree database 2-Dimensional Pure1-trees Node is 1 iff that quadrant is purely 1-bits, e.g., A bit-file (from, e.g., high-order bit of the RED band of a 2-D image) 1111110011111000111111001111111011110000111100001111000001110000 Which, in spatial raster order looks like: Run-length compress it into a quadrant tree using Peano order. 11 11 11 11 11 11 11 01 11 11 11 11 11 11 11 11 11 10 11 11 00 00 00 00 00 00 00 10 00 00 00 00 0 1 0 0 0 0 1 0 1 1 0 1 1 1 1 0 0 0 1 0 1 1 0 1 0 Count tree? Counts are what’s needed in DM, but P1-trees are more compressed and produce counts quickly. One can construct the Count-tree in which each inode counts 1s in that quadrant): 1=001 11 11 11 11 11 11 11 01 7=111 11 11 11 11 11 11 11 11 11 10 11 11 11 11 11 11 55 0 00 00 00 10 11 11 11 11 0 16 1 1 level-3 (pure=43) 2 0 3 15 0 2 03 0 14 01 04 04 03 14 3 1 1 1 0 0 0 1 0 1 1 0 1 116 2.2.3 QID (Quadrant ID): e.g., 2.2.3 Pure-1/Pure-0 quadrants Root Count ( 7, 1 ) Tree levels: 3, 2, 1, 0, with Purity counts of 43 42 41 40 respectively The Fan-out = 2dim = 4 ( 111, 001 ) 10.10.11 level-2 level-1 level-0 Logical Operations on Ptrees (are used to get counts of any pattern) Ptree 1 Ptree 2 AND result OR result AND operation is faster than the bit-by-bit AND since, there are shortcuts (any pure0 operand node means result node is pure0.) (any pure1, copy subtree of the other operand to the result) e.g., only load quadrant 2 to AND Ptree1, Ptree2, etc. The more operands there are in the AND, the greater the benefit due to this shortcut (more pure0 nodes). Ptree Algebra And Or Complement Other Ptree: 55 ____________/ / \ \___________ / ___ / \___ \ / / \ \ 16 ____8__ _15__ 16 / / | \ / | \ \ 3 0 4 1 4 4 3 4 //|\ //|\ //|\ 1110 0010 1101 PM-tree1: m ______/ / \ \______ / / \ \ / / \ \ 1 m m 1 / / \ \ / / \ \ m 0 1 m 11 m 1 //|\ //|\ //|\ 1110 0010 1101 Complement: 9 ____________/ / \ \___________ / ___ / \___ \ / / \ \ 0 ____8__ __1__ 0 / / | \ / | \ \ 1 4 0 3 0 0 1 0 //|\ //|\ //|\ 0001 1101 0010 PM-tree2: m ______/ / \ \______ / / \ \ / / \ \ 1 0 m 0 / / \ \ 11 1 m //|\ 0100 AND Result: m ________ / / \ \___ / ____ / \ \ / / \ \ 1 0 m 0 / | \ \ 1 1 m m //|\ //|\ 1101 0100 How to AND P-trees??? Depth-first Pure 1 path AND code 0 100 101 102 12 132 20 21 220 221 223 23 3 & 0 20 21 22 231 0 0 20 20 21 21 220 221 223 22 23 231 RESULT 0 20 21 220 221 223 231 Basic, Value and Tuple Ptrees Basic Ptrees (a Pure1-Trees predicate-tree for target bit of target attribute) e.g., P11, P12, …, P18, P21, …, P28, …, P71, …, P78 Target Attribute Target Bit Position AND Value Ptrees (predicate: quad is purely target value in target attribute) e.g., P1, 5 = P1, 101 = P11 AND Target Attribute Target Value P12’ AND P13 AND Tuple Ptrees (predicate: quad is purely target tuple) e.g., P(1, 2, 3) = P(001, 010, 111) = P1, 001 AND P2, 010 AND P3, 111 AND/OR Cube Ptrees (predicate: quad is purely in target cube (product of intervals) e.g., P([13],, [0.2]) = (P1,1 OR P1,2 OR P1,3) AND (P3,0 OR P3,1 OR P3,2) Hilbert Ordering? In 2-dimensions, Peano ordering is 22-recursive z-ordering (raster ordering) • Hilbert ordering is 44-recursive tuning fork ordering (H-trees have fanout=16) down 0 1 down 0123456789ABCDEF right 0123456789ABCDEF 2 ... . 3 4 5 6 0123456789ABCDEF . 7 8 left right ... . 0123456789ABCDEF down 0123456789ABCDEF 9 A ... . 0 3 4 5 B C D E F up down 0123456789ABCDEF . . ... 0123456789ABCDEF up 0123456789ABCDEF Coordinates of a tuning-fork (upper-left) depend on ancestry. (x,y) = (ggrrbb, ggrrbb). If your parent points Down and you are the H node in your tuning-fork, 1 your 2-bit contribution is given by: E F row(x) col(y) 0 00 , 00 2 C D 1 00 , 01 2 01 , 01 8 7 B 3 01 , 00 9 A 6 4 10 , 00 5 11 , 00 6 11 , 01 7 10 , 01 8 10 , 10 9 11 , 10 A 11 , 11 B 10 , 11 C 01 , 11 D 01 , 10 E 00 , 10 F 00 , 11 Lookup table for Up, Left, Right Parents are similar. 3-Dimensional Ptrees 3-Dimensional Ptrees (e.g., for the CEASR sensor network X Y Z Intensity 0 0 0 15 (1111) 1 0 0 15 (1111) 0 1 0 15 (1111) 1 1 0 15 (1111) 0 0 1 15 (1111) 1 0 1 15 (1111) 0 1 1 15 (1111) 1 1 1 15 (1111) 2 0 0 15 (1111) 3 0 0 4 (0100) 2 1 0 1 (0001) 3 1 0 12 (1100) 2 0 1 12 (1100) 3 0 1 2 (0010) 2 1 1 12 (1100) 3 1 1 12 (1100) 0 2 0 15 (1111) 1 2 0 15 (1111) 0 3 0 2 (0010) 1 3 0 0 (0000) 0 2 1 15 (1111) 1 2 1 15 (1111) 0 3 1 2 (0010) 1 3 1 0 (0000) 2 2 0 12 (1100) Ptree dimension The dimension of the Ptree structure is a user chosen parameter It can be chosen to fit the data dimension Most datasets 1-D Ptrees (recursive halving) 2-D Images 2-D Ptrees (recursive quartering) 3-D Solids 3-D Ptrees (recursive eighth-ing) Or dimension can be chosen based on other considerations optimize compression increase processing speed (next slide) Generalized Raster and Peano Sorting: generalizes to any table with numeric attributes (not just images). Raster Sorting: Peano Sorting: Unsorted relation Attributes 1st Bit position 1st Bit position 2nd Attributes 2nd Generalize Peano Sorting KNN speed improvement (using 5 UCI Machine Learning Repository data sets) Time in Seconds 120 100 80 60 40 20 0 Unsorted Generalized Raster Generalized Peano Astronomy Application: National Virtual Observatory data What Ptree dimension and what ordering should be used for astronomical data? Where all bodies are assumed to be on the surface of a sphere, the celestial sphere (shares equatorial plane with earth and has no specified radius) Peano Triangle Mesh Tree (PTM-tree) Peano Celestial Coordinate tree (PCCtree) Uses (RA, dec) coordinates of the celestial sphere RA=Recession Angle (longitudinal angle) dec=declination (latitude angle) Peano Triangular Mesh Tree (PTM-tree) Similar to the Hierarchical Triangular Mesh (HTM) used in the Sloan Digital Sky Survey project. In both: Sphere is divided into triangles Triangle sides are always great circle segments. PTM differs from HTM in the way in which they are ordered? The difference between HTM and PTM-trees is in the ordering. 1,3,3 1,1,2 1,3,1 1.1.3 1,1,1 1 1,2 1 1,3,0 1,1,0 1, 21,1 1,3 1,0 1,1 Ordering of HTM Why use a different ordering? 1, 0 1, 3 Ordering of PTM-tree 1,3,2 PTM Triangulation of the Celestial Sphere Traverse southern hemisphere in the revere direction (just the identical pattern pushed down, arriving at the Southern neighbor of the start point – a globe-filling curve? dec RA This “Peano ordering” produces a sphere-surface filling curve with good continuity characteristics. PTM triangulation – Next Level LRLR LRLR LRLR LRLR PTM-triangulation - Next Level LRLR RLRL LRLR RLRL LRLR RLRL LRLR RLRL LRLR RLRL LRLR RLRL LRLR RLRL LRLR RLRL Peano Celestial Coordinate Trees (PCCtrees) Unlike PTM-trees which initially partition the sphere into the 8 faces of an octahedron: the sphere is tranformed into a cylinder, then into a rectangle, then standard Peano ordering is used on the Celestial Coordinates. Celestial Coordinates RA is from 0 to 360o dec is -90o to 90o. P R A d e c 90o North Plane 0o South Plane -90o 0o 360o Sphere Cylinder Plane Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z PUBLIC (Ptree Unfied BioLogical InformtiCs Data Cube and Dimension Tables) SubCell-Location Myta Ribo Nucl Ribo Function apop meio mito apop StopCodonDensity .1 .1 .1 .9 PolyA-Tail 1 1 0 0 Organism Species Vert Genome Size (million bp) human Homo sapiens 1 3000 fly Drosophila melanogaster 0 185 yeast Saccharomyces cerevisiae 0 12.1 o3 mouse Mus musculus 1 3000 e0 Organism Dimension Table g0 g1 g2 17, 78 12, 60 Mi, 40 1 1 1 1 1, 48 o1 10, 175e0 0 0 00 7, 1 o2 0 40 1 0 0 0 10 014, 65 10 1 0 1 16, 76 0e 9, 45 Pl, 43 0 1 1 0 1 1 1 0 1 1 e2 1 1 1 e3 0 1 e2 P I U N V S T R C T Y S T Z E D A D S H M N 1 e3 Experiment 1 Dimension Table 3 2 a c h 2 2 b s h 2 4 a c a 1 2 4 a s a 1 0 (MIAME) (chromosome,length) g3 o0 e1 L A B Gene-Organism Dimension Table Gene Dimension Table 0 1 0 1 1 0 0 0 0 0 1 0 1 0 1 Gene-Experiment-Organism Cube (1 iff that gene from that organism expresses at a threshold level in that experiment.) many-to-many-to-many relationship Protein-Protein Interaction Pyramid SubCellLocation Myt a Rib o Nucl Rib o Function StopCodonDensity apo p .1 mei o .1 mit o .1 apo p .9 PolyA-Tail 1 1 0 0 Original Gene Dimension Table g3 1 0 0 0 g2 0 0 1 0 1 1 1 0 0 g1 g13 1 0 0g2 1 0 0 0 1 1 g0 g1 1 1 0 g0 g0 1 g1 0 g2 1 g3 0 M y t a R i b o N u c l a p o p M e i o M i t o S C D 1 S C D 2 S C D 3 S C D 4 G E N E 1 P ol yA 1 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1 g1 0 0 1 0 0 1 0 0 0 1 0 g2 0 1 0 1 0 0 1 0 0 1 0 g3 g0 Boolean Gene Dimension Table (Binary) Association of Computing Machinery KDD-Cup-02 NDSU Team Greyware PPI graph mining tool Visualize feature information using a glyph for each gene (PPI graph node) PPI Edge iff the 2 genes code for interacting proteins le n g t h e ss e nt ia l Di sce nt er M y t a R i b o N u c l a p o p M e i o M i t o S C D G E N E 1 In foqt y 4 1 0 0 1 0 1 o 1 0 1 0 0 1 0 .1 6 0 5 1 g2 0 0 1 0 0 1 .1 4 0 0 5 g3 0 1 0 1 0 0 .9 9 0 8 2 g4 4 Glyp h for g1 g1 Gene Dimension Table (non-binary) stopcodondensity This visual data mining tool was effective in KDD-CUP ’02) Network Security Application (Network security through Vertical Structured data) Network layers do their own partitioning Packets, frames, etc. (usually independent of any intrinsic data structuring – e.g., record structure) Fragmentation/Reassembly, Segmentation/Reassembly Data privacy is compromised when the horizontal (stream) message content is eavesdropped upon at the reassembled level (in network A standard solution is to host-encrypt the horizontal structure so that any network reassembled message is meaningless. Alt.: Vertically structure (decompose, partition) data (e.g., basic Ptrees). Send one Ptree per packet Send intra-message packets separately Trick flow classifiers into thinking the multiple packets associated with a particular message are unrelated. The message is only meaningful after destination demux-ing Note: the only basic Ptree that holds actual information is the high-order bit Ptree. Therefore encrypt it! It seems like there ought to be a whole range of killer ideas associated with the concept of using vertical structuring data within network transmission units Active networking? (AND basic Ptrees (or just certain levels of) at active net nodes?) A very informal seminar Dr. Michael Vogt Principal Investigator and Project Engineer Chemical Microsensor Division Argonne National Laboratory Friday, November 7, 2003, 3:30 P.M. IACC 204N