Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data from Far and Wide: Finding IT, Managing IT, Using IT Professor Robert Hollebeek NSCP - University of Pennsylvania 7th International Conference on High Performance Computing, December 18, 2000 Bangalore, India Outline The importance of Data Intensive Computing Data and Medicine Data and Maps Data Infrastructure Conclusions 12/18/00 R. Hollebeek data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data Data Intensive Computing: Particularly Interesting (hard) when Data comes from distributed sensors is controlled or stored in distributed databases or caches is secure or semi-private is large scale (terabyte to petabyte) is made of multi-component data 12/18/00 R. Hollebeek Difficulty Increases with data diversity, size, speed requirements Current Projects explore all three dimensions Govt Data Medical Data Size 12/18/00 R. Hollebeek Destination Computer The Power of Data Mining Network Traffic on a 500 node LAN run Source Computer Destination Node Source Node 12/18/00 The network data shown here contains a lot of information but displayed this way, yields little insight or knowledge about the underlying activity. R. Hollebeek NSCP BlockNess Algorithm Rearranged, sorted and clustered, we see that there are several major groups of processors with joint activities. Data Mining Prerequisites Finding IT: Find Interesting Data – Data Intensive Applications • Social Science, Economics, Medicine, Science Managing IT: Data Infrastructure and Data Organization – Parallel Storage above the Terabyte Level Using IT: Finally you get to do Mining – Data Intensive -> Semi-automated 12/18/00 R. Hollebeek Talk Will Highlight Examples of Data Intensive Applications from NSCP@PENN (http://nscp.upenn.edu) NDMA: National Digital Mammography Archive NIS-P: Neighborhood Information system for Philadelphia Parallel Data Infrastructure : NSCP 12/18/00 Massive Distributed Secure Diverse Web enabled Secure Ultra high speeds for massive data R. Hollebeek Outline - Data and Medicine The importance of Data Intensive Computing Data and Medicine – Finding IT – Managing IT – Using IT Data and Maps Data Infrastructure Conclusions 12/18/00 R. Hollebeek Finding IT Hospitals X-rays mammograms MRI cat scans endoscopies ….. – Very large data sources - great clinical value to digital storage and manipulation and significant cost savings – 7,000 Gigabytes per hospital per year – dominated by digital images Why we chose Mammography – – – – 12/18/00 clinical need for film recall large volume ( 4,000 GB/year ) standards exist great clinical value to this application R. Hollebeek Managing IT 12/18/00 R. Hollebeek Major Components Hospital Portal Systems 12/18/00 Network Infrastructure “RadAR” Large Scale Storage and Indexing R. Hollebeek RadAR : NSCP@PENN High capacity radiology storage developed by NSCP 1996-1999 Radiology Active Repository 12/18/00 R. Hollebeek RadAR Components Large Disks Parallel CPU Control (MA R) Hi-speed Interconnect 12/18/00 R. Hollebeek RadAR MetaData Large Disks MetaData 12/18/00 R. Hollebeek RadAR Contents Large Disks Not to scale MetaData Logs Images Records Dicom SR Birads 12/18/00 R. Hollebeek RadAR + Portals Portal Systems at HUP, UNC, UC, SWH MAP/MAQ NDMA/NSCP Large Disks Parallel CPU Control (MA R) MetaData Images Logs Records Hi-speed Interconnect 12/18/00 R. Hollebeek Map - MA system portal Hospital Network VPN Win 2000 Linux Two Dual Processor IBM/Netfinity 5100 systems 12/18/00 R. Hollebeek 12/18/00 R. Hollebeek 12/18/00 R. Hollebeek Portals + RadAR Hospital Network VPN Large Disks Win 2000 Linux Parallel CPU Control (MA R) Hi-speed Interconnect 12/18/00 R. Hollebeek 12/18/00 R. Hollebeek NSCP High Capacity Archive 100 TB, million record per day pilot system developed by NSCP and demonstrated at SC98 RadAR R. Hollebeek RadAR 12/18/00 R. Hollebeek NSCP – IBM/SP2 Hardware Components Control MAR spcw Serial Ports High Performance Switch ATM sp02 sp01 Primary Node Backup Primary Node Disk Pool 1 Data Data Data Data Data Data Data Data Data Data Data Disk Pool 2 Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Status Node Serial HPS ATM sp03 sp03 sp03 sp03 Node Node Node Node Disk Pool Disk Pool Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Disk Pool Disk Pool Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Lab Tour 12/18/00 R. Hollebeek Scale of the Problem Recent FDA approval and cost and other advantages of digital devices will encourage digital radiology conversion 2000 Hospitals x 7 TB per year x 2 28 PetaBytes per year – (1 Petabyte = 1 Million Gigabytes ) Pilot Problem scale in NDMA – 4 x 7 x 2 = 56 Terabytes / year 12/18/00 R. Hollebeek Storage Hierarchy Hospital / Clinic 7 R @ 4,000 TB/yr A A 20 A @ 100 TB/yr 15 H @ REGIONAL 7 TB/yr A A A Goal: Distribute Storage Load and Balance Network and Query Loads 12/18/00 R. Hollebeek Networks 7 TB / yr in each hospital is ~2% of an OC3 Typical T1 to DS-3 connects today at Clinics are almost sufficient Study size and transmission time to remote reader is a more important constraint requiring higher speeds – 1.5 Minutes at DS-3 – 2 sec at OC48 12/18/00 R. Hollebeek NDMA NSCP@Penn: – Digital Storage, Search and Retrieval Oak Ridge National Lab: – Network (VPN) and Security Hospitals of – University of Pennsylvania – University of Chicago – University of North Carolina – University of Toronto 12/18/00 R. Hollebeek Large scale radiology testbed Regional and Area Archives (A) A A A A REGIONAL A A A A REGIONAL A REGIONAL A A A A A A A A REGIONAL A A A A A REGIONAL A A A 12/18/00 R. Hollebeek Layout matches growth pattern of national networks 12/18/00 R. Hollebeek Portal Systems in the test lab at NSCP/PENN 12/18/00 R. Hollebeek First Hospital portal systems being installed at the Hospital of the University of Pennsylvania Portal NDMA01 in place in the communications closet Construction of the remaining Portal systems 12/18/00 R. Hollebeek Systems Undergoing network tests in the server room 1200 Gigabyte fast disk under test in a joint program with Lucent and CyberStorage Systems. Using IT Store Records for retrieval – typical request would retrieve 3-4 yrs Audit and log transmissions Parse, Index and Store incoming information Support Computer Assisted Diagnostics Support Radiologist Training and Evaluation 12/18/00 R. Hollebeek Training, Teaching, Evaluation 12/18/00 R. Hollebeek 12/18/00 R. Hollebeek Network and Data Security Virtual Private Network – used to assure system security User Authentication – password + token or biometric Roles – Doctor, Administrator, Assistant, ... Client Authorization – required for Medical Records NDMA Data Mining Challenges Fuzzy matching for records feature matching in images clustering - outcomes, other variables outlier search in many dimensions computer assisted diagnosis 12/18/00 R. Hollebeek NDMA - http://nscp.upenn.edu/ndma NSCP with Children’s Hospital • To provide fast parallel processing over high speed nets so that functional MRI can be used in real time clinically • On the right: an individual noisy frame of a human brain 12/18/00 R. Hollebeek Functional MRI J. Yu graduate student Degree in 2000 Now on Wall Street 12/18/00 R. Hollebeek Fuzzy Clustering Cluster analysis is based on partitioning a collection of data points into a number of subgroups, where the data points inside a cluster (subgroup) show a certain degree of similarity. Fuzzy Clustering Algorithm was used to group resting brain voxels according to similarity in temporal pattern without prior knowledge of brain anatomy. Fuzziness here was used to stress the fuzzy-nature of brain data set. 12/18/00 R. Hollebeek run 12/18/00 R. Hollebeek Outline Data and Medicine Data and Maps – Census, Economic Data – Government, City, Demographic – Neighborhood Information System in Philadelphia Data 12/18/00 Infrastructure R. Hollebeek Finding IT Federal Government Sources: Census files State Government Sources: Economic files City Government Sources: Revenue files, taxes, permits, land use, … 12/18/00 R. Hollebeek History ES202 (federal economic data) with State of Pennsylvania – began 1997 Census files started 1998 NIS - Geographical Information Systems started 1999 XML - Digital Government (with SDSC) Current Program: combine state economic activity data, census demographic data and city operational data 12/18/00 R. Hollebeek Managing IT Federal ES202 - State Economic Data 1990 Census Neighborhood Information System 12/18/00 R. Hollebeek NSCP Economic Database Step 1 data management ES202 Federal Data raw data clean DB2 tables and clean, tabbed flat files 12/18/00 raw format R. Hollebeek UIACCOUNTNB Collect, Organize, add crucial components RPINGUNITNB AUXILIARYCODE DTLASTRPCHANGE FIRSTDTONUDB UDBNB EFFECTIVEDTIDDATA STATECODE { COUNTYCODE TOWNSHIPCODE Government record County & Township OWNERSHIPCODE EIN SICCODE SICCHANGEDT SICVERIFICATIONDT SICVERIFICATIONRS { SICs QTLYAVMTHLYWAGE1, .. , 4 IMPUTEDWAGESFLAG1, .. 4 CALYRAVGMTHLYWAGE MEEICODE Wages MTHLYEMPL1, .. , 12 IMPUTEDEMPLFLAG1, .. , 4 Employment Name & Address ANNUALAVEMPL VARCALYEAREMPLS ADDRESSSOURCE ADDRESSCHANGEDT Location PHONENB TRADENAME RPINGUNITDESCR MSACODE REFERENCEDT ADDRESSTYPE CITY LEGALNAME STREETADDRESS STATEABBREVIATION ZIPCODE LOCX LOCY Format of T89, .., T97, and T_Large 12/18/00 ZIPCODEEXPANSION R. Hollebeek Derive new tables of interest UDBNB UDBNB YEAR YEAR UDBNB . BIRTHS ( udbnb that does not exist in the previous year . ) . DEATHS vanishes next ( udbnb thatyear ) . . REFERENCEDT UDBNB UDBNB YEAR OLDCOUNTY OLDZIP T_LARGE NEWZIP OLDZIP OLDLOCX NEWZIP OLDLOCY EMPL NEWLOCX WAGE NEWLOCY YEAR OLDCOUNTY NEWCOUNTY MOVEJOBS MOVE ( udbnb that changes location from previous year 12/18/00 NEWCOUNTY ) that changes location from previous ( udbnbyear. ) Also show empl and wage R. Hollebeek Data Input Users need easy methods to insert data into the complicated large scale parallel database system needs to be semi automatic to avoid huge administrative load Tools to help user provide – data description (schema) – file uploads 12/18/00 R. Hollebeek Using IT 12/18/00 R. Hollebeek The primary census database tools have been used to create a CD ROM of census data extracts for individual regions of the country. CD used at SEPCHE colleges for demos, teaching, and research and Stanford for statistics. •Hypertext index page (Right) helps browse both raw census data and demonstrative processed data views NSCP-SEPCHE Census Extract CDROM Version 1.0 Reformulated Census tables into easy to use Philadelphia regional extract. 12/18/00 R. Hollebeek Census Demographic Information Median Incomes of Philadelphia Census Tracts 12/18/00 The size of each ‘bubble’ is proportional to the median income of persons living in the Philadelphia County census tracts. R. Hollebeek Fast preparation of underlying extracts from the large parallel systems • Fast Extraction from parallel DB2 on SP2 frame to Spreadsheet on PC via self installing CDROM • Easy navigation of county/census tract level data for selected counties using spreadsheet tabs • Data tables (in spreadsheets) ready for processing in formulas, tables, statistics etc • Samples and examples of data manipulations and data views included on the CDROM 1 Long commute Linguistic isolation Large Household single father single mother fraction hispanic -0.5 fraction black 0 Unemploy ment 0.5 Education deprivation Correlation Coefficient Correlation of Poverty Rate with other Factors in Southeast Pennsylvania Counties Philadelphia Bucks Chester Delaware Montgomery • New ‘object oriented’ access and manipulation tools provide a straightforward approach to handling natural units of data. • Sample above generated by iterating variations on a few lines of code, invoking generic ‘methods’ associated with ‘census table’ objects. 12/18/00 R. Hollebeek Data Mining in Economics • Collaborate with the Pennsylvania State Department of Commerce and Economic Development to analyze economic activity in Pennsylvania from 1989 to 1997 12/18/00 R. Hollebeek Time Series Extractor - An application that generates the time series of a userspecified cross-section of the economy - Capable of using many computers to search through the database distributed on different machines (MPI) On the left: The time series of the total monthly wages paid to employees at all the Food Stores with 10 to 20 workers in Philadelphia county. MPI based implementation allows single PC or cluster utilization 12/18/00 R. Hollebeek Job and Wage Migration 12/18/00 R. Hollebeek Pittsburgh 12/18/00 R. Hollebeek Database has Time sequence for each enterprise 10 x Percentage Employment Change Example of a single enterprise time sequence Jan 89 to Dec 96 12/18/00 R. Hollebeek Example of a High Level Data Mining Query Find the most prevalent pattern And the three companies which follow that trend most closely An Example of what you can do on a cluster system 12/18/00 R. Hollebeek Time Histories A particular class of data with special data mining needs How to stripe the data across processors: two primary choices by location, or by time Compare histories, cluster histories Use history (I.e. internal dynamics) to define clusters 12/18/00 R. Hollebeek Graphic techniques (GIS) are particularly useful in this type of data Graphic displays of location or economic activity State of Pennsylvania Employment increases and decreases NIS-P 1990 Census Federal ES202 City Data Neighborhood Information System 12/18/00 R. Hollebeek City Data Philadelphia Census Philadelphia Bureau of Revenue and Taxes, Licenses, Water, Redevelopment, Gas, ... New York and Philadelphia housing abandonment data GIS stereo photography Build combined Philadelphia Neighborhood Information System 12/18/00 R. Hollebeek Using IT 12/18/00 R. Hollebeek Application Demonstration Map linked application for extracting data from the master database: Not available in web version of talk. See also http://nscp.upenn.edu/nis Community Access Web Site Web site Provides Operational access to data for the City Provides Community access to selected City data 12/18/00 R. Hollebeek Data Mining in multi-dimensions •Location (latitude/longitude) •Market value •Sale Price •Zipcode 5 dimensions 12/18/00 R. Hollebeek Easy to locate geographic clusters But more valuable to simultaneously cluster in several variables 12/18/00 R. Hollebeek Segmentation based on location, sales price and market value Group 1, high sales price, high assessment Group 2, high sales price, low assessments, particular zipcode 12/18/00 R. Hollebeek Cluster Finding in multi-dimensions Looking for density variations with constraints and boundaries From a file we use the standard open dialog box etc Load data from a file Load Data from a database Note that a database can be local or remote NISMAIN is a DB2 parallel database which exists on the SP2 supercomputer. After database is selected you must select the table of interest Data are selected. Now we can run by clicking on run Visualize Choose the value of h by typing it in the edit box The circle in the middle is to guide the eye for the value of h. Red are for hot spots, green for normal and blue for cold. Boundaries Rivers Borders data constraints ... 12/18/00 R. Hollebeek Outline The importance of Data Intensive Computing Data and Medicine Data and Maps Data Infrastructure Conclusions 12/18/00 R. Hollebeek Parallel and Distributed Data Intensive Applications Some general principles gleaned from the Data Mining and Application Examples Three 12/18/00 Lessons R. Hollebeek Three Lessons about High Performance Data Mining and Data Intensive Computing PIOM Parallelize I/O and Match EDP Exploit Data Parallelism OPDLT Optimize Physical Data Layout and Transforms Time to move/scan a Terabyte sets the scale of what can be achieved TB / day GIGABIT OC3 0 12/18/00 2 4 6 8 10 12 R. Hollebeek Lesson One PIOM Parallelize I/O and Match 12/18/00 R. Hollebeek General System Design Striped Disks HPS SP Node 12/18/00 ATM ATM or Ethernet Switch SP Node R. Hollebeek Data Flow Architecture Eliminate bottlenecks 12/18/00 R. Hollebeek Design for matching disks to fiber Parallel nodes Front end Switch Parallel Disk fiber Collaboration with Lucent OC48 Drivers, Cards, Switches 12/18/00 R. Hollebeek NSCP Petabytes Design Parallel nodes Front end Switch Parallel Disk Parallel nodes fiber Front end To move a Petabyte in one year requires approximately 75% of an OC12 Switch Parallel Disk Parallel nodes fiber Front end Switch Parallel Disk Parallel nodes fiber Front end Switch Parallel Disk 12/18/00 fiber Disk $$ required dropping rapidly 4xOC48 scans a petabyte in about 2 1/2 weeks R. Hollebeek Lesson Two EDP Exploit Data Parallelism Interesting Data can often be segmented into independent units 12/18/00 R. Hollebeek Shared Nothing Clusters of computers are extremely effective for data intensive mining such segmented data Scalable Clusters 12/18/00 Goal: Performance increase x N (without needing more people) R. Hollebeek Picking Corn and Data Mining 12/18/00 R. Hollebeek Picking Corn and Data Mining 12/18/00 R. Hollebeek Lesson Three OPDLT - Optimize Physical Data Layout and Transforms Two goals with (sometimes) competing requirements Importance of Legacy Data Formats – storage in legacy format may be necessary – have a strategy to interface to legacy formats Importance of Optimized Data Layout – mining and query times depend critically on data layout 12/18/00 R. Hollebeek Example: Data re-Arrangement “Multifile” Multifile - column oriented rearrangement of data with metadata indices enables fast parallel search strategies 12/18/00 R. Hollebeek Finally – the Petabyte Test System 4x IBM Netfinity 5100 WAN Lucent Edge Switch CyberBorg 4x OC48 Petabyte Storage Fast Interconnect Prototype for the NDMA Area Archive 4x Ultra SCSI 3 Joint Development •Lucent •CyberStorage •Hubs Inc. •Penn www.lucent-optical.com/oan www.cyberstorage.com www.hubs-inc.com 12/18/00 R. Hollebeek Storage/Application/Net Fabric Design for a merged storage, computation, and communication fabric. Communication CPU Storage Link Scalable to Petabyte Data 12/18/00 Project Confidential R. Hollebeek Conclusions NDMA: Huge Data, significant network requirements,parallel internal infrastructures to enable the data management NIS: high volume data from many sources, requires effective data management AND user tools Parallel and high-speed networks: the key to making the data move both internally and externally 12/18/00 R. Hollebeek Conclusions Data Intensive Computing is Interesting, Challenging, and Crucial Real Applications are the key – Examples • Data and Medicine • Data and Maps Data Infrastructure Conclusions – Parallel hardware, parallel software, parallel data 12/18/00 R. Hollebeek [email protected] http://nscp.upenn.edu/hollebeek/talks/india