Download PPT - Snowmass 2001

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data from Far and Wide:
Finding IT,
Managing IT,
Using IT
Professor Robert Hollebeek
NSCP - University of Pennsylvania
7th International Conference on High
Performance Computing, December 18, 2000
Bangalore, India
Outline
The importance of Data Intensive
Computing
 Data and Medicine
 Data and Maps
 Data Infrastructure Conclusions

12/18/00
R. Hollebeek
data data data data data data data data data data data data data data data data data data
data data data data data data data data data data data data data data data data data data
data data data data data data data data data data data data data data data data data data
data data data data data data data data data data data data data data data data data data
data data data data data data data data data data data data data data data data data data
data data data data data data data data data data data data data data data data data data
data data data data data data data data data data data data data data data data data data
data data data data data data data data data data data data data data data data data data
data data data data data data data data data data data data data data data data data data
data data data data data data data data data data data data data data data data data data
data data data data data data data data data data data data data data data data data data
data data data data data data data data data data data data data data data data data data
data data data data data data data data data data data data data data data data data data
data data data data data data data data data data data data data data data data data data
data data data data data data data data data data data data data data data data data data
data data data data data data data data data data data data data data data data data data
data data data data data data data data data data data data data data data data data data
data data data data data data data data data data data data data data data data data data
data data data data data data data data data data data data data data data data data data
data data data data data data data data data data data data data data data data data data
data data data data data data data data data data data data data data data data data data
Data Intensive Computing: Particularly
Interesting (hard) when
Data comes from distributed sensors
 is controlled or stored in distributed
databases or caches
 is secure or semi-private
 is large scale (terabyte to petabyte)
 is made of multi-component data

12/18/00
R. Hollebeek
Difficulty Increases with data
diversity, size, speed requirements
Current Projects explore all three dimensions
Govt Data
Medical Data
Size
12/18/00
R. Hollebeek
Destination
Computer
The Power of Data Mining
Network Traffic
on a 500 node
LAN
run
Source Computer
Destination Node
Source Node
12/18/00
The network data
shown here contains
a lot of information
but displayed this way,
yields little insight
or knowledge about
the underlying activity.
R. Hollebeek
NSCP BlockNess Algorithm
Rearranged, sorted
and clustered, we
see that there are
several major groups
of processors with
joint activities.
Data Mining Prerequisites

Finding IT: Find Interesting Data
– Data Intensive Applications
• Social Science, Economics, Medicine, Science

Managing IT: Data Infrastructure and Data
Organization
– Parallel Storage above the Terabyte Level

Using IT: Finally you get to do Mining
– Data Intensive -> Semi-automated
12/18/00
R. Hollebeek
Talk Will Highlight Examples of Data
Intensive Applications from NSCP@PENN
(http://nscp.upenn.edu)

NDMA: National Digital
Mammography Archive
 NIS-P: Neighborhood
Information system for
Philadelphia
 Parallel Data Infrastructure :
NSCP
12/18/00
Massive
Distributed
Secure
Diverse
Web enabled
Secure
Ultra high
speeds for
massive data
R. Hollebeek
Outline - Data and Medicine
The importance of Data Intensive
Computing
 Data and Medicine

– Finding IT
– Managing IT
– Using IT
Data and Maps
 Data Infrastructure Conclusions

12/18/00
R. Hollebeek
Finding IT

Hospitals
X-rays
mammograms
MRI
cat scans
endoscopies
…..
– Very large data sources - great clinical value to digital
storage and manipulation and significant cost savings
– 7,000 Gigabytes per hospital per year
– dominated by digital images

Why we chose Mammography
–
–
–
–
12/18/00
clinical need for film recall
large volume ( 4,000 GB/year )
standards exist
great clinical value to this application
R. Hollebeek
Managing IT
12/18/00
R. Hollebeek
Major Components
Hospital
Portal
Systems
12/18/00
Network
Infrastructure
“RadAR”
Large Scale
Storage
and
Indexing
R. Hollebeek
RadAR : NSCP@PENN

High capacity radiology storage
developed by NSCP 1996-1999
 Radiology Active Repository
12/18/00
R. Hollebeek
RadAR Components
Large Disks
Parallel CPU
Control (MA R)
Hi-speed Interconnect
12/18/00
R. Hollebeek
RadAR MetaData
Large Disks
MetaData
12/18/00
R. Hollebeek
RadAR Contents
Large Disks
Not to scale
MetaData
Logs
Images
Records
Dicom SR
Birads
12/18/00
R. Hollebeek
RadAR + Portals
Portal Systems
at HUP, UNC,
UC, SWH
MAP/MAQ
NDMA/NSCP
Large Disks
Parallel CPU
Control (MA R)
MetaData
Images
Logs
Records
Hi-speed Interconnect
12/18/00
R. Hollebeek
Map - MA system portal
Hospital
Network
VPN
Win 2000
Linux
Two Dual Processor IBM/Netfinity 5100 systems
12/18/00
R. Hollebeek
12/18/00
R. Hollebeek
12/18/00
R. Hollebeek
Portals + RadAR
Hospital
Network
VPN
Large Disks
Win 2000 Linux
Parallel CPU
Control (MA R)
Hi-speed Interconnect
12/18/00
R. Hollebeek
12/18/00
R. Hollebeek
NSCP High Capacity Archive
100 TB, million record per day pilot system developed by NSCP
and demonstrated at SC98
RadAR
R. Hollebeek
RadAR
12/18/00
R. Hollebeek
NSCP – IBM/SP2 Hardware
Components
Control
MAR
spcw
Serial Ports
High Performance Switch
ATM
sp02
sp01
Primary Node
Backup Primary Node
Disk Pool 1
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Disk Pool 2
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Status
Node
Serial
HPS
ATM
sp03
sp03
sp03
sp03
Node
Node
Node
Node
Disk Pool
Disk Pool
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Disk Pool
Disk Pool
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Lab Tour
12/18/00
R. Hollebeek
Scale of the Problem
Recent FDA approval and cost and other
advantages of digital devices will encourage
digital radiology conversion

2000 Hospitals x 7 TB per year x 2

28 PetaBytes per year
– (1 Petabyte = 1 Million Gigabytes )

Pilot Problem scale in NDMA
– 4 x 7 x 2 = 56 Terabytes / year
12/18/00
R. Hollebeek
Storage Hierarchy
Hospital / Clinic
7 R @ 4,000 TB/yr
A
A
20 A @ 100 TB/yr
15 H @
REGIONAL
7 TB/yr
A
A
A
Goal: Distribute Storage Load and Balance Network
and Query Loads
12/18/00
R. Hollebeek
Networks
7 TB / yr in each hospital is ~2% of
an OC3
 Typical T1 to DS-3 connects today at
Clinics are almost sufficient
 Study size and transmission time to
remote reader is a more important
constraint requiring higher speeds

– 1.5 Minutes at DS-3
– 2 sec at OC48
12/18/00
R. Hollebeek
NDMA

NSCP@Penn:
– Digital Storage, Search and Retrieval

Oak Ridge National Lab:
– Network (VPN) and Security

Hospitals of
– University of Pennsylvania
– University of Chicago
– University of North Carolina
– University of Toronto
12/18/00
R. Hollebeek
Large scale radiology testbed
Regional and Area Archives (A)
A
A
A
A
REGIONAL
A
A
A
A
REGIONAL
A
REGIONAL
A
A
A
A
A
A
A
A
REGIONAL
A
A
A
A
A
REGIONAL
A
A
A
12/18/00
R. Hollebeek
Layout matches growth pattern of national networks
12/18/00
R. Hollebeek
Portal Systems
in the test lab at
NSCP/PENN
12/18/00
R. Hollebeek
First Hospital portal systems being installed at
the Hospital of the University of Pennsylvania
Portal NDMA01 in place in the communications closet
Construction of
the remaining
Portal systems
12/18/00
R. Hollebeek
Systems Undergoing network tests in the server room
1200 Gigabyte fast disk under test in a joint program with
Lucent and CyberStorage Systems.
Using IT





Store Records for retrieval
– typical request would retrieve 3-4 yrs
Audit and log transmissions
Parse, Index and Store incoming
information
Support Computer Assisted Diagnostics
Support Radiologist Training and
Evaluation
12/18/00
R. Hollebeek
Training, Teaching, Evaluation
12/18/00
R. Hollebeek
12/18/00
R. Hollebeek
Network and Data Security

Virtual Private Network
– used to assure system security

User Authentication
– password + token or biometric

Roles
– Doctor, Administrator, Assistant, ...

Client Authorization
– required for Medical Records
NDMA Data Mining Challenges
Fuzzy matching for records
 feature matching in images
 clustering - outcomes, other variables
 outlier search in many dimensions
 computer assisted diagnosis

12/18/00
R. Hollebeek
NDMA -
http://nscp.upenn.edu/ndma
NSCP with Children’s Hospital
• To provide fast parallel
processing over high speed nets
so that functional MRI can be
used in real time clinically
• On the right: an individual
noisy frame of a human brain
12/18/00
R. Hollebeek
Functional MRI

J. Yu
graduate
student
Degree in
2000
Now on Wall
Street
12/18/00
R. Hollebeek
Fuzzy Clustering
Cluster analysis is based on partitioning a collection of data points
into a number of subgroups, where the data points inside a cluster
(subgroup) show a certain degree of similarity.
Fuzzy Clustering Algorithm was used to group resting brain
voxels according to similarity in temporal pattern without prior
knowledge of brain anatomy. Fuzziness here was used to stress
the fuzzy-nature of brain data set.
12/18/00
R. Hollebeek
run
12/18/00
R. Hollebeek
Outline
 Data
and Medicine
 Data and Maps
– Census, Economic Data
– Government, City, Demographic
– Neighborhood Information System in
Philadelphia
 Data
12/18/00
Infrastructure
R. Hollebeek
Finding IT
Federal Government Sources:
Census files
 State Government Sources:
Economic files
 City Government Sources:
Revenue files, taxes, permits, land
use, …

12/18/00
R. Hollebeek
History

ES202 (federal economic data) with State of
Pennsylvania
– began 1997
 Census files started 1998
 NIS - Geographical Information Systems started 1999
 XML - Digital Government (with SDSC)
Current Program: combine
state economic activity data,
census demographic data and
city operational data
12/18/00
R. Hollebeek
Managing IT
 Federal
ES202 - State Economic
Data
 1990 Census
 Neighborhood Information System
12/18/00
R. Hollebeek
NSCP Economic Database

Step 1 data management
ES202 Federal Data
raw data
clean
DB2 tables and clean, tabbed flat files
12/18/00
raw format
R. Hollebeek
UIACCOUNTNB
Collect, Organize, add crucial components
RPINGUNITNB
AUXILIARYCODE
DTLASTRPCHANGE
FIRSTDTONUDB
UDBNB
EFFECTIVEDTIDDATA
STATECODE
{
COUNTYCODE
TOWNSHIPCODE
Government record
County & Township
OWNERSHIPCODE
EIN
SICCODE
SICCHANGEDT
SICVERIFICATIONDT
SICVERIFICATIONRS
{
SICs
QTLYAVMTHLYWAGE1, .. , 4
IMPUTEDWAGESFLAG1, .. 4
CALYRAVGMTHLYWAGE
MEEICODE
Wages
MTHLYEMPL1, .. , 12
IMPUTEDEMPLFLAG1, .. , 4
Employment
Name & Address
ANNUALAVEMPL
VARCALYEAREMPLS
ADDRESSSOURCE
ADDRESSCHANGEDT
Location
PHONENB
TRADENAME
RPINGUNITDESCR
MSACODE
REFERENCEDT
ADDRESSTYPE
CITY
LEGALNAME
STREETADDRESS
STATEABBREVIATION
ZIPCODE
LOCX
LOCY
Format of T89, .., T97, and T_Large
12/18/00
ZIPCODEEXPANSION
R. Hollebeek
Derive new tables of interest
UDBNB
UDBNB
YEAR
YEAR
UDBNB
.
BIRTHS
(
udbnb that does not exist in the
previous year
.
)
.
DEATHS
vanishes next
( udbnb thatyear
)
.
.
REFERENCEDT
UDBNB
UDBNB
YEAR
OLDCOUNTY
OLDZIP
T_LARGE
NEWZIP
OLDZIP
OLDLOCX
NEWZIP
OLDLOCY
EMPL
NEWLOCX
WAGE
NEWLOCY
YEAR
OLDCOUNTY
NEWCOUNTY
MOVEJOBS
MOVE
(
udbnb that changes location
from previous year
12/18/00
NEWCOUNTY
)
that changes location from previous
( udbnbyear.
)
Also show empl and wage
R. Hollebeek
Data Input
Users need easy methods to insert
data into the complicated large scale
parallel database
 system needs to be semi automatic
to avoid huge administrative load
 Tools to help user provide

– data description (schema)
– file uploads
12/18/00
R. Hollebeek
Using IT
12/18/00
R. Hollebeek
The primary census
database tools have
been used to create a
CD ROM of census
data extracts for
individual regions of the
country.
CD used at SEPCHE
colleges for demos,
teaching, and research
and Stanford for
statistics.
•Hypertext index page
(Right) helps browse
both raw census data
and demonstrative
processed data views
NSCP-SEPCHE Census Extract
CDROM
Version 1.0
Reformulated
Census tables
into easy to use
Philadelphia
regional extract.
12/18/00
R. Hollebeek
Census Demographic Information
Median Incomes of Philadelphia
Census Tracts
12/18/00
The size of each ‘bubble’
is proportional to the
median income of
persons
living in the Philadelphia
County census tracts.
R. Hollebeek
Fast preparation of underlying extracts from the large parallel systems
• Fast Extraction from parallel DB2 on SP2 frame to
Spreadsheet on PC via self installing CDROM
• Easy navigation of county/census tract level data
for selected counties using spreadsheet tabs
• Data tables (in spreadsheets) ready for processing
in formulas, tables, statistics etc
• Samples and examples of data manipulations and
data views included on the CDROM
1
Long
commute
Linguistic
isolation
Large
Household
single
father
single
mother
fraction
hispanic
-0.5
fraction
black
0
Unemploy
ment
0.5
Education
deprivation
Correlation
Coefficient
Correlation of Poverty Rate with other
Factors in Southeast Pennsylvania
Counties
Philadelphia
Bucks
Chester
Delaware
Montgomery
• New ‘object oriented’ access and manipulation tools provide a
straightforward approach to handling natural units of data.
• Sample above generated by iterating variations on a few lines of
code, invoking generic ‘methods’ associated with ‘census table’
objects.
12/18/00
R. Hollebeek
Data Mining in Economics
• Collaborate with the
Pennsylvania State
Department of Commerce
and Economic Development
to analyze economic activity
in Pennsylvania
from 1989 to 1997
12/18/00
R. Hollebeek
Time Series Extractor
- An application that generates the time series of a userspecified cross-section of the economy
- Capable of using many computers to search through
the database distributed on different machines (MPI)
On the left: The time series of
the total monthly wages paid to
employees at all the Food
Stores with 10 to 20 workers
in Philadelphia county.
MPI based implementation
allows single PC or cluster
utilization
12/18/00
R. Hollebeek
Job and Wage Migration
12/18/00
R. Hollebeek
Pittsburgh
12/18/00
R. Hollebeek
Database has Time sequence for
each enterprise
10 x
Percentage
Employment
Change
Example of a
single enterprise
time sequence
Jan 89 to Dec 96
12/18/00
R. Hollebeek
Example of a High Level Data Mining Query
Find the most prevalent pattern
And the three
companies
which follow
that trend
most closely
An Example of
what you can
do on a cluster
system
12/18/00
R. Hollebeek
Time Histories
A particular class of data with special
data mining needs
 How to stripe the data across
processors: two primary choices
by location, or by time
 Compare histories, cluster histories
 Use history (I.e. internal dynamics) to
define clusters

12/18/00
R. Hollebeek
Graphic techniques (GIS) are
particularly useful in this type of data

Graphic displays
of location or
economic activity
State of Pennsylvania
Employment increases and decreases
NIS-P
 1990
Census
 Federal ES202
 City Data
Neighborhood Information System
12/18/00
R. Hollebeek
City Data
Philadelphia Census
 Philadelphia Bureau of Revenue and
Taxes, Licenses, Water,
Redevelopment, Gas, ...
 New York and Philadelphia housing
abandonment data
 GIS stereo photography
 Build combined Philadelphia
Neighborhood Information System

12/18/00
R. Hollebeek
Using IT
12/18/00
R. Hollebeek
Application Demonstration
Map linked application for
extracting data from the master
database:
Not available in web version of
talk.
See also
http://nscp.upenn.edu/nis
Community Access Web Site
Web site
Provides Operational access
to data for the City
Provides Community access to
selected City data
12/18/00
R. Hollebeek
Data Mining in multi-dimensions
•Location (latitude/longitude)
•Market value
•Sale Price
•Zipcode
5 dimensions
12/18/00
R. Hollebeek
Easy to locate geographic clusters
But more
valuable to
simultaneously
cluster in
several
variables
12/18/00
R. Hollebeek
Segmentation based on location, sales price and market value
Group 1, high sales price, high assessment
Group 2, high sales price, low assessments, particular zipcode
12/18/00
R. Hollebeek
Cluster Finding in multi-dimensions
Looking for density variations
with constraints and
boundaries
From a file we use the
standard open dialog box etc
Load data from a file
Load Data from a database
Note that a database can be local or remote
NISMAIN is a DB2 parallel database which
exists on the SP2 supercomputer.
After database is selected you
must select the table of interest
Data are
selected.
Now we can
run by clicking
on run
Visualize
Choose the value
of h by typing it in
the edit box
The circle in the
middle is to guide
the eye for the
value of h.
Red are for hot
spots, green for
normal and blue for
cold.
Boundaries
Rivers
Borders
data constraints
...
12/18/00
R. Hollebeek
Outline
The importance of Data Intensive
Computing
 Data and Medicine
 Data and Maps
 Data Infrastructure Conclusions

12/18/00
R. Hollebeek
Parallel and Distributed Data Intensive Applications
 Some
general principles gleaned
from the Data Mining and
Application Examples
 Three
12/18/00
Lessons
R. Hollebeek
Three Lessons
about High Performance Data
Mining and Data Intensive
Computing
PIOM
Parallelize I/O and Match
EDP
Exploit Data Parallelism
OPDLT Optimize Physical Data
Layout and Transforms
Time to move/scan a Terabyte sets
the scale of what can be achieved
TB / day
GIGABIT
OC3
0
12/18/00
2
4
6
8
10
12
R. Hollebeek
Lesson One
PIOM Parallelize I/O
and Match
12/18/00
R. Hollebeek
General System Design
Striped Disks
HPS
SP Node
12/18/00
ATM
ATM or Ethernet Switch
SP Node
R. Hollebeek
Data Flow Architecture

Eliminate bottlenecks
12/18/00
R. Hollebeek
Design for matching disks to fiber
Parallel
nodes
Front end
Switch
Parallel
Disk
fiber
Collaboration with Lucent
OC48 Drivers, Cards, Switches
12/18/00
R. Hollebeek
NSCP Petabytes Design
Parallel
nodes
Front end
Switch
Parallel
Disk
Parallel
nodes
fiber
Front end
To move a Petabyte
in one year requires
approximately 75%
of an OC12
Switch
Parallel
Disk
Parallel
nodes
fiber
Front end
Switch
Parallel
Disk
Parallel
nodes
fiber
Front end
Switch
Parallel
Disk
12/18/00
fiber
Disk $$ required
dropping rapidly
4xOC48 scans a
petabyte in about
2 1/2 weeks
R. Hollebeek
Lesson Two
EDP Exploit Data Parallelism
Interesting Data can often be segmented
into independent units
12/18/00
R. Hollebeek
Shared Nothing Clusters of
computers are extremely
effective for data intensive
mining such segmented data
Scalable Clusters
12/18/00
Goal: Performance increase x N
(without needing more people)
R. Hollebeek
Picking Corn and Data Mining
12/18/00
R. Hollebeek
Picking Corn and Data Mining
12/18/00
R. Hollebeek
Lesson Three
OPDLT - Optimize Physical Data
Layout and Transforms
Two goals with (sometimes) competing requirements


Importance of Legacy Data Formats
– storage in legacy format may be necessary
– have a strategy to interface to legacy formats
Importance of Optimized Data Layout
– mining and query times depend critically on
data layout
12/18/00
R. Hollebeek
Example: Data re-Arrangement “Multifile”
Multifile - column
oriented
rearrangement of
data with metadata
indices
 enables fast parallel
search strategies

12/18/00
R. Hollebeek
Finally – the Petabyte Test System
4x IBM
Netfinity 5100

WAN
Lucent
Edge
Switch
CyberBorg
4x OC48
Petabyte Storage
 Fast Interconnect
 Prototype for the
NDMA Area Archive
4x Ultra SCSI 3
Joint Development
•Lucent
•CyberStorage
•Hubs Inc.
•Penn
www.lucent-optical.com/oan
www.cyberstorage.com
www.hubs-inc.com
12/18/00
R. Hollebeek
Storage/Application/Net Fabric
Design for a
merged storage,
computation, and
communication
fabric.
Communication
CPU
Storage
Link
Scalable to Petabyte Data
12/18/00
Project Confidential
R. Hollebeek
Conclusions

NDMA: Huge Data, significant network
requirements,parallel internal infrastructures
to enable the data management
 NIS: high volume data from many sources,
requires effective data management AND
user tools
 Parallel and high-speed networks: the key to
making the data move both internally and
externally
12/18/00
R. Hollebeek
Conclusions

Data Intensive Computing is Interesting,
Challenging, and Crucial
 Real Applications are the key
– Examples
• Data and Medicine
• Data and Maps

Data Infrastructure Conclusions
– Parallel hardware, parallel software, parallel data
12/18/00
R. Hollebeek
[email protected]
http://nscp.upenn.edu/hollebeek/talks/india