Download Introduction to Large Databases and Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Extensible Storage Engine wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Database wikipedia , lookup

Relational model wikipedia , lookup

Functional Database Model wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
Introduc)on
to
Large
Databases
&
Data
Mining
Tips
for
Assembling
Your
Data
Analysis
Toolbox
for
the
22nd
Century
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
1
Outline
‐
I
•  Rela)onal
Databases
&
BIG
DATA
–  Big
data
volumes
require
a
new
data
handling
paradigm
–  Advantages
of
a
rela)onal
database
•  Organiza)on
of
data
•  Data
integrity
•  SQL
‐‐
Structured
(and
almost
standard)
query
language
for
queries
–  What
a
database
is
not.
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
2
Outline
‐
II
•  Data
mining
–  What
is
it?
–  Common
data
mining
tasks
–  (FREE)
Tools
available
to
you
to
perform
many
of
these
tasks.
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
3
Outline
‐
III
•  Examples
–
Imagined
&
Real
–  If
we
only
had
)me
travel…
–  Things
one
might
start
to
do
with
PAN‐STARRS
data
(right
now).
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
4
RELATIONAL
DATABASES
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
5
Basic
Defini8ons
• 
Database:
• 
Data:
• 
Database
Management
System
(DBMS):
• 
–  A
collec)on
of
related
data
organized
to
provide
informa)on.
–  Known
facts
that
can
be
recorded
and
have
an
implicit
meaning.
–  Oben
integrated
from
several
sources.
–  Stored
in
a
standard
format
for
use
by
mul)ple
applica)ons.
–  A
sobware
package/
system
to
facilitate
the
crea)on
and
maintenance
of
a
computerized
database.
Database
System:
10/05/12
–  The
DBMS
sobware
together
with
the
data
itself
and
the
hardware
upon
which
it
runs.
Some)mes,
the
applica)ons
are
also
included.
Jim
Heasley,
Ins)turte
for
Astronomy
6
Two
approaches
–  Generally,
there
are
two
approaches
to
extract
informa)on
from
data:
•  file
processing
approach
–  file
based
sobware
programs
•  database
approach
–  DBMS
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
7
File
processing
approach
Application program 1
Data
Instructions
Application program n
.
.
.
Data
Instructions
•  Each application
program has a
specific purpose
•  Each program
uses its own data
–  Issues:
•  data
redundancy
•  redundant
processes/interfaces
•  data
integrity
–  ease
of
maintenance
–  consistency
•  Security
–  preserva)on
–
valuable
company
asset
–  access
control
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
8
Mo8va8on
for
databases
–  Data
is
a
very
important
asset
of
an
organiza)on
–  Mo)va)ons
for
databases
•  to
maintain
data
independent
from
applica)on
programs
•  to
avoid:
–  redundant
data
–  redundant
processes/interfaces
•  to
enable:
–  ease
of
maintenance
–  sharing
of
data
–  data
access
control
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
9
Database
approach
Application program 1
DBMS
Instructions
Data
.
.
.
Metadata
Application program n
Instructions
–  DBMS
‐
a
general
purpose sobware
•  is
self‐describing
•  contains
–  data
–  metadata
(i.e.
data
about
data)
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
10
Main
Characteris8cs
of
the
Database
Approach
• 
Self‐describing
nature
of
a
database
system:
–  A
DBMS
catalog
stores
the
descrip)on
of
a
par)cular
database
(e.g.
data
structures,
types,
and
constraints)
• 
• 
Insula8on
between
programs
and
data:
–  Called
program‐data
independence.
Data
Abstrac8on:
–  A
data
model
is
used
to
hide
storage
details
and
present
the
users
with
a
conceptual
view
of
the
database.
• 
Support
of
mul8ple
views
of
the
data:
–  Each
user
may
see
a
different
view
of
the
database,
which
describes
only
the
data
of
interest
to
that
user.
• 
Concurrent
Execu8ons
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
11
Characteris8cs
of
DBMS
–  Data
is:
•  integrated,
shared,
persistent
•  self‐describing
–  Abstrac)on
•  program
and
data
independence
–  Mul)ple
views
of
the
data
•  different
users
need
different
kinds
of
informa)on
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
12
Advantages
of
Using
the
Database
Approach
• 
Controlling
redundancy
–  Sharing
of
data
among
mul)ple
users.
• 
• 
• 
• 
• 
• 
• 
• 
Restric)ng
unauthorized
access
to
data.
Providing
persistent
storage
for
program
Objects
Providing
Storage
Structures
(e.g.
indexes)
for
efficient
Query
Processing
backup
and
recovery
services.
mul)ple
interfaces
to
different
classes
of
users.
complex
rela)onships
among
data.
integrity
constraints.
Drawing
inferences
and
ac)ons
from
the
stored
data
using
deduc)ve
and
ac)ve
rules
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
13
Addi8onal
advantages
of
the
database
approach
–  Re‐use
of
data
across
mul)ple
applica)ons
–  Data
structure
and
access
can
be
changed
without
changing
applica)ons
–  Enforcement
of
standards
and
computa)on
of
sta)s)cs
–  Improved
responsiveness,
produc)vity
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
14
Addi8onal
Implica8ons
of
Using
the
Database
Approach
Poten)al
for
enforcing
standards
Reduced
applica)on
development
)me
Flexibility
to
change
data
structures
Availability
of
current
informa)on
–  Extremely
important
for
on‐line
transac)on
systems
such
as
airline,
hotel,
car
reserva)ons.
•  Economies
of
scale
• 
• 
• 
• 
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
15
Disadvantages
of
the
database
approach
– 
– 
– 
– 
– 
10/05/12
Complexity
Size
(of
sobware
and
applica)on)
Cost
Performance
Risk
of
(spectacular!)
failures
Jim
Heasley,
Ins)turte
for
Astronomy
16
When
not
to
use
a
DBMS
•  Main
inhibitors
(costs)
of
using
a
DBMS:
–  High
ini)al
investment
and
possible
need
for
addi)onal
hardware.
–  Overhead
for
providing
generality,
security,
concurrency
control,
recovery,
and
integrity
func)ons.
•  When
a
DBMS
may
be
unnecessary:
–  If
the
database
and
applica)ons
are
simple,
well
defined,
and
not
expected
to
change.
–  If
access
to
data
by
mul)ple
users
is
not
required.
•  When
no
DBMS
may
suffice:
–  If
the
database
system
is
not
able
to
handle
the
complexity
of
data
because
of
modeling
limita)ons
–  If
the
database
users
need
special
opera)ons
not
supported
by
the
DBMS.
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
17
Database
Logic
•  Opera)ons
within
the
database
are
governed
by
standard
set
theory
and
logic.
New
types
of
databases
that
are
built
upon
fuzzy
sets,
fuzzy
logic,
and
fuzzy
measure
are
currently
the
subject
of
ac)ve
research,
but
are
not
(as
yet)
widely
available.
•  The
two
key
set
opera)ons
of
interest
in
databases
are
INTERSECTION
(the
JOIN)
and
UNION
(called
the
same
in
the
DB
world).
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
18
Structured
Query
Language
•  The
user
usually
interacts
with
the
database
by
expressing
what
she/he
wants
to
accomplish
by
expressing
the
request
in
SQL.
Note
SQL
tells
the
database
what
you
want
to
do,
but
not
how
to
do
it.
•  There
are
many
helpful
tutorials
about
SQL
available
on
the
web.
An
excellent
introduc)on
is
available
at
www2.aao.gov.au/2dfgrs/Public/Release/Database/sql_intro.pdf
•  This
introduc)on
is
sufficiently
vanilla
it
will
get
you
started
despite
the
minor
varia)ons
between
different
flavors
of
SQL
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
19
The
Schema
•  The
logical
schema
defines
how
aoributes
are
assigned
to
various
tables
and
the
defini)on
of
keys
(indexes)
that
help
to
)e
tables
together.
A
user
must
have
understanding
of
the
logical
schema.
•  The
physical
schema
defines
how
the
data
tables
are
stored
on
the
physical
storage
media
(e.g.,
disks).
Generally,
users
do
not
need
to
know
the
physical
schema
although
the
system
developers
must
leverage
this
to
maximize
the
performance
of
their
system.
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
20
User
Queries
•  Users
develop
queries
to
the
database
in
a
procedural
language,
usually
some
form
of
SQL,
that
builds
requests
for
informa)on
stored
in
the
databases
tables,
oben
making
use
of
internal
rela)onships
inherent
in
the
data
(e.g.,
intersec)ons
between
different
tables).
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
21
The
SQL
Select
Command
•  The
most
frequently
used
SQL
command
(by
the
typical
users)
is
the
SELECT
command.
This
is
used
to
get
(i.e.
select)
data
from
the
database
tables.
•  The
basic
syntax
of
the
SELECT
command
is
SELECT
(list
of
aoributes
you
want)
FROM
(list
of
tables
containing
them)
WHERE
(list
of
limi)ng/restric)ng
condi)ons)
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
22
What
a
Database
isn’t!
While
the
column
arrangement
of
aoributes
in
database
tables
might
remind
the
user
of
a
spreadsheet
program
like
Excel,
a
database
is
not
a
compu)ng
engine.
Further,
because
of
the
nature
of
SQL,
the
user’s
query
simply
defines
what
data
is
wanted,
not
how
to
get
it.
That
also
includes
how
the
database
may
choose
to
execute
numerical
opera)ons
the
user
embeds
in
the
query.
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
23
Database
Technology
Machine
Learning
Statistics
Data Mining
Information
Science
Visualization
Other
Disciplines
DATA
MINING:
CONFLUENCE
OF
MULTIPLE
DISCIPLINES
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
24
The
purpose
of
compu)ng
is
insight,
not
numbers.
Richard
Hamming,
in
the
preface
to
his
1962
text
on
numerical
methods.
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
25
What
is
Data
Mining?
•  Finding
(meaningful)
paoerns
in
data
– 
– 
– 
– 
– 
Classifica)on
Associa)on
Rules
Cluster
Analysis
Anomaly
Detec)on
Regression
•  Data
mining
tools
have
been
used
extensively
in
– 
– 
– 
– 
– 
– 
– 
10/05/12
Biology,
gene)cs,
medical
research
(Bioinforma)cs)
Business
and
Economics
Ecology
and
resource
management
Engineering
Literature
Music
Voice
and
facial
recogni)on
Jim
Heasley,
Ins)turte
for
Astronomy
26
Don’t
Re‐invent
the
Wheel!
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
27
Rela8onship
between
Databases
&
Data
Mining
•  Databases
are
oben
a
key
component
in
data
mining.
One
oben
finds
data
warehouses
providing
the
informa)on
needed
by
the
mining
tools.
•  However,
one
usually
finds
that
the
actual
data
mining
opera)ons
are
executed
outside
the
database
itself.
Databases
are
excellent
informa)on
severs
but
are
not
good
compute
engines!
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
28
Classifica8on:
Defini8on
•  Given
a
collec)on
of
records
(training
set
)
–  Each
record
contains
a
set
of
a<ributes,
one
of
the
aoributes
is
the
class.
•  Find
a
model
for
class
aoribute
as
a
func)on
of
the
values
of
other
aoributes.
•  Goal:
previously
unseen
records
should
be
assigned
a
class
as
accurately
as
possible.
–  A
test
set
is
used
to
determine
the
accuracy
of
the
model.
Usually,
the
given
data
set
is
divided
into
training
and
test
sets,
with
training
set
used
to
build
the
model
and
test
set
used
to
validate
it.
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
29
Associa8on
Rule
Mining
•  Given
a
set
of
transac)ons,
find
rules
that
will
predict
the
occurrence
of
an
item
based
on
the
occurrences
of
other
items
in
the
transac)on
Market‐Basket
transac)ons
Example
of
Associa)on
Rules
{Diaper}
→
{Beer},
{Milk,
Bread}
→
{Eggs,Coke},
{Beer,
Bread}
→
{Milk},
Implica)on
means
co‐occurrence,
not
causality!
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
30
What
is
Cluster
Analysis?
•  Finding
groups
of
objects
such
that
the
objects
in
a
group
will
be
similar
(or
related)
to
one
another
and
different
from
(or
unrelated
to)
the
objects
in
other
groups
Inter-cluster
distances are
maximized
Intra-cluster
distances are
minimized
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
31
Anomaly/Outlier
Detec8on
•  What
are
anomalies/outliers?
–  The
set
of
data
points
that
are
considerably
different
than
the
remainder
of
the
data
•  Variants
of
Anomaly/Outlier
Detec)on
Problems
–  Given
a
database
D,
find
all
the
data
points
x
∈
D
with
anomaly
scores
greater
than
some
threshold
t
–  Given
a
database
D,
find
all
the
data
points
x
∈
D
having
the
top‐n
largest
anomaly
scores
f(x)
–  Given
a
database
D,
containing
mostly
normal
(but
unlabeled)
data
points,
and
a
test
point
x,
compute
the
anomaly
score
of
x
with
respect
to
D
•  Applica)ons:
–  Credit
card
fraud
detec)on,
telecommunica)on
fraud
detec)on,
network
intrusion
detec)on,
fault
detec)on
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
32
Regression
(Predic8on)
Regression
is
the
process
of
finding
a
func)on
that
describes
data
classes
for
the
purpose
of
being
able
to
predict
discrete
numerical
data
values.
Numerous
approaches
for
developing
the
desired
func)on
exist,
including
classifica)on
(IF‐THEN)
rules,
decision
trees,
mathema)cal
formulae,
or
neural
networks.
Predic)on
also
encompasses
the
iden)fica)on
of
distribu)on
trends
based
on
the
available
data.
Both
classifica)on
and
predic)on
may
need
to
be
preceded
by
relevance
analysis,
which
aoempts
to
iden)fy
those
aoributes
or
features
that
do
not
contribute
to
the
classifica)on
or
predic)on
process.
These
aoributes
can
then
be
excluded
from
the
analysis.
A
common
relevance
analysis
technique
is
principal
component
analysis.
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
33
Machine
Learning
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
34
Data
Mining
Environments
There
are
a
large
number
of
data
mining
sobware
packages
available,
both
commercial
and
open
source.
A
search
of
the
internet
can
quickly
iden)fy
these.
A
comprehensive
review
of
these
packages
is
far
beyond
the
scope
of
what
we
can
deal
with
in
this
talk,
so
I
will
restrict
my
comments
here
to
several
well‐known
packages
used
for
data
analysis
and
mining:
the
R
sta)s)cal
analysis
package,
Matlab
(and
the
open
source
work‐alike
Octave),
and
data
mining
packages
Weka
and
Scikits.Learn.
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
35
•  The
R
Project
for
Sta8s8cal
Compu8ng
www.r‐project.org/
•  R,
also
called
GNU
S,
is
a
strongly
func)onal
language
and
environment
to
sta)s)cally
explore
data
sets,
make
many
graphical
displays
of
data.
Very
strong
sta)sical
tools.
•  The
basic
system
has
been
greatly
expanded
by
the
addi)on
of
packages
developed
by
its
user
community
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
36
Matlab
(Octave)
•  MATLAB,
a
commercial
product
from
MathWorks,
is
a
high‐level
technical
compu)ng
language
and
interac)ve
environment
for
algorithm
development,
data
visualiza)on,
data
analysis,
and
numerical
modeling.
hop://www.mathworks.com/products/matlab/
•  GNU
Octave
is
a
high‐level
interpreted
language,
primarily
intended
for
numerical
computa)ons.
It
is
ian
open
source
work‐alike
version
of
MATLAB.
hop://www.gnu.org/sobware/octave/
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
37
Weka
(Waikato
Environment
for
Knowledge
Analysis)
is
a
well‐known
suite
of
machine
learning
sobware
that
supports
several
typical
data
mining
tasks,
par)cularly
data
preprocessing,
clustering,
classifica)on,
regression,
visualiza)on,
and
feature
selec)on.
Its
techniques
are
based
on
the
hypothesis
that
the
data
is
available
as
a
single
flat
file
or
rela)on,
where
each
data
point
is
labeled
by
a
fixed
number
of
aoributes.
Weka
provides
access
to
SQL
databases
u)lizing
Java
Database
Connec)vity
and
can
process
the
result
returned
by
a
database
query.
Its
main
user
interface
is
the
Explorer,
but
the
same
func)onality
can
be
accessed
from
the
command
line
or
through
the
component‐based
Knowledge
Flow
interface.
hop://www.cs.waikato.ac.nz/~ml/weka/
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
38
scikit‐learn
is
a
Python
module
integra)ng
classic
machine
learning
algorithms
in
the
)ghtly‐knit
scien)fic
Python
world
(numpy,
scipy,
matplotlib).
It
aims
to
provide
simple
and
efficient
solu)ons
to
learning
problems,
accessible
to
everybody
and
reusable
in
various
contexts:
machine‐learning
as
a
versa)le
tool
for
science
and
engineering.
Tools
are
available
for
supervised
&
unsupervised
learning,
model
selec)on,
datasets,
feature
extrac)on.
hop://scikit‐learn.org/stable/
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
39
Pluses,
Minuses,
Observa8ons
The
R
and
Weka
sobware
both
have
a
large
community
which
contributes
to
extending
their
func)onality
through
the
development
of
new
add‐on
packages.
Further
R
and
Weka
can
be
interfaced
via
the
RWeka
package.
There
are
many
excellent
on‐line
tutorials
for
these
packages,
and
Weka
itself
is
well
described
in
the
text
Data
Mining
–
PracBcal
Machine
Learning
Tools
and
Techniques
by
Wioen,
Frank,
&
Hall.
This
text
provides
both
a
good
underpinning
of
the
methods
and
prac)cal
tutorial
informa)on.
(The
text
is
available
as
an
e‐book.)
Scikits.learn,
while
s)ll
fairly
new
(current
release
is
version
0.7),
has
a
very
impressive
collec)on
of
tools
and
an
extensive
user
guide.
The
sobware
is
wrioen
in
Python.
My
main
reserva)on
about
this
sobware
is
that
while
the
user
guide
presents
many
examples,
there
is
an
implicit
assump)on
that
the
user
knows
a
great
deal
about
the
field
of
data
mining.
This
may
leave
the
new
user
somewhat
in
over
their
head
in
trying
to
determine
exactly
which
tool
best
serves
their
need.
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
40
EXAMPLES
–
IMAGINARY
&
REAL
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
41
How
could
we
have
helped
this
lady?
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
42
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
43
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
44
Or
these
gentlemen?
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
45
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
46
Or
him?
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
47
Pan‐STARRS
Opportuni8es
•  The
PS1
Small
Area
Survey
(SAS),
covering
an
area
of
81
deg2,
overlaps
with
the
SDSS
Stripe
82.
In
addi)on
to
the
deep
Stripe
82
database,
the
images
from
this
region
have
been
examined
by
the
Ci)zen
Science
team
known
as
the
Galaxy
Zoo.
This
interes)ng
overlap
of
resources
provides
data
for
some
exci)ng
data
mining
experiments.
•  Star‐Galaxy
classifica)on
(or
more
precisely,
Star‐Galaxy‐QSO
classifica)on)
is
an
on‐going
challenge
for
the
PS1
science
teams.
While
this
work
has
been
reasonably
successful,
the
efforts
thus
far
seem
to
have
aoempted
to
get
by
with
the
simplest
possible
classifica)on
approach.
What
might
happen
if
we
performed
a
classifica)on
exercise
wherein
we
use
a
wide
range
of
IPP
measurements
(e.g.,
psf,
Kron,
Petrosian
magnitude,
Petrosian
radii,
various
moments
measured
in
individual
frames
and
stack)
with
SDSS
and
Galaxy
Zoo
data
providing
classifica)on
“truth?”
•  A
similar
analysis,
using
visual
inspec)on
of
the
images
to
iden)fy
ar)facts
in
the
PS1
images
and/or
stacks,
might
provide
a
robust
garbage
rejec)on
process.
Not
necessarily
glamorous
but
definitely
important.
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
48
Empirical
Photo‐Z
Methods
• 
• 
• 
• 
• 
• 
• 
• 
• 
Ar)ficial
Neural
Networks
Support
Vector
Machines
Self‐Organizing
Maps
Gaussian
Process
Regression
Kernel
Regression
Linear/Nonlinear
polynomial
fixng
Instance
Based
Learning
&
Nearest
Neighbors
Boosted
Decision
Trees
Regression
Trees
And
these
are
just
the
ones
I’ve
found
so
far!
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
49
Galaxy
Clusters?
•  We
all
know
the
best
way
to
iden)fy
clusters
of
galaxies
is
from
their
x‐ray
emission.
Unfortunately,
current
x‐ray
surveys
don’t
provide
sufficient
sky
&
depth
coverage
to
do
this.
•  Op)cal
surveys
have
sufficient
depth
but
suffer
from
background
issues,
overlapping
foreground
&
background
clusters,
etc.
•  It
has
long
been
hoped
that
in
large
scale
op)cal
surveys
such
as
Pan‐STARRS
and
LSST,
we
will
be
able
to
use
Photo‐
Z
values
to
sort
out
real
clusters
from
accidental
clustering
of
galaxies,
and
overlapping
clusters
at
different
distances.
(Some
of
the
PS1
partners
in
Taiwan
are
working
on
this
problem.)
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
50
Galaxy
Clusters
–
Can
Data
Mining
Help?
•  While
there
is
a
plethora
of
data
mining
techniques
for
finding
clusters
within
data,
most
are
probably
not
well
suited
for
finding
galaxy
clusters.
Many
methods
start
off
by
assuming
that
in
a
given
region
that
one
knows
how
many
clusters
are
present.
Clearly
this
is
not
the
case
with
our
problem.
Further,
we
need
to
deal
with
the
fact
that
in
the
3‐D
representa)on,
we
have
much
larger
uncertainty
along
the
line
of
sight
due
to
the
accuracy
of
the
Photo‐Z
measures.
•  Some
interes)ng
work
in
this
area
has
made
use
of
a
friend‐of‐friends
approach.
I
think
this
could
be
generalized
to
include
beoer
background
discrimina)on
including
the
Photo‐Z
distribu)on.
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
51
PAU
10/05/12
Jim
Heasley,
Ins)turte
for
Astronomy
52