Download Here - Wirtschaftsinformatik und Maschinelles Lernen, Universität

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
The 36th Annual Conference of
the German Classification Society (GfKl) on
Data Analysis, Machine Learning and
Knowledge Discovery
University of Hildesheim, Germany
August 1-3, 2012
Program & Abstracts
Photo title page:
c
Market place of Hildesheim (Hildesheim
Marketing, Photographer: Obornik)
The 36th Annual Conference of the German Classification Society (GfKl)
The 36th Annual Conference of
the German Classification Society (GfKl) on
Data Analysis, Machine Learning and Applications
Program & Abstracts
v
Preface
Message from the GfKl 2012 Chairs
We would like to cordially welcome you to the 36th Annual Conference of the
German Classification Society, taking place in Hildesheim, Germany.
The GfKl has become 36 years young. In these years, we saw the core topics
of the conference crystallize themselves into thematic areas. This year, for the first
time, these areas were made explicit and their coordination was undertaken by dedicated Area Chairs. We are proudly hosting five Areas:
• Statistics and Data Analysis (SDA),
organized by Hans-Hermann Bock, Christian Henning and Claus Weihs
• Machine Learning and Knowledge Discovery (MLKD),
organized by Lars Schmidt-Thieme and Myra Spiliopoulou
• Data Analysis and Classification in Marketing (DACMar),
organized by Daniel Baier and Reinhold Decker
• Data Analysis in Finance (DAFin),
organized by Michael Hanke and Krzysztof Jajuga
• Biostatistics and Bio-informatics,
organized by Anne-Laure Boulesteix and Hans Kestler
• Interdisciplinary Domains (InterDom),
organized by Andreas Hadjar, Sabine Krolak-Schwerdt and Claus Weihs
• and the Workshop Library and Information Science (LIS’2012)
organized by Frank Scholze
Obviously, the subjects accommodated in the five areas are not sharply separated.
We have fostered links and interactions among them, culminating in the three plenary and six semi-plenary talks, and in the three invited sessions that cover research
advances spanning more than one area. Our invited sessions are:
• Ensemble Methods in Clustering and Classification,
organized by Berthold Lausen
vii
viii
Preface
• Applications in Empirical Educational Research Based on Secondary Data,
organized by Sabine Krolak-Schwerdt and Alexandra Schwarz
• Dynamic Cluster Analysis — Theory and Practise,
organized by Jozef Pociecha under the auspices of the Polish Classification Society
Next to the plenary and semi-plenary talks, our scientific program accommodates
130 contributions, 16 of them in the LIS workshop. As expected, the lion’s share
among the contributions comes from Germany, followed by Poland, but we have
contributions from all over the world, stretching from Portugal to Ukraine, from
Canada and USA to Japan and Thailand.
Organizing such a conference with its parallel, interleaved events is not an easy
task. It requires coordination of many individuals and on many issues, and lives from
the tremendous effort of engaged scientists and of the dedicated teams in Hildesheim
and Magdeburg. We would like to thank the Area Chairs for their hard work in conference advertisement, author recruitment and submissions evaluation, and the three
Chairs of the invited sessions for winning renowned presenters to complement the
main areas with inter-area subjects of major interest. We are particularly indebted to
the Polish Classification Society for its involvement and presence in the GfKl 2012.
We are proud to announce the best paper awards of GfKl 2011. This year, we
have honoured two papers:
1. Sarah Frost and Daniel Baier (University of Cottbus) elaborate on the performance of the Earth Mover’s Distance on image clustering; and
2. Florent Domenach and Ali Tayari (University of Nicosia) discuss implications of
Axiomatic consensus properties.
We would like to congratulate the prize winners and would like to thank the Best
Paper Awards Jury members H.H. Bock, R. Decker, B. Lausen, A. Ultsch, and C.
Weihs for their excellent work. The awarded papers appear as part of the GfKl 2011
postconference proceedings.
We would like to thank the EasyChair GfKl 2012 administrator, Miriam Tödten
(master student of the ’Data & Knowledge Engineering’ degree at the Otto-vonGuericke University Magdeburg) for her tireless work in troubleshooting and assistance during submissions, evaluation and camera-ready preparation and contribution to the abstracts volume, and Silke Reifgerste, the financial administrator of
the KMD research lab at the Otto-von-Guericke University Magdeburg for her fast
and competent treatment of all financial matters concerning the Magdeburg team.
Further we would like to thank Kerstin Hinze-Melching (University of Hildesheim)
for her help with the local organization, Jörg Striewski and Uwe Oppermann, our
technicians in Hildesheim, for technical assistance, Selma Batur (master student at
University of Hildesheim) for help with the abstract proceedings and preparation of
the conference and our assistants in Hildesheim for the conference Fabian Brandes,
Christian Brauch, Lenja Busch, Sarina Flemnitz, Sophia Graefe, Stephan Reller,
Nicole Reuss and Kai Wedig.
GfKl 2012 is not ending with the conference in Hildesheim. According to our
long tradition of post-conference proceedings, we will open the EasyChair GfKl
Preface
ix
2012 site again from August 2012, and invite the conference authors to submit the
full version of their work for the peer-reviewing phase scheduled for fall 2012. Accepted papers will be published by Springer.
We wish you a productive, inspiring conference and a pleasant stay in Hildesheim!
Hildesheim,
August 2012
Lars Schmidt-Thieme and Ruth Janning,
GfKl 2012 Local Organizers
Myra Spiliopoulou and Lars Schmidt-Thieme,
GfKl 2012 Program Chairs
Claus Weihs, President of the GfKl
x
Preface
Sponsors
We thank our sponsors:
Information Systems and Machine Learning Lab (ISMLL)
Stiftung Universität Hildesheim
Microsoft
Preface
Conference Location
The GfKl 2012 is hosted by University Hildesheim.
The conference location is the Domäne Marienburg.
The address is:
Domäne Marienburg
Domänenstraße
31141 Hildesheim
xi
xii
Map of the buildings of Domäne Marienburg
Preface
Preface
xiii
Program Committee Chairs
Myra Spiliopoulou, Otto-von-Guericke-Univ. Magdeburg, Germany
Lars Schmidt-Thieme, Univ. Hildesheim, Germany
Local Organizers
Lars Schmidt-Thieme, Univ. Hildesheim, Germany
Ruth Janning, Univ. Hildesheim, Germany
Scientific Program Committee
AREA Machine Learning and Knowledge Discovery (MLKD)
Myra Spiliopoulou, Otto-von-Guericke-Univ. Magdeburg, Germany (Area Chair)
Lars Schmidt-Thieme, Univ. Hildesheim, Germany (Area Chair)
Martin Atzmueller, Univ. Kassel, Germany
Eirini Ntoutsi, Ludwig-Maximilians-Univ. Munich, Germany
Georg Krempl, Otto-von-Guericke-Univ. Magdeburg, Germany
Joao Gama, Univ. Porto, Portugal
Eyke Hüllermeier, Univ. Marburg, Germany
Thomas Seidl, RWTH Aachen, Germany
Andreas Hotho, Univ. Wuerzburg, Germany
AREA Statistics and Data Analysis
Claus Weihs, TU Dortmund, Germany (Area Chair)
Hans-Hermann Bock, RWTH Aachen, Germany (Area Chair)
Christian Hennig, Univ. College London, UK (Area Chair)
Bettina Gruen, Johannes Kepler Univ. Linz, Austria
Patrick Groenen, Erasmus Univ. Rotterdam, Netherlands
AREA Data Analysis and Classification in Marketing
Daniel Baier, BTU Cottbus, Germany (Area Chair)
Reinhold Decker, Univ. Bielefeld, Germany (Area Chair)
xiv
Preface
AREA Data Analysis in Finance
Krzysztof Jajuga, Wroclaw Univ.of Economics, Poland (Area Chair)
Michael Hanke, Univ. of Liechtenstein, Liechtenstein (Area Chair)
AREA Data Analysis in Biostatistics and Bioinformatics
Anne-Laure Boulesteix, Ludwig Maximilian Univ. Munich, Germany (Area Chair)
Hans Kestler, Univ. Ulm, Germany (Area Chair)
Harald Binder, Univ. Mainz, Germany
Matthias Schmid, Univ. Erlangen, Germany
Friedhelm Schwenker, Univ. Ulm, Germany
AREA Data Analysis in Interdisciplinary Domains
Sabine Krolak-Schwerdt, Univ. Luxembourg, Luxembourg (Area Chair)
Claus Weihs, TU Dortmund, Germany (Area Chair)
Andreas Hadjar, Univ. Luxembourg, Luxembourg
Irmela Herzog, LVR, Bonn, Germany
Florian Klapproth, Univ. Luxembourg, Luxembourg
Hans-Joachim Mucha, WIAS, Berlin, Germany
Frank Scholze, KIT Karlsruhe, Germany(Chair)
LIS’2012
Frank Scholze, KIT Karlsruhe, Germany(Chair)
Stefan Gradmann, HU Berlin, Germany
Heidrun Wiesenmüller, HDM Stuttgart, Germany
Ewald Brahms, Univ. Hildesheim, Germany
Michael Mönnich, KIT Karlsruhe, Germany
Bernd Lorenz, FHÖV Munich, Germany
Hans-Joachim Hermes, TUniv. Chemnitz, Germany
Andreas Geyer-Schulz, KIT Karlsruhe, Germany
Preface
xv
Social Program
Tuesday, July 31th
20:00 h Informal come-together in Knochenhauer Amtshaus
(http://www.knochenhaueramtshaus.com/) in the city center at the market place
Wednesday, August 1th
16:30 h
17:30 h
18:00 h
18:00 h
20:00 h
Guided tour of UB Hildesheim
Guided tour of Dombibliothek
Guided City tour of Hildesheim (German)
Guided City tour of Hildesheim (English)
Reception in the city hall at the market place
Greeting: Ruth Seefels, Mayor of Hildesheim, Claus Weihs, President of GfKl
Thursday, August 2th
20:15 h Conference dinner in the Novotel
(Bahnhofsallee 38, 31134 Hildesheim, admittance: 19:30)
c
City hall of Hildesheim (Photo: Hildesheim
Marketing)
xvi
Preface
Invited Speakers
Wolfgang Gaul
Where Data Analysis meets Graph Theory
(01.08, 10:00-10:45, Building: HS 52, Room: 001 (Theater))
Katsutoshi Yada
Knowledge Discovery in Shopping Path Data
(1.08, 13:30-14:15, Building: HS 52, Room: 001 (Theater))
Thomas Seidl
Stream Data Mining and Anytime Algorithms
(01.08, 13:30-14:15, Building: HS 27, Room 003)
Joao Gama
Data Stream Mining for Ubiquitous Environments
(02.08, 09:00-09:45, Building: HS 52, Room: 001 (Theater))
Michele Sebag
Autonomous Robotics: Defining Instincts and Learning Systems of Values
(02.08, 14:00-14:45, Building: HS 52, Room: 001 (Theater))
Alex Weissensteiner Arbitrage-Free Scenario Trees for Financial Optimization
(02.08, 14:00-14:45, Building: HS 27, Room 003)
Hillol Kargupta
Connected Cars, Machine-to-Machine Environments, and
Distributed Data Mining
(03.08, 09:00-09:45, Building: HS 52, Room: 001 (Theater))
Dirk Van den Poel
On the value of incorporating sequential information into
predictive analytics classification models for analytical CRM
(03.08, 09:00-09:45, Building: HS 27, Room 003)
Shai Ben-David
Universal Learning vs. No Free Lunch results - can there
be learners that do not require task-specific knowledge?
(03.08, 13:15-14:00, Building: HS 52, Room: 001 (Theater))
Preface
xvii
GfKl 2013
The next annual conference of the German Classification Society GfKl 2013 will
take place in Luxembourg from July 10, 2013 till July 13, 2013 under the title
European Conference on Data Analysis
Scientific Program Committee:
Prof. Dr. Dirk Van den Poel (Ghent University), Chair
Local Organizer:
Prof. Dr. Sabine Krolak-Schwerdt, Dr. Matthias Böhmer
Luxembourg
Program
Preface
xix
Scientific Program of GfKl 2012 (Overview)
Tuesday, July 31, 2012
Social events
20:00 Informal come-together in Knochenhauer Amtshaus
(http://www.knochenhaueramtshaus.com/) in the city center at the market place
Wednesday, August 01, 2012
08:15
09:00
Registration (Building: HS 52, Foyer)
Opening (Building: HS 52, Room: 001 (Theater))
– Welcome by Prof. Dr. Wolfgang-Uwe Friedrich (President of the University
of Hildesheim)
– Welcome by Prof. Dr. Martin Sauerwein (Dean of faculty for mathematics,
natural sciences, economy and computer science, University of Hildesheim)
– Welcome and best paper awards by Prof. Dr. Claus Weihs (President of the
GfKl):
· Best Paper Award 2011 - methods: ”Implications of Axiomatic Consensus
Properties”, Florent Domenach and Ali Tayari (Department of Computer
Science, University of Nicosia)
· Best Paper Award 2011 - application: ”Comparing Earth Movers Distance
and its Approximations for Clustering Images”, Sarah Frost and Daniel
Baier (Institute of Business Administration and Economics, Brandenburg
University of Technology Cottbus)
– Welcome by Prof. Dr. Myra Spiliopoulou (Program Chair)
– Welcome by Prof. Dr. Dr. Lars Schmidt-Thieme (Local Organizer)
10:00 Opening Plenary (Building: HS 52, Room: 001 (Theater)),
Wolfgang Gaul: Where Data Analysis meets Graph Theory (4),
Chair: L. Schmidt-Thieme
xx
Preface
Coffee break 10:45 – 11:15
Building
Area
HS 52,
Room 001
Statistics &
Data Analysis
HS 1,
Room 007
Data Analysis
&
Classification
in Marketing
Session
Clustering 1
Chair
H. Bock
S. Voekler
11:15
Alexandrovich
(30)
Tanioka (52)
Ayale (36)
Baier (56)
11:40
12:05
Rese (64)
-
HS 52,
Room 101
Data Analysis
in Finance
HS 27,
HS 2a,
Room 003
Room 004
Machine
Interdisciplinary
Learning &
Domains
Knowledge
Discovery
Recommenders
Education
&
Multi-Criteria
Optimization
K. Jajuga
L. SchmidtS. KrolakThieme
Schwerdt
Vogt (83)
Symeonidis
Trendtel (129)
(100)
Müller (79)
Ntoutsi (95)
Kasper (119)
Feldman (75)
Cheng (90)
-
Lunch break 12:30 – 13:30
13:30 Semi Plenary (Building: HS 52, Room: 001 (Theater)),
Katsutoshi Yada: Knowledge Discovery in Shopping Path Data (10),
Chair: D. Baier
13:30 Semi Plenary (Building: HS 27, Room 003),
Thomas Seidl: Stream Data Mining and Anytime Algorithms (9), Chair: C. Weihs
Break 14:15 – 14:30
Building
Area
Session
Chair
14:30
14:55
15:20
HS 52,
Room 001
Statistics &
Data Analysis
HS 1,
Room 007
Data Analysis
&
Classification
in Marketing
Classification 1
J. Schiffner
A. Sänn
Bischl (33)
Ba̧k (59)
Takai (51)
Rumstadt (65)
Lange (42)
Voekler (70)
HS 52,
Room 101
Data Analysis
in Finance
K. Jajuga
Piontek (81)
Nagy (80)
Kaszuba (78)
HS 27,
HS 2a,
Room 003
Room 004
Machine
Interdisciplinary
Learning &
Domains
Knowledge
Discovery
Streams
Psychology
C. Weihs
F. Klapproth
Bolanos (88)
Hahn (114)
Tödten (101)
Hörstermann
(118)
Matuszyk (93) Geyer-Schulz
(113)
Preface
xxi
Coffee break 15:45 – 16:15
Building
Area
HS 52,
Room 001
Statistics &
Data Analysis
HS 1,
Room 007
Statistics &
Data Analysis
Statistics 1
Statistics in
Economics
Session
Chair
16:15
16:40
F. Schwaiger
A. Rybicka
Beige (31) Brzezinska (34)
Voigt (53)
Biron (32)
17:05
Joenssen (40)
Jefmanski (39)
HS 52,
Room 101
Data Analysis
in Finance
M. Hanke
Bessler (72)
RutkowskaZiarko
(82)
Garsztka (76)
HS 27,
Room 003
Machine
Learning &
Knowledge
Discovery
Clustering
HS 2a,
Room 004
Invited Session:
Völkel (103)
Schwarz (127)
Applications in
Empirical
Educational
Research Based
on Secondary
Data
A. Schwarz
Mouysset (94) Schwarz (19)
Pelka (96)
Makles (18)
Social events
16:30 Guided tour of UB Hildesheim
17:30 Guided tour of Dombibliothek
18:00 Guided City tour of Hildesheim (German)
18:00 Guided City tour of Hildesheim (English)
20:00 Reception in the city hall at the market place (Greeting: Ruth Seefels,
Mayor of Hildesheim, Claus Weihs, President of GfKl)
Thursday, August 02, 2012
08:15 Registration (Building: HS 52, Foyer)
09:00 Plenary (Building: HS 52, Room: 001 (Theater)),
João Gama: Data Stream Mining for Ubiquitous Environments (3),
Chair: M. Spiliopoulou
xxii
Preface
Break 09:45 – 10:00
Building
Session
HS 52,
HS 1,
Room 001
Room 007
Statistics & Biostatistics &
Data Analysis Bioinformatics
Factor analysis
Chair
10:00
10:25
C. Hennig
H. Kestler
Schoonees (49) Potapov (138)
Schmid (139)
Area
10:50
Mucha (47)
Matuszyk (136)
HS 52,
Room 101
-
HS 27,
HS 2a,
Room 003
Room 004
Invited Session: Interdisciplinary
Domains
Dynamic
Music 1
Cluster
Analysis Theory and
Practise
J. Pociecha
C. Weihs
Bock (22)
Dittmar (111)
Najman (24)
Hillewaere
(117)
Lula (23)
Bauer (108)
Coffee break 11:15 – 11:30
Building
Area
Session
HS 52,
HS 1,
Room 001
Room 007
Statistics & Biostatistics &
Data Analysis Bioinformatics
Classification 2
Chair
11:30
B. Bischl
Meyer (45)
11:55
Lange (43)
12:20
12:45
12:45
13:30
H. Kestler
Heider (135)
Burkovski
(134)
Schwenker (50) Maucher (137)
HS 52,
Room 101
-
Nguyen (124)
HS 27,
HS 2a,
Room 003
Room 004
Invited Session: Interdisciplinary
Domains
Dynamic
Music 2
Cluster
Analysis Theory and
Practise
J. Pociecha
C. Weihs
Sokolowski
Vatolkin (130)
(25)
Voekler (27) Eichhoff (112)
Stanimir (26)
Lukashevich
(123)
Krey (120)
Meeting AG Biostatistik (HS 1, Room 007)
Meeting AG DA-NK (HS 52, Room 101)
Lunch break 12:45 – 14:00
14:00 Semi Plenary (Building: HS 52, Room: 001 (Theater)),
Michele Sebag: Autonomous Robotics: Defining Instincts and Learning Systems
of Values (8), Chair: W. Gaul
Preface
xxiii
14:00 Semi Plenary (Building: HS 27, Room 003),
Alex Weissensteiner: Arbitrage-Free Scenario Trees for Financial Optimization
(5), Chair: K. Jajuga
Break 14:45 – 14:55
Building
Area
HS 52,
Room 001
Statistics &
Data Analysis
Session
Clustering 2
Chair
14:55
15:20
Ritter
Wilk (54)
Schwaiger (38)
15:45
Hennig (37)
HS 1,
Room 007
Data Analysis
&
Classification
in Marketing
HS 52,
Room 101
Data Analysis
in Finance
HS 27,
Room 003
Machine
Learning &
Knowledge
Discovery
Classification
& Ensembles
R. Decker
Steiner (67)
Lichtenthäler
(61)
Tuma (69)
M. Hanke
Geyer (77) Schwenker (97)
Bohlmann (74)
Senge (98)
HS 2a,
Room 004
-
-
Vatolkin (102)
-
HS 52,
Room 101
HS 27,
Room 003
HS 2a,
Room 004
Interdisciplinary
Domains
Coffee break 16:10 – 16:40
Building
Area
HS 52,
Room 001
Statistics &
Data Analysis
HS 1,
Room 007
Data Analysis
&
Classification
in Marketing
Session Model selection
Chair
C. Weihs
R. Decker
16:40
17:05
17:30
Liebscher (44)
Mucha (46)
Minke (62)
Bak (57)
-
Break 17:30 – 18:00
18:00 General meeting of the German Classification Society
(Building: HS 52, Room: 001 (Theater), End: 19:30)
-
Language &
Education
S. KrolakSchwerdt
Beica (109)
Nisioi (125)
Ünlü (131)
xxiv
Preface
Social events
20:15 Conference dinner in the Novotel (Bahnhofsallee 38, 31134 Hildesheim,
admittance: 19:30)
Friday, August 03, 2012
08:15 Registration (Building: HS 52, Foyer)
09:00 Semi Plenary (Building: HS 52, Room: 001 (Theater)),
Hillol Kargupta: Connected Cars, Machine-to-Machine Environments, and Distributed Data Mining (6), Chair: L. Schmidt-Thieme
09:00 Semi Plenary (Building: HS 27, Room 003),
Dirk Van den Poel: On the value of incorporating sequential information into
predictive analytics classification models for analytical CRM (7),
Chair: W. Steiner
Break 09:45 – 09:55
Building
Area
HS 52,
Room 001
Statistics &
Data Analysis
HS 1,
Room 007
Data Analysis
&
Classification
in Marketing
Session
Applications
Chair
H. Mucha
09:55
Santos (48)
D. Van den
Poel
Kottemann (60)
10:20
Klapproth (41)
Ballings (58)
10:45
Carvalho (35)
Paetz (63)
HS 52,
HS 27,
HS 2a,
Room 101
Room 003
Room 004
Machine
Machine
Interdisciplinary
Learning &
Learning &
Domains
Knowledge
Knowledge
Discovery
Discovery
Opinions and Distributed and
Quality
marketing
Temporal Data
Analysis
J. Gama
C. Hennig
Ahn (86)
Khan (92)
Sinelnikova
(99)
Wagner (104)
Dávid (91)
Bakhtyar (87)
Thorleuchter
(128)
Hildebrand
(116)
Rozkrut (126)
Preface
xxv
Coffee break 11:10 – 11:25
Building
HS 52,
HS 1,
Room 001
Room 007
Area Invited Session: Data Analysis
&
Classification
in Marketing
Session
Ensemble
methods in
clustering and
classification
Chair
B. Lausen
A. Rese
11:25
Ziegler (15)
Sänn (68)
11:50
Binder (13)
Selka (66)
12:15
Janitza (14)
12:40
Adler (12)
-
HS 52,
Room 101
-
HS 27,
HS 2a,
Room 003
Room 004
Machine
Interdisciplinary
Learning &
Domains
Knowledge
Discovery
Social networks Maps & Images
P. Symeonidis
Yakoubi (106)
Wartena (105)
Buza (89)
I. Herzog
Busche (110)
Herzog (115)
Loidl (121)
-
Break 13:05 – 13:15
13:15 Closing Plenary (Building: HS 52, Room: 001 (Theater)),
Shai Ben-David: Universal Learning vs. No Free Lunch results - can there be
learners that do not require task-specific knowledge? (2), Chair: C. Hennig
14:00 Farewell (Beverages/snacks, Building: HS 52, Foyer)
End 14:30
xxvi
Preface
Full Scientific Program of GfKl 2012
Tuesday, July 31, 2012
Social events
20:00 Informal come-together in Knochenhauer Amtshaus
(http://www.knochenhaueramtshaus.com/) in the city center at the market place
Wednesday, August 01, 2012
08:15
09:00
Registration (Building: HS 52, Foyer)
Opening (Building: HS 52, Room: 001 (Theater))
– Welcome by Prof. Dr. Wolfgang-Uwe Friedrich (President of the University
of Hildesheim)
– Welcome by Prof. Dr. Martin Sauerwein (Dean of faculty for mathematics,
natural sciences, economy and computer science, University of Hildesheim)
– Welcome and best paper awards by Prof. Dr. Claus Weihs (President of the
GfKl):
· Best Paper Award 2011 - methods: ”Implications of Axiomatic Consensus
Properties”, Florent Domenach and Ali Tayari (Department of Computer
Science, University of Nicosia)
· Best Paper Award 2011 - application: ”Comparing Earth Movers Distance
and its Approximations for Clustering Images”, Sarah Frost and Daniel
Baier (Institute of Business Administration and Economics, Brandenburg
University of Technology Cottbus)
– Welcome by Prof. Dr. Myra Spiliopoulou (Program Chair)
– Welcome by Prof. Dr. Dr. Lars Schmidt-Thieme (Local Organizer)
10:00 Opening Plenary (Building: HS 52, Room: 001 (Theater)),
Wolfgang Gaul: Where Data Analysis meets Graph Theory (4),
Chair: L. Schmidt-Thieme
Coffee break 10:45 – 11:15
Preface
xxvii
Statistics and Data Analysis: Clustering 1 (HS 52, Room 001)
Chair: H. Bock
11:15 Grigory Alexandrovich: An exact Newton’s method for ML estimation in a
penalized Gaussian mixture model (30)
11:40 Kensuke Tanioka and Hiroshi Yadohisa: Three-way Subspace Hierarchical
Clustering based on Entropy Regularization Method (52)
12:05 Daher Ayale and Dhorne Thierry: Geographic clustering through aggregation control (36)
Data Analysis and Classification in Marketing (HS 1, Room 007)
Chair: S. Voekler
11:15 Daniel Baier, Wolfgang Polasek and Alexandra Rese: Spatial Modeling of
Dependencies Between Population, Education, and Economic Growth (56)
11:40 Alexandra Rese, Hans-Georg Gemnden and Daniel Baier: Rasch Models
for Analyzing Role Models in Inter-Organisational Innovation Processes (64)
Data Analysis in Finance (HS 52, Room 101)
Chair: K. Jajuga
11:15 Jonas Vogt: Sovereign Credit Spreads During the European Fiscal Crisis
(83)
11:40 Marlene Müller: Using generalized additive models to fit credit rating
scores (79)
12:05 Lukasz Feldman, Radoslaw Pietrzyk and Pawel Rokita: A practical method
of determining longevity and premature-death risk aversion in households and
some proposals of its application (75)
Machine Learning and Knowledge Discovery: Recommenders and
Multi-Criteria Optimization (HS 27, Room 003)
Chair: L. Schmidt-Thieme
11:15 Panagiotis Symeonidis: Recommendations in Time Evolving Multi-modal
Social Networks (100)
11:40 Eirini Ntoutsi, Kostas Stefanidis, Kjetil Norvag and Hans-Peter Kriegel:
gRecs: A collaborative filtering framework for group recommendations (95)
12:05 Weiwei Cheng and Eyke Hüllermeier: Label Ranking with Abstention:
Learning to Predict Partial Orders (90)
xxviii
Preface
Interdisciplinary Domains: Education (HS 2a, Room 004)
Chair: S. Krolak-Schwerdt
11:15 Matthias Trendtel and Ali Ünlü: Using Latent Class Models with Random
Effects for Investigating Local Dependence (129)
11:40 Daniel Kasper and Ali Ünlü: Sensitivity Analyses for the Rasch Model (119)
Lunch break 12:30 – 13:30
13:30 Semi Plenary (Building: HS 52, Room: 001 (Theater)),
Katsutoshi Yada: Knowledge Discovery in Shopping Path Data (10),
Chair: D. Baier
13:30 Semi Plenary (Building: HS 27, Room 003),
Thomas Seidl: Stream Data Mining and Anytime Algorithms (9), Chair: C. Weihs
Break 14:15 – 14:30
Statistics and Data Analysis: Classification 1 (HS 52, Room 001)
Chair: J. Schiffner
14:30 Bernd Bischl, Julia Schiffner and Claus Weihs: Benchmarking classification
algorithms on high-performance computing clusters (33)
14:55 Keiji Takai and Kenichi Hayashi: Effects of Labeling Mechanisms on Classification Error in Linear Discriminant Analysis (51)
15:20 Tatjana Lange, Karl Mosler and Pavlo Mozharovskyi: DDα-classification
of asymmetric and fat-tailed data (42)
Data Analysis and Classification in Marketing (HS 1, Room 007)
Chair: A. Sänn
14:30 Andrzej Ba̧k and Tomasz Bartłomowicz: Microeconometrics Multinomial
Models and their Applications in Preferences Analysis using R (59)
14:55 Susanne Rumstadt and Daniel Baier: Variable Weighting and Selection Approaches for Market Segmentation: A Comparison (65)
15:20 Sascha Voekler and Daniel Baier: Solving Product Line Design Optimization Problems using Stochastic Programming (70)
Preface
xxix
Data Analysis in Finance (HS 52, Room 101)
Chair: K. Jajuga
14:30 Krzysztof Piontek: Value-at-Risk Backtesting Procedures Based on the Loss
Functions - Simulation Analysis of the Power of Tests (81)
14:55 Gabor I. Nagy and Krisztian Buza: Clustering Algorithms for Storage of
Tick Data (80)
15:20 Bartosz Kaszuba: Correlation of outliers in multivariate data (78)
Machine Learning and Knowledge Discovery: Streams (HS 27, Room 003)
Chair: C. Weihs
14:30 Matthew Bolanos, John Forrest and Michael Hahsler: A Study of the Efficiency and Accuracy of Data Stream Clustering for Large Data Sets (88)
14:55 Miriam Tödten, Zaigham Faraz Siddiqui and Myra Spiliopoulou: A Lightweight
CVFDT Classifier for Streams with Concept Drift (101)
15:20 Pawel Matuszyk: Framework for Storing and Processing Relational Entities in a Data Stream (93)
Interdisciplinary Domains: Psychology (HS 2a, Room 004)
Chair: F. Klapproth
14:30 Sonja Hahn: ANOVA and Alternatives for Causal Inferences (114)
14:55 Thomas Hörstermann and Sabine Krolak-Schwerdt: Comparing regression
approaches in modelling (non-)compensatory judgment formation (118)
15:20 Andreas Geyer-Schulz, Jonas Kunze and Andreas Sonnenbichler: Learning
in groups and exam performance (113)
Coffee break 15:45 – 16:15
Statistics and Data Analysis: Statistics 1 (HS 52, Room 001)
Chair: F. Schwaiger
16:15 Tim Beige, Thomas Terhorst, Claus Weihs and Holger Wormer: Which District of Dortmund is the Most Dangerous? (31)
xxx
Preface
16:40 Tobias Voigt, Roland Fried, Michael Backes and Wolfgang Rhode: GammaHadron-Separation in the MAGIC-Experiment (53)
17:05 Dieter Joenssen and Udo Bankhofer: Zur Begrenzung der Verwendungshäufigkeit
von Spenderobjekten bei der Imputationen fehlender Daten mittels Hot-DeckVerfahren (40)
Statistics and Data Analysis: Statistics in Economics (HS 1, Room 007)
Chair: A. Rybicka
16:15 Justyna Brzezinska: Visual models for categorical data in economic research (34)
16:40 Miguel Biron and Cristian Bravo: Empirically Measuring the Effect of Violating the Independence Assumption in Behavioral Scoring (32)
17:05 Bartlomiej Jefmanski and Marcin Pelka: Fuzzy Composite Index for Customer Satisfaction Evaluation: an Application for Public Sector Services (39)
Data Analysis in Finance (HS 52, Room 101)
Chair: M. Hanke
16:15 Wolfgang Bessler and Daniil Wagner: Sovereign Wealth Funds and Portfolio Choice (72)
16:40 Anna Rutkowska-Ziarko: Fundamental portfolio construction based on
semi-variance (82)
17:05 Przemysław Garsztka: Optimal portfolios of assets taking into account the
asymmetry of specific risk (76)
Machine Learning and Knowledge Discovery: Clustering (HS 27, Room 003)
16:15 Sandrine Mouysset, Joseph Noailles, Daniel Ruiz and Clovis Tauber: Spectral Clustering: interpretation and Gaussian parameter (94)
16:40 Marcin Pelka: Symbolic cluster ensemble based on co-association matrix
vs. noisy variables and outliers (96)
17:05 Gunnar Völkel, Uwe Schöning and Hans A. Kestler: Group-Based Ant
Colony Optimization (103)
Preface
xxxi
Invited Session: Applications in Empirical Educational Research Based on
Secondary Data (HS 2a, Room 004)
Chair: A. Schwarz
16:15 Alexandra Schwarz: Applications in Empirical Educational Research Based
on Secondary Data (19)
16:40 Anna Makles and Kerstin Schneider: Does school choice increase ethnic
segregation in primary schools or only segregation indices? (18)
17:05 Alexandra Schwarz: The Impact of Student Loans on Personal Financing of
Higher Education in Germany (127)
Social events
16:30 Guided tour of UB Hildesheim
17:30 Guided tour of Dombibliothek
18:00 Guided City tour of Hildesheim (German)
18:00 Guided City tour of Hildesheim (English)
20:00 Reception in the city hall at the market place (Greeting: Ruth Seefels,
Mayor of Hildesheim, Claus Weihs, President of GfKl)
xxxii
Preface
Thursday, August 02, 2012
08:15 Registration (Building: HS 52, Foyer)
09:00 Plenary (Building: HS 52, Room: 001 (Theater)),
João Gama: Data Stream Mining for Ubiquitous Environments (3),
Chair: M. Spiliopoulou
Break 09:45 – 10:00
Statistics and Data Analysis: Factor analysis (HS 52, Room 001)
Chair: C. Hennig
10:00 Pieter Schoonees, Michel Van de Velden and Patrick Groenen: Constrained
dual scaling of successive categories for detecting response styles (49)
10:50 Hans-Joachim Mucha, Hans-Georg Bartel and Jens Dolata: Dual Scaling
Classification and Its Application in Archaeometry (47)
Biostatistics and Bioinformatics (HS 1, Room 007)
Chair: H. Kestler
10:00 Sergej Potapov, Asma Gul, Werner Adler and Berthold Lausen: Decision
tree ensembles with different split criteria (138)
10:25 Florian Schmid, Ludwig Lausser and Hans A. Kestler: A Transductive Set
Covering Machine (139)
10:50 Pawel Matuszyk, Dominik Brammen, René Schult and Myra Spiliopoulou:
Prediction of Surgery Duration Using Data Mining Methods on Anaesthesia Protocols (136)
Preface
xxxiii
Invited Session: Dynamic Cluster Analysis - Theory and Practise (HS 27,
Room 003)
Chair: J. Pociecha
10:00 Hans-Hermann Bock: Old and new dynamic clustering methods (22)
10:25 Kamila Migdał Najman and Krzysztof Najman: Dynamical Clustering with
Self Learning Neural Networks (24)
10:50 Paweł Lula: Machine learning approach in information retrieval for real
estate offers analysis (23)
Interdisciplinary Domains: Music 1 (HS 2a, Room 004)
Chair: C. Weihs
10:00 Christian Dittmar, Daniel Gärtner, Kay F. Hildebrand and Florian Müller:
Evaluating Similarity Measures for Plagiarism Detection in Melody Transcriptions (111)
10:25 Ruben Hillewaere, Bernard Manderick and Darrell Conklin: Alignment
methods for folk tune classification (117)
10:50 Nadja Bauer, Klaus Friedrichs, Julia Schiffner and Claus Weihs: Onset detection using an auditory model (108)
Coffee break 11:15 – 11:30
Statistics and Data Analysis: Classification 2 (HS 52, Room 001)
Chair: B. Bischl
11:30 Oliver Meyer, Bernd Bischl and Claus Weihs: Support Vector Machines on
Large Data Sets: Simple Parallel Approaches (45)
11:55 Tatjana Lange and Pavlo Mozharovskyi: The Alpha-Procedure - a nonparametric invariant method for automatic classification of d-dimensional objects
(43)
12:20 Friedhelm Schwenker and Sascha Meudt: On Instance Selection in Multi
Classifier Systems (50)
12:40 Hoang Huy Nguyen, Stefan Frenzel and Christoph Bandt: Multi-Step Linear Discriminant Analysis for Classification of Event-Related Potentials (124)
xxxiv
Preface
Biostatistics and Bioinformatics (HS 1, Room 007)
Chair: H. Kestler
11:30 Dominik Heider, Christoph Bartenhagen, J. Nikolaj Dybowski, Sascha
Hauke, Martin Pyka and Daniel Hoffmann: Unsupervised dimension reduction
methods for protein sequence classification (135)
11:55 Andre Burkovski, Ludwig Lausser and Hans A. Kestler: Rank aggregation
for candidate gene selection (134)
12:20 Markus Maucher, Christian Wawra and Hans A. Kestler: The critical noise
level for learning Boolean functions (137)
12:45 Meeting AG Biostatistik
Invited Session: Dynamic Cluster Analysis - Theory and Practise (HS 27,
Room 003)
Chair: J. Pociecha
11:30 Andrzej Sokolowski: Classification of Three-Way Clustering Problems (25)
11:55 Sascha Voekler and Baier Daniel: Solving Product Line Design Optimization Problems using Stochastic Programming (27)
12:20 Agnieszka Stanimir: Studies in Lower Secondary Educational Level Outcomes Changes in Poland Using Correspondence Analysis (26)
Interdisciplinary Domains: Music 2 (HS 2a, Room 004)
Chair: C. Weihs
11:30 Igor Vatolkin, Günther Rötter and Claus Weihs: Music Genre Prediction by
High-Level Instrument and Harmony Characteristics (130)
11:55 Markus Eichhoff and Claus Weihs: From Single Tones to MIDI Remixes Detecting Families of Musical Instruments by High-Level Features (112)
12:20 Hanna Lukashevich: Confidence measures in automatic music classification
(123)
12:45 Sebastian Krey, Uwe Ligges and Friedrich Leisch: Music and Timbre Segmentation by efficient Order Constrained K-Means Clustering (120)
13:30
Meeting AG DA-NK (HS 52, Room 101)
Lunch break 12:45 – 14:00
Preface
xxxv
14:00 Semi Plenary (Building: HS 52, Room: 001 (Theater)),
Michele Sebag: Autonomous Robotics: Defining Instincts and Learning Systems
of Values (8), Chair: W. Gaul
14:00 Semi Plenary (Building: HS 27, Room 003),
Alex Weissensteiner: Arbitrage-Free Scenario Trees for Financial Optimization
(5), Chair: K. Jajuga
Break 14:45 – 14:55
Statistics and Data Analysis: Clustering 2 (HS 52, Room 001)
Chair: Ritter
14:55 Justyna Wilk and Marcin Pelka: Cluster Analysis of Symbolic Data with
Application of R Software (54)
15:20 Hajo Holzmann and Florian Schwaiger: Merging States in Hidden Markov
Models (38)
15:45 Christian Hennig: Some thoughts about the “number of clusters”-problem
(37)
Data Analysis and Classification in Marketing (HS 1, Room 007)
Chair: R. Decker
14:55 Winfried Steiner, Florian Siems, Anett Weber and Daniel Guhl: Exploring
Nonlinear Effects in the Relationship between Customer Satisfaction and Customer Retention (67)
15:20 Christina Lichtenthäler and Lars Schmidt-Thieme: Multinomial-SVM-ItemRecommender for Repeat Buying Scenarios (61)
15:45 Michael Tuma: Identifying Consumer Typologies from Online Product Reviews Using Finite Mixture Models (69)
Data Analysis in Finance (HS 52, Room 101)
Chair: M. Hanke
14:55 Alois Geyer, Michael Hanke and Alex Weissensteiner: A Simplex Rotation
Algorithm for the Factor Approach to Generate Financial Scenarios (77)
15:20 Daniel Bohlmann and Jarek Krajewski: Feature reduction and pattern classification for financial forecasting, - A comparative study on different optimization strategies - (74)
xxxvi
Preface
Machine Learning and Knowledge Discovery: Classification & Ensembles (HS
27, Room 003)
14:55 Friedhelm Schwenker, Michael Glodek and Martin Schels: Ensemble learning for density estimation (97)
15:20 Robin Senge and Eyke Hüllermeier: An Analysis of Classifier Chains for
Multi-Label Classification (98)
15:45 Igor Vatolkin, Bernd Bischl, Günter Rudolph and Claus Weihs: Statistical
Comparison of Classifiers for Multi-Objective Feature Selection in Instrument
Recognition (102)
Coffee break 16:10 – 16:40
Statistics and Data Analysis: Model selection (HS 52, Room 001)
Chair: C. Weihs
16:40 Eckhard Liebscher: A universal method for model selection in parametric
regression models based on statistical tests (44)
17:05 Hans-Joachim Mucha and Hans-Georg Bartel: Soft Bootstrapping and Its
Comparison with Other Resampling Methods (46)
Data Analysis and Classification in Marketing (HS 1, Room 007)
Chair: R. Decker
16:40 Anneke Minke and Klaus Ambrosi: Approach to Predicting Changes in
Market Segments Based on Customer Behavior (62)
17:05 Andrzej Bak, Marcin Pelka and Aneta Rybicka: Discrete Choice Methods
and Their Applications in Preference Analysis of Vodka Consumers (57)
Interdisciplinary Domains: Language & Education (HS 2a, Room 004)
Chair: S. Krolak-Schwerdt
16:40 Andreea Beica and Liviu P. Dinu: Computational Aspects of Natural Languages’ Similarities (109)
17:05 Sergiu Nisioi and Liviu P. Dinu: The Author in Translation: A Computational Method (125)
Preface
xxxvii
17:30 Ali Ünlü, Daniel Kasper and Matthias Trendtel: The OECD’s Programme
for International Student Assessment (PISA) Study: A Review of Its Basic Psychometric Concepts (131)
Break 17:30 – 18:00
18:00 General meeting of the German Classification Society
(Building: HS 52, Room: 001 (Theater), End: 19:30)
Social events
20:15 Conference dinner in the Novotel (Bahnhofsallee 38, 31134 Hildesheim,
admittance: 19:30)
xxxviii
Preface
Friday, August 03, 2012
08:15 Registration (Building: HS 52, Foyer)
09:00 Semi Plenary (Building: HS 52, Room: 001 (Theater)),
Hillol Kargupta: Connected Cars, Machine-to-Machine Environments, and Distributed Data Mining (6), Chair: L. Schmidt-Thieme
09:00 Semi Plenary (Building: HS 27, Room 003),
Dirk Van den Poel: On the value of incorporating sequential information into
predictive analytics classification models for analytical CRM (7),
Chair: W. Steiner
Break 09:45 – 09:55
Statistics and Data Analysis: Applications (HS 52, Room 001)
Chair: H. Mucha
09:55 Jaime Santos and Orlando Belo: Introducing Analytical Methods and Predictive Models in Project Management Activities (48)
10:20 Florian Klapproth, Sabine Krolak-Schwerdt and Thomas Hörstermann:
Predictive validity of tracking decisions: Application of a new validation criterion (41)
10:45 Mariana Carvalho, Paulo Sampaio and Orlando Belo: Discovering Process
Certification Tendencies (35)
Data Analysis and Classification in Marketing (HS 1, Room 007)
Chair: D. Van den Poel
09:55 Pascal Kottemann, Martin Meiner and Reinhold Decker: Measuring Consumers’ Brand Associations in Online Market Research (60)
10:20 Michel Ballings and Dirk Van den Poel: The Dangers of using Intention as
a Surrogate for Retention in Brand Positioning Decision Support Systems (58)
Preface
xxxix
10:45 Friederike Paetz and Winfried J. Steiner: Finite Mixture MNP vs. Finite
Mixture IP Models: An Empirical Study (63)
Machine Learning and Knowledge Discovery: Opinions and marketing (HS
52, Room 101)
09:55 Hyunsup Ahn, Markus Weinmann and Christoph Lofi: Classification and
definition of contextual vicinity from emotional words for sentiment analysis (86)
10:20 Alina Sinelnikova, Eirini Ntoutsi and Hans-Peter Kriegel: Sentiment analysis in the Twitter stream (99)
10:45 Ralf Wagner: The Dark Side of Marketing Communication: Grouping Consumers with Respect to Their Reactance Behavior (104)
Machine Learning and Knowledge Discovery: Distributed and Temporal Data
Analysis (HS 27, Room 003)
Chair: J. Gama
09:55 Umer Khan, Alexandros Nanopoulos and Lars Schmidt-Thieme: Experimental Evaluation of Communication Efficient Distributed Classification in Peerto-Peer Networks (92)
10:20 István Dávid and Krisztian Buza: On the relation of cluster stability and
early classifiability of time series (91)
10:45 Maheen Bakhtyar, Lena Wiese, Katsumi Inoue and Nam Dang: Using Conceptual Inductive Learning for Cooperative Query Answering (87)
Interdisciplinary Domains: Quality (HS 2a, Room 004)
Chair: C. Hennig
09:55 Dirk Thorleuchter and Dirk Van den Poel: Espionage Risk Assessment for
Security of Defense based Research and Technology (128)
10:20 Kay F. Hildebrand: Supporting Selection of Statistical Techniques in Research (116)
10:45 Dominik Rozkrut: Differentiation of innovation strategies across regions
(126)
Coffee break 11:10 – 11:25
xl
Preface
Invited Session: Ensemble methods in clustering and classification (HS 52,
Room 001)
Chair: B. Lausen
11:25 Andreas Ziegler and Jochen Kruppa: Probability Machines: Estimating individual probabilities using machine learning methods (15)
11:50 Harald Binder: Tailoring componentwise boosting for prediction with a
huge number of molecular measurements (13)
12:15 Silke Janitza and Anne-Laure Boulesteix: An AUC-based Permutation Variable Importance Measure for Random Forests (14)
12:40 Werner Adler, Zardad Khan, Sergej Potapov and Berthold Lausen: Diversity
Based Weighting to Improve the Performance of Classifier Ensembles (12)
Data Analysis and Classification in Marketing (HS 1, Room 007)
Chair: A. Rese
11:25 Alexander Sänn and Daniel Baier: Complex Product Development: Using a
Combined VoC Lead User Approach (68)
11:50 Sebastian Selka, Daniel Baier and Peter Kurz: An Validity Analysis of Recent Commercial Conjoint Analysis Studies (66)
Machine Learning and Knowledge Discovery: Social networks (HS 27, Room
003)
Chair: P. Symeonidis
11:25 Zied Yakoubi and Rushed Kanawati: Applying Leaders Driven Community
Detection Algorithms to Data Clustering (106)
11:50 Christian Wartena and Rogier Brussee: Evaluating Tag Similarity Measures
by Clustering Bibsonomy Tags (105)
12:40 Krisztian Buza: Feedback Predicition for Blogs (89)
Interdisciplinary Domains: Maps & Images (HS 2a, Room 004)
Chair: I. Herzog
11:25 Andre Busche, Ruth Janning, Tomas Horvath and Lars Schmidt-Thieme: A
Unifying Framework for GPR Image Reconstruction (110)
11:50 Irmela Herzog: Testing Models for Medieval Settlement Location (115)
12:15 Martin Loidl and Christoph Traun: The balance of value and space - Merging classification and regionalization to make more sense out of spatial data
(121)
Preface
xli
Break 13:05 – 13:15
13:15 Closing Plenary (Building: HS 52, Room: 001 (Theater)),
Shai Ben-David: Universal Learning vs. No Free Lunch results - can there be
learners that do not require task-specific knowledge? (2), Chair: C. Hennig
14:00 Farewell (Beverages/snacks, Building: HS 52, Foyer)
End 14:30
Workshop on Classification and Subject Indexing in Library and
Information Science (LIS'2012)
im Rahmen der Jahrestagung der Deutschen Gesellschaft für
Klassifikation
Mittwoch, August 01, 2012
09:00
Eröffnung der Tagung (HS 52)
10:45
Kaffeepause
Moderation: Frank Scholze
11:15
Heidrun Wiesenmüller:
Resource Discovery Systeme – Chance oder Verhängnis für die bibliothekarische
Erschließung?
Karl Rädler:
Instrumentalisierung der klassifikatorischen Sacherschließung im neuen
Suchportal mit AquaBrowser in der Vorarlberger Landesbibliothek
12:30
Mittagspause
13:30
Uwe Geith:
Die sachliche Suche in Schweizer Online-Katalogen und Discovery-Systemen
14:15
Pause
Moderation: Michael Mönnich
14:30
Dominique Ritze, Kai Eckert:
Data Enrichment in Discovery Systems using Linked Data
Elmar Haake:
Verarbeitung von Sacherschliessungselementen in Discoverysystemen: auf dem
Weg zu einer nutzergerechten Verwendung von inhaltlicher Erschließung in der
E-LIB Bremen.
15:45
Kaffeepause
16:15
Jan Frederik Maas:
Entwicklung eines Werkzeugs zur Visualisierung der SWD/GND
Alice Spinnler:
Sacherschliessung mit GND/RSWK im Verbund Basel : eine erste Bilanz
17:30
Bibliotheksführungen
20:30
Abendessen (auf Selbstzahlerbasis)
Donnerstag, August 02, 2012
09:00
Plenarsitzung (HS 52)
09:45
Pause
Moderation: Ewald Brahms
10:00
Debora Daberkow, Petra Mensing, Irina Sens, Claudia Todt:
LinSearch – Effiziente Indizierung an der Technischen Informationsbibliothek,
Hannover
Monika Lösse, Mathias Lösch:
Sachliche Einordnung von Dokumenten in Bibliotheken: praktische Erfahrungen
mit maschinellen Lernverfahren
11:15
Kaffeepause
11:30
Magnus Pfeffer:
Abgleich von Titeldaten zur Übernahme von Sacherschließungsinformationen
über Verbundgrenzen
Uwe Geith, Wolfgang Giella:
Herausforderung "Neue Klassifikation für Freihandbestände" - 3 Praxis-Beispiele
aus der Schweiz
12:45
Mittagspause
14:00
Semiplenarsitzung (HS 52/27)
14:45
Pause
Moderation: Heidrun Wiesenmüller
14:55
Andreas Ledl:
Blogs als Thesaurus-Datenbanken
Michael Schwantner, Silke Rehme, Helmut Müller, Elke Bubel, Mario Quilitz,
Peter König, Nadejda Nikitina, Achim Rettinger, Nils Elsner:
Semiautomatische Ontologiegenerierung – ein Erfahrungsbericht
16:10
Kaffeepause
16:40
Bernd Lorenz:
AG Dezimalklassifikationen - Literaturbericht 2011
17:30
Pause bzw. Treffen des LIS PC
18:00
Versammlung der Gesellschaft für Klassifikation (HS 52)
19:30
Ende
The LIS’2012 workshop will take place at Building HS 31, Room 012.
Abstracts
Contents
Part I Keynote Speakers
Universal Learning vs. No Free Lunch results - can there be learners
that do not require task-specific knowledge? . . . . . . . . . . . . . . . . . . . . . . . . .
Shai Ben-David, Nathan Srebro, and Ruth Urner
2
Data Stream Mining for Ubiquitous Environments . . . . . . . . . . . . . . . . . . .
João Gama
3
Where Data Analysis Meets Graph Theory . . . . . . . . . . . . . . . . . . . . . . . . . .
Wolfgang Gaul
4
Arbitrage-free Scenario Trees for Financial Optimization . . . . . . . . . . . . . .
Alois Geyer, Michael Hanke, and Alex Weissensteiner
5
Connected Cars, Machine-to-Machine Environments, and Distributed
Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Hillol Kargupta
6
On the value of incorporating sequential information into predictive
analytics classification models for analytical CRM . . . . . . . . . . . . . . . . . . . .
Dirk Van den Poel
7
Autonomous Robotics: Defining Instincts and Learning Systems of
Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Michèle Sebag
8
Stream Data Mining and Anytime Algorithms . . . . . . . . . . . . . . . . . . . . . . .
Thomas Seidl
9
Knowledge Discovery in Shopping Path Data . . . . . . . . . . . . . . . . . . . . . . . . 10
Katsutoshi Yada
xlv
xlvi
Contents
Part II Invited Session: Ensemble methods in clustering and classification
Diversity Based Weighting to Improve the Performance of Classifier
Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Werner Adler, Zardad Khan, Sergej Potapov, and Berthold Lausen
Tailoring Componentwise Boosting for Prediction with a Huge Number
of Molecular Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Harald Binder
An AUC-based Permutation Variable
Importance Measure for Random Forests
for Unbalanced Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Silke Janitza and Anne-Laure Boulesteix
Probability Machines: Estimating individual probabilities using
machine learning methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Andreas Ziegler and Jochen Kruppa
Part III Invited Session: Applications in Empirical Educational Research
Based on Secondary Data
Does school choice increase ethnic segregation in primary schools or
only segregation indices? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Anna Makles and Kerstin Schneider
Applications in Empirical Educational Research Based on Secondary
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Alexandra Schwarz
Part IV Invited Session: Dynamic Cluster Analysis - Theory and Practise
Old and new dynamic clustering methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Hans-Hermann Bock
Machine learning approach in information retrieval for real estate offers
analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Paweł Lula
Dynamical Clustering with Self Learning Neural Networks . . . . . . . . . . . . 24
Kamila Migdał Najman and Krzysztof Najman
Classification of Three-Way Clustering Problems . . . . . . . . . . . . . . . . . . . . . 25
Andrzej Sokolowski
Studies in Lower Secondary Educational Level Outcomes Changes in
Poland Using Correspondence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Agnieszka Stanimir
Contents
xlvii
Solving Product Line Design Optimization Problems using Stochastic
Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Sascha Voekler and Daniel Baier
Part V Statistics and Data Analysis
An exact Newton’s method for ML estimation in a penalized Gaussian
mixture model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Grigory Alexandrovich
Which District of Dortmund is the Most Dangerous? . . . . . . . . . . . . . . . . . . 31
Tim Beige, Thomas Terhorst, Claus Weihs and Holger Wormer
Empirically Measuring the Effect of Violating the Independence
Assumption in Behavioural Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Miguel Biron and Cristián Bravo
Benchmarking classification algorithms on high-performance computing
clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Bernd Bischl, Julia Schiffner, Claus Weihs
Visual models for categorical data in economic research . . . . . . . . . . . . . . . 34
Justyna Brzeziǹska
Discovering Process Certification Tendencies . . . . . . . . . . . . . . . . . . . . . . . . 35
Mariana Carvalho, Paulo Sampaio, and Orlando Belo
Geographic clustering through aggregation control . . . . . . . . . . . . . . . . . . . 36
Daher Ayale and Dhorne Thierry
Some thoughts about the “number of clusters”-problem . . . . . . . . . . . . . . . 37
Christian Hennig
Merging States in Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Hajo Holzmann and Florian Schwaiger
Fuzzy Composite Index for Customer Satisfaction Evaluation: an
Application for Public Sector Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Bartłomiej Jefmański and Marcin Pełka
Zur Begrenzung der Verwendungshäufigkeit von Spenderobjekten bei
der Imputationen fehlender Daten mittels Hot-Deck-Verfahren . . . . . . . . . 40
Dieter William Joenssen and Udo Bankhofer
Predictive validity of tracking decisions: Application of a new validation
criterion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Florian Klapproth, Sabine Krolak-Schwerdt, and Thomas Hörstermann
xlviii
Contents
DDα-classification of asymmetric and fat-tailed data . . . . . . . . . . . . . . . . . . 42
Tatjana Lange, Karl Mosler, and Pavlo Mozharovskyi
The Alpha-Procedure - a nonparametric invariant method for automatic
classification of d-dimensional objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Tatjana Lange and Pavlo Mozharovskyi
A universal method for model selection in parametric regression models
based on statistical tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Eckhard Liebscher
Support Vector Machines on Large Data Sets: Simple Parallel
Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Oliver Meyer, Bernd Bischl, Claus Weihs
Soft Bootstrapping and Its Comparison with Other Resampling Methods
Hans-Joachim Mucha and Hans-Georg Bartel
46
Dual Scaling Classification and Its Application in Archaeometry . . . . . . . . 47
Hans-Joachim Mucha, Hans-Georg Bartel, and Jens Dolata
Introducing Analytical Methods and Predictive Models in Project
Management Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Jaime Santos and Orlando Belo
Constrained Dual Scaling of Successive Categories for Detecting
Response Styles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Pieter C. Schoonees, Michel van de Velden, and Patrick J. F. Groenen
On Instance Selection in Multi Classifier Systems . . . . . . . . . . . . . . . . . . . . . 50
Friedhelm Schwenker, Sascha Meudt
Effects of Labeling Mechanisms on Classification Error in Linear
Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Keiji Takai and Kenichi Hayashi
Three-way Subspace Hierarchical Clustering based on Entropy
Regularization Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Kensuke Tanioka and Hiroshi Yadohisa
Gamma-Hadron-Separation in the MAGIC-Experiment . . . . . . . . . . . . . . 53
Tobias Voigt, Roland Fried, Michael Backes, and Wolfgang Rhode
Cluster Analysis of Symbolic Data with Application of R Software . . . . . . 54
Justyna Wilk and Marcin Pełka
Part VI Data Analysis and Classification in Marketing
Contents
xlix
Spatial Modeling of Dependencies Between Population, Education, and
Economic Growth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Daniel Baier, Wolfgang Polasek, and Alexandra Rese
Discrete Choice Methods and Their Applications in Preference Analysis
of Vodka Consumers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Andrzej Ba̧k, Marcin Pełka, and Aneta Rybicka
The Dangers of using Intention as a Surrogate for Retention in Brand
Positioning Decision Support Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Michel Ballings and Dirk Van den Poel
Microeconometrics Multinomial Models and their Applications in
Preferences Analysis using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Andrzej Ba̧k and Tomasz Bartłomowicz
Measuring Consumers’ Brand Associations in Online Market Research . . 60
Pascal Kottemann, Martin Meißner and Reinhold Decker
Multinomial-SVM-Item-Recommender for Repeat-Buying Scenarios . . . . 61
Christina Lichtenthaeler and Lars Schmid-Thieme
Approach to Predicting Changes in Market Segments Based on
Customer Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Anneke Minke and Klaus Ambrosi
Finite Mixture MNP vs. Finite Mixture IP Models: An Empirical Study . . 63
Friederike Paetz and Winfried J. Steiner
Rasch Models for Analyzing Role Models in Inter-Organisational
Innovation Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Alexandra Rese, Hans-Georg Gemünden, and Daniel Baier
Variable Weighting and Selection Approaches for Market Segmentation:
A Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Susanne Rumstadt and Daniel Baier
An Validity Analysis of Recent Commercial Conjoint Analysis Studies . . . 66
Sebastian Selka, Daniel Baier, and Peter Kurz
Exploring Nonlinear Effects in the Relationship between Customer
Satisfaction and Customer Retention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Winfried J. Steiner, Florian U. Siems, Anett Weber and Daniel Guhl
Complex Product Development: Using a Combined VoC Lead User
Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Alexander Sänn and Daniel Baier
l
Contents
Identifying Consumer Typologies from Online Product Reviews Using
Finite Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Michael N. Tuma
Solving Product Line Design Optimization Problems using Stochastic
Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Sascha Voekler and Daniel Baier
Part VII Data Analysis in Finance
Sovereign Wealth Funds and Portfolio Choice . . . . . . . . . . . . . . . . . . . . . . . . 72
Wolfgang Bessler and Daniil Wagner, CFA
Feature reduction and pattern classification for financial forecasting, - A
comparative study on different optimization strategies - . . . . . . . . . . . . . . . 74
Daniel Bohlmann and Jarek Krajewski
A practical method of determining longevity and premature-death risk
aversion in households and some proposals of its application . . . . . . . . . . . 75
Lukasz Feldman, Radoslaw Pietrzyk, and Pawel Rokita
Optimal portfolios of securities taking into
account the asymmetry of specific risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Garsztka Przemyslaw
A Simplex Rotation Algorithm for the Factor Approach to Generate
Financial Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Alois Geyer, Michael Hanke, and Alex Weissensteiner
Correlation of outliers in multivariate data . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Bartosz Kaszuba
Using generalized additive models to fit credit rating scores . . . . . . . . . . . . 79
Marlene Müller
Clustering Algorithms for Storage of Tick Data . . . . . . . . . . . . . . . . . . . . . . 80
Gabor I. Nagy and Krisztian Buza
Value-at-Risk Backtesting Procedures Based on the Loss Functions Simulation Analysis of the Power of Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Krzysztof Piontek
Fundamental portfolio construction based on semi-variance . . . . . . . . . . . . 82
Anna Rutkowska-Ziarko
Sovereign Credit Spreads During the European Fiscal Crisis . . . . . . . . . . . 83
Jonas Vogt, PhD Student
Part VIII Machine Learning and Knowledge Discovery
Contents
li
Classification and definition of contextual vicinity from emotional words
for sentiment analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Hyunsup Ahn, Markus Weinmann, and Christoph Lofi
Using Conceptual Inductive Learning for Cooperative Query Answering . 87
Maheen Bakhtyar, Lena Wiese, Katsumi Inoue, and Nam Dang
A Study of the Efficiency and Accuracy of
Data Stream Clustering for Large Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . 88
Matthew Bolaños, John Forrest, and Michael Hahsler
Feedback Predicition for Blogs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Krisztian Buza
Label Ranking with Abstention: Learning to Predict Partial Orders . . . . . 90
Weiwei Cheng, Willem Waegeman, Volkmar Welker, and Eyke Hüllermeier
On the relation of cluster stability
and early classifiability of time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
István Dávid and Krisztian Buza
Experimental Evaluation of Communication Efficient Distributed
Classification in Peer-to-Peer Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Umer Khan , Alexandros Nanopoulos and Lars Schmidt Thieme
Framework for Storing and Processing Relational Entities in a Data
Stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Pawel Matuszyk
Spectral clustering: interpretation and Gaussian parameter . . . . . . . . . . . . 94
Sandrine Mouysset, Joseph Noailles, Daniel Ruiz, and Clovis Tauber
gRecs: A collaborative filtering framework for group recommendations . . 95
Eirini Ntoutsi, Kostas Stefanidis, Kjetil Nørvåg, and Hans-Peter Kriegel
Symbolic cluster ensemble based on co-association matrix vs. noisy
variables and outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Pełka Marcin
Ensemble learning for density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Friedhelm Schwenker, Michael Glodek, Martin Schels
An Analysis of Classifier Chains for Multi-Label Classification . . . . . . . . . 98
Robin Senge, Jose Barranquero, Juan José del Coz, and Eyke Hüllermeier
Sentiment analysis in the Twitter stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Alina Sinelnikova, Eirini Ntoutsi, and Hans-Peter Kriegel
lii
Contents
Recommendations in Time Evolving Multi-modal Social Networks . . . . . . 100
Panagiotis Symeonidis
A Lightweight CVFDT Classifier for Streams with Concept Drift . . . . . . . 101
Miriam Tödten, Zaigham Faraz Siddiqui, and Myra Spiliopoulou
Statistical Comparison of Classifiers for Multi-Objective Feature
Selection in Instrument Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Igor Vatolkin, Bernd Bischl, Günter Rudolph, and Claus Weihs
Group-Based Ant Colony Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Gunnar Völkel, Uwe Schöning, and Hans A. Kestler
The Dark Side of Marketing Communication: Grouping Consumers
with Respect to Their Reactance Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Ralf Wagner
Evaluating Tag Similarity Measures by Clustering Bibsonomy Tags . . . . . 105
Christian Wartena and Rogier Brussee
Applying Leaders Driven Community Detection Algorithms to Data
Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Zied Yakoubi and Rushed Kanawati
Part IX Interdisciplinary Domains
Onset detection using an auditory model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Bauer, Nadja, Friedrichs, Klaus, Schiffner, Julia, and Weihs, Claus
Computational Aspects of Natural Languages’ Similarities . . . . . . . . . . . . 109
Andreea Beica and Liviu P. Dinu
A Unifying Framework for GPR Image Reconstruction . . . . . . . . . . . . . . . 110
Andre Busche, Ruth Janning, Tomáš Horváth, and Lars Schmidt-Thieme
Evaluating Similarity Measures for Plagiarism Detection in Melody
Transcriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Christian Dittmar, Daniel Gärtner, Kay F. Hildebrand, and Florian Müller
From Single Tones to MIDI Remixes - Detecting Families of Musical
Instruments by High-Level Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Eichhoff, Markus and Weihs, Claus
Learning in groups and exam performance . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Andreas Geyer-Schulz, Jonas Kunze, and Andreas Sonnenbichler
ANOVA and Alternatives for Causal Inferences . . . . . . . . . . . . . . . . . . . . . . 114
Sonja Hahn
Contents
liii
Testing Models for Medieval Settlement Location . . . . . . . . . . . . . . . . . . . . . 115
Irmela Herzog
Supporting Selection of Statistical Techniques in Research . . . . . . . . . . . . . 116
Kay F. Hildebrand
Alignment methods for folk tune classification . . . . . . . . . . . . . . . . . . . . . . . 117
Ruben Hillewaere, Bernard Manderick, and Darrell Conklin
Comparing regression approaches in modelling (non-)compensatory
judgment formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Thomas Hörstermann and Sabine Krolak-Schwerdt
Sensitivity Analyses for the Rasch Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Daniel Kasper and Ali Ünlü
Music and Timbre Segmentation by efficient Order Constrained
K-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Sebastian Krey, Uwe Ligges, and Friedrich Leisch
The balance of value and space Merging classification and regionalization
to make more sense out of spatial data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Martin Loidl and Christoph Traun
Confidence measures in automatic music classification . . . . . . . . . . . . . . . . 123
Hanna Lukashevich
Multi-Step Linear Discriminant Analysis for
Classification of Event-Related Potentials . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Nguyen Hoang Huy, Stefan Frenzel, and Christoph Bandt
The Author in Translation:
A Computational Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Sergiu Nisioi and Liviu P. Dinu
Differentiation of innovation strategies across regions . . . . . . . . . . . . . . . . 126
Dominik Antoni Rozkrut
The Impact of Student Loans on Personal Financing of Higher
Education in Germany . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Alexandra Schwarz
Espionage Risk Assessment for Security of Defense based Research and
Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Dirk Thorleuchter and Dirk Van den Poel
liv
Contents
Using Latent Class Models with Random Effects for Investigating Local
Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Matthias Trendtel and Ali Ünlü
Music Genre Prediction by High-Level Instrument and Harmony
Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Igor Vatolkin, Günther Rötter, and Claus Weihs
The OECD’s Programme for International Student Assessment (PISA)
Study: A Review of Its Basic Psychometric Concepts . . . . . . . . . . . . . . . . . . 131
Ali Ünlü, Daniel Kasper and Matthias Trendtel
Part X Biostatistics and Bioinformatics
Rank aggregation for candidate gene selection . . . . . . . . . . . . . . . . . . . . . . . 134
Andre Burkovski, Ludwig Lausser and Hans A. Kestler
Unsupervised dimension reduction methods for protein sequence
classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Dominik Heider, Christoph Bartenhagen, J. Nikolaj Dybowski, Sascha
Hauke, Martin Pyka, and Daniel Hoffmann
Prediction of Surgery Duration Using Data Mining Methods on
Anaesthesia Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
Pawel Matuszyk, Dominik Brammen, Ren Schult, and Myra Spiliopoulou
The critical noise level for learning Boolean functions . . . . . . . . . . . . . . . . . 137
Markus Maucher, Christian Wawra, and Hans A. Kestler
Decision tree ensembles with different split criteria. . . . . . . . . . . . . . . . . . . . 138
Sergej Potapov, Asma Gul, Werner Adler, and Berthold Lausen
A Transductive Set Covering Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Florian Schmid, Ludwig Lausser and Hans A. Kestler
Part XI LIS’12 Workshop
LinSearch – Effiziente Indizierung an der Technischen
Informationsbibliothek, Hannover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Dr. Debora Daberkow, Dr. Petra Mensing, Dr. Irina Sens, Claudia Todt
Herausforderung ”Neue Klassifikation für Freihandbestände” - 3
Praxis-Beispiele aus der Schweiz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Uwe Geith and Dr. Wolfgang Giella
Die sachliche Suche in Schweizer Online-Katalogen und DiscoverySystemen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Uwe Geith
Contents
lv
Verarbeitung von Sacherschliessungselementen in Discoverysystemen:
Auf dem Weg zu einer nutzergerechten Verwendung von inhaltlicher
Erschlieung in der E-LIB Bremen. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Dr. Elmar Haake
Der Blog als Thesaurus-Datenbank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Andreas Ledl
AUSZUG AUS DEM LITERATURBERICHT 2011 DEWEY DECIMAL
CLASSIFICATION (DDC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
Bernd Lorenz
Entwicklung eines Werkzeugs zur Visualisierung der SWD/GND . . . . . . . 149
Dr.-Ing. Jan Frederik Maas
Practical Experiences with Machine Learning-based Text Categorization
for Library Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Elisabeth Mödden, Mathias Lösch, Monika Lösse, and Ulrike Junger
Abgleich von Titeldaten zur Übernahme von
Sacherschließungsinformationen über Verbundgrenzen . . . . . . . . . . . . . . . 151
Magnus Pfeffer
Data Enrichment in Discovery Systems using Linked Data . . . . . . . . . . . . . 152
Dominique Ritze and Kai Eckert
Instrumentalisierung der klassifikatorischen Sacherschließung im neuen
Suchportal mit AquaBrowser in der Vorarlberger Landesbibliothek . . . . . 153
Karl Rädler
Text Mining für den Ontologieaufbau . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
Elke Bubel, Nils Elsner, Peter König, Helmut Müller, Nadejda Nikitina,
Mario Quilitz, Silke Rehme, Achim Rettinger, and Michael Schwantner
Sacherschliessung mit GND/RSWK im Verbund Basel: eine erste Bilanz . 155
Alice Spinnler
Resource Discovery Systeme – Chance oder Verhängnis für die
bibliothekarische Erschließung? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Heidrun Wiesenmüller
Inhaltliche Anpassung der RVK als Aufstellungsklassifikation – Projekt
Bibliotheksneubau Kleine Fächer der FU Berlin, Schwerpunkt Orient . . . 157
Helen Younansardaroud
List of Contributors
Prof. Dr. Myra Spiliopoulou (PC Chair)
Workgroup KMD: ”Knowledge Management and Discovery”
Faculty of Computer Science
Otto-von-Guericke-Universitt Magdeburg
Universittsplatz 2
39106 Magdeburg, Germany
Prof. Dr. Dr. Lars Schmidt-Thieme (PC Chair, Local Organizer)
Information Systems and Machine Learning Lab (ISMLL)
Institute of Computer Science
University of Hildesheim
Marienburger Platz 22
D-31141 Hildesheim, Germany
Ruth Janning, M.Sc. (Local Organizer)
Information Systems and Machine Learning Lab (ISMLL)
Institute of Computer Science
University of Hildesheim
Marienburger Platz 22
D-31141 Hildesheim, Germany
lvii
Part I
Keynote Speakers
Universal Learning vs. No Free Lunch results can there be learners that do not require
task-specific knowledge?
Shai Ben-David1 , Nathan Srebro2 , and Ruth Urner3
1
2
3
University of Waterloo, Canada [email protected]
Toyota Technological Institute at Chicago, United States [email protected]
University of Waterloo, Canada [email protected]
Abstract. The so called No-Free-Lunch principle is a basic insight of machine
learning. It may be viewed as stating that in the lack of prior knowledge (or inductive bias), every learning algorithm may fail on some *learnable* task.
In recent years, several paradigms for ”universal learning” have been proposed and
advocated. These range from paradigms of almost science-fictional nature, like ”Automation of science”, through practically oriented Deep Belief Networks, to theoretical constructs like Universal Kernels, Universal Priors and Universal Coding for
MDL-based learning.
In this talk I address this apparent contradiction by examining and analyzing several
possible definitions of universal learning. I will show a basic no-free-lunch theorem for such generic learning and discuss how it applies to the above mentioned
universal learning paradigms.
Keywords
Theory, Universal Learning, No-Free-Lunch
2
Data Stream Mining for Ubiquitous
Environments
João Gama
1
2
LIAAD, INESC TEC
FEP, University of Porto, Portugal [email protected]
Abstract. Data stream mining is, nowadays, a mature topic in data mining. Nevertheless, most of the works focus on centralized approaches to learn from sequences
of instances generated from environments with unknown dynamics, that can be read
only once or a small number of times, using limited computing and storage capabilities. The phenomenal growth of mobile and embedded devices coupled with
their ever-increasing computational and communications capacity presents an exciting new opportunity for real-time, distributed intelligent data analysis in ubiquitous
environments. In these contexts centralized approaches have limitations due to communication constraints, power consumption (e.g. in sensor networks), and privacy
concerns. Distributed online algorithms are highly needed to address the above concerns. The focus of this talk is on distributed stream mining algorithms that are
highly scalable, computationally efficient and resource-aware. These features enable the continued operation of data stream mining algorithms in highly dynamic
mobile and ubiquitous environments.
Keywords
Data Mining, Data Streams, Distributed Algorithms
3
Where Data Analysis Meets Graph Theory
Wolfgang Gaul
University of Karlsruhe, Germany [email protected]
Abstract. Based on information by which objects are described standard tasks of
data analysis try to reveal pecularities (features, structures, etc.) in the data that help
to characterize the objects. When relations between objects belong to the information available the objects can be interpreted as vertices of a graph and knowledge
about the relational structure between the vertices can be added to the underlying
data analysis situation with the help of (possibly weighted) links between pairs of
vertices. Graph clustering and Web data mining, among others, are examples that
will be used to demonstrate findings in which data analysis and graph theory overlap.
Keywords
Graph Clustering, Web Data Mining, Data Analysis in Marketing
4
Arbitrage-free Scenario Trees for Financial
Optimization
Alois Geyer1 , Michael Hanke2 , and Alex Weissensteiner3
1
2
3
Vienna University of Economics and Business, Austria
[email protected]
University of Liechtenstein, Liechtenstein [email protected]
Free University of Bolzano/Bozen, Italy
[email protected]
Abstract. This paper presents a method which is designed to generate arbitragefree scenario trees representing multivariate return distributions. Our approach is
embedded in the setting of Arbitrage Pricing Theory (APT), and asset returns are
assumed to be driven by orthogonal factors. In a complete market setting we derive
no-arbitrage bounds for expected excess returns using the least possible number
of scenarios (i.e. the smallest dimension of the discrete state space) necessary to
match the first two moments and to exclude arbitrage at the outset. This not only
safeguards against the curse of dimensionality: Numerical results from solving twostage asset allocation problems show that highly accurate results can be obtained
with the smallest possible scenario tree.
Keywords
No-Arbitrage Bounds, Scenario Generation, Financial Optimization
5
Connected Cars, Machine-to-Machine
Environments, and Distributed Data Mining
Hillol Kargupta
1
2
Agnik
University of Maryland, Baltimore County, Computer Science & Electrical
Engineering Department, USA
Abstract. Modern vehicles are embedded with varieties of sensors monitoring different functional components of the car and the driver behavior. With vehicles getting connected over wide-area wireless networks, many of these vehicle diagnosticdata along with location and accelerometer information are now accessible to a
wider audience through wireless aftermarket devices. This data offer rich source
of information about the vehicle and driver performance. Once this is combined
with other contextual data about the car, environment, location, and the driver, it can
offer exciting possibilities. Distributed data mining technology powered by onboard
analysis of data is changing the face of such vehicle telematics applications for the
consumer market, insurance industry, car repair chains and car OEMs. This talk will
offer an overview of the market, emerging product-types, and identify some of the
core technical challenges. It will describe how advanced data analysis has helped
creating new innovative products and made them commercially successful. The talk
will offer a perspective on the algorithmic issues and describe their practical significances. It will end with remarks on future directions of the field of Machine-toMachine (M2M) sensor networks and how the next generation of researchers can
play an important role in shaping that.
6
On the value of incorporating sequential
information into predictive analytics
classification models for analytical CRM
Dirk Van den Poel1
Ghent University, Department of Marketing, Tweekerkenstraat 2, 9000 Ghent,
Belgium [email protected]
Abstract. This keynote talk gives an overview of different methods to incorporate
sequential information into classification models for predictive analytics in marketing. More specifically, we zoom in on SAM (sequence alignment methods), Markov,
MTD, MTDg, and Markov for Discrimination, and Survival analysis.
It has been shown time and again that sequential data adds value to predictive
models in marketing. We discuss applications of these techniques in financial services, fast-moving consumer goods (FMCG), and home appliance. Sequential data
captures two aspects: 1. Order, 2. Timing. We show that sequential information is
useful for cross-sell modeling (PRINZIE et al. 2006b, 2007) as well as customer
churn modeling (MIGUEIS et al. 2012a, 2012b; PRINZIE et al. 2006a) in analytical
customer relationship management.
References
MIGUEIS V.L., VAN DEN POEL D., CAMANHO A.S., CUNHA J. F. (2012a): Predicting partial
customer churn: On the value of the purchasing sequence. under review.
MIGUEIS V.L., VAN DEN POEL D., CAMANHO A.S., CUNHA J. F. (2012b), Modeling partial
customer churn: On the value of first product-category purchase sequences. under review.
PRINZIE A. and VAN DEN POEL D. (2006a): Incorporating sequential information into traditional classification models by using an element/position-sensitive SAM. Decision Support
Systems, 42 (2), 508-526.
PRINZIE A. and VAN DEN POEL D. (2006b): Investigating Purchasing Patterns for Financial Services using Markov, MTD and MTDg Models. European Journal of Operational Research,
170 (3), 710-734.
PRINZIE A. and VAN DEN POEL D. (2007): Predicting home-appliance acquisition sequences:
Markov/MTD/MTDg and survival analysis for modeling sequential information in NPTB
models. Decision Support Systems, 44 (1), 28-45.
Keywords
PREDICTIVE ANALYTICS, SEQUENCE ANALYSIS, ANALYTICAL CUSTOMER
RELATIONSHIP MANAGEMENT
7
Autonomous Robotics: Defining Instincts and
Learning Systems of Values
Michèle Sebag
University Paris-Sud, CNRS, France [email protected]
Abstract. Reinforcement learning aims at finding a good action policy, interacting
with the environment in such a way that the agent (the robot) optimizes its cumulative reward along time. Where does the reward come from? In the robotics simulation context, the ground truth is available and the designer can use it to steer the
robot learning toward the desired goals, through an appropriate reward function.
When reinforcement learning takes place on the robot, the ground truth is no
longer available. A first issue then becomes to design an intrinsic reward function,
or ”instinct”, providing the robot with internal incentives to act and explore its environment, visiting all states reachable on a given time budget. A second issue is to
provide the robot with ”values”, indicating that not all reachable states are equal,
and gradually steering the exploration toward the most promising behaviors.
Regarding the former issue, an intrinsic reward function will be discussed: viewing the robot as an information machine, a natural motivation thus is to maximize
the quantity of information in its sensory datastream.
Regarding the latter issue, preference-based approaches can be used. A first possibility is to ask the designer to rank the behaviors demonstrated by the robot, thus
enabling the robot to learn a policy return estimate. Iteratively, the robot builds a
new and expectedly better controller during an active reinforcement learning phase.
It thereafter demonstrates this controller to the designer, and uses the designer’s
feedback (it’s better / it’s worse) to update the policy return estimate.
Another possibility is to design ad hoc experiments to learn a preference-based
value function directly on the state space. For instance, the designer can set the
robot in a target position, and exploit the fact that the robot situation almost surely
deteriorates along time when using naive controllers.
Keywords
reinforcement learning, preference learning, active learning, robotics
8
Stream Data Mining and Anytime Algorithms
Thomas Seidl
Department of Computer Science 9 (Data Management and Data Exploration)
RWTH Aachen University, 52056 Aachen,
[email protected]
Abstract. Sensors pervade all areas of personal, environmental and industrial domains, and nearly all applications in engineering, telecommunication, business, and
life sciences produce tremendously increasing amounts of data. Though the availability of storage space grows at decreasing prices, many of the data require immediate analysis as they cannot be stored for reasons of their huge size or the fast
reaction they require. In contrast to static data mining algorithms, stream data mining techniques follow the data and, as an additional challenge, the evolution of their
concepts. In contrast to real-time algorithms which strictly obey fixed time budgets,
anytime algorithms are designed to exploit the available time between the arrival of
objects in a stream even for varying stream rates. Recently, various anytime classification techniques as well as anytime clustering algorithms have been proposed
which were integrated into the MOA framework.
References
KRANEN P., ASSENT I., BALDAUF C., SEIDL T. (2011): The ClusTree: Indexing MicroClusters for Anytime Stream Mining. Knowl Inf Syst, 29(2), 249–272.
KRANEN P., KREMER H., JANSEN T., SEIDL T., BIFET A., HOLMES G., PFAHRINGER
B., READ J. (2012): Stream Data Mining using the MOA Framework. (Demo) DASFAA,
Springer, 309–313.
KRANEN P., SEIDL T. (2009): Harnessing the Strengths of Anytime Algorithms for Constant
Data Streams. Data Min Knowl Disc, 19(2), 245–260.
SEIDL T., ASSENT I., KRANEN P., KRIEGER R., HERRMANN J. (2009): Indexing Density
Models for Incremental Learning and Anytime Classification on Data Streams. EDBT/ICDT,
311–322.
Keywords
Data Mining, Stream Data Analysis, Clustering, Anytime Algorithms
9
Knowledge Discovery in Shopping Path Data
Katsutoshi Yada
Kansai University, Japan [email protected]
Abstract. The development of Radio Frequency Identification (RFID) has enabled
detailed tracking and electronic recording of customer positions and movements in
stores. We give the term Shopping Path Data to time series data on customer movement paths in stores which was obtained this way. This talk describes a model using
Shopping Path Data, and explains findings which are useful for in-store marketing
as a result of analysis using store experiment data in Japan. Shopping Path Data
provides us with new knowledge about customer movement in stores.
Keywords
Shopping Path Data, RFID, Customer Movement, Time-Series Data, Knowledge
Discovery
10
Part II
Invited Session: Ensemble methods in
clustering and classification
Diversity Based Weighting to Improve the
Performance of Classifier Ensembles
Werner Adler1 , Zardad Khan2 , Sergej Potapov1 , and Berthold Lausen2
1
2
University Erlangen-Nuremberg, Germany
{werner.adler,sergej.potapov}@imbe.med.uni-erlangen.de
University of Essex, United Kingdom
{zkhan,blausen}@essex.ac.uk
Abstract. The performance of bootstrap aggregated classifier ensembles, e.g. bagged
classification trees (Breiman, 1996) or random forests (Breiman, 2001) depends on
the diversity of the base classifiers constituting the ensembles. Adler et al. (2011)
proposed a modified approach to draw the bootstrap samples for the base classifiers
in a repeated measurements setup. The bootstrap samples are drawn on the subject
rather than the observation level, i.e. from a data set consisting of several observations from several subjects - as is the case in the repeated measurements setup
- subjects are randomly selected into the bootstrap sample while not selected subjects then make up the out-of-bag sample. Compared to the traditional approach,
where observations are drawn for the bootstrap sample irrespective of the subjects,
this leads to more diverse base classifiers and hence to an improved classification
performance of the ensemble.
In addition to indirectly increasing the diversity of the ensemble by modified
bootstrap strategies, we examine the effect of actively weighting the single baseclassifiers based on their similarity or dissimilarity as calculated by proposed similarity measeres (examined e.g. by Tang et al., 2006) to each others. We report and
discuss the results obtained using simulated data as well as a clinical example data
set.
References
A DLER , W., P OTAPOV, S., and L AUSEN , B. (2011): Classification of repeated measurements data
using tree-based ensemble methods. Computational Statistics, 26(2), 355-369.
B REIMAN , L. (1996): Bagging Predictors. Machine Learning, 24(2), 123-140.
B REIMAN , L. (2001): Random forests. Machine Learning, 45, 5-32.
TANG , E.K., S UGANTHAN , P.N., YAO , X. (2006): An analysis of diversity measures. Machine
Learning, 65, 247-271.
Keywords
Bootstrap, Classifier Ensembles, Diversity
12
Tailoring Componentwise Boosting for
Prediction with a Huge Number of Molecular
Measurements
Harald Binder
Institut für Medizinische Biometrie, Epidemiologie und Informatik,
Universitätsmedizin der Johannes-Gutenberg-Universität Mainz, 55101 Mainz,
Germany [email protected]
Abstract. When seeking prognostic information for patients or when attempting
classification, modern technologies provide a huge amount of molecular measurements as a starting point. For example, there may be more than one million single
nucleotide polymorphisms (SNPs) that need to be simultaneously considered with
respect to a clinical endpoint or class membership. Sparse multivariable regression
techniques have recently become available for automatically identifying molecular
signatures that comprise relatively few covariates and provide reasonable prediction
performance. For illustrating how such approaches can be adapted to the specific
features of molecular measurements, we propose different variants of a componentwise likelihood-based boosting approach for SNP data. The latter links SNP measurements to a class membership or a time-to-event endpoint by a regression model
that is built up in a large number of steps. The variants allow for strategic choices in
dealing with SNPs that differ in variance due to their variation in minor allele frequencies. In addition, we propose a heuristic that allows computationally efficient
handling of millions of covariates. The resulting models are judged according to prediction performance and signature stability in resampling data sets. By considering
these different aspects, a more general strategy is outlined for linking a huge number of molecular measurements to class membership or a time-to-event endpoint by
means of componentwise likelihood-based boosting.
Keywords
PREDICTION, SNP, VARIABLE SELECTION, BOOSTING
13
An AUC-based Permutation Variable
Importance Measure for Random Forests
for Unbalanced Data
Silke Janitza1∗ and Anne-Laure Boulesteix2
1
Department for Medical Informatics, Biometry and Epidemiology, University of
Munich, Marchioninistr. 15, 81377 Munich, Germany
[email protected]
Department for Medical Informatics, Biometry and Epidemiology, University of
Munich, Marchioninistr. 15, 81377 Munich, Germany
[email protected]
2
Abstract. The random forest method is a commonly used tool for classification
with high dimensional data as well as for predictor ranking. It can handle complex
data structures including correlated predictors, interactions and heterogeneity and
offers inbuilt variable importance measures for the ranking of important predictors.
However classification performance of random forest is suboptimal in case of extremely unbalanced data, i.e. data where response class sizes differ considerably. In
this case it tends to almost always predict the majority class, yielding a minimal
error rate. The standard random forest permutation variable importance measure
which is based on the error rate is directly affected by this problem and loses its
ability to discriminate between important and unimportant predictors in the case of
extreme class unbalance. This effect is more pronounced for small effects and small
sample sizes.
The area under the curve (AUC) is a promising alternative to the error rate for
unbalanced classes as it puts the same weight on both classes. A novel permutation variable importance measure in which the error rate is replaced by the AUC is
therefore a promising alternative for unbalanced data settings. It can be shown in
simulations that this measure outperforms the error-rate-based permutation variable
importance measure for strongly unbalanced classes.
References
BLAGUS, R. and LUSA, L. (2010): Class Prediction for high-dimensional Class-Imbalanced Data.
BMC Bioinformatics, 11, 523.
LIN, W.J. and CHEN, J. (2012): Class-imbalanced Classifers for high-dimensional Data. Briefings
in Bioinformatics.
Keywords
RANDOM FOREST, VARIABLE IMPORTANCE MEASURE, AREA UNDER
THE CURVE, FEATURE SELECTION, UNBALANCED DATA, CLASS IMBALANCE
∗
First author SJ is a student
14
Probability Machines: Estimating individual
probabilities using machine learning methods
Andreas Ziegler and Jochen Kruppa
Universität zu Lübeck, Germany
{ziegler,kruppa}@imbs.uni-luebeck.de
Abstract. Machine learning (ML) is increasingly used for data mining in biomedicine,
credit scoring, weather forecasting and other areas of application. Recent work has
shown that machine learning can also be used for probability estimation by embedding the probability estimation problem in nonparametric regression estimation. As
a result, nonparametric regression machines directly inherit their properties, such as
consistency and convergence rate, to the corresponding probability machine. Their
advantage over parametric standard statistical approaches, such as logistic regression is that probability machines do not require a correct specification of the functional relationship between the dependent variables and the independent variables.
These methods provide robust nonparametric modeling of the regression function
with minimal assumptions about the form of the relationships instead. Probability
machines directly apply to assessing the probability of outcomes of interest based
on different characteristics of individuals. In therapeutic observational studies, they
can also be used for computing propensity scores for adjustment. They easily extend
to dependent variables with multiple categories. In this contribution we first embed
the probability estimation problem in nonparametric regression estimation. Next,
we explore some consistent probability machines, such as random forest, k-nearest
neighbors, and bagged nearest neighbors for the purpose of probability estimation.
We show how probabilities using probability machines can be estimated using standard software. Finally, we illustrate the approach using data from the literature as
well as from our own applications.
References
M ALLEY, J. D., K RUPPA , J., DASGUPTA, A., M ALLEY, K. G. and Z IEGLER , A. (2012). Probability machines. Consistent probability estimation using nonparametric learning machines.
Methods of Information in Medicine 51, 74-81. doi: 10.3414/ME00-01-0052.
K RUPPA , J., Z IEGLER , A. and K ÖNIG, I. R. (2012). Risk estimation and risk prediction using
machine learning methods. Human Genetics, in press.
Keywords
Bagged nearest neighbor, Consistency, k-nearest neighbor, Nonparametric regression, Probability estimation, R software package, Random forest, Random jungle
15
Part III
Invited Session: Applications in Empirical
Educational Research Based on Secondary
Data
Does school choice increase ethnic segregation in
primary schools or only segregation indices?
Anna Makles1 and Kerstin Schneider2
1
2
University of Wuppertal, Schumpeter School of Business and Economics,
Germany, [email protected]
University of Wuppertal, Schumpeter School of Business and Economics,
Germany, [email protected]
Abstract. In 2006 the government of the federal state North Rhine-Westphalia
(NRW) in Germany passed a new school law abolishing binding primary school
catchment areas by the 2008/09 school year. Hence, parents in NRW - unlike their
counterparts in other German federal states - are now allowed to choose a primary
school independent of their place of residence. The political intention was to increase parental school choice and to foster competition between schools. The most
frequently-cited argument against free school choice, however, is the fear of increased ethnic segregation and educational disparity. In educational research school
segregation is mainly measured by the dissimilarity index given by Duncan and
Duncan (1955). But, despite of its popularity and related measures in empirical
work, the indices nonetheless suer from severe shortcomings. As segregation indices
are in particular sensitive when group sizes and minority proportions are small, there
is need to account for changes in group size (e. g. Carrington and Troske (1997)). In
our study, we show how biased the index of ethnic segregation in primary schools
is and how simple therefore a negative effect of the policy reform can be detected.
Finally, we calculate unbiased systematic segregation measures for dierent ethnic
groups and show (a) that accounting for the drawbacks of the index within empirical studies is not that difficult and (b) that systematic segregation has not increased
signicantly since the abolishment of primary school catchment areas.
References
CARRINGTON, W. and TROSKE, K. (1997): On measuring segregation in samples with small
units. Journal of Business & Economic Statistics, 15, 402–409.
DUNCAN, O. and DUNCAN, B. (1955): A methodological analysis of segregation indexes. American Sociological Review, 20, 210–217.
Keywords
SEGREGATION, DISSIMILARITY, SCHOOL CHOICE, SCHOOL CATCHMENT
AREAS
18
Applications in Empirical Educational Research
Based on Secondary Data
Alexandra Schwarz
German Institute for International Educational Research, Schlossstr. 29, D-60486
Frankfurt am Main, Germany, [email protected]
Abstract. In general, empirical educational research is an area concerned with the
analysis and the evaluation of conditions and requirements, processes, and outcomes
of education. Typical problems in this area involve the assessment of individual student achievement and progress and the evaluation of professional practice, programs
and policies. Hence, empirical educational research is an interdisciplinary research
domain, bringing together theories and methods from education, psychology, sociology and economics.
Whereas evaluating effects of educational programs often requires primary surveys conducted in experimental designs, the analysis of national and regional data
bases are of great importance for investigating governance aspects and policy issues
of education and education systems. Especially data from official and semi-official
statistics offer the opportunity of evaluating conditions, processes and outcomes
of education on a much broader basis. Such data bases have become increasingly
available in recent years, but their usage for the work on scientific issues still needs
to be improved. Among other things, this can be attributed to the fact that dealing
with administrative and other secondary data requires special methods and a sound
methodology, e. g. techniques of statistical inference, weighting, and data fusion.
In this session, we consider methodological questions of analyzing secondary
data in education contexts as well as empirical papers dealing with the evaluation of
specific educational or organizational policies, programs or systems.
Keywords
EDUCATION, SECONDARY DATA, POLICY ANALYSIS
19
Part IV
Invited Session: Dynamic Cluster Analysis
- Theory and Practise
Old and new dynamic clustering methods
Hans-Hermann Bock
Institute of Statistics, RWTH Aachen University, [email protected]
Abstract. ’Clustering methods’ deal with the grouping of objects into (typically
disjoint) homogeneous classes on the basis of data that are recorded for all objects
and that characterize their mutual similarities or dissimilarities. Many clustering approaches try to attain an ’optimal’ classification C ∗ by minimizing a suitable clustering criterion g(C ) among all feasible groupings C . In many situations there is
a two- or multi-variable criterion G(C , θ ) with a suitable parameter vector θ such
that g(C ) = minθ G(C , θ ). Then the classical relaxation method from mathematics
can be used for finding or approximating an optimum configuration just by iteratively minimizing G(C , θ ) w.r.t. C and θ in turn, thereby producing a dynamically
varying sequence of steadily improving classifications. This algorithm is termed
k-means method, dynamic clustering, iterated minimum distance method, etc. The
classical case of least-squares clustering G(C , θ ) = ∑ki=1 ∑ j∈Ci ||x j − θi ||2 → min
has been generalized in many ways, e.g., in the framework of probabilistic clustering models, fuzzy clustering, subspace clustering, distance-based criteria (medoid
method), multimode clustering, entropy clustering, etc. The paper provides a survey on the resulting dynamic clustering approaches and also discusses briefly some
typical problems that are encountered with these methods: convergence, local optima, bias when estimating class-specific model parameters, choice of the number
of classes, etc.
References
Bezdek, J.C. (1981): Pattern recognition with fuzzy objective function algorithms. Plenum Press, New York.
Bock, H.-H. (1974): Automatische Klassifikation. Vandenhoeck & Ruprecht, Göttingen.
Bock, H.-H. (2007): Clustering methods: a history of k-means algorithms. In: P. Brito, P. Bertrand,
G. Cucumel, F. de Carvalho (eds.): Selected contributions in data analysis and classification.
Springer, Heidelberg, 2007, 161-172.
Bock, H.-H. (2008): Origins and extensions of the k-means algorithm in cluster analysis. Journ@l
Electronique d’Histoire des Probabilités et de la Statistique. Numéro spécial ’Contributions à l’histoire
de l’analyse des données’ de la revue Electronic Journ@l for History of Probability and Statistics
(JEHPS), vol. 4 (2008), no. 2, 18pp. www.emis.de/journals/JEHPS/decembre2008.html
Dalenius T. (1950): The problem of optimum stratification I. Skandinavisk Aktuarietidskrift, 203-213.
Forgey, E.W. (1965): Cluster analysis of multivariate data: efficiency versus interpretability of
classifications. Biometric Society Meeting, Riverside, California, 1965. Abstract in Biometrics
21 (1965), 768.
Diday, E. (1971): Une nouvelle méthode de classification automatique et reconnaissance des
formes: la méthode des nuées dynamiques. Revue de Statistique Appliquée XIX (2), 1970, 19-33.
Diday, E., Schroeder, A. (1976): A new approach in mixed distribution detection. R.A.I.R.O. Recherche
Opérationnelle 10 (6) 75-106.
Steinhaus, H. (1956): Sur la division des corps matériels en parties. Bulletin de l’Académie Polonaise des
Sciences, Classe III, vol. IV, no. 12, 801-804.
Steinley, D. (2006): K-means clustering: a half-century synthesis. British J. on Mathematical and Statistical Psychology 59, 1-34.
22
Machine learning approach in information
retrieval for real estate offers analysis
Paweł Lula
Cracow University of Economics, Poland [email protected]
Abstract. Information retrieval is a process which allows to extract automatically
fundamental facts from text documents. This technique is widely used for web
pages, forums and blogs processing. The main challenge for automatic text mining
is identification and extraction crucial pieces of information. The rules for information extraction can be defined manually or can be built by machine learning process.
The evaluation of different approaches used for information retrieval is the main
goal of the presentation. The following solutions are discussed:
• methods based on the vector space model,
• automatically defined regular expressions,
• methods based on the domain model.
All methods are learnt and evaluated on the set of real estate offers prepared in
Polish.
Keywords
text mining, information extraction, machine learning methods
23
Dynamical Clustering with Self Learning Neural
Networks
Kamila Migdał Najman and Krzysztof Najman
University of Gdansk, Poland {kmn,krzysztof.najman}@wzr.pl
Abstract. Together with constantly expanding IT knowledge, the amount of data
collected in the different systems of database is increasing. One of the characteristics
of modern databases is their increasing dynamism. The number of registered units
and the structure of the group change dynamically.
In order to effectively detect fast changes in the number and structure of clusters, it should use the appropriate methods for cluster analysis. The article presents
the results of simulation research into the possibility of using self-learning neural
networks in clustering data with dynamically changing structure of the group.
References
F RITZKE B., Growing cell structures - a self-organizing network for unsupervised and supervised
learning, Neural Networks, 1994, vol. 7, no. 9, page 1441-1460.
NAJMAN K., Dynamical clustering with Growing Neural Gas Networks, Statistical Review, 2011,
vol. 3-4, page 231-242 (in polish).
Q IN A. K., S UGANTHAN P. N., Robust Growing Neural Gas algorithm with application in cluster
analysis, Neural Networks, 2004, vol. 17, no. 8-9, page 1135-1148.
P RUDENT Y., E NNAJI A., An incremental Growing Neural Gas learns topologies, Proceedings of
International Joint Conference on Neural Networks, 2005, page 1211-1216.
Keywords
dynamical clustering, self-learning neural networks, classification
24
Classification of Three-Way Clustering
Problems
Andrzej Sokolowski
Cracow University of Economics sokolows@ uek.krakow.pl
Abstract. A special notation for clustering problems is proposed, where classification subject and classification space is defined. We consider a set of objects Y, set
of variables Z and set of time units T. Three types of clustering problems can be
define: simple, double and complex. The first one has one set as the classification
objects, the second one one set as a classification space, and in the complex one
we are trying to find which objects, characterized by which variables and when can
be considered as homogeneous. In the paper special attention is paid to the double
clustering problems where two strategies are being proposed.
25
Studies in Lower Secondary Educational Level
Outcomes Changes in Poland Using
Correspondence Analysis
Agnieszka Stanimir
Wroclaw University of Economics, Poland
[email protected]
Abstract. The purpose of this paper is to indicate the possibility of using correspondence analysis to study changes in the level of learning outcomes in lower secondary
educational level. Assessment of student knowledge and skills is conducted on the
basis of exam results from the years 2003-2010. Results were collected for all students of the two Polish regions belonging to the same Regional Examination Board.
The analysis of knowledge and skills of a young person is an extremely important
task in the educational process. The use of different variables in the study provides
extensive analysis of the problem and allows for formulation of recommendations
for the development of education system. In the analysis additional factors, such
as gender, commune type, competences and skills areas were taken into account.
This factors are nominal, therefore natural is to apply correspondence analysis to
describe associations between categories of this variables.
References
B LASIUS , J. (2001). Korrespondenzanalyse. München: Oldenburg Verlag.
The Education and Assessment System in Poland, Central Examination Board, 1999
G REENACRE , M., J. (1984). Theory and applications of correspondence analysis, London : Academic Press.
Keywords
multiway correspondence analysis, knowledge and skills of young people, analysis
of changes over time, regional comparisons
26
Solving Product Line Design Optimization
Problems using Stochastic Programming
Sascha Voekler1 and Daniel Baier2
1
2
Institute of Business Administration and Economics, Brandenburg University of
Technology Cottbus, Postbox 101344, D-03013 Cottbus, Germany
[email protected]
Institute of Business Administration and Economics, Brandenburg University of
Technology Cottbus, Postbox 101344, D-03013 Cottbus, Germany
[email protected]
Abstract. In this paper, we try to apply stochastic programming methods to product line design optimization problems. Because of the estimated part-worths of the
product attributes in conjoint analysis, there is a need to deal with the uncertainty
caused by the underlying statistical data (Kall/Mayer 2011). Inspired by the work of
Georg B. Dantzig (Dantzig 1955), we developed an approach to use the methods of
stochastic programming for product line design issues. Therefore, four different approaches will be compared by using notional data of a yogurt market from Gaul and
Baier (2009). Stochastic programming methods like single- or two-stage programs
are applied on Gaul, Aust and Baier (Gaul et al. 1995) and will be compared to
its original approach, to Green and Krieger (Green/Krieger 1985) and to Kohli and
Sukumar (Kohli/Sukumar 1990). Besides the theoretical work, these methods will
be realized by a self-written code with the help of the statistical software package
R.
References
Dantzig, G.B. (1955): Linear Programming Under Uncertainty. Management Science, 1(3/4), 197206.
Gaul, W., Aust, E., Baier, D. (1995): Gewinnorientierte Produktliniengestaltung unter
Berücksichtigung des Kundennutzens. Zeitschrift für Betriebswirtschaftslehre, 65, 835-855.
Gaul, W., Baier, D. (2009): Simulations- und Optimierungsrechnungen auf Basis der Conjointanalyse. Conjointanalyse - Methoden-Anwendungen-Praxisbeispiele, D. Baier, M. Brusch (Hrsg.),
Berlin, Heidelberg, Springer 2009, 163–182.
Green, P.E., Krieger, A.M. (1985): Models and Heuristics for Product Line Selection. Marketing
Science, 4(1), 1-19.
Kall, P., Mayer, J. (2011): Linear Stochastic Programming - Models, Theory, and Computation.
International Series in Operations Research and Management Science, Springer New York,
Dordrecht, Heidelberg, London, 2011, 156.
Kohli, R., Sukumar, R. (1990): Heuristics for Product-line Design Using Conjoint Analysis. Marketing Science, 36(12), 1464-1478.
Keywords
Conjoint Analysis, Product Line Design Optimization, Stochastic Programming.
27
Part V
Statistics and Data Analysis
An exact Newton’s method for ML estimation in
a penalized Gaussian mixture model
Grigory Alexandrovich
Philipps-Universität Marburg
Abstract. We discuss the problem of computing the MLE of the parameters of a
multivariate Gaussian mixture.
The most widely used method for solving this problem is the EM-Algorithm. Although this method converges globally under some general assumptions, does not
require much storage and is simple to implement, it yields only a linear convergence
rate.
We introduce an alternative - an exact Newton’s method, which converges loacally quadratic and yields an estimate of the Fisher-Information matrix. To this
end we discuss a parametrization of the mixture density, which assures the adherence of several restrictions on the parameters during the Newton iterations. For the
parametrization of the covariance matrices we use the Cholesky decomposition of
the inverse: Σ −1 = LLT . We also discuss some aspects of computing the analytical
derivatives of the log-likelihood function.
Further we consider a penalization of the log-likelihood to avoid ”bad” solutions,
as suggested by Chen and Tan (Inference for multivariate normal mixtures, Journal
of Multivariate Analysis 100 (2009) 1367-1383).
Finally we consider some numerical experiments where we compare the EMAlgorithm with our implementation of the Newton method and discuss the possibility of computation the MLE with the Newton’s method in other elliptical mixture
models.
30
Which District of Dortmund is the Most
Dangerous?
Tim Beige,1 Thomas Terhorst,2 Claus Weihs1 and Holger Wormer2
1
2
Chair of Computational Statistics, TU Dortmund University
[email protected], [email protected],
Institute of Journalism, TU Dortmund University
[email protected], [email protected]
Abstract. In this paper the districts of Dortmund, a big German city, are ranked
concerning their level of risk to be involved in an offence. In order to measure this
risk the offences reported by police press reports in the year 2011 (Presseportal
(2011)) were analyzed and weighted by their maximum penalty provided by the
German criminal code. The resulting danger index was used to rank the districts.
Moreover, the socio-demographic influences on the different offences are studied.
The most probable influences appear to be traffic density (Sierau (2006)) and the
share of older people. Also, the inner city parts appear to be much more dangerous
than the outskirts of the city of Dortmund. However, can these results be trusted?
The head of the press office of Dortmund’s police argues that offences might not be
uniformly reported by the districts to his office, and that small offences like pocket
picking are never reported in police press reports.
References
PRESSEPORTAL: http://www.presseportal.de/polizeipresse/pm/4971/polizei-dortmund?start=0
SIERAU, U. (2006): Dortmunderinnen und Dortmunder unterwegs - Ergebnisse einer Befragung von Dortmunder Haushalten zu Mobilität und Mobilitätsverhalten, Ergebnisbericht,
Dortmund-Agentur/Graphischer Betrieb Dortmund, Stadt Dortmund, 09/2006.
Keywords
risk level, danger index, regression, variable selection
31
Empirically Measuring the Effect of Violating
the Independence Assumption in Behavioural
Scoring
Miguel Biron∗1 and Cristián Bravo2
1
Department of Industrial Engineering, Universidad de Chile. República 701,
8370439 Santiago, Chile. [email protected]
Finance Center, Department of Industrial Engineering, Universidad de Chile.
Domeyko 2369, 8370397 Santiago, Chile. [email protected]
2
Abstract. Behavioural scorings are a well-known statistical technique used by financial institutions to predict if new clients are in danger of not returning a loan
in the future. The aim of this work is to assess the importance of independence
assumption in logistic regression based behavioural scorings. The issue has been
documented on the literature [1], but no assessment has been made on its real impact. We develop four sampling methods that control which observations associated
to each client are to be included in the training set, avoiding a functional dependence
between observations of the same client. We then calibrate regressions with variable
selection on the samples created by each method, plus one using all the data in the
training set (biased base method), and validate the models on an independent data
set. We find that the regression built using all the observations shows the highest
area under the ROC curve and Kolmogorv-Smirnov statistics, while the regression
that uses the least amount of observations shows the lowest performance and highest variance of these indicators. Nevertheless, method four shows almost the same
performance as the base method using less variables. We conclude that violating
the independence assumption does not impact strongly on results and, furthermore,
trying to control it by using less data can harm the performance of calibrated models.
References
1.MEDEMA, L., KONING, H. R. and LENSINK, R. (2009): A practical approach to validating a
PD model. Journal of Banking & Finance, 33(4), 701–708.
Keywords
Behavioral Scoring, Sampling, Panel Logistic Regression
∗
Student author
32
Benchmarking classification algorithms on
high-performance computing clusters
Bernd Bischl, Julia Schiffner, Claus Weihs
Lehrstuhl fuer Computergestuetzte Statistik, Technische Universitaet Dortmund
Vogelpothsweg 87, 44227 Dortmund {bischl, schiffner,
weihs}@statistik.tu-dortmund.de
Abstract. Comparing and benchmarking classification algorithms is an important
topic in applied data analysis. Extensive and thorough studies of such a kind will
produce a considerable computational burden and are therefore best delegated to
high-performance computing clusters. This on the other hand is technically nontrivial, requires knowledge of the underlying architectures and naive approaches often
make experiments much harder to reproduce. We will demonstrate how to effectively and reproducibly perform these calculations on high-performance computing
clusters with minimal effort for the researcher. We build upon our recently developed R packages BatchJobs (Map, Reduce and Filter operations from functional
programming for clusters) and BatchExperiments (Parallelization and management
of statistical experiments). We will present benchmarking results for standard classification algorithms and study the influence of hyperparameters and pre-processing
steps on their performance.
References
BISCHL, B., LANG, M., MERSMANN, O., RAHNENFUEHRER, J. and WEIHS, C. (2012):
BatchJobs and BatchExperiments: Abstraction Mechanisms for Using R in Batch Environments. Technical report for Collaborative Research Center SFB 876, TU Dortmund.
Keywords
Classification, Benchmarking, Parallelization
33
Visual models for categorical data in economic
research
Justyna Brzeziǹska
Department of Statistics,
University of Economics in Katowice, 1 Maja 50, 40– 287 Katowice
[email protected]
Abstract. This paper is concerned with the use of visualizing categorical data in
qualitative data analysis [1],[2],[3]. Graphical methods for qualitative data and extension using a variety of R packages will be presented. This paper outlines a general
framework for data visualization methods. These ideas are illustrated with a variety
of graphical methods for categorical data for large, multi-way contingency tables.
Graphical methods are available in R software in vcd vcd and vcdExtra library
including mosaic plot, association plot, sieve plot, double–decker plot or agreement
plot. These R packages include methods for the exploration of categorical data, such
as fitting and graphing, plots and tests for independence or visualization techniques
for log–linear models. Some graphs e.g. sieve and mosaic display are well–suited
for detecting and patterns of association in the process of model building, others
are useful in model diagnosis and graphical presentation and summaries. The use
of log–linear analysis, as well as visualizing categorical data in economic, will be
presented in this paper.
Key words: Graphics, mosaic display, log-linear models, categorical data analysis
References
1.Friendly M., Visualizing Categorical Data, Cary, NC: SAS Institute, 2000.
2.Meyer D., Zeileis A., Hornik K., The strucplot framework: visualizing multi-way contingency
tables with vcd, Journal of Statistical Software, 17 (3), 1-48, 2006.
3.Meyer D., Zeileis A., Hornik K., VCD: Visualizing Categorical Data, R package http://CRAN.Rproject.org, 2008.
34
Discovering Process Certification Tendencies
Mariana Carvalho, Paulo Sampaio, and Orlando Belo
Algoritmi R&D Centre, University of Minho, PORTUGAL
[email protected], [email protected],
[email protected]
Abstract. This was a case study especially designed to conceive a specific-oriented
system for discovering certification tendencies, based on information about what
certificates were acquired by Portuguese companies during the period of 2008-2010.
Our main goal was to provide useful (and effective) information about what kind
certificate a company must apply to guarantee the quality of its services and products, getting consequently advantages besides its most direct competitors. As all
we know, certification is a voluntary process, which despite the time taken in the
process, the costs and the bureaucracy involved with, is today quite crucial for the
survival of any company. Certificates are pledges of companys commitment with
quality of its services and products directly, and necessarily with the environment,
health and security of its workers, just to name a few. Which are the most adequate
certificates that a company must apply (and acquire) is one of the main questions
that is often placed to any new company in the market. The answer to this question
it is not easy and depends on several factors, as the region where the company is located or its activity sector. A previous analysis was necessary to get the knowledge
about the state of the market that enables to take the best decisions. The application
of data mining techniques on this case allowed us to get a clear description? of the
state of the certification market in Portugal for the referred period. What is very
interesting, once it revealed very particularly characteristics of the way of being of
Portuguese business companies. Specifically, with clustering analysis we got very
refined information that provides to new companies the necessary awareness of the
state of the market, providing them an initial orientation in the competitive market.
The information extracted from the several application data sets is enough to support decision making relatively to the set of certificates that better suit to the need of
a company. Obviously, this set varies according with the companys competitiveness,
and depends also on the number of other companies that are located in the region,
and in the same activity sector. In this work we will present a general overview
of the certification process in Portugal, describe the main aspects of the case study
used - general features, data selection, and data preparation tasks -, the clustering
techniques and models used, and, finally, a review of the entire set of results, its
interpretation and impact in the certification field.
Keywords
Data Mining, Process Certification of Companies, Data Clustering, Discovery of
Patterns of Certification Processes
35
Geographic clustering through aggregation
control
Daher Ayale and Dhorne Thierry
Lab-STICC CNRS UMR 6285 / Université de Bretagne Sud
[email protected],[email protected]
Abstract. The actors in spatial decision are often led to define strategic zoning
checking both constraints of structural homogeneity and spatial cohesion. In most
cases, they use traditional clustering algorithms that do not take into account geographic information, and possibly correct the excessive fragmentation with Ad hoc
heuristics.
Usually, in clustering algorithms, the number of clusters is initially fixed. In this
paper, it is considered that, equivalently, the number of geographically
related components (called regions) R is also initially fixed, with, of course, the
constraint : R ≥ C. Rather than seeking an absolute solution to this optimization
problem, which is computationally difficult, the proposed method uses an algorithm to control the intra-class geographic aggregation in order to approach (possibly achieve) the constraint set on the number of regions.
We present in detail the proposed method, and then show how the level of aggregation can be controlled by a parameter of the algorithm. Finally we seek the
parameter value that leads to the most appropriate solution, and we study the dynamic evolution of clusters and regions according to the parameter control.
The results are illustrated on a real example.
References
OLIVER, M.A. and WEBSTER, R. (1989): A Geostatistical Basis for Spatial Weighting in Multivariate Classification. Mathematical Geology, 21, 275–289.
Keywords
Geographic clustering, Connected components, Geographic Information.
36
Some thoughts about the “number of
clusters”-problem
Christian Hennig1
Department of Statistical Science, UCL, London WC1E 6BT, United Kingdom
[email protected]
Abstract. The problem of finding the number of clusters in a dataset is notoriously
difficult. This is at least partly due to a widely shared misconception that for any
given dataset (or at least for many of them) there is a unique “true” clustering which
“good” methodology should “estimate”. Such true clustering, however, is rarely well
defined. The view taken here is different. What a cluster is, depends crucially on
the researcher’s concept of “observations belonging together”. This depends on the
given application, and it can easily be demonstrated that there are various legitimate
versions of it, which may lead to different numbers of clusters in different datasets.
Existing criteria for this task can be understood as translations of various different
cluster concepts, and can be used if they are found to agree with the concept required
in the given application. In this presentation I will discuss existing ideas about the
“true number of clusters” and the data analytic meaning of existing methods to find
the number of clusters such as the BIC (Fraley and Raftery 2002), the average silhouette width (Kaufman and Rouseeuw 1990), the Calinski and Harabasz (1974)
index and the prediction strength (Tibshirani and Walther 2005).
References
CALINSKI, T. and HARABASZ, J. (1974): A dendrite method for cluster analysis. Communications in Statistics 3, 1–27.
FRALEY, C. and RAFTERY, A. E. (2002): How many clusters? Which clustering method? Answers via model-based cluster analysis. The Computer Journal, 41, 578-588.
KAUFMAN, L. and ROUSSEEUW, P. J. (1990): Finding Groups in Data: An Introduction to
Cluster Analysis. Wiley, New York.
TIBSHIRANI, R. and WALTHER, G. (2005): Cluster Validation by Prediction Strength, Journal
of Computational and Graphical Statistics, 14, 511-528.
Keywords
NUMBER OF CLUSTERS, BIC, AVERAGE SILHOUETTE WIDTH, PREDICTION STRENGTH, CALINSKI AND HARABASZ INDEX
37
Merging States in Hidden Markov Models
Hajo Holzmann1 and Florian Schwaiger2
1
2
Philipps-Universität Marburg [email protected]
Philipps-Universität Marburg [email protected]
Abstract. We analyse clustering problems in case of dependent data. Specifically
we consider the observable part of a finite state hidden Markov model (HMM),
where the stationary distribution is a finite mixture of parametric distributions
(e.g. of multivariate normal distributions) and the hidden state process has a Markov
chain structure. Generally, for clustering data the estimates of the states can be used
as cluster assignments. In contrast to independent finite mixtures, the dependence
structure of the model plays an important role for estimating the non-observable
states of the HMM (see e.g. the Viterbi algorithm).
Baudry et al. (2010) model i.i.d. samples with finite mixtures and merge those
components to clusters, whose merged component distribution appears more like a
cluster than the component distributions considered singularly. Similarly, it is not
necessarily always the case that each state of the Markov chain corresponds to an
own cluster. Thus, we analyse the merging of states to clusters for HMMs. In contrast to independent finite mixtures, where merging states does not affect the probabilistic structure of the model, merging states of a HMM changes this structure:
After merging, the dependence structure is influenced as the transition probability
matrix changes and the state dependent distribution is now a mixture itself. If the
dependence structure is not taken into account, it can occur that states are being
merged whose state dependent distributions imply a merging, although their transition probabilities are too distinct and hence a lot of dependence information would
be lost. Therefore, we employ an entropy based criterion which strongly involves
the dependence structure of the estimated model.
References
BAUDRY, J.-P., RAFTERY, A., CELEUX, G., LO, K. and GOTTARDO R. (2010): Combining
Mixture Components for Clustering. Journal of Computational and Graphical Statistics, 19,
332–353.
Keywords
HIDDEN MARKOV MODEL, CLUSTERS, MERGING
38
Fuzzy Composite Index for Customer
Satisfaction Evaluation: an Application for
Public Sector Services
Bartłomiej Jefmański1 and Marcin Pełka1
1 Wroclaw
University of Economics, Department of Econometrics and Computer
Science, Nowowiejska 3, 58-500 Jelenia Góra, Poland,
[email protected], [email protected]
Abstract. Customer satisfaction is a complex and latent concept, of which the direct measurement and analysis are impossible, therefore requires the estimation with
applying directly observable variables. In the social sciences we are dealing with a
number of such occurrences and a popular approach in their analysis is the construction of composite indices. These are synthetic measures useful in the situations
when the concept, considering its complexity, can not be expressed using a single
indicator.
The purpose of this study is to apply the methodology of the composite indices
construction to develop an index of customer satisfaction in one of the Polish Town
Offices. To construct the index the data from a survey periodically conducted by
the office were used. Since each of the attributes of service quality is assessed on a
five-point ordinal scale and the items of the scale constitute linguistic values – fuzzy
sets were applied in the study. Such a structure of the index enabled to take into
consideration the ambiguity and subjectivity in the opinions of the respondents. The
index values in 2008-2010 years were estimated in the R program.
References
KENETT, R.S., SALINI, S. (2012), Modern Analysis of Customer Surveys: with Applications using
R. Chichester, John Wiley & Sons, Ltd.
SMITHSON, M., VERKUILEN, J. (2006), Fuzzy set theory. Applications in the Social Sciences.
Thousand Oaks, Sage Publications, Inc.
ZIMMERMANN, H.J. (2001), Fuzzy Sets Theory and its Applications. Norwell, Kluwer Academic
Publishers.
Keywords
CUSTOMER SATISFACTION INDEX, FUZZY SETS, PUBLIC SECTOR SERVICES
39
Zur Begrenzung der Verwendungshäufigkeit von
Spenderobjekten bei der Imputationen
fehlender Daten mittels Hot-Deck-Verfahren
Dieter William Joenssen1 and Udo Bankhofer2
1
2
Technische Universität Ilmenau
[email protected]
Technische Universität Ilmenau [email protected]
Abstract. Hot-Deck-Verfahren sind spezielle, auf Imputationsklassen basierende
Imputationsverfahren. Das Objekt, das dabei die vorhandenen Daten zur Imputation liefert, wird als Spenderobjekt bezeichnet. Damit sichergestellt wird, dass
die fehlenden Daten durch die Ausprägungen eines ähnlichen Spenderobjekts ersetzt wird, erfolgt der Verdoppelungsprozess innerhalb zuvor gebildeter Imputationsklassen. Durch diese Verdoppelungseigenschaft der Hot-Deck-Imputation ergibt
sich das Problem, dass im Extremfall ein Spenderobjekt alle Werte zur Imputation liefert. Deshalb erfolgt bei einigen Hot-Deck-Verfahren eine Begrenzung der
Anzahl, wie häufig ein Objekt als Spenderobjekt verwendet werden darf. Damit
stellt sich zwangsläufig die Frage, unter welchen Bedingungen eine Begrenzung
überhaupt sinnvoll ist. Im Rahmen dieser Arbeit wird daher eine Simulationsstudie
zur Beantwortung dieser Frage durchgeführt. Dabei zeigt sich, dass es deutliche
Unterschiede zwischen Hot-Deck-Imputationen gibt, bei denen die Spenderverwendungshäufigkeit variiert wird. Darüber hinaus können auch Einflussfaktoren identifiziert werden, die für oder gegen eine Begrenzung der Spenderobjekte sprechen.
References
Andridge, R.R., Little, R.J.A. (2010): A Review of Hot Deck Imputation for Survey Nonresponse.
International Statistical Review, 78, 1, 40–64.
Bankhofer, U. (1995): Unvollständige Daten- und Distanzmatrizen in der Multivariaten Datenanalyse. Eul, Bergisch Gladbach.
Kalton, G., Kish, L. (1981): Two Efficient Random Imputation Procedures.Proceedings of the Survey Research Methods Section 1981, 146–151.
Sande, I. (1983): Hot-Deck Imputation Procedures. In: W. Madow, H. Nisselson, I. Olkin (Eds.):
Incomplete Data in Sample Surveys, 3, Theory and Bibliographies. Academic Press, New
York, 339–349.
Keywords
Hot-Deck-Verfahren, fehlende Daten, Imputation, Simulationsstudie
40
Predictive validity of tracking decisions:
Application of a new validation criterion.
Florian Klapproth1 , Sabine Krolak-Schwerdt2 , and Thomas Hörstermann3∗
1
University of Luxembourg, Route de Diekirch, L-7220 Walferdange
[email protected]
[email protected]
[email protected]
2
3
Abstract. Although tracking decisions are primarily based on students’ achievements, the distributions of academic competencies in secondary school strongly
overlap between school tracks. However, the correctness of tracking decisions usually is based on whether or not a student has kept the track he was initially assigned
to. To overcome the neglect of misclassified students, we proposed an alternative
validation criterion for tracking decisions. In the present study, we applied this criterion to a sample of n = 2, 300 Luxembourgish 9th graders to examine the degree
of misclassification due to tracking decisions. Of all students, scores of academic
achievement tests were obtained at the beginning of 9th grade. The distributions of
test scores, when separated for the academic track and the vocational track, overlapped to a large degree. Based on the intersection of both distributions, we determined two competence levels. With respect to their individual test scores, we
assigned students to one of these levels. Students being assigned to the lower level
showed scores that were more likely to occur within the vocational than within the
academic track. The reverse was true for students assigned to the higher competence level. However, it turned out that about 20% of the students attended a track
that did not match their competence level. Whereas the agreement between tracking
decisions and actual tracks in 9th grade was fairly high (κ = .93), the agreement
between tracking decisions and competence levels was only moderate (κ = .56).
Keywords
TRACKING DECISIONS, VALIDATION CRITERION, MISCLASSIFICATIONS
∗
PhD student
41
DDα-classification of asymmetric and fat-tailed
data
Tatjana Lange1 , Karl Mosler2 , and Pavlo Mozharovskyi2,3
1
2
3
Hochschule Merseburg, Geusaer Straße, 06217, Merseburg, Germany
[email protected]
Universität zu Köln, Albertus-Magnus-Platz, 50923, Köln, Germany
{mosler,mozharovskyi}@statistik.uni-koeln.de
PhD student
Abstract. The DDα-procedure is a fast nonparametric method for supervised classification of d-dimensional objects into q ≥ 2 classes. It is based on q-dimensional
depth plots (Liu et al., 1999) and the α-procedure (Vasil’ev and Lange, 1998), which
is an efficient algorithm for discrimination in the depth space [0, 1]q . Specifically,
we use two depth functions that are well computable in high dimensions, the zonoid
depth (Koshevoy and Mosler, 1997) and the random Tukey depth (Cuesta-Albertos
and Nieto-Reyes, 2008), and compare their performance for different simulated data
sets, in particular asymmetric elliptically and t-distributed data.
References
CUESTA-ALBERTOS J.A. and NIETO-REYES A. (2008): The random Tukey depth. Computational Statistics and Data Analysis, 52, 4979–4988.
KOSHEVOY G. and MOSLER K. (1997): Zonoid trimming for multivariate distributions, Annals
of Statistics, 25, 1998-2017.
LIU, R., PARELIUS, J. and SINGH, K. (1999): Multivariate analysis of the data-depth : Descriptive statistics and inference. Annals of Statistics 27, 783-858.
VASIL’EV V.I. and LANGE T. (1998): The duality principle in learning for pattern recognition (in
Russian). Kibernetika i Vytschislit’elnaya Technika, 121, 7-16.
Keywords
ALPHA-PROCEDURE, ZONOID DEPTH, DD-PLOT, LOCATION DEPTH, PATTERN RECOGNITION, SUPERVISED LEARNING
42
The Alpha-Procedure - a nonparametric
invariant method for automatic classification of
d-dimensional objects
Tatjana Lange1 and Pavlo Mozharovskyi2
1
2
Hochschule Merseburg, Geusaer Straße, 06217, Merseburg, Germany
[email protected]
Universität zu Köln, Albertus-Magnus-Platz, 50923, Köln, Germany
[email protected]
Abstract. The presentation describes the α-procedure, which is based on a geometric representation of the separation of two classes by a hyperplane within a ddimensional rectifying feature space. The needed dimension of the space, i.e. the
number of features that is necessary for classification, is gained step by step using a 2-dimensional repère (frame of vector space). The supplement of the feature
set is performed depending on the values of the functions describing the discriminating power of both the feature and the repère. The transformation of the vectors
(i.e. objects) within the 2-dimensional repère is done towards the growth of value
of the discriminating power while the invariant is preserved. Here, the invariant is
the object’s affiliation with a class. The result of the repère’s transformation builds
a fictitious feature. Now, a new repère is built using this fictitious feature and as
second dimension the next real feature that owns the best value of the discriminating power. The enrichment of the feature set and the transformation of the repères
are stopped after the classes are separated. The advantage of the α-procedure is
the robustness and clarity of the process separating step by step the classes using
2-dimensional repères. Finally, the results of investigations comparing advanced
classification methods, such as SVM and others, will be discussed.
References
VASIL’EV, V.I. (1991): The reduction principle in pattern recognition learning (PRL) problem.
Pattern Recognition and Image Analysis, 1, 1.
VASIL’EV V.I. and LANGE T. (1998): The duality principle in learning for pattern recognition (in
Russian). Kibernetika i Vytschislit’elnaya Technika, 121, 7-16.
Keywords
ALPHA-PROCEDURE, PATTERN RECOGNITION, SUPERVISED LEARNING,
REPÈRE, INVARIANT
43
A universal method for model selection in
parametric regression models based on
statistical tests
Eckhard Liebscher
University of Applied Sciences Merseburg
[email protected]
Abstract. Let Yn1 , ...,Ynn be a sample of observations of a response variable. We
consider the following master regression model with fixed design:
k
Yni =
(n)
∑ β j xi j
+ εi for i = 1, ..., n,
j=1
(n)
where β1 , ..., βk are the parameters. (xi j )i=1...n, j=1...k is the deterministic design matrix containing the data of the regressor variables. Suppose that ε1 , ε2 , ... is a sequence of i.i.d. real random variables with E(εi ) = 0 and Var(εi ) = σ 2 . Further a
family F of submodels is determined. To each submodel, we assign a number d
assessing the complexity of the submodel M ∈ F . The aim is to search for a submodel which fits the data reasonably well and which is as simple as possible. For
the decision concerning submodel with index ν, we employ a modified F -statistic
M(ν).
The selection method goes as follows: For given numbers ψi , we search for a
minimum of d(ν) subject to M(ν) ≤ ψν . If there is more than one admissible model
with the same minimum complexity, then we take the model with maximum p-value.
In the talk we present asymptotic bounds for the misselection error which imply
consistency of the rule under weak assumptions. One particular feature of the rule is
that subjective grading of the model complexity can be incorporated. This aspect is
of special interest from the point of view of model building. Typically model builder
have some preference rules for special types of functions in mind when selecting
the model. These ideas can be used in the definition of d. Results of a simulation
study show that by using the proposed selection rule, the mis-selection error can be
controlled uniformly in contrast to well-known approaches such as Akaike criterion,
Bayesian one and Hannan-Quinn one.
In the last part of the talk, we discuss some computational issues. The use of the
branch-and-bound method improves the performance of the search.
References
LIEBSCHER, E. (2012): A universal selection method in linear regression models. Open Journal
of Statistics, to Appear.
HOFMANN, M.; GATU, C. and KONTOGHIORGHES, E. J. (2007): Efficient algorithms for computing the best-subset regression models for large scale problems. Computational Statistics &
Data Analysis, 52, 16-29.
44
Support Vector Machines on Large Data Sets:
Simple Parallel Approaches
Oliver Meyer, Bernd Bischl, Claus Weihs
Lehrstuhl fuer Computergestuetzte Statistik, Technische Universitaet Dortmund
Vogelpothsweg 87, 44227 Dortmund {meyer, bischl,
weihs}@statistik.uni-dortmund.de
Abstract. Support Vector Machines (SMVs) are well-known for their excellent performance in the field of statistical classification. Still, the high computational cost
due to the cubic runtime complexity is problematic for larger data sets. To mitigate
this, Graf et al. (2005) proposed the Cascade SVM. It is a simple, stepwise procedure, in which the SVM is iteratively trained on subsets of the original data set and
support vectors of resulting models are combined to create new training sets. The
general idea is to bound the size of all considered training sets and therefore obtain
a significant speedup. Another relevant advantage is that this approach can easily be
parallelized because a number of independent models have to be fitted during each
stage of the cascade. Initial experiments show that even moderate parallelization
can reduce the computation time considerably, with only minor loss in accuracy.
We compare the Cascade SVM to the standard SVM and a simple parallel bagging
method w.r.t. both classification accuracy and training time. Furthermore, some approaches to improve the performance of the Cascade SVM, e.g. specifically adapted
hyperparameter tuning will be discussed.
References
GRAF, H. P., COSATTO, E., BOTTOU, L., DURDANOVIC, I. and VAPNIK, V. (2005): Parallel
Support Vector Machines: The Cascade SVM. Advances in Neural Information Processing
Systems, 17, 521 S-P-E-C-I-A-L-C-H-A-R 528.
CHAWLA, N. V., MOORE, T. E., HALL, L. O., BOWYER, K. W., KEGELMEYER, P. and
SPRINGER, C. (2003): Distributed Learning with Bagging-Like Performance. Pattern Recognition Letters, 24, 455 S-P-E-C-I-A-L-C-H-A-R 471.
Keywords
Classification, Support Vector Machines, Cascade SVM, Parallelization
45
Soft Bootstrapping and Its Comparison with
Other Resampling Methods
Hans-Joachim Mucha1 and Hans-Georg Bartel2
1
2
WIAS, Germany [email protected]
Humboldt University Berlin, Germany
Abstract. The bootstrap approach is resampling taken with replacement from the
original data. Concretely, the original bootstrap technique can be formulated by
choosing the following weights of observations: m(i) = k, if the corresponding object i is drawn k times, and m(i) = 0, otherwise. Here it is supposed that originally
m(i) = 1 for all observations. In clustering, the so-called sub-sampling (i.e., resampling taken without replacement from the original data) is another approach (see
Hartigan (1969)): m(i) = 1, if observation i is drawn randomly, and m(i) = 0, otherwise. Here we recommend another bootstrap method, called soft bootstrapping,
that consists of random change of the original masses m(i) = 1 to some degree. This
resampling scheme of assigning randomized masses m(i) ¿ 0 (under the constraint
that the total sum of masses equals the original number of obserbvations) is especially appropriate for a small sample size because no object is excluded from the soft
bootstrap sample. We compare the applicability of different resampling techniques
with respect to cluster analysis.
Keywords
bootstrap, sub-sampling, cluster analysis
46
Dual Scaling Classification and Its Application
in Archaeometry
Hans-Joachim Mucha1 , Hans-Georg Bartel2 , and Jens Dolata3
1
2
3
Weierstrass Institute for Applied Analysis and Stochastics (WIAS), 10117
Berlin, Germany, [email protected]
Department of Chemistry at Humboldt University, Berlin, Brook-Taylor-Straße
2, 12489 Berlin, Germany, [email protected]
Head Office for Cultural Heritage Rhineland-Palatinate (GDKE), Große
Langgasse 29, 55116 Mainz, Germany, [email protected]
Abstract. We consider binary classification based on the dual scaling technique. In
the case of more than two classes many binary classifiers can be considered. We call
this pairwise classification because we train a classifier for each pair of classes. The
proposed approach goes back to Mucha (2002) and it is based on the pioneering
book of Nishisato (1980). It is applicable to mixed data. First, numerical variables
have to be discretized into bins to become ordinal variables (data preprocessing).
Second, the ordinal variables are converted into categorical ones. Then the data is
ready for dual scaling of each individual variable based on the given two classes:
each category is transformed into a score. Then a classifier can be derived from the
scores simply in an additive manner over all variables. Examples and applications
to archaeometry (provenance studies of Roman ceramics) are presented.
References
NISHISATO, S. (1980): Analysis of Categorical Data: Dual Scaling and Its Applications. The
University of Toronto Press, Toronto 1980.
MUCHA, H.-J. (2002): An Intelligent Clustering Technique Based on Dual Scaling. In: S.
Nishisato, Y. Baba, H. Bozdogan, and K. Kanefuji (Eds.),: Measurement and Multivariate
Analysis. Springer, Tokyo, 37–46.
47
Introducing Analytical Methods and Predictive
Models in Project Management Activities
Jaime Santos1 and Orlando Belo2
1
2
ISCTE/IUL, Portugal [email protected]
Algoritmi R&D Centre, University of Minho, Portugal
[email protected]
Abstract. Independent of the nature of a project, in project management the control
of variables like cost, quality, schedule, and scope are main decision factors for a
good and successful execution of a project. In the context of software engineering,
the project planning and execution are highly influenced by the creative nature of
the individual intended to create and the individuals involved with the project. Additionally, projects are surrounded by environmental complexities influencing directly
(and indirectly) team productivity. So, managing the risks, related to the different
project steps, is a key task with extreme importance for project managers (and sponsors) that should be focused on control and monitoring such variables, as well as
others concerned with the context around them. This work will present a set of
analytical techniques that we used to estimate the effort and to perform classification to predict the success of a project. In this study, we prepared a small cocktail
of data mining techniques and methods to explore potential correlations and influences contained in some of the most relevant parameters related to experience, complexity, organization maturity and project innovation, as well, some other execution
constraints and sizing units. We developed a model that could be deployed in any
project management process, assisting project managers in planning and monitoring
the state of one project or program under its supervision.
Keywords
Project Management; Data Mining; Business Intelligence; Effort Estimation; Project
Success Classification.
48
Constrained Dual Scaling of Successive
Categories for Detecting Response Styles
Pieter C. Schoonees1,2 , Michel van de Velden1 , and Patrick J. F. Groenen1
1
2
Erasmus University Rotterdam, The Netherlands
[email protected]
Abstract. A constrained dual scaling method for detecting response styles in socalled successive categories categorical data is proposed. Response styles arise in
questionnaire research when respondents tend to use rating scales in a manner unrelated to the actual content of the survey question. Dual scaling for successive
categories is a technique related to correspondence analysis (CA) for analyzing categorical data. However, there are important differences, with one important aspect
of dual scaling for successive categories data being that it also provides optimal
scores for the rating scale.
This property is used together with the interpretation of a response style as a
nonlinear mapping of a group of respondents latent preferences to a rating scale. It is
shown through simulation that the curvature properties of four well-known response
styles make it possible to use dual scaling to detect them. Also, the relationship
between dual scaling and CA in conjunction with nonnegative least squares is used
to restrict the detected mappings to conform to quadratic monotone splines. This
gives rise to simple diagnostic maps which can help researchers to determine both
the type of response style and the extent to which it is manifested in the data.
References
NISHISATO, S. (1980): Analysis of Categorical Data: Dual Scaling and its Applications. University of Toronto Press, Toronto.
VAN ROSMALEN, J., VAN HERK, H. and GROENEN, P.J.F. (2010): Identifying Response
Styles: A Latent-Class Bilinear Multinomial Logit Model. Journal of Marketing Research,
47, 157–172.
Keywords
RESPONSE STYLE, DUAL SCALING, CORRESPONDENCE ANALYSIS, SPLINES
49
On Instance Selection in Multi Classifier
Systems
Friedhelm Schwenker, Sascha Meudt
University of Ulm, Institute of Neural Information Processing, 89069 Ulm
[email protected]
Abstract. In any data mining application the training set design is the most important part of the overall data mining process. Designing a training set means preprocessing the raw data, selecting the relevant features, selecting the representative
instances (samples), and labeling the instances for the classification or regression
application at hand. Labeling data is usually time consuming, expensive (e.g. in
cases where more than one expert must be asked), and error-prone. Instance selection deals with searching for a subset S of the original training set T , such that a
classifier trained on S shows similar, or even better classification performance than
a classifier trained on the full data set T (Olvera-López et al 2010). We will present
confidence-based instance selection criteria for k-nearest-neighbor classifiers and
probabilistic support vector machines. In particular we propose criteria for multi
classifier systems and discuss them in the context of classifier diversity. The statistical evaluation of the proposed selection methods has been performed on affect
recognition from speech and facial expressions. Classes are not defined very well in
this type of application leading to data sets with high label noise. Numerical evaluations on these data sets show that classifiers can benefit form instance selection not
only in terms of computational costs, but even in terms of classification accuracy.
References
OLVERA-LÓPEZ, J. A. and CARRASCO-OCHOA, J. A. and MARTINEZ-TRINIDAD, J. F. and
KITTLER, J. (2010): A review of instance selection methods, Artif. Intell. Rev. 34(2), 133-143
Keywords
INSTANCE SELECTION, ACTIVE LEARNING, MULTI -CLASSIFIER-SYSTEMS,
SUPERVISED LEARNING
50
Effects of Labeling Mechanisms on
Classification Error in Linear Discriminant
Analysis
Keiji Takai1 and Kenichi Hayashi2
1
2
Kansai University, 3-3-35 Yamatecho, Suita, Osaka 564-8680, JAPAN
[email protected]
Osaka University, 2-2 Yamadaoka, Suita, Osaka 565-0871, JAPAN
[email protected]
Abstract. In machine learning literature as well as statistical method literature, it
is widely believed that unlabeled data in addition to labeled data are effective to
reduce the classification error or to make more precise estimation of the parameters
for the classification boundary. In our talk, we examine if this belief is true or not
by focusing our attention to the classification error in linear discriminant analysis.
For the examination, we introduce the missing-data framework. This is because unlabeled data can be regarded as missing data. Using this framework, we classify
the labeling mechanisms into two, the feature-independent labeling mechanism and
the feature-dependent labeling mechanism. The former corresponds to MCAR and
the latter to MAR in the missing-data analysis context. The former mechanism has
been implicitly assumed in a lot of the machine learning studies that deal with the
partially labeled data, while the latter has been rarely assumed although a lot of the
practical examples in which the latter mechanism is more suitable can be found,
for instance, in medical sciences. Under each of the labeling mechanisms, there are
two ways to use the data for estimation of the parameters, that is, the estimation
based on the labeled data alone and the estimation based on the mixed data of labeled and unlabeled data. For each of the labeling mechanisms, we theoretically
derive the asymptotic classification error efficiency based on the asymptotic theory
for missing data and numerically show which is a better way to use the data.
References
EFRON, B. (1975): The efficiency of logistic regression compared to normal discriminant analysis,
Journal of the American Statistical Association, 70, 892–898.
Keywords
UNLABELED DATA, SEMI-SUPERVISED LEARNING, MISSING DATA, LDA
51
Three-way Subspace Hierarchical Clustering
based on Entropy Regularization Method
Kensuke Tanioka1 and Hiroshi Yadohisa2
1
2
Graduate School of Culture and Information Science, Doshisha University,
1-3 Tatara Miyakodani, Kyotanabe City, Kyoto 610-0394, Japan
[email protected]
Department of Culture and Information Science, Doshisha University,
1-3 Tatara Miyakodani, Kyotanabe City, Kyoto 610-0394, Japan
[email protected]
Abstract. Three-way three-mode data are defined as X ∈ R|I|×|J|×|K| , where I,J, and
K are a set of objects, variables, and occasions, respectively. Vichi, et al., (2007) proposed a three-way clustering that considers the effects of variables and occasions.
The subspace is described as a linear combination of all original variables and occasions. However, Lance, et al., (2004) argue that such a subspace is affected by
noise variables. In addition, the subspace is subjected to complicated assumptions,
and therefore it is hard to interpret the results. Next, Tanioka and Yadohisa (2012)
proposed a subspace hierarchical clustering whose subspaces for each cluster are
described as differential subsets of variables and occasions. These method eliminate noise variables and makes it easier to interpret the results. However, the only
mean to evaluate each variable and occasion for each cluster is by calculating the
subspace.
In this study, we propose that distribution concepts such as the variation of each
variable and each occasion used to reflect structures for each variable and occasion
of each cluster and not as means to evaluate them.
References
LANCE, P., EHTEASHAM, H. and HUAN, L. (2004): Subspace Clustering for High Dimensional
Data:A Review.
TANIOKA, K. and YADOHISA, H. (2012): Three-mode Subspace clustering for considering effects under noise variables and occasions,
VICHI, M., ROCCI, R. and KIERS, H.A.L. (2007): Simultaneous component and clustering models for three way-data: Within and between approaches.
Keywords
VARIABLE SELECTION, OCCASION SELECTION
52
Gamma-Hadron-Separation in the
MAGIC-Experiment
Tobias Voigt1,3 , Roland Fried1 , Michael Backes2 , and Wolfgang Rhode2
1
2
3
TU Dortmund, Faculty of Statistics, Vogelpothsweg 87, 44227 Dortmund
[email protected],
[email protected]
TU Dortmund, Physics Faculty, Otto-Hahn-Straße 4, 44227 Dortmund
[email protected],
[email protected]
PhD Student
Abstract. The MAGIC-telescopes on the canary island of La Palma are the largest
Cherenkov telescopes in the world, operating in stereoscopic mode since 2009
(Aleksić, 2012). Their purpose is to detect very high energy gamma rays emitted
by various astrophysical sources. Due to characteristics of the detection process one
cannot avoid that besides the gamma ray signal also other particles are observed.
These background particles are summarized as hadrons. Before the gamma rays can
be further analyzed, they have to be separated from the hadronic background. In the
MAGIC experiment this classification is usually done using a random forest. In this
talk we introduce the data which is provided by the MAGIC telescopes, which has
some distinctive features. These features include high class imbalance and unknown
and unequal misclassification costs as well as the absence of reliably labeled training data. We introduce a method to deal with some of these features. The method is
based on a thresholding approach (Sheng and Liang, 2009) and aims at minimization
of the mean square error of an estimator, which is derived from the classification.
The method is designed to fit into the special requirements of the MAGIC data.
References
ALEKSIĆ, J. et al. (2012): Performance of the MAGIC stereo system obtained with Crab Nebula
data. Astroparticle Physics, 35, 435–448
SHENG, V. and LIANG, C. (2006): Thresholding for making classifiers cost-sensitive. Proceedings of the 21st National Conference on Artifcial Intelligence, AAAI Press, 1, 476–481
Keywords
MAGIC, THRESHOLDING, CLASS IMBALANCE, RANDOM FOREST, ASTROPHYSICS
53
Cluster Analysis of Symbolic Data with
Application of R Software
Justyna Wilk1 and Marcin Pełka1
1 Wroclaw
University of Economics, Department of Econometrics and Computer
Science, Nowowiejska 3, 58-500 Jelenia Góra, Poland,
[email protected], [email protected]
Abstract. Cluster analysis is ranked among the most important groups of exploratory data analysis. In the typical cluster analysis study four major steps are
distinguished: objects and variables selection, distance measurement and objects
clustering, determining the number of clusters and clustering validation, clusters
description and profiling. Symbolic data analysis contributes significantly in the
development of taxonomy methodology but there are two main problems in symbolic data clustering. Majority of methods used in the procedure is implemented
exclusively for classical data situation and symbolic data complexity prevents direct
application of the methods. The aim of this article is to present approach in symbolic data clustering with using of R software. In the first part of the paper symbolic
data concept and cluster analysis procedure is presented. In the second part alternative strategies of symbolic data clustering and the methods basing on dissimilarity
matrix and symbolic data table are discussed. A review of these two approach application in empirical research is performed. Afterwards packages and functions of
R software that may be useful in symbolic data clustering depending on selected
approach (with a special focus on the symbolicDA package) are presented.
References
BOCK, H.-H., DIDAY, E. (Eds.) (2000): Analysis of symbolic data. Explanatory methods for extracting statistical information from complex data. Springer Verlag, Berlin-Heidelberg.
DIDAY, E., NOIRHOMME-FRAITURE, M. (2008): Symbolic data analysis and the Sodas software. John Wiley & Sons, Chichester.
GATNAR, E., WALESIAK, M. (Eds.) (2011): Analiza danych jakościowych i symbolicznych z
wykorzystaniem programu R [Qualitative and symbolic data analysis with application of R
software]. C.H. Beck, Warszawa.
Keywords
SYMBOLIC DATA ANALYSIS, CLUSTERING, R SOFTWARE
54
Part VI
Data Analysis and Classification in
Marketing
Spatial Modeling of Dependencies Between
Population, Education, and Economic Growth
Daniel Baier1 , Wolfgang Polasek2 , and Alexandra Rese1
1
2
Chair of Marketing and Innovation Management, Brandenburg University of
Technology Cottbus, Germany, daniel.baier|[email protected]
Institute for Advanced Studies, Vienna, Austria, [email protected]
Abstract. Since the seminal work by Anselin (1988), spatial effects have become
an important tool for predictions in econometrics. Spatial effects appear in a huge
variety of analysis tasks in economics and business adminstration, so, e.g. when a
comparative regional analysis (Zelias 1987) or the modeling of regional sales data
(see, e.g., Baier, Polasek 2010) is under study or when spatial effects have to be
compared with other effects in success factor analysis (see, e.g., Rese, Baier 2011).
The paper discusses different approaches to model spatial effects and shows the
viability of this approach in a practical application where the dependencies of population, education, and economic growth are under study. The paper uses actual
regional data from Germany to analyze these effects. The advantages and disadvantages of this spatial modeling approach is discussed.
References
Anselin, L. (1988): Spatial Econometrics. Baltagi, B.H. (Ed.), A Companion to Theoretical Econometrics. Blackwell Publishing Ltd AD, 310–330.
Baier, D., Polasek, W. (2010): Marketing and Regional Sales: Evaluation of Expenditure Strategies
by Spatial Sales Response Functions. Studies in Classification, Data Analysis, and Knowledge
Organization, 40, 673–682.
Rese, A., Baier, D. (2011): Success Factors for Innovation Management in Networks of Small and
Medium Enterprises. R&D Management, 41(2), 138–155.
Zelias, A.J. (1987): A Regression Approach to Regional Forecasting. Papers of the Regional Science Association, 61, 39–49.
Keywords
SPATIALMODELS, ECONOMETRICS, MARKETING, INNOVATIONMANAGEMENT
56
Discrete Choice Methods and Their Applications
in Preference Analysis of Vodka Consumers
Andrzej Ba̧k1 , Marcin Pełka1 , and Aneta Rybicka1
1 Wroclaw
University of Economics, Department of Econometrics and Computer
Science, Nowowiejska 3, 58-500 Jelenia Góra, Poland,
[email protected], [email protected],
[email protected]
Abstract. Preference analysis is one of key elements in marketing researches and in
economy in general. The preferences help to explain how and why consumers make
their choices. There are two types of preferences – stated and revealed. Discrete
choice methods allow to analyze stated preferences. They model choices made by
people among a finite set of alternatives. Discrete choice models take many forms,
including: binary logit, binary probit, multinomial logit, conditional logit, multinomial probit, nested logit, generalized extreme value models, mixed logit, and exploded logit.
The main aim of the paper is to apply multinominal logit model to analyze vodka
consumers preferences – a discrete choice experiment – with application of R software. The article presents basic terms of multinominal logit model, discrete choice
experiment, model estimation. The paper presents also results of model estimation
that will allow to determine worst and best vodka brands and which attributes are
most important for consumers.
References
AGRESTI, A. (2002): Categorical Data Analysis. Second Edition. Wiley, New York.
CAMERON, A.C., TRIVEDI, P.K. (2005): Microeconometrics. Methods and Applications. Cambridge University Press, New York.
TRAIN, K. (2003): Discrete Choice Methods with Simulation. Cambridge University Press, New
York.
ZWERINA, K. (1997): Discrete Choice Experiments in Marketing. Heidelberg-New York,
Physica-Verlag.
Keywords
DISCRETE CHOICE METHODS, PREFERENCE ANALYSIS, MULTINOMINAL LOGIT MODEL
57
The Dangers of using Intention as a Surrogate
for Retention in Brand Positioning Decision
Support Systems
Michel Ballings1 and Dirk Van den Poel2
1
2
Ghent University, Department of Marketing, Tweekerkenstraat 2, 9000 Ghent,
Belgium [email protected],PhD student
Ghent University, Department of Marketing, Tweekerkenstraat 2, 9000 Ghent,
Belgium [email protected]
Abstract. The purpose of this paper is to explore the dangers of using intention as
a surrogate for retention in a decision support system (DSS) for brand positioning.
An empirical study is conducted, using structural equation modeling and both data
from the internal transactional database and a survey. The results show that different product benefits are recommended for brand positioning when intention is used
as opposed to retention as a criterion variable. The findings also indicate that the
strength of the structural relationships is inflated when intention is used. This has
implications in that managers will not only underinvest in marketing campaigns but
will also invest in advertisements that promote the wrong product benefits. Although
this study is limited to only one industry; the newspaper industry, it provides guidance for brand managers in selecting the most appropriate product benefit for brand
positioning and advices against the use of intention as opposed to retention in DSSs.
Our contribution to literature is demonstrated in the fact that it is the first study that
challenges and refutes the commonly held belief that intention is a valid surrogate
for retention in a DSS for brand positioning. Moreover, a framework is provided
that addresses opportunities for integrating predictive and descriptive computational
systems through data.
Keywords
PRODUCT BENEFITS, BRAND POSITIONING, DECISION SUPPORT SYSTEM, INTENTION, OBSERVED CUSTOMER RETENTION
58
Microeconometrics Multinomial Models
and their Applications
in Preferences Analysis using R
Andrzej Ba̧k and Tomasz Bartłomowicz
Wrocław University of Economics, Department of Econometrics and Computer
Science, ul. Nowowiejska 3, 58-500 Jelenia Góra, Poland, [email protected]
[email protected]
Abstract. Measurement of consumer preferences is one of the most important elements of marketing research. It helps to explain the reasons of consumer choices
among products and services. Microeconometrics models are useful in analysis
of categorical data (microdata describing individuals) often collected in marketing research based on discrete choices. Among microeconometrics models for unordered categories most frequently are used multinomial logit model, conditional
logit model and mixed logit model. The main aim of this paper is to present some
types of discrete choice multinomial logit models and their applications in consumer preferences analysis. The basis for distinguishing between types of multinomial models is mainly character of the independent variables including in the model
and this distinction is not clearly interpreted in microeconometrics publications. The
paper shows the fundamental differences between these types of multinomial logit
models used in the area of consumer preferences analysis. For estimation these models are used R program, R packages and user functions written in R programming
language.
References
AGRESTI A. (2002), Categorical Data Analysis. Second Edition, Wiley, New York, CAMERON
A.C., TRIVEDI P.K. (2005), Microeconometrics. Methods and Applications. Cambridge University Press, New York. JACKMAN S. (2007), Models for Unordered
Outcomes. Political Science
150C/350C. http://jackman.stanford.edu/classes/350C/07/unordered. pdf (12.03.2012).
SO Y., KUHFELD W.F. (1995), Multinomial Logit Models. http://support.sas.
com/techsup/technote/mr2010g.pdf (12.03.2012) .
WINKELMANN R., BOES S. (2006), Analysis of Microdata. Springer, Berlin.
Keywords
MICROECONOMETRICS, PREFERENCES, R PROGRAM
59
Measuring Consumers’ Brand Associations in
Online Market Research
Pascal Kottemann∗ , Martin Meißner and Reinhold Decker
Bielefeld University, Department of Business Administration and Economics,
P.O. Box 10 01 31, 33501 Bielefeld, Germany
{pkottemann,mmeissner,rdecker}@wiwi.uni-bielefeld.de
Abstract. Understanding brand equity based on consumers’ brand associations is
important for both marketing research and practice. Assuming that human knowledge is stored in a network structure, association network analysis is often applied
for determining a brand’s image (see, e.g., Teichert and Schöntag 2010). In 2006,
John et al. introduced Brand Concept Maps (BCM) as a tool for identifying and
visualizing brand associations, the direct or indirect link of these associations to the
brand and the relationship between these associations. Up to know, applying BCM
requires data collection in a laboratory setting where consumers express their associations on poster boards. However, generating representative and sufficiently large
samples in such setting is difficult and often comes along with high costs.
The aim of this paper is to investigate whether the BCM approach can be applied
in online market research (implying computerized interviews). We empirically compare the outcomes of the original BCM approach using face-to-face interviews with
results from an online adaptation of BCM. Furthermore, we discuss the extent to
which BCM data are suitable for market segmentation based on brand perception.
References
JOHN, D. R.; LOKEN, B.; KIM, K.; MONGA, A. B. (2006): Brand Concept Maps: A Methodology for Identifying Brand Association Networks. Journal of Marketing Research, 43(4),
549–563.
TEICHERT, T. A.; SCHÖNTAG, K. (2010): Exploring Consumer Knowledge Structures Using
Associative Network Analysis. Psychology and Marketing, 27(4), 369–398.
Keywords
BRAND CONCEPT MAPS, BRAND ASSOCIATION NETWORKS, ONLINE
MARKET RESEARCH
∗
Ph.D. student
60
Multinomial-SVM-Item-Recommender for
Repeat-Buying Scenarios
Christina Lichtenthaeler1 and Lars Schmid-Thieme2
1
2
Institute for Advanced Study, Technische Universität München,
Lichtenbergstrasse 2a, 85748 Garching, Germany
[email protected]
Information Systems and Machine Learning Lab, University of Hildesheim
Marienburger Platz 22, 31141 Hildesheim, Germany
[email protected]
Abstract. Most of the common recommender systems are dealing with the task
to generate recommendations for assortments in which a product is usually bought
only once like books or DVDs. However, there are plenty of online shops selling
consumer goods like drugstore products where the customer purchases the same
product repeatedly. We call such scenarios repeat-buying scenarios (Böhm et al.
(2001)). For our approach we assigned the results of information geometry (Amari
and Nagaoka (2000)) and transformed customer data taken from a repeat-buying
scenario into a multinomial space. Using the multinomial diffusion kernel from
Lafferty and Lebanon (2005) we developed a multinomial SVM item recommender
system M-SVM-IR to calculate personalized item recommendations for a repeatbuying scenario. We evaluated our SVM-item-recommender in a 10-fold-crossvalidation against the state of the art recommender BPR-MF developed by Rendle
et. al (2009). Evaluation was performed on a real world dataset taken from the online drugstore of Rossmann. It shows that the M-SVM-IR outperforms the BPR-MF
with statistical significance regarding the AUC.
References
AMARI, S., NAGAOKA, H. : Methods of information geometry. In: Translation of mathematical
Monographs, Vol. 191, American Mathematical Society.
BÖHM W. ,GEYER-SCHULZ, A. , HAHSLER, M., JAHN, M. (2001): Repeat-Buying Theory
and Its Application for Recommender Services. In: Opitz, O. (Ed.): Studies in Classification,
Data Analysis, and Knowledge Organization.
LAFFERTY, J. and LEBANON, G.(2005): Diffusion Kernels on Statististical Manifolds. In: Journal of Machine Learning Research 6, 129-163.
RENDLE, S., FREUDENTHALER, C., GANTNER, Z., SCHMIDT-THIEME, L.S. (2009): BPR:
Bayesian Personalized Ranking from Implicit Feedback. In: Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence.
Keywords
SVM, Item-Recommender, Diffusion Kernel, Information Geometry
61
Approach to Predicting Changes in Market
Segments Based on Customer Behavior
Anneke Minke1 and Klaus Ambrosi1
Institut für Betriebswirtschaft und Wirtschaftsinformatik,
Universität Hildesheim, Germany
{minke,ambrosi}@bwl.uni-hildesheim.de
Abstract. In modern marketing, knowing the development of different market segments is crucial. However, simply measuring the occured changes is not sufficient
when planning future marketing campaigns. Predictive models are needed to show
trends and to forecast abrupt changes such as the elimination of segments, the splitting of a segment, or the like. For predicting changes, continuously collected data
are needed. In internet market places, data concerning customer behavior can easily be recorded; furthermore, these data are more adequate for showing changes
in the relationship between customers and a cooperation than demographic data.
Therefore, behavioral data are suitable for spotting trends in customer segments.
For detecting changes in a market structure, fuzzy-clustering is used since gradual
changes in cluster memberships can implicate future abrupt changes. In this talk, we
introduce different measurements for the analysis of gradual changes that comprise
the currentness of data and can be used in order to predict abrupt changes.
References
BÖTTCHER, M. and HÖPPNER, F. and SPILIOPOULOU, M. (2008): On Exploiting the Power
of Time in Data Mining. ACM SIGKDD Explorations Newsletter, 10/2, 3–11.
MINKE, A. and AMBROSI, K. and HAHNE, F. (2009): Approach for Dynamic Problems in Clustering. In: I.N. Athanasiadis, P.A. Mitkas, A.E. Rizzoli, J.M. Gómez (Eds.): Proceedings of
the 4th International Symposium on Information Technologies in Environmental Engineering
(ITEE’09). Springer, Berlin, 373–386.
SONG, H.S. and KIM, J.K. and KIM, S.H. (2001): Mining the Change of Customer Behavior in
an Internet Shopping Mall. Expert Systems with Applications, 21, 157–168.
Keywords
CHANGE PREDICTION, CLUSTERING, MARKET SEGMENTATION
62
Finite Mixture MNP vs. Finite Mixture IP
Models: An Empirical Study
Friederike Paetz1 and Winfried J. Steiner2
1
2
Department of Marketing, Clausthal University of Technology,
38678 Clausthal-Zellerfeld [email protected]
Department of Marketing, Clausthal University of Technology,
38678 Clausthal-Zellerfeld [email protected]
Abstract. In the context of conjoint choice models, we propose a new finite mixture model framework for estimating two types of probit models: Finite Mixture
Multinomial Probit (FM-MNP) models and Finite Mixture Independent Probit (FMIP) models. While the FM-MNP both accommodates heterogeneity in consumers’
preferences and allows to consider dependencies between alternatives, the FM-IP
assumes independence and suffers from a similar property like IIA (cf. Hausman
and Wise (1978)). The models are estimated using an Expectation-Maximization
algorithm. In an empirical study, we investigate how restrictive the independence
assumption of the FM-IP model is, and whether FM-IP and FM-MNP models therefore lead to different implications concerning the number of segments and market share forecasts (following Haaijer et al. 1998). Not unexpected, the MNP performs much better on the aggregate market level. However, when heterogeneity is
accounted for, the FM-IP outperforms the FM-MNP, with the FM-IP 2-segment solution turning out as overall best. Obviously, the additional benefit from considering
dependencies between alternatives dilutes when taking into account heterogeneity of
consumers. We also find that market share predictions under the optimal 2-segment
solution are rather close between the FM-MNP and FM-IP models so that the higher
complexity of the FM-MNP seems not justified.
References
HAAIJER, R., WEDEL, M., VRIENS, M. and WANSBEEK, T.J. (1998): Utility Covariances and
Context Effects in Conjoint MNP Models. Marketing Science, 17 (3), 236–252
HAUSMAN, J. and WISE, D. (1978): A conditional probit model for qualitative choice: Discrete
decisions recognizing interdependence and heterogeneous preferences. Econometrica, 46 (2),
403–429
Keywords
FINITE MIXTURE MODELS, MULTINOMIAL PROBIT MODELS, INDEPENDENT PROBIT MODELS
63
Rasch Models for Analyzing Role Models in
Inter-Organisational Innovation Processes
Alexandra Rese1 , Hans-Georg Gemünden2 , and Daniel Baier1
1
2
Chair of Marketing and Innovation Management, Brandenburg University of
Technology Cottbus, Germany [email protected],
[email protected]
Chair for Innovation and Technology Management, Technical University of
Berlin, Germany [email protected]
Abstract. Rasch models have been used especially in psychometrics and human
sciences for the analysis of social behavior, abilities or personality traits (Rasch
1980). Rasch models are adapted and used here to examine a set of dichotomous
promoting behavior items in inter-organizational radical innovations with respect
to the dimensional structure and item fit. Empirical research in innovation management has shown that key people play an important role in initiating and implementing innovations (Walter et al. 2011). However, these key people have been hardly
investigated in an inter-organisational context so far (Rese, Baier 2011). Rasch analysis allows for taking several measurement issues into account which are required
for validity and support scale improvement. Regarding the occupation of a role for
each person a total raw score of the role trait can be calculated. The focus and aim of
this study is first of all to develop a scale to assess the behavior of several actors in
inter-organizational radical innovations. Besides the identification of roles the role
structure is analyzed: With respect to an assumed size of two to five cooperating
organizations the question answered is how many people work together and take on
roles.
References
Rasch, G. (1980): Probabilistic Models for Some Intelligence and Attainment Tests. Mesa Press,
Chicago, IL.
Rese, A. and Baier, D. (2011):Success factors for innovation management in networks of small and
medium enterprises. R&D Management, 41(2), 138–155.
Walter, A., Parboteeah, K. P., Riesenhuber, F. and Hoegl, M. (2011): Championship behaviors
and innovations success: an empirical investigation of university spinoffs. Journal of Product
Innovation Management, 28(4), 586–598.
Keywords
RASH MODELS, ROLE MODELS, INNOVATION MANAGEMENT
64
Variable Weighting and Selection Approaches
for Market Segmentation: A Comparison
Susanne Rumstadt and Daniel Baier
Chair of Marketing and Innovation Management, BTU Cottbus, Postbox 101344,
03013 Cottbus, Germany, {susanne.rumstadt,
daniel.baier}@tu-cottbus.de
Abstract. The selection and weighting of variables play decisive roles in market
segmentation. The inclusion or exclusion of variables as well as the distribution of
their possible values affect the quality of the grouping. Some selection and weighting approaches suggest better, some worse groupings. So, e.g., in the grouping of
respondents on the basis of uploaded holiday, spare time, or appartment images
in social networks or during online interviews (see, e.g., Baier, Daniel 2012 for a
recent overview), a manifold of subsets of extracted features from these uploaded
images (e.g. color histograms, edge histograms, high level features like number of
persons or categories like beach or mountain) can be used, resulting in different
grouping results. For solving this problem, several feature saliency approaches have
been proposed recently, basing, e.g. on latent class analysis (see, e.g. Law et al.
2004) or k-means heuristics (see, e.g. Carmone et al. 1999, Brusco, Cradit 2001,
Steinly, Brusco 2008).
In this paper we analyze which feature saliency approach is advantageous in
which setting. We use real data from image clustering and simulated data, both in a
Monte Carlo setting, for this purpose.
References
Baier, D., Daniel, I. (2012): Image Clustering for Marketing Purposes, to appear in: Studies in
Classification, Data Analysis, and Knowledge Organization, 43, 1.
Brusco, M.J., Cradit, J.D. (2001): A Variable-Selection Heuristic for K-means Clustering, Psychometrika, 66, 2, 249-270.
Steinly, D., Brusco, M.J. (2008): Selection of Variables in Cluster Analysis: An Empirical Comparison of Eight Procedures, Psychometrika, 73, 1, 125-144.
Carmone, F.J., Kara, A., Maxwell, S. (1999): HINoV: A new Model to Improve Market Segment
Definition by Identifying Noisy Variables, Journal of Marketing Research, 36, 4, 501-509.
Law, M.H.C., Figueiredo, M.A.T., Jain, A.K. (2004): Simultaneous Feature Selection and Clustering Using Mixture Models, IEEE Transactions on Pattern Analysis and Machine Intelligence,
26, 9, 1154-1166.
2 Susanne Rumstadt and Daniel Baier
Keywords
Market Segmentation, Latent Class Analysis, K-Means, Feature Saliency, Variable
Selection
65
An Validity Analysis of Recent Commercial
Conjoint Analysis Studies
Sebastian Selka, Daniel Baier, and Peter Kurz
Institute of Business Administration and Economics,
Brandenburg University of Technology Cottbus,
Postbox 101344, 03013 Cottbus, Germany
{sebastian.selka,daniel.baier}@tu-cottbus.de
TNS Infratest GmbH,
Arnulfstrasse 205, Munich 6839, Germany
[email protected]
Abstract. Due to more and more online questionnaires and possible distraction –
e.g. by mails, social network messages, or news reading during the processing in an
uncontrolled environment – one can assume that the (internal and external) validity
of conjoint studies lowers. We test this assumption by comparing the (internal and
external) validity of commercial conjoint analysis studies over the last years. Research base are (disguised) recent commercial conjoint analysis studies of a leading
international marketing research company in this field with about 1.000 conjoint
studies per year. The validity information is analyzed w.r.t. research objective, product type, period, incentives, and other categories, also w.r.t. other outcomes like
interview length and response rates. The results show some interesting changes in
the validity of these studies. Additionally, new procedures to deal with these setting
will be shown.
References
WITTINK, D. R. and VRIENS, M. and BURHENNE, W. (1994): Commercial use of conjoint
analysis in Europe: Results and critical reflections. International Journal of Research in Marketing, 11, 1, 41 - 52.
DEUTSKENS, E. and de RUYTER, K. and WETZELS, M. and OOSTERVELD, P. (2004): Response Rate and Response Quality of Internet-Based Surveys: An Experimental Study. Marketing Letters, 2004, 15, 1, 21-36.
GREEN, P. E., KRIEGER, A. M., and WIND, Y. J. (2001): Thirty Years of Conjoint Analysis:
Reflections and Prospects. Interfaces, 31, 56–73.
Keywords
MARKETING RESEARCH, CONJOINT ANALYSIS, VALIDITY DEVELOPMENT
66
Exploring Nonlinear Effects in the Relationship
between Customer Satisfaction and Customer
Retention
Winfried J. Steiner1 , Florian U. Siems2 , Anett Weber1 and Daniel Guhl1
1
2
Department of Marketing, Clausthal University of Technology,
38678 Clausthal-Zellerfeld [email protected],
[email protected],
[email protected]
Faculty of Business and Economics, RWTH Aachen University,
52072 Aachen [email protected]
Abstract. There is consensus in the marketing literature that satisfaction of customers concerning (1) perceived quality and (2) pricing of products/services is critical for customer retention. In contrast, there is lack of empirical evidence about
the exact functional relationship. Using nonparametric regression, this contribution
empirically investigates whether and to what extent nonlinear effects of each of
those two satisfaction dimensions affect customer retention. Results from an empirical study not only reveal complex nonlinear effects for these satisfaction-retention
relationships, respectively, but also indicate strong interaction effects of both satisfaction dimensions on customer retention.
To estimate nonlinear effects, we follow Lang and Brezger (2004) who proposed
a Bayesian version of P-splines originally introduced by Eilers and Marx (1996).
Accordingly, nonlinear interaction effects are modeled via tensor products of unidimensional splines within this Bayesian framework. The P-spline models are estimated with and without interaction effects and clearly outperform parametric benchmark models in both fit and predictive validity. While the P-spline model with interaction effects shows the best performance across models, the P-spline model without interaction effects still outperforms the parametric model with interaction effects
which indicates the important role of nonlinearities in the satisfaction-retention context.
References
Eilers P, Marx B (1996) Flexible smoothing using B-splines and penalized likelihood (with comment and rejoinder). Statistical Science 11(2):89-121
Lang S, Brezger A (2004) Bayesian P-Splines. Journal of Computational and Graphical Statistics
13(1):183-212
Keywords
CUSTOMER SATISFACTION, CUSTOMER RETENTION, NONPARAMETRIC
REGRESSION, P-SPLINES
67
Complex Product Development: Using a
Combined VoC Lead User Approach
Alexander Sänn1 and Daniel Baier2
1
IHP GmbH - Leibniz-Institut für innovative Mikroelektronik† ,
Im Technologiepark 25, 15236 Frankfurt (Oder), Germany
[email protected]
2
Institute of Business Administration and Economics,
Brandenburg University of Technology Cottbus,
Postbox 101344, 03013 Cottbus, Germany
[email protected]
Abstract. Nowadays, the lead user method is a state of the art method to generate breakthrough innovations for the new product development. Since the lead
user method was successfully applied in generating simple products for businessto-consumer markets (e.g. Sänn and Baier 2012), the contribution of lead users in a
complex product environment is highly controversial (e.g. Mahr and Lievens 2012;
Magnusson 2009). This research adopts the view of SME to generate complex radical innovations for business-to-business contexts employing the lead user method in
a combined surrounding with voice of the customer techniques in the field of complex IT security products. The new approach is expected to lower the risk of a niche
product development. The empirical finding led to an adaptive lead user approach
addressing common problems of lead userness in complex product environments.
Overall, SME will be enabled to specify and parameterize future products according
to reliable and user verified data in an early stage of the new product development.
References
MAGNUSSON, P. (2009): Exploring the Contributions of Involving Ordinary Users in Ideation of
Technology-Based Services.Journal of Product Innovation Management, 26, 5, 578-593.
MAHR, D. and LIEVENS, A. (2012): Virtual lead user communities: Drivers of knowledge creation for innovation. Research Policy, 41, 1, 167-177.
SÄNN, A. and BAIER, D. (2012): Lead User Identification Based in Conjoint Analysis Based
Product Design. Studies in Classification, Data Analysis and Knowledge Organization, 43,
521-528.
Keywords
MARKETING RESEARCH, LEAD USER, NEW PRODUCT DEVELOPMENT
† Alexander Sänn is a PhD student at Institute of Business Administration and Economics, Brandenburg University of Technology Cottbus, Postbox 101344, 03013 Cottbus, Germany.
68
Identifying Consumer Typologies from Online
Product Reviews Using Finite Mixture Models
Michael N. Tuma1∗
Department of Business Administration and Economics, Bielefeld University,
D-33615 Bielefeld, Germany. [email protected]
Abstract. Online product reviewing is an emerging phenomenon that is playing an
increasingly important role in consumers’ purchase decisions (Chen and Xie, 2008).
Recent empirical surveys show that people rely more and more on opinions posted
on blogs, online forums and opinion portals when making a variety of decisions,
ranging from which movies to watch to which products to purchase.
Despite this importance of opinion analysis, to the best of our knowledge, there
has been no attempt by marketing researchers to empirically identify different types
of consumers who post their opinions online. This study seeks to fill this gap. Using natural language processing techniques, we develop a novel approach – related
to that of Decker and Trusov (2010) – to identify those variables which, at least
partly, explain the articulated opinions. These variables are then used in a modelbased clustering approach (Wedel and Kamakura, 2001) to identify homogeneous
segments of consumers that can then be targeted with the same marketing measures.
The results show that polarity in consumer opinions plays a significant role in segment formation.
References
CHEN, Y. and XIE, J. (2008): Online Consumer Review: Word–of–Mouth as a New Element of
Marketing Communication Mix. Management Science, 54, 477-491.
DECKER, R. and TRUSOV, M. (2010): Estimating Aggregate Consumer Preferences from Online
Product Reviews. International Journal of Research in Marketing, 27, 293-307.
WEDEL, M. and KAMAKURA, W. (2000): Market Segmentation: Conceptual and Methodological Foundations. 2nd ed., Kluwer Academic Publishers, Dordrecht.
Keywords
MARKET SEGMENTATION, ONLINE CONSUMER REVIEWS, FINITE MIXTURE MODELS, SEGMENTATION VARIABLES
∗
PH.D. student
69
Solving Product Line Design Optimization
Problems using Stochastic Programming
Sascha Voekler1 and Daniel Baier2
1
2
Institute of Business Administration and Economics, Brandenburg University of
Technology Cottbus, Postbox 101344, D-03013 Cottbus, Germany
[email protected]
Institute of Business Administration and Economics, Brandenburg University of
Technology Cottbus, Postbox 101344, D-03013 Cottbus, Germany
[email protected]
Abstract. In this paper, we try to apply stochastic programming methods to product line design optimization problems. Because of the estimated part-worths of the
product attributes in conjoint analysis, there is a need to deal with the uncertainty
caused by the underlying statistical data (Kall/Mayer 2011). Inspired by the work of
Georg B. Dantzig (Dantzig 1955), we developed an approach to use the methods of
stochastic programming for product line design issues. Therefore, four different approaches will be compared by using notional data of a yogurt market from Gaul and
Baier (2009). Stochastic programming methods like singleor two-stage programs
are applied on Gaul, Aust and Baier (Gaul et al. 1995) and will be compared to
its original approach, to Green and Krieger (Green/Krieger 1985) and to Kohli and
Sukumar (Kohli/Sukumar 1990). Besides the theoretical work, these methods will
be realized by a self-written code with the help of the statistical software package
R.
References
Dantzig, G.B. (1955): Linear Programming Under Uncertainty. Management Science, 1(3/4), 197206.
Gaul, W., Aust, E., Baier, D. (1995): Gewinnorientierte Produktliniengestaltung unter Beru S-P-EC-I-A-L-C-H-A-R cksichtigung des Kundennutzens. Zeitschrift fu S-P-E-C-I-A-L-C-H-A-R
r Betriebswirtschaftslehre, 65, 835-855.
Gaul, W., Baier, D. (2009): Simulationsund Optimierungsrechnungen auf Basis der Conjointanalyse. Conjointanalyse Methoden-Anwendungen-Praxisbeispiele, D. Baier, M. Brusch (Hrsg.),
Berlin, Heidelberg, Springer 2009, 163 S-P-E-C-I-A-L-C-H-A-R 182.
Green, P.E., Krieger, A.M. (1985): Models and Heuristics for Product Line Selection. Marketing
Science, 4(1), 1-19.
Kall, P., Mayer, J. (2011): Linear Stochastic Programming Models, Theory, and Computation.
International Series in Operations Research and Management Science, Springer New York,
Dordrecht, Heidelberg, London, 2011, 156.
2 Sascha Voekler and Daniel Baier
Kohli, R., Sukumar, R. (1990): Heuristics for Product-line Design Using Conjoint Analysis. Marketing Science, 36(12), 1464-1478.
Keywords
Conjoint Analysis, Product Line Design Optimization, Stochastic Programming.
70
Part VII
Data Analysis in Finance
Sovereign Wealth Funds and Portfolio Choice
Wolfgang Bessler1 and Daniil Wagner, CFA2
1
2
Center for Finance and Banking, University of Giessen, Licher Strasse 74, 35394
Giessen, Germany
[email protected]
PhD Student, Center for Finance and Banking, University of Giessen, Licher
Strasse 74, 35394 Giessen, Germany
[email protected]
Abstract. In this paper we take the portfolio manager’s perspective and analyze
Sovereign Wealth Fund (SWF) portfolios from an investment management view.
Since more than 50 percent of SWF assets are funded by oil or gas revenues and two
of the main SWF goals are economic stabilization (e.g. in the case of resource risks
borne by a high dependence on resource revenues) and intergenerational wealth
transfer (e.g. from exhaustible resource revenues to financial assets) our approach
is to include the funding source of the SWF into the portfolio choice problem as
a background asset. By doing so the portfolio choice for countries endowed with
natural resources should significantly differ depending on the type of resource involved and also from a pure financial asset setting. Based on this assumption we
develop a portfolio choice framework using Markowitz mean-variance optimization
(MVO) and including the natural resource pool as a fixed optimization component.
We account for the shortcomings of sample-based MVO like the high sensitivity
of optimal weights to small changes in input parameters and the poor out-of-sample
performance relative to naive diversification strategies by using the Black-Litterman
model to calculate the input parameters. Our empirical results show that in presence
of background assets (natural resources like oil, gas or copper) the investment opportunities for the whole country shrink in terms of risk-return opportunities. Furthermore, in-sample- as well as out-of-sample analysis indicates that for resource-rich
economies a high allocation of low correlated assets such U.S. government bonds,
real estate or hedge funds may be optimal.
Keywords
SOVEREIGN WEALTH FUNDS, RESOURCE RISK, PORTFOLIO CHOICE,
MEAN-VARIANCE OPTIMIZATION, BLACK-LITTERMAN MODEL
72
Sovereign Wealth Funds and Portfolio Choice
73
References
BALDING, C. (2008): A portfolio analysis of Sovereign Wealth Funds. Working Paper University
of California.
BESSLER, W. and WOLFF, D. (2011): A Theoretical and Empirical Analysis of the BlackLitterman Model, in: GfKl 2011 Proceedings.
BEST, M.J. and GRAUER, R.R. (1991): On the Sensitivity of Mean-Variance-Efficient Portfolios
to Changes in Asset Means: Some Analytical and Computational Results, in: The Review of
Financial Studies, vol. 4, no. 2, pp. 315-342.
BLACK, F. and LITTERMAN, R. (1992): Global Portfolio Optimization, in: Financial Analysts
Journal, pp. 28-43.
BROWN, A., PAPAIOANNOU, M. and PETROVA, I. (2010): Macrofinancial linkages of the
strategic asset allocation of commodity-based Sovereign Wealth Funds, IMF Working Paper.
CORDEN, W.M. and NEARY, J.P. (1982): Booming sector and de-industrialisation in a small open
economy, in: Economic Journal, vol. 92, pp. 825-848.
DOSKELAND, T.M. (2007): Strategic asset allocation for a country: the Norwegian case, in: Financial Markets and Portfolio Management, vol. 21, pp. 167-201.
DEMIGUELl, V., GARLAPPI, L. and UPPAL, R. (2009): Optimal Versus Naive Diversification:
How Inefficient is the 1/N Portfolio Strategy? in: The Review of Financial Studies, vol. 22, no.
5, pp. 1915-1953.
GINTSCHEL, A. and SCHERER, B. (2008): Optimal asset allocation for sovereign wealth funds,
in: Journal of Asset Management, vol. 9, pp. 215-238.
HARTWICK, J.M. (1977): Intergenerational Equity and the Investing of Rents from Exhaustible
Resources, in: The American Economic Review, vol. 67, no. 5, pp. 972-974.
HE, G. and LITTERMAN, R. (1999): The Intuition Behind Black-Litterman Model Portfolios,
Goldman Sachs Investment Management.
HOTELLING, H. (1931): The Economics of Exhaustible Resources, in: Journal of Political Economy, vol. 39, no. 2, pp. 137-175.
IDZOREK, T.M. (2006): Developing Robust Asset Allocations, Working Paper.
LEE, B. and WANG, H. (2010): Reevaluating the roles of large public surpluses and Sovereign
Wealth Wealth Funds in Asia, Asian Development Bank Institute and Institute for South East
Asian Studies Working Paper.
MICHAUD, R.O. (1989): The Markowitz Optimization Enigma: Is Optimized Optimal? in: Financial Analysts Journal, vol. 45, pp. 31-42.
SCHERER, B. (2009): A note on portfolio choice for sovereign wealth funds, in: Financial Markets
and Portfolio Management, vol. 23, no. 2009, pp. 315-327.
SOLOW, R.M. (1974): Intergenerational equity and exhaustible resources, in: The Review of Economic Studies, vol. 41, pp. 29-45.
SOLOW, R.M. and WAN, F.Y. (1976): Extraction costs in the theory of exhaustible resources, in:
The Bell Journal of Economics, vol. 7, no. 2, pp. 359-370.
XIE, P. and C. CHEN (2008): Sovereign Wealth Funds, macroeconomic policy alignment and
financial stability, National Natural Science Fund Emergency Project Working Paper.
Feature reduction and pattern classification for
financial forecasting, - A comparative study on
different optimization strategies Daniel Bohlmann1 and Jarek Krajewski2
1
2
Bergische Universitt Wuppertal, Germany
[email protected]
Bergische Universitt Wuppertal, Germany [email protected]
Abstract. The aim of our contribution is to expose different feature reduction
strategies for classifying patterns in the field of financial time series forecasting. Feature (space) reduction plays an important role in pattern classification and has gained
higher interest during the last years where the number of features has enormously
increased (large scale data). The inclusion of irrelevant, redundant, and noisy attributes in the dataset can result in poor predictive performance and requires efficient search strategies and evaluation criteria in order to find the optimal feature
subset. Wrapper methods tend to achieve superior classification accuracy than filter
approaches, but they also face higher computational costs. Feature transformation
strategies such as Principal Component Analysis (PCA) or Nonnegative Matrix Factorization (NMF) aim on reducing the space dimension without losing any including
information.
In financial time series prediction, the number of features in previous work was relatively small and mostly focused on a few trend indicators and oscillators. Since
the velocity and the acceleration function might include more information about the
future development of the time series, we extend the number of features to the first
and second derivative of the technical indicators. The dataset in this study focuses
on the development of the Euro-Dollar exchange rate and consists of 3,382 trading
days from January 1999 to December 2011.
This study presents a benchmark comparison of several attribute reduction methods
for supervised classification. We use different Forward Selection strategies in combination with Wrapper, Filter, Transformation and Hybrid (filter-wrapper) models.
We apply Support vector machine (SVM), Back-propagation neural network (BP),
k-nearest neighbor and Naive Bayes as learning algorithms. Empirical results indicate that SVM deals best with the nonlinear and noisy environment and outperforms
the other forecasting models and random walk. Furthermore it can be shown that the
selected features clearly depend on the specific underlying classification algorithm.
Keywords
Financial forecasting, Pattern recognition, Pattern classification, Feature reduction,
Feature selection, Dimension reduction
74
A practical method of determining longevity and
premature-death risk aversion in households
and some proposals of its application
Lukasz Feldman1 , Radoslaw Pietrzyk2 , and Pawel Rokita3
1
2
3
Wroclaw University of Economics [email protected]
Wroclaw University of Economics [email protected]
Wroclaw University of Economics [email protected]
Abstract. This paper presents a technique facilitating practical calibration of utility
function of a household to support the choice of the optimal (or, at least, satisfying)
cash flow term structure in retirement. A simplified model of a two-person household is adopted. It is suggested that household members choose from amongst a
number of easy-to-understand graphical schemes of cumulated cash flow term structures to be realized at the distribution phase of the life cycle. On this basis an analyst
(or personal financial advisor) assigns individualized utility function parameters to
the household. The utility function, as well as any associated mortality-rate models, are joint models for the whole household. Utility function calibrated with the
suggested algorithm may be further on used in optimization of retirement spending,
but also to support investment decisions in accumulation phase. The resulting cash
flow term structure that maximizes expected discounted utility depends (among others) on: applied life-cycle and mortality-force model, age and gender of household
members, cumulated retirement capital, etc. The calibration technique might be also
helpful in classification of households with respect to risk aversion.
References
GONG, G. and WEBB, A. (2008): Mortality Heterogeneity and the Distributional Consequences
of Mandatory Annuitization. The Journal of Risk and Insurance, 75(4), 1055–1079.
MILEVSKY, M. A., HUANG, H. (2011): Spending Retirement on Planet Vulcan: The Impact of
Longevity Risk Aversion on Optimal Withdrawal Rates, Financial Analysts Journal, 67(2),
45–58
Keywords
LONGEVITY RISK, UTILITY, HOUSEHOLD, RETIREMENT
75
Optimal portfolios of securities taking into
account the asymmetry of specific risk
Garsztka Przemyslaw1
Poznan Univ. of Economics [email protected]
Abstract. The standard approach to portfolio construction by Sharpe has two
sources of risk: market risk and the risk of a random factor. At the same time, it
is assumed that the random component is normally distributed with zero expectation and constant variance. But in empirical research, it may be noted the high
kurtosis and skewness of the characteristic line residuals. Suppose that the return on
asset is dependent on the situation on the market: In the case of positive information
asset is attractive to buyers and they are willing to a premium in order to accelerate
the asset purchase. In the case of negative information asset is ”attractive” for the
sellers and they are willing to some concession. Additionally, suppose that: The less
liquid value - the more difficult to conclude a transaction, the premium/concession
must be greater [Amihud, Mendelson (1986)]. The bonus offered by buyers is the
reason for the emergence of right-sided skewness of the random component of the
observed characteristic lines in the calculation of line in periods when the stock
index increases. Similarly, the concession offered by sellers is the reason for the
emergence of left-sided skewness. The paper proposes an empirical account of that
fact by the splitting of the random component into two factors, one of which explains the asymmetric effect of liquidity risk attached to the assets. In article is used
Battesse and Coelli specification of function which was originally used to stochastic
frontier analysis. In addition, author proposed construction of the portfolio of shares
listed on The Frankfurt Stock Exchange, taking into account three sources of risk.
References
AMIHUD, Y. and MENDELSON, H. (1986): Asset pricing and the bid-ask spread, Journal of
Financial Economics 17, 223-249.
BATTESSE, G.E. and COELLI, T.J. (1995): A Model for Technical Inefficiency Effects in a
Stochastic Frontier Production Function for Panel Data, Empirical Economics 20, 325-332.
Keywords
PORTFOLIO SELECTION, SPECYFIC RISK, ASSETS LIQUIDITY
76
A Simplex Rotation Algorithm for the Factor
Approach to Generate Financial Scenarios
Alois Geyer1 , Michael Hanke2 , and Alex Weissensteiner3
1
2
3
WU (Vienna University of Economics and Business), Austria
Vienna Graduate School of Finance (VGSF)
Institute for Financial Services
University of Liechtenstein
School of Economics and Management
Free University of Bolzano, Italy
Abstract. Scenario trees to be used for financial optimization must be free of arbitrage opportunities. We start from a factor approach which is explicitly designed
to generate arbitrage-free scenario trees while exactly matching the assets’ expected
excess returns and covariances. Here we present a new algorithm to implement the
factor approach which is based on rotations of simplexes. This algorithm offers two
major computational advantages: First, it does not require Cholesky decomposition,
but uses a deterministically constructed simplex as its starting point. Second, instead
of (potentially frequent) re-sampling, it ensures no-arbitrage for every single run by
purposefully rotating this simplex. Hence, the new algorithm completely avoids any
need for checking scenarios for arbitrage. As a by-product, the derivation of our
algorithm provides interesting geometrical insights.
77
Correlation of outliers in multivariate data
Bartosz Kaszuba
Department of Financial Investments and Risk Management
Wroclaw University of Economics, ul. Komandorska 118/120, Wroclaw, Poland
[email protected]
Abstract. Conditional correlations of stock returns (also known as exceedance correlations) are commonly compared for downside moves and upside moves separately. The results have shown so far the increase of correlation when the market
goes down and hence investors’ portfolios are less diversified. Unfortunately, while
analysing empirical exceedance correlations in multi-asset portfolio each correlation may base on different moments of time thus high exceedance correlations for
downside moves does not mean lack of diversification in bear market.
This paper proposes calculating correlations assuming that Mahalanobis distance is
greater than given quantile of chi-square distribution. The main advantage of proposed approach is that each correlation is calculated from the same moments of time.
Furthermore, when the data come from elliptical distribution, proposed conditional
correlation does not change, what is in opposition to exceedance correlation. Empirical results for selected stocks from DAX30 will show the increase of correlation
in bear market and decrease of correlation in bull market.
References
CHUA, D. B., KRITZMAN, M. and PAGE, S. (2009): The Myth of Diversification The Journal of
Portfolio Management, 36, 26–35.
HONG, Y.,TU, J. and ZHOU, G. (2007): Asymmetries in Stock Returns: Statistical Tests and
Economic Evaluation. Review of Financial Studies, 20, 1547–1581.
LONGIN, F., and SOLNIK, B. (2001): Extreme Correlation of International Equity Markets. Journal of Finance, 56, 649–676.
Keywords
MULTIVARIATE OUTLIERS, ASSET CORRELATION, MAHALANOBIS DISTANCE, EXCEEDANCE CORRELATION
0
PhD student
78
Using generalized additive models to fit credit
rating scores
Marlene Müller
Beuth University of Applied Sciences Berlin
[email protected]
Abstract. We consider the estimation of credit scores by means of semiparametric
logit models. In credit scoring, the fitted rating score shall not only provide an optimal classification result but serves also as a modular component of a (typically quite
complex) rating system. This means in particular that a rating score should be given
by a linearly weighted sum of rating factors. That way the rating procedure can be
easily interpreted and understood also by non-statisticians.
For that reason the logit model or the logistic regression approach is one of the
most popular models for estimating credit rating scores. The first step in fitting the
rating model is usually a nonlinear transformation of the raw variables in order to
obtain a linear predictor (rating score) in the final estimation. As an alternative to
this two-step approach, generalized additive models (GAM) would allow for a simultaneous estimation of both the initial transformation and final logit fit. In this
study we compare GAM estimating approaches with a focus on the specific structure of credit data: small default rates, mixed discrete and continuous explanatory
variables, possibly nonlinear dependencies between the regressors.
References
HÄRDLE, W., MÜLLER, M., SPERLICH, S. and WERWATZ, A. (2004): Nonparametric and
Semiparametric Modeling, Springer, New York.
HASTIE, T. J. and TIBSHIRANI, R. J. (1990): Generalized Additive Models, Chapman and Hall,
London.
R DEVELOPMENT CORE TEAM (2010): R: A Language and Environment for Statistical Computing, http://www.R-project.org
WOOD, S. N. (2006): Generalized Additive Models: An Introduction with R, Chapman and Hall,
London.
Keywords
SEMIPARAMETRIC LOGIT MODEL, GENERALIZED ADDITIVE MODEL,
CREDIT RATING
79
Clustering Algorithms for Storage of Tick Data
Gabor I. Nagy∗ and Krisztian Buza
Budapest University of Technology and Economics
Magyar tudósok körútja 2, H-1117 Budapest, Hungary
[email protected], [email protected]
Abstract. Tick data is one of the most prominent types of temporal data, as it can
be used to represent data in various domains such as geophysics or finance. Storage of tick data is a challenging problem because two criteria have to be fulfilled
simultaneously: the storage structure should allow fast execution of queries and the
data should not occupy too much space on the hard disk or in the main memory. We
present two clustering-based solutions, in particular, our recently-developed clustering algorithms, SOHAC and SOPAC. These algorithms are designed to support
the storage of tick data and are under publication (see References). We evaluate our
algorithms both on publicly available real-world datasets, as well as real-world tick
data from the financial domain provided by one of the world-wide most renowned
investment bank. In our experiments, we compare our approaches, SOHAC and
SOPAC, against a large collection of conventional clustering algorithms from the literature. The experiments show that our algorithm substantially outperforms – both
in terms of statistical significance and practical relevance – the examined clustering algorithms for the tick data storage problem. Additionally, we present our most
recent research directions related to clustering algoritms for tick data storage.
References
NAGY, G.I. and BUZA, K. (2012): Partitional Clustering of Tick Data to Reduce Storage Space.
IEEE 16th International Conference on Intelligent Engineering Systems, to appear.
NAGY, G.I. and BUZA, K. (2012): Efficient Storage of Tick Data That Supports Search and Analysis 12th Industrial Conference on Data Mining, LNCS, Springer, to appear.
Keywords
TICK DATA, CLUSTERING, STORAGE, APPLICATION, FINANCE
∗
The first author is PhD-student
80
Value-at-Risk Backtesting Procedures Based on
the Loss Functions - Simulation Analysis of the
Power of Tests
Krzysztof Piontek
Department of Financial Investments and Risk Management
Wroclaw University of Economics, ul. Komandorska 118/120, Wroclaw, Poland
[email protected]
Abstract. The definition of Value at Risk is quite general. There are many approaches which can give different VaR values. The challenge is not to suggest a new
method but to distinguish between good and bad models. Backtesting is the necessary statistical procedure to evaluate VaR models and select the best one. There
are three groups of methods for validating VaR models: based on the frequency of
failures, on the adherence of model to asset return distributions and on various loss
functions. Usually risk managers are not concerned about the power of used tests. If
the power of the test is low, then it is likely to mis-classify an inaccurate VaR model
as well-specified. It can be a threat to financial institutions.
The aim of the paper is to analyze some chosen backtesting methods (based on
the idea of loss functions) focusing on the problem of power of the tests and limited
data sets (usually observed in practice). The main attention is paid to the different kinds of loss functions and to statistical evaluation of the most applied tests.
Simulated data representing asset returns are used here. The last part summarizes
obtained results and gives hints for the optimal backtesting.
This paper is a continuation of earlier pieces of research done by the author.
References
CAMPBELL, S. (2005): A Review of Backtesting and Backtesting Procedures. Federal Reserve
Board. Washington
PIONTEK, K. (2010): Analysis of Power for Some Chosen VaR Backtesting Procedures - Simulation Approach, Advances in Data Analysis, Data Handling and Business Intelligence, Part 7,
Springer Verlag, 481-490
Keywords
RISK MEASUREMENT, VALUE-AT-RISK, VaR, BACKTESTING, POWER OF
TESTS
81
Fundamental portfolio construction based on
semi-variance
Anna Rutkowska-Ziarko
Abstract. In models for creating a fundamental portfolio, based on the classical
Markowitz model, the variance is usually used as a risk measure. However, equal
treatment of negative and positive deviations from the expected return rate is a slight
shortcoming of variance as the risk measure. Markowitz defined semi-variance to
measure the negative deviations only. However, finding the fundamental portfolio
with minimum semi-variance is much more difficult than finding a fundamental
portfolio with minimum variance. The fundamental portfolio introduces an additional condition aimed at ensuring that the portfolio is only composed of companies
in good economic condition. A synthetic indicator is constructed for each company, describing its economic and financial situation. The method of constructing
fundamental portfolios using semi-variance as the risk measure is presented. The
differences between the semi-variance fundamental portfolios and variance fundamental portfolios are analysed on example of companies listed on the Warsaw Stock
Exchange.
Keywords
Markowitz model, fundamental portfolio, semi-variance, Mahalanobis distance
82
Sovereign Credit Spreads During the European
Fiscal Crisis
Jonas Vogt, PhD Student1
Fakultaet Statistik, Technische Universitaet Dortmund, 44221 Dortmund Germany
[email protected]
Abstract. During the European financial crisis strong correlations of certain sovereigns’
credit spreads were observed even though the respective economies are hardly connected. A possible explanation for this phenomenon might be that the increase of
these spreads is not induced by an increase in default probabilities themselves but
by an increase of their implied variances. To analyze this hypothetical relation, we
model the risk-neutral default probabilities implied in CDS spreads both under the
risk-neutral and historical measure considering a default as first jump of a Poisson
process and the intensities as diffusion processes. By comparing diffusion parameter estimates obtained under both of these measures, we want to see whether an
increase in spreads can be explained by an increase in risk-premiums for the default
intensity variance (see Pan und Singleton, 2008). We make use of the characteristics
of affine processes to transform the Feynman-Kac differential equations resulting
from expectation terms in the common CDS-pricing formula for solving the implied
intensities (see Duffie et al, 2000). We suggest moreover an iterative procedure to
numerically solve for implied intensities based on the parameters of the underlying
diffusions and to estimate the diffusion parameters based on the obtained default
intensities in turn.
References
DUFFIE, D., PAN, J. und SINGLETON, K. (2000): Transform Analysis and Asset Pricing for
Affine Jump Diffusions. Econometrica, 68, 1343–1376.
PAN, J. und SINGLETON, K. (2008): Default and Recovery Implicit in the Term Structure of
Sovereign CDS Spreads. Journal of Finance, LXIII, 2345–2384
Keywords
CREDIT-RISK, REDUCED-FORM MODEL, AFFINE PROCESSES, MACROFINANCE, EUROPEAN FINANCIAL CRISIS
83
Part VIII
Machine Learning and Knowledge
Discovery
Classification and definition of contextual
vicinity from emotional words for sentiment
analysis
Hyunsup Ahn, Markus Weinmann, and Christoph Lofi
TU-Braunschweig, Germany
{hs.ahn@,markus.weinmann@,[email protected].}tu-bs.de
Abstract. Much research has been carried out on collecting opinion from product
reviews. Due to a number of channel and message on the web where customers leave
messages to show their opinion and product-related emotion, a manual approach selecting a relevant and meaningful opinion is being harder than ever. For this reason,
the method which adjusts a polarity from dichotomized categories - negative and
positive words - automatically has been generally accepted. In this paper we propose to build an emotion-relevant lexicon that indicates an intensity of emotional
words and contextual vicinity among themselves. Instead of manual classification
of word set, our suggestion adopted a self-verifiable method based on a user rating
that is highly congruent with an overall opinion posted by the person. Therefore our
result shows further possibility that the common technique which extracts and summarizes peoples emotional stances from web as corpus could be more accurately
measured by applying weighted score from emotional words.
Keywords
web mining, sentiment analysis, emotion in e-Business, product review
86
Using Conceptual Inductive Learning for
Cooperative Query Answering
Maheen Bakhtyar1 , Lena Wiese2 , Katsumi Inoue3 , and Nam Dang4
1
2
3
4
Asian Inst. of Technology Bangkok, Thailand
[email protected]
University of Hildesheim, Hildesheim, Germany [email protected]
National Inst. of Informatics, Tokyo, Japan [email protected]
Tokyo Inst. of Technology, Tokyo, Japan [email protected]
Abstract. A database system may not always be able to find correct answers for
a query and such a query is called a failing query. Cooperative query answering
systems produce informative answers for such queries. To obtain such informative answers we apply generalization operators that have long been studied in the
area of Conceptual Inductive Learning (Michalski, 1983). In particular, Inoue and
Wiese (2011) analyze three generalization operators “Dropping Conditions”, “AntiInstantiation” and “Goal Replacement”. We observed that sometimes a number of
answers produced after query generalization are not related to what user asked.
Therefore, we extend the generalization operators by a mechanism to classify answers into those related and those unrelated to the original query intention. We
determine the similarity between the user query and the answers produced based
on a similarity function by acquiring the semantics of constants in the answer using WordNet (http://wordnet.princeton.edu). Only those answers being classified as
most related to the query will be returned to the user.
References
Inoue, K. and Wiese, L. (2011): Generalizing conjunctive queries for informative answers. In Proceedings of the 9th International Conference on Flexible Query Answering Systems. Lecture
Notes in Artificial Intelligence, vol. 7022, pp. 1–12, Springer-Verlag.
Michalski, Ryzard S. (1983): A Theory and Methodolgy of Inductive Learning. In Machine Learning: An Artificial Intelligence Approach, pp. 111–161, TIOGA Publishing.
Keywords
INDUCTIVE CONCEPTUAL LEARNING, QUERY RELAXATION, SEMANTIC FILTERING
87
A Study of the Efficiency and Accuracy of
Data Stream Clustering for Large Data Sets
Matthew Bolaños1∗ , John Forrest2 , and Michael Hahsler1
1
Southern Methodist University, Dallas, Texas, USA.
Microsoft, Redmond, Washington, USA.
2
Abstract. Identifying groups in large data sets is important for many machine
learning and knowledge discovery applications. In recent years, data stream clustering algorithms have been proposed which can deal efficiently with potentially unbounded streams of data. Obviously, these algorithms can also be used for large nonstreaming data sets and, as such, present light-weight alternatives to conventional
algorithms. The question is how accurate are results obtained via a data stream clustering algorithm compared to conventional clustering methods. To investigate this
among other questions we have developed an R-extension package called stream
which provides the experimental infrastructure for data stream mining and currently
focuses on data stream clustering and cluster evaluation. Using this infrastructure
we will systematically compare the results obtained via conventional clustering algorithms (ranging from k-means and hierarchical clustering to BIRCH) with data
stream clustering algorithms (CluStream, DenStream, kNN) on a set of synthetic
and real-world data from several domains. We will evaluate efficiency (runtime and
memory requirements), accuracy (e.g., by purity and precision given known ground
truth), as well as sensitivity to parameters and other choices like the algorithm used
to recluster micro-clusters.
References
AGGARWAL, C. (2007): Data Streams - Models and Algorithms. Springer.
BOLAÑOS, M., FORREST, J. and HAHSLER, M. (2012): stream: Infrastructure for Data Streams,
R package version 0.1-0, http://r-forge.r-project.org/projects/clusterds.
Keywords
CLUSTERING, DATA STREAM CLUSTERING ALGORITHMS, EVALUATION
∗
Student author
88
Feedback Predicition for Blogs
Krisztian Buza
Department of Computer Science and Information Theory
Budapest University of Technology and Economics, Hungary
[email protected]
Abstract. The last decade lead to an unbelievable growth of the importance of
social media. While in the early days of social media, blogs, tweets, facebook,
youtube, social tagging systems, etc. served more-less just as an entertainment of
a few enthusiastic users, nowadays news spreading over social media may govern
the most important changes of our society, such as the revolutions in the Islamic
world, or US president elections. Due to the huge amounts of documents appearing in social media, there is an enormous need for the automatic analysis of such
documents.
One of the most important properties which distinguishes social media from the
classic one, is the uncontrolled, dynamic and rapidly-changing content: e.g. when a
blog-entry appears, users may immediately comment this document. In this work,
we focus on the analysis of documents appearing in blogs. We present an industrial
application which has the following major components: (i) the crawler, (ii) information extractors, (iii) data store and (iv) analytic components. Analytic components
allow to explore trends and to predict the number of feedbacks that a document is
expected to receive in the next 24 hours. This task is related to opinion mining, however, despite its relevance, there are just a few works on predicting the number of
feedbacks that a blog-entry is expected to receive, see e.g. Yano and Smith (2010).
In contrast to them, we target various topics (do not focus on political blogs) and
perform experiments with many different models. We hope that our observations
will motivate research in order to improve classification and regression algorithms.
References
YANO, T. and SMITH, N. A. (2010): Whats Worthy of Comment? Content and Comment Volume
in Political Blogs. 4th International AAAI Conference on Weblogs and Social Media, 359–362
MISHNE, G. (2007): Using Blog Properties to Improve Retrieval. International Conference on
Weblogs and Social Media
Keywords
SOCIAL MEDIA, BLOGS, FEEDBACK PREDICTION
89
Label Ranking with Abstention: Learning to
Predict Partial Orders
Weiwei Cheng1 , Willem Waegeman2 , Volkmar Welker1 , and Eyke Hüllermeier1
1
2
Mathematics and Computer Science, Marburg University, 35032 Marburg,
Germany {cheng,eyke}@mathematik.uni-marburg.de
Department of Applied Mathematics, Biometrics and Process Control, Ghent
University [email protected]
Abstract. The prediction of structured outputs in general and rankings in particular has attracted considerable attention in machine learning in recent years, and
different types of ranking problems have already been studied (Fürnkranz and
Hüllermeier, 2011). Here, we propose a generalization or, say, relaxation of the standard setting of label ranking, allowing a model to make predictions in the form of
partial instead of total orders. We interpret such kind of prediction as a ranking with
partial abstention: If the model is not sufficiently certain regarding the relative order of two alternatives and, therefore, cannot reliably decide whether the former
should precede the latter or the other way around, it may abstain from this decision
and instead declare these alternatives as being incomparable. We propose a general
approach to ranking with partial abstention as well as evaluation metrics for measuring the correctness and completeness of predictions. Moreover, we introduce a
new method for learning to predict partial orders that improves on an existing approach (Cheng et al., 2010), both theoretically and empirically. Our method is based
on the idea of thresholding the probabilities of pairwise preferences between labels
as induced by a predicted (parameterized) probability distribution on the set of all
rankings (Marden 1995).
References
Fürnkranz, J., and Hüllermeier, E. Preference Learning. Springer-Verlag, 2011. Cheng, W., Rademaker, M., De Baets, B., and Hüllermeier, E. Predicting partial orders: Ranking with abstention. In Proc. ECML/PKDD–2010, pp. 215–230, Barcelona, Spain, 2010.
Marden, J. Analyzing and Modeling Rank Data. Chapman and Hall, 1995.
Keywords
label ranking, partial orders, Plackett-Luce model, Mallows model
90
On the relation of cluster stability
and early classifiability of time series
István Dávid∗ and Krisztian Buza
Department of Computer Science and Information Theory
Budapest University of Technology and Economics, Hungary
{david,buza}@cs.bme.hu
Abstract. Although classification of time series or sequences of observations is one
of the well-studied topics, there are just a few works on their early classification. By
early classifiability we mean the property that the class label (which refers to the entire time series) can be often predicted based on the first few observations (see e.g.
Xing et al., 2009 and Xing et al., 2008). Some practical examples of early classification include fraud detection, applications in health care and network engineering
(e.g. classification of TCP/IP packets based on the first few segments of data).
In our study, we examine the relation of early classifiability to early identification
of clusters and cluster (in)stability, which might be an indicator of concept drift.
As the k-nearest neighbor algorithm (k-NN) with dynamic time warping (DTW)
became popular for time-series classification, we target the above question in the
context of k-NN and k-medoids. Furthermore, we extend the concept of cluster stability introduced by Ackerman and Ben-David (2009) for time-series clustering.
References
XING, Z., PEI, J., YU, P.S. (2009): Early Prediction on Time Series: A Nearest Neighbor Approach. Proc. of the Twenty-First Int. Joint Conf. on Artificial Intelligence (IJCAI-09). AAAI
Press, Palo Alto, California, 1297-1302.
ACKERMAN, M., BEN-DAVID, S. (2009): Clusterability: A Theoretical Study. Journal of Machine Learning Research: Workshop and Conf. Proc., 5, 1-8.
XING, Z. et al. (2008): Mining sequence classifiers for early prediction. SDM’08 Proc. of the 2008
SIAM Int. Conf. on Data Mining, 644-655.
Keywords
TIME-SERIES CLASSIFICATION, CLUSTER STABILITY, EARLY PREDICTION, CLUSTERING, ALGORITHMS
∗
Student
91
Experimental Evaluation of Communication
Efficient Distributed Classification in
Peer-to-Peer Networks
Umer Khan1 , Alexandros Nanopoulos2 and Lars Schmidt Thieme1
1
2
University of Hildesheim, Information Systems and Machine Learning Lab
{khan, schmidt-thieme}@ismll.uni-hildesheim.de
University of Eichstätt, Ingolstadt Germany.
[email protected]
Abstract. Mining patterns from large-scale distributed networks, such as Peer-toPeer (P2P), is a challenging task, because centralization of data is not feasible. The
goal is to develop mining algorithms that are communication efficient, scalable,
asynchronous, and robust to peer dynamism, which achieve accuracy as close as
possible to centralized ones. In this paper, we present a detailed experimental evaluation of classification algorithms in P2P framework. We focus on two variants of
Support Vector Machines (SVM), namely Reduced-SVM (RSVM) (Lee et. al 2001)
and Relevance Vector Machines (RVM) (Tipping 2001). RSVM are known for their
ability to represent whole data by using a very small subset of training instances
(Ang et al. 2008). Based on Bayesian probabilistic framework, RVM utilizes dramatically fewer kernel functions. Nevertheless, both RSVM and RVM provide very
good generalization performance, which is comparable to standard SVM. Additionally, their ability of providing compact and accurate models makes them both efficient for classification in P2P networks, due to the reduced communication cost
resulting from the need to propagate local (i.e. within each peer) classification models to neighboring peers, until all peers converge to a global model. We perform an
extensive empirical comparison between RSVM and RVM, using several real data
sets from UCI repository. Our results provide useful conclusions about the suitability of RSVM and RVM for the task of classification in P2P networks, in terms of
classification accuracy and communication overhead.
References
Ang, Hock H. et al.(2008): Cascade RSVM in Peer-to-Peer Networks.ECML PKDD
Lee, Y. and Mangasarian, Olvi L.(2001): RSVM: Reduced Support Vector Machines. First SIAM
International Conference on Data Mining,5-7
Tipping, Michael E.(2001): Sparse Bayesian Learning and the Relevance Vector Machine. Journal
of Machine Learning Research,211-244.
Keywords
DISTRIBUTED DATA MINING, RSVM, RVM, P2P NETWORKS
92
Framework for Storing and Processing
Relational Entities in a Data Stream
Pawel Matuszyk∗
Otto-von-Guericke-University, Faculty of Computer Science
Magdeburg, Germany
[email protected]
Abstract. Conventional stream mining algorithms use the assumption, that every
data instance can be seen only once in a stream [1]. Therefore, all data instances are
considered statistically independent from each other. Consequently, this assumption causes a loss of information. This problem can be solved by modelling data
as reoccurring relational entities (e.g. customer in an online shop)[2]. In this article
an efficient, multithreaded framework, which can handle such entities, is proposed.
For this framework a new architecture consisting of four layers was developed. One
of these layers is the cache layer, for which a new, tailored cache structure, which
shows a nearly linear complexity, is proposed.
The framework and its components were evaluated using a self-implemented data
generator that creates relational streams with changing speed and concept drift. The
evaluation showed, that the new framework leads to a reduction of the computation
time of up to 97,13%. An effect of smoothing of the speedups of the stream has been
observed. This is especially relevant for many application scenarios in the internet
(e.g. recommender systems).
References
1. Guha, S.; Meyerson, A.; Mishra, N.; Motwani, R. & O’Callaghan, L. Clustering data streams:
Theory and practice, IEEE Transactions on Knowledge and Data Engineering, IEEE Computer Society, 2003, 515-528
2. Siddiqui, Z.; Spiliopoulou, M.; Winslett, M. (Ed.) Combining Multiple Interrelated Streams for
Incremental Clustering , Scientific and Statistical Database Management, Springer Berlin /
Heidelberg, 2009, 5566, 535-552
Keywords
Relational Stream Mining, Relational Entities Preparation, Mining Multiple Streams
∗
PhD student
93
Spectral clustering: interpretation and Gaussian
parameter
Sandrine Mouysset1 , Joseph Noailles, Daniel Ruiz2 , and Clovis Tauber3
1
2
3
University of Toulouse, IRIT-UPS, 118 route de Narbonne, 31062 Toulouse,
University of Toulouse, IRIT-ENSEEIHT, 2 rue Camichel, 31071 Toulouse,
{sandrine.mouysset,joseph.noailles,
daniel.ruiz}@irit.fr
University of Tours, Hopital Bretonneau, 2 boulevard Tonnelle, 37044 Tours,
[email protected]
Abstract. Spectral Clustering consists in creating, from the spectral elements of a
Gaussian affinity matrix, a low-dimension space in which data are grouped into clusters. This unsupervised method is mainly based on the Gaussian affinity measure,
its parameter and its spectral elements. However, questions about the separability of
clusters in the projection space and the spectral parameter choices remain open. By
drawing back to some continuous formulation wherein clusters will appear as disjoint subsets, we propose an interpretation of spectral clustering for a finite discrete
data set via Partial Differential Equations and Finite Elements theory which gives
good properties on how spectral clustering works. This approach develops some particular geometrical properties inherent to eigenfunctions of some specific eigenvalue
problem. This leads to a study showing the rule of the Gaussian affinity parameter:
this geometrical property is proved to be preserved asymptotically on the Gaussian
parameter when looking at eigenvectors of spectral clustering algorithm. With numerical experiments, we show the efficiency of the spectral clustering method on
retrieving groups from several geometrical examples and with various refinements.
More precisely, we focus on the behaviour of the method with respect to this new
theoretical material.
References
NG, A.Y. and JORDAN, M.I. and WEISS, Y. (2002): On spectral clustering: Analysis and an
algorithm. Advances in neural information processing systems, 849–856.
MOUYSSET, S. and NOAILLES, J. and RUIZ, D. (2010):On an interpretation of Spectral Clustering via Heat equation and Finite Elements theory. IAENG, 267–272.
Keywords
Spectral Clustering, Gaussian Kernel, Heat equation, Eigenvalue problem.
94
gRecs: A collaborative filtering framework for
group recommendations
Eirini Ntoutsi1 , Kostas Stefanidis2 , Kjetil Nørvåg2 , and Hans-Peter Kriegel1
1
2
Institute for Informatics, Ludwig-Maximilians University (LMU), Munich
{ntoutsi,kriegel}@dbs.ifi.lmu.de
Department of Computer and Information Science, Norwegian University of
Science and Technology, Trondheim
{kstef,Kjetil.Norvag}@idi.ntnu.no
Abstract. Recommendation systems provide suggestions to users about a variety
of items, such as movies and restaurants. The large majority of these systems are
designed to make recommendations for individual users. However, there are cases
in which the items to be suggested are intended for a group of users, e.g., a group
of friends planning to watch a movie or visit a restaurant. Recent approaches try to
satisfy the preferences of all group members either by creating a joint profile for
the group and suggesting items w.r.t. this profile or by aggregating the single user
recommendations into group recommendations [3]. We opt for the second approach,
since it is more flexible and offers opportunities for efficiency improvements.
We propose a framework for group recommendations following the collaborative
filtering approach. The most prominent items for each user of the group are identified based on items that similar users liked in the past. We efficiently aggregate the
single user recommendations into group recommendations by leveraging the power
of a top-k algorithm. We employ three different aggregation designs: least misery,
where strong user preferences act as a veto, most optimistic, where the most satisfied member is the most influential one and fair, for more democratic cases. The
main bottleneck in collaborative filtering is to locate the most similar users for a
given user. We model the user-item interactions in terms of clustering and use the
extracted clusters for predictions [1,2]. To deal with the high dimensionality and
sparsity of ratings, we envision subspace clustering to find clusters of similar users
and subsets of items where these users have similar ratings for the items.
References
1 NTOUTSI, E., STEFANIDIS, K., NORVAG, K. and KRIEGEL, H-P. (2012): Fast Group Recommendations by Applying User Clustering. In: ER.
2 NTOUTSI, E., STEFANIDIS, K., NORVAG, K. and KRIEGEL, H-P. (2012): gRecs: A Group
Recommendation System based on User Clustering (demo paper). In: DASFAA.
3 ROY, S.B., AMER-YAHIA, S., CHAWLA, A., DAS, G. and YU, C. (2010): Space Efficiency in
Group Recommendation. VLDBJ 19(6), 877–900.
Keywords
GROUP RECOMMENDATIONS, COLLABORATIVE FILTERING, USER CLUSTERING
95
Symbolic cluster ensemble based on
co-association matrix vs. noisy variables and
outliers
Pełka Marcin1
Wroclaw University of Economics, Department of Econometrics and Computer
Science, Nowowiejska 3, 58-500 Jelenia Góra, Poland,
[email protected]
Abstract. Ensemble approach based on aggregating information provided by different models has been proved to be a very useful tool in the context of the supervised learning. The main goal is to increase the accuracy and stability of the final
classification. Recently the same techniques have been applied for cluster analysis
where by combining a set of different clusterings, a better solution can be received.
Ensemble clustering techniques might be not a new problem, but their application to
the symbolic data case is quite new area. The article presents a proposal of application of the co-association based functions in cluster analysis when dealing symbolic
data which tends to form not well separated clusters of many different shapes. In the
empirical part simulation experiment results are compared based on artificial data
(containing noisy variables and/or outliers). Besides that ensemble clustering results
of real data sets are shown. In both cases ensemble clustering results are compared
with application of single clustering method.
References
BOCK, H.-H., DIDAY, E. (Eds.) (2000): Analysis of symbolic data. Explanatory methods for extracting statistical information from complex data. Springer Verlag, Berlin-Heidelberg.
FRED, A.L.N. (2001): Finding consistent clusters in data partitions. In: J. Kittler, F. Roli (Eds.):
Multiple Classifier Systems, Vol. 1857 of Lecture Notes in Computer Science. Springer-Verlag,
Berlin-Heidelberg, 78–86.
FRED, A.L.N., JAIN, A.K. (2005): Combining multiple clustering using evidence accumulation.
IEEE Transaction on Pattern Analysis and Machine Intelligence, Vol. 27, 835–850.
Keywords
SYMBOLIC DATA ANALYSIS, ENSEMBLE CLUSTERING, CO-ASSOCIATION
MATRIX
96
Ensemble learning for density estimation
Friedhelm Schwenker, Michael Glodek, Martin Schels
University of Ulm, Institute of Neural Information Processing, 89069 Ulm
[email protected]
Abstract. Estimation of probability density functions (PDF) is a fundamental concept in statistics and machine learning and has various applications in pattern recognition. In this contribution ensemble learning approaches will be discussed in the
context of density estimation, particularly these methods will be applied to two PDF
estimation methods: Kernel density estimation (or Parzen window approach) and
Gaussian mixture models (GMM) (Fukunaga, 1990). The idea of ensemble learning is to combine a set of L pre-trained models g1 , . . . , gL into an overall ensemble
estimate g. Combining multiple models is a natural step to overcome shortcomings
and problems appearing in the design of single models. Along with the design of
single models gi an aggregation mapping must be realized in order to achieve a final combined estimate, usually this mapping has to be fixed a priori but trainable
fusion mappings can be applied as well. Examples of fixed fusion schemes are median or (weighted) average of the predicted models (Kuncheva, 2004), e.g. weighted
average is defined through gw (x) = ∑Ll=1 wl gl (x) with wl ≥ 0 and ∑Ll=1 wl = 1. For
example, weighted averaging of kernel density estimates leads to a representation
with a new kernel function. The proposed ensemble pdf approach will be analyzed
by statistical evaluations on benchmark data sets. The behavior of these algorithms
in classification and cluster analysis applications will be presented as well.
References
KUNCHEVA, L. (2004): Combining pattern classifiers: Methods and algorithms, Wiley.
FUKUNAGA, K. (1990): Introduction to statistical pattern recognition. Academic press.
Keywords
ENSEMBLES, KERNEL DENSITY ESTIMATION, GAUSSIAN MIXTURE MODELS
97
An Analysis of Classifier Chains for Multi-Label
Classification
Robin Senge1 , Jose Barranquero2 , Juan José del Coz2 , and Eyke Hüllermeier1
1
2
Mathematics and Computer Science, Marburg University, 35032 Marburg,
Germany {senge,eyke}@mathematik.uni-marburg.de
Artificial Intelligence Center, University of Oviedo at Gijón, Campus de
Viesques, 33204 Gijón, Spain [email protected]
Abstract. Multi-label classification (MLC) has attracted increasing attention in the
machine learning community during the past few years. Apart from being interesting theoretically, this is largely due to its practical relevance in many domains, such
as text classification and bioinformatics. The goal in MLC is to induce a model that
assigns a subset of labels to each example, rather than a single one as in multiclass classification. In order to exploit dependencies between the labels, so-called
classifier chains have been proposed as an appealing method for tackling the MLC
task (Read et al., 2011). In addition to several empirical studies showing it to be
competitive to state-of-the-art methods, especially when being used in its ensemble variant, there are also some first results on theoretical properties of classifier
chains (Dembczyński et al., 2010). Continuing along this line, we analyze the influence of a potential pitfall of the learning process, namely the discrepancy between
the feature spaces used in training and testing: While true class labels are used as
supplementary attributes for training the binary models along the chain, the same
models need to rely on estimations of these labels when making a prediction. We
demonstrate under which circumstances the attribute noise thus created can affect
the overall prediction performance. As a result of our findings, we propose two variants of classifier chains that are designed to overcome this problem. Experimentally,
we show that these methods are indeed able to produce better results in cases where
the original chaining process is likely to fail.
References
J. Read, B. Pfahringer, G. Holmes, and E. Frank. Classifier chains for multi-label classification.
Machine Learning, 85(3):333–359, 2011.
K. Dembczyński, W. Cheng, and E. Hüllermeier. Bayes optimal multilabel classification via probabilistic classifier chains. In ICML, pages 279–286, 2010.
Keywords
multi-label classification, classifier chains, attribute noise
98
Sentiment analysis in the Twitter stream
Alina Sinelnikova, Eirini Ntoutsi, and Hans-Peter Kriegel1
Institute for Informatics, Ludwig-Maximilians University (LMU), Munich
[email protected], {ntoutsi,
kriegel}@dbs.ifi.lmu.de
Abstract. Nowadays, more and more people publish their opinions online so everyone has the possibility to catch up the thoughts of millions of people without
even knowing them. This way, consumers have the enormous power to influence
each other by sharing their brand experiences, either positive or negative. Twitter is
the most famous micro-blogging service and an opinion-rich resource that allows
people to broadcast their opinions about politics, products, movies etc in real time.
With 200 million tweets generated on a daily basis, there is a need for opinion mining and sentiment analysis in order to help business analysts in the decision making
process.
In this work, we deal with the challenges posed by the Twitter stream, namely
size, unbalanced classes, changing class distributions, as well as with the specific
limitations of the Twitter language, namely, colloquial language, tweet length and
the difficult nature of the sentiment analysis problem due to the subjectivity of the
tweets. For the study, we use a dataset of predefined topics from the Twitter API
monitored over a period of three months. We experimented with a variety of classifiers such as Multinomial Naive Bayes, Adaptive Hoeffding Tree, Stochastic Gradient Descent, a hybrid Hoeffding Tree and Naive Bayes classifier and ensembles
of classifiers. For the evaluation, we used both holdout and prequential methods.
As a forgetting mechanism we used a sliding window. We evaluated the different
methods and also the impact of the different preprocessing steps. We implemented
a sentiment analysis tool that connects our methods to the Twitter API and identifies and monitors the changes in the sentiment distribution of the current opinions
regarding some user defined topic.
References
PANG, B. and LEE, L. (2008): Opinion Mining and Sentiment Analysis. Found. Trends Inf. Retr.,
2 1554-0669.
BIFET, A. and Holmes, G. and Pfahringer, B.(2011): MOA-TweetReader: Real-Time Analysis in
Twitter Streaming Data. In: T. Elomaa, J. Hollmén and H. Mannila (Eds.): Discovery Science.
Springer, 46–60.
Keywords
DATA STREAMS, SENTIMENT ANALYSIS, TWITTER.
99
Recommendations in Time Evolving
Multi-modal Social Networks
Panagiotis Symeonidis
Department of Informatics, Aristotle University of Thessaloniki
[email protected]
Abstract. Social networking sites (OSNs), such as Facebook and LinkedIn, have
attracted a huge attention after the widespread adoption of Web 2.0 technology.
These systems contain gigabytes of data which can be mined and used for making
personalized predictions and recommendations of products, users and digital content. In particular, OSNs collect information from users’ social contacts and other
interactions, build an interconnected multi-modal social network, and make suggestions of products or even people to users based on their common friends, common
commenting on written posts etc. People often belong to multiple explicit or implicit social networks because of different interpersonal interactions. For example,
in Facebook, people add each other as friends constructing a large unipartite friendship network. However, besides the explicit friendship relations between the users,
there are also other implicit relations. For example, users can co-comment on the
posts written by their friends, they can co-rate products, and co-like a user’s photo.
In this paper, we study (i) methods that combine information derived from heterogeneous explicit or implicit social networks and (ii) the evolution of user preferences
over time. These two aspects of OSNs result to better personalized recommendations of users, products and services.
References
SYMEONIDIS, P. and TIAKAS, E. and MANOLOPOULOS, Y. (2011): Product recommendation
and rating prediction based on multi-modal social networks. In: Proceedings of the fifth ACM
conference on Recommender systems (RecSys 2011), ACM, Chicago, 61–68.
SIDDIQUI, Z.F. and SPILIOPOULOY, M. and SYMEONIDIS, P. and TIAKAS, E. (2011): A Data
Generator for Multi-Stream Data. In: Proceedings of the second International Workshop on
Mining Ubiquitous and Social Environments (MUSE 2011), Athens, 63–68.
Keywords
Social Networks, Data Streams, Recommender Systems
100
A Lightweight CVFDT Classifier for Streams
with Concept Drift
Miriam Tödten∗ , Zaigham Faraz Siddiqui† , and Myra Spiliopoulou
Otto-von-Guericke University Magdeburg, 39106 Magdeburg, Germany
{toedten@mail,siddiqui@iti,myra@iti}.cs.uni-magdeburg.de
Abstract. We have investigated the induction of decision trees over concept-drifting
data streams. Whereas other approaches - based on the concept-adapting CVFDT
[HULTEN et al] - maintain alternate subtrees if there is sufficient statistical evidence
for another test attribute in some decision node, our learner replaces the subtree in
question by a leaf with Naı̈ve Bayes classifier if there is no longer sufficient evidence for the test attribute currently selected. This approach is based on the ability
of Naı̈ve Bayes leaves to improve the any-time property of Hoeffding trees [Gama
et al]. Since leaves of Hoeffding trees evolve by learning subsequent training examples, the new Naı̈ve Bayes leaf will be grown to a new subtree that reflects the new
target concept. The results of the evaluation show that our lightweight approach can
react fast to concept drift. For evaluation purposes, we suggest an approach that differs from the established test scenario for supervised learning tasks on data streams.
Since the classification performance of a classifier depends on the speed in which
query examples arrive, we distinguish between training and test examples and simulate test streams of different speeds. Experiments show that a learner that reacts fast
is beneficial if the test/query instances arrive in a fast data stream.
References
H ULTEN , G., S PENCER , L., and D OMINGOS , P. (2001): Mining Time-Changing Data Streams.
Proc. of KDD 2001, ACM Press.
G AMA , J., ROCHA , R., and M EDAS , P. (2003): Accurate Decision Trees for Mining High-Speed
Data Streams. Proc. of KDD 2003, ACM Press.
Keywords
Decision Tree Stream Classifier, Stream Classification, Concept Drift, Streams
∗
The first author is a student of Master ’Data & Knowledge Engineering’.
† Work of the second author was partially funded by the German Research Foundation project
SP 572/11-1 ”IMPRINT: Incremental Mining for Perennial Objects”.
101
Statistical Comparison of Classifiers for
Multi-Objective Feature Selection in Instrument
Recognition
Igor Vatolkin1 , Bernd Bischl2 , Günter Rudolph1 , and Claus Weihs2
1
2
TU Dortmund, Chair of Algorithm Engineering
{igor.vatolkin;guenter.rudolph}@tu-dortmund.de
TU Dortmund, Chair of Computational Statistics
{bernd.bischl;claus.weihs}@tu-dortmund.de
Abstract. Instrument identification is one of the most challenging tasks in Music Information Retrieval. With an increasing number of simultaneously playing
sources it becomes harder to distinguish between their spectral fractions, which are
built from fundamental frequencies, overtones, non-harmonic and resonant components. Also the intensity of these characteristics varies over time and is often
classified into attack, decay, sustain and release stages. A vast number of different
features are potentially available for instrument classification and it is still unsolved
which perform best. Also it is acceptable to a certain degree to trade off prediction accuracy against a computationally simpler model with less features. Because
this trade-off can in general not be specified a priori, we employ a multi-objective
feature selection approach regarding instrument recognition in polyphonic mixtures
and the number of features in the model. The performance of several classifiers and
their impact on the Pareto front are compared by means of statistical tests.
References
BEUME, N., NAUJOKS, B. and EMMERICH, M. (2007): Sms-emoa: Multiobjective selection
based on dominated hypervolume. European Journal of Operational Research, 181(3), 1653–
1669.
GUYON, I., GUNN, S., NIKRAVESH, M. and ZADEH, L. (Eds.) (2006): Feature Extraction,
Foundations and Applications. Springer, Berlin - Heidelberg.
VATOLKIN, I., PREUß, M. and RUDOLPH, G. (2011): Multi-Objective Feature Selection in Music Genre and Style Recognition Tasks. In: N. Krasnogor and P.L. Lanzi (Eds.): Proceedings
of the 2011 Genetic and Evolutionary Computation Conference (GECCO). ACM Press, New
York, 411–418.
Keywords
INSTRUMENT RECOGNITION, FEATURE SELECTION
102
Group-Based Ant Colony Optimization
Gunnar Völkel1 , Uwe Schöning1 , and Hans A. Kestler2
1
2
Institute of Theoretical Computer Science, University of Ulm,
[email protected] (PhD student),
[email protected]
Institute of Neural Information Processing, University of Ulm,
[email protected]
Abstract. Ant Colony Optimization (ACO) is a metaheuristic for combinatorial optimization problems. The main idea of ACO is that in each iteration a fixed number
of solutions is constructed probabilistically based on a pheromone matrix which
evolves between the iterations. In general a solution consists of a sequence of solution components.
For problems like the Traveling Salesman Problem (TSP) the linear solution encoding of ACO as a sequence of components works well since a sequence of customers is a natural representation of the visiting order of those customers. The solutions of the Capacitated Vehicle Routing Problem (CVRP), a descendant of the TSP,
usually consist of more than one route. A linear solution encoding for the CVRP
has to consist of the component sequences of the individual routes interleaved with
some end of route component. This is no natural encoding because it favors one
route over another whereas the problem does not state such a preference. Generally, this applies to problems with a solution that is sub-structured into independent
groups of components.
We propose Group-Based Ant Colony Optimization (GBACO) whose solution
encoding is sub-structured into groups each consisting of a sequence of components. The modified construction procedure selects one pair of group and component
probabilistically and adds the selected component to the selected group. First experiments comparing ACO and GBACO on the commonly used Solomon benchmark
instances (VRP with Time Windows) are presented.
References
DORIGO, M. and STÜTZLE, T. (2009): Ant Colony Optimization: Overview and Recent Advances. Techreport, IRIDIA, Université Libre de Bruxelles.
GAMBARDELLA, L. M. and TAILLARD, E. and AGAZZI G. (1999): MACS-VRPTW - A Multiple Colony System For Vehicle Routing Problems With Time Windows. In: Corne, D. and
Dorigo, M. et al. (Eds.): New Ideas in Optimization. McGraw-Hill, Maidenhead, England,
63–76.
Keywords
GROUP-BASED ACO, VEHICLE ROUTING PROBLEM
103
The Dark Side of Marketing Communication:
Grouping Consumers with Respect to Their
Reactance Behavior
Ralf Wagner
SVI-Endowed Chair for International Direct Marketing
DMCC- Dialog Marketing Competence Center
University of Kassel, Germany
[email protected]
Abstract. Marketing practitioners as well as researchers are fascinated by the new
opportunities of communicating to and with customers using using social media and
sophisticated mobile devices (e.g., Wagner, 2011). However, consumers’ dialogue
competence and as well their disposition are rarely challenged.
In this study the recipients’ reactance (Brehm & Brehm, 1981) towards marketing
communication is quantified by means of a Rasch model. This probabilistic test
theory approach allows to compute individual scores for the recipients. These scores
provide the data basis for grouping the recipients to cluster with similar reactance
behavior. Taking advantage of the Rasch framework’s invariance of comparisons
(Salzberger, 1999) we reveal patterns of of unfavorable marketing communication
consequences in different cultures.
References
ANDRICH, D. (2002): Understanding Resistance to the Data-Model Relationship in Rasch’s
Paradigm: A Refelction for the Next Generation. Journal of Applied Measurement, 3, 325–
359.
BREHM, S. S. and BREHM, J. W. (1981): Psychological Reactance: A Theory of Freedom and
Control. Academic Press, New York.
SALZBERGER, T. (1999): Interkulturelle Marktforschung - Methoden zur berprfung der
Datenäquivalenz. Service, Wien.
WAGNER, R. (2011): Neue Medien im Kundendialog - Ein berblick zu den Kommunikationsdiensten des Web 2.0. In: W. Lietzau, J. Bender and T. Richter (Eds.): Praxishandbuch Social
Media in Verbnden Grundlagen - Praxiswissen - Fallbeispiele. DGVM, Bonn, 90–104.
Keywords
INVARIANCE, MARKETING COMMUNICATION, REACTANCE
104
Evaluating Tag Similarity Measures by
Clustering Bibsonomy Tags
Christian Wartena1 and Rogier Brussee2
1
2
Hochschule Hannover, Expo Plaza 12, 30539 Hannover, Germany.
[email protected]
Univ. of Applied Sciences Utrecht, Crossmedialab, PO Box 8611, 3503 RP
Utrecht, The Netherlands. [email protected]
Abstract. Most approaches to determine semantic relatedness of collaborative tags
have been based on direct co-occurrence of tags (see Markines et al. (2009) for
an overview). In Wartena and Brussee (2008) we have studied a similarity measure based on comparison of contexts in which tags occur, and we could show that
this so-called second order co-occurrence outperforms similarity measures based
on direct co-occurrence in an ontology alignment task. Here we add evidence to
the superiority of second order co-occurrence by showing that the second order cooccurrence similarity measure is also superior in a tag clustering task. The topical
coherence of clustering results largely dependents on the quality of the distance
measure used by the clustering algorithm. Thus, the effectiveness of clustering can
be used as a method to evaluate distance measures for semantic relevance. For an
empirical evaluation we used a dump with 2.6 million tag assignments from Bibsonomy, a tagging service for scientific papers. We chose 12 scientific disciplines
and their main journals, and selected 215 most typical tags. We then cluster the tags
using distance measures based on Jaccard coefficients, on cosine similarity and on
the above mentioned second-order co-occurrence. We evaluated the results against
the predefined clustering by scientific discipline using F-scores and cluster purity.
Clustering based on the co-occurrence similarity measure consistently performed
better for both F-score and purity, clustering using cosine similarity and Jaccard
coefficient.
References
MARKINES, B., CATTUTO, C., MENCZER, F. and BENZ, D., HOTHO, A., STUMME, G
(2009): Evaluating similarity measures for emergent semantics of social tagging, in: J. Quada
et al. (Eds.): Proceedings of the 18th international conference on World Wide Web. ACM,
641–650.
WARTENA, C. and BRUSSEE, R. (2008): Instance-based mapping between thesauri and folksonomies, in: A. Seth et al. (Eds.): Proceedings of the 7th International Conference on The
Semantic Web. Springer, 356–370.
Keywords
TAGGING, CO-OCCURRENCE, SIMILARITY, CLUSTERING
105
Applying Leaders Driven Community Detection
Algorithms to Data Clustering
Zied Yakoubi and Rushed Kanawati
LIPN CNRS UMR 7030 University Paris Nord, Villetaneuse, France
[email protected]
Abstract. Leaders driven community detection algorithms (LDA hereafter) constitute a new trend in devising algorithms for community detection in complex networks. Unlike most existing community detection approaches, LDA algorithms are
not guided by the optimization of an objective function such as the modularity. In
this work, we show that LDA approaches can also be efficiently applied to data clustering. The clustering approach is organized into two steps: first the input dataset is
processed to generate a complex network. This is achieved by constructing the relative neighborhood graph using a similarity matrix induced by a given distance function defined over the dataset points (Toussaint (1980)). Then, we apply a community
detection algorithm on the produced graph in order to get the different clusters. We
show through experimentation on different classical clustering dataset benchmarks
that applying our LDA algorithm, called LICOD (Kanawati (2011)), provides better clustering results, evaluated in terms of purity and rand Index, than: Modularity
optimization algorithms (Blondel at. al. (2008)), Label propagation community detection approaches (Raghavan et. al, 2007) and K-means algorithm.
References
BLONDEL V. D., GUILLAUME J-L. LAMBIOTTE R. LFEBVRE E. (2008): Fast unfolding of
communites in large networks, Journal of Statistical Mechanics: Theory and Experiment,
1742-5468
KANAWATI, R. (2011): LICOD: Leaders Identification for Community Detection in Complex
Networks. In IEEE SocialCom’11, Boston, MA, 577-582.
RAGHAVAN U. N., ALBERT R. and KUMARA S. (2007): Near linear time algorithm to detect
community structures in large-scale networks Phys. Rev. E, vol. 76, p. 036106.
TOUSSAINT G. T. (1980): The Relative Neighbourhood Graph of a Finite Planar Set. Pattern
Recognition (PR) 12(4):261-268
Keywords
DATA CLUSTERING, COMMUNITY DETECTION.
106
Part IX
Interdisciplinary Domains
Onset detection using an auditory model
Bauer, Nadja, Friedrichs, Klaus, Schiffner, Julia, and Weihs, Claus
Chair of Computational Statistics, Faculty of Statistics, TU Dortmund
{bauer,friedrichs,schiffner,weihs}@statistik.tu-dortmund.de
Abstract. Onset detection is an important step for music transcription and other
applications like timbre or meter analysis. Although several approaches have been
developed for this task, neither of them works well under all circumstances. In our
work, we will use a simple algorithm proposed by Bauer et al. (2010), which is
based on calculating the correlation index between spectra of neighboring signal
windows. In Bauer et al. (2012) this algorithm was tested on a special data set of
tone sequences, which are composed of recorded tones from real musical instruments. This data set was generated by using an experimental design, in which music
tempo and the set of instruments are considered as control variables. This particulary
allows to measure the influence of each instrument on the onset detection rate.
In this work, the onset detection algorithm is extended by a computational model
of the human auditory periphery. Instead of the original signal the spectral analysis
is evaluated on the outputs of the simulated auditory nerve fibres. The extension of
this simple algorithm with an auditory model leads to an essential improvement of
the onset detection rate compared to previous results. The main challenge here is
combining the outputs of all auditory nerve fibres to one feature for onset detection.
Different approaches are presented and compared.
References
BAUER, N., SCHIFFNER, J., WEIHS, C. (2010): Einsatzzeiterkennung bei polyphonen
Musikzeitreihen. SFB 823 Discussion Paper 22/2010, TU Dortmund.
BAUER, N., SCHIFFNER, J., WEIHS, C. (2012): Einfluss der Musikinstrumente auf die Güte der
Einsatzeiterkennung. SFB 823 Discussion Paper 10/2012, TU Dortmund.
Keywords
AUDITORY MODEL, DESIGN OF EXPERIMENTS, ONSET DETECTION
108
Computational Aspects of Natural Languages’
Similarities
Andreea Beica1∗ and Liviu P. Dinu2
1
University of Bucharest, Faculty of Mathematics and Computer Science, 14
Academiei, Bucharest,Romania [email protected]
University of Bucharest, Faculty of Mathematics and Computer Science, 14
Academiei, Bucharest,Romania [email protected]
2
Abstract. Natural languages can be classified into families - groups of languages
related through descent from a common ancestor, called the proto-language of that
family. Establishing language families is equivalent to the construction of phylogenetic language trees. What’s more, words are the core of any language, and cognates
are words that have a common etymological origin. Therefore, cognate identification, alongside phylogenetic inference (which aims to determine the existing genetic
relationships between languages), represent the basis of discovering the evolutionary history of languages. In this thesis we have designed a system that uses different
distances (like the rank distance[1] or the Hamming and alphabet-weight edit distances) to measure string similarity, and we have applied the system to the task of
phylogenetic inference. We initially worked on a relatively small corpus, consisting
of 200-word Swadesh lists, one for each of the 11 languages we analyse. We then
extended our work to significantly larger parallel corpora: George Orwell’s ‘1984’
novel, translated in 8 of the 11 languages. We conducted our study using both the
phonetical transcription and the latin-alphabet form of our corpora. We used both a
dataset consisting of syllables, as well as one consisting of whole words. When applied to the Indo-European language family, our method estimated phylogenies that
were compatible with the benchmark tree, and correctly reproduced established major language groups present in the dataset; thus, our results confirmed the linguistic
theories, stating the correctness of our approach.
References
[1] DINU, L.P.: On the classification and aggregation of hierarchies with different constitutive
elements. Fundamenta Informaticae, 55, 1, 39-50, 2003.
[2] DINU, L.P.: Rank Distance with Applications in Similarity of Natural Languages, Fundamenta
Informaticae 65 (2005) 1-15
Keywords
PHYLOGENETIC INFERENCE, LANGUAGE SIMILARITY, LANGUAGE FAMILIES, PHYLOGENIES
∗
Final year Undergraduate student
109
A Unifying Framework for GPR Image
Reconstruction
Andre Busche, Ruth Janning, Tomáš Horváth, and Lars Schmidt-Thieme
University of Hildesheim, Information Systems and Machine Learning Lab
{busche,janning,horvath,schmidt-thieme}@ismll.uni-hildesheim.de
Abstract. Ground Penetrating Radar (GPR) is a widely used technique for detecting buried objects in subsoil. Exact localiztion of buried objects is required, e.g.
during environmental reconstruction works to both accelerate the overall process
and to reduce overall costs. Radar measurements are usually visualized as images,
so-called radargrams, that contain certain geometric shapes to be identified.
This paper introduces a component-based image reconstruction framework for
the recognition process based on pixelwise image decomposition at position (x/y):
I(x, y) =
K
k
fk (θk , x, y) +
K
gl (x, y)
(1)
l
We assume some image to be generated out of k base component models fk being individually parameterized through θ , e.g., being an image representation from
a FDTD-based simulation, or some extracted pattern from a training dataset. Those
component models are aggregated through an operator , e.g., a summation ∑ in
the case of a first-order Born Approximation on simulated radargrams, of more complex convolutional operations. Integration of l different noise components allows for
capturing different noise types for better estimations and artifacts suppression.
We present initial experimental results on a simple instantiation of this conceptual model using primitive object shapes, being a first step towards a pluggable,
robust image reconstruction mechanism for GPR data.
References
H. Chen and A.G. Cohn, Probabilistic robust hyperbola mixture model for interpreting ground
penetrating radar data, IJCNN IEEE, 2010, 1-8.
F. Yaman, Location and shape reconstructions of sound-soft obstacles in penetrable cylinders, Inverse Problems, 2009, 1-17.
Keywords
GPR, Image Reconstruction, Inverse Problem, Unifying Framework, template models
110
Evaluating Similarity Measures for Plagiarism
Detection in Melody Transcriptions
Christian Dittmar1 , Daniel Gärtner1 , Kay F. Hildebrand2 , and Florian Müller3
1
2
3
Fraunhofer IDMT Department Metadata Ilmenau, Germany
dmr|[email protected]
European Research Center for Information Systems (ERCIS), Münster
[email protected]
zeb/information.technology [email protected]
Abstract. Plagiarism in the area of music is a problem that consumes a lot of
resources. Lawsuits prosecuting acoustic plagiarism can last for decades. Consequently, plagiarism cases need a transparent and efficient approach to reduce insecurity of judges and accelerate the decision process.
Instead of performing similarity analyses manually, software can be used. By automatically extracting relevant features from audio files, a working basis is created.
After correcting this input, musicology experts can apply pattern matching algorithms. Eventually, software can display identified similarities to enable evaluation
of individual importance and explanation to untrained audiences.
In this paper, we present a detailed empirical evaluation of algorithms that can
be used to compare transcribed melodies in pitch vector format. Pitch Vector Similarity (PVS), Recursive Alignment (RA), Geometric Alignment (GA) and Sequence
Alignment (SA) have been submitted to tests evaluating their ability to detect similarities with increasing difference in compared sequences. Results show that PVS
and SA deliver good detection rates and are most stable and flexible among tested
candidates. Under the given conditions, GA performed better than RA.
References
M. Ryynännen and A. Klapuri (2008): Query by humming of MIDI and audio using locality sensitive hashing. ICASSP 2008, 2249-2252.
X. Wu et al. (2006): A top-down approach to melody match in pitch contour for query by humming.
Proceedings of ISCA 2006.
J. Urbano et. al. (2011): Melodic Similarity through Shape Similarity. In Proceedings of the 7th
International Conference on Exploring Music Contents, 338-355.
Keywords
AUDIO PLAGIARISM DETECTION, SIMILARITY MEASURES, LOCAL ALIGNMENT, GLOBAL ALIGNMENT.
111
From Single Tones to MIDI Remixes - Detecting
Families of Musical Instruments by High-Level
Features
Eichhoff, Markus1 and Weihs, Claus1
1 Chair
of Computational Statistics, Faculty of Statistics, TU Dortmund
{eichhoff,weihs}@statistik.tu-dortmund.de
Abstract. Detecting musical instruments in pieces of polyphonic music given as
mp3- or wav-files is a difficult task. Using source-filter models for sound separation
as being done in Heittola et al. (2009) is one Ansatz to do it. In this study four
families of musical instruments (strings, wind, piano, plugged strings) are classified
by using the four high-level audio feature groups Pitchless Periodogramm (PiP)
(Weihs and Ligges (2003)), Absolute Amplitude Envelope, Mel-Frequency Cepstral
Coefficients and Linear Predictor Coding to take also physical properties of the
instruments into account (Fletcher (2008)). These feature groups are calculated for
consecutive time blocks. Statistical supervised classification methods such as LDA,
MDA, Support Vector Machines, Random Forest, Boosting and variable selection
are used for classification. This instrument recognition task is carried out for single
tones, intervals, chords and MIDIs. MIDI-samples have been exchanged by real
audio samples that are used for training statistical models in case of single tones,
intervals and chords. Statistical tests confirm hypotheses on e.g. which blocks are at
least necessary or which statistical methods are best for each classification task.
References
FLETCHER, N.H. (2008): The physics of musical instruments. Springer, New York, 2008.
HEITTOLA, T. KLAPURI, A. and Virtanen, T. (2009): Musical Instrument Recognition in Polyphonic Aurio Using Source-Filter Model For Sound Separation. 10th International Society
for Music Information Retrieval Conference, ISMIR 2009, Proceedings.
WEIHS, C. and LIGGES, U. (2003): Voice Prints as a Tool for Automatic Classification of Vocal
Performance. In: R. Kopiez, A. C. Lehmann, I. Wolther and C. Wolf (Eds.): Proceedings of
the 5th Triennial ESCOM Conference, Hanover University of Music and Drama. Germany,
September 8-13, 332-335.
Keywords
HIGH-LEVEL AUDIO FEATURES, MUSICAL INSTRUMENT RECOGNITION,
SUPERVISED CLASSIFICATION, PIP, MIDI
112
Learning in groups and exam performance
Andreas Geyer-Schulz1 , Jonas Kunze1 , and Andreas Sonnenbichler1
Informationsdienste und Elektronische Märkte, Karlsruhe Institute of Technology
(KIT)
{andreas.geyer-schulz|jonas.kunze|andreas.sonnenbichler}@kit.edu
Abstract. As learning in groups may increase learners productivity (cf. Lam and
Ching, 2001), exercise courses are common standard amongst classes in universities. Besides that various aspects may be considered while forming a learning
group (Hsiung, 2010), the size of the group has shown a special role. Hunkeler and
Sharp (1997) showed that four member groups outperfomed three member groups
in a statistically significant way.
In this article, we present an evaluation of 2010-2012 exercise course data and
final exam performance with a special focus on the learner group size of different
courses.
References
HSIUNG, C.-M. (2010): An experimental investigation into the efficiency of cooperative learning
with consideration of multiple grouping criteria. European Journal of Engineering Education,
35(6), 679–692.
HUNKELER, D. and SHARP, J. E. (1997): Assigning functional groups: The influence of group
size, academic record, practical experience, and learning style. Journal of Engineering Education, 86(4), 321–332.
LAM, M. and CHING, R. (2001): Effect of group learning on academic performance: A pilot study
for com-based classes. In: AMCIS 2001 Proceedings. Paper 21.
Keywords
EDUCATION, GROUP LEARNING, GROUP SIZE
113
ANOVA and Alternatives for Causal Inferences
Sonja Hahn1
Friedrich-Schiller-Universität Jena, Institut für Psychologie, Am Steiger 3, Haus 1,
07743 Jena [email protected]
Abstract. Analysis of variance (ANOVA) is one of the procedures most often used
for analyzing experimental and quasiexperimental data in psychology. Nonetheless
there is sometimes confusion which subtype to prefer when there is unbalanced data.
Much of this confusion can be prevented when first an adequate hypothesis is formulated. In the present paper this is done by using a theory of causal effects. This is
the starting point for the following simulation study done on unbalanced two-way
designs. Simulating data sets differed in the presence of an (average) effect, the degree of interaction, sample size (N = 30; 60; 90; 150; 300; 600; 900), stochasticity
of the factors and if there was confounding between the two factors (i.e. experimental vs. quasiexperimental design). Different subtypes of ANOVA as well as other
competing procedures from the tradition of causality research were compared with
regard to adherence to the nominal α-level and power. Results suggest that different
types of ANOVA should be used with care, especially in quasiexperimental designs
and when there is interaction. Procedures developed within the tradition of causality research are feasible alternatives that may serve better to answer meaningful
hypotheses.1
References
STEYER, R., GABLER, S., von DAVIER, A. A. and NACHTIGALL, C. (2000): Causal regression
models II: Unconfoundedness and causal unbiasedness. Methods of Psychological Research
Online, 5, 55–87.
STEYER, R., NACHTIGALL, C., WÜTHRICH-MARTONE, O. and KRAUS, K. (2002): Causal
Regression Models III: Covariates, Conditional, and Unconditional Average Causal Effects.
Methods of Psychological Research Online, 7, 41–68.
Keywords
ANOVA, CAUSALITY, SIMULATION STUDY, UNBALANCED DESIGNS
1
Author is PhD student
114
Testing Models for Medieval Settlement
Location
Irmela Herzog
The Rhineland Commission for Archaeological Monuments and Sites
The Rhineland Regional Council [email protected]
Abstract. Two models have been proposed for the spread of Medieval settlements
in the landscape known as Bergisches Land in Germany. Some archaeologists think
that the spread was closely connected with the ancient trade routes which were already in use before the population increase in Medieval times. An alternative hypothesis assumes that the settlements primarily developed in the valleys with good
soil. Focusing on an area covering 675 km2 of the Bergisches Land, this contribution investigates the two hypotheses. For this study area, a publication is available
listing the years when the small hamlets and villages were first mentioned in historical sources, with a total of 513 locations mentioned between 950 and 1350 AD.
In a first step the patterns of movement in Medieval times are derived from the
trade routes of that time. The result is an adjusted distance measure, which takes
slope and wet soil into account. In the next step, simple accessibility maps are generated on the basis of this adjusted distance measure for both alternative targets,
i.e. the trade routes and the valleys with favourable soils. For each location, the
accessibility values in these maps correspond to the distance to the nearest trade
route or valley with good soils respectively. In a final step, for each alternative target a Kolmogorov-Smirnov test is applied to compare the adjusted distances of the
Medieval settlements with the reference distribution derived from the appropriate
accessibility map.
References
BORTZ, J. and LIENERT, G.A. (1998): Kurzgefasste Statistik für die klinische Forschung.
Springer, Berlin.
HERZOG, I. (2009): Berechnung von optimalen Wegen am Beispiel der Zeitstrasse.
Archäologische Informationen 31 (1&2), 87–96.
NICKE, H. (2001): Vergessene Wege. Martina Galunder Verlag, Nümbrecht.
Keywords
MEDIEVAL SETTLEMENTS, LEAST-COST PATHS, KOLMOGOROV-SMIRNOV
TEST
115
Supporting Selection of Statistical Techniques in
Research
Kay F. Hildebrand
European Research Center for Information Systems (ERCIS), Münster
[email protected]
Abstract. In this paper we describe the necessity for a more structured approach towards quantitative research. The number of available techniques has surpassed the
limit of possible comprehension by researchers. Deciding for one suitable technique
to work with a given dataset is a non-trivial and time-consuming task. Thus, structured support for choosing adequate data analysis techniques is required. We present
a structural framework for organizing techniques, a description template to uniformly characterize techniques. We show that the former will provide an overview
on all available techniques on different levels of abstraction, while the latter offers
a way to assess a single method as well as compare it to others. Furthermore, we
developed a set of guidelines for the process of data analysis that-if applied-will
increase the overall quality of data analysis in research.
References
J. Becker et al. (2000). Guidelines of Business Process Modeling. Business Process Management,
1806, 30-49. Springer.
J. Jackson (2002). Data Mining: A Conceptual Overview. Communications of the Association for
Information Systems, 8(1), 267-296.
Keywords
RESEARCH METHODOLOGY, DATA ANALYSIS, STATISTICS, FRAMEWORK,
GUIDELINES.
116
Alignment methods for folk tune classification
Ruben Hillewaere1 , Bernard Manderick1 , and Darrell Conklin2,3
1
2
3
Computational Modeling Lab, Department of Computing, Vrije Universiteit
Brussel, Brussels, Belgium {rhillewa,bmanderi}@vub.ac.be
Department of Computer Science and AI, Universidad del Paı́s Vasco
UPV/EHU, San Sebastián, Spain darrell [email protected]
IKERBASQUE, Basque Foundation for Science, Bilbao, Spain
Abstract. In folk song research, alignment methods have been widely used to retrieve highly similar tunes from a database. In a recent study (van Kranenburg,
2010), they have also been applied to the specific task of tune family classification,
a tune family being an ensemble of folk songs which are all variations of the same
tune. It is shown that they achieve remarkable classification accuracies in comparison with other types of models.
In this study, we investigate how alignment methods perform on two fundamentally different classification tasks. The first task is geographic region classification,
which we have thoroughly studied in our previous work (Hillewaere et al., 2009). A
second task is a folk tune genre classification, where the genres are the dance types
of the tunes. Given the excellent results with alignment methods on tune family classification, one could expect that they would also perform well on other classification
tasks.
To verify that hypothesis, a string edit distance method is applied to three folk
music datasets. Folk tunes are encoded in melodic and rhythmic representations: as
strings of pitch intervals, and as strings of inter onset intervals. All pairwise edit
distances are computed over the string representations and the classification is done
with a one nearest neighbour algorithm.
Classification accuracies with the alignment methods are compared with an ngram model. Results confirm that alignment methods perform well on the tune family classification task, and suggest that n-gram models are better choices for the
other two classification tasks.4
References
HILLEWAERE, R., MANDERICK, B. and CONKLIN, D. (2009): Global feature versus event
models for folk song classification. In: Proceedings of the 10th International Society for Music
Information Retrieval Conference. Kobe, Japan, 729–733.
VAN KRANENBURG, P. (2010): A computational approach to content-based retrieval of folk
song melodies. SIKS dissertatiereeks, 43.
Keywords
MUSIC CLASSIFICATION, ALIGNMENT, MUSIC REPRESENTATION
4
Author Ruben Hillewaere is PhD-student.
117
Comparing regression approaches in modelling
(non-)compensatory judgment formation
Thomas Hörstermann1∗ and Sabine Krolak-Schwerdt2
1
University of Luxembourg, Route de Diekirch, L-7220 Walferdange
[email protected]
[email protected]
2
Abstract. Research on judgment formation deals with the integration of multiple
information into a mostly unidimensional judgment. Psychological theories and empirical results support the assumption of compensatory strategies, e.g. (weighted)
additive models, as well as non-compensatory (heuristic) strategies as underlying
decision rules. If a compensatory decision rule is assumed, multiple regression is
frequently used to model the judgment formation process. An adequate fit of the
regression model in turn leads to the conclusion that the cognitive process of judgment formation is compensatory, whereas an unsatisfactory fit leads to the rejection of a cognitive compensatory model. The conclusion’s validity is impaired if either regression models do not reliably identify an underlying compensatory decision
rule or if non-compensatory decision rules also lead to an adequate fit of the linear
model. The study adresses this question by applying regression techniques to simulated sets of judgment data with underlying compensatory and non-compensatory
decision rules. The simulated data sets are designed to reflect typical data sets
from empirical educational research. Results indicate that noncompensatory decision rules,at least partially, may lead to an adequate fit, thus impairing conclusion’s
validity.
References
ANDERSON, N.H. and BUTZIN, C.A. (1974): Performance = Motivation × Ability. An
Integration-theoretical Analysis. Journal of Personality and Social Psychology, 30, 598–604.
GIGERENZER, G. (2008): Why Heuristics work. Perspectives on Psychological Science, 3, 20–
29.
Keywords
JUDGMENT FORMATION, JUDGMENT MODELLING, REGRESSION ANALYSIS, (NON-)COMPENSATORY JUDGMENTS
∗
PhD student
118
Sensitivity Analyses for the Rasch Model
Daniel Kasper∗ and Ali Ünlü
Chair for Methods in Empirical Educational Research, TUM School of Education,
Technische Universität München, Lothstrasse 17, 80335 Munich, Germany
{daniel.kasper,ali.uenlue}@tum.de
Abstract. For scaling items and persons in large scale assessment studies such as
Programme for International Student Assessment (PISA; OECD (2012)) or Progress
in International Reading Literacy Study (PIRLS; Martin et al. (2007)) variants of
the Rasch model (Fischer and Molenaar (1995)) are used. However, goodness-offit statistics for the overall fit of the models under varying conditions as well as
specific statistics for the various testable consequences of the models (Steyer and
Eid (2001)) are rarely, if at all, presented in the published reports.
In this paper, we apply the Rasch model to PISA data under varying conditions
(e.g., under different methods for dealing with missing data, different dichotomization procedures, or different software for performing the item response analyses).
On the basis of various overall and specific fit statistics, we compare how sensitive
the Rasch model is, across changing conditions. The results of our study will help in
quantifying how meaningful the findings from large scale assessment studies can be,
and we will be able to recommend under which conditions the Rasch model or its
variants can be used for scaling large scale assessment data. Finally, practical guides
are given to help the applied researcher interested in Rasch modeling in choosing
the appropriate psychometric software package for her/his intended research.
References
FISCHER, G.H. and MOLENAAR, I.W. (Eds.) (1995): Rasch Models: Foundations, Recent Developments, and Applications. Springer-Verlag, New York.
MARTIN, M.O., MULLIS, I.V.S. and KENNEDY, A.M. (2007): PIRLS 2006 Technical Report.
TIMSS & PIRLS International Study Center, Chestnut Hill.
OECD (2012): PISA 2009 Technical Report. OECD Publishing, Paris.
STEYER, R. and EID, M. (2001): Messen und Testen [Measuring and Testing]. Springer-Verlag,
Berlin.
Keywords
RASCH MODEL, PISA, LARGE SCALE ASSESSMENT, SENSITIVITY ANALYSES, PSYCHOMETRIC SOFTWARE
∗
PhD student
119
Music and Timbre Segmentation by efficient
Order Constrained K-Means Clustering
Sebastian Krey1∗ , Uwe Ligges1 , and Friedrich Leisch2
1
Technische Universität Dortmund, Fakultät Statistik, Vogelpothsweg 87, 44221
Dortmund, Germany, Tel.: +49-231-755 3057, Fax: +49-231-755 4387,
[email protected],
[email protected]
Universität für Bodenkultur Wien, Institut für angewandte Statistik und EDV,
Peter-Jordan-Straße 82, 1190 Wien, Austria, Tel.: +43-1-47 654 5061, Fax:
+43-1-47 654 5069, [email protected]
2
Abstract. Clustering of features derived from musical sound recordings proved to
be beneficial for further classification tasks such as instrument recognition [1]. Using order constrained solutions in K-means clustering [2] the clustering results can
be stabilized and the interpretability of the clustering is improved. With this method
a further reduction of the misclassification error in the aforementioned instrument
recognition task is possible.
For an efficient calculation of the order constrained solutions in K-means clustering we use a dynamic programming approach implemented in the statistical programming language R. Using this efficient implementation the musical structure of
a whole piece of popular music can be extracted automatically. Visualizing the distances of the feature vectors through a self distance matrix allows for an easy visual
verification of the result.
For the estimation of the right number of clusters, we propose to calculate the
adjusted Rand indices of bootstrap samples of the data and base the decision on
the minimum of a robust version of the coefficient of variation. In addition to the
average stability, which is measured through the adjusted Rand index, this approach
takes the variation between the different bootstrap samples into account. This results
in favoring settings with little variation between the bootstrap samples, if average
stability is nearly identical.
References
1.LIGGES, U. and KREY, S. (2011): Feature Clustering for Instrument Classification. Computational Statistics 26(2), 279–291.
2.STEINLEY, D. and HUBERT, L. (2008): Order-constrained solutions in k-means clustering:
Even better than being globally optimal. Psychometrika 73(5), 647–664.
Keywords
CLUSTERING, CONSTRAINTS, MUSIC, CLASSIFICATION
∗
PhD student
120
The balance of value and space Merging classification and regionalization
to make more sense out of spatial data.
Martin Loidl1 and Christoph Traun1
Center for Geoinformatics, University of Salzburg, Hellbrunnerstr. 34, 5020
Salzburg [martin.loidl; christoph.traun]@sbg.ac.at
Abstract. Within the domain of geography there are basically two different approaches to reduce complexity and reveal underlying patterns of polygonal aggregated, univariate quantitative data like unemployment rates per administrative unit:
1. Regionalization (in its homogeneous variant) combines polygons into one region if they share a common border and are attributive similar. This approach
primarily aims to define boundaries between contiguous spatial aggregates based
on an attributive homogeneity criterion. Example: Dividing the EU into a set of
individual regions based on economic performance.
2. Classification in a geographic sense groups objects into mutually exclusive categories based on value ranges (class intervals) of a predetermined attribute. The
main purpose of classification in this context (e.g. applied in cartography) is to
reduce visual “noise” in the map and help the interpreter to extract meaningful information (Cromley and Cromley 1996), like spatial patterns formed by an
underlying geographic phenomenon .
In cartography, which can be seen as the visual output of geographic analysis, characteristic spatial patterns occur if polygons of the same class (resp. areal shading) are predominantly adjacent and therefore visually connected to larger figures.
However, commonly used cartographic classification techniques are solely based
on the attributive domain and completely ignore the spatial context. This ‘blindness’ for spatial configuration during a classification process leads to comparably
complex and fragmented spatial patterns, hampering visual perception and subsequent cognitive processes related to map interpretation. While cartographic classification therefore sometimes misses its target of removing “visual noise” from choropleth maps, “homogeneous” regions resulting from regionalization tend to hide too
much of the local variation in values to allow meaningful interpretation. Several
approaches conceptually between regionalization and classification have been developed (e.g. Murray and Shyy 2000). The main shortcoming of all the proposed
methods is the undefinable weight between the attributive and spatial properties of
data. This leads to vague results which cannot be reproduced or compared. In contrast Autocorrelation-based Regioclassification (Traun and Loidl 2012) uses the
degree of spatial autocorrelation determined by Moran’s I statistics as a weight in
a bi-criterion classification process considering the attributive dimension as well as
the local neighborhood. Therefore it closes the gap between classification and re121
122
Martin Loidl and Christoph Traun
gionalization on a sound statistical basis. While the applicability of the method has
been proven in a cartographic context, potential fields of application comprise a
variety of space-sensitive questions, for example in image analysis and object extraction.
References
CROMLEY, E. K. and R. G. CROMLEY (1996): An Analysis of Alternative Classification
Schemes for Medical Atlas Mapping. European Journal of Cancer, 32(9), 1551–1559.
MURRAY, A. T. and T.-K. SHYY (2000): Integrating Attribute and Space Characteristics in
Choropleth Display and Spatial Data Mining. International Journal of Geographical Information Science, 14(7), 649-667.
TRAUN, C. and M. LOIDL (2012): Autocorrelation-Based Regioclassification - a self-calibrating
classification approach for choropleth maps explicitly considering spatial autocorrelation. International Journal of Geographical Information Science, iFirst, 1-17.
Keywords
REGIONALIZATION, CLASSIFICATION, AUTOCORRELATION-BASED REGIOCLASSIFICATION, CARTOGRAPHY, SPATIAL DATA
Confidence measures in automatic music
classification
Hanna Lukashevich
1
2
Fraunhofer IDMT, Ehrenbergstr. 31, 98693 Ilmenau, Germany
[email protected]
Abstract. Automatic music classification receives a steady attention in the research
community. Music can be classified, for instance, according to music genre, style,
mood, or played instruments. Automatically retrieved class labels can be used for
searching and browsing within large digital music collections. State-of-the-art methods for music classification involve various machine learning techniques such as
Gaussian mixture models and support vector machines. Once trained, the classifiers
can predict class labels for the unseen data. However, due to the variability and
complexity of music data and to the imprecise class definitions, the classification of
the real-world music remains error-prone. The goal of this work is to enhance the
automatic class labels with the confidence measures that provide an estimation of
the probability of correct classification.
Keywords
AUTOMATIC MUSIC CLASSIFICATION, CONFIDENCE MEASURES
123
Multi-Step Linear Discriminant Analysis for
Classification of Event-Related Potentials
Nguyen Hoang Huy, Stefan Frenzel, and Christoph Bandt
Institute for Mathematics and Informatics, University of Greifswald, 17487
Greifswald, Germany, [email protected]
Abstract. Event-related potentials (ERPs) are responses to stimuli in electroencephalogram. By means of them it is possible to drive a brain-computer interface
(BCI). Determining the presence or absence of ERPs from the electroencephalogram can be considered a binary classification problem. Linear classifiers are probably the most popular algorithms for BCI applications with many of them being
based on linear discriminant analysis (LDA). In order to overcome the small sample
size problem of LDA, techniques such as regularization of the sample covariance
matrix have been applied, see Blankertz et al. (2011).
We introduce a multi-step machine learning approach and use it to classify data
from a visual ERP-based BCI, see Frenzel et al. (2011). Our approach is motivated
by the separability of the spatio-temporal covariance matrix. At first all features are
divided into disjoint subgroups and LDA is applied to each of them. This procedure
is iterated until there is only one score remaining and this one is used for classification. Thereby we avoid to estimate the high-dimensional covariance matrix of all
spatio-temporal features. We investigate the classification performance with special
attention to the small sample size case. We also present some theoretical results regarding the asymptotic error rate for the normal model with separable covariance
matrix. They give insight into the way the subgroups should be formed on each
level.
References
BLANKERTZ, B. and LEMM, S. and TREDER, M. and HAUFE, S and MÜLLER K. R. (2011):
Single-trial analysis and classification of ERP components – A tutorial. NeuroImage, 56, 814825.
FRENZEL, S. and NEUBERT, E. and BANDT, C. (2011): Two communication lines in a 3 × 3
matrix speller. Journal of Neural Engineering, 8, 036021.
Keywords
LINEAR DISCRIMINANT ANALYSIS, EVENT-RELATED POTENTIALS, BRAINCOMPUTER INTERFACE
124
The Author in Translation:
A Computational Method
Sergiu Nisioi1 and Liviu P. Dinu2
1
University of Bucharest, Faculty of Mathematics and Computer Science,
Academiei 14, Bucharest, Romania [email protected] †
University of Bucharest, Faculty of Mathematics and Computer Science,
Academiei 14, Bucharest, Romania [email protected]
2
Abstract. In this article we will discuss about quantitative stylistic measurements
in order to describe the evolution of an author’s style, the differences which occur
in a translation and, implicitly, the possibility to use these parameters in authorship
attribution. As a case study we have chosen Vladimir Nabokov’s novels. We have
made an analysis solely on his works looking for a stylistic pattern. For this purpose
we have selected as similarity measuring tools the rankings of the function words
and the Spectrum kernel. To identify the most relevant function words we have
based our work on the study of Foucault about the ”author function”. Therefore we
have reduced number of function words based on empirical facts about Nabokov.
Our results proved to discriminate between the translated and the original text either Russian or English. We have also brought into discussion the importance of
lemmatizing Russian function words. Starting from the assumption that under the
pen-name of M. Ageyev lies V. Nabokov we have made an authorship attribution
investigation. We have concluded that there exists a resemblance between the two.
References
FOUCAULT, M. (1987): What Is an Author? Twentieth-Century Literary Theory. State University
Press of New York, Albany.
McKENNA, W., BURROWS, J. and ANTONIA, A. (1999): Beckett’s Trilogy: Computational
Stylistics and the Nature of Translation. RISSH. 35, 151-71.
POPESCU, M and DINU, P. L. (2007): Kernel methods and string kernels for authorship identification: The federalist papers case. Proceedings of the International Conference RANLP.
Borovets, Bulgaria, pp 484-487.
Keywords
AUTHOR FUNCTION, TRANSLATION, RANK DISTANCE, SPECTRUM KERNEL, VLADIMIR NABOKOV
†
Student
125
Differentiation of innovation strategies across
regions
Dominik Antoni Rozkrut12
1
2
Statistical Office in Szczecin, Poland [email protected]
University of Szczecin, Depatment of Statistics and Econometrics, Poland
Abstract. Classical indicators, constructed using a single variable such as the ”innovation rate”, are of limited information capacity. These simple indicators that
combines information, regardless of the way firms innovate turn out to be oversimplified. Since innovation is a multidimensional process, application of exploratory
data analysis give additional insight into its nature. The need to develop appropriate
indicators of innovation practices, and to examine how these vary across regions
and industries was stated recently by many authors. The goal of the study is to better exploit the potential of innovation studies by producing disaggregated indicators
that identify how firms innovate, as illustrated by the real example based on the data
from 2010 Community Innovation Survey. The study tries too shed light on this by
applying multidimensional statistical analysis to group enterprises according to their
innovation practices and to identify resulting patterns. Specifically, factor analysis
and k-means clustering is used to derive and group according to different practices.
The research reveals differences in innovation practices observed on the regional
level when compared with national results. These are particularly clear when the
tetrachoric correlation is used as an input to the factor analysis. The interpretation
of underlying modes of innovation activity increases understanding of what innovation strategies are prevalent in the region as compared with the countrywide picture.
The conclusions are both of technical and substantive matter.
References
ARUNDEL, A. et al. (2007): How Europe’s Economies Learn: A Comparison of Work Organization and Innovation Mode for the EU-15. Industrial and Corporate Change, Vol. 16, Number
6.
Keywords
INNOVATION METRICS, EXPLORATORY ANALYSIS
126
The Impact of Student Loans on Personal
Financing of Higher Education in Germany
Alexandra Schwarz
German Institute for International Educational Research, Schlossstr. 29, D-60486
Frankfurt am Main, Germany, [email protected]
Abstract. Most colleges and universities in Germany are state-funded and free of
charge. As a consequence, Germany does not have a loan culture with respect to
financing higher education like the United States for example. The major part of
private expenses (living expenses during university studies, learning material etc.)
is financed by the parents. In addition, federal student support (so-called “BAfoeG”)
is granted to students whose parents can not afford to fund their children’s education.
Nonetheless, among the reasons to opt for vocational training instead of university
studies, financial motives prevail. Financial security is of great importance for graduates eligible to study; they do not feel up to the financial burden of studying, or
they are not willing to go into debt. This especially applies to school graduates from
low-income families, where often parents themselves do not have a university degree. In addition, problems in financing living expenses are one of the major reasons
for university drop-out.
To counteract social differentiation and enable students to study more efficiently,
the German government proposed to introduce a student loan program which serves
to finance student’s living expenses: In 2006 the state-owned KfW Bankengruppe
launched the “KfW student loan” which is an individual loan to be repaid plus interest. It is granted independently of the student’s and his/her parents’ income, without
collateral, and at a favorable interest. Based on an online survey the German Institute for International Educational Research evaluated this loan program with respect
to its effectiveness and its impact on personal financing of tertiary education. The
study describes the methods deployed as well as detailed results of this evaluation
where we focus on the individual funding requirements and financing structures of
borrowers. The results clearly indicate that KfW student loan plays an important
role in taking up and in commencing studies, and this especially applies to students
with working-class and middle-class background.
Keywords
HIGHER EDUCATION, EDUCATION FUNDING, STUDENT LOANS
127
Espionage Risk Assessment for Security of
Defense based Research and Technology
Dirk Thorleuchter1 and Dirk Van den Poel2
1
2
Fraunhofer INT, Appelsgarten 2, 53879 Euskirchen, Germany
[email protected]
Ghent University, Faculty of Economics and Business Administration,
Tweekerkenstraat 2, B-9000 Gent, Belgium [email protected]
Abstract. Governmental and industrial espionage in security and defense based
research and technology (R&T) becomes a more and more economic and security problem for companies and governments where sensitive information is collected without permission of the information holder. We introduce a new methodology that investigates the information leakage risk of security and defense based
R&T-projects concerning governmental or industrial espionage. The methodology
extends the well-known risk assessment methodology. It consists of two steps. In
the first step, the sensitivity of a project is estimated by human experts. A qualitative assignment of projects to different sensibility classes is done using an adapted
risk assessment methodology. The second step supports this qualitative estimation
by a new quantitative methodology. Text and web mining is used to extract relevant
information from strategic documents and to crawl relevant technological information from the internet. Text classification is used to assign the provided information
to different aspects that are identified as relevant for the qualitative estimation.
References
THORLEUCHTER, D. and VAN DEN POEL, D. (2011): Semantic Technology Classification. In:
Uncertainty Reasoning and Knowledge Engineering. IEEE Conference Publications Management Group, NJ, USA, 36–39.
THORLEUCHTER, D., VAN DEN POEL, D. and PRINZIE, A. (2010): A compared R&D-based
and patent-based cross impact analysis for identifying relationships between technologies.
Technological Forecasting and Social Change, 77 (7), 1037–1050.
THORLEUCHTER, D., GERICKE, W., WECK, G., REILAENDER, F. and LOSS, D. (2009):
Vertrauliche Verarbeitung staatlich eingestufter Inforation die Informationstechnologie im
Geheimschutz. Informatik-Spektrum, 32 (2), 102–109.
Keywords
ESPIONAGE, TECHNOLOGY, TEXT CLASSIFICATION
128
Using Latent Class Models with Random Effects
for Investigating Local Dependence
Matthias Trendtel∗ and Ali Ünlü
Chair for Methods in Empirical Educational Research, TUM School of Education,
Technische Universität München, Lothstrasse 17, 80335 Munich, Germany
{matthias.trendtel,ali.uenlue}@tum.de
Abstract. Local independence, i.e., stochastic independence given latent variable,
is one of the key assumptions underlying such latent variable modeling approaches
as item response theory (e.g., Hambleton et al. (1991)). It is a strong assumption,
which may not hold in realistic contexts. Generalizations are possible however. Latent class models with random effects (LCMRE; Qu et al. (1996)) allow for a general
local dependence structure among items, including as a special case local independence.
In this paper, we demonstrate how the LCMRE approach can be used to model
various local dependence structures among psychometric items. We derive a measure quantifying the degree of local dependence for pairs of items. This measure
can be viewed as a dissimilarity function in the sense of psychophysical scaling
(Dzhafarov and Colonius (2007)) and so allows representing the local dependence
structure of a set of items with pairwise psychophysical distances graphically in the
Euclidean 2D space. We illustrate our approach by simulations and by investigating
the local dependence structures in item types and instances of large scale assessment data from the Programme for International Student Assessment (PISA; OECD
(2012)).
References
DZHAFAROV, E.N. and COLONIUS, H. (2007): Dissimilarity Cumulation Theory and Subjective
Metrics. Journal of Mathematical Psychology, 51, 290–304.
HAMBLETON, R.K., SWAMINATHAN, H. and ROGERS, H.J. (1991): Fundamentals of Item
Response Theory. Sage Publications, Newbury Park, CA.
OECD (2012): PISA 2009 Technical Report. OECD Publishing, Paris.
QU, Y., TAN, M. and KUTNER, M.H. (1996): Random Effects Models in Latent Class Analysis
for Evaluating Accuracy of Diagnostic Tests. Biometrics, 52, 797–810.
Keywords
LOCAL DEPENDENCE, LATENT CLASS ANALYSIS WITH RANDOM EFFECTS, 2D VISUALIZATION, LARGE SCALE ASSESSMENT, PISA
∗
PhD student
129
Music Genre Prediction by High-Level
Instrument and Harmony Characteristics
Igor Vatolkin1 , Günther Rötter2 , and Claus Weihs3
1
2
3
TU Dortmund, Chair of Algorithm Engineering
[email protected]
TU Dortmund, Institute for Music and Music Science
[email protected]
TU Dortmund, Chair of Computational Statistics
[email protected]
Abstract. For music genre prediction typically low-level audio signal features from
time, spectral or cepstral domains are taken into account. Another way is to use
community-based statistics such as Last.FM tags. Whereas the first feature group
often can not be clearly interpreted by listeners, the second one lacks in erroneous
or not available data for less popular songs. We propose the two-level approach
combining the specific advantages of the both groups: at first we create high-level
descriptors which describe instrumental and harmonic characteristics of music content, some of them derived from low-level features by supervised classification (Vatolkin et al (2012)) or from analysis of extended chroma and chord features (Mauch
and Dixon (2010)). Our previous study (Rötter at al (2011)) demonstrated the high
relevance of these high-level features for personal music categories, so that they
can be used themselves as input to supervised genre classification. We discuss the
performance with response to classification error, feature set size and feature interpretability.
References
MAUCH, M. and DIXON, S. (2010): Approximate Note Transcription for the Improved Identification of Difficult Chords. In: Proceedings of the 11th International Society for Music
Information Retrieval Conference (ISMIR), 135–140.
RÖTTER, G., VATOLKIN, I. and WEIHS, C. (2011): Computational Prediction of High-Level
Descriptors of Music Personal Categories. Accepted for Proc. of the 2011 GfKl 35th Annual
Conf. of the German Classification Society (GfKl).
VATOLKIN, I., PREUß, M., RUDOLPH, G., EICHHOFF, M. and WEIHS, C. (2012): MultiObjective Evolutionary Feature Selection for Instrument Recognition in Polyphonic Audio
Mixtures. Accepted for Soft Computing, Special Issue on Evolutionary Music.
Keywords
HIGH-LEVEL MUSIC FEATURES, MUSIC GENRE CLASSIFICATION
130
The OECD’s Programme for International
Student Assessment (PISA) Study: A Review of
Its Basic Psychometric Concepts
Ali Ünlü, Daniel Kasper∗ and Matthias Trendtel†
Chair for Methods in Empirical Educational Research, TUM School of Education,
Technische Universität München, Lothstrasse 17, 80335 Munich, Germany
{ali.uenlue,daniel.kasper,matthias.trendtel}@tum.de
Abstract. The Programme for International Student Assessment (PISA; e.g., OECD,
2002, 2004, 2007, 2012) is an international large scale assessment study that aims to
assess the skills and knowledge of 15-year-old students, and based on those results,
to compare education systems across the participating (approximately 70) countries
(with a minimum number of circa 4, 500 tested students per country). Initiator of
this Programme is the Organisation for Economic Co-operation and Development
(OECD)—see www.pisa.oecd.org.
We review the main methodological techniques of the PISA study. Primarily, we
focus on the psychometric procedures applied for scaling items and persons, and
we recapitulate the methods applied in PISA for longitudinal data analysis. PISA
proficiency scale construction and proficiency levels derived based on discretization
of the continua are discussed as well. Finally, questions and suggestions are raised,
and we hope that along these lines the PISA analyses can be better understood and
evaluated, and if necessary, possibly be improved.
References
OECD (2002): Sample Tasks from the PISA 2000 Assessment. OECD Publishing, Paris.
OECD (2004): Learning for Tomorrow’s World: First Results from PISA 2003. OECD Publishing,
Paris.
OECD (2007): PISA 2006: Science Competencies For Tomorrow’s World. OECD Publishing, Paris.
OECD (2012): PISA 2009 Technical Report. OECD Publishing, Paris.
Keywords
PROGRAMME FOR INTERNATIONAL STUDENT ASSESSMENT, LARGE SCALE
ASSESSMENT, ITEM RESPONSE THEORY, PSYCHOMETRICS
∗
PhD student
†
PhD student
131
Part X
Biostatistics and Bioinformatics
Rank aggregation for candidate gene selection
Andre Burkovski1 , Ludwig Lausser1 and Hans A. Kestler1
Research Group Bioinformatics and Systems Biology, Institute of Neural
Information Processing, Ulm University, 89069 Ulm, Germany
{andre.burkovski, ludwig.lausser,
hans.kestler}@uni-ulm.de
Abstract. Molecular processes in biological systems are normally influenced or
even determined by the gene expression levels (RNA concentrations) of the involved
cells. It is believed that the higher the concentration of a RNA molecule is, the higher
is its activating (or inhibiting) influence on the process. In order to explain changes
in such systems high-dimensional expression profiles are screened for differentially
expressed genes. The top scoring genes expressions are then considered candidates
for further analysis. The measured differences may not be related to biological process as they can also be caused by variation in measurement or by other sources of
noise.
An alternative approach is the analysis of relative ranks of gene expression within
a single profile. While measurements of a single profile can be considered comparable, measurements between profiles may vary considerably. Ranking the values
and aggregating profiles from the same group can extract stable relationships. The
aggregated ranking can be considered a consensus of single profiles. Comparison of
intra- and inter-class aggregated rankings reveals changes in gene activity. The intersection between these rankings results in specific feature sets. These sets provide
candidate genes for further investigation or analysis.
Rankings can be aggregated in several ways. We will compare statistical and
positional methods and their application to artificial as well as real world data. The
resulting consensus rankings may be used to identify specifically expressed genes
and differences between groups.
References
SCHALEKAMP, F. and ZUYLEN, A. (2009): Rank aggregation: Together we’re strong. In I.
Finocchi and J. Hershberger (Eds.): 11th Workshop on Algorithm Engineering and Experiments, ALENEX 2009, New York, New York, USA, 38–51.
Keywords
RANK AGGREGATION, RANKING, CANDIDATE GENE SELECTION
Andre Burkovski and Ludwig Lausser are PhD students.
134
Unsupervised dimension reduction methods for
protein sequence classification
Dominik Heider1 , Christoph Bartenhagen2 , J. Nikolaj Dybowski1 , Sascha Hauke3 ,
Martin Pyka4 , and Daniel Hoffmann1
1
2
3
4
Dept. of Bioinformatics, University of Duisburg-Essen, Universitaetsstr. 2,
45141 Essen, Germany, {dominik.heider, nikolaj.dybowski,
daniel.hoffmann}@uni-due.de
Dept. of Medical Informatics, University of Münster, Domagkstr. 9, 48149
Münster, Germany, [email protected]
CASED, Technische Universität Darmstadt, Mornewegstr. 32, 64293 Darmstadt,
Germany, [email protected]
Dept. of Psychiatry und Psychotherapy, Philipps-University Marburg,
Rudolf-Bultmann-Str. 8, 35039 Marburg, Germany,
[email protected]
Abstract. Feature extraction methods are widely applied in order to reduce the dimensionality of data for subsequent classification, thus decreasing the risk of noise
fitting. Principal Component Analysis (PCA) is a popular linear method for transforming high-dimensional data into a low-dimensional representation. Non-linear
and non-parametric methods for dimension reduction, such as Isomap and Stochastic Neighbor Embedding (SNE) are also used. In this study, we compare the performance of PCA, Isomap, t-SNE and Interpol as preprocessing steps for classification
of protein sequences. Using random forests, we compared the classification performance on two artificial and nineteen real-world protein data sets, including HIV
drug resistance, HIV-1 co-receptor usage and protein functional class prediction,
preprocessed with PCA, Isomap, t-SNE and Interpol. Significant differences between these feature extraction methods were observed. The prediction performance
of Interpol converges towards a stable and significantly higher value compared to
PCA, Isomap and t-SNE. This is probably due to the nature of protein sequences,
where amino acid are often dependent from and affect each other to achieve, for instance, conformational stability. However, visualization of data reduced with Interpol is rather unintuitive, compared to the other methods. We conclude that Interpol is
superior to PCA, Isomap and t-SNE for feature extraction previous to classificaton,
but is of limited use for visualization.
Keywords
MACHINE LEARNING, FEATURE EXTRACTION, PROTEINS, SEQUENCES
135
Prediction of Surgery Duration Using Data
Mining Methods on Anaesthesia Protocols
Pawel Matuszyk1∗ , Dominik Brammen2 , Ren Schult1 , and Myra Spiliopoulou1
1
Otto-von-Guericke-University, Faculty of Computer Science
Magdeburg, Germany
(pawel.matuszyk|myra)@ovgu.de, [email protected]
University Hospital Magdeburg
Department of Anesthesiology and Intensive Care
[email protected]
2
Abstract. An exact scheduling of surgeries plays an essential role in the efficiency
of a hospital. It allows to reduce the waiting times for patients as well as for medical
staff. A precise schedule does not only avoid frustration at the workplace among the
personnel, but it also improves the utilisation of surgery rooms and reduces costs for
overtimes. In order to create a realistic schedule, it is necessary to estimate the duration of a future surgery accurately. Popular methods for estimating the duration of a
surgery are nowadays mean values and estimations based on experience of the medical staff. However, these methods often reveal a high prediction error. In this paper
we propose a method for estimating the duration of a future surgery based on data
mining techniques. It encompasses discretisation of the class attribute, which is the
duration of a surgery, and classification using decision trees. The data used for learning the data mining models are anaesthesia protocols, which have to be recorded in
every hospital. We figured out that our method results in a reduction of the absolute
prediction error by up to 31,03 percent against the average-based estimations. 3
References
1. Schult, R.; Matuszyk, P.; Spiliopoulou, M.; Prediction of Surgery Duration using Empirical
Anesthesia Protocols, In KD-HCM 2011.
2. Schult, R.; Matuszyk, P.; Spiliopoulou, M.; Framework for Computer Aided Analysis of Medical
Protocols in a Hospital, In HEALTHINF 2012.
Keywords
Data Mining, Anaesthesia Protocols, Discretisation, Surgery Duration
∗
3
Ph.D. student
Other versions of the results have been presented in [1, 2].
136
The critical noise level for learning Boolean
functions
Markus Maucher, Christian Wawra, and Hans A. Kestler
1
2
Bioinformatics and Systems Biology Group, Ulm University, Ulm, Germany
[email protected], [email protected],
[email protected]
Abstract. The inference of gene regulatory systems from time series measurements
is a challenging task to reveal the global functionality of a cell. Among several
reconstruction methods Boolean networks have been successfully applied to such
data. As time-resolved gene expression measurements at different stages of a cell
are difficult and expensive, all reconstruction methods are faced with a relative small
number of time points compared to the number of genes. In addition to this dimension problem, biological systems as well as measurement techniques are subject to
noise.
In this work, we present an analysis of the reconstructability of Boolean networks
in the case of noisy data. We introduce the notion of the critical noise level, a function characteristic which measures the complexity of the reconstruction of a function
from noisy time series data. This measure constitutes a natural upper bound for the
noise probability under which a function can still be reconstructed, but can also be
incorporated into the reconstruction process to improve reconstruction results. We
show how to efficiently compute the critical noise level of any given Boolean function and present experimental data that shows how it can be used to improve the
best-fit extension algorithm for the reconstruction of a Boolean network from noisy
time series data.
References
KAUFFMAN, S.A. (1993). The Origins of Order: Self-Organization and Selection in Evolution.
Oxford University Press.
LÄHDESMäKI, H., SHMULEVICH, I., YLI-HARJA, O. (2003). On learning gene regulatory
networks under the boolean network model. Machine Learning, 52(12):147–167.
Keywords
BOOLEAN FUNCTIONS, BOOLEAN NETWORKS, SYSTEMS BIOLOGY, RANDOM PERTURBATIONS
137
Decision tree ensembles with different split
criteria.
Sergej Potapov1 , Asma Gul2 , Werner Adler1 , and Berthold Lausen2
1
2
University of Erlangen-Nuremberg, Germany
{sergej.potapov,werner.adler}@imbe.med.uni-erlangen.de
University of Essex, United Kingdom
{agul,blausen}@essex.ac.uk
Abstract. In recent years many papers discuss boosting and bagging based methods
for supervised learning. Both concepts aggregate sets of estimated trees, which are
derived by split criteria without adjusting for variables measured on different scales.
Breiman et al. (1984) observed that quantitative variables tend to be more often
selected as binary variables. As a solution Lausen et al. (1994, 2004) introduced
p-value adjusted classification and regression trees, which introduce the p-value of
maximally selected test statistics as split criteria. The p value adjustment avoids
the possible selection bias of variables measured on different scales. The R package
TWIX of Potapov et al. (2012) offer p-value adjusted classification trees. In our
paper we compare bagging, double-bagging (Hothorn and Lausen, 2003) without
and with p-value adjustment by means of simulation. Moreover, we illustrate our
approach using a clinical study involving micro array data.
References
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J. (1984): Classification and regression trees.
Wadsworth Press.
Hothorn, T., Lausen, B. (2003): Double-bagging: Combinig classifiers by bootstrap aggregation.
Pattern Recognition 36(6), 1303–1309.
Lausen, B., Hothorn, T., Bretz, F., Schumacher, M. (2004): Assessment of optimal selected prognostic factors.Biometrical Journal 46, 364–374.
Lausen, B., Sauerbrei, W., Schumacher, M. (1994): Classification and regression trees (CART)
used for the exploration of prognostic factors measured on different scales, in: Dirschedl, P.,
and Ostermann, R. (eds.), Computational Statistics, Physica-Verlag, Heidelberg, 483–496.
Potapov, S., Theus, M. (2012): The TWIX package (Version 0.2.19.). http://cran.r-project.org
Keywords
ENSEMBLE LEARNING, CLASSIFICATION TREES, BAGGING
138
A Transductive Set Covering Machine
Florian Schmid∗1 , Ludwig Lausser†1 and Hans A. Kestler1
Research Group Bioinformatics and Systems Biology, Institute of Neural
Information Processing, Ulm University, 89069 Ulm, Germany
{florian-1.schmid, ludwig.lausser,
hans.kestler}@uni-ulm.de
Abstract. Classifiying tissue samples according to genetic markers is one of the basic tasks in molecular medicine. Tissue samples are represented as high-dimensional
gene expression profiles obtained from high-throughput experiments. It is often assumed that these profiles only contain a small set of predictive features.
A classifier based on such small marker combinations is the Set Covering Machine (SCM) with data dependent rays [?]. The SCM is thereby an ensemble scheme
which finds a minimal set of base classifiers covering all positive samples while minimizing the misclassifications of negative ones. Applied with univariate ray classifiers the SCM can be utilized to construct a decision rule of very low dimensionality.
A SCM is normally trained in an inductive way, i.e. a SCM is initially adapted
to a set of previously labeled training samples. This kind of learning might not be
optimal in scenarios based on expensive or time consuming labeling processes; in
this setting labeled training samples are often rare. The small sample size limits
the training of an inductive classifier. The resulting models are likely to be affected
by overfitting. Transductive learning is an alternative in this scenario. This learning
scheme additionally incorporates unlabeled samples into the training process.
In this work we propose and characterize a transductive version of the SCM with
data dependent rays. The classifier is evaluated on partially labeled microarray data
with different ratios of labeled and unlabeled data.
References
Kestler, H., Lausser, L., Lindner, W., Palm, G.: On the fusion of threshold classi- fiers for categorization and dimensionality reduction. Computational Statistics 26(2), 321340 (2011)
Keywords
TRANSDUCTIVE LEARNING, CLASSIFICATION, SET COVERING MACHINE,
SCM
∗
Student
†
PhD student
139
Part XI
LIS’12 Workshop
LinSearch – Effiziente Indizierung an der
Technischen Informationsbibliothek, Hannover
Dr. Debora Daberkow, Dr. Petra Mensing, Dr. Irina Sens, Claudia Todt
Technische Informationsbibliothek Hannover, Welfengarten 1b, 30167 Hannover
[email protected]
Abstract. Die Technische Informationsbibliothek Hannover (TIB) (http://www.tibhannover.de/) ist die Deutsche Zentrale Fachbibliothek für Technik sowie Architektur, Chemie, Informatik, Mathematik und Physik. Die TIB bietet ihren Nutzern mit
GetInfo (https://getinfo.de/app) ein Fachportal für Technik und Naturwissenschaften.
Ca. 45 Millionen Objekte (Texte, Forschungsdaten, AV-Medien, 3D-Modelle) sind
im zentralen Index indexiert. Aufgrund der exponentiell anwachsenden Menge an
verfügbaren Informationen ist es kaum noch möglich, alle Objekte manuell zu
klassifizieren. Nach einer umfangreichen Projektphase werden die Metadaten GetInfo mithilfe (semi)automatischer Verfahren klassifiziert. Es handelt sich um ein
insgesamt vierstufiges Verfahren zur automatischen Zuordnung von Metadaten.
Die erste Stufe ermöglicht eine pauschale Zuordnung von Datensüatzen zu einem
der sechs Schwerpunktfächer der TIB, in der zweiten Stufe werden alle in den
Datensätzen vorhandenen Klassifikationsangaben, wie bspw. DDC oder MSC und
weitere genutzt, um eine automatische Fachzuordnung zu ermöglichen. In der sich
daran anschließenden Stufe werden ISSN- und Konferenzangaben zur Zuordnung
herangezogen. Ist bis zu diesem Zeitpunkt keine Verarbeitung möglich, werden die
davon betroffenen Datensätze an die averbis extraction platform zur Klassifizierung
übergeben.
Ziel des Verfahrens ist die Zuordnung aller Datensätze zu den sechs TIB-Fächern
Architektur, Chemie, Informatik, Mathematik, Physik und Technik. Daraus resultiert ein weiterer Filter für die Suche im Portal GetInfo. Die ersten drei Stufen dieses
Verfahrens sind Eigenentwicklungen der TIB und nutzen eigens für diesen Zweck
erzeugte lexikalische Ressourcen, die aus dem Projekt LINSearch hervorgegangen
sind. Die vierte Stufe basiert dagegen auf Methoden des automatischen Lernens und
wurde mithilfe der Firma averbis aufgebaut.
Keywords
AUTOMATISCHE ERSCHLIESSUNG, KLASSIFIKATION, METADATEN
142
Herausforderung ”Neue Klassifikation für
Freihandbestände” - 3 Praxis-Beispiele aus der
Schweiz
Uwe Geith1 and Dr. Wolfgang Giella2
1
2
ZHAW Hochschulbibliothek, Winterthur [email protected]
ZHAW Hochschulbibliothek, Winterthur [email protected]
Abstract. Die Notwendigkeit der Einführung einer neuen Aufstellungssystematik kann unterschiedliche Gründe haben. Doch immer stellen die Evaluation und
Einführung einer neuen Klassifikation eine Herausforderung für eine Bibliothek dar,
insbesondere für kleinere Bibliotheken.
An 3 Beispielen aus der Schweiz
• Kantonsbibliothek Graubünden, Chur (Basisklassifikation)
• Teilbibliotheken der ZHAW Hochschulbibliothek, Standort Zürich (RVK)
• Teilbibliotheken der ZHAW Hochschulbibliothek, Standort Wintherthur (DDC)
werden die Hintergründe für die Entscheidung für eine bestimmte Klassifikation
offengelegt und die Durchführung der entsprechenden Projekte skizziert.
Unterschiedliche Ziele, Strategien und Lösungsansätze werden gegenübergestellt
und aufgezeigt, dass die Entscheidung für eine bestimmte Klassifikation nicht nur
inhaltliche, sondern auch pragmatische Gründe haben kann.
Keywords
SCHWEIZ, BIBLIOTHEK, AUFSTELLUNGSSYSTEMATIK, AUSWAHL
143
Die sachliche Suche in Schweizer
Online-Katalogen und Discovery-Systemen
Uwe Geith
ZHAW Hochschulbibliothek, Winterthur [email protected]
Abstract. Heerscharen von BibliothekarInnen versehen bibliographische Datensätze
mit Sacherschliessungsinformationen. Sie beschlagworten oder klassifizieren - oder
machen sogar beides. Doch für wen machen sie sich die viele Arbeit? Bekanntermassen nutzen Bibliothekskunden nur zu einem kleinen Prozentsatz die sachlichen
Sucheinstiege.
Schweizer OPACs mit ihren klassischen Suchmöglichkeiten machen die Vielfalt
der Anwendungen sowohl in der verbalen als auch in der klassifikatorischen Sacherschliessung deutlich. Dabei liegt der Schwerpunkt auf den Aleph-OPACs des
Informationsverbundes Deutschschweiz (IDS), die klassische Union-Catalogues
darstellen.
Stark im Trend liegen Versuche, die OPACs durch RDS-Systeme abzulösen.
Welche sachlichen Suchen können in den derzeitigen mit Suchmaschinen-Technologie
arbeitenden Discovery-Systemen in der Schweiz durchgeführt werden? Vorgestellt
werden ”swissbib”, ”NEBIS recherche” und das Webportal ”e-lib.ch”.
Es wird sowohl der aktuelle Stand der sachlichen Suche in der Schweiz reflektiert
als auch versucht, Erweiterungsmöglichkeiten und neue Ansätze aufzuzeigen.
Keywords
SCHWEIZ, INHALTSERSCHLIESSUNG, ONLINE-RECHERCHE, OPAC,
RESSOURCE DISCOVERY SYSTEM
144
Verarbeitung von Sacherschliessungselementen
in Discoverysystemen:
Auf dem Weg zu einer nutzergerechten
Verwendung von inhaltlicher Erschlieung in der
E-LIB Bremen.
Dr. Elmar Haake1
Staats- und Univerversitätsbibliothek Bremen, Bibliothekstr., 28359 Bremen
[email protected]
Abstract. Aktuelle kommerzielle Internetdienste zeigen, dass der Erfolg einer
Webdienstleistung erheblich von der Qualität und der zeitgemäßen Präsentation
der eigenen Angebote abhängt. Dabei sollte der Nutzwert der Dienste unmittelbar erkennbar sein. Usability Betrachtungen, klare einfache Strukturen und die
Beschränkung auf das Wesentliche können Webangebote für ungeschulte Nutzer attraktiver machen. Die Online-Präsentation der verschiedenen Dienste einer Bibliothek ist dagegen noch zu stark von Fragen der technischen Realisierbarkeit und bibliothekarischer Fachsicht beeinflusst. Dies gilt im Besonderen für bibliothekarische
Suchinstrumente. Die Metadatenpräsentation aktueller Bibliothekskataloge wird in
ihrer Fülle von ungeschulten Nutzern kaum verstanden. Im Rahmen der E-LIB Bremen experimentieren wir mit neuen Möglichkeiten der Präsentation und Auswertung u.a. von inhaltserschliessenden Metadaten, um dem Nutzer in verständlicher
Weise die Modifikation seines Suchweges zu erleichtern oder ihn zum thematischen
Stöbern zu motivieren. Die statistische Auswertung der gesamten Treffermenge
einer Suchanfrage bietet die Basis zur Entwicklung zahlreicher neuer Empfehlungsfunktionen6. Intern dienen die gleichen Dienste auch als Vorschlagsfunktion zur
Vereinfachung der Klassifizierung.
References
J. Rochkind (2007): (Meta)search Like Google. The time has come for libraries, too, to negotiate
for rights to index full text, Library Journal
J. Wang und A. Lim (2009): Local touch and global reach: The next generation of network, Library
Management 30, No. 1/2, 25-34
M. Parry (2009): After Losing Users in Catalogs, Libraries Find Better Search Software, The
Chronicle of Higher Education, 28. Sept. 2009
M. Blenkle, R. Ellis, E. Haake (2009): Next-generation library catalogues: review of E-LIB Bremen, Serials 22(2)
W. Gödert (2004): Navigation und Konzepte für ein interaktives Retrieval im OPAC oder:
von der Informationserschlieung zur Wissenserkundung, Mitteilungen der Vereinigung
Österreichischer Bibliothekarinnen & Bibliothekare 57, Nr. 1, 70-80
145
146
Dr. Elmar Haake
M. Blenkle, R. Ellis, E. Haake (2009):E-LIB Bremen Automatische Empfehlungsdienste fr
Fachdatenbanken im Bibliothekskatalog / Metadatenpools als Wissensbasis für bestandsunabhängige Services, Bibliotheksdienst 43. Jg., 6, 618-627
E. Haake (2009): Erschliessen Sie immmer noch oder lassen Sie auch schon indexieren? Vortrag
8. Fortbildungstreffen der Arbeitsgruppe Fachreferat Naturwissenschaften 21.09.2009
R. Siegmüller (2007): Verfahren der automatischen Indexierung in bibliotheksbezogenen Anwendungen Berlin : Institut für Bibliotheks- und Informationswissenschaft der HumboldtUniversität zu Berlin, 2007. - 106 S. : graph. Darst. - Berliner Handreichungen zur
Bibliotheks- und Informationswissenschaft ; 214) - ISSN: 1438-7662
Keywords
intuitive Nutzung von Sacherschlieungselementen, Serendipitt, Nutzung des Katalogs als Wissensbasis fr Sacherschliessung
Der Blog als Thesaurus-Datenbank
Andreas Ledl
University Library of Basel, Switzerland [email protected]
Abstract. Übersichten von Online-Thesauri, -Klassifikationen und Ontologien werden gegenwärtig in der Regel im Internet verstreut in mehr oder weniger unvollständigen Linklisten angeboten. Es gibt bisher keinen zentralen, virtuellen Ort,
der sich zur Aufgabe gesetzt hat, eine möglichst komplette Zusammenstellung
aller frei zugänglichen Erschliessungsinstrumente mit internationalem Anspruch
zu liefern. Mit dem Blog Thesaurusportal (http://thesaurusportal.blogspot.com), der
einmal über 500 frei zugängliche Thesauri, Klassifikationen und Ontologien in 47
Sprachen enthalten wird, steht nun erstmals ein solches Angebot zur Verfügung.
Es richtet sich an interessierte Laien, Studierende und Forschende, die mit dem
Block Building Approach nach Literatur oder Informationen recherchieren und
dazu ein kontrolliertes Begriffsarsenal benötigen; an (wissenschaftliche) Bibliothekare, Hochschuldozierende, Lehrer und sonstige Vermittler von Informationskompetenz, die ihren Studierenden resp. Schülern genau solche professionellen
Herangehensweisen näher bringen möchten und dafür inspirierenden Stoff brauchen;
an Ontologieingenieure, da Thesauri bei der Abbildung komplexer Wissensbeziehungen und somit beim Konzipieren semantischer Netze als Grundlage dienen; an Bibliotheken, Archive, Museen, Forschungseinrichtungen und ähnliche Institutionen,
wo sie als Indexierungswerkzeuge verwendet werden können.
Im Vortrag wird der erwähnte Blog vorgestellt und dargelegt, dass neben dem inhaltlichen Anspruch auch die Form der Präsentation neu ist. Momentan werden
Blogs von Bibliotheken hauptsächlich dazu verwendet, Neuigkeiten über die Institution oder den Bestand zu verbreiten. Dabei eignen sie sich gerade bei kleineren
Datenmengen ideal dazu, statisch-unhandliche Linklisten zu ersetzen und können
als nachhaltige Open Access-Datenbanken mit Web 2.0-Funktionalität dienen.
Keywords
Blog, Thesaurus, Datenbank, Klassifikation, Open Access
147
AUSZUG AUS DEM LITERATURBERICHT
2011 DEWEY DECIMAL CLASSIFICATION
(DDC)
Bernd Lorenz1
Fachhochschule für öffentliche Verwaltung und Rechtspflege in Bayern
Fachbereich Archiv- und Bibliothekswesen, München
[email protected]
Abstract. ”The 23rd edition of the DDC enhances the efficiency and accuracy of
your classification work in ways no previous editions have done.”
Vgl. http://www.oclc.org/dewey/
Effenberger, Claudia:
Ein semantisches Netz für die Suche mit der Dewey-Dezimalklassifikation - Optimiertes Retrieval durch die Verwendung versionierter DDC-Klassen (= Mitteilungen der VÖB 64, 2011 S. 270-289)
Effenberger, Claudia - Hauser, Julia:
Would an Explicit Versioning of the DDC Bring Advantages for Retrieval? In: Concepts in Context. Proceedings of the Cologne Conference on Interoperability and
Semantics in Knowledge Organization July 19th-20th, 2010. Ed. by Felix Boteram,
Winfried Gdert, Jessica Hubrich. Würzburg: Ergon, 2011 S. 123-132
Golub, Koraljika:
Automated Subject Classification of Textual Documents in the Context of WebBased Hierarchical Browsing(= Knowledge Organization 38, 2011 S. 230-244)
(Verwendung der DDC)
Green, Rebecca:
See-also Relationships in the Dewey Decimal Classification (= Knowledge Organization 38, 2011 S. 335-341)
Grüter, Doris - Kölbl, Andrea Pia - Villinger, Martin - Walger, Nicole:
Genese, Aufgaben und Zukunft der Vifarom: Konzept und DFG-Förderung einer
Virtuellen Fachbibliothek aus der Praxisperspektive (= ZfBB 58, 2011 S. 59-71)(S.
61 f.: Verwendung der DDC)
Schöning-Walter, Christa:
Automatische Erschlieungsverfahren fr Netzpublikationen. Z um Stand der Arbeiten im Projekt PETRUS (= Dialog mit Bibliotheken 23, 2011 S. 31-36; einschl.
Reklameteil)(enth. auch Arbeit mit DDC-Sachgruppen)
Keywords
DEWEY DECIMAL CLASSIFICATION, LITERATURE REVIEW
148
Entwicklung eines Werkzeugs zur Visualisierung
der SWD/GND
Dr.-Ing. Jan Frederik Maas
Staats- und Universittsbibliothek Hamburg, Von Melle Park 3, 20146 Hamburg
[email protected]
Abstract. Die Verschlagwortung von Medien anhand von kooperativ erstellten
Normdateien hat sich als sehr flexibler, da kontinuierlich erweiterbarer Bestandteil
der Sacherschließung etabliert. Grundlage für die Verschlagwortung war bisher im
deutschsprachigen Raum die Schlagwortnormdatei (SWD), die schrittweise durch
die Gemeinsame Normdatei (GND) ersetzt wird. Die GND enthält über die Sachschlagworte hinaus noch die Bestände der Personennamendatei (PND), der Gemeinsamen Körperschaftsdatei (GKD) sowie die Einheitssachtitel-Datei des Deutschen
Musikarchivs.
Die SWD/GND dient primär der Vereinheitlichung der Verschlagwortung. Darüber hinaus sind in der Struktur der SWD Relationen zwischen Schlagwörtern
definiert, die eine thematische Suche stark erleichtern können. Beispiel für solche
Relationen sind die Unterbegriff-/Oberbegriffrelationen (Hyponym/Hyperonym) oder
die Relation der Ähnlichkeit von Begriffen.
Um den Umgang mit der SWD/GND zu erleichtern, wurde ein Werkzeug zur
Recherche in den Schlagwörtern erstellt, das eine Visualisierung der beschriebenen Relationen ermöglicht und darüber hinaus komplexe Suchanfragen z.B. mittels Regulärer Ausdrücke unterstützt. So kann zum einen die Verschlagwortung von
Medien erleichtert werden, zum anderen sind bei der Ansetzung von Schlagwörtern
entstehende Fehler leichter vermeidbar.
Eine lohnende Herausforderung stellt die Umstellung der Software auf die Struktur der GND dar, die perspektivisch diskutiert werden soll.
References
MAAS, Jan F.(2010): SWD-Explorer - Design und Implementierung eines Software- Tools zur
erweiterten Suche und grafischen Navigation in der Schlagwortnormdatei. Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft 275.
Keywords
SWD, GND, Visualisierung, Schlagworte
149
Practical Experiences with Machine
Learning-based Text Categorization for Library
Applications
Elisabeth Mödden1 , Mathias Lösch2 , Monika Lösse1 , and Ulrike Junger1
1
2
Deutsche Nationalbibliothek
Adickesallee 1
D-60322 Frankfurt am Main
[email protected]
[email protected]
[email protected]
Universitätsbibliothek Bielefeld
Universitätsstr. 25
D-33619 Bielefeld
[email protected]
Abstract. In recent years, text mining has gained more and more attention in the
field of (digital) libraries. Potentially fruitful applications include tasks like automatic abstracting and automatic clustering or classification of documents. Our
contribution presents experiences from two projects that aim at the automation
of classification in the library domain using machine learning-based text categorization: The project “PETRUS”, carried out by the German National Library,
aims at the automatic classification of electronic publications by attributing DDCSachgruppen. These DDC-Sachgruppen, a scheme based on Dewey Decimal Classification (DDC), consist of roughly one hundred subject classes and is used to
structure the German National Bibliography. Coordinated at Bielefeld University
Library, the project “Automatic Enhancement of OAI Metadata” aims at the automatic classification of Dublin Core metadata records, as employed by most institutional and subject repositories. The target category space of this project is the DDC,
and in particular also the subset of the DDC-Sachgruppen, which is the prevalent
category system in the German repository landscape. A comparative analysis of
the experiences from both projects unveils promising classification accuracy rates
in both applications, but also similar challenges faced during the construction of
production-ready classifiers for library classification schemes. We therefore conclude with recommendations and best practices for the application of text categorization in the library context.
Keywords
SUBJECT INDEXING, MACHINE LEARNING, TEXT MINING
150
Abgleich von Titeldaten zur Übernahme von
Sacherschließungsinformationen über
Verbundgrenzen
Magnus Pfeffer
Hochschule der Medien, Stuttgart [email protected]
Abstract. Sacherschließung in den Verbunddatenbanken ist – unter anderem bedingt durch die kaum koordinierte Zusammenarbeit – höchst uneinheitlich. Dies betrifft auch Werke, die in vielen Auflagen und Drucklegungen inhaltlich im Wesentlichen unverändert erscheinen. Im letzten Jahr habe ich ein Verfahren vorgestellt,
welches unterschiedliche Auflagen und Ausgabeformen eines Werkes in Gruppen
zusammenfasst und die Sacherschließungsinformationen der erschlossenen Titel
einer Gruppe auf die nicht erschlossenen Titel überträgt. Die Grundidee dabei ist der
Abgleich einer Kombination von Autoren/Urheberangaben und dem vollständigen
Titel.
Das Verfahren wurde auf zwei Verbunddatenbanken (Südwestverbund und Hebis)
angewendet. Die Ergebnisse sind beeindruckend: Waren im SWB vor dem Abgleich von 12.777.191 Monografien 3.979.796 mit SWD-Schlagwörtern und 3.235.958
mit RVK-Notationen erschlossen, konnten durch den Abgleich mit anschließender
Übernahme zusätzlich 636.462 mit SWD und 959.419 mit RVK erschlossen werden.
Die Arbeitsgruppen der Sacherschließungsexperten haben in beiden Verbünden die
Ergebnisse in zufälligen und systematischen Stichproben überprüft. Dabei wurde
den erzielten Ergebnissen eine hohe Qualität bescheinigt und die Übernahme in die
Produktivdatenbanken empfohlen, was zwischenzeitlich geschehen ist.
Aktuell werden Datenabzüge der Katalogdatenbanken des BVB und des HBZ fr
das Verfahren vorbereitet. Die Ergebnisse der Zusammenführung der vier Verbundkataloge werden auf dem Workshop erstmals vorgestellt.
Keywords
LIBRARY UNION CATALOGS, SUBJECT HEADINGS, TITLE MATCHING
151
Data Enrichment in Discovery Systems using
Linked Data
Dominique Ritze and Kai Eckert
Mannheim University Library, Germany
{dominique.ritze,eckert}@bib.uni-mannheim.de
Abstract. The Linked Data Web is an abundant source for information that can be
used to enrich information retrieval results. This can be helpful in many different
scenarios, for example to enable extensive multilingual semantic search or to provide additional information to the users.
In general, for the data enrichment two ways are possible: on the side of the client
and on the side of the server. With client side data enrichment, i.e., usually an enrichment by means of JavaScript in the browser, users can get additional information
related to the results they are provided with. This additional information is not stored
with the retrieval system and thus not available to improve the actual search. An example would be the provision of links to external sources like Wikipedia, merely for
convenience.
By contrast, an enrichment on the server side can be exploited to improve the retrieval directly, at the cost of data duplication and additional efforts to keep the data
up-to-date.
In this talk, we show various examples where discovery systems have been enriched
both on the client and the server side. We compare advantages and disadvantages of
both variants and briefly demonstrate the data enrichment in Primo at the Mannheim
University Library.
Keywords
Linked Data, Data Enrichment, Improving Discovery Systems
152
Instrumentalisierung der klassifikatorischen
Sacherschließung im neuen Suchportal mit
AquaBrowser in der Vorarlberger
Landesbibliothek
Karl Rädler
Vorarlberger Landesbibliothek, Fluherstraße 4 6900 Bregenz
www.vorarlberg.at/vlb
Abstract. Das neue Suchportal der Vorarlberger Landesbibliothek mit AquaBrowser
bringt die Aufstellungssystematik bzw. Klassifikation in der Recherche in mehrfacher
Hinsicht aktiv ins Spiel: Einmal können die Rechercheergebnisse thematisch sukzessiv Top-down verfeinert werden. Die entsprechenden Gliederungsebenen der Klassifikation werden dabei mittels verbaler Benennungen zur Auswahl angeboten.
Analog zu den Sachnotationen werden auch die Ländercodes über ihre verbalen
Repräsentanten als eigene Regional-Facette instrumentalisiert, wiederum mit der
Möglichkeit, sukzessiv Top-down einschränken zu können. Eine 3. Facette “ZeitEpoche” soll noch hinzukommen. Andererseits kommen die verbalen Benennungen der einzelnen Klassen in einer eigenständigen Verfeinerungskategorie “Schlagwort” zur Geltung. Über eine “Wortwolke” werden zudem auch hierarchische
bzw. assoziative Verweisungen zwischen den Klassen aktiv zur Unterstützung bzw.
Auswahl angeboten. Mit praktischen Recherchebeispielen soll demonstriert werden, dass die klassifikatorische Sacherschließung gerade auch in Suchmaschinen
in gewissem Sinne eine neue Dimension der Recherchequalität bereitzustellen in
der Lage ist. Insbesondere auch im Sinne der Möglichkeit, den Gesamtbestand
einer Bibliothek aktiv transparent zu machen und sozusagen im multidimensionalen Informationsraum der Bibliothek auf Entdeckungsreise gehen zu k¨nnen. Um
dies zu unterstüzen, wurde die Suche graphisch und funktional in die Homepage
integriert. So werden über einzelne “Reiter” auch Direkteinstiege über dahinter
liegende Search-Links angeboten, die es unter anderem auch erlauben, Fachgebiete als Ganzes zu recherchieren, und dann sukzessiv nach unterschiedlichsten
Facetten (Medienart, Zeitraum, Klassifikation, ...) zu verfeinern. Eine Hauptmission
des neuen Suchportals der Vorarlberger Landesbibliothek war, ihre Informationsdienstleistung bzw. ihr Medienangebot möglichst aktiv zu präsentieren, und so ein
digitales multidimensionales “Schaufenster” anzubieten. Inwieweit dies gelungen
ist, soll demonstriert und zur Diskussion gestellt werden.
Keywords
VORARLBERGER LANDESBIBLIOTHEK, KATALOG, SUCHPORTAL,
AQUABROWSER, KLASSIFIKATION, RECHERCHE
153
Text Mining für den Ontologieaufbau
Elke Bubel2 , Nils Elsner1 , Peter König2 , Helmut Müller1 , Nadejda Nikitina3 ,
Mario Quilitz2 , Silke Rehme1 , Achim Rettinger3 , and Michael Schwantner1,4
1
2
3
4
FIZ Karlsruhe Leibniz Institut für Informationsinfrastruktur,
Hermann-von-Helmholtz-Platz 1, 76344 Eggenstein-Leopoldshafen
INM Leibniz-Institut für Neue Materialien gGmbH, Campus D2 2, 66123
Saarbrücken
Institut AIFB, KIT-Campus Süd, 76128 Karlsruhe
corresponding author, [email protected]
Abstract. Im Rahmen des durch den Senatsausschuss Wettbewerb der Wissenschaftsgemeinschaft Gottfried Wilhelm Leibniz geförderten Forschungsprojekts NanOn:
Semiautomatische Ontologiegenerierung - ein Beitrag zum Knowledge Sharing in
der Nanotechnologie wurde eine Ontologie für die chemische Nanotechnologie
aufgebaut. Entsprechend der Ontology Engineering Methodology nach GòmezPerez und Suàrez wurde zunächst eine Anforderungsanalyse erstellt. Als Zielgruppe wurden Wissenschaftler und Produzenten identifiziert, die nach Materialien, Eigenschaften oder Prozessen (Synthesen, Applikationen) suchen. Für den
Aufbau der Ontologie wurden zur Unterstützung der intellektuellen Konzeptualisierung auch bereits vorhandene Ontologien einbezogen (u.a. CMO und ChEBI).
Der Schwerpunkt des Projekts war jedoch, die Eignung von Text Mining Methoden sowohl für den Aufbau der Ontologie als auch für die automatische Annotation wissenschaftlicher Artikel zu untersuchen. Dazu wurden eigene prototypische Werkzeuge entwickelt, wobei verschiedene Open Source Tools (z.B. GATE
und OpenNLP) verwendet wurden. Es konnte gezeigt werden, dass für den Aufbau einer Ontologie Text Mining Methoden, mit denen aus Fachtexten relevante
Begriffe extrahiert werden, von großem Wert sind, indem sie den Prozess der intellektuellen Konzeptualisierung unterstützen und zu größerer Vollständigkeit der
Ontologie führen. Bei der automatischen Annotation von Begriffen und Relationen, für die erste Tests durchgeführt wurden, fielen die Ergebnisse deutlich heterogener aus. Die Annotation von Konzepten ist erwartungsgemäß stark abhängig von
der Vollständigkeit der Ontologie hinsichtlich der (Quasi-)Synonyma. Die Annotation von Relationen ist wesentlich schwieriger. Hier wird die Qualität vor allem
davon beeinflusst, ob für die Relationen spezifische Formulierungen bestimmt werden können.
Keywords
ONTOLOGY, NANOTECHNOLOGY, ONTOLOGY ENGINEERING METHODOLOGY, TEXT MINING, AUTOMATIC ANNOTATION
154
Sacherschliessung mit GND/RSWK im Verbund
Basel: eine erste Bilanz
Alice Spinnler1
Universität Basel Universitätsbibliothek, Straße am Forum 2 Karlsruhe 76131,
Germany [email protected]
Abstract. Im April 2011 hat der Verbund Basel von einer hauseigenen verbalen
Sacherschliessung auf SWD/RSWK umgestellt. Gut ein Jahr nach der Einführung
ist es Zeit, eine erste Bilanz zu ziehen. Nach einem summarischen Überblick über
die verbale Sacherschliessung im IDS - Basel ist Teil des Informationsverbunds
Deutschschweiz - werden die Gründe für den Wechsel aufgezeigt. Haben sich unsere Erwartungen erüllt? Wie gestaltet sich der Erschliessungsalltag für die FachreferentInnen und die Schlagwortredaktion mit dem neuen Regelwerk und SWD, ab
Mai 2012 mit GND? Was hat die neue Erschliessungspraxis für Auswirkungen auf
die thematische Suche im aktuellen OPAC und in swissbib Basel Bern, unserer
künftigen Suchplattform? Was geschieht mit dem nach hauseigenem Regelwerk erschlossenen Bestand? Und zuletzt: hat sich der Wechsel gelohnt?
Keywords
SWD, RSWK, GND
155
Resource Discovery Systeme – Chance oder
Verhängnis für die bibliothekarische
Erschließung?
Heidrun Wiesenmüller
Hochschule der Medien, Stuttgart [email protected]
Abstract. Heterogenität in Bibliothekskatalogen ist nichts Neues, hat aber mit der
Einführung so genannter Resource Discovery Systeme (z.B. EBSCO Discovery Service, Primo Central, Summon) ein bisher ungekanntes Maß erreicht: Diese bieten Suchindizes von gewaltiger Größe, die sich aus sehr unterschiedlichen Daten
kommerzieller Anbieter zusammensetzen. Kombiniert man RDS-Daten mit Daten,
die nach bibliothekarischen Standards erschlossen sind, so führt dies zu erheblichen Problemen. Selbst bei einfachen formalen Facetten ist der Nutzen aufgrund fehlender Normierung stark eingeschränkt. Komplexere Suchfunktionen wie
z.B. eine Einschränkung nach dem Fachgebiet scheinen sich überhaupt nicht mehr
realisieren zu lassen. Im Beitrag werden typische Effekte ebenso wie bisherige
Lösungsansätze – in der Regel die Trennung von bibliothekarischen Daten und
RDS-Daten – vorgestellt. Darüber hinaus werden erste strategische Überlegungen
für ein gewinnbringendes Zusammenspiel von bibliothekarischen Daten und RDSDaten angestellt. Das Ziel muss es sein, dass unsere qualitativ hochwertigen Daten
nicht im Meer der nicht-bibliothekarischen Daten, die beispielsweise nicht über
Normdatenverknüpfungen verfügen, untergehen, sondern vielmehr zu deren Verbesserung beitragen. Dabei wird sich auch die Rolle von Bibliothekaren verändern:
Sie werden künftig verstärkt auch als ”‘Metadatenmanager”’ täresourcetig sein.
Keywords
RESOURCE DISCOVERY SYSTEMS, LIBRARY DATA, HETEROGENEITY
156
Inhaltliche Anpassung der RVK als
Aufstellungsklassifikation – Projekt
Bibliotheksneubau Kleine Fächer der FU Berlin,
Schwerpunkt Orient
Helen Younansardaroud
Projekt Bibliotheksneubau “24 in 1” der FU Berlin
Abstract. Im Zusammenhang mit der Bibliotheksstrukturreform der Freien Universität Berlin soll für die sog. “Kleinen Fächer” des Fachbereichs Geschichtsund Kulturwissenschaften eine integrierte Bibliothek mit einem gemeinsamen Standort entstehen. Die Bestände der jetzigen Fachbibliotheken der “Kleinen Fächer”
sollen retrokatalogisiert und für eine Freihandaufstellung schrittweise nach der Regensburger Verbundklassifikation einheitlich erschlossen werden. Um das Ziel zu
erreichen, soll durch Anwendung der Crosskonkordanz-Methodik die Frage beantwortet werden, ob die Regensburger Verbundklassifikation (= RVK) für die nach der
Haussystematik aufgestellten Bibliotheksbestände des Orient-Clusters der Kleinen
Fächer der FU Berlin ausreicht bzw. aussagekräftig ist. Dabei sollen mögliche Erweiterungsvorschläge zur Optimierung der RVK aufgezeigt werden. (Mein Thema
ist basiert auf meiner Masterarbeit mit dem Titel Inhaltliche Anpassung der RVK als
Aufstellungsklassifikation: Projekt Bibliotheksneubau Kleine Fächer der FU Berlin,
Islamwissenschaft; erschienen in: Institut für Bibliotheks- und Informationswissenschaft der Humboldt-Universität zu Berlin, 2010. (Berliner Handreichungen zur
Bibliotheks- und Informationswissenschaft; 287; http://edoc.hu-berlin.de/docviews/
abstract.php? lang=ger&id=37367)
References
CARMEN = Content Analysis, Retrieval and MetaData: Effective Networking: Abschlussbericht des Arbeitspakets 12 (AP 12) Crosskonkordanzen von Klassifikationen und Thesauri. Als Online- Publikation aufbereitete Version, (2002), http://www.opus-bayern.de/uniregensburg/volltexte/2003/242/pdf/CARMENAP12 Abschlussbericht Netz.pdf ; Zugriff am
30.04.2010.
Goldziher, Ignác: A short history of classical Arabic literature. Transl., rev., and enl. by Joseph
DeSomogyi, Hildesheim: Olms, (Olms Paperbacks; 23), 1966.
Mayr, Philipp; Walter, Anne-Kathrin: Einsatzmöglichkeiten von Crosskonkordanzen. In:
Stempfhuber, Maximilian (Hg.): Lokal - Global: Vernetzung wissenschaftlicher Infrastrukturen: 12. Kongress der IuK-Initiative der Wissenschaftlichen Fachgesellschaft in Deutschland. Bonn: GESIS - IZ Sozialwissenschaften. (Tagungsberichte), (2006), S. 149-166,
http://www.ib.hu- berlin.de/m̃ayr/arbeiten/mayr-walter-IuK06.pdf ; Zugriff am 30.04.2010.
Oberhauser, Otto; Seidler, Wolfram: Reklassifizierung grösserer fachspezifischer Bibliotheksbestände. Durchführbarkeitsstudie für die Fachbibliothek für Germanistik an der Universität
Wien. Wien, 2000, http://www.germ.univie.ac.at/fbg/Studie.pdf ; Zugriff am 05.03.2010.
Umlauf, Konrad: Einführung in die bibliothekarische Klassifikationstheorie und -praxis mit
Übungen. Berlin: Institut für Bibliothekswissenschaft der Humboldt-Universität zu Berlin,
(Berliner Handreichungen zur Bibliothekswissenschaft; 67), 1999-2006, http://www.ib.huberlin.de/k̃umlau/handreichungen/h67/ ; Stand: 20.12.2006; Zugriff am 02.05.2010.
157