Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Diversity in search: what, how, and what for? Bettina Berendt Dept. Computer Science, KU Leuven Thanks to Sebastian Kolbe-Nusser Anett Kralisch Siegfried Nijssen Ilija Subašić Mathias Verbeke Hugo Zaragoza ... Diversity in natural language diverse (s#2), various : distinctly dissimilar or unlike ..., diversity (s#1), ..., variety : noticeable heterogeneity (Wordnet) “the fact that members of a set are different from one another“ Why is diversity interesting for search? “People like to see a range of different, nonredundant things/views/etc.“ “Different people search differently.“ How? When / under what conditions? (What) can we do? What is diverse? Documents – the relevance of a document must be determined considering the documents appearing before it (Goffman, 1964) – E.g. MMR (Carbonell & Goldstein, 1998) – Many further developments, e.g. for images – Presentation choices, e.g. re-ranking or clustering? What is diverse? Documents People – “The term diversity is a form of euphemistic shorthand to describe differences in racial or ethnic classifications, age, gender, religion, philosophy, physical abilities, socioeconomic background, sexual orientation, gender identity, intelligence, mental health, physical health, genetic attributes, behavior, attractiveness, place of origin, cultural values, or political view as well as other identifying features.” http://en.wikipedia.org/wiki/Diversity_(politics) What is diverse? Documents People Knowledge and its articulations (= documents in a wider sense?!) – “Knowledge and its articulations are strongly influenced by diversity in, e.g., cultural backgrounds, schools of thought, geographical contexts.” – “LivingKnowledge will study the effect of diversity and time on opinions and bias.” – “The goal [is] to improve navigation and search in very large multimodal datasets (e.g., the Web itself).” How we got here The impact of language and culture on Web usage behaviour Diversity of users How we got here The impact of language and culture on Web usage behaviour Diversity of users Tools for sense-making in literature search Diversity of documents How we got here The impact of language and culture on Web usage behaviour Diversity of users Tools for sense-making in literature search Diversity of documents PORPOISE, STORIES tools for graphical news summarization and understanding How we got here The impact of language and culture on Web usage behaviour Collaborative re-use of literature search results Diversity of users Diversity of diversity Tools for sense-making in literature search Diversity of documents PORPOISE, STORIES tools for graphical news summarization and understanding Why this talk? The impact of language and culture on Web usage behaviour Collaborative re-use of literature search results Diversity of users Diversity of diversity Tools for sense-making in literature search Diversity of documents PORPOISE, STORIES tools for graphical news summarization and understanding Why this talk? The impact of language and culture on Web usage behaviour Collaborative re-use of literature search results e.g. Information Retrieval J. 2009 Proceedings Living Web WS@ISWC 2009 Tools for sense-making in literature search Inf. Processing & Management 2010 PORPOISE, STORIES tools for graphical news summarization and understanding e.g. Knowledge and Information Systems J. 2009 Towards an integrated understanding of diversity The impact of linguistic diversity on Web usage and thereby on the Web Or: Why are non-English languages underrepresented on the Web? A web-analysis approach asking for underlying – cognitive-linguistic – behavioural – attitude factors A simple expectation of how much content exists in which language But: Dynamics of content creation, link setting, link following, attitudes, and use But: Dynamics of content creation, link setting, link following, attitudes, and use People create less content People link less to content People use links less People think the content is bad ... and use it less But: Dynamics of content creation, link setting, link following, attitudes, and use Under-representation ! Underlying data and methods Database of countries and official languages Distribution comparisons between – – – – – worldwide proportions of native speakers of different languages worldwide distribution of servers registered by country crawler analysis of links to a multilingual site S log analysis assigning each session a native language log analysis of (user native language) – (S-entry-page language) Questionnaire/TAM analysis of native and non-native users of S: – usability, ease of use, competence in English, beliefs about availability of content in native language Some questions Does one find such dynamics also in search engines? What factors stop or reverse such languagemarginalisation trends? – Critical mass? – Laws? – Volunteers? Did / can Web 2.0/3.0 change this? (When) is it better to work without pre-defined labels for users? Part 2: An approach that ... Does one find such dynamics also in search engines? What factors stop or reverse such languagemarginalisation trends? – Critical mass? – Laws? – Volunteers? Did / can Web 2.0/3.0 change this? (When) is it better to work without pre-defined labels for users? Motivation (1): Diversity of people is ... Speaking different languages (etc.) localisation / internationalisation Having different abilities accessibility Liking different things collaborative filtering Structuring the world in different ways ? Motivation (2): Diversity-aware applications ... Must have a (formal) notion of diversity Can follow a – “personalization approach“ adapt to the user‘s value on the diversity variable(s) transparently? Is this paternalistic? – “customization approach“ show the space of diversity allow choice / raise awareness / semi-automatic! Measuring grouping diversity Diversity = 1 – similarity = 1 - Normalized mutual information By colour & NMI = 0 NMI = 0.35 Measuring user diversity “How similarly do two users group documents?“ For each query q, consider their groupings gr: For various queries: aggregate ... and now: the application domain ... that‘s only the 1st step! Workflow 1. 2. 3. 4. Query Automatic clustering Manual regrouping Re-use 1. Learn + present way(s) of grouping 2. Transfer the constructed concepts Concepts Extension – the instances in a group Intension – Ideally: “squares vs. circles“ – Pragmatically: defined via a classifier Step 1: Retrieve CiteseerX via OAI Output: set of – document IDs, – document details – their texts Step 2: Cluster “the classic bibliometric solution“ CiteseerCluster: – Similarity measure: co-citation, bibliometric coupling, word or LSA similarity, combinations – Clustering algorithm: k-means, hierarchical Damilicious: phrases Lingo How to choose the “best“? – Experiments: Lingo better than k-means at reconstruction and extension-over-time Step 3 (a): Re-organise & work on document groups Step 3 (b): Visualising document groups Steps 4+5: Re-use Basic idea: 1. learn a classifier from the final grouping (Lingo phrases) 2. apply the classifier to a new search result “re-use semantics“ Whose grouping? – One‘s own – Somebody else‘s Which search result? – – – – “ the same“ (same query, structuring by somebody else) “ More of the same“ (same query, later time more doc.s) “ related“ (... Measured how? ...) arbitrary Visualising user diversity (1) Simulated users with different strategies U0: did not change anything (“System“) U1: tried produce a better fit of the document groups to the cluster intensions; 5 regroupings U2: attempted to move everything that did not fit well into the remainder group “Other topics”, & better fit; 10 regroupings U3: attempted to move everything from „Other topics“ into matching real groups; 5 regroupings U4: regrouping by author and institution; 5 regroupings 5*5 matrix of diversities gdiv(A,B,q) multidimensional scaling Visualising user diversity (2) Web mining Data mining RFID aggregated using gdiv(A,B) Evaluating the application Clustering only: Does it generate meaningful document groups? – yes (tradition in bibliometrics) – but: data? – Small expert evaluation of CiteseerCluster Clustering & regrouping – End-user experiment with CiteseerCluster – 5-person formative user study of Damilicious The Damilicious tool: Summary and (some) open questions A tool that helps users in sense-making, exploring diversity, and reusing semantics diversity measures when queries and result sets are different? how to best present of diversity? – How to integrate into an environment supporting user and community contexts? Incentives to use the functionalities? how to find the best balance between similarity and diversity? which measures of grouping diversity are most meaningful? – Extensional? – Intensional? Structure-based? Hybrid? (cf. ontology matching) which other sources of user diversity? Diversity and relevance: can we learn from user-dependent relevance judgements? Some lessons learned (or questions raised?) We need to embrace diversity. We need to take into account – The diversity of documents / knowledge – The diversity of people Thanks! – The diversity of diversity . We need to be clear about what we mean. We need to ask whether / when „striving for diversity“ is in itself A Good Thing. We need to ask whether / when „raising awareness of diversity“ is in itself A Good Thing. Diversity in search: what, how, and what for? Bettina Berendt Dept. Computer Science, KU Leuven ... and now: the application domain ... that‘s only the 1st step!