Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Research Problems & Topics (Web Domain) (CS598-CXZ Advanced Topics in IR Presentation) Jan. 25, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign Faculty Homepage Classification/Finding • The problem is, to classify the faculty homepages from different universities according to their research field. If a student doing data mining wants to apply for the graduate program in US universities, he can input “data mining, U.S., university”. The search result is the data mining faculty homepages from different universities in U.S. This would help a lot since people currently have to navigate to different university websites, go to the “faculty” list, and click on every faculty name to find out whether his interest is data mining or not. • The challenge is how to summarize the homepages and classify them correctly. Search by Relations • • Search by relations instead of words and phrases A typical search query today is based on keyword matching. Usually, the semantic-rich, structured data are hidden implicitly on the Web. Hence, we need some “mining” technologies to find these relations. E.g., if we want to find Mike’s address, we will take “Mike” and “address” as keywords to search by Google. Typically all the sources with these two keywords will be showed as the search results, in which most actually have nothing to do with Mike’s address. Then we have to dig deeply into these results to look for the information we want. – Users: everyone who has an internet connection – Data involved: the whole web – Functions to be developed: search by relations Google Dictionary • • • • As a non-native English speaker, I often want to find out the correct usage of a word, and more often, the correct usage of a phrase. Online dictionaries usually only show a few examples. For certain phrases, online dictionaries may not even have entries for them. If I try to search for the word or phrase on Google, it usually finds web pages where the word or the phrase only appears in the title, which is not very helpful. It would be very useful to have a Google dictionary so that if you type in a word or a phrase, it shows how the word or phrase is popularly used, with summarization and examples. The users of such a tool will be people to whom English is a second language, or kids who are still learning new words and phrases. It would also be useful for finding out the meanings of buzzwords. Since it is supposed to be an English dictionary, the system should filter out web pages that may contain improper usage of English. The data involved should most likely be news articles, online books, essays, and other well-written English articles. The dictionary should ideally summarize the usage of the word or phrase into several categories, give examples for each of them, and maybe differentiate between formal usage and informal usage. And just like other online dictionaries, this dictionary should be able to correct the user’s spelling, or find the best match if the user enters a phrase that does not exist. • • Personalized Conceptual Search Engine An identical query may have different latent meanings. In real world searching, people usually have their own preferences of a certain aspect of one query. For example, “apple” may mean “computer” and also “fruit”. “Java” may mean “country”, “coffee” and “programming language”. In general search, it’s hard to indicate which aspect a user want from one concept, but in personalized search, one user is likely to prefer one aspect. Task2: Sometimes it’s hard to generate a good query from a certain need. For example, a user wants to know “what did America Government say about …”. Most articles may mention “Bush said …”, “Bill Clinton said …”, “Bowel said …”. A query like “America Government …” may not get satisfactory results. This is because America Government is a concept, which indicates a group of terms. Again, in general search, modeling a concept is hard, because each concept may have different meaning. In personalized search, people may have stationary components for each concept. ? A better scenario, when a user wants to know “the state-of-the-art of NLP”, an ideal personalized system should first figure out NLP is Natural Language Processing of this user, and then figure out it refers to POS Tagger, Parser, etc. These, are hard for general search engine but doable for personalized search. Interestingly, a person’s name can be a very good example of concept. ? The training data could be any kind of texts with personalized property (may not be strictly on query history). For example, a word-usage statistics of the user’s articles, chatting records, and other collections can be very useful. All these things can be done on client side, which avoids the privacy problem. ? – – – – User: common users of search engine. Data: query history, personal collections of texts, articles and chatting records. Functions: concept clustering, summarization from texts; personal preference learning, query modification by concept selection and splitting. Challenge: How to cluster terms into concepts from personalized texts. How to represent a concept. How to do query expansion with the information of concepts. Find in-depth knowledge about an Entity • To find in-depth knowledge for a particular entity on the web. • Example: I type in "Microsoft", and I want to find the earnings, revenue, and locations for the Microsoft corporation. • Users: Researcher, stock analysis, people who work in human resources, and all who want to do research on a person, company, or a particular topic. • Data: The complete web • Method: The semantic web may be the solution to this problem. However, it may be still useful to disambiguate among entities with common names and cluster the pages, then do summarization on them. Web Search for information Seen Before • • • • • I think that most people has such an experience that they remember they saw something useful or interesting before, but just cannot figure out how to access them again. For example, when planning the dinner, one may just happen to remember s/he saw an interesting receipt in a cookbook site before and try to find it. However, s/he just forget the name of the site and couldn't find it after trying several different queries to the search engines. I believe this kind of information need usually emerges in our daily life. In such situation, we usually still have a rough idea of that information. Sometimes, we can recall the original search context (when or what situation) and then figure out how to query and then access it again. But sometimes, we have to give it up after several unsuccessful trials. This kind of search is different from the general Web search. The user is each individual user who ever surfs and searches the Web. The search content would be the pages that s/he have ever accessed. The challenge is how to help the user clarify his memory so as to figure out the context to access the target information. There would be many kinds of approaches to this problem. For example, the Internet front end (e.g., IE) can log the queries that the user ever submitted and interactively help the user to refine the query. Another possibility is that a personal agent indexes all the pages accessed before and do the search on these cached pages. Although the search space of such cached pages becomes much smaller compared to the whole Web, the space available in local computer limit the indexing capability. It may not be able to index all cached pages. How to index and perform search would be critical. To summarize, User: Each individual user Data: Cached Web page + Web Function: efficient indexing and search in local computer Infer User Preferences over Websites • Users: Search engine users • Data: Search results • Description: People have preferred websites for different searches. For example, if I am searching for a paper download then I would probably only look for CiteSeer and ACM because they usually have a paper download link. When searching for news, I trust NY Times over other news sources. It would be convenient to give more weight to results from these sites. The preference can be inferred from implicit or explicit feedback, but the challenge is the preferred websites change for different search topics. Structural Search • If we submit a long sentence or a paragraph to google, most of time, google is not able to handle it. Structural search is different from traditional key words search in that it also takes the structural information in the query sentences. • The users could be any general users. • The data involved in the challenge is the text webpages over the internet. • The key functions in the problem are the document indexing and document searching. Price Extraction and Comparison • Companies, like Walmart, Bestbuy, and etc, usually price their merchandise only based on their buy-in cost and the amount of goods in stock. However, if other competing compaies offer a lower price at the same time, most customers will spend their money at other companies. The consequence is that the goods cannot be sold and company will incur more cost. • If we can build an information retrieval system that collect all the retailing prices from other competing companies. Then we can price the merchandise more competitively. • The challenge in this system includes how to find all prices of certain merchandise on the web, and how to link some effect that make the price not directly visible, like the on sale or coupon information. Domain-Specific Search Currently, the search engines only deal with general search. That is, for any query, it will search the whole web. However, in many situations, people know the answer is in a specific domain. For example, one student would like to find some references about a particular problem, say, “max flow of the network”, he knows the answer should be on a Web page which belongs to the .edu domain. But using Google, it is hard to specify such a constraint. One solution is only to index the Web pages in this domain. Then a search engine can be built on such a domain. This is the topic of domain-specific search. Another use of domain-specific search is to help employers find suitable employees. For example, some organizations want to recruit new staffs from the ongoing graduate students. They can rely on the .edu domain search engine for this purpose. • • • • Users: Each domain will have a particular group of users. For example, the .edu has the faculties, students and potential employers. Data: the Web pages which belongs to a domain Functions: keyword search, course search for .edu domain. Challenges: What is the characteristic of the domain? People will agree that .gov and .edu has different characteristics. How to recognize these characteristics to help search? What specific functions should be defined for a domain? For example, the .edu may support “professor” or “course” search function. Research Area Relation Mining • • • • • There are all kinds of research branches for one department, for example, Artificial Intelligence, Machine Learning, Data Mining, and Computer Vision… for Computer Science. What is the relationship between these areas? For example, Machine Learning always has strong relation with Data Mining and Computer Vision. Data Mining is always correlated with Information Retrieval. Could we find the relation from the Web? Could we find or anticipate the new emerging areas or inter-disciplinary areas? Users: students, faculties Data: I think the homepages of the faculties are good sources. Faculties always state their interests and their publications in their homepage. If one professor has more than one interest, the two areas are probably related. If two professors collaborate on one paper, the two professors’ interests are probably related. Such an application may help faculties and students find new interests. Functions: Research Area relation mining. Challenges: How to recognize the faculties’ interests? How to mine the relation? Fuzzy Matching for Web Search • • • • Topic: More intelligent search engine Description: Current search engine "Google", even though powerful, not "smart" enough, it can only conduct exact search with "key word" matching. However, this works only under the assumption that user could specify the "best" keywords. If the user himself only have some vague ideas, Google may not be good enough. Therefore, a function component for "Vague searching" may be added. ? An example is illustrated here: When a user wants to buy a cheap computer, he may input a batch of keywords: "computer", "PC", "cheap", "personal computer", if we use Google directly, Google may return the results contain "Computer + PC +cheap + personal computer". However, the diresed result maybe the sale inforamtion page of DELL computer. The desired techniques should be natural language processing and semantic web. • • • More Expressive Query Languages One interesting research topic is how to allow users to query the web using more sophisticated query language instead of just keyword queries. The keyword query has the advantage of simplicity, but it does not allow user to specify their information needs precisely. For example, you are a new DAIS Ph.D student and you are preparing for the Qualify Exam, so you want to search for all the courses related to Database and information system area. You can send such a query as "related course database information system" to Google. However, you will be very disappointed with the results returned by Google, which only supports the keyword query. For this topic, the user can be any web user ,and the data can be any indexed web pages. To solve this problem, many techniques will play an important role, such as text summarization, text categorization and information extraction etc. One of the most challenging problems is about query language. Unlike traditional database, there is no schema for web data, which creates huge challenge for defining a query language. It is still worth discussing that whether we should have a universal query language or several special query language for different domain. Automatic News and Information Extractor, Classifier, and Comparator User: Ordinary Internet user, who is looking for a way to read news in a more classified and organized way from all the sources he/she wants, spending the minimum amount of time. Data Involved: The specified web pages by the user, input data entered by the user. Function: This problem was motivated by the problem that I am having almost every day. There are two domains in which I check news. First social and political news, and second sports news. For each of the categories above, I check several online sources to get the desired breadth and depth of information. However, this process takes a lot of my time everyday. What I am looking for, is a software (or a web page) that gets the links to the online news/information sources that I use once, and do the following actions either on a regular basis, or in an on demand fashion: Regular (or daily) function: Extract the mutually related news from all the sources and provide me with titles, sources, summaries, and links to the full content/articles. It is expected that the software do this separately for every interesting, hot, or commonly discussed issue that appears in the media. Personalized News Alerts Right now, many people read news online routinely instead of buying a newspaper. Generally the user not only read general news such as politics and business, but he has his possible multiple distinguished interests such as the news in his professional community and the news in his hobby. But at the same time, the user has to visit multiple web sites to browsing what is happening daily, weekly or monthly. If we have a provide a personalized News Alerts, a desktop software, which can automatically retrieve, rank and present the news of personal interest to the user, the user can save a lot time and do not miss important news. Everyone can benefit from this software. The News articles (maybe in RSS format) and other web pages of World Web Wide will be crawled and ¯filtered. For each person, we need only crawl a subset of web sites. This personalized News Alerts will have the filtering functionality (¯filtering uninteresting news), organization(organization by topic, date,etc), search (search by topic, date, etc) and mining ( user can do comparative news reading to get different opinions about the same event.) There are some challenges. First is how to model the user interesting. There are some clues about the user's personal Possible Web Topic Areas • Improving search engines – Specialized Search Engines • Special collections (domain specific, information seen before) • Specialized users/Personalized (infer user interests) – Advanced query capabilities • More powerful query languages (structured, relational, semantic) • Comprehensive/complete news service • Web Information Extraction – Price extraction – English usage extraction – IN-depth knowledge about an entity A possible group project •Better CS website search – More powerful query language • customized to CS domain/ontology (e.g., courses, publications, projects, etc) – Better browsing support • Adding structures to the collection • Automatic annotation (virtual links) – Academic ads – Automatic report of “what’s new” – CS domain news service? Two Papers to Consider Presenting 1. Adaptive Web Search Based on User Profile Constructed without Any Effort from Users, WWW 2004. (http://www.www2004.org/proceedings/docs/1p675.pdf) 2. Query-Free News Search, WWW 2003. (http://www2003.org/cdrom/papers/refereed/p707/p707-henzinger.html) Assignment 2 (for Web Team) •Read past WWW conference proceedings (e.g., www2002-www2005) •Every one identifies one or two most interesting papers, which you like to present •Send me your choices by this Sunday (Jan. 30) •Need one volunteer for presenting a web paper on Feb. 3