Download The Detection of Emerging Concepts in Constructive, Inquiry

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Detecting Emerging Concepts in Textual Data Mining
William M. Pottenger, Ph.D. and David R. Gevry
Lehigh University
Recent advances in computer technology are fueling radical changes in the nature of information
management. Increasing computational capacities coupled with the ubiquity of networking have
resulted in widespread digitization of information, thereby creating fundamentally new
possibilities for managing information. One such opportunity lies in the budding area of textual
data mining. With roots in the fields of statistics, machine learning and information theory, data
mining is emerging as a field of study in its own right. The marriage of data mining techniques
to applications in textual information management has created unprecedented opportunity for the
development of automatic approaches to tasks heretofore considered intractable. This document
briefly summarizes our research to date in the automatic identification of emerging trends in
textual data. We also discuss the integration of trend detection in the development of
constructive, inquiry-based multimedia courseware.
The process of detecting emerging conceptual content that we envision is analogous to the
operation of a radar system. A radar system assists in the differentiation of mobile vs. stationary
objects, effectively screening out uninteresting reflections from stationary objects and preserving
interesting reflections from moving objects. In the same way, our proposed techniques will
identify regions of semantic locality in a set of collections and ‘screen out’ topic areas that are
stationary in a semantic sense with respect to time. As with a radar screen, the user of our
proposed prototype must then query the identified ‘hot topic’ regions of semantic locality and
determine their characteristics by studying the underlying literature automatically associated with
each such ‘hot topic’ region.
Applications of trend detection in textual data are numerous: the detection of such trends in
warranty repair claims, for example, is of genuine interest to industry. Technology forecasting is
another example with numerous applications of both academic and practical interest. In general,
trending analysis of textual data can be performed in any domain that involves written records of
human endeavors whether scientific or artistic in nature.
Trending of this nature is primarily based on human-expert analysis of sources (e.g., patent,
trade, and technical literature) combined with bibliometric techniques that employ both semi and
fully automatic methods [White and McCain 1989]. Automatic approaches have not focused on
the actual content of the literature primarily due to the complexity of dealing with large numbers
of words and word relationships. With advances in computer communications, computational
capabilities, and storage infrastructure, however, the stage is set to explore complex
interrelationships in content as well as links (e.g., citations) in the detection of time-sensitive
patterns in distributed textual repositories.
Semantics are, however, difficult to identify unambiguously. Computer algorithms deal with a
digital representation of language – we do not have a precise interpretation of the semantics. The
challenge thus lies in mapping from this digital domain to the semantic domain in a temporally
sensitive environment. In fact, our approach to solving this problem imbues semantics to a
statistical abstraction of relationships that change with time in literature databases.
Our research objective is to design, implement, and validate a prototype for the detection of
emerging content through the automatic analysis of large repositories of textual data. In this
project in particular, we are interested in applying trend detection algorithms as a textual data
mining tool that will aid students in learning through constructive exercises.
The following steps are involved in the process: concept identification/extraction; concept cooccurrence matrix formation; knowledge base creation; identification of regions of semantic
locality; the detection of emerging conceptual content; and a visualization depicting the flow of
topics through time. For details on our approach, please see [Pottenger and Yang 2000] and
[Bouskila and Pottenger 2000].
The integration of our Hot Topics Data Mining System in constructive, inquiry-based
multimedia requires sophisticated lesson tracking and context construction mechanisms that are
described in more detail below.
Lesson Tracking and Context Enhancement
The research that is being done in this area is two fold. The first focus of this project is to track
users as they move through the lessons and determine how individual users as well as users as a
group approach the lessons. The goal is to use individual users’ contexts to enhance their
performance when conducting constructive, inquiry-based learning exercises that employ the Hot
Topics Textual Data Mining System to uncover trends in a given field of study.
The motivation for this tracking research comes from our current work with user profiling based
on temporal aspects of web access: how often a user visits a page and how long they stay on that
page.
The goal of the research is to link users’ temporal data with the semantic data of the
documents that they view. This temporal link will allow us to automatically filter a model of the
user’s interests based on their history of access to the material. The first step in this research is
thus to gather source data for individual user access.
Initially this user profiling research began by examining server web logs in order to profile
individual users – unfortunately this data did not work for our purposes. The logs that we were
using did not contain enough information to distinguish individual users. For example, given an
IP address it is hard to determine whether the user is a distinct person or a number of users who
are using the same address (e.g., a proxy server). In the logs we studied 1 the reported value for
the operating system changed for an individual address in many cases. IP address look-up of
these addresses revealed that the majority of the addresses were proxy servers or similar
gateways, hence invalidating them as individual users for our purposes.
Another reason why these logs files were not useful for the research was due to the sparseness of
individual user access. Users did not seem to frequent the site for very long, and during the two
week period of time we chose, few users made repeat visits to the site. Below is a chart that
depicts user access in a continuous five-day period. In order for temporal user profiling to be
effective there must be sufficient data to characterize the user’s browsing activities. The web
1
Our logs were drawn from a two week period of access to www.ncsa.uiuc.edu
server logs however cannot provide us with this type of data because users do not frequent the
site enough to yield adequate temporal data.
Number of Users
Users vs. Days Accessed
90000
80000
70000
60000
50000
40000
30000
20000
10000
0
1
2
3
4
5
Days Accessed
These factor compounded by the uncertainty of identifying individual users caused us to abandon
these logs as a viable source of data for our research.
In response to this issue we devised an approach to track the usage of multimedia courseware.
We believe that tracking lessons in this way will yield us with a larger source of individual user
data. The nature of the lessons themselves will promote user access, and we will be able to track
these individual users as they progress through the lessons. Although this data will not be
representative of individual user web access it will provide us with interesting representations on
how students use and access the data as well as possible temporal relationships with user interest.
Additionally it will assist in locating spots inside the lessons themselves where individuals or
groups of users spend a significant amount of time, and this will allow us to determine possible
points of interest or confusion in understanding the material. The users will have round-theclock access to the lessons and therefore tracking individual access will give us a better picture
of a user’s interest and what parts of the lessons the user found useful in studying the material,
working homework exercises, studying for exams, etc.
Our focus will be to generate temporally sensitive contexts specific to individual users, and to
boost the performance of the Hot Topics Textual Data Mining System using these contexts. By
tracking the user we plan to build the context within which they are conducting constructive,
inquiry-based learning exercises that employ the Hot Topics Textual Data Mining System. The
data will be used to generate time sensitive contexts to focus the nature of the detection of
emerging topics in the field being studied. Time sensitive contexts will be compared to an
unmodified general context for the course to see if focusing on what the user has examined in a
given timeframe is more effective in identifying ‘hot topics’ relevant to the constructive, inquirybased exercises.
To aid the Hot Topics Textual Data Mining System we will generate a
repository of documents related to the topic area under study. This will give a more focused
conceptual space from which to draw when performing ‘hot topics’ detection.
Finally, though not directly related to this project, an individual proxy system will be
implemented to gain a second, more general dataset of user profiles. This will involve routing
participants’ web browsers through a proxy server placed on their machine. The log files
generated by this system will have the benefit of being specific to a user and will give a better
picture of user browsing patterns and interests in a temporal sense.
Given below is a use-case diagram for the Lesson Tracking and Hot Topics Textual Data Mining
System as well as a timeline for our research.
The use-case diagram shows the pathways
through the proposed system we will design.
The user’s actions will be tracked through
JavaScript functions that will communicate to our database through a CGI script. The database
will also contain additional content information for the lessons and this will be combined with
the temporal data to form a temporally sensitive contextual model that will be used by the Hot
Topics Textual Data Mining System to augment its performance. The timeline is broken into
separate timelines for the Lesson Tracking and Hot Topics Textual Data Mining System, proxy
tracking, and collection development and management for the repository encompassing the field
of study.