Download Possible Topics - NDSU Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Paper Topic Ideas to consider
You choose your own paper topic. Below are some areas and discussion of important topics within those areas, which are meant to help you get
started thinking of and formulating your own RESEARCH topic. Please submit your topic to me for approval by emailing a title and abstract
to [email protected].. It must be a research topic (makes some new contribution to the body of knowledge, as opposed to simply
an exposition of what has already developed by others).
Only one person per topic (first come first serve - email your request to me). Please check the schedule before emailing your topic choice to me
(to make sure it has not been chosen by one of your colleagues already). Your paper should be high quality in terms of style and correctness.
This research project is YOUR project. I am suggesting topics but if you choose one of my suggested topics, that makes it your topic. You
should choose one of my suggestions only if you understand it and think it has potential as a research topic for you. The suggestions are
meant to help you find a suitable topic but are not intended to limit you to these topics.
Be sure to include:
INTRODUCTION AND CONTEXT. Research the "area" of the topic (put it in context of what has been done by others, what is still left to do,
what you are contributing. - as to what has been written by others, see our text, ACM Computing Reviews, online database searches... - be
sure to include a paper which was written since 1990 so that you will have some assurance that you are tied in with the latest round of
research on the topic). This usually forms the Introduction section of your paper and should be about 300-2000 words in length or longer.
MAIN NEW CONTRIBUTION (KILLER IDEA). Detail your contribution so that it can be followed by a reader who is new to area. (this can
be an expansion of the "what you are contributing section of your Introduction). It would be best to have just one killer idea and do it well.
This section should be about 800-3000 words in length or longer.
PROOF. Prove that your idea is correct and makes the contribution you claim it does (i.e., it is a "killer" idea). This differs with topic and area. If
the contribution or "killer idea" involves random variations (stochasticity) a simulation may be required. If not, an analytic model
(assumptions, formulas and analysis of results) may do the job. Actual experimentation may also be possible, though that involves
prototyping the system itself. We can email-talk about this section on an individual basis. This section should be about 500-3000 words in
length or longer.
CONCLUSION. Summarize the most important points and contributions you have made. Note that you will be "telling us what you're going to
do" in the Intro; "doing it" in the Idea and Proof Sections; and then "Telling us what you did" in the Conclusion Section. Thus, you will say
the same thing thrice - in different ways, for different purposes and in different depth levels. This section should be about 200-1000 words in
length or longer.
Thank you and good luck.
Paper Topic Ideas to consider
First a note: New topic ideas will be added at the end of this section during the term!
A very hot topic area, which overlaps with web search querying and analysis, software
engineering reachability graph analysis and control flow graph analysis, sales graph analysis
and bioinformatics interactions analysis; is the need to analyze multiple interactions for
common "strong interaction cells".
Websites interactions: Two websites interact if one contains the other's URL (this is a
"directional interaction" and is modeled by a directed graph in which the nodes of the graph
are websites (URLs), and there is a directed edge running from a URL to each of the URLs it
contains on its website). One can simply analyze "existence" of references (an unlabelled
directed edge iff the source URL contains one or more instances of the destination URL), or
one can analyze "strength" of reference (a labeled graph in which the label on any edge
records the number of times the destination URL occurs on the source URL page.
Two websites interact iff they are reference on the same webpage (undirected graph which can be
labeled with "strength = # of different pages reference both" or just "existence = unlabeled
edge" iff they are co-reference). Two websites interact iff a given user goes from one site
immediately to the other site during a web surfing session (this is "directional" so it is
modeled by a directed graph which can have labeled edges (count of user traces) or just
existence (at least one user trace).
Paper Topic Ideas to consider
Bioinformatics:
Two genes (or proteins) interact iff their expression profiles in a microarray experiment are
similar enough (this would be an unlabelled undirected graph - it could include "strength" =
the level of similarity as an undirected edge label). Two genes (or proteins) interact iff the
proteins they code for interact (in some particular way - i.e., occur in the same pathway;
combine into the same complex, etc.). Two genes (or proteins) interact iff they are coreferenced in the same document in the PubMed literature (again. "existence" or "strength"
are possible). Actually, this third point is not true most of the times. The authors might refer to
other genes in different contexts with respect to the genes they are working on. This doesn't
mean that there is an interaction among the genes that were mentioned in the same document.
It is in this scenario, GENE ONTOLOGY (GO) comes into picture. GO has several evidence
codes which signify how the functions of the genes/gene products were assigned. For GO
evidence codes, refer http://www.geneontology.org/GO.evidence.shtml#ic. Before confirming
the interaction among the genes in the same document, it would be good to cross check with
the GO evidence codes.
Paper Topic Ideas to consider
Software Engineering:
Two programs interact iff the same code segment occurs in both (undirected, either labeled with a
strength = e.g., the number of times that segment occurs, or existence = just whether the
segment co-exists at least once for not). Two programs interact iff they call the same program.
Two programs interact iff they are called by the same program. Two programs interact iff they
contains roughly the same set of constants (variables) with respect to some ontology or
standards listing. Two programs interact iff they have the same aspect designation.
Paper Topic Ideas to consider
Sales Analysis:
Two products interact iff they co-occur at checkout 80% of the time (or with some other threshold
support = "% of market baskets"). Two products interact iff when one occurs in a market
basked, 80% of the time, the other will also (this is the "directed graph" version of 1 above
and has to do with the "confidence" of the association). Two products interact iff the same
salesman sells both. Two products interact iff they are sold in the same region (at a threshold
level, or as a labeled graph, label that edge with the number sold). Two products interact iff
they are sold at a threshold level during the same season (e.g., in December).
Paper Topic Ideas to consider
Security Applications:
Two persons interact iff they are from the same neighborhood (or city or state or country). Two persons
interact iff they are in the same occupation. Two persons interact iff they have similar records
(employment records, criminal records, etc.). Two persons interact iff they belong to the same
organizations. Two events interact iff they are attended by similar sets of attendees. Two locations
interact iff they are visited by similar sets of people. Two locations interact iff they are associated
with similar events. In all the interactions, the graph model is central and one is looking for strong
clusters of nodes (nodes that are strongly associated with via the edge set) How do we find the
strong clusters? What do we mean precisely by a cluster?
Paper Topic Ideas to consider
Notes:
Using vertical technologies to search out common clusters or quasi-clusters or "cliques" should be very
valuable in bioinformatics as well as in web analysis. For instance, there are thousands of interaction
graphs of interest over a given set of genes (proteins). Using vertical technology, it is possible to
construct an index attribute and an order attribute for each interaction graph and to analyze them (using
Dr. Daxin Jiang's methods or other methods - e.g., OPTICS-like) directly. You can find Dr. Daxin Jiang's
work here: Jiang_TKDE_paper A shorter version (preliminary work): Jiang_paper You can get the
OPTICS paper to get some understanding about ordering-based algorithm here: OPTICS paper For each
data set (either a micro-array dataset or an interaction graph data set), construct two derived attributes,
the step count attribute from the ordering and the index attribute. For each pair of such added attributes,
we can quickly search for the pulses using vertical technology (just a matter of looking for those genes
where the index exceeds a threshold and move that threshold down until the user feels he/she has found
the appropriate pulses. These "pulse genes" have a step number in the ordering. For each pulse gene, we
can quickly extract the forward subinterval from that pulse to the next. Each such search will give us the
"flat region", from which the strong cluster associated with that pulse can be extracted. So we will have a
vertical "mask" defining each strong cluster from each dataset. We can quickly "AND" those to find
common strong clusters using vertical technology. With this minor extension to Dr. Daxin Jiang's
wonderful tool and a re-coding of the tool for vertical data, one could analyze across multiple interaction
graphs. I think that is the main exciting application and across multiple micro-arrays (may be?) and
across multiple web graphs. When the dataset is very large, the scalability becomes a very important
issue.
More Paper Topic Ideas to consider
A new method of classification or clustering based on Derived Attributes that are "Walk-based". The walks
can be based on Z-ordering, Hilbert ordering or another ordering (or random walks?).
An new method of classification or clustering based on some statistic derived from the the Covariance matrix
of a training space or space to be clustered.
A new method of classification or clustering based on some combination of derived attributes that are
varition based and walk based.
An automatic alerter system to be used by Software Engineers which will automatically alert the
development team when some type of "bad situation" or "dangerous practice" is detected (e.g., within a
system such as CVS for storing "developement version", when a version is "checked in", the alerter
analyzer would immediately analyze (classify based on the database of past development projects done
with the CVS system?) for the "exceptional situation.
Based on the idea in 110, (Automatic alerter used by Software Engineers which will automatically alert the
development team when some type of "bad situation" or "dangerous practice" is detected (e.g., within a
system such as CVS for storing "developement version", when a version is "checked in", the alerter
analyzer would immediately analyze (classify based on the database of past development projects done
with the CVS system?) for the "exceptional situation.) develop a specific alerter for an ASPECT.
More Paper Topic Ideas to consider
Based on the idea relaed to Automatic alerter used by Software Engineers which will automatically alert the
development team when some type of "bad situation" or "dangerous practice" is detected (e.g., within a
system such as CVS for storing "developement version", when a version is "checked in", the alerter
analyzer would immediately analyze (classify based on the database of past development projects done
with the CVS system?) for the "exceptional situation.) develop a specific alerter for "Development shop
coding rule violations".
Based on the idea in Automatic alerter used by Software Engineers which will automatically alert the
development team when some type of "bad situation" or "dangerous practice" is detected (e.g., within a
system such as CVS for storing "developement version", when a version is "checked in", the alerter
analyzer would immediately analyze (classify based on the database of past development projects done
with the CVS system?) for the "exceptional situation.) develop a specific alerter for "Standards
Organization standards violations".
Based on the idea in Automatic alerter used by Software Engineers which will automatically alert the
development team when some type of "bad situation" or "dangerous practice" is detected (e.g., within a
system such as CVS for storing "developement version", when a version is "checked in", the alerter
analyzer would immediately analyze (classify based on the database of past development projects done
with the CVS system?) for the "exceptional situation.) develop a specific alerter for "highly risky
situations".
Based on the idea in Automatic alerter used by Software Engineers which will automatically alert the
development team when some type of "bad situation" or "dangerous practice" is detected (e.g., within a
system such as CVS for storing "developement version", when a version is "checked in", the alerter
analyzer would immediately analyze (classify based on the database of past development projects done
with the CVS system?) for the "exceptional situation.) develop a specific alerter for "potential security
problems (code that invites hackers to hack)".
More Paper Topic Ideas to consider
New Method of Cluster Data Mining Support that DBMSs Should Use (with reasons). How do Oracle, IBM
DB-2, Microsoft SQL Server support clustering and how would you improve on these methods?
Decision Tree Induction Classification Implementation and Performance Analysis for Numeric Data.
Implement a new method of Decision Tree Induction classification data mining. Prove that your method
performs well compared to ID3 C4.5, C5 or other known methods for at least one type of data.
Decision Tree Induction Classification Implementation and Performance Analysis for Categorical Data.
Implement a new method of C4.5 or C5 -like decision tree induction classification data mining method
and prove it compares well to C4.5 or C5 or other known methods for categorical data.
Bayesian Classification Implementation and Performance Analysis. Implement a new method of Bayesian
classification data mining and prove it compares well to known methods.
Neural Network Classification Implementation and Performance Analysis. Implement a new method of
Neural Network classification data mining and prove it compares well to known methods (how well does
it scale to large datasets?)
K-Nearest Neighbor Classification Implementation and Performance Analysis or K-Most Similar
Classification Implementation and Performance Analysis. Implement a new method and prove it
compares well to known methods.
Density-Based Classification Implementation and Performance Analysis. Implement a new method of
Density-Based classification and prove it compares well to known methods.
Genetic-Algorithm-Based Data Mining Implementation and Performance Analysis. Implement a new method
of Genetic-Algorithm-Based classification and prove it compares well to known methods.
Simulated Annealing-Based Classification Implementation and Performance Analysis. Implement a new
method of Simulated-Annealing-Based classification and prove it compares well to known methods.
More Paper Topic Ideas to consider
Tabu-Search-Based Classification Implementation and Performance Analysis. Implement a new method of
Tabu-Search-Based classification and prove it compares well to known methods.
Rough-Set-Based Classification Implementation and Performance Analysis. Implement a new method of
Rough-Set-Based classification and prove it compares well to known methods.
Fuzzy-Set-Based Classification Implementation and Performance Analysis. Implement a new method of
Fuzzy-Set-Based classification and prove it compares well to known methods.
Markov-Modeling-based Classification and Performance Analysis. (Hidden Markov Model based, ...)
Implement a new method of Markov-Chain-Based classification and prove it compares well to known
methods. Reference "Evaluation of Techniques for Classifying Biological Sequences", Deshpande and
Karypis, PA-KDD Conf. 2002, Springer-Verlag Lecture Notes in Artificial Intelligence 2336, pg 417.
Multiple-Regression Data Mining Implementation and Performance Analysis. Implement a new method of
multiple-regression-based data mining and prove it compares well to known methods.
Non-linear-Regression Data Mining Implementation and Performance Analysis. Implement a new method of
non-linear-regression-based data mining and prove it compares well to known methods.
Poisson-Regression Data Mining Implementation and Performance Analysis. Implement a new method of
Poisson-regression-based data mining and prove it compares well to known methods.
Association Rule Mining Implementation and Performance Analysis. Implement a new method of
Association Rule Mining and prove it compares well to known methods (e.g., Frequent Pattern Trees).
Multilevel Association Rule Mining Implementation and Performance Analysis. Implement a new method of
Multilevel Association Rule Mining and prove it compares well to known methods.
Counts-count Association Rule Mining Implementation and Performance Analysis. Implement a new method
of Counts-count Association Rule Mining and prove it compares well to known methods. Counts count
ARM means that the method takes account of the number of each item in a market basket, not just
whether or not the item is bought (1 or more).
More Paper Topic Ideas to consider
Partitioned Hash Functions. Devise a variation on the basic Partitioned Hash Function structure which you
can show will perform in a superior way in some particular multidimensional setting and workload.
Multi-key Index. Devise a variation on the basic Multi-key Index structure which you can show will perform
in a superior way in some particular multidimensional setting and workload.
kd-Trees: Devise a variation on the basic kd-Tree index structure which you can show will perform in a
superior way in some particular multidimensional setting and workload. Apply #24 kd-tree to the
computer graphics(detailed speaking: ray shooting), that is, developing and improving kd-tree to get
good performance for scenes of different complexities.
R-Trees: Devise a variation on the basic R-Tree index structure which you can show will perform in a
superior way in some particular multidimensional setting and workload.
K-Means Clustering. Implement a new method of K-Means Clustering and prove it compares well to known
methods.
K-Medoids Clustering. Implement a new method of K-Medoids Clustering and prove it compares well to
known methods.
K-Nearest Neighbor Clustering. Implement a new method of K-Nearest Neighbor Clustering and prove it
compares well to known methods. A reference is "Clustering Using a Similarity Measure Based on
Shared Near Neighbors", Jarvis and Patrick, IEEE Transactions on Computers, Vol. c-22, No. 11,
November 1973.
Agglomerative Hierarchical Clustering. Implement a new method of Agglomerative Hierarchical Clustering
and prove it compares well to known methods such as AGNES.
Divisive Hierarchical Clustering. Implement a new method of Divisive Hierarchical Clustering and prove it
compares well to known methods such as DIANA.
More Paper Topic Ideas to consider
Hierarchical clustering similar to BIRCH Implement a new method similar to BIRCH clustering and prove it
compares well to known methods such as BIRCH itself.
Clustering similar to CURE. Implement a new method similar to CURE clustering and prove it compares
well to known methods such as CURE itself.
Clustering similar to OPTICS. Implement a new method similar to OPTICS clustering and prove it compares
well to known methods such as OPTICS itself.
Clustering similar to DB-SCAN. Implement a new method similar to DB-SCAN clustering and prove it
compares well to known methods such as DB-SCAN itself.
Grid-based clustering similar to STING. Implement a new method similar to STING clustering and prove it
compares well to known methods such as STING itself. Grid-based clustering similar to CLIQUE.
Implement a new method similar to CLIQUE clustering and prove it compares well to known methods
such as CLIQUE itself.
CLARANS partioning clustering. Implement a new clustering method similar to the CLARANS and prove it
compares well to known methods such as CLARANS itself.
Hierarchical clustering similar to ROCK Implement a new method similar to ROCK clustering and prove it
compares well to known methods such as ROCK itself.
Hierarchical clustering similar to CAMELEON Implement a new method similar to CAMELEON clustering
and prove it compares well to known methods such as CAMELEON itself.
Density-based clustering similar to DENCLUE. Implement a new method similar to DENCLUE clustering
and prove it compares well to known methods such as DENCLUE itself.
Statistics-based clustering similar to COBWEB. Implement a new method similar to COBWEB clustering
and prove it compares well to known methods such as COBWEB itself.
More Paper Topic Ideas to consider
Statistics-based clustering similar to CLASSIT. Implement a new method similar to CLASSIT clustering and
prove it compares well to known methods such as CLASSIT itself.
Statistics-based clustering similar to AutoClass. Implement a new method similar to AutoClass clustering and
prove it compares well to known methods such as AtuoClass itself.
The ACID Properties Testing & Performance------- Devise an approach to test the support of the ACID
properties which are enforced by the concurrency control and recovery methods for various DBMSs you
have access to.
Concurrency control using ROLL and ROCC and MVROCC -like methods for heterogeneous distributed
database systems. Work out a serializable concurrency control scheme based on the ROLL protocol
which could be implemented on a variety of system which are autonomously running a off-the-shelf
database system.
ROLL Concurrency Control with Space Efficient Request Vectors. Develop and test a ROLL Concurrency
Control method in which the Request Vectors are compressed (such as with Ptrees).
New Deadlock Managment Method. Especially for widely distributed data on a distributed DBMS devise and
test a new deadlock management method. Devise a method particularly well suited for this environment
and give some justification for believing it is superior to existing methods.
Quad-Trees-II (Kinked Quad Trees) : Devise a variation on the basic Quad-Tree index structure which you
can show will perform in a superior way in some particular multidimensional setting and workload.
Connsider the following. Instead of quadration using vertical and horizontal division lines replace the
horizontal line with a "kinked" line (to joined line segments) and the vertical line with a "kinked" line,
carefully chosen to distribute the points evenly in the quadrants. The only additional overhead is that the
"cutlines" would be recorded using more parameters than simple quad trees. Another possibility which
might be compared to basic quadtrees and to "kinked" quadtrees is "Poly Quad Trees" in which a
polynomial is used in each case instead of a line (e.g., a parabola, a polynomial of degree 2, degree 3, ...)
More Paper Topic Ideas to consider
BEGIN order Recovery. The RECOVERY PROCESS with checkpointing is as follows: 1. Start at the
Checkpoint record in the LOG. Put the "active list" into an UNDO list. 2. Work forward in the LOG. For
every BEGIN encountered, put the transaction in the UNDO list For every COMMIT encounrtered move
trans from UNDO to REDO list. 3. When the end of the LOG is reached, Redo all transactions in the
REDO list in REDO-list order UNDO all transactions in the UNDO list Note: Since transactions are
being redone in REDO-List order (commit order in this case), it must be the case that the Serial Order to
which execution is equivalent is COMMIT order. In T2 and T4 above, messages may have gone back to
the users which were based on and execution order equivalent to SOME serial order (values reported to
users were generated by execution in that order). Thus, RECOVERY must regenerate the same values.
The only way that the RECOVERY process can know what serial order the original execution was
equivalent to is that the initial execution be equivalent to some serial order identifiable from the LOG.
One order identifiable from the LOG is COMMIT order. Therefore, it is common to demand that the
order of execution be equivalent to the serial COMMIT-order (e.g., use Strict 2PL?) Are there other
possibilities? How about "arrival or BEGIN order"? How could that be made determinable from the
LOG? How does this compare to COMMIT order? Better? Worse?
"ITERATIVE DYNAMMIC PROGRAMMING" and its high potential in QUERY OPTIMIZATION.
"The Application of P-tree in the power plant MIS" "K-Clustering using P-trees"
1. Coding statiatic issue such as sum, mean, and variance using the DMI;
2. Focus on our k-clustering algorithm 3
. Compare performance with K-mean, mean-spliting, variance-based algorithm.
"Addressing number in MBR ARM Impl and Perf Analysis."
"SOM Clustering Using P-Trees"
Support Vector Machine -like Classification method.
More Paper Topic Ideas to consider
Develop a process of structuring data into a vertical database? Prove that is has some good features. Try to
compare it either to other vertical database structuring processes and/or to horizontal database structuring
processes (e.g., ER diagramming...).
Develop some aspect of the query processor for a vertical database. Prove that it has some good features.
Try to compare it either to other vertical query processing approaches (e.g., the ones in the notes) and/or
to horizontal database query processing.
Develop some aspect of the Data Mining processor for a vertical database. Prove that it has some good
features. Try to compare it either to other vertical data mining approaches (e.g., the ones in the notes)
and/or to horizontal database data mining processing.
Wikipedia is a new hyperlinked database. It is developing very rapidly into a major source of online
information. Many topics spring to mind in and around Wikipedia. E.g.,
Analyze the link structure in Wikipedia (related to the "analyze interactions" group at the beginning of this
"topics" section).
Automated link setting in wikipedia.