Download literature survey

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data analysis wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Concurrency control wikipedia , lookup

SQL wikipedia , lookup

Data vault modeling wikipedia , lookup

Database wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Information privacy law wikipedia , lookup

Open data in the United Kingdom wikipedia , lookup

Search engine indexing wikipedia , lookup

Forecasting wikipedia , lookup

Business intelligence wikipedia , lookup

SAP IQ wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Versant Object Database wikipedia , lookup

Database model wikipedia , lookup

Transcript
International Journal of Research In Science & Engineering
e-ISSN: 2394-8299
p-ISSN: 2394-8280
TITLE: Optimized Prediction of Hard Keyword Queries Over Databases
Sunil Dombale1, Akshay Ingale2, Vasusaurabh Jadhav3
1
UG student,Dept.of Information Technology, JSCOE Hadapsar,[email protected]
UG student,Dept.of Information Technology, JSCOE Hadapsar, [email protected]
3
UG student,Dept.of Information Technology, JSCOE Hadapsar, [email protected]
2
ABSTRACT
Keyword Query Interface on databases give easy access to data, but undergo from low
ranking quality, i.e., low precision and recall. It would be constructive to recognize queries
that are likely to have low No of quality to improve the user achievement. For example, the
system may recommend to the user alternative queries for such hard queries.Aim of this
paper is to conclude the characteristics of hard queries and propose a novel framework to
measure the degree of difficulty for a keyword query over a database, grant for both the
structure and the content of the database and the result of query. There are query difficulty
prediction model but results displayed that even with structured data, searching the model
answers to keyword queries is still a hard task. Further, paper present a new approach is
nothing but extended output in which before measuring Structured Robustness score we are
applying the K-means clustering to dividing input dataset into number of clusters those
having validate information’s. As well, we will use relating to language features Such as
investigation features, syntactical features, and semantic features for effective prediction of
difficult queries over database. Due to this, Time required for guess the difficult keywords
over large dataset is minimized and process becomes robust and perfect.
Keywords:CheckQueryPerformance,QqueryEffectiveness,Keyword
Query,Robustness,Prediction,Database
I.INTRODUCTION
Keyword query interfaces (KQIS) for databases are much concern in the last decade due to
ease of use in searching and exploring the data. keyword queries characteristically have many
possible answers. KQIs must recognize the information needs behind keyword queries and
rate the answers so that the desired answers so that highest priority appear at the top of the
list. Databases contain entity,attributes and values. Some of the problems associated in
answering a query are likely to have contents users do not specify the preferred schema
element(s) for each query term. For e.g. keyword Godfather in the movie database does not
state that user is concern for title or Distributor Company. So a KQI must search for the
desired keyword associated with each term in the query if users does not provide enough
information about their desired entities. Consider a example keyword may return movies or
actors. One of effort called data-centric track of INEX Workshop where KQIS and are
IJRISE| www.ijrise.org|[email protected]
International Journal of Research In Science & Engineering
e-ISSN: 2394-8299
p-ISSN: 2394-8280
evaluated over the well-known IMDB data set which contains structured information about
movies and people. One more effort is the series of Semantic
Searched Challenges (SemSearch) at Semantic Search Workshop, where the data set is the
Billion Triple Challenge .The queries used are from Yahoo Keyword query log. Users have
provided relevance judegments for both benchmarks. Still the results indicate even with
structured data finding the preferred answers to keyword queries is still a hard task. Rating
quality of the methods used in both workshops were observed that they are performing poor
on a subset of queries. Consider the example of query like ancient Rome era over the IMDB
data set. Users would like to look for information about movies like ancient Rome. For this
query the XML search methods which we implemented return ratings of considerably lower
quality than their average rating quality over all queries. Therefore, some queries are much
difficult than others. Furthermore no matter which rating method is used we are unable to
deliver a reasonable rating for these queries. It is important for a KQI to look for such
queries and warn the user for alternative techniques like query reformulation or query
suggestions.No work is induce on predicting or analyzing the difficulties of queries over
databases. Recent study have proposed some methods to detect difficult queries over plain
text document collections. But these methods does not give the desired output to our problem
since they ignore the schema of the database. As mentioned a KQI must assign each term of
query to a schema element(s) in the database.
LITERATURE SURVEY
Some recent studies have presented different methods to predict hard queries over
unstructured documents or plain text collection. It can be classified into two groups: preretrieval and post-retrieval methods .Pre-retrieval methods predict the difficultness of a query
without computing its results. These methods usually use the statistical properties to measure
ambiguity or term-relatedness of the query to predict its difficultness. Examples like average
document frequency of the query terms or the number of documents which contain atleast
one query term. These methods generally predict that more the discriminative query terms
are the easier the query will be. Post retrieval methods make use of the results of a query to
display its difficulty and generally fall into one of the following categories.
Clarity-score based: In this concept of clarity score assumptions are made such that users are
concerned in a very few topics. Thus making sufficiently noticeable from other documents in
the collection. It is efficient than pre-retrieval based method. If these probability distributions
are relatively similar the query results contain information about as many topics as the
collection of whole thus the query is considered as difficult. Some methods were propose to
improve the efficiency of clarity score. However domain knowledge about the data sets to
extend idea of clarity score for queries over databases is required. Each topic in a database
contains the entities that are of similar subject.
Ranking-score based: The rating score of a document returned by the retrieval systems for an
input query may estimate the similar of the query and the document. Some recent methods
measure the difficulty of a query based on the score distribution of its results.
IJRISE| www.ijrise.org|[email protected]
International Journal of Research In Science & Engineering
e-ISSN: 2394-8299
p-ISSN: 2394-8280
Robustness-based: Another type of post-retrieval methods argue that the results of an easy
query are relatively stable against the queries, documents or ranking algorithms. Some
methods use machine learning techniques to study the properties and predict their hardness.
They have same limitations as the other approaches when applied to stable data. Moreover
their success depends on the quality of their available training data. High quality training data
is not available for many databases. Some recent study propose frameworks that theoretically
explain existing Predictions and combine them to achieve higher prediction accuracy. It
present novel learning methods for estimating the results returned by a search engine in
response to a query. Estimation conext is based on the agreement between the top results of
the full query and the top results of its sub-queries. It express the usefulness of quality
estimation for several applications among them like improvement of retrieval detecting
queries for which no relevant content exists in the document collection, and distributed
information retrieval. It describe two methods for learning an estimator of query difficulty.
Limitations:
1)The quality of query prediction strongly depends on the query length.
2)The restricted amount of training data.
The query-performance prediction task is estimating the effectiveness of a search performed
in response to a query in lack of relevant judgments. Post-retrieval predictors analyze the
result list of top-retrieved documents. Framework are based on using a pseudo effective and
ineffective raking as reference comparisons to the ranking at hand the quality of which it
want to predict.
limitations: It outperforms on large dataset.
Keyword queries over structured databases are distributed ambiguous. Understanding a single
keyword of query cannot satisfy all users and multiple interpretations may occur overlapping
results. It proposes a scheme to balance the relevance and novelty of keyword search results
over stable databases. Firstly, it displays a probabilistic model which effectively ranks the
possible interpretations of a keyword query over structured data. Forecast query difficulty
based on linguistic features, using Tree Tagger.
III .CONCLUSION
In this paper it introduces with the novel problem of predicting the effectiveness of keyword
queries over databases. It concludes with result that the current prediction methods for
queries over unstructured data sources cannot be effectively used for solution. It provides
new improve method for difficult keyword prediction by overcoming the drawbacks of
Capability. This approach is nothing but extended framework in which before going to
calculate SR score we are applying the K-means clustering to divide input data set into now
of clusters those having useful information’s. As well it will measure the degree of the
difficulty of a keyword query over a database, using the ranking, rating robustness principle.
IJRISE| www.ijrise.org|[email protected]
International Journal of Research In Science & Engineering
e-ISSN: 2394-8299
p-ISSN: 2394-8280
The algorithms predict the difficulty of a query with relatively low errors and negotiable time
overheads.
IV.References
[1] Shiwen Cheng, Arash Termehchy, and Vagelis Hristidis “Efficient Prediction of Difficult Keyword
Queries over Databases,” IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26,
NO. 6, JUNE 2014
[2] Hristidis, L. Gravano, and Y. Papakonstantinou.“Efficient IR style keyword search over relational
databases,” in Proc. 29th VLDB Conf., Berlin, Germany, 2003, pp. 850861.
[3] E. Yom-Tov, S. Fine, D. Carmel, and A. Darlow,“Learning to estimate query difficulty:Including
applications to missing content detection and distributed information retrieval,” in Proc.28th Annu.
Int. ACM SIGIR Conf. Research Development Information Retrieval, Salvador, Brazil, 2005, pp.
512519.
[4] O. Kurland, A. Shtok, S. Hummel, F. Raiber, D. Carmel, and O. Rom,“”Back to the roots: A
probabilistic framework for query performance prediction,,”” in Proc. 21st Int. CIKM, Maui, HI,
USA,2012, pp. 823832..
[5] O. Kurland, A. Shtok, D. Carmel, and S. Hummel,“”A Unified framework for postretrieval queryperformance prediction,””in Proc. 3rd Int. ICTIR, Bertinoro, Italy,2011, pp. 1526.
[6] S. C. Townsend, Y. Zhou, and B. Croft,“”Predicting query performance,””in Proc.SIGIR 02,
Tampere, Finland, pp. 299306.
[7] J. Han, M. Kamber, and J. Pei,“Data Mining: Concepts and Techniques,”San Francisco,CA: Morgan
Kaufmann, 2011.
[8] S. Cheng, A. Termehchy, and V. Hristidis,“Predicting the effectiveness of keyword queries on
databases,”in Proc. 21st ACM Int. CIKM, Maui, HI, 2012, pp. 1213-1222.
[9] Overview of the INEX 2011 Data-Centric Track Qiuyue Wang1,2, Georgina Ramírez3, Maarten
Marx4, Martin Theobald5, Jaap Kamps6
[10] T. Tran, P. Mika, H. Wang, and M. Grobelnik, “Semsearch ´S10,”in Proc. 3rd Int. WWW Conf.,
Raleigh, NC, USA, 2010
[11] Y. Zhou and B. Croft, “Ranking robustness: A novel framework to predict query performance,” in
Proc. 15th ACM Int. CIKM, Geneva, Switzerland, 2006, pp. 567–574.
[12]E. Demidova, P. Fankhauser, X. Zhou, and W. Nejdl, “DivQ: Diversification for keyword search
over structured databases,”
in Proc. SIGIR’ 10, Geneva, Switzerland, pp. 331–338
[13]Josiane Mothe , Ludovic Tanguy “Linguistic features to predict query difficulty - a case study on
previous TREC campaigns”
IJRISE| www.ijrise.org|[email protected]