Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Seminar on Optimized Research Techniques in Data Mining R. S. Somasundaram Associate Professor Department of Computer Applications SNS College of Engineering 1 To do good Ph.D • • • • • • • • • • • • • 10 % talent and 90 % how hard you work A good professor to supervise you Good research team Collaborate on the projects of other students Choose a good research topic Get more feedback on your research project Get more funding for your research Get other opportunities such as reviewing articles for conferences, participating in committees, etc. Work on a significant problem and bring an original solution Make a good planning Set some goals and some deadlines for achieving these goals Make a plan of where you want to submit your papers and when Teach courses relevant to your research field during your Ph.D 2 How to choose a Supervisor? • Reputation of the university • It is important to choose a professor that is working in a field or on a project that you like. • Is he active in research? • Funding (From various agencies) • Time (to discuss your research with you) • Good opportunities (e.g. participate in a book, in a committee of a conference) • Get a co-advisor • Research environment • Potential supervisor has a good research team • Will there be some opportunities to collaborate with other students and publish some joint publications • Does the professor as a good social network with other professors in your field of interest? • You should also consider the location of the university. Will you need to travel to another city? How much it will cost ? go far away from home? etc. • If you want to work on fundamental algorithms for data mining, you may want to find a Phd supervisor who is specialized on data mining. 3 How to choose a Topic? • • • • • • • • • Have Passion on it Should have basic knowledge in the area Improve mathematical skill Ability to convert any problem into a mathematical model Ability or support to solve the mathematical problem Which problems you want to solve or what you want to improve. Whether you want to design/improve data mining techniques Apply data mining techniques or do both You need to be aware that improving data mining techniques may require better algorithmic and/or mathematics skills. • What kind of techniques you want to apply or design/improve? 4 How to choose a Conference/Journal? • • • • • • • • • • • • Does the conference have a good reputation? Who publish the conference proceedings The best conferences are sometimes very selective How difficult it is to get a paper accepted at a given conference? Does my paper has a good chance of getting accepted? Does not always reflect very well the difficulty of having a paper accepted. Location of the conference The registration fee Opportunity to meet researchers that could be interested in his research Deadline of a conference and the review time The format Topics covered in the conference – Should be specific 5 Data Mining Research Area • • • • • • • • • • • • • • • • • • Traditional – Pre-process, Class, Cluster, Association, Sequential Pattern Pattern mining (sequential pattern mining, itemset mining, etc.) Social Network Analysis/Social media analytics Sentiment Analysis or Social Opinion Analysis Data Analysis Patterns for Software Engineering Data Mining for Signal Processing Video Mining Data Engineering for the Cloud. Medical science Finance analysis (Stock prediction, customer behavior prediction etc) Text mining/Stream Mining/Web Mining Data mining on Human driving behavior data such as eco-driving, disaster evacuation, Traffic congestion, etc. Outlier Distributed data clustering Dynamic query-result clustering Unsupervised feature selection For large scale data/big data For linked data 6 Data Mining Research Area • • • • • Text summarization - As the problem of information overload has grown, and as the quantity of data has increased, so has interest in automatic summarization. Many news oriented applications are relying on text summarization. Title recommendation, Topic modeling - To predict the title for articles, websites etc. Needs to create learning based system using classification algorithms. In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Semantic correction system - Little complex but interesting. Generally retried text faces semantic error, hence leads to wrong result. Applying this as preprocessing leads to better outcomes. Syntactic correction system - Much needed now a days. Non-English speakers creates much syntactical error. It can also be used as preprocessing job in many projects. So your algorithm should automatically detect such errors and suggest correct grammar. Search engine for wikipedia - Wikipedia data available as dump file. Check dbpedia for reference. Apply indexing techniques and build small kinda SE for wiki pages. As wikipedia already provides this functionality but you can work on better user experience, result optimization. 7 Data Mining Research Topics • • • • • Twitter tweets classifier - Pretty easy and interesting too. Creating learning system for various categories kinda Sports, entertainment, business, politics, hollywood etc. Train the classifier (naive bayes, SVM) and predict the category for incoming tweets. Sentiment analysis for twitter, review, conversations - There are few packages available in R which can help to perform this job. One needs to add few additional feature on top of that to make more intuitive. Nltk, Stanford, word2vect are algo good open source tools for the same. Spam mail detection - Again learning based classification system. Train the classifier using users pre-selected spam mail which would be able to classify new upcoming mails. If uses mark new mail as spam, then retrain(may be some other better option). Sarcasms detection - This can be very interesting one. In sentiment analysis we identify users sentiment regarding somethings, here we identify sarcasm expressed by users. Classifying Fake Users, Classifying insincere posts - Mail service providers like Gmail, Yahoo etc works a lot on keeping their users away from spam mail and spam users. Also on online discussion forums admin are much willing to auto delete smap-fake-irrelevant posts. 8 Data Mining Research Topics • • • Fraud detection - Some users on social media intentionally creates hype about particular products, stock to let it be up. Identifying such fraudulent users and activity is also one of the challenging task. Market Analysis - CocaCola continually hires 3rd party companies to process data related to them from Twitter and Facebook. They launch creative campaigns and want to constantly monitor if the campaign is being accepted by the audience. Many companies try to understand the flaws in their processes by trying to understand what their users/customers are saying about their products or services. Analysts are automating their work by building tools that read the news and try to predict the market situations for the next day. Sentiment Analysis is still one of the hottest applications (and yours truly has been engaged in research on Sentiment Analysis for two years.) You can read about Risk Analysis and Predictive Analysis to learn about latest concentration and advancements in these areas. Robotics - The robots are not simply pre-programmed toys anymore. They try to learn how to do their work from their previous experiences. Genetic Algorithms to Reinforcement Learning, there are many areas of Computer Science that are trying to solve these problems from multiple perspectives. We would love to sit in the car that drives itself if it proves that it can think on the fly. We want missiles to hit the target despite being in an unknown land with totally different climate and unexpectedly high wind speeds. 9 Data Mining Research Topics • Manufacturing, Automotive, Aviation - Concentration is on improving manufacturing processes to optimize time and material, and ensure high quality production in the assembly line. This extends beyond the factory and on the road when modern braking systems knows how much pressure should be applied on each tyre to stop your car in the most comfortable way. Air and Space industry is working on developing aircraft performance models. 10 Data Mining freeware tools Data Mining Software Weka - an open-source software for data mining RapidMiner - an open-source system for data and text mining KNIME - an open-source data integration, processing, analysis, and exploration platform The Mahout machine learning library - mining large data sets. It supports recommendation mining, clustering, classification and frequent itemset mining. Rattle - a GUI for data mining using R Clustering CLUTO - a software package for clustering low- and high-dimensional datasets fastcluster - fast hierarchical clustering routines for R and Python Association Rules arules - an R package for mining association rules and frequent itemsets ARMiner - a client-server data mining application specialized in finding association rules A C++ Frequent Itemset Mining Template Library Frequent Itemset Mining Implementations Repository Sequence Analysis TraMineR - an R package for mining and visualizing sequence data Social Network Analysis Gephi - an interactive visualization and exploration platform for networks and complex systems, dynamic and hierarchical graphs Pajek - a free tool for large network analysis and and visualization CFinder - a free software for finding and visualizing overlapping dense groups of nodes in networks, based on the Clique Percolation Method (CPM) Process Mining ProM - a free software for process mining Spatial Data Analysis GeoDa - a free software for spatial data analysis CLAVIN - an open source software package for document geotagging and geoparsing that employs context-based geographic entity resolution 11 Reputed Journals in Data Mining Field • • • • • • • • • • • • • • • • • • • • • • • • • • • • Journals TKDE - IEEE Transactions on Knowledge and Data Engineering IPL - Information Processing Letters VLDB - The Vldb Journal DATAMINE - Data Mining and Knowledge Discovery Sigkdd Explorations CS&DA - Computational Statistics & Data Analysis Journal of Knowledge Management WWW - World Wide Web Journal of Classification INFFUS - Information Fusion KAIS - Knowledge and Information Systems IDA - Intelligent Data Analysis JECR - Journal of Electronic Commerce Research Transactions on Rough Sets TKDD - ACM Transactions on Knowledge Discovery From Data IJDMB - International Journal of Data Mining and Bioinformatics IJDWM - International Journal of Data Warehousing and Mining IJBIDM - International Journal of Business Intelligence and Data Mining Statistical Analysis and Data Mining IJICT - International Journal of Information and Communication Technology Advanced Data Analysis and Classification MLDM - Transactions on Machine Learning and Data Mining DQ - Data Quality Journal TGIS - Transactions in Gis OIR - Online Information Review ISJ-GP - Information Security Journal: A Global Perspective IT - It Professional 12 • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • Reputed Conferences in Data Mining Field Conferences KDD - Knowledge Discovery and Data Mining ICDE - International Conference on Data Engineering CIKM - International Conference on Information and Knowledge Management ICDM - IEEE International Conference on Data Mining SDM - SIAM International Conference on Data Mining PKDD - Principles of Data Mining and Knowledge Discovery PAKDD - Pacific-Asia Conference on Knowledge Discovery and Data Mining RIAO - Recherche d'Information Assistee par Ordinateur DMKD - Research Issues on Data Mining and Knowledge Discovery DASFAA - Database Systems for Advanced Applications DaWaK - Data Warehousing and Knowledge Discovery DOLAP - International Workshop on Data Warehousing and OLAP DS - Discovery Science ICWSM - International Conference on Weblogs and Social Media WSDM - Web Search and Data Mining DMDW - Design and Management of Data Warehouses PJW - Workshop on Persistence and Java FIMI - Workshop on Frequent Itemset Mining Implementations GRC - IEEE International Conference on Granular Computing IDEAL - Intelligent Data Engineering and Automated Learning MLDM - Machine Learning and Data Mining in Pattern Recognition Fuzzy Systems and Knowledge Discovery ADMA - Advanced Data Mining and Applications KDID - International Workshop on Knowledge Discovery in Inductive Databases ICDM - Industrial Conference on Data Mining MineNet - Mining Network Data ESF Exploratory Workshops TSDM - Temporal, Spatial, and Spatio-Temporal Data Mining ICETET - International Conference on Emerging Trends in Engineering & Technology WKDD - Workshop on Knowledge Discovery and Data Mining DMIN - Int. Conf. on Data Mining CINQ - cInQ project Japanese Discovery Science Project WebMine - Workshop on Web Mining ASONAM - Advances in Social Network Analysis and Mining DW - Data Warehousing MLG - Mining and Learning with Graphs AMINING - Active Mining KELSI - Knowledge Exploration in Life Science Informatics ICDM2 - Industrial Conference on Data Mining Actes dIC - Journées Francophones dIngénierie des Connaissances 13