Download PPT

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Seminar on Optimized Research
Techniques in Data Mining
R. S. Somasundaram
Associate Professor
Department of Computer Applications
SNS College of Engineering
1
To do good Ph.D
•
•
•
•
•
•
•
•
•
•
•
•
•
10 % talent and 90 % how hard you work
A good professor to supervise you
Good research team
Collaborate on the projects of other students
Choose a good research topic
Get more feedback on your research project
Get more funding for your research
Get other opportunities such as reviewing articles for conferences, participating in
committees, etc.
Work on a significant problem and bring an original solution
Make a good planning
Set some goals and some deadlines for achieving these goals
Make a plan of where you want to submit your papers and when
Teach courses relevant to your research field during your Ph.D
2
How to choose a Supervisor?
• Reputation of the university
• It is important to choose a professor that is working in a field or on a project that you
like.
• Is he active in research?
• Funding (From various agencies)
• Time (to discuss your research with you)
• Good opportunities (e.g. participate in a book, in a committee of a conference)
• Get a co-advisor
• Research environment
• Potential supervisor has a good research team
• Will there be some opportunities to collaborate with other students and publish some
joint publications
• Does the professor as a good social network with other professors in your field of
interest?
• You should also consider the location of the university. Will you need to travel to
another city? How much it will cost ? go far away from home? etc.
• If you want to work on fundamental algorithms for data mining, you may want to find
a Phd supervisor who is specialized on data mining.
3
How to choose a Topic?
•
•
•
•
•
•
•
•
•
Have Passion on it
Should have basic knowledge in the area
Improve mathematical skill
Ability to convert any problem into a mathematical model
Ability or support to solve the mathematical problem
Which problems you want to solve or what you want to improve.
Whether you want to design/improve data mining techniques
Apply data mining techniques or do both
You need to be aware that improving data mining techniques may require
better algorithmic and/or mathematics skills.
• What kind of techniques you want to apply or design/improve?
4
How to choose a Conference/Journal?
•
•
•
•
•
•
•
•
•
•
•
•
Does the conference have a good reputation?
Who publish the conference proceedings
The best conferences are sometimes very selective
How difficult it is to get a paper accepted at a given conference?
Does my paper has a good chance of getting accepted?
Does not always reflect very well the difficulty of having a paper accepted.
Location of the conference
The registration fee
Opportunity to meet researchers that could be interested in his research
Deadline of a conference and the review time
The format
Topics covered in the conference – Should be specific
5
Data Mining Research Area
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Traditional – Pre-process, Class, Cluster, Association, Sequential Pattern
Pattern mining (sequential pattern mining, itemset mining, etc.)
Social Network Analysis/Social media analytics
Sentiment Analysis or Social Opinion Analysis
Data Analysis Patterns for Software Engineering
Data Mining for Signal Processing
Video Mining
Data Engineering for the Cloud.
Medical science
Finance analysis (Stock prediction, customer behavior prediction etc)
Text mining/Stream Mining/Web Mining
Data mining on Human driving behavior data such as eco-driving, disaster
evacuation, Traffic congestion, etc.
Outlier
Distributed data clustering
Dynamic query-result clustering
Unsupervised feature selection
For large scale data/big data
For linked data
6
Data Mining Research Area
•
•
•
•
•
Text summarization - As the problem of information overload has grown, and as
the quantity of data has increased, so has interest in automatic summarization.
Many news oriented applications are relying on text summarization.
Title recommendation, Topic modeling - To predict the title for articles, websites
etc. Needs to create learning based system using classification algorithms. In
machine learning and natural language processing, a topic model is a type of
statistical model for discovering the abstract "topics" that occur in a collection of
documents.
Semantic correction system - Little complex but interesting. Generally
retried text faces semantic error, hence leads to wrong result. Applying this
as preprocessing leads to better outcomes.
Syntactic correction system - Much needed now a days. Non-English speakers
creates much syntactical error. It can also be used as preprocessing job in
many projects. So your algorithm should automatically detect such errors and
suggest correct grammar.
Search engine for wikipedia - Wikipedia data available as dump file. Check
dbpedia for reference. Apply indexing techniques and build small kinda SE for
wiki pages. As wikipedia already provides this functionality but you can work on
better user experience, result optimization.
7
Data Mining Research Topics
•
•
•
•
•
Twitter tweets classifier - Pretty easy and interesting too. Creating learning
system for various categories kinda Sports, entertainment, business, politics,
hollywood etc. Train the classifier (naive bayes, SVM) and predict the
category for incoming tweets.
Sentiment analysis for twitter, review, conversations - There are few packages
available in R which can help to perform this job. One needs to add few additional
feature on top of that to make more intuitive. Nltk, Stanford, word2vect are algo
good open source tools for the same.
Spam mail detection - Again learning based classification system. Train the
classifier using users pre-selected spam mail which would be able to classify new
upcoming mails. If uses mark new mail as spam, then retrain(may be some other
better option).
Sarcasms detection - This can be very interesting one. In sentiment analysis we
identify users sentiment regarding somethings, here we identify sarcasm expressed
by users.
Classifying Fake Users, Classifying insincere posts - Mail service providers like
Gmail, Yahoo etc works a lot on keeping their users away from spam mail and
spam users. Also on online discussion forums admin are much willing to auto delete
smap-fake-irrelevant posts.
8
Data Mining Research Topics
•
•
•
Fraud detection - Some users on social media intentionally creates hype about
particular products, stock to let it be up. Identifying such fraudulent users and
activity is also one of the challenging task.
Market Analysis - CocaCola continually hires 3rd party companies to process data
related to them from Twitter and Facebook. They launch creative campaigns and
want to constantly monitor if the campaign is being accepted by the audience.
Many companies try to understand the flaws in their processes by trying to
understand what their users/customers are saying about their products or services.
Analysts are automating their work by building tools that read the news and try to
predict the market situations for the next day. Sentiment Analysis is still one of the
hottest applications (and yours truly has been engaged in research on Sentiment
Analysis for two years.) You can read about Risk Analysis and Predictive Analysis
to learn about latest concentration and advancements in these areas.
Robotics - The robots are not simply pre-programmed toys anymore. They try to
learn how to do their work from their previous experiences. Genetic Algorithms to
Reinforcement Learning, there are many areas of Computer Science that are trying
to solve these problems from multiple perspectives. We would love to sit in the car
that drives itself if it proves that it can think on the fly. We want missiles to hit the
target despite being in an unknown land with totally different climate and
unexpectedly high wind speeds.
9
Data Mining Research Topics
•
Manufacturing, Automotive, Aviation - Concentration is on improving
manufacturing processes to optimize time and material, and ensure high quality
production in the assembly line. This extends beyond the factory and on the road
when modern braking systems knows how much pressure should be applied on
each tyre to stop your car in the most comfortable way. Air and Space industry is
working on developing aircraft performance models.
10
Data Mining freeware tools
Data Mining Software
Weka - an open-source software for data mining
RapidMiner - an open-source system for data and text mining
KNIME - an open-source data integration, processing, analysis, and exploration platform
The Mahout machine learning library - mining large data sets. It supports recommendation mining, clustering, classification and frequent
itemset mining.
Rattle - a GUI for data mining using R
Clustering
CLUTO - a software package for clustering low- and high-dimensional datasets
fastcluster - fast hierarchical clustering routines for R and Python
Association Rules
arules - an R package for mining association rules and frequent itemsets
ARMiner - a client-server data mining application specialized in finding association rules
A C++ Frequent Itemset Mining Template Library
Frequent Itemset Mining Implementations Repository
Sequence Analysis
TraMineR - an R package for mining and visualizing sequence data
Social Network Analysis
Gephi - an interactive visualization and exploration platform for networks and complex systems, dynamic and hierarchical graphs
Pajek - a free tool for large network analysis and and visualization
CFinder - a free software for finding and visualizing overlapping dense groups of nodes in networks, based on the Clique Percolation Method
(CPM)
Process Mining
ProM - a free software for process mining
Spatial Data Analysis
GeoDa - a free software for spatial data analysis
CLAVIN - an open source software package for document geotagging and geoparsing that employs context-based geographic entity resolution
11
Reputed Journals in Data Mining Field
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Journals
TKDE - IEEE Transactions on Knowledge and Data Engineering
IPL - Information Processing Letters
VLDB - The Vldb Journal
DATAMINE - Data Mining and Knowledge Discovery
Sigkdd Explorations
CS&DA - Computational Statistics & Data Analysis
Journal of Knowledge Management
WWW - World Wide Web
Journal of Classification
INFFUS - Information Fusion
KAIS - Knowledge and Information Systems
IDA - Intelligent Data Analysis
JECR - Journal of Electronic Commerce Research
Transactions on Rough Sets
TKDD - ACM Transactions on Knowledge Discovery From Data
IJDMB - International Journal of Data Mining and Bioinformatics
IJDWM - International Journal of Data Warehousing and Mining
IJBIDM - International Journal of Business Intelligence and Data Mining
Statistical Analysis and Data Mining
IJICT - International Journal of Information and Communication Technology
Advanced Data Analysis and Classification
MLDM - Transactions on Machine Learning and Data Mining
DQ - Data Quality Journal
TGIS - Transactions in Gis
OIR - Online Information Review
ISJ-GP - Information Security Journal: A Global Perspective
IT - It Professional
12
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Reputed Conferences in Data Mining Field
Conferences
KDD - Knowledge Discovery and Data Mining
ICDE - International Conference on Data Engineering
CIKM - International Conference on Information and Knowledge Management
ICDM - IEEE International Conference on Data Mining
SDM - SIAM International Conference on Data Mining
PKDD - Principles of Data Mining and Knowledge Discovery
PAKDD - Pacific-Asia Conference on Knowledge Discovery and Data Mining
RIAO - Recherche d'Information Assistee par Ordinateur
DMKD - Research Issues on Data Mining and Knowledge Discovery
DASFAA - Database Systems for Advanced Applications
DaWaK - Data Warehousing and Knowledge Discovery
DOLAP - International Workshop on Data Warehousing and OLAP
DS - Discovery Science
ICWSM - International Conference on Weblogs and Social Media
WSDM - Web Search and Data Mining
DMDW - Design and Management of Data Warehouses
PJW - Workshop on Persistence and Java
FIMI - Workshop on Frequent Itemset Mining Implementations
GRC - IEEE International Conference on Granular Computing
IDEAL - Intelligent Data Engineering and Automated Learning
MLDM - Machine Learning and Data Mining in Pattern Recognition
Fuzzy Systems and Knowledge Discovery
ADMA - Advanced Data Mining and Applications
KDID - International Workshop on Knowledge Discovery in Inductive Databases
ICDM - Industrial Conference on Data Mining
MineNet - Mining Network Data
ESF Exploratory Workshops
TSDM - Temporal, Spatial, and Spatio-Temporal Data Mining
ICETET - International Conference on Emerging Trends in Engineering & Technology
WKDD - Workshop on Knowledge Discovery and Data Mining
DMIN - Int. Conf. on Data Mining
CINQ - cInQ project
Japanese Discovery Science Project
WebMine - Workshop on Web Mining
ASONAM - Advances in Social Network Analysis and Mining
DW - Data Warehousing
MLG - Mining and Learning with Graphs
AMINING - Active Mining
KELSI - Knowledge Exploration in Life Science Informatics
ICDM2 - Industrial Conference on Data Mining
Actes dIC - Journées Francophones dIngénierie des Connaissances
13