Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Visualizing digital footprints of our complex life Data mining You don't have to be a rocket scientist to be a data scientist! János Abonyi Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. Decision Making Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Papers, Files, Web documents, Scientific experiments, Database Systems Knowledge Discovery (KDD) Process Knowledge Pattern Evaluation Data Mining Task-relevant Data Data Warehouse Selection Data Cleaning Data Integration Databases Problem => Hypothesis √ ? Model Identification Exploratory data analysis Check the hypothesis Generate hypothesis Supervised learning Unsupervised learning Classification Clustering Frequent itemset mining, Association rules Anomaly detection Regression Recommender systems, collaborative filtering Sentiment analysis Sentiment analysis (opinion mining) refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information in source materials Classifier Evaluation Metrics: Accuracy, Error Rate, Sensitivity and Specificity A\P C ¬C C TP FN P ¬C FP TN N P’ N’ All Class Imbalance Problem: One class may be rare, e.g. fraud, or HIV-positive Significant majority of the negative class and minority of the positive class Classifier Accuracy, or recognition rate: percentage of test set tuples that are correctly classified Sensitivity: True Positive recognition rate Sensitivity = TP/P Accuracy = (TP + TN)/All Specificity: True Negative recognition rate Specificity = TN/N Error rate: 1 – accuracy, or Error rate = (FP + FN)/All 10 Time-series mining Clustering Classification Rule discovery Content based search s = 0.5 c = 0.3 Outlier detection A 0 B 500 1000 Motifs C 1500 2000 2500 0 20 40 60 80 100 120 A B C Applications What people think about EU? Decision What influences regional development? How to measure the quality of life? Operation Predictive modeling Early warning Information Link analysis Segmentation Tools Anomaly detection Data Regression Classification Freq. itemset Clustering Time demand Problem analysis Data analysis Collection of data Data cleaning Data mining Reporting Application Feedback 0% 5% 10% 15% 20% 25% 30% The major challenge for data scientists: The Data-to-Knowledge (D2K) challenge Big Data: Over 80% of our data is from text/natural language/social media, unstructured, noisy, dynamic, unreliable, …, but interconnected! Keys from big data to big knowledge: Structuring! transforming unstructured text into structured, typed, interconnected entities/relationships Networking take advantage of massive, structured connections Mining/reasoning effectively on massive, relatively structured, interconnected networks D2K → D2N2K (Data to Network to Knowledge) Construction and mining of typed, heterogeneous information networks Teamwork – Big Data Workshop Let’s see the details You don't have to be a rocket scientist to be a data scientist! Administrative datasets EU Open Data Portal http://www.europeandataportal.eu/ Single point of access to a growing range of data from the institutions and other bodies of the European Union (EU). Data are free for you to use and reuse for commercial or noncommercial purposes. Data.gov https://www.data.gov/ The home of the U.S. Government’s open data World bank http://data.worldbank.org World Bank Open Data: free and open access to data about development in countries around the globe. OECD DATA https://data.oecd.org/ OEDA http://openeventdata.org OECD data: charts, maps, tables and related publications The prime objective of the OEDA is to provide reliable, open access, multi-sourced political event datasets that are updated at least weekly, are transparent and have documented source texts, and use one or more of the open coding ontologies supported by the organization EHPS http://primary-sources.eui.eu The purpose of EHPS is to provide an easily searchable index of scholarly digital repositories that contain primary sources for the history of Europe ENGAGE ENGAGE is a door for researchers that leads them to the world of Open Government Data. By using the ENGAGE platform, researchers and citizens will be able to submit, acquire, search and visualize diverse, distributed and derived http://www.engagedata.eu/opendatasit Public sector datasets from all the countries of the European es Union. EUROSTAT http://ec.europa.eu/eurostat/data/data base European Statistics http://atlas.media.mit.edu/ https://public.tableau.com http://datausa.io Linked open data PREFIX db: <http://dbpedia.org/resource/> PREFIX onto: <http://dbpedia.org/ontology/> SELECT * WHERE { ?s onto:birthPlace db:Kőszeg } http://iwb.fluidops.com/ http://pantheon.media.mit.edu/ RSS (Rich Site Summary uses a family of standard web feed formats to publish frequently updated information: blogentries, news headlines, audio, video 2016_BD_Workshop\Demo\to CartoDB\Events-1 https://iask.cartodb.com/viz/3b678bfe-3c68-11e6-a3eb-0e3ff518bd15/public_map Network of towns in Wikipedia tables.googlelabs.com http://analysis.gdeltproject.org/ Looking across the nearly 200 million articles from across the entire world in 65 languages monitored by GDELT in 2015, we wanted to explore geographic correlation. The map below draws a line between every pair of locations mentioned together in the same article at least 100 times across the entire 200 million article archive http://analysis.gdeltproject.org/module-gkg-tonetimeline.html Application programming interface (API) is a set of routine definitions, protocols, and tools for building software and applications. JSON - JavaScript Object Notation) is an open-standard format that uses human-readable text to transmit data objects Open refine Extraction structured information from web forums Nice reports … Text mining Development Sustainable Country Nature Global Goal Financial Human World Development Life Social Nature VOS viewer C:\Users\János\Dropbox (Bigdata)\Bigdata Team Folder\Abonyi\Research\sciNET\VOS_viewer VOS viewer -Map of Social Sciences Since 1976 only 45 publications of Hungarian Scientists were related to „sustainable development” (according to abstracts in Scopus database) What is in the books ? https://books.google.com/ngrams/graph?content=migration%2CEU&year_start=1800 &year_end=2000&corpus=15&smoothing=3&share=&direct_url=t1%3B%2Cmigration %3B%2Cc0%3B.t1%3B%2CEU%3B%2Cc0 https://www.google.hu/trends/ https://www.google.hu/trends/explore#q=big%20data%2C%20%2Fm%2F06n6p%2C% 20Migration&cmpt=q&tz=Etc%2FGMT-2 Thank you … [email protected]