Download DOC, 118 Kb

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
Government of Russian Federation
Federal State Autonomous Educational Institution of Higher Professional
Education
"National Research University
'Higher school of economics'
Faculty of Business Informatics
Discipline program
"Advanced methods of data analysis and big data in
business intelligence "
for direction 38.04.05 "Business Informatics", Master training
Program’s author:
Nikolay V. Markov, [email protected]
Approved at the meeting of the Department of
information and business in the sphere of information technologies
Head of Department, Svetlana V. Maltseva
«____»____________ 2014 г.
_____________________
Recommended by the EMS section of «Business Informatics» «____»____________ 2014 г.
Chairman, Y. V. Taratukhina
____________________
Moscow, 2014
This program can not be used by other parts of the university and other institutions of higher
education without the permission of the department - developer of the program.
1. Scope and normative references
This program of an academic discipline establishes minimum requirements for knowledge and
skills of the student and determines the content and types of studies and reports.
The program is designed for teachers, leading this discipline, teaching assistants and students
directions 38.04.05 "Business Informatics" Master training, students in the master's program "Big Data
Systems".
The program is developed in accordance with:

working curriculum of the University towards 080500.68 "Business Informatics" Master
training for master's program «Big Data Systems», approved in 2014
2. Goals for studying

Formation of the theoretical knowledge and practical skills in the collection, storage,
processing and analysis of large data.
 Develop skills and practical skills to analyze large data to tackle a wide range of applications,
including analysis of corporate data, financial data from the data warehousing world markets,
modeling data storage and processing, prediction of complex indicators.
3. Student competences, generated as a result of studying
As a result, during the studying of the discipline a student should::
Understand the theory and fundamentals of storage, processing and analysis of big data,
advanced tools for collection, storage, transmission and visualization of big data.
 To be able to process and analyze large amounts of data using modern software packages IBM
InfoSphere.
 Have the skills to use neural networks and fuzzy models for compression, processing and
analysis of large data, as well as their continuing effectiveness.

As a result of the development of the discipline the student acquires the following
competences:
Competence
Ability to offer concepts,
models, invent and test
methods and tools of
professional activity
The ability to apply the
methods of system analysis
and modeling to evaluate
and design
Ability to develop and apply
mathematical models to
justify the design decisions
in the field of ICT
Ability to organize self and
collective research work at
the enterprise and manage it
Forms and methods of
teaching, contributing to the
formation and development of
competence
Lectures, workshops,
homework
GEF/NR
U code
Descriptors - the main features of the
development (indicators of
achievement results)
СК-2
Demonstrates
ПК-13
Owns and uses
Lectures, workshops,
homework
ПК-14
Owns and uses
Lectures, workshops,
homework
ПК-16
Owns and uses
Lectures, workshops,
homework
2
4. Place in the structure of the discipline of the educational program
As part of the master's program «Big Data Systems» this discipline is a compulsory subject.
For the proper development, students should:
 know the content of the following disciplines: numerical methods, optimization
methods, data analysis, discrete mathematics, theoretical foundations of computer
science, computer systems, networks, telecommunications, information systems
management and production company.
 Be able to use mathematical and IT-tools for management tasks.
The main provisions of the discipline should be used for the further studying the discipline
"Elaboration and implementation of big data."
5. Topical plan of an academic discipline
№
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Total
hours
Topic name
Introduction to the analysis and management of
large data
Data Management
Model of distributed file systems and databases
computing
Search for similarities in the data
Analysis of streaming data
Link analysis
Frequent datasets analysis
Clustering algorithms and their applications
Neural networks and their applications
Advertising on the Web
Decision support system
Analysis of social network graphs
Reducing the dimension of data
Large scale machine learning
ИТОГО
180
Classroom hours
Homewo
Lecture Semin Workshop
rk
s
ars
s
2
2
2
2
7
7
4
2
2
2
4
4
4
2
2
4
2
2
4
2
2
2
4
4
4
2
2
4
2
2
7
8
8
8
8
8
8
8
7
8
7
7
38
38
106
6. Forms of students knowledge control
Type of
control
Current
(week)
Total
(week)
1st year
Form of control
Thesis
Exam
1
1
Parameters
2
Volume 25-20 pp., result evaluation – 2
weeks
1
Oral exam, 20 min per student
6.1 Criteria for assessing the knowledge, skills
The student should demonstrate the knowledge of sections of the discipline and the ability to
present the results of homework and tests in accordance with the required competencies.
Evaluation of all forms of monitoring are set on a 10-point scale.
3
On the final evaluation on a subject matter consists of ratings for:
 work in practical classes - O1
 control work - O2
 response to the competition - O3
according to the formula: О = 0,2 *О1+ 0,4 *О2 + О3 *0,4
7. Program content
Topic 1. Introduction to the analysis and management of big data
What is big data? Characteristics of Big Data. Big data as one of the global challenges of our
time.
Data analysis, basic principles and methods. Statistical modeling and simulation based on
machine learning. Bonferroni principle. Hash functions and indexes. Base of natural algorithms.
Basic literature
1. Minelli M., Chambers M., Dhiraj A. Big Data, Big Analytics: Emerging Business
Intelligence and Analytic Trends for Today's Businesses. John Wiley & Sons, 2012
2. Ye N. The Handbook of Data Mining. Lawrence Erlbaum Associates, 2003
3. Leskovec J., Rajaraman A., Jeffrey D. Ullman. Mining of Massive Datasets. Stanford
University, 2010
Additional literature
1. Eaton C., Deutsch T., Deroos D., Lapis G., Zikopoulos P. Understanding Big Data.
Analytics for Enterprise Class Hadoop and Streaming Data
Topic 2. Data Management
Data management foundation. Stages of working with data collection, compression, storage,
processing, analysis. Principles of storage and data management. Data compression methods.
Basic literature
1. Minelli M., Chambers M., Dhiraj A. Big Data, Big Analytics: Emerging Business
Intelligence and Analytic Trends for Today's Businesses. John Wiley & Sons, 2012
2. Ye N. The Handbook of Data Mining. Lawrence Erlbaum Associates, 2003
3. Leskovec J., Rajaraman A., Jeffrey D. Ullman. Mining of Massive Datasets. Stanford
University, 2010
Additional literature
4. Eaton C., Deutsch T., Deroos D., Lapis G., Zikopoulos P. Understanding Big Data.
Analytics for Enterprise Class Hadoop and Streaming Data
Topic 3. Model of distributed file systems and databases computing
4
Distributed file systems. Physical organization of computing nodes. Approach MapReduce: Maptask, Reduce-task. Algorithms using MapReduce and their applications. Matrix-vector multiplication,
the operation of relational algebra operations on databases, grouping and aggregation.
Extensions to MapReduce. Flow systems. Communication cost models. The theory of
complexity for MapReduce: dimension reduction and graph models.
Basic literature
1. Minelli M., Chambers M., Dhiraj A. Big Data, Big Analytics: Emerging Business
Intelligence and Analytic Trends for Today's Businesses. John Wiley & Sons, 2012
2. Ye N. The Handbook of Data Mining. Lawrence Erlbaum Associates, 2003
3. Leskovec J., Rajaraman A., Jeffrey D. Ullman. Mining of Massive Datasets. Stanford
University, 2010
Additional literature
4. Eaton C., Deutsch T., Deroos D., Lapis G., Zikopoulos P. Understanding Big Data.
Analytics for Enterprise Class Hadoop and Streaming Data
Topic 4. Search for similarities in the data
Application of the Near-Neighbor search. Jaccard similarity in the data. Similarity in
information. Collaborative filtering.
Splitting documents. k-splitting: the choice of dimension splitting hashing, splitting
construction of words.
LSH - hashing. Measures of distances. Euclidean distance, the distance Jaccard, cosine
distance. LS - function. LSH - family and their applications.
Basic literature
1. Minelli M., Chambers M., Dhiraj A. Big Data, Big Analytics: Emerging Business
Intelligence and Analytic Trends for Today's Businesses. John Wiley & Sons, 2012
2. Ye N. The Handbook of Data Mining. Lawrence Erlbaum Associates, 2003
3. Leskovec J., Rajaraman A., Jeffrey D. Ullman. Mining of Massive Datasets. Stanford
University, 2010
Additional literature
4. Eaton C., Deutsch T., Deroos D., Lapis G., Zikopoulos P. Understanding Big Data.
Analytics for Enterprise Class Hadoop and Streaming Data
Topic 5. Analysis of streaming data.
5
Threading model of data. Management system of streaming data. Discretization of flow data.
Filter streams. Flageolet-Martin algorithm. Alon-Matias-Zhegedi algorithm. Datar-Gionisa-IndicaMotwani algorithm (DGIM).
Basic literature
1. Minelli M., Chambers M., Dhiraj A. Big Data, Big Analytics: Emerging Business
Intelligence and Analytic Trends for Today's Businesses. John Wiley & Sons, 2012
2. Ye N. The Handbook of Data Mining. Lawrence Erlbaum Associates, 2003
3. Leskovec J., Rajaraman A., Jeffrey D. Ullman. Mining of Massive Datasets. Stanford
University, 2010
Additional literature
4. Eaton C., Deutsch T., Deroos D., Lapis G., Zikopoulos P. Understanding Big Data.
Analytics for Enterprise Class Hadoop and Streaming Data
Topic 6. Link analysis.
PageRank algorithm. Earlier search engine, network structure. Transition matrix. Iterations of
PageRank using MapReduce.
Basic literature
1. Minelli M., Chambers M., Dhiraj A. Big Data, Big Analytics: Emerging Business
Intelligence and Analytic Trends for Today's Businesses. John Wiley & Sons, 2012
2. Ye N. The Handbook of Data Mining. Lawrence Erlbaum Associates, 2003
3. Leskovec J., Rajaraman A., Jeffrey D. Ullman. Mining of Massive Datasets. Stanford
University, 2010
Additional literature
4. Eaton C., Deutsch T., Deroos D., Lapis G., Zikopoulos P. Understanding Big Data.
Analytics for Enterprise Class Hadoop and Streaming Data
Topic 7. Frequent datasets analysis
Determination and application of frequent datasets. Model of a market basket. Association
rules. A-Priori algorithm. Monotony of data. Storing big data in memory. Park Chan-Yu algorithm.
Multi-level algorithm. Multihash algorithm. Algorithms for restricted access. Savasere-OmichinskiNebat algorithm. Toivonen algorithm. Counting the frequent datasets: sampling, hybrid methods.
Basic literature
1. Minelli M., Chambers M., Dhiraj A. Big Data, Big Analytics: Emerging Business
Intelligence and Analytic Trends for Today's Businesses. John Wiley & Sons, 2012
2. Ye N. The Handbook of Data Mining. Lawrence Erlbaum Associates, 2003
6
3. Leskovec J., Rajaraman A., Jeffrey D. Ullman. Mining of Massive Datasets. Stanford
University, 2010
Additional literature
4. Eaton C., Deutsch T., Deroos D., Lapis G., Zikopoulos P. Understanding Big Data.
Analytics for Enterprise Class Hadoop and Streaming Data
Topic 8. Clustering algorithms and their applications
Introduction clustering algorithms: a point space distance. Clustering strategy. Hierarchical
clustering in Euclidean and non-Euclidean spaces, its effectiveness. K-means algorithm. BradleyFayyad Reina (BFR) algorithm. CURE algorithm. Cluster tree. GRGPF algorithm. Application of
clustering algorithms in in-line and parallel computing.
Basic literature
1. Minelli M., Chambers M., Dhiraj A. Big Data, Big Analytics: Emerging Business
Intelligence and Analytic Trends for Today's Businesses. John Wiley & Sons, 2012
2. Ye N. The Handbook of Data Mining. Lawrence Erlbaum Associates, 2003
3. Leskovec J., Rajaraman A., Jeffrey D. Ullman. Mining of Massive Datasets. Stanford
University, 2010
Additional literature
4. Eaton C., Deutsch T., Deroos D., Lapis G., Zikopoulos P. Understanding Big Data.
Analytics for Enterprise Class Hadoop and Streaming Data
Topic 9. Neural networks and their applications
Determining the structure and typology of neural networks. Kohonen maps. Neural network
inverse distribution. Application of neural networks in economics, logistics, IT-sphere.
Basic literature
1. Minelli M., Chambers M., Dhiraj A. Big Data, Big Analytics: Emerging Business
Intelligence and Analytic Trends for Today's Businesses. John Wiley & Sons, 2012
2. Ye N. The Handbook of Data Mining. Lawrence Erlbaum Associates, 2003
3. Leskovec J., Rajaraman A., Jeffrey D. Ullman. Mining of Massive Datasets. Stanford
University, 2010
Additional literature
4. Eaton C., Deutsch T., Deroos D., Lapis G., Zikopoulos P. Understanding Big Data.
Analytics for Enterprise Class Hadoop and Streaming Data
5. Heaton C. Introduction to the Math of Neural Networks. Heaton Research, 2010.
7
Topic 10. Advertising on the Web
Algorithms for online and offline. Greedy algorithm. Competitive ratio. The problem of
coincidences. Algorithm balance and the balance of the generalized algorithm.
Basic literature
1. Minelli M., Chambers M., Dhiraj A. Big Data, Big Analytics: Emerging Business
Intelligence and Analytic Trends for Today's Businesses. John Wiley & Sons, 2012
2. Ye N. The Handbook of Data Mining. Lawrence Erlbaum Associates, 2003
3. Leskovec J., Rajaraman A., Jeffrey D. Ullman. Mining of Massive Datasets. Stanford
University, 2010
Additional literature
4. Eaton C., Deutsch T., Deroos D., Lapis G., Zikopoulos P. Understanding Big Data.
Analytics for Enterprise Class Hadoop and Streaming Data
Topic 11. Decision support system
Model of decision support system. Utility matrix. Making decisions based on the contents of the
data. Identification of the properties and parameters of the data. Collaborative filtering. Measurement
identity. Reduced dimension. UV - decomposition. Standard deviation.
Basic literature
1. Minelli M., Chambers M., Dhiraj A. Big Data, Big Analytics: Emerging Business
Intelligence and Analytic Trends for Today's Businesses. John Wiley & Sons, 2012
2. Ye N. The Handbook of Data Mining. Lawrence Erlbaum Associates, 2003
3. Leskovec J., Rajaraman A., Jeffrey D. Ullman. Mining of Massive Datasets. Stanford
University, 2010
Additional literature
4. Eaton C., Deutsch T., Deroos D., Lapis G., Zikopoulos P. Understanding Big Data.
Analytics for Enterprise Class Hadoop and Streaming Data
Topic 12. Analysis of social network graphs
What is a social network? Social networks as graphs. Types of social networks. Clustering of
social network graphs, distance in graphs. Girvan-Newman algorithm. Bipartite graphs and subgraphs.
Maximum likelihood algorithm.
Basic literature
1. Minelli M., Chambers M., Dhiraj A. Big Data, Big Analytics: Emerging Business
Intelligence and Analytic Trends for Today's Businesses. John Wiley & Sons, 2012
2. Ye N. The Handbook of Data Mining. Lawrence Erlbaum Associates, 2003
8
3. Leskovec J., Rajaraman A., Jeffrey D. Ullman. Mining of Massive Datasets. Stanford
University, 2010
Additional literature
4. Eaton C., Deutsch T., Deroos D., Lapis G., Zikopoulos P. Understanding Big Data.
Analytics for Enterprise Class Hadoop and Streaming Data
8. Literature
Basic literature
1. Minelli M., Chambers M., Dhiraj A. Big Data, Big Analytics: Emerging Business
Intelligence and Analytic Trends for Today's Businesses. John Wiley & Sons, 2012
2. Ye N. The Handbook of Data Mining. Lawrence Erlbaum Associates, 2003
3. Leskovec J., Rajaraman A., Jeffrey D. Ullman. Mining of Massive Datasets. Stanford
University, 2010
Additional literature
4. Eaton C., Deutsch T., Deroos D., Lapis G., Zikopoulos P. Understanding Big Data.
Analytics for Enterprise Class Hadoop and Streaming Data
5. Heaton C. Introduction to the Math of Neural Networks. Heaton Research, 2010.
9. Knowledge control questions
1. Big data. The problem of big data today.
2. Data Management. Methods of data collection and data preparation. Principles of
storage and data management.
3. Modeli distributed file systems. File system Google and Hadoop.
4. MapReduce. Paradigm, the essence of the structure.
5. Search of similarities. Similarity Jaccard. Splitting. LSH - hashing.
6. Stream data model. Flageolet Martin algorithm. The algorithm of Alon-Matias-Zhegedi.
Algorithm Datar-Gionis-Indic-Motwani (DGIM).
7. Link analysis. Page Rank.
8. Determination and application of frequent sets. Model of a market basket. Association
rules. A-Priori algorithm.
9. Determination and application of frequent sets. Monotony of data. Storing large data in
memory. Algorithm Park Chan-Yu. Multi-level algorithm.
9
10. Determination and application of frequent sets. Multi-level algorithm. Multihash
algorithm. Algorithms for restricted access. Savasere-Omichinski-Nebat Algorithm.
Toivonen Algorithm. Counting the frequent data sets: sampling, hybrid methods.
11. Clustering algorithms. Clustering strategy. Hierarchical clustering in Euclidean and
non-Euclidean spaces, its effectiveness. K-means algorithm.
12. Bradley-Fayyad-Rein Algorithm (BFR). CURE Algorithm. Cluster tree. GRGPF
algorithm. Application of clustering algorithms with in-line and parallel computing.
13 Determination of the structure and typology of neural networks. Kohonen maps. Neural
network inverse distribution. Application of neural networks in economics, logistics, ITsphere.
14 Algorithms for online and offline. Greedy. Competitive ratio. The problem of
coincidences. Balance algorithm and the balance of the generalized algorithm.
15 The model of decision support system. Matrix utility. Collaborative filtering.
Measurement identity. UV - decomposition. Standard deviation.
16 What is a social network? Social networks as graphs. Types of social networks.
Clustering of social network graphs, distance in graphs. Girvan-Newman algorithm.
Maximum likelihood algorithm.
Developers:
NRU-HSE________ _______professor________ _____Nikolay V. Markov
(workplace)
(position)
(инициалы, фамилия)
10