Download Finalise songs that have both correct lyrics and tags

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia, lookup

Transcript
Data Mining and Text Analytics in Music
Audi Sugianto and Nicholas Tawonezvi
Overview




Introduction
Building a ground truth set
Experiments
Results
Introduction






Purpose: Music mood classification through lyric text
mining approaches
MIR (Music Information Retrieval)
Use of Audio Datasets:
 AMC (Audio Mood Classification)
 USPOP, USCRAP, etc.
Use of Social tags from last.fm
Challenges:
Natural subjectivity of music
Human perspectives on mood
Generating Ground Truth
Data Collection




Combination of in-house and public audio tracks
Collect songs with at least one social tag from last.fm
Lyrics can be gathered from mainly Lyricwiki.org.

Use of Lingua to ensure data quality
Finalise songs that have both correct lyrics and tags
Generating Ground Truth
Algorithms, Resources and Techniques




WordNet-Affect
 Used to filter out junk tags
 Assignment of labels to concepts (emotions,
moods, responses)
Use of human expertise to identify mood-related
words in the music domain
 Affective Aspect
 Judgemental Tags
 Ambiguous Meanings
Use of WordNet to categorise into groups based
on synonyms.
Use of music experts to merge groups by musical similarity
Generating Ground Truth
Selecting Songs



Approaches:
Tag identification
Lyric counts
Multi-label Classification
Mood Categories and Song Distributions
Experiments
Evaluation Measures and Classifiers

Use of 10-fold Cross Validation




Break data into 10 sets of size n/10.
Train on 9 datasets and test on 1.
Repeat 10 times and take a mean accuracy.
Classification with Support Vector Machines (SVM)

Algorithms to analyse data and recognise patterns
Experiments
Lyric Preprocessing


Facts:
Repetitions of words and sections:
- Lack of verbatim transcripts
Consisting of sections:
 Intro, interlude, verse, etc. in the annotations
 Notes about song and instrumentation
Possible solution:

Identifying and converting repetition and annotation
patterns to actual repeated segments
Experiments
Lyrics Features


Common text classification tasks:
 Bag-of-words (BOW)
 Collection of Unordered words
 Part-of-Speech (POS)
 Use of Stanford Tagger
 Function Words (the, a, etc.)
Assigning of values:
 Frequency
 Tf-idf weight
 Normalised-frequency
 Boolean Value
Experiments
Stemming


Stemming – Merging words with same morphological roots
Snowball Stemmer
 Irregular nouns and verbs as inputs
Results


Text categorisation provides dimensionality and good
generalisability POS Boolean representation is poorer
because of high content of POS types in lyrics
Content words are more useful in mood classification
10th International Society for Music Information Retrieval Conference (ISMIR 2009)
Acknowledgement
Hu, X. et al. 2009. Lyric Text Mining in Music Mood Classification. International Music
Information Retrieval Systems Evaluation Laboratory University of Illinois at UrbanaChampaign. [Online]. Pp.411-416. [Accessed 6 December 2013]. Available fromː
http://ismir2009.ismir.net/proceedings/PS3-4.pdf
Training and Testing Data Sets. 2013. Training and Testing Data Sets. [Online].
[Accessed 5 December 2013]. Available from:
http://technet.microsoft.com/en-us/library/bb895173.aspx.
Kohavi, Ron (1995). A study of cross-validation and bootstrap for accuracy estimation and
model selection. Proceedings of the Fourteenth International Joint Conference on
Artificial Intelligence 2 (12): 1137–1143.(Morgan Kaufmann, San Mateo, CA)
D. Ellis, A. Berenzweig, and B. Whitman: The USPOP2002 Pop Music Data Set. Available
fromː http://labrosa.ee.columbia.edu/projects/musicsim/uspop2002.html.
Software & Additional Resources
http://www.music-ir.org/mirex/2007/index.php/AMC
http://en.wikipedia.org/wiki/MoodLogic
http://search.cpan.org/search%3fmodule=Lingua::Ident – Statistical language identifier
http://snowball.tartarus.org/
http://www.englishpage.com/irregularverbs/irregularverbs.htm - irregular verb list
http://www.esldesk.com/eslquizzes/irregular-nouns/irregular-nouns.htm - irregular noun list
http://nlp.stanford.edu/software/tagger.shtml http://www.music-ir.org/mirex/2007/abs/AI_CC_GC_MC_AS_tzanetakis.pdf - POS Tagger
http://www.music-ir.org/archive/figs/18moodcat.htm - Mood Categories & Song Distributions
http://www.originlab.com/index.aspx?go=Products/Origin/Statistics/Nonparametric
Tests&pid=1087 – Performance identifier