Download Music Synthesis for Home Videos: An Analogy Based Approach

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
ICICS-PCM 2003
15-18 December 2003
Singapore
3B1.7
Music synthesis for home videos: An analogy based approach
Meera Nayak
Dept. of Computer Science
National Univ. of Singapore
Singapore
S H Srinivasan
Applied Research Group
Satyam Computer Services
Bangalore
[email protected]
[email protected]
Abstract
There have been efforts in the recent years to make home
videos look more interesting and pleasing to viewers by
mixing it with music. Most of the existing software enables
the user to add music of his preference. It assumes that the
user has enough knowledge about the aesthetic mixing
principles. In our research, we propose a way of adding
audio to video by synthesizing appropriate music based on
the video content. This is a step toward semi-automatic
music mixing without totally excluding the user. We have
developed a system that takes in music examples selected
by the user and generates new music by applying the
aesthetic rules of audio-video mapping. The paper
concentrates on pitch generation for music synthesized
through contour based pitch matching by using technique of
string matching. This system would enable the user to
understand what music would be suited to a particular video
and would assist him in choosing a matching piece of music.
Keywords
Music synthesis, media aesthetics, string matching, music
analogies.
1.
Introduction
The ease of use of the digital camcorders has made many an
amateur to direct his own home video. These videos are not
professionally made and lack the aesthetic appeal that
movies have. Music can be used to improve the appeal of
such home videos. Automatic audio video mixing is one of
the ways to address this problem. In our earlier work [7],
the certain features of the video and audio were extracted
and based on matching criteria presented in [5]; the best
clip for the audio was chosen. Instead of mixing music by
feature extraction and subsequent matching , another
approach would be to synthesize music by ‘listening’ to the
meaning inherent in the video using underlying principles
of computational media aesthetics [2] to generate
customized music for every video clip. This paper is
organized as follows. Section 2 and 3 summarize briefly
related work on music synthesis and basics of music theory.
Section 4 gives an insight into the motivation behind the
0-7803-8185-8/03/$17.00 © 2003 IEEE
Mohan S. Kankanhalli
Dept. of Computer Science
National Univ. of Singapore
Singapore
[email protected]
work. The system architecture is laid out in 5. Section 6
presents the algorithm, 7 and 8 reports experimental results
and conclusions.
2. Related Work
Artificial techniques have been often used to learn the
expressive interpretation of music pieces which involves
learning musical parameters such as the dynamics
(variations in loudness) and rubato (variations of local
tempo) [4]. Many computational intelligence techniques
have been used to solve musical problems like music
cognition, algorithmic composition, and sound synthesis [1].
Some of these are neural networks, genetic algorithms and
genetic programming. Genetic algorithms (GA) have been
mainly used to solve compositional and synthesis tasks.
These algorithmic composers operate on music knowledge
such as pitch, rhythm, meter and on the rule representation
which contains a set of rules that determine how the
composition evolves [1]. GA are especially useful when
generating improvisations and producing variations of
already existing music. But the disadvantage with this
method is that rules need to be properly represented as
constraints since they prune the search space. Neural
networks are used to generate musical segments but they
are limited due to large training data required.
3. Background
3.1 Motivation: Media Aesthetics
Media aesthetics is the study of visual and aural elements,
the interaction and integration of these elements to
understand the semantic and the semiotic content of the
video [2]. This is based on computation of lower level
elements and constructing new expressive elements out of it
to define the higher level semantics. For instance, the shot
lengths and motion can be used to define tempo of video. If
the shot lengths are short the tempo is staccato in nature but
if they are long and motion slow, then it is a legato style.
This similar to tempo descriptions of music. It could be
possible to extract the aesthetic elements of one medium (a
video clip), manipulate it and give new meaning to that of
another medium (audio clip). Rhythm is derived from the
inherent structure in the video. Video rhythm is divided into
shot and motion rhythm. Motion rhythm is again divided
into metrical, attack, decay and free.
sonification rules are derived from this structural mapping.
The audio features that have been used are the pitch,
dynamics and tempo.
The tempo and rhythm described above have corresponding
parallels in music. Musical rhythm is divided into groups
and groups into measures. Meter is split into bars and every
note is determined and defined by the onset time (attack),
the duration of the note and the decay time.
3.3.2
Our work is derived from these ideas to match the video
features with that of audio. We extract primitive features of
the video and apply the rules of aesthetics as laid out in film
production to synthesize music that not only synchronizes
with the video tempo but also aesthetically blends with it.
3.2 Basics of Music Theory
Pitch: This indicates the relative highness or lowness of
sound. Pitch can be used to create musical moods such as
low pitch signifies solemnness but high pitch signifies
brighter moods.
Calculation of audio features
Pitch: The pitch values in MIDI ranges from [1 ...127]. The
midi pitch key of the middle C is 60 corresponding to the
frequency of 261.625Hz. The difference in frequency of
one midi pitch to the next higher one is the twelfth root of 2.
The range R of midi pitch in hertz is
[8.1558 -12543.8539]. So if the h is the normalized value
of the hue of a frame, it is
converted to a hertz value by
P hertz
= h * R
.
Midi pitch note is given as
12
P h = 60 + ln(Phertz) − ln(261.625) / ln( 2 ) .
Dynamics: This indicates loudness of music. A steady
increase of dynamics stirs up excitement that is usually
accompanied by increase in pitch. The gradual decrease in
dynamics suggests calmness of mood.
Rhythm: It refers to how the music ebbs and flows against
the passage of time. It is expressed by beats, meter, accent
and tempo of music. The note lengths are usually varied by
setting them against the timeline of beats.
Tempo: It refers to the speed at which the beats are played.
There is no absolute measure of tempo.
Melody: Melody is determined by combination of pitch
series and rhythm resulting in clearly defined shape. The
duration of notes and their ordered succession of intervals
define a melody. Melody may start on note C, rise up to a
note an octave higher, then come down to the starting pitch
thus following a melodic arch or contour.
Figure 1: Architecture of the System [8]
Harmony: It is composed of chords and is based on the
progression of chords, (tones that are sounded
simultaneously).
3.3. System Architecture
It is divided into two layers, the sonification and aesthetics
layer as shown in Figure 1. These layers are explained in
greater detail below.
3.3.1
Sonification Layer
Video features like hue, saturation, tempo are mapped to
corresponding audio features. The mapping is done
according to table given by Zettl in Figure 2. The
Figure 2: Audio/Video structural mapping. Adapted
from Zettl [5].
Volume: The volume or velocity in MIDI ranges
[0-127]. This is again derived from the brightness of the
video. The conversion from brightness to midi velocity is
linear. If v is the brightness of the frame V = v * 127 .
Tempo: Tempo is derived from motion and is specified in
beats per minute. The duration of the notes and the motion
parameter are inversely related. T = α / m ; α is a
constant, m is the motion. Beat duration is given by
b = 60 / t which will decide number of notes to be
played per beat.
Evaluation of the edit distance
The edit distance is the minimum number of local
transformations that are required to transform P v into Pm
and this can be calculated using dynamic programming.
The procedure consists in constructing an integer matrix
where each row corresponds to the event (note) in P v and
each column to that of Pm . Each cell stores the distance
d i , j between P v
∈ {a 1, a 2,.., a n }
and
Pm ∈{b1, b2,.., bm } which
b
a
gives the element similarity distance between i and j
where i ∈ {1,2,..., n} and j ∈{1, 2,.., m} . d i , j , is the
3.3.3 Aesthetics Layer
The aesthetics layer is a compositional layer that gives a
form and structure to music generated by the lower level
sonification layer. Music can either be generated through a
rule based approach where rules are used to train a system
or, by means of an example based approach where
examples of music are used as reference to generate it. This
paper deals with the second approach.
4
4.2
Analogy based Composition
Using the analogy based method; we generate the pitch of
the new music by matching the pitch profile (contour)
derived from the video with the pitch contour of the
example chosen.
distance based on characters itself and not on the position of
the characters in the string. This would represent the
alphabet weight edit distance which is used to calculate the
weighted edit distance graph.
4.3 Procedure to find distance between the
two pitch series
For each note segment in P v
Begin
For each note segment in Pm
Begin
d i , j = d i , j =| a i − b j | /(a i + b j )
End
4.1 Contour based pitch matching
End
Experiments in music retrieval have shown that searching
based on melodic contour which is a sequence of interval
directions have produced good results [6].Here the
matching between the contours is done using sequence
comparison techniques. Haar approximation (lower
resolution form) of the music sample and that of the pitch
profile from video. The synthesized music emulates a
particular example.
A directed graph is constructed from this similarity distance
matrix as given in Figure 3 where d (1,1) the initial vertex is
and d (n, m ) is the final terminal one. Weights along the
graph are given as w1 = d i , j , w 2 = 2* d i , j , w 3 = d i , j
d (1,1)
To measure similarity between two sequences of pitch
segments using approximate string matching, Pm (the Haar
d (2,1)
approximate music pitch profile) and P v (the pitch profile
.
from video), we need to calculate the local transformations
which are replacement, insertion, deletion. The alphabet
from which the pitch numbers are drawn is
∑ ∈ {1,2,..,127} , the range of midi notes. The sequence
Pnew can be obtained from P v and
transformations
steps
a n → bn : P v ∈ {a1, a 2,.., a n } ,
Pm by a set of
a1 → b2
Pm ∈{b1, b2,.., bm } .
,..,
…
d (1,2 ) ….
w2
w1
d (2,2)
. w3
d (2, m )
.
.
d (n,1)
d (1, m )
.
d (n,1) ……
d ( n, m )
Figure 3: A directed graph from similarity distance
matrix.
The weights stand for the cost of edit operations such as
substitution (w2), insertion (w3), deletion (w1), the cost
being zero for a match.
The shortest path Pmin from d (1,1) to d (n, m ) is
calculated by an efficient graph matching
algorithm. Here Dijkstra’s algorithm is used for fast
computation of Pmin which gives new pitch profile Pnew
representing the new set of synthesized notes derived from
the original music.
5. Experimental Results
The music data in our experiments are melodies mainly
selected from western classical instrumental music. An
example from this collection is transcribed into a sequence
of notes using MIDI IO library [10]. Figure 4 gives the
pitch contour of a melody. The profile in solid line indicates
the original pitch and the profile in dotted line is Haar
Approximation of the music sample.
Figure 4: Pitch Contour of a Bach melody
The pitch contour of the video as shown in Figure 5 is
derived from the hue of every frame of the video. The
sequence comparison method gives us the notes that are
‘similar’ to the music example. The velocity of the note is
also computed from the brightness of video and assigned to
every note. The pitch, volume so generated are reassembled in the midi format and then converted to midi
music. The matched contour is shown in Figure 6.
A user survey done on the analogy results suggests that the
music generated is acceptable though some parts of music
may seem repetitive or may not be musically pleasing. This
in general is preferred to the rule based generation of chord
music which we had experimented with earlier.
The results can be found on
http://www.comp.nus.edu.sg/~meeragaj
the
Figure 5: Pitch Contour of 'airplane' video clip obtained
from sonification layer.
website
6. Conclusions
We have presented a novel approach to add audio to video
which focuses on generating content-related music by
translating primitive elements of the video to audio features
and using sequence comparison to synthesize new pitch
sequence. We have experimented with the generation of
melodies. The system can be expanded to include matching
with more dimensions such as contour, interval and rhythm
to produce more variations.
The existing music analysis techniques of beat induction
and rhythm tracking can be experimented for this. We
intend to make the synthesis more versatile by including
parameters such as scale, rhythm.
Figure 6: Pitch Contour of synthesized music
References
[1] A.R.Burton and T.Vladimirova, “Generation of musical
sequences with genetic techniques , Computer Music
Journal, Volume 23, No.4, Issue 1, Dec 1999.
[2] C. Dorai and S. Venkatesh, “Bridging the semantic gap
in content management systems: comp computational
media aesthetics”, in International Conference on
Computational Semiotics in Games and New Media,
pp~94-99, 2001.
[3] D. Mazzoni and R.B.Dannenberg, “Melody Matching
Directly From Audio’, ISMIR 2001.
[4] G.Widmer, “ The Synergy of Music Theory and AI:
Learning Multi-Level Expressive Interpretation”, In
Proceedings of the Twelfth National Conference on
Artificial Intelligence (AAAI-94), AAAI Press/MIT Press,
Cambridge, MA, pp.114-119.
[5] H.Zettl,” Sight, sound, motion: Applied media
aesthetics”, Wadsworth, 1998.
[6] K.Lemstrom,”String matching Techniques for Music
retrieval “, PhD thesis, University of Helsinki, Finland,
Nov 1999.
[7] P.Mulhem, M.S.Kankanhalli, H.Hassan, and Ji.Yi.
“Pivot vector space approach for audio-video mixing”,
IEEE Multimedia, Vol. 10, No. 2, pp. 28-40, Apr-Jun2003.
[8] S.H.Srinivasan, Meera G Nayak, Mohan Kankanhalli,
“Music Synthesis for Home Videos”, manuscript under
preparation, September 2003.
[9] Yuehu, Liu, et al, “A method for Content-Based
Similarity Retrieval of Images using Two Dimensional DP
matching algorithm”, 11th international conference on
image analysis and processing, Sep 2001.
[10] http://midiio.sapp.org.