Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ICICS-PCM 2003 15-18 December 2003 Singapore 3B1.7 Music synthesis for home videos: An analogy based approach Meera Nayak Dept. of Computer Science National Univ. of Singapore Singapore S H Srinivasan Applied Research Group Satyam Computer Services Bangalore [email protected] [email protected] Abstract There have been efforts in the recent years to make home videos look more interesting and pleasing to viewers by mixing it with music. Most of the existing software enables the user to add music of his preference. It assumes that the user has enough knowledge about the aesthetic mixing principles. In our research, we propose a way of adding audio to video by synthesizing appropriate music based on the video content. This is a step toward semi-automatic music mixing without totally excluding the user. We have developed a system that takes in music examples selected by the user and generates new music by applying the aesthetic rules of audio-video mapping. The paper concentrates on pitch generation for music synthesized through contour based pitch matching by using technique of string matching. This system would enable the user to understand what music would be suited to a particular video and would assist him in choosing a matching piece of music. Keywords Music synthesis, media aesthetics, string matching, music analogies. 1. Introduction The ease of use of the digital camcorders has made many an amateur to direct his own home video. These videos are not professionally made and lack the aesthetic appeal that movies have. Music can be used to improve the appeal of such home videos. Automatic audio video mixing is one of the ways to address this problem. In our earlier work [7], the certain features of the video and audio were extracted and based on matching criteria presented in [5]; the best clip for the audio was chosen. Instead of mixing music by feature extraction and subsequent matching , another approach would be to synthesize music by ‘listening’ to the meaning inherent in the video using underlying principles of computational media aesthetics [2] to generate customized music for every video clip. This paper is organized as follows. Section 2 and 3 summarize briefly related work on music synthesis and basics of music theory. Section 4 gives an insight into the motivation behind the 0-7803-8185-8/03/$17.00 © 2003 IEEE Mohan S. Kankanhalli Dept. of Computer Science National Univ. of Singapore Singapore [email protected] work. The system architecture is laid out in 5. Section 6 presents the algorithm, 7 and 8 reports experimental results and conclusions. 2. Related Work Artificial techniques have been often used to learn the expressive interpretation of music pieces which involves learning musical parameters such as the dynamics (variations in loudness) and rubato (variations of local tempo) [4]. Many computational intelligence techniques have been used to solve musical problems like music cognition, algorithmic composition, and sound synthesis [1]. Some of these are neural networks, genetic algorithms and genetic programming. Genetic algorithms (GA) have been mainly used to solve compositional and synthesis tasks. These algorithmic composers operate on music knowledge such as pitch, rhythm, meter and on the rule representation which contains a set of rules that determine how the composition evolves [1]. GA are especially useful when generating improvisations and producing variations of already existing music. But the disadvantage with this method is that rules need to be properly represented as constraints since they prune the search space. Neural networks are used to generate musical segments but they are limited due to large training data required. 3. Background 3.1 Motivation: Media Aesthetics Media aesthetics is the study of visual and aural elements, the interaction and integration of these elements to understand the semantic and the semiotic content of the video [2]. This is based on computation of lower level elements and constructing new expressive elements out of it to define the higher level semantics. For instance, the shot lengths and motion can be used to define tempo of video. If the shot lengths are short the tempo is staccato in nature but if they are long and motion slow, then it is a legato style. This similar to tempo descriptions of music. It could be possible to extract the aesthetic elements of one medium (a video clip), manipulate it and give new meaning to that of another medium (audio clip). Rhythm is derived from the inherent structure in the video. Video rhythm is divided into shot and motion rhythm. Motion rhythm is again divided into metrical, attack, decay and free. sonification rules are derived from this structural mapping. The audio features that have been used are the pitch, dynamics and tempo. The tempo and rhythm described above have corresponding parallels in music. Musical rhythm is divided into groups and groups into measures. Meter is split into bars and every note is determined and defined by the onset time (attack), the duration of the note and the decay time. 3.3.2 Our work is derived from these ideas to match the video features with that of audio. We extract primitive features of the video and apply the rules of aesthetics as laid out in film production to synthesize music that not only synchronizes with the video tempo but also aesthetically blends with it. 3.2 Basics of Music Theory Pitch: This indicates the relative highness or lowness of sound. Pitch can be used to create musical moods such as low pitch signifies solemnness but high pitch signifies brighter moods. Calculation of audio features Pitch: The pitch values in MIDI ranges from [1 ...127]. The midi pitch key of the middle C is 60 corresponding to the frequency of 261.625Hz. The difference in frequency of one midi pitch to the next higher one is the twelfth root of 2. The range R of midi pitch in hertz is [8.1558 -12543.8539]. So if the h is the normalized value of the hue of a frame, it is converted to a hertz value by P hertz = h * R . Midi pitch note is given as 12 P h = 60 + ln(Phertz) − ln(261.625) / ln( 2 ) . Dynamics: This indicates loudness of music. A steady increase of dynamics stirs up excitement that is usually accompanied by increase in pitch. The gradual decrease in dynamics suggests calmness of mood. Rhythm: It refers to how the music ebbs and flows against the passage of time. It is expressed by beats, meter, accent and tempo of music. The note lengths are usually varied by setting them against the timeline of beats. Tempo: It refers to the speed at which the beats are played. There is no absolute measure of tempo. Melody: Melody is determined by combination of pitch series and rhythm resulting in clearly defined shape. The duration of notes and their ordered succession of intervals define a melody. Melody may start on note C, rise up to a note an octave higher, then come down to the starting pitch thus following a melodic arch or contour. Figure 1: Architecture of the System [8] Harmony: It is composed of chords and is based on the progression of chords, (tones that are sounded simultaneously). 3.3. System Architecture It is divided into two layers, the sonification and aesthetics layer as shown in Figure 1. These layers are explained in greater detail below. 3.3.1 Sonification Layer Video features like hue, saturation, tempo are mapped to corresponding audio features. The mapping is done according to table given by Zettl in Figure 2. The Figure 2: Audio/Video structural mapping. Adapted from Zettl [5]. Volume: The volume or velocity in MIDI ranges [0-127]. This is again derived from the brightness of the video. The conversion from brightness to midi velocity is linear. If v is the brightness of the frame V = v * 127 . Tempo: Tempo is derived from motion and is specified in beats per minute. The duration of the notes and the motion parameter are inversely related. T = α / m ; α is a constant, m is the motion. Beat duration is given by b = 60 / t which will decide number of notes to be played per beat. Evaluation of the edit distance The edit distance is the minimum number of local transformations that are required to transform P v into Pm and this can be calculated using dynamic programming. The procedure consists in constructing an integer matrix where each row corresponds to the event (note) in P v and each column to that of Pm . Each cell stores the distance d i , j between P v ∈ {a 1, a 2,.., a n } and Pm ∈{b1, b2,.., bm } which b a gives the element similarity distance between i and j where i ∈ {1,2,..., n} and j ∈{1, 2,.., m} . d i , j , is the 3.3.3 Aesthetics Layer The aesthetics layer is a compositional layer that gives a form and structure to music generated by the lower level sonification layer. Music can either be generated through a rule based approach where rules are used to train a system or, by means of an example based approach where examples of music are used as reference to generate it. This paper deals with the second approach. 4 4.2 Analogy based Composition Using the analogy based method; we generate the pitch of the new music by matching the pitch profile (contour) derived from the video with the pitch contour of the example chosen. distance based on characters itself and not on the position of the characters in the string. This would represent the alphabet weight edit distance which is used to calculate the weighted edit distance graph. 4.3 Procedure to find distance between the two pitch series For each note segment in P v Begin For each note segment in Pm Begin d i , j = d i , j =| a i − b j | /(a i + b j ) End 4.1 Contour based pitch matching End Experiments in music retrieval have shown that searching based on melodic contour which is a sequence of interval directions have produced good results [6].Here the matching between the contours is done using sequence comparison techniques. Haar approximation (lower resolution form) of the music sample and that of the pitch profile from video. The synthesized music emulates a particular example. A directed graph is constructed from this similarity distance matrix as given in Figure 3 where d (1,1) the initial vertex is and d (n, m ) is the final terminal one. Weights along the graph are given as w1 = d i , j , w 2 = 2* d i , j , w 3 = d i , j d (1,1) To measure similarity between two sequences of pitch segments using approximate string matching, Pm (the Haar d (2,1) approximate music pitch profile) and P v (the pitch profile . from video), we need to calculate the local transformations which are replacement, insertion, deletion. The alphabet from which the pitch numbers are drawn is ∑ ∈ {1,2,..,127} , the range of midi notes. The sequence Pnew can be obtained from P v and transformations steps a n → bn : P v ∈ {a1, a 2,.., a n } , Pm by a set of a1 → b2 Pm ∈{b1, b2,.., bm } . ,.., … d (1,2 ) …. w2 w1 d (2,2) . w3 d (2, m ) . . d (n,1) d (1, m ) . d (n,1) …… d ( n, m ) Figure 3: A directed graph from similarity distance matrix. The weights stand for the cost of edit operations such as substitution (w2), insertion (w3), deletion (w1), the cost being zero for a match. The shortest path Pmin from d (1,1) to d (n, m ) is calculated by an efficient graph matching algorithm. Here Dijkstra’s algorithm is used for fast computation of Pmin which gives new pitch profile Pnew representing the new set of synthesized notes derived from the original music. 5. Experimental Results The music data in our experiments are melodies mainly selected from western classical instrumental music. An example from this collection is transcribed into a sequence of notes using MIDI IO library [10]. Figure 4 gives the pitch contour of a melody. The profile in solid line indicates the original pitch and the profile in dotted line is Haar Approximation of the music sample. Figure 4: Pitch Contour of a Bach melody The pitch contour of the video as shown in Figure 5 is derived from the hue of every frame of the video. The sequence comparison method gives us the notes that are ‘similar’ to the music example. The velocity of the note is also computed from the brightness of video and assigned to every note. The pitch, volume so generated are reassembled in the midi format and then converted to midi music. The matched contour is shown in Figure 6. A user survey done on the analogy results suggests that the music generated is acceptable though some parts of music may seem repetitive or may not be musically pleasing. This in general is preferred to the rule based generation of chord music which we had experimented with earlier. The results can be found on http://www.comp.nus.edu.sg/~meeragaj the Figure 5: Pitch Contour of 'airplane' video clip obtained from sonification layer. website 6. Conclusions We have presented a novel approach to add audio to video which focuses on generating content-related music by translating primitive elements of the video to audio features and using sequence comparison to synthesize new pitch sequence. We have experimented with the generation of melodies. The system can be expanded to include matching with more dimensions such as contour, interval and rhythm to produce more variations. The existing music analysis techniques of beat induction and rhythm tracking can be experimented for this. We intend to make the synthesis more versatile by including parameters such as scale, rhythm. Figure 6: Pitch Contour of synthesized music References [1] A.R.Burton and T.Vladimirova, “Generation of musical sequences with genetic techniques , Computer Music Journal, Volume 23, No.4, Issue 1, Dec 1999. [2] C. Dorai and S. Venkatesh, “Bridging the semantic gap in content management systems: comp computational media aesthetics”, in International Conference on Computational Semiotics in Games and New Media, pp~94-99, 2001. [3] D. Mazzoni and R.B.Dannenberg, “Melody Matching Directly From Audio’, ISMIR 2001. [4] G.Widmer, “ The Synergy of Music Theory and AI: Learning Multi-Level Expressive Interpretation”, In Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI-94), AAAI Press/MIT Press, Cambridge, MA, pp.114-119. [5] H.Zettl,” Sight, sound, motion: Applied media aesthetics”, Wadsworth, 1998. [6] K.Lemstrom,”String matching Techniques for Music retrieval “, PhD thesis, University of Helsinki, Finland, Nov 1999. [7] P.Mulhem, M.S.Kankanhalli, H.Hassan, and Ji.Yi. “Pivot vector space approach for audio-video mixing”, IEEE Multimedia, Vol. 10, No. 2, pp. 28-40, Apr-Jun2003. [8] S.H.Srinivasan, Meera G Nayak, Mohan Kankanhalli, “Music Synthesis for Home Videos”, manuscript under preparation, September 2003. [9] Yuehu, Liu, et al, “A method for Content-Based Similarity Retrieval of Images using Two Dimensional DP matching algorithm”, 11th international conference on image analysis and processing, Sep 2001. [10] http://midiio.sapp.org.