Robust Audio Fingerprinting Based on Local Spectral Luminance Maxima Scheme Yong-zhe Shi, Wei-Qiang Zhang, Jia Liu Tsinghua National Laboratory for Information Science and Technology, Department of Electronic Engineering, Tsinghua University, Beijing 100084, China [email protected], {wqzhang,liuj}@tsinghua.edu.cn

Abstract This paper proposes a robust audio fingerprinting system based on local spectral luminance maxima (LSLM) scheme using image processing approaches. Our approach treats spectrogram of an audio clip as a 2-D image and extracts the local luminance maxima of spectrum image as the discriminative characteristics. LSLM are selected due to resilience against quantization, compression, and noise addition, etc. Experimental results show that the proposed binary audio fingerprints outperform some of the state-of-the-art in the context of both robustness and reliability, especially in the noisy environment. Index Terms: audio fingerprints, local spectral luminance maxima

1. Introduction With the advancement of both computer and internet technology, information retrieval has become very popular in many fields. Audio retrieval as one of such applications has attracted a great of attention over the last decade. As for various distortions, such as quantization, compression, ambient noises, etc, there still exists an important issue on how to effectively and reliably retrieval. Audio fingerprints as one of the most promising solutions are proposed in recent years and defined as perceptual features of audio content, which aim to provide fast and reliable methods for content identification [1, 2, 3]. A good fingerprint implies robustness, discrimination, efficiency [1]. That is to say fingerprints of distorted signal should be similar with ones of the original, different perceptual audio clips have different fingerprints, and the fingerprints are conductive to retrieval. Many methods have been proposed recently [2, 4, 5], where histogram based on kd-tree algorithm [5] and binary fingerprints based on hashing algorithm[2, 4] are two major solutions. Hashing algorithm with fast retrieval gets more attentions. In addition, Philips scheme named robust hashing algorithm proposed by Haitsma and Kaiker is proven to be a robust and accurate audio fingerprinting scheme[2, 6]. In this method, spectral feature is extracted, and then differences of adjacent sub-bands and frames are quantized to binary string as audio fingerprints, which is a good representation of audio content as low-level feature [2, 6]. However, it is susceptible to interference and sensitive to a flip of bit, which will result in the significant deterioration of performance in noisy environment. For this purpose, many methods for robust fingerprints are proposed[7, 8, 9], most of which are based on spectral temporal information, the problem is partly solved. In this paper, audio fingerprinting based on local spectral luminance maximum (LSLM) scheme is proposed. We treat spectrogram of an audio clip as a 2-D image and extract the LSLM

as the discriminative characteristics. This paper is organized as follows: In section 2, we briefly introduce an overview of audio fingerprinting system. Section 3 illustrates the proposed scheme and detailed experiments and performance evaluations are presented in section 4. Finally, section 5 gives the conclusion.

2. Overview of Audio Fingerprinting System An overview of the audio fingerprinting system is shown in Fig. 1, which is generally made up of feature extraction, hash search and fingerprints match [1, 2]. Fingerprints are extracted from queries, and then candidates are searched using hash table established from audio fingerprint database (DB). Finally, targets are hit at the match stage. In this system, feature’s robustness and discrimination directly influence the performance of system. Robustness means fingerprints are immune to all kinds of distortions and retain the similarity, which guarantees a high recall rate in hash search stage, and discrimination refers to the capability of representing audio content to gain high precision in the match stage. Many of the existing features used in audio retrieval [2, 4, 7] obtain a high precision in final stage. This is usually based on the hypothesis that the matched clips come from the same source and they are only degraded by different distortions. After a brute-force search, high precision and recall rates will occur, however, it is time exhaustive. Hashing algorithm provides a scheme to quickly locate the targets. Nevertheless, it uses the local descriptors as the key, which needs a quite robust description for local characteristic, even one bit. Generally, common method increases the size of frame to improve robustness. Even so, it is not an effective solution, which will result in dilemma for short audio clips. In this paper, we focus on the robustness of fingerprints and propose a novel robust fingerprint based on LSLM scheme. Audio

Feature Extraction

Clips Queries

Hash Search In DB

Fingerprints Match

Targets

Candidates

Fingerprints

Fingerprints DataBase

DB

Figure 1: Overview of an audio fingerprinting system

3. Proposed Audio Fingerprint Robust audio fingerprints based on LSLM scheme are extracted based on FBANK feature. First of all, audio clips are split into

frames with overlap, each frame is windowed by a Hamming window and transformed into spectral domain; subsequently sub-band energy is calculated through Mel-filter banks, finally 32 dimensional FBANK feature are extracted. Fig.2 supplies an

smoothed spectrum at the block k, where M is the block length. As shown in Fig.4, density of black means the luminance in the smoothed spectrum, and divided into 4 categories. Quantization is performed according to quadrant of the local luminance maximum.

Spectral image(Fbank feature) from an audio clip

ik = argmax{¯ ek (n)}, n = 0, 1, ..., N − 1

40 28 30

24 20

22 20

10

18 350

300

250

200

150

100

16

0 0Frame

50

Sub−band Energy

Comparision of the 250th Frame with the degraded frame the same audio clip 30 No distortion 10dB Noise 5dB Noise 0 dB Noise MP3 8kbps

25

20 0

5

10

15 20 25 30 Sub−band (32 dim fBank feature)

35

40

45

Figure 2: Spectral image and 32 dimensional spectrum of one frame with different distortions instance of a spectral image and 32 dimensional spectrum of the same audio frame with different distortions. From this figure, we find that adjacent frames at the same sub-band are smooth in the local of spectral image, especially for relatively long frame length, for example 1024, 2048, etc (sampling frequency fs =8 kHz). However, these peaks with higher sub-band energy are resistant to distortions and adjacent sub-bands of which keep a similar gradient, which is a robust characteristic. Therefore, adjacent frames are less distinguished than adjacent sub-bands in the local, which is not considered in most of the existed algorithms. Based on these facts, we proposed a simple but effective audio fingerprint based on LSLM, detailed illustrations are showed in Fig. 3 and Fig.4. Spectral image (FBANK Feature)

4×4 Block

Fingerprints

Subband

0 1 1 1

Frame query

e¯k (n) n = 0, 1, ..., N − 1 e˜k (n) = ∑N −1 ¯k (n) n=0 e

ik = diffmax2 (˜ ek (n)) > δ ? argmax{˜ ek (n)} : 0

where1 δ is empirical threshold to control the quantization ambiguity, which can improve robustness, especially for unstable high frequency regions. To investigate the effectiveness of the proposed fingerprints, we select 1000 5s-audio clips without distortions as queries and degrade the 1000 queries with added white noises of SNR 5 dB as the test clips. We calculate the minimum hamming distance referred to Bit Error Rate (BER) between each pair of a query and a test clip, in which totally there is 1000 × 1000 (106 ) trials consisted of 1000 matches and 999000 mismatches. Based on this experiment, we give the distance probability distribution function (PDF) compared with Philips scheme in Fig.5. We see Distance PDF of match and mismatch target 100 90

1 0 0 1

Subband

Quantization

Figure 3: Feature extraction in proposed scheme

80 70 60

queries :1000 5s−clips without any distortions test clips:1000 5s−clips with added noises of SNR 5dB

50 40 Match Match ( Philips ) ( LSLM )

20

Smooth

Mismatch ( LSLM )

Quantization 10

Subband

0 0.2

Frame

Mismatch ( Philips )

1000×1000 (106) trials and 1000 true tests and 999000 imposter tests.The distance PDF is shown.

30

4×4 Block

(3)

n

1 1 0 1

Frame

(2)

where e˜k (n) is the normalized spectrum of block k. We define diffmax2 () as a function that calculates the absolute difference of the top two in a vector, which return a positive real number.

0 0 0 1

4×4 Block

Partitioning spectral image into 4×4 Blocks

N is the block height and ik ∈ [0, N − 1] is the local peak, referred to a local spectral luminance maximum, and then, ik is converted into the corresponding binary string as the local descriptor of block k. In the proposed system, both of M and N are empirically set as 4. To improve the robustness of quantization, we examine the maximum and the second maximum to decide the quantization confidence, the closer the top two, the less accurate the quantization. An intuitive idea is to quantize these blocks with low confidence to a same binary string.

Probability density

400

(1)

n

26

11

10

01

00

Figure 4: Detailed illustrations of LSLM quantization FBANK feature as a 2-D spectral image is partitioned into blocks of 4 × 4 with overlap of 50% in sub-band and frame. We denote Ek (m, n) as the spectral energy at frame m and sub∑M −1 1 band n in the block k. e¯k (n) = M m=0 Ek (m, n) is the

0.25

0.3

0.35 0.4 Bit Error Rate (BER)

0.45

0.5

0.55

Figure 5: Distance PDF of matches and mismatches compared with Philips system and proposed one, frame length is 1024. that both of PDFs are bimodal distribution and we can easily find the optimal decision threshold. Additionally, two peaks on 1 The formula is similar to the C language expression of the three objectives.

the PDF of the proposed are farther away. It implies that the proposed audio fingerprints are more discriminative. In addition, to investigate the robustness of the proposed, 1000 queries are used to retrieval in hash table established from fingerprints of 1000 test clips above. We count the fingerprints directly hitting targets using hash table, without consideration of final match result and give the direct hit rate per fingerprint of query in hashing search stage in Table 1. From Table 1, we see that

Table 1: Hit rate per fingerprint in hashing search stage. Philips System 0.016

Proposed System 0.090

direct hit rate per fingerprint in hashing search stage improve significantly. Subsequently, we will give further evaluations of the proposed fingerprints in the next section.

Figure 6: maximum recall rate for different frame sizes at different distortions compared with Philips scheme. 512, 1024 and 4096 in chart are the sample number for sampling frequency Fs =8 kHz Performance of the Philips System (FrameLength=1024)

4. Performance Evaluation

4.1. Evaluation of frame size In this section, we investigated the factor of frame size on the performance. We used 1000 queries to do retrieval in the degraded DB and studied the relation between the maximum recall rates and frame size. From Fig.6, we see that both of the recall rates with the increase of the frame size are higher. Additionally, Philips scheme is more sensitive to frame size, especially for MP3 8kbps. The proposed LSLM scheme is more robust and reliable for short frame.

MP3 32kbps MP3 16kbps MP3 8kbps SNR 10dB SNR 5dB MP3 32kbps MP3 16kbps MP3 8kbps SNR 10dB SNR 5dB

90

Precision / Recall Rate %

80 70 60

pre pre pre pre pre rec rec rec rec rec

50 40 30 20 10 0

0

0.1

0.2

0.3 0.4 0.5 Bit Error Rate (BER)

0.6

0.7

0.8

Figure 7: Performance of Philips system with frame size 1024 (Fs =8 kHz). Performance of the Proposed LSLM System (FrameLength=1024) 100

MP3 32kbps MP3 16kbps MP3 8kbps SNR 10dB SNR 5dB MP3 32kbps MP3 16kbps MP3 8kbps SNR 10dB SNR 5dB

90 80

Precision / Recall Rate %

The performance of LSLM scheme was evaluated using a 10h phone-call record as database. To investigate the performances of the proposed under different distorted conditions, the 10-hour phone-call record was degraded with MP3 compression (MP3 32 kbps, Mp3 16 kbps, Mp3 8 kbps), added white noises with SNR of 20 dB ,15 dB ,10 dB ,5 dB , 0 dB ,-5 dB). These distortions were used to simulate the interferences in real environment. Totally, the original phone-call record and the degraded ones constituted the entire test database (100-hour database(DB)). 1000 5s-clips selected randomly form original audio database were queries. All the clips of DB were sampled with 8 kHz and 16 bits linear equalization. The fingerprints were extracted using 32 critical bands from 80 Hz to 4 kHz based on FBANK feature, where frame step was 192 (24 ms per frame). System performance was measured using precision and recall rate. In the following experiments, we designed three experiments, compared with the well-known Philips system , to study the impact of different factors on the performance, such as frame size, distortions (compression, noise, etc.) and noises with different SNR. We used a same audio retrieval engine, just with different fingerprints to evaluate the performances.

100

70 60

pre pre pre pre pre rec rec rec rec rec

50 40 30 20 10 0

0

0.1

0.2

0.3 0.4 0.5 Bit Error Rate (BER)

0.6

0.7

0.8

Figure 8: Performance of the proposed LSLM system with frame size 1024 (Fs =8 kHz).

4.2. Evaluation of all kinds of distortions In this section, we focus on investigating the overall performance of the proposed algorithm for all kinds of distortions, such as MP3 32 kbps, 16 kbps, 8 kbps, added noises with SNR of 10 dB and 5 dB. Detailed comparisons for frame size 1024 are showed in Fig.7 and Fig.8. “pre” and “rec” are short forms of precision and recall rate in the following figures. We can see that the proposed scheme outperforms the Philips scheme, especially for SNR 10 dB, 5 dB and MP3 8 kbps. Moreover, both of precision and recall rate are above 0.95 at BER threshold 0.30

for MP3 32 kbps, 16 kbps, 8 kbps and SNR 10 dB. 4.3. Evaluation of noises with different SNR In this section, we evaluated the performance of the proposed scheme in noisy environment with frame size of 1024 and 4096. 1000 5s-queries were used to do retrieval in different noisy environment, such as added white noises with SNR 20 dB, 15 dB, 10 dB, 5 dB, 0 dB, -5 dB. Compared with Fig.9 and Fig.10, we see that the proposed scheme do well for added noises with

Performance of the Philips System (FrameLength=1024) SNR SNR SNR SNR SNR SNR SNR SNR SNR SNR SNR SNR

90

Precision / Recall Rate %

80 70 60 50

Performance of the Philips System (FrameLength=4096)

20dB pre 15dB pre 10dB pre 5dB pre 0dB pre −5dB pre 20dB rec 15dB rec 10dB rec 5dB rec 0dB rec −5dB rec

40 30

100

80

20

70 60 50 40 30

10 0

0.1

0.2

0.3 0.4 0.5 Bit Error Rate (BER)

0.6

0.7

0

0.8

Figure 9: Performance of Philips system with frame size 1024 (Fs =8 kHz)

0

90

Precision / Recall Rate %

80 70 60 50

20dB pre 15dB pre 10dB pre 5dB pre 0dB pre −5dB pre 20dB rec 15dB rec 10dB rec 5dB rec 0dB rec −5dB rec

40 30

70 60 50

0.7

0.8

20dB pre 15dB pre 10dB pre 5dB pre 0dB pre −5dB pre 20dB rec 15dB rec 10dB rec 5dB rec 0dB rec −5dB rec

30

0 0.6

0.8

40

10

0.3 0.4 0.5 Bit Error Rate (BER)

0.7

SNR SNR SNR SNR SNR SNR SNR SNR SNR SNR SNR SNR

80

20

0.2

0.6

90

10 0.1

0.3 0.4 0.5 Bit Error Rate (BER)

Performance of the Proposed LSLM System (FrameLength=4096)

20

0

0.2

100

Precision / Recall Rate %

SNR SNR SNR SNR SNR SNR SNR SNR SNR SNR SNR SNR

0.1

Figure 11: Performance of Philips system with frame size 4096 (Fs =8 kHz)

Performance of the Proposed LSLM System (FrameLength=1024) 100

0

20dB pre 15dB pre 10dB pre 5dB pre 0dB pre −5dB pre 20dB rec 15dB rec 10dB rec 5dB rec 0dB rec −5dB rec

20

10 0

SNR SNR SNR SNR SNR SNR SNR SNR SNR SNR SNR SNR

90

Precision / Recall Rate %

100

0

0.1

0.2

0.3 0.4 0.5 Bit Error Rate (BER)

0.6

0.7

0.8

Figure 10: Performance of the proposed LSLM system with frame size 1024 (Fs =8 kHz).

Figure 12: Performance of the proposed LSLM system with frame size 4096 (Fs =8 kHz).

SNR 20 dB, 15 dB and 10 dB. To study the ultimate performance of system in this DB, we also evaluated the performance of frame size 4096 shown in Fig .11 and Fig.12. Given an empirical BER threshold, the best precision and recall rate for noises 5dB is 0.9988 and 0.9114 for BER=0.30; 0.9322 and 0.9302 for BER=0.35. Compared with Philips scheme, the performance has been improved significantly.

[2] J. Haitsma and T. Kalker, “A highly robust audio fingerprinting system with an efficient search strategy,” Journal of New Music Research, vol. 32, no. 2, pp. 211–221, 2003.

5. Conclusions This paper has proposed a robust audio fingerprinting system based on LSLM. We treat spectrogram of an audio clip as a 2-D image to extract robust fingerprints. Experimental results show that our proposed LSLM scheme is more robust and reliable, especially for noisy environment. Additionally, LSLM scheme is insensitive to frame size, and still effective for short frame.

6. Acknowledgements This work was supported by the National Natural Science Foundation of China under Grant No. 61005019 and No. 60931160443, and in part by the National High Technology Development Program of China (863 Program) under Grant No. 2008AA040201.

7. References [1] P. Cano, E. Batlle, T. Kalker, and J. Haitsma, “A review of audio fingerprinting,” Journal of VLSI Signal Processing, vol. 41, no. 3, pp. 271–284, 2005.

[3] C. Bellettini and G. Mazzini, “A framework for robust audio fingerprinting,” Journal of Communications, vol. 5, no. 5, pp. 409–424, 2010. [4] A. L. Wang, “An industrial-strength audio search algorithm,” in International Symposium Conference on Music Information Retrieval(ISMIR), 2003, pp. 7–13. [5] K. Kashino, T. Kurozumi, and H. Murase, “A quick search method for audio and video signals based on histogram pruning,” IEEE Transactions on Multimedia, vol. 5, no. 3, pp. 348–357, 2003. [6] F. Balado, N. J. Hurley, E. P. Mccarthy, and G. C. M. Silvestre, “Performance analysis of robust audio hashing,” IEEE Transactions on Information Forensics and Security, vol. 2, no. 2, pp. 254– 266, 2007. [7] S. Baluja and M. Covell, “Audio fingerprinting: Combing computer vision and data stream processing,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP), 2007, pp. 213–216. [8] J. S. Seo, M. Jin, S. Lee, D. Jang, S. Lee, and C. D. Yoo, “Audio fingerprinting based on normalized spectral sub-band centroids,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP), vol. 3, 2005, pp. 213–216. [9] ——, “Audio fingerprinting based on normalized spectral sub-band moments,” IEEE Signal Processing Letters, vol. 3, pp. 209–212, 2006.

Robust Audio Fingerprinting Based on Local Spectral ...

Index Terms: audio fingerprints, local spectral luminance max- ima. 1. ..... International Symposium Conference on Music Information Re- trieval(ISMIR), 2003, pp. ... for audio and video signals based on histogram pruning,” IEEE. Transactions ...

618KB Sizes 2 Downloads 234 Views

Recommend Documents

Robust Image Watermarking Based on Local Zernike ...
Signal Processing Laboratory, School of Electrical Engineering and INMC, ..... to check whether the suspect image is corrupted by resizing or scal- ing attacks.

Robust Audio-Visual Speech Recognition Based on Late Integration
Jul 9, 2008 - gram of the Korea Science and Engineering Foundation and Brain Korea 21 ... The associate ... The authors are with the School of Electrical Engineering and Computer ...... the B.S. degree in electronics engineering (with the.

Local Neighborhood Based Robust Colour Occurrence ...
databases for content based image retrieval and experimental results suggest that .... natural and colour textural databases and found promising results and also ...

On Robust Key Agreement Based on Public Key Authentication
explicitly specify a digital signature scheme. ... applies to all signature-based PK-AKE protocols. ..... protocol design and meanwhile achieve good efficiency.

On Robust Key Agreement Based on Public Key ... - Semantic Scholar
in practice. For example, a mobile user and the desktop computer may hold .... require roughly 1.5L multiplications which include L square operations and 0.5L.

Nonlinear Spectral Transformations for Robust ... - Semantic Scholar
resents the angle between the vectors xo and xk in. N di- mensional space. Phase AutoCorrelation (PAC) coefficients, P[k] , are de- rived from the autocorrelation ...

LOCAL SPECTRAL PROPERTIES OF ...
URL: http://www.math.missouri.edu/personnel/faculty/gesztesyf.html. Department of Mathematics, California Institute of Technology, Pasadena, CA. 91125, USA.

On Sampling-based Approximate Spectral ... - Research at Google
Conference on Automatic Face and Gesture. Recognition. Talwalkar, A., Kumar, S., & Rowley, H. (2008). Large-scale manifold learning. CVPR. Williams, C. K. I. ...

survey and evaluation of audio fingerprinting ... - Research at Google
should be short for mobile applications (e.g., 10 seconds for retrieval. ... age features in the computer vision community for content- .... on TV and from laptop speakers in noisy environments. In.

A Robust Color Image Quantization Algorithm Based on ...
Clustering Ensemble. Yuchou Chang1, Dah-Jye Lee1, Yi Hong2, James Archibald1, and Dong Liang3. 1Department of Electrical and Computer Engineering, ...

Robust Speech Recognition Based on Binaural ... - Research at Google
degrees to one side and 2 m away from the microphones. This whole setup is 1.1 ... technology and automatic speech recognition,” in International. Congress on ...

Model generation for robust object tracking based on ...
scription of the databases of the PASCAL object recogni- tion challenge). We try to overcome these drawbacks by proposing a novel, completely unsupervised ...

Minimax Robust Relay Selection Based on Uncertain ... - IEEE Xplore
Feb 12, 2014 - for spectrum sharing-based cognitive radios,” IEEE Trans. Signal Pro- ... Richness of wireless channels across time and frequency naturally.

A robust proportional controller for AQM based on ...
b Department of Computer Science, HongKong University of Science and Technology, HongKong, China. a r t i c l e i n f o ... best tradeoff between utilization and delay. ... than RED under a wide range of traffic scenario, the major draw-.

Highly Noise Robust Text-Dependent Speaker Recognition Based on ...
conditions and non-stationary color noise conditions (factory, chop- per and babble noises), which are also the typical conditions where conventional spectral subtraction techniques perform poorly. Index Terms: Robust speaker recognition, hypothesize

Robust Obstacle Segmentation based on Topological ...
persistence diagram that gives a compact visual representation of segmentation ... the 3D point cloud estimated from the dense disparity maps computed ..... [25] A. Zomorodian and G. Carlsson, “Computing persistent homology,” in Symp. on ...

A Robust Color Image Quantization Algorithm Based on ...
2Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong ...... Ph.D. degrees in computer science from the University of.

robust video object tracking based on multiple kernels ...
Identification and Security Technology Center,. Industrial .... the loss of the information caused by the occlusion by introducing ... Thus, we associate each kernel with one adaptively ... similarity is defined as the degree of match between the.

robust video object tracking based on multiple kernels with projected ...
finding the best match during tracking under predefined constraints. .... A xδ and. B xδ by using projected gradient [10],. B. A x x. C)C(CC. JC)C(CCI x. 1 x. T.

Robust audio watermarking using perceptual masking - CiteSeerX
Digital watermarking has been proposed as a means to identify the owner or ... frequency bands are replaced with spectral components from a signature.

A Study on Dominance-Based Local Search ...
view of dominance-based multiobjective local search algorithms is pro- posed. .... tor, i.e. a job at position i is inserted at position j \= i, and the jobs located.

Face Recognition Based on Local Uncorrelated and ...
1State Key Laboratory for Software Engineering, Wuhan University, Wuhan, ... of Automation, Nanjing University of Posts and Telecommunications, 210046, ...