Publications

2026

  • M. Heikkinen, A. Politis, K. Drossos, and T. Virtanen, “Beyond Omnidirectional: Neural Ambisonics Encoding for Arbitrary Microphone Directivity Patterns using Cross-Attention,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, accepted for publication., 2026.
    [BibTeX] [Download PDF]
    @INPROCEEDINGS{2026_ICASSP,
    author = "Heikkinen, Mikko and Politis, Archontis and Drossos, Konstantinos and Virtanen, Tuomas",
    title = "Beyond Omnidirectional: Neural Ambisonics Encoding for Arbitrary Microphone Directivity Patterns using Cross-Attention",
    year = "2026",
    booktitle = "IEEE International Conference on Acoustics, Speech, and Signal Processing, accepted for publication.",
    url = "https://arxiv.org/abs/2601.23196"
    }

  • M. Silaev, K. Drossos, and T. Virtanen, “Discriminating real and synthetic super-resolved audio samples using embedding-based classifiers,” in Joint Workshop on HSCMA and CHiME 2026, accepted for publication., 2026.
    [BibTeX] [Download PDF]
    @INPROCEEDINGS{2026_ICA,
    author = "Silaev, Mikhail and Drossos, Konstantinos and Virtanen, Tuomas",
    title = "Discriminating real and synthetic super-resolved audio samples using embedding-based classifiers",
    year = "2026",
    booktitle = "Joint Workshop on HSCMA and CHiME 2026, accepted for publication.",
    url = "https://arxiv.org/abs/2601.03443"
    }

2025

  • M. Airaksinen, O. Räsänen, and S. Vanhatalo, “Trade-Offs Between Simplifying Inertial Measurement Unit–Based Movement Recordings and the Attainability of Different Levels of Analyses: Systematic Assessment of Method Variations,” JMIR Mhealth and Uhealth, vol. 13, 2025. doi:10.2196/58078
    [BibTeX] [Abstract]

    Background: Human movement activity is commonly recorded with inertial measurement unit (IMU) sensors in many science disciplines. The IMU data can be used for an algorithmic detection of different postures and movements, which may support more detailed assessments of complex behaviors, such as daily activities. Studies on human behavior in real-life environments need to strike a balance between simplifying the recording settings and preserving sufficient analytic gains. It is poorly understood, however, what the trade-offs are between alternative recording configurations and the attainable analyses of naturalistic behavior at different levels of inspection, or with respect to achievable scientific questions. Objective: This study assessed systematically the effects of IMU recording configurations (placement and number of IMU sensors, sampling frequency, and sensor modality) on the high temporal resolution detections of postures and movements, and on their lower temporal resolution derivative statistics when the data represents naturalistic daily activity without excessively repetitive movements. Methods: We used a dataset from spontaneously moving infants (N=41; age range 4‐18 months) recorded with a multisensor wearable suit. The analysis benchmark was obtained using human annotations of postures and movements from a synchronously recorded video, and the reference IMU recording configuration included 4 IMU sensors collecting triaxial accelerometer and gyroscope modalities at 52 Hz. Then, we systematically tested how the algorithmic classification of postures (N=7), and movements (N=9), as well as their distributions and a derivative motor performance score, are affected by reducing IMU data sampling frequency, sensor modality, and sensor placement. Results: Our results show that reducing the number of sensors has a significant effect on classifier performance, and the single sensor configurations were nonfeasible (posture classification Cohen kappa<0.75; movement<0.45). Reducing sensor modalities to accelerometer only, that is, dropping gyroscope data, leads to a modest reduction in movement classification performance (kappa=0.50-0.53). However, the sampling frequency could be reduced from 52 to 6 Hz with negligible effects on the classifications (posture kappa=0.90-0.92; movement=0.56-0.58). Conclusions: The present findings highlight the significant trade-offs between IMU recording configurations and the attainability of sufficiently reliable analyses at different levels. Notably, the single-sensor recordings employed in most of the literature and wearable solutions are of very limited use when assessing the key aspects of real-world movement behavior at relevant temporal resolutions. The minimal configuration with an acceptable classifier performance includes at least a combination of one upper and one lower extremity sensor, at least 13 Hz sampling frequency, and at least an accelerometer, but preferably also a gyroscope (posture kappa=0.89-0.91; movement=0.50-0.53). These findings have direct implications for the design of future studies and wearable solutions that aim to quantify spontaneously occurring postures and movements in natural behaviors.

    @article{2025_c,
    author = {Airaksinen, Manu and R{\"a}s{\"a}nen, Okko and Vanhatalo, Sampsa},
    title = "Trade-Offs Between Simplifying Inertial Measurement Unit–Based Movement Recordings and the Attainability of Different Levels of Analyses: Systematic Assessment of Method Variations",
    abstract = "Background: Human movement activity is commonly recorded with inertial measurement unit (IMU) sensors in many science disciplines. The IMU data can be used for an algorithmic detection of different postures and movements, which may support more detailed assessments of complex behaviors, such as daily activities. Studies on human behavior in real-life environments need to strike a balance between simplifying the recording settings and preserving sufficient analytic gains. It is poorly understood, however, what the trade-offs are between alternative recording configurations and the attainable analyses of naturalistic behavior at different levels of inspection, or with respect to achievable scientific questions. Objective: This study assessed systematically the effects of IMU recording configurations (placement and number of IMU sensors, sampling frequency, and sensor modality) on the high temporal resolution detections of postures and movements, and on their lower temporal resolution derivative statistics when the data represents naturalistic daily activity without excessively repetitive movements. Methods: We used a dataset from spontaneously moving infants (N=41; age range 4‐18 months) recorded with a multisensor wearable suit. The analysis benchmark was obtained using human annotations of postures and movements from a synchronously recorded video, and the reference IMU recording configuration included 4 IMU sensors collecting triaxial accelerometer and gyroscope modalities at 52 Hz. Then, we systematically tested how the algorithmic classification of postures (N=7), and movements (N=9), as well as their distributions and a derivative motor performance score, are affected by reducing IMU data sampling frequency, sensor modality, and sensor placement. Results: Our results show that reducing the number of sensors has a significant effect on classifier performance, and the single sensor configurations were nonfeasible (posture classification Cohen kappa<0.75; movement<0.45). Reducing sensor modalities to accelerometer only, that is, dropping gyroscope data, leads to a modest reduction in movement classification performance (kappa=0.50-0.53). However, the sampling frequency could be reduced from 52 to 6 Hz with negligible effects on the classifications (posture kappa=0.90-0.92; movement=0.56-0.58). Conclusions: The present findings highlight the significant trade-offs between IMU recording configurations and the attainability of sufficiently reliable analyses at different levels. Notably, the single-sensor recordings employed in most of the literature and wearable solutions are of very limited use when assessing the key aspects of real-world movement behavior at relevant temporal resolutions. The minimal configuration with an acceptable classifier performance includes at least a combination of one upper and one lower extremity sensor, at least 13 Hz sampling frequency, and at least an accelerometer, but preferably also a gyroscope (posture kappa=0.89-0.91; movement=0.50-0.53). These findings have direct implications for the design of future studies and wearable solutions that aim to quantify spontaneously occurring postures and movements in natural behaviors.",
    keywords = "accelerometer, activity recognition, algorithm, balance, detection, gross motor development, human activity recognition, IMU, MAIJU, motility, motility assessment, motor, motor development, movement sensors, neural networks, neurodevelopment, posture, recording configuration, sensor, wearable",
    note = {Publisher Copyright: {\textcopyright} Manu Airaksinen, Okko R{\"a}s{\"a}nen, Sampsa Vanhatalo.},
    year = "2025",
    doi = "10.2196/58078",
    language = "English",
    volume = "13",
    journal = "JMIR Mhealth and Uhealth",
    issn = "2291-5222",
    publisher = "JMIR Publications"
    }

  • M. A. Cruz Blandón, N. Gonzalez-Gomez, M. Lavechin, and O. Räsänen, "Simulating prenatal language exposure in computational models: An exploration study," COGNITION, vol. 256, 2025. doi:10.1016/j.cognition.2024.106044
    [BibTeX] [Abstract]

    Researchers have hypothesized that infant language learning starts from the third trimester of pregnancy. This is supported by studies with fetuses and newborns showing discrimination/preference for their native language. Jointly with empirical research, initial computational modeling studies have investigated whether learning language patterns from speech input benefits from auditory prenatal language exposure (PLE), showing some advantages for prior adaptation to speech-like patterns. However, these modeling studies have not modeled prenatal speech input in an ecologically representative manner regarding quality or quantity. This study describes an ecologically representative framework for modeling PLE for full-term and preterm infants. The approach is based on empirical estimates of the amount of prenatal speech input together with a model of speech signal attenuation from the external air to the fetus{’} auditory system. Using this framework, we conduct language learning simulations with computational models that learn from acoustic speech input in an unsupervised manner. We compare the effects of PLE to standard learning from only postnatal input on various early language phenomena. The results show how incorporating PLE can affect models{’} learning outcomes, including differences between full-term and preterm conditions. Moreover, PLE duration might influence model behavior, depending on the linguistic capability being tested. While the inclusion of PLE did not improve the compatibility of the tested models with empirical infant data, our study highlights the relevance of PLE as a factor in modeling studies. Moreover, it provides a basic framework for modeling the prenatal period in future computational studies.

    @article{2025_g,
    author = {Cruz Bland{\'o}n, Mar{\'i}a Andrea and Gonzalez-Gomez, Nayeli and Lavechin, Marvin and R{\"a}s{\"a}nen, Okko},
    title = "Simulating prenatal language exposure in computational models: An exploration study",
    abstract = "Researchers have hypothesized that infant language learning starts from the third trimester of pregnancy. This is supported by studies with fetuses and newborns showing discrimination/preference for their native language. Jointly with empirical research, initial computational modeling studies have investigated whether learning language patterns from speech input benefits from auditory prenatal language exposure (PLE), showing some advantages for prior adaptation to speech-like patterns. However, these modeling studies have not modeled prenatal speech input in an ecologically representative manner regarding quality or quantity. This study describes an ecologically representative framework for modeling PLE for full-term and preterm infants. The approach is based on empirical estimates of the amount of prenatal speech input together with a model of speech signal attenuation from the external air to the fetus{\textquoteright} auditory system. Using this framework, we conduct language learning simulations with computational models that learn from acoustic speech input in an unsupervised manner. We compare the effects of PLE to standard learning from only postnatal input on various early language phenomena. The results show how incorporating PLE can affect models{\textquoteright} learning outcomes, including differences between full-term and preterm conditions. Moreover, PLE duration might influence model behavior, depending on the linguistic capability being tested. While the inclusion of PLE did not improve the compatibility of the tested models with empirical infant data, our study highlights the relevance of PLE as a factor in modeling studies. Moreover, it provides a basic framework for modeling the prenatal period in future computational studies.",
    keywords = "Child language development, Computational modeling, Language acquisition, Prenatal language exposure",
    note = "Publisher Copyright: {\textcopyright} 2024 The Authors",
    year = "2025",
    month = "March",
    doi = "10.1016/j.cognition.2024.106044",
    language = "English",
    volume = "256",
    journal = "COGNITION",
    issn = "0010-0277",
    publisher = "Elsevier B.V."
    }

  • J. Garcia-Martinez, D. Diaz-Guerra, A. Politis, T. Virtanen, J. J. Carabias-Orti, and P. Vera-Candeas, "SynthSOD: Developing an Heterogeneous Dataset for Orchestra Music Source Separation," IEEE Open Journal of Signal Processing, 2025. doi:10.1109/OJSP.2025.3528361
    [BibTeX] [Abstract]

    Recent advancements in music source separation have significantly progressed, particularly in isolating vocals, drums, and bass elements from mixed tracks. These developments owe much to the creation and use of large-scale, multitrack datasets dedicated to these specific components. However, the challenge of extracting similarly sounding sources from orchestra recordings has not been extensively explored, largely due to a scarcity of comprehensive and clean (i.e bleed-free) multitrack datasets. In this paper, we introduce a novel multitrack dataset called SynthSOD, developed using a set of simulation techniques to create a realistic, musically motivated, and heterogeneous training set comprising different dynamics, natural tempo changes, styles, and conditions by employing high-quality digital libraries that define virtual instrument sounds for MIDI playback (a.k.a., soundfonts). Moreover, we demonstrate the application of a widely used baseline music separation model trained on our synthesized dataset w.r.t to the well-known EnsembleSet, and evaluate its performance under both synthetic and real-world conditions.

    @article{2025_SP_a,
    author = "Garcia-Martinez, Jaime and Diaz-Guerra, David and Politis, Archontis and Virtanen, Tuomas and Carabias-Orti, Julio J. and Vera-Candeas, Pedro",
    title = "SynthSOD: Developing an Heterogeneous Dataset for Orchestra Music Source Separation",
    abstract = "Recent advancements in music source separation have significantly progressed, particularly in isolating vocals, drums, and bass elements from mixed tracks. These developments owe much to the creation and use of large-scale, multitrack datasets dedicated to these specific components. However, the challenge of extracting similarly sounding sources from orchestra recordings has not been extensively explored, largely due to a scarcity of comprehensive and clean (i.e bleed-free) multitrack datasets. In this paper, we introduce a novel multitrack dataset called SynthSOD, developed using a set of simulation techniques to create a realistic, musically motivated, and heterogeneous training set comprising different dynamics, natural tempo changes, styles, and conditions by employing high-quality digital libraries that define virtual instrument sounds for MIDI playback (a.k.a., soundfonts). Moreover, we demonstrate the application of a widely used baseline music separation model trained on our synthesized dataset w.r.t to the well-known EnsembleSet, and evaluate its performance under both synthetic and real-world conditions.",
    keywords = "Classical music, dataset, deep learning, machine learning, music source separation, orchestra music",
    note = "Publisher Copyright: {\textcopyright} 2020 IEEE.",
    year = "2025",
    doi = "10.1109/OJSP.2025.3528361",
    language = "English",
    journal = "IEEE Open Journal of Signal Processing",
    issn = "2644-1322",
    publisher = "IEEE"
    }

  • M. Heikkinen, A. Politis, K. Drossos, and T. Virtanen, "Gen-A: Generalizing Ambisonics Neural Encoding to Unseen Microphone Arrays," in ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), United States, 2025, p. 1–5. doi:10.1109/ICASSP49660.2025.10887869
    [BibTeX] [Abstract]

    Using deep neural networks (DNNs) for encoding of microphone array (MA) signals to the Ambisonics spatial audio format can surpass certain limitations of established conventional methods, but existing DNN-based methods need to be trained separately for each MA. This paper proposes a DNN-based method for Ambisonics encoding that can generalize to arbitrary MA geometries unseen during training. The method takes as inputs the MA geometry and MA signals and uses a multi-level encoder consisting of separate paths for geometry and signal data, where geometry features inform the signal encoder at each level. The method is validated in simulated anechoic and reverberant conditions with one and two sources. The results indicate improvement over conventional encoding across the whole frequency range for dry scenes, while for reverberant scenes the improvement is frequency-dependent.

    @inproceedings{2025_ICASSP,
    author = "Heikkinen, Mikko and Politis, Archontis and Drossos, Konstantinos and Virtanen, Tuomas",
    editor = "Rao, Bhaskar D and Trancoso, Isabel and Sharma, Gaurav and Mehta, Neelesh B.",
    title = "Gen-A: Generalizing Ambisonics Neural Encoding to Unseen Microphone Arrays",
    abstract = "Using deep neural networks (DNNs) for encoding of microphone array (MA) signals to the Ambisonics spatial audio format can surpass certain limitations of established conventional methods, but existing DNN-based methods need to be trained separately for each MA. This paper proposes a DNN-based method for Ambisonics encoding that can generalize to arbitrary MA geometries unseen during training. The method takes as inputs the MA geometry and MA signals and uses a multi-level encoder consisting of separate paths for geometry and signal data, where geometry features inform the signal encoder at each level. The method is validated in simulated anechoic and reverberant conditions with one and two sources. The results indicate improvement over conventional encoding across the whole frequency range for dry scenes, while for reverberant scenes the improvement is frequency-dependent.",
    keywords = "Ambisonics, deep learning, microphone array, Spatial audio",
    note = "Publisher Copyright: {\textcopyright} 2025 IEEE.; IEEE International Conference on Acoustics, Speech, and Signal Processing ; Conference date: 06-04-2025 Through 11-04-2025",
    year = "2025",
    doi = "10.1109/ICASSP49660.2025.10887869",
    language = "English",
    series = "Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing",
    publisher = "IEEE",
    pages = "1--5",
    booktitle = "ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",
    address = "United States"
    }

  • K. Khorrami and O. Räsänen, "A model of early word acquisition based on realistic-scale audiovisual naming events," Speech Communication, vol. 167, 2025. doi:10.1016/j.specom.2024.103169
    [BibTeX] [Abstract]

    Infants gradually learn to parse continuous speech into words and connect names with objects, yet the mechanisms behind development of early word perception skills remain unknown. We studied the extent to which early words can be acquired through statistical learning from regularities in audiovisual sensory input. We simulated word learning in infants up to 12 months of age in a realistic setting, using a model that solely learns from statistical regularities in unannotated raw speech and pixel-level visual input. Crucially, the quantity of object naming events was carefully designed to match that accessible to infants of comparable ages. Results show that the model effectively learns to recognize words and associate them with corresponding visual objects, with a vocabulary growth rate comparable to that observed in infants. The findings support the viability of general statistical learning for early word perception, demonstrating how learning can operate without assuming any prior linguistic capabilities.

    @article{2025_ICA,
    author = {Khorrami, Khazar and R{\"a}s{\"a}nen, Okko},
    title = "A model of early word acquisition based on realistic-scale audiovisual naming events",
    abstract = "Infants gradually learn to parse continuous speech into words and connect names with objects, yet the mechanisms behind development of early word perception skills remain unknown. We studied the extent to which early words can be acquired through statistical learning from regularities in audiovisual sensory input. We simulated word learning in infants up to 12 months of age in a realistic setting, using a model that solely learns from statistical regularities in unannotated raw speech and pixel-level visual input. Crucially, the quantity of object naming events was carefully designed to match that accessible to infants of comparable ages. Results show that the model effectively learns to recognize words and associate them with corresponding visual objects, with a vocabulary growth rate comparable to that observed in infants. The findings support the viability of general statistical learning for early word perception, demonstrating how learning can operate without assuming any prior linguistic capabilities.",
    keywords = "Associative learning, Computational modeling, Statistical learning, Word acquisition",
    note = "Publisher Copyright: {\textcopyright} 2024 The Authors",
    year = "2025",
    month = "February",
    doi = "10.1016/j.specom.2024.103169",
    language = "English",
    volume = "167",
    journal = "Speech Communication",
    issn = "0167-6393",
    publisher = "Elsevier B.V."
    }

  • Y. Liu, M. Kiran Reddy, M. K. Yagnavajjula, O. Räsänen, P. Alku, T. Ikävalko, T. Hakanpää, A. Öyry, and A. Laukkanen, "Automatic Classification of Strain in the Singing Voice Using Machine Learning," Journal of Voice, 2025. doi:10.1016/j.jvoice.2025.03.040
    [BibTeX]
    @article{2025_a,
    author = {Liu, Yuanyuan and Kiran Reddy, Mittapalle and Yagnavajjula, Madhu Keerthana and R{\"a}s{\"a}nen, Okko and Alku, Paavo and Ik{\"a}valko, Tero and Hakanp{\"a}{\"a}, Tua and {\"O}yry, Aleksi and Laukkanen, Anne-Maria},
    title = "Automatic Classification of Strain in the Singing Voice Using Machine Learning",
    year = "2025",
    doi = "10.1016/j.jvoice.2025.03.040",
    language = "English",
    journal = "Journal of Voice",
    issn = "0892-1997",
    publisher = "Elsevier"
    }

  • A. Mesaros, R. Serizel, T. Heittola, T. Virtanen, and M. D. Plumbley, "A decade of DCASE: Achievements, practices, evaluations and future challenges," in ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), United States, 2025, p. 1–5. doi:10.1109/ICASSP49660.2025.10887673
    [BibTeX] [Abstract]

    This paper introduces briefly the history and growth of the Detection and Classification of Acoustic Scenes and Events (DCASE) challenge, workshop, research area and research community. Created in 2013 as a data evaluation challenge, DCASE has become a major research topic in the Audio and Acoustic Signal Processing area. Its success comes from a combination of factors: the challenge offers a large variety of tasks that are renewed each year; and the workshop offers a channel for dissemination of related work, engaging a young and dynamic community. At the same time, DCASE faces its own challenges, growing and expanding to different areas. One of the core principles of DCASE is open science and reproducibility: publicly available datasets, baseline systems, technical reports and workshop publications. While the DCASE challenge and workshop are independent of IEEE SPS, the challenge receives annual endorsement from the AASP TC, and the DCASE community contributes significantly to the ICASSP flagship conference and the success of SPS in many of its activities.

    @inproceedings{2025_ICASSP_a,
    author = "Mesaros, Annamaria and Serizel, Romain and Heittola, Toni and Virtanen, Tuomas and Plumbley, Mark D.",
    editor = "Rao, Bhaskar D and Trancoso, Isabel and Sharma, Gaurav and Mehta, Neelesh B.",
    title = "A decade of DCASE: Achievements, practices, evaluations and future challenges",
    abstract = "This paper introduces briefly the history and growth of the Detection and Classification of Acoustic Scenes and Events (DCASE) challenge, workshop, research area and research community. Created in 2013 as a data evaluation challenge, DCASE has become a major research topic in the Audio and Acoustic Signal Processing area. Its success comes from a combination of factors: the challenge offers a large variety of tasks that are renewed each year; and the workshop offers a channel for dissemination of related work, engaging a young and dynamic community. At the same time, DCASE faces its own challenges, growing and expanding to different areas. One of the core principles of DCASE is open science and reproducibility: publicly available datasets, baseline systems, technical reports and workshop publications. While the DCASE challenge and workshop are independent of IEEE SPS, the challenge receives annual endorsement from the AASP TC, and the DCASE community contributes significantly to the ICASSP flagship conference and the success of SPS in many of its activities.",
    keywords = "AASP Challenges, DCASE Challenge, DCASE Workshop",
    note = "Publisher Copyright: {\textcopyright} 2025 IEEE.; IEEE International Conference on Acoustics, Speech, and Signal Processing ; Conference date: 06-04-2025 Through 11-04-2025",
    year = "2025",
    doi = "10.1109/ICASSP49660.2025.10887673",
    language = "English",
    series = "Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing",
    publisher = "IEEE",
    pages = "1--5",
    booktitle = "ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",
    address = "United States"
    }

  • M. Neri and T. Virtanen, "Multi-channel Replay Speech Detection using an Adaptive Learnable Beamformer," IEEE Open Journal of Signal Processing, vol. 6, p. 530–535, 2025. doi:10.1109/OJSP.2025.3568758
    [BibTeX] [Abstract]

    Replay attacks belong to the class of severe threats against voice-controlled systems, exploiting the easy accessibility of speech signals by recorded and replayed speech to grant unauthorized access to sensitive data. In this work, we propose a multi-channel neural network architecture called M-ALRAD for the detection of replay attacks based on spatial audio features. This approach integrates a learnable adaptive beamformer with a convolutional recurrent neural network, allowing for joint optimization of spatial filtering and classification. Experiments have been carried out on the ReMASC dataset, which is a state-of-the-art multi-channel replay speech detection dataset encompassing four microphones with diverse array configurations and four environments. Results on the ReMASC dataset show the superiority of the approach compared to the state-of-the-art and yield substantial improvements for challenging acoustic environments. In addition, we demonstrate that our approach is able to better generalize to unseen environments with respect to prior studies.

    @article{2025_SP_c,
    author = "Neri, Michael and Virtanen, Tuomas",
    title = "Multi-channel Replay Speech Detection using an Adaptive Learnable Beamformer",
    abstract = "Replay attacks belong to the class of severe threats against voice-controlled systems, exploiting the easy accessibility of speech signals by recorded and replayed speech to grant unauthorized access to sensitive data. In this work, we propose a multi-channel neural network architecture called M-ALRAD for the detection of replay attacks based on spatial audio features. This approach integrates a learnable adaptive beamformer with a convolutional recurrent neural network, allowing for joint optimization of spatial filtering and classification. Experiments have been carried out on the ReMASC dataset, which is a state-of-the-art multi-channel replay speech detection dataset encompassing four microphones with diverse array configurations and four environments. Results on the ReMASC dataset show the superiority of the approach compared to the state-of-the-art and yield substantial improvements for challenging acoustic environments. In addition, we demonstrate that our approach is able to better generalize to unseen environments with respect to prior studies.",
    keywords = "Beamforming, Physical Access, Replay attack, Spatial Audio, Voice anti-spoofing",
    note = "Publisher Copyright: {\textcopyright} 2020 IEEE.",
    year = "2025",
    doi = "10.1109/OJSP.2025.3568758",
    language = "English",
    volume = "6",
    pages = "530--535",
    journal = "IEEE Open Journal of Signal Processing",
    issn = "2644-1322",
    publisher = "IEEE"
    }

  • O. Räsänen and D. Kocharov, "A pipeline for stochastic and controlled generation of realistic language input for simulating infant language acquisition," BEHAVIOR RESEARCH METHODS, vol. 57, iss. 10, 2025. doi:10.3758/s13428-025-02772-6
    [BibTeX] [Abstract]

    Computational models of early language development involve implementing theories of learning as functional learning algorithms, exposing these models to realistic language input, and comparing learning outcomes to those in infants. While recent research has made major strides in developing more powerful learning models and evaluation protocols grounded in infant data, models are still predominantly trained with non-naturalistic input data, such as crowd-sourced read speech or text transcripts. This is due to the lack of suitable child-directed speech (CDS) corpora in terms of scale and quality. In parallel, the question of how properties and individual variability in language input affect learning outcomes is an active area of empirical research, underlining the need for realistic yet controllable data for modeling such phenomena. This paper presents a solution to the training data problem through stochastic generation of naturalistic CDS data using statistical models, thereby enabling controlled computational simulations with naturalistic input. We provide a proof-of-concept demonstration of the approach by showing how naturalistic CDS transcripts can be generated with a language model conditioned on recipient information (here, infant age), and how text-to-speech systems can be used to convert the transcripts to high-quality speech with a controllable speaking style. We also conduct modeling experiments with generated speech corpora by varying different aspects of the data, showing how this maps into different learning outcomes, thereby demonstrating the feasibility of the approach for controlled language learning simulations. Finally, we discuss the limitations of using synthetic data in general, and of the present proof-of-concept pipeline in particular.

    @article{2025_d,
    author = {R{\"a}s{\"a}nen, Okko and Kocharov, Daniil},
    title = "A pipeline for stochastic and controlled generation of realistic language input for simulating infant language acquisition",
    abstract = "Computational models of early language development involve implementing theories of learning as functional learning algorithms, exposing these models to realistic language input, and comparing learning outcomes to those in infants. While recent research has made major strides in developing more powerful learning models and evaluation protocols grounded in infant data, models are still predominantly trained with non-naturalistic input data, such as crowd-sourced read speech or text transcripts. This is due to the lack of suitable child-directed speech (CDS) corpora in terms of scale and quality. In parallel, the question of how properties and individual variability in language input affect learning outcomes is an active area of empirical research, underlining the need for realistic yet controllable data for modeling such phenomena. This paper presents a solution to the training data problem through stochastic generation of naturalistic CDS data using statistical models, thereby enabling controlled computational simulations with naturalistic input. We provide a proof-of-concept demonstration of the approach by showing how naturalistic CDS transcripts can be generated with a language model conditioned on recipient information (here, infant age), and how text-to-speech systems can be used to convert the transcripts to high-quality speech with a controllable speaking style. We also conduct modeling experiments with generated speech corpora by varying different aspects of the data, showing how this maps into different learning outcomes, thereby demonstrating the feasibility of the approach for controlled language learning simulations. Finally, we discuss the limitations of using synthetic data in general, and of the present proof-of-concept pipeline in particular.",
    keywords = "Computational modeling, Language development, Language resources, Speech processing",
    note = "Publisher Copyright: {\textcopyright} The Author(s) 2025.",
    year = "2025",
    month = "October",
    doi = "10.3758/s13428-025-02772-6",
    language = "English",
    volume = "57",
    journal = "BEHAVIOR RESEARCH METHODS",
    issn = "1554-351X",
    publisher = "Springer Nature",
    number = "10"
    }

  • O. Räsänen, M. Airaksinen, V. Marchi, O. Chorna, A. Guzzetta, and F. Festante, "Motherese Directed at Prelinguistic Infants at Risk for Neurological Disorders: An Exploratory Study," Journal of Child Language, 2025. doi:10.1017/S0305000924000217
    [BibTeX] [Abstract]

    To investigate how a high risk for infant neurological impairment affects the quality of infant verbal interactions, and in particular properties of infant-directed speech, spontaneous interactions between 14 mothers and their 4.5-month-old infants at high risk for neurological disorders (7 female) were recorded and acoustically compared with those of 14 dyads with typically developing infants (8 female). Mothers of at-risk infants had proportionally less voicing, and the proportion of voicing decreased with increasing severity of the infants' long-term outcome. Follow-up analysis based on manual annotation of phonation style revealed breathy phonation as more common toward infants with more severe long-term outcomes (N=7; 44.7\\% of speech) than controls (N=14; 22.0\\%; p=0.005) or at-risk infants with typical or mildly abnormal long-term outcomes (N=7; 16.5\\%; p=0.002). The results indicate that maternal phonation style during early dyadic interactions is affected by the infant's neurological condition.

    @article{2025_f,
    author = {R{\"a}s{\"a}nen, Okko and Airaksinen, Manu and Marchi, Viviana and Chorna, Olena and Guzzetta, Andrea and Festante, Fabrizia},
    title = "Motherese Directed at Prelinguistic Infants at Risk for Neurological Disorders: An Exploratory Study",
    abstract = "To investigate how a high risk for infant neurological impairment affects the quality of infant verbal interactions, and in particular properties of infant-directed speech, spontaneous interactions between 14 mothers and their 4.5-month-old infants at high risk for neurological disorders (7 female) were recorded and acoustically compared with those of 14 dyads with typically developing infants (8 female). Mothers of at-risk infants had proportionally less voicing, and the proportion of voicing decreased with increasing severity of the infants' long-term outcome. Follow-up analysis based on manual annotation of phonation style revealed breathy phonation as more common toward infants with more severe long-term outcomes (N=7; 44.7\\% of speech) than controls (N=14; 22.0\\%; p=0.005) or at-risk infants with typical or mildly abnormal long-term outcomes (N=7; 16.5\\%; p=0.002). The results indicate that maternal phonation style during early dyadic interactions is affected by the infant's neurological condition.",
    keywords = "acoustic analysis, infant-directed speech, mother-infant interaction, neurological impairment, phonation style",
    note = "Publisher Copyright: {\textcopyright} The Author(s), 2025. Published by Cambridge University Press.",
    year = "2025",
    doi = "10.1017/S0305000924000217",
    language = "English",
    journal = "Journal of Child Language",
    issn = "0305-0009",
    publisher = "Cambridge University Press"
    }

  • P. Sudarsanam, I. Martin-Morato, A. Hakala, and T. Virtanen, "AVCaps: An Audio-Visual Dataset With Modality-Specific Captions," IEEE Open Journal of Signal Processing, vol. 6, p. 691–704, 2025. doi:10.1109/OJSP.2025.3578296
    [BibTeX] [Abstract]

    This paper introduces AVCaps, an audio-visual dataset that contains separate textual captions for the audio, visual, and audio-visual contents of video clips. The dataset contains 2061 video clips constituting a total of 28.8 hours. We provide up to 5 captions for the audio, visual, and audio-visual content of each clip, crowdsourced separately. Existing datasets focus on a single modality or do not provide modality-specific captions, limiting the study of how each modality contributes to overall comprehension in multimodal settings. Our dataset addresses this critical gap in multimodal research by offering a resource for studying how audio and visual content are captioned individually, as well as how audio-visual content is captioned in relation to these individual modalities. Crowdsourced audio-visual captions are prone to favor visual content over audio content. To avoid this we use large language models (LLMs) to generate three balanced audio-visual captions for each clip based on the crowdsourced captions. We present captioning and retrieval experiments to illustrate the effectiveness of modality-specific captions in evaluating model performance. Specifically, we show that the modality-specific captions allow us to quantitatively assess how well a model understands audio and visual information from a given video. Notably, we find that a model trained on the balanced LLM-generated audio-visual captions captures audio information more effectively compared to a model trained on crowdsourced audio-visual captions. This model achieves a 14\\% higher Sentence-BERT similarity on crowdsourced audio captions compared to a model trained on crowdsourced audio-visual captions, which are typically more biased towards visual information. We also discuss the possibilities in multimodal representation learning, question answering, developing new video captioning metrics, and generative AI that this dataset unlocks. The dataset is available publicly at Zenodo and Hugging Face.

    @article{2025_SP,
    author = "Sudarsanam, Parthasaarathy and Martin-Morato, Irene and Hakala, Aapo and Virtanen, Tuomas",
    title = "AVCaps: An Audio-Visual Dataset With Modality-Specific Captions",
    abstract = "This paper introduces AVCaps, an audio-visual dataset that contains separate textual captions for the audio, visual, and audio-visual contents of video clips. The dataset contains 2061 video clips constituting a total of 28.8 hours. We provide up to 5 captions for the audio, visual, and audio-visual content of each clip, crowdsourced separately. Existing datasets focus on a single modality or do not provide modality-specific captions, limiting the study of how each modality contributes to overall comprehension in multimodal settings. Our dataset addresses this critical gap in multimodal research by offering a resource for studying how audio and visual content are captioned individually, as well as how audio-visual content is captioned in relation to these individual modalities. Crowdsourced audio-visual captions are prone to favor visual content over audio content. To avoid this we use large language models (LLMs) to generate three balanced audio-visual captions for each clip based on the crowdsourced captions. We present captioning and retrieval experiments to illustrate the effectiveness of modality-specific captions in evaluating model performance. Specifically, we show that the modality-specific captions allow us to quantitatively assess how well a model understands audio and visual information from a given video. Notably, we find that a model trained on the balanced LLM-generated audio-visual captions captures audio information more effectively compared to a model trained on crowdsourced audio-visual captions. This model achieves a 14\\% higher Sentence-BERT similarity on crowdsourced audio captions compared to a model trained on crowdsourced audio-visual captions, which are typically more biased towards visual information. We also discuss the possibilities in multimodal representation learning, question answering, developing new video captioning metrics, and generative AI that this dataset unlocks. The dataset is available publicly at Zenodo and Hugging Face.",
    keywords = "Audio-visual, AVCaps, Captioning, Dataset, Multimodal, Retrieval",
    note = "Publisher Copyright: {\textcopyright} 2020 IEEE.",
    year = "2025",
    doi = "10.1109/OJSP.2025.3578296",
    language = "English",
    volume = "6",
    pages = "691--704",
    journal = "IEEE Open Journal of Signal Processing",
    issn = "2644-1322",
    publisher = "IEEE"
    }

  • A. Triantafyllopoulos, I. Tsangko, A. Gebhard, A. Mesaros, T. Virtanen, and B. W. Schuller, "Computer Audition: From Task-Specific Machine Learning to Foundation Models," Proceedings of the IEEE, vol. 113, iss. 4, p. 317 – 343, 2025. doi:10.1109/JPROC.2025.3593952
    [BibTeX] [Abstract]

    Foundation models (FMs) are increasingly spearheading recent advances on a variety of tasks that fall under the purview of computer audition—i.e., the use of machines to understand sounds. They feature several advantages over traditional pipelines: among others, the ability to consolidate multiple tasks in a single model, the option to leverage knowledge from other modalities, and the readily available interaction with human users. Naturally, these promises have created substantial excitement in the audio community and have led to a wave of early attempts to build new, general-purpose FMs for audio. In the present contribution, we give an overview of computational audio analysis as it transitions from traditional pipelines toward auditory FMs. Our work highlights the key operating principles that underpin those models and showcases how they can accommodate multiple tasks that the audio community previously tackled separately.

    @article{2025_b,
    author = {Triantafyllopoulos, Andreas and Tsangko, Iosif and Gebhard, Alexander and Mesaros, Annamaria and Virtanen, Tuomas and Schuller, Bj{\"o}rn W.},
    title = "Computer Audition: From Task-Specific Machine Learning to Foundation Models",
    abstract = "Foundation models (FMs) are increasingly spearheading recent advances on a variety of tasks that fall under the purview of computer audition—i.e., the use of machines to understand sounds. They feature several advantages over traditional pipelines: among others, the ability to consolidate multiple tasks in a single model, the option to leverage knowledge from other modalities, and the readily available interaction with human users. Naturally, these promises have created substantial excitement in the audio community and have led to a wave of early attempts to build new, general-purpose FMs for audio. In the present contribution, we give an overview of computational audio analysis as it transitions from traditional pipelines toward auditory FMs. Our work highlights the key operating principles that underpin those models and showcases how they can accommodate multiple tasks that the audio community previously tackled separately.",
    keywords = "Acoustic scene classification, artificial intelligence (AI), audio captioning (AC), computational audio analysis, computer audition, foundation models (FMs), large audio models, machine listening, sound event detection (SED)",
    note = "Publisher Copyright: {\textcopyright} 1963-2012 IEEE.",
    year = "2025",
    month = "August",
    doi = "10.1109/JPROC.2025.3593952",
    language = "English",
    volume = "113",
    pages = "317 -- 343",
    journal = "Proceedings of the IEEE",
    issn = "0018-9219",
    publisher = "Institute of Electrical and Electronics Engineers Inc.",
    number = "4"
    }

  • E. Vaaras, M. Airaksinen, and O. Räsänen, "IAR 2.0: An Algorithm for Refining Inconsistent Annotations for Time-Series Data Using Discriminative Classifiers," IEEE Access, vol. 13, p. 19979–19995, 2025. doi:10.1109/ACCESS.2025.3534637
    [BibTeX] [Abstract]

    The performance of discriminative machine-learning classifiers, such as neural networks, is limited by training label inconsistencies. Even expert-based annotations can suffer from label inconsistencies, especially in the case of ambiguous phenomena-to-annotate. To address this, we propose a novel algorithm, iterative annotation refinement (IAR) 2.0, for refining inconsistent annotations for time-series data. IAR 2.0 uses a procedure that utilizes discriminative classifiers to iteratively combine original annotations with increasingly accurate posterior estimates of classes present in the data. Unlike most existing label refinement approaches, IAR 2.0 offers a simpler yet effective solution for resolving ambiguities in training labels, working with real label noise on time-series data instead of synthetic label noise on image data. We demonstrate the effectiveness of our algorithm through five distinct classification tasks on two highly distinct data modalities. As a result, we show that the labels produced by IAR 2.0 systematically improve classifier performance compared to using the original labels or a previous state-of-the-art method for label refinement. We also conduct a set of controlled simulations to systematically investigate when IAR 2.0 fails to improve on the original training labels. The simulation results demonstrate that IAR 2.0 improves performance in nearly all tested conditions. We also find that the decrease in performance when IAR 2.0 fails is small compared to the average performance gain when IAR 2.0 succeeds, encouraging the use of IAR 2.0 even when the nature of data is unknown. The code is freely available at https://github.com/SPEECHCOG/IAR\\_2.

    @article{2025,
    author = {Vaaras, Einari and Airaksinen, Manu and R{\"a}s{\"a}nen, Okko},
    title = "IAR 2.0: An Algorithm for Refining Inconsistent Annotations for Time-Series Data Using Discriminative Classifiers",
    abstract = "The performance of discriminative machine-learning classifiers, such as neural networks, is limited by training label inconsistencies. Even expert-based annotations can suffer from label inconsistencies, especially in the case of ambiguous phenomena-to-annotate. To address this, we propose a novel algorithm, iterative annotation refinement (IAR) 2.0, for refining inconsistent annotations for time-series data. IAR 2.0 uses a procedure that utilizes discriminative classifiers to iteratively combine original annotations with increasingly accurate posterior estimates of classes present in the data. Unlike most existing label refinement approaches, IAR 2.0 offers a simpler yet effective solution for resolving ambiguities in training labels, working with real label noise on time-series data instead of synthetic label noise on image data. We demonstrate the effectiveness of our algorithm through five distinct classification tasks on two highly distinct data modalities. As a result, we show that the labels produced by IAR 2.0 systematically improve classifier performance compared to using the original labels or a previous state-of-the-art method for label refinement. We also conduct a set of controlled simulations to systematically investigate when IAR 2.0 fails to improve on the original training labels. The simulation results demonstrate that IAR 2.0 improves performance in nearly all tested conditions. We also find that the decrease in performance when IAR 2.0 fails is small compared to the average performance gain when IAR 2.0 succeeds, encouraging the use of IAR 2.0 even when the nature of data is unknown. The code is freely available at https://github.com/SPEECHCOG/IAR\\_2.",
    keywords = "Annotation refinement, daylong audio recordings, discriminative classifiers, human activity recognition, inconsistent labels, label refinement, movement sensors, multi-sensor inertial measurement unit, speech emotion recognition, time-series data",
    note = "Publisher Copyright: {\textcopyright} 2025 The Authors.",
    year = "2025",
    doi = "10.1109/ACCESS.2025.3534637",
    language = "English",
    volume = "13",
    pages = "19979--19995",
    journal = "IEEE Access",
    issn = "2169-3536",
    publisher = "Institute of Electrical and Electronics Engineers Inc."
    }

  • E. Vaaras, M. Airaksinen, and O. Räsänen, "PFML: Self-Supervised Learning of Time-Series Data Without Representation Collapse," IEEE Access, vol. 13, p. 60233–60244, 2025. doi:10.1109/ACCESS.2025.3556957
    [BibTeX] [Abstract]

    Self-supervised learning (SSL) is a data-driven learning approach that utilizes the innate structure of the data to guide the learning process. In contrast to supervised learning, which depends on external labels, SSL utilizes the inherent characteristics of the data to produce its own supervisory signal. However, one frequent issue with SSL methods is representation collapse, where the model outputs a constant input-invariant feature representation. This issue hinders the potential application of SSL methods to new data modalities, as trying to avoid representation collapse wastes researchers{’} time and effort. This paper introduces a novel SSL algorithm for time-series data called Prediction of Functionals from Masked Latents (PFML). Instead of predicting masked input signals or their latent representations directly, PFML operates by predicting statistical functionals of the input signal corresponding to masked embeddings, given a sequence of unmasked embeddings. The algorithm is designed to avoid representation collapse, rendering it straightforwardly applicable to different time-series data domains, such as novel sensor modalities in clinical data. We demonstrate the effectiveness of PFML through complex, real-life classification tasks across three different data modalities: infant posture and movement classification from multi-sensor inertial measurement unit data, emotion recognition from speech data, and sleep stage classification from EEG data. The results show that PFML is superior to a conceptually similar SSL method and a contrastive learning-based SSL method. Additionally, PFML is on par with the current state-of-the-art SSL method, while also being conceptually simpler and without suffering from representation collapse.

    @article{2025_e,
    author = {Vaaras, Einari and Airaksinen, Manu and R{\"a}s{\"a}nen, Okko},
    title = "PFML: Self-Supervised Learning of Time-Series Data Without Representation Collapse",
    abstract = "Self-supervised learning (SSL) is a data-driven learning approach that utilizes the innate structure of the data to guide the learning process. In contrast to supervised learning, which depends on external labels, SSL utilizes the inherent characteristics of the data to produce its own supervisory signal. However, one frequent issue with SSL methods is representation collapse, where the model outputs a constant input-invariant feature representation. This issue hinders the potential application of SSL methods to new data modalities, as trying to avoid representation collapse wastes researchers{\textquoteright} time and effort. This paper introduces a novel SSL algorithm for time-series data called Prediction of Functionals from Masked Latents (PFML). Instead of predicting masked input signals or their latent representations directly, PFML operates by predicting statistical functionals of the input signal corresponding to masked embeddings, given a sequence of unmasked embeddings. The algorithm is designed to avoid representation collapse, rendering it straightforwardly applicable to different time-series data domains, such as novel sensor modalities in clinical data. We demonstrate the effectiveness of PFML through complex, real-life classification tasks across three different data modalities: infant posture and movement classification from multi-sensor inertial measurement unit data, emotion recognition from speech data, and sleep stage classification from EEG data. The results show that PFML is superior to a conceptually similar SSL method and a contrastive learning-based SSL method. Additionally, PFML is on par with the current state-of-the-art SSL method, while also being conceptually simpler and without suffering from representation collapse.",
    keywords = "EEG data, embedding masking, human activity recognition, multi-sensor inertial measurement unit data, representation collapse, self-supervised learning, sleep stage classification, speech emotion recognition, statistical functionals, time-series data",
    note = "Publisher Copyright: {\textcopyright} 2013 IEEE.",
    year = "2025",
    doi = "10.1109/ACCESS.2025.3556957",
    language = "English",
    volume = "13",
    pages = "60233--60244",
    journal = "IEEE Access",
    issn = "2169-3536",
    publisher = "Institute of Electrical and Electronics Engineers Inc."
    }

  • H. Xie, K. Khorrami, O. Räsänen, and T. Virtanen, "Text-based Audio Retrieval by Learning from Similarities between Audio Captions," IEEE Signal Processing Letters, vol. 32, p. 221–225, 2025. doi:10.1109/LSP.2024.3511414
    [BibTeX] [Abstract]

    This paper proposes to use similarities of audio captions for estimating audio-caption relevances to be used for training text-based audio retrieval systems. Current audio-caption datasets (e.g., Clotho) contain audio samples paired with annotated captions, but lack relevance information about audio samples and captions beyond the annotated ones. Besides, mainstream approaches (e.g., CLAP) usually treat the annotated pairs as positives and consider all other audio-caption combinations as negatives, assuming a binary relevance between audio samples and captions. To infer the relevance between audio samples and arbitrary captions, we propose a method that computes non-binary audio-caption relevance scores based on the textual similarities of audio captions. We measure textual similarities of audio captions by calculating the cosine similarity of their Sentence-BERT embeddings and then transform these similarities into audio-caption relevance scores using a logistic function, thereby linking audio samples through their annotated captions to all other captions in the dataset. To integrate the computed relevances into training, we employ a listwise ranking objective, where relevance scores are converted into probabilities of ranking audio samples for a given textual query. We show the effectiveness of the proposed method by demonstrating improvements in text-based audio retrieval compared to methods that use binary audio-caption relevances for training.

    @article{2025_SP_b,
    author = {Xie, Huang and Khorrami, Khazar and R{\"a}s{\"a}nen, Okko and Virtanen, Tuomas},
    title = "Text-based Audio Retrieval by Learning from Similarities between Audio Captions",
    abstract = "This paper proposes to use similarities of audio captions for estimating audio-caption relevances to be used for training text-based audio retrieval systems. Current audio-caption datasets (e.g., Clotho) contain audio samples paired with annotated captions, but lack relevance information about audio samples and captions beyond the annotated ones. Besides, mainstream approaches (e.g., CLAP) usually treat the annotated pairs as positives and consider all other audio-caption combinations as negatives, assuming a binary relevance between audio samples and captions. To infer the relevance between audio samples and arbitrary captions, we propose a method that computes non-binary audio-caption relevance scores based on the textual similarities of audio captions. We measure textual similarities of audio captions by calculating the cosine similarity of their Sentence-BERT embeddings and then transform these similarities into audio-caption relevance scores using a logistic function, thereby linking audio samples through their annotated captions to all other captions in the dataset. To integrate the computed relevances into training, we employ a listwise ranking objective, where relevance scores are converted into probabilities of ranking audio samples for a given textual query. We show the effectiveness of the proposed method by demonstrating improvements in text-based audio retrieval compared to methods that use binary audio-caption relevances for training.",
    keywords = "audio retrieval, Audio-caption relevance, listwise ranking, textual similarity",
    note = "Publisher Copyright: {\textcopyright} 1994-2012 IEEE.",
    year = "2025",
    doi = "10.1109/LSP.2024.3511414",
    language = "English",
    volume = "32",
    pages = "221--225",
    journal = "IEEE Signal Processing Letters",
    issn = "1070-9908",
    publisher = "Institute of Electrical and Electronics Engineers Inc."
    }

2024

  • M. Airaksinen, E. Vaaras, L. Haataja, O. Räsänen, and S. Vanhatalo, "Automatic assessment of infant carrying and holding using at-home wearable recordings," Scientific Reports, vol. 14, iss. 1, 2024. doi:10.1038/s41598-024-54536-5
    [BibTeX] [Abstract]

    Assessing infant carrying and holding (C/H), or physical infant-caregiver interaction, is important for a wide range of contexts in development research. An automated detection and quantification of infant C/H is particularly needed in long term at-home studies where development of infants{’} neurobehavior is measured using wearable devices. Here, we first developed a phenomenological categorization for physical infant-caregiver interactions to support five different definitions of C/H behaviors. Then, we trained and assessed deep learning-based classifiers for their automatic detection from multi-sensor wearable recordings that were originally used for mobile assessment of infants{’} motor development. Our results show that an automated C/H detection is feasible at few-second temporal accuracy. With the best C/H definition, the automated detector shows 96\\% accuracy and 0.56 kappa, which is slightly less than the video-based inter-rater agreement between trained human experts (98\\% accuracy, 0.77 kappa). The classifier performance varies with C/H definition reflecting the extent to which infants{’} movements are present in each C/H variant. A systematic benchmarking experiment shows that the widely used actigraphy-based method ignores the normally occurring C/H behaviors. Finally, we show proof-of-concept for the utility of the novel classifier in studying C/H behavior across infant development. Particularly, we show that matching the C/H detections to individuals{’} gross motor ability discloses novel insights to infant-parent interaction.

    @article{2024,
    author = {Airaksinen, Manu and Vaaras, Einari and Haataja, Leena and R{\"a}s{\"a}nen, Okko and Vanhatalo, Sampsa},
    title = "Automatic assessment of infant carrying and holding using at-home wearable recordings",
    abstract = "Assessing infant carrying and holding (C/H), or physical infant-caregiver interaction, is important for a wide range of contexts in development research. An automated detection and quantification of infant C/H is particularly needed in long term at-home studies where development of infants{\textquoteright} neurobehavior is measured using wearable devices. Here, we first developed a phenomenological categorization for physical infant-caregiver interactions to support five different definitions of C/H behaviors. Then, we trained and assessed deep learning-based classifiers for their automatic detection from multi-sensor wearable recordings that were originally used for mobile assessment of infants{\textquoteright} motor development. Our results show that an automated C/H detection is feasible at few-second temporal accuracy. With the best C/H definition, the automated detector shows 96\\% accuracy and 0.56 kappa, which is slightly less than the video-based inter-rater agreement between trained human experts (98\\% accuracy, 0.77 kappa). The classifier performance varies with C/H definition reflecting the extent to which infants{\textquoteright} movements are present in each C/H variant. A systematic benchmarking experiment shows that the widely used actigraphy-based method ignores the normally occurring C/H behaviors. Finally, we show proof-of-concept for the utility of the novel classifier in studying C/H behavior across infant development. Particularly, we show that matching the C/H detections to individuals{\textquoteright} gross motor ability discloses novel insights to infant-parent interaction.",
    keywords = "MAIJU",
    note = "Publisher Copyright: {\textcopyright} The Author(s) 2024.",
    year = "2024",
    doi = "10.1038/s41598-024-54536-5",
    language = "English",
    volume = "14",
    journal = "Scientific Reports",
    issn = "2045-2322",
    publisher = "Nature Research",
    number = "1"
    }

  • J. Coffey, O. Räsänen, C. Scaff, and A. Cristia, "The difficulty and importance of estimating the lower and upper bounds of infant speech exposure," in Proceedings of the Interspeech 2024, 2024, p. 3615–3619.
    [BibTeX]
    @inproceedings{2024_InterSpecch,
    author = {Coffey, Joseph and R{\"a}s{\"a}nen, Okko and Scaff, Camila and Cristia, Alejandrina},
    title = "The difficulty and importance of estimating the lower and upper bounds of infant speech exposure",
    year = "2024",
    language = "English",
    series = "Proceedings of the International Conference on Spoken Language Processing",
    publisher = "Interspeech",
    pages = "3615--3619",
    booktitle = "Proceedings of the Interspeech 2024",
    note = "Annual Conference of the International Speech Communication Association ; Conference date: 01-09-2024 Through 05-09-2024"
    }

  • R. Convey, T. Ihalainen, Y. Liu, O. Räsänen, S. Ylinen, and N. Penttilä, "A comparative study of automatic vowel articulation index and auditory-perceptual assessments of speech intelligibility in Parkinson’s disease," INTERNATIONAL JOURNAL OF SPEECH-LANGUAGE PATHOLOGY, vol. 26, iss. 5, p. 663–673, 2024. doi:10.1080/17549507.2023.2251725
    [BibTeX]
    @article{2024_b,
    author = {Convey, Rachel and Ihalainen, Tiina and Liu, Yuanyuan and R{\"a}s{\"a}nen, Okko and Ylinen, Sari and Penttil{\"a}, Nelly},
    title = "A comparative study of automatic vowel articulation index and auditory-perceptual assessments of speech intelligibility in Parkinson{\textquoteright}s disease",
    year = "2024",
    doi = "10.1080/17549507.2023.2251725",
    language = "English",
    volume = "26",
    pages = "663--673",
    journal = "INTERNATIONAL JOURNAL OF SPEECH-LANGUAGE PATHOLOGY",
    issn = "1754-9507",
    publisher = "Informa Healthcare",
    number = "5"
    }

  • A. Cristia, L. Gautheron, Z. Zhang, B. Schuller, C. Scaff, C. Rowland, O. Räsänen, L. Peurey, M. Lavechin, W. Havard, C. M. Fausey, M. Cychosz, E. Bergelson, H. Anderson, N. Al Futaisi, and M. Soderstrom, "Establishing the reliability of metrics extracted from long-form recordings using LENA and the ACLEW pipeline," BEHAVIOR RESEARCH METHODS, vol. 56, p. 8588–8607, 2024. doi:10.3758/s13428-024-02493-2
    [BibTeX] [Abstract]

    Long-form audio recordings are increasingly used to study individual variation, group differences, and many other topics in theoretical and applied fields of developmental science, particularly for the description of children{’}s language input (typically speech from adults) and children{’}s language output (ranging from babble to sentences). The proprietary LENA software has been available for over a decade, and with it, users have come to rely on derived metrics like adult word count (AWC) and child vocalization counts (CVC), which have also more recently been derived using an open-source alternative, the ACLEW pipeline. Yet, there is relatively little work assessing the reliability of long-form metrics in terms of the stability of individual differences across time. Filling this gap, we analyzed eight spoken-language datasets: four from North American English-learning infants, and one each from British English-, French-, American English-/Spanish-, and Quechua-/Spanish-learning infants. The audio data were analyzed using two types of processing software: LENA and the ACLEW open-source pipeline. When all corpora were included, we found relatively low to moderate reliability (across multiple recordings, intraclass correlation coefficient attributed to the child identity [Child ICC], was < 50\\% for most metrics). There were few differences between the two pipelines. Exploratory analyses suggested some differences as a function of child age and corpora. These findings suggest that, while reliability is likely sufficient for various group-level analyses, caution is needed when using either LENA or ACLEW tools to study individual variation. We also encourage improvement of extant tools, specifically targeting accurate measurement of individual variation.

    @article{2024_e,
    author = {Cristia, Alejandrina and Gautheron, Lucas and Zhang, Zixing and Schuller, Bj{\"o}rn and Scaff, Camila and Rowland, Caroline and R{\"a}s{\"a}nen, Okko and Peurey, Loann and Lavechin, Marvin and Havard, William and Fausey, Caitlin M. and Cychosz, Margaret and Bergelson, Elika and Anderson, Heather and Al Futaisi, Najla and Soderstrom, Melanie},
    title = "Establishing the reliability of metrics extracted from long-form recordings using LENA and the ACLEW pipeline",
    abstract = "Long-form audio recordings are increasingly used to study individual variation, group differences, and many other topics in theoretical and applied fields of developmental science, particularly for the description of children{\textquoteright}s language input (typically speech from adults) and children{\textquoteright}s language output (ranging from babble to sentences). The proprietary LENA software has been available for over a decade, and with it, users have come to rely on derived metrics like adult word count (AWC) and child vocalization counts (CVC), which have also more recently been derived using an open-source alternative, the ACLEW pipeline. Yet, there is relatively little work assessing the reliability of long-form metrics in terms of the stability of individual differences across time. Filling this gap, we analyzed eight spoken-language datasets: four from North American English-learning infants, and one each from British English-, French-, American English-/Spanish-, and Quechua-/Spanish-learning infants. The audio data were analyzed using two types of processing software: LENA and the ACLEW open-source pipeline. When all corpora were included, we found relatively low to moderate reliability (across multiple recordings, intraclass correlation coefficient attributed to the child identity [Child ICC], was < 50\\% for most metrics). There were few differences between the two pipelines. Exploratory analyses suggested some differences as a function of child age and corpora. These findings suggest that, while reliability is likely sufficient for various group-level analyses, caution is needed when using either LENA or ACLEW tools to study individual variation. We also encourage improvement of extant tools, specifically targeting accurate measurement of individual variation.",
    keywords = "Accuracy, Big data, Daylong recordings, Speech technology",
    note = "Publisher Copyright: {\textcopyright} The Psychonomic Society, Inc. 2024.",
    year = "2024",
    doi = "10.3758/s13428-024-02493-2",
    language = "English",
    volume = "56",
    pages = "8588–8607",
    journal = "BEHAVIOR RESEARCH METHODS",
    issn = "1554-351X",
    publisher = "Springer Nature"
    }

  • W. Dai, X. Li, A. Politis, and T. Virtanen, "Reference Channel Selection by Multi-Channel Masking for End-to-End Multi-Channel Speech Enhancement," in 2024 32nd European Signal Processing Conference (EUSIPCO), United States, 2024, p. 241–245. doi:10.23919/EUSIPCO63174.2024.10715275
    [BibTeX] [Abstract]

    In end-to-end multi-channel speech enhancement, the traditional approach of designating one microphone signal as the reference for processing may not always yield optimal results. The limitation is particularly in scenarios with large distributed microphone arrays with varying speaker-to-microphone distances or compact, highly directional microphone arrays where speaker or microphone positions change over time. Current mask-based methods often fix the reference channel during training, which makes it not possible to adaptively select the reference channel for optimal performance. To address this problem, we introduce an adaptive approach for selecting the optimal reference channel. Our method leverages a multi-channel masking-based scheme, where multiple masked signals are combined to generate a single-channel output signal. This enhanced signal is then used for loss calculation, while the reference clean speech is adjusted based on the highest scale-invariant signal-to-distortion ratio (SI-SDR). The experimental results on the Spear challenge simulated dataset D4 demonstrate the superiority of our proposed method over the conventional approach of using a fixed reference channel with single-channel masking.

    @inproceedings{2024_EUSIPCO_a,
    author = "Dai, Wang and Li, Xiaofei and Politis, Archontis and Virtanen, Tuomas",
    title = "Reference Channel Selection by Multi-Channel Masking for End-to-End Multi-Channel Speech Enhancement",
    abstract = "In end-to-end multi-channel speech enhancement, the traditional approach of designating one microphone signal as the reference for processing may not always yield optimal results. The limitation is particularly in scenarios with large distributed microphone arrays with varying speaker-to-microphone distances or compact, highly directional microphone arrays where speaker or microphone positions change over time. Current mask-based methods often fix the reference channel during training, which makes it not possible to adaptively select the reference channel for optimal performance. To address this problem, we introduce an adaptive approach for selecting the optimal reference channel. Our method leverages a multi-channel masking-based scheme, where multiple masked signals are combined to generate a single-channel output signal. This enhanced signal is then used for loss calculation, while the reference clean speech is adjusted based on the highest scale-invariant signal-to-distortion ratio (SI-SDR). The experimental results on the Spear challenge simulated dataset D4 demonstrate the superiority of our proposed method over the conventional approach of using a fixed reference channel with single-channel masking.",
    keywords = "end-to-end multi-channel speech enhancement, multi-channel masking, reference channel selection",
    note = "Publisher Copyright: {\textcopyright} 2024 European Signal Processing Conference, EUSIPCO. All rights reserved.; European Signal Processing Conference ; Conference date: 26-08-2024 Through 30-08-2024",
    year = "2024",
    doi = "10.23919/EUSIPCO63174.2024.10715275",
    language = "English",
    series = "European Signal Processing Conference",
    publisher = "IEEE",
    pages = "241--245",
    booktitle = "2024 32nd European Signal Processing Conference (EUSIPCO)",
    address = "United States"
    }

  • D. Diaz-Guerra Aparicio, A. Politis, A. Miguel, J. R. Beltran, and T. Virtanen, "Permutation Invariant Recurrent Neural Networks for Sound Source Tracking Applications," in Proceedings of the 10th Convention of the European Acoustics Association Forum Acusticum 2023, 2024, p. 2137. doi:10.48550/arXiv.2306.08510
    [BibTeX] [Abstract]

    Many multi-source localization and tracking models based on neural networks use one or several recurrent layers at their final stages to track the movement of the sources. Conventional recurrent neural networks (RNNs), such as the long short-term memories (LSTMs) or the gated recurrent units (GRUs), take a vector as their input and use another vector to store their state. However, this approach results in the information from all the sources being contained in a single ordered vector, which is not optimal for permutation-invariant problems such as multi-source tracking. In this paper, we present a new recurrent architecture that uses unordered sets to represent both its input and its state and that is invariant to the permutations of the input set and equivariant to the permutations of the state set. Hence, the information of every sound source is represented in an individual embedding and the new estimates are assigned to the tracked trajectories regardless of their order.

    @inproceedings{2024_d,
    author = "Diaz-Guerra Aparicio, David and Politis, Archontis and Miguel, Antonio and Beltran, Jose Ramon and Virtanen, Tuomas",
    title = "Permutation Invariant Recurrent Neural Networks for Sound Source Tracking Applications",
    abstract = "Many multi-source localization and tracking models based on neural networks use one or several recurrent layers at their final stages to track the movement of the sources. Conventional recurrent neural networks (RNNs), such as the long short-term memories (LSTMs) or the gated recurrent units (GRUs), take a vector as their input and use another vector to store their state. However, this approach results in the information from all the sources being contained in a single ordered vector, which is not optimal for permutation-invariant problems such as multi-source tracking. In this paper, we present a new recurrent architecture that uses unordered sets to represent both its input and its state and that is invariant to the permutations of the input set and equivariant to the permutations of the state set. Hence, the information of every sound source is represented in an individual embedding and the new estimates are assigned to the tracked trajectories regardless of their order.",
    year = "2024",
    doi = "10.48550/arXiv.2306.08510",
    language = "English",
    publisher = "European Acoustics Association",
    pages = "2137",
    booktitle = "Proceedings of the 10th Convention of the European Acoustics Association Forum Acusticum 2023",
    note = "Convention of the European Acoustics Association Forum Acusticum ; Conference date: 11-09-2023 Through 15-09-2023"
    }

  • D. Diaz-Guerra Aparicio, A. Politis, P. Ariyakulam Sudarsanam, K. Shimada, D. Krause, K. Uchida, Y. Koyama, N. Takahashi, S. Takahashi, T. Shibuya, Y. Mitsufuji, and T. Virtanen, "Baseline models and evaluation of sound event localization and detection with distance estimation in DCASE 2024 Challenge," in Proceedings of the 9th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2024), 2024, p. 41–45.
    [BibTeX] [Download PDF]
    @inproceedings{2024_DCASE 2024,
    author = "Diaz-Guerra Aparicio, David and Politis, Archontis and Ariyakulam Sudarsanam, Parthasaarathy and Shimada, Kazuki and Krause, Daniel and Uchida, Kengo and Koyama, Yuichiro and Takahashi, Naoya and Takahashi, Shusuke and Shibuya, Takashi and Mitsufuji, Yuki and Virtanen, Tuomas",
    title = "Baseline models and evaluation of sound event localization and detection with distance estimation in DCASE 2024 Challenge",
    year = "2024",
    language = "English",
    pages = "41--45",
    booktitle = "Proceedings of the 9th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2024)",
    publisher = "DCASE",
    note = "Workshop on Detection and Classification of Acoustic Scenes and Events, DCASE2024 ; Conference date: 23-10-2024 Through 25-10-2024",
    url = "https://dcase.community/workshop2024/"
    }

  • D. Dogan, H. Xie, T. Heittola, and T. Virtanen, "Multi-Label Zero-Shot Audio Classification with Temporal Attention," in 2024 18th International Workshop on Acoustic Signal Enhancement, IWAENC 2024 - Proceedings, United States, 2024, p. 250–254. doi:10.1109/IWAENC61483.2024.10694459
    [BibTeX] [Abstract]

    Zero-shot learning models are capable of classifying new classes by transferring knowledge from the seen classes using auxiliary information. While most of the existing zero-shot learning methods focused on single-label classification tasks, the present study introduces a method to perform multi-label zero-shot audio classification. To address the challenge of classifying multi-label sounds while generalizing to unseen classes, we adapt temporal attention. The temporal attention mechanism assigns importance weights to different audio segments based on their acoustic and semantic compatibility, thus enabling the model to capture the varying dominance of different sound classes within an audio sample by focusing on the segments most relevant for each class. This leads to more accurate multi-label zero-shot classification than methods employing temporally aggregated acoustic features without weighting, which treat all audio segments equally. We evaluate our approach on a subset of AudioSet against a zero-shot model using uniformly aggregated acoustic features, a zero-rule baseline, and the proposed method in the supervised scenario. Our results show that temporal attention enhances the zero-shot audio classification performance in multi-label scenario.

    @inproceedings{2024_IWAENC,
    author = "Dogan, Duygu and Xie, Huang and Heittola, Toni and Virtanen, Tuomas",
    title = "Multi-Label Zero-Shot Audio Classification with Temporal Attention",
    abstract = "Zero-shot learning models are capable of classifying new classes by transferring knowledge from the seen classes using auxiliary information. While most of the existing zero-shot learning methods focused on single-label classification tasks, the present study introduces a method to perform multi-label zero-shot audio classification. To address the challenge of classifying multi-label sounds while generalizing to unseen classes, we adapt temporal attention. The temporal attention mechanism assigns importance weights to different audio segments based on their acoustic and semantic compatibility, thus enabling the model to capture the varying dominance of different sound classes within an audio sample by focusing on the segments most relevant for each class. This leads to more accurate multi-label zero-shot classification than methods employing temporally aggregated acoustic features without weighting, which treat all audio segments equally. We evaluate our approach on a subset of AudioSet against a zero-shot model using uniformly aggregated acoustic features, a zero-rule baseline, and the proposed method in the supervised scenario. Our results show that temporal attention enhances the zero-shot audio classification performance in multi-label scenario.",
    keywords = "audio classification, audio tagging, multi-label zero-shot learning, temporal attention",
    note = "Publisher Copyright: {\textcopyright} 2024 IEEE.; International Workshop on Acoustic Signal Enhancement ; Conference date: 09-09-2024 Through 12-09-2024",
    year = "2024",
    doi = "10.1109/IWAENC61483.2024.10694459",
    language = "English",
    publisher = "IEEE",
    pages = "250--254",
    booktitle = "2024 18th International Workshop on Acoustic Signal Enhancement, IWAENC 2024 - Proceedings",
    address = "United States"
    }

  • S. Drgas, L. Bramslow, A. Politis, G. Naithani, and T. Virtanen, "Dynamic Processing Neural Network Architecture for Hearing Loss Compensation," IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 32, p. 203–214, 2024. doi:10.1109/TASLP.2023.3328285
    [BibTeX] [Abstract]

    This paper proposes neural networks for compensating sensorineural hearing loss. The aim of the hearing loss compensation task is to transform a speech signal to increase speech intelligibility after further processing by a person with a hearing impairment, which is modeled by a hearing loss model. We propose an interpretable model called dynamic processing network, which has a structure similar to band-wise dynamic compressor. The network is differentiable, and therefore allows to learn its parameters to maximize speech intelligibility. More generic models based on convolutional layers were tested as well. The performance of the tested architectures was assessed using spectro-temporal objective index (STOI) with hearing-threshold noise and hearing aid speech intelligibility (HASPI) metrics. The dynamic processing network gave a significant improvement of STOI and HASPI in comparison to popular compressive gain prescription rule Camfit. A large enough convolutional network could outperform the interpretable model with the cost of larger computational load. Finally, a combination of the dynamic processing network with convolutional neural network gave the best results in terms of STOI and HASPI.

    @article{2024_k,
    author = "Drgas, Szymon and Bramslow, Lars and Politis, Archontis and Naithani, Gaurav and Virtanen, Tuomas",
    title = "Dynamic Processing Neural Network Architecture for Hearing Loss Compensation",
    abstract = "This paper proposes neural networks for compensating sensorineural hearing loss. The aim of the hearing loss compensation task is to transform a speech signal to increase speech intelligibility after further processing by a person with a hearing impairment, which is modeled by a hearing loss model. We propose an interpretable model called dynamic processing network, which has a structure similar to band-wise dynamic compressor. The network is differentiable, and therefore allows to learn its parameters to maximize speech intelligibility. More generic models based on convolutional layers were tested as well. The performance of the tested architectures was assessed using spectro-temporal objective index (STOI) with hearing-threshold noise and hearing aid speech intelligibility (HASPI) metrics. The dynamic processing network gave a significant improvement of STOI and HASPI in comparison to popular compressive gain prescription rule Camfit. A large enough convolutional network could outperform the interpretable model with the cost of larger computational load. Finally, a combination of the dynamic processing network with convolutional neural network gave the best results in terms of STOI and HASPI.",
    keywords = "deep neural networks, Hearing loss, hearing loss compensation",
    note = "Publisher Copyright: {\textcopyright} 2014 IEEE.",
    year = "2024",
    doi = "10.1109/TASLP.2023.3328285",
    language = "English",
    volume = "32",
    pages = "203--214",
    journal = "IEEE/ACM Transactions on Audio Speech and Language Processing",
    issn = "2329-9290",
    publisher = "IEEE Advancing Technology for Humanity"
    }

  • S. Gharib, M. Tran, D. Luong, K. Drossos, and T. Virtanen, "Adversarial Representation Learning for Robust Privacy Preservation in Audio," IEEE Open Journal of Signal Processing, vol. 5, p. 294–302, 2024. doi:10.1109/OJSP.2023.3349113
    [BibTeX] [Abstract]

    Sound event detection systems are widely used in various applications such as surveillance and environmental monitoring where data is automatically collected, processed, and sent to a cloud for sound recognition. However, this process may inadvertently reveal sensitive information about users or their surroundings, hence raising privacy concerns. In this study, we propose a novel adversarial training method for learning representations of audio recordings that effectively prevents the detection of speech activity from the latent features of the recordings. The proposed method trains a model to generate invariant latent representations of speech-containing audio recordings that cannot be distinguished from non-speech recordings by a speech classifier. The novelty of our work is in the optimization algorithm, where the speech classifier's weights are regularly replaced with the weights of classifiers trained in a supervised manner. This increases the discrimination power of the speech classifier constantly during the adversarial training, motivating the model to generate latent representations in which speech is not distinguishable, even using new speech classifiers trained outside the adversarial training loop. The proposed method is evaluated against a baseline approach with no privacy measures and a prior adversarial training method, demonstrating a significant reduction in privacy violations compared to the baseline approach. Additionally, we show that the prior adversarial method is practically ineffective for this purpose.

    @article{2024_SP,
    author = "Gharib, Shayan and Tran, Minh and Luong, Diep and Drossos, Konstantinos and Virtanen, Tuomas",
    title = "Adversarial Representation Learning for Robust Privacy Preservation in Audio",
    abstract = "Sound event detection systems are widely used in various applications such as surveillance and environmental monitoring where data is automatically collected, processed, and sent to a cloud for sound recognition. However, this process may inadvertently reveal sensitive information about users or their surroundings, hence raising privacy concerns. In this study, we propose a novel adversarial training method for learning representations of audio recordings that effectively prevents the detection of speech activity from the latent features of the recordings. The proposed method trains a model to generate invariant latent representations of speech-containing audio recordings that cannot be distinguished from non-speech recordings by a speech classifier. The novelty of our work is in the optimization algorithm, where the speech classifier's weights are regularly replaced with the weights of classifiers trained in a supervised manner. This increases the discrimination power of the speech classifier constantly during the adversarial training, motivating the model to generate latent representations in which speech is not distinguishable, even using new speech classifiers trained outside the adversarial training loop. The proposed method is evaluated against a baseline approach with no privacy measures and a prior adversarial training method, demonstrating a significant reduction in privacy violations compared to the baseline approach. Additionally, we show that the prior adversarial method is practically ineffective for this purpose.",
    keywords = "Acoustics, Adversarial machine learning, adversarial neural networks, adversarial representation learning, Feature extraction, Privacy, privacy preservation, sound event detection, Speech recognition, Task analysis, Training",
    note = "Publisher Copyright: Authors",
    year = "2024",
    doi = "10.1109/OJSP.2023.3349113",
    language = "English",
    volume = "5",
    pages = "294--302",
    journal = "IEEE Open Journal of Signal Processing",
    issn = "2644-1322",
    publisher = "IEEE"
    }

  • A. Hakala, T. Kincy, and T. Virtanen, "Automatic Live Music Song Identification Using Multi-level Deep Sequence Similarity Learning," in 2024 32nd European Signal Processing Conference (EUSIPCO), United States, 2024, p. 31–35. doi:10.23919/EUSIPCO63174.2024.10715468
    [BibTeX] [Abstract]

    This paper studies the novel problem of automatic live music song identification, where the goal is, given a live recording of a song, to retrieve the corresponding studio version of the song from a music database. We propose a system based on similarity learning and a Siamese convolutional neural network-based model. The model uses cross-similarity matrices of multilevel deep sequences to measure musical similarity between different audio tracks. A manually collected custom live music dataset is used to test the performance of the system with live music. The results of the experiments show that the system is able to identify 87.4\\% of the given live music queries.

    @inproceedings{2024_EUSIPCO,
    author = "Hakala, Aapo and Kincy, Trevor and Virtanen, Tuomas",
    title = "Automatic Live Music Song Identification Using Multi-level Deep Sequence Similarity Learning",
    abstract = "This paper studies the novel problem of automatic live music song identification, where the goal is, given a live recording of a song, to retrieve the corresponding studio version of the song from a music database. We propose a system based on similarity learning and a Siamese convolutional neural network-based model. The model uses cross-similarity matrices of multilevel deep sequences to measure musical similarity between different audio tracks. A manually collected custom live music dataset is used to test the performance of the system with live music. The results of the experiments show that the system is able to identify 87.4\\% of the given live music queries.",
    keywords = "cross-similarity matrice, live song identification, multi-level deep sequences, music information retrieval, Siamese network, similarity learning",
    note = "Publisher Copyright: {\textcopyright} 2024 European Signal Processing Conference, EUSIPCO. All rights reserved.; European Signal Processing Conference ; Conference date: 26-08-2024 Through 30-08-2024",
    year = "2024",
    doi = "10.23919/EUSIPCO63174.2024.10715468",
    language = "English",
    series = "European Signal Processing Conference",
    publisher = "IEEE",
    pages = "31--35",
    booktitle = "2024 32nd European Signal Processing Conference (EUSIPCO)",
    address = "United States"
    }

  • M. Heikkinen, A. Politis, and T. Virtanen, "Neural Ambisonics Encoding For Compact Irregular Microphone Arrays," in ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), United States, 2024, p. 701–705. doi:10.1109/ICASSP48485.2024.10447425
    [BibTeX] [Abstract]

    Ambisonics encoding of microphone array signals can enable various spatial audio applications, such as virtual reality or telepresence, but it is typically designed for uniformly-spaced spherical microphone arrays. This paper proposes a method for Ambisonics encoding that uses a deep neural network (DNN) to estimate a signal transform from microphone inputs to Ambisonics signals. The approach uses a DNN consisting of a U-Net structure with a learnable preprocessing as well as a loss function consisting of mean average error, spatial correlation, and energy preservation components. The method is validated on two microphone arrays with regular and irregular shapes having four microphones, on simulated reverberant scenes with multiple sources. The results of the validation show that the proposed method can meet or exceed the performance of a conventional signal-independent Ambisonics encoder on a number of error metrics.

    @inproceedings{2024_ICASSP,
    author = "Heikkinen, Mikko and Politis, Archontis and Virtanen, Tuomas",
    title = "Neural Ambisonics Encoding For Compact Irregular Microphone Arrays",
    abstract = "Ambisonics encoding of microphone array signals can enable various spatial audio applications, such as virtual reality or telepresence, but it is typically designed for uniformly-spaced spherical microphone arrays. This paper proposes a method for Ambisonics encoding that uses a deep neural network (DNN) to estimate a signal transform from microphone inputs to Ambisonics signals. The approach uses a DNN consisting of a U-Net structure with a learnable preprocessing as well as a loss function consisting of mean average error, spatial correlation, and energy preservation components. The method is validated on two microphone arrays with regular and irregular shapes having four microphones, on simulated reverberant scenes with multiple sources. The results of the validation show that the proposed method can meet or exceed the performance of a conventional signal-independent Ambisonics encoder on a number of error metrics.",
    year = "2024",
    doi = "10.1109/ICASSP48485.2024.10447425",
    language = "English",
    series = "Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing",
    publisher = "IEEE",
    pages = "701--705",
    booktitle = "ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",
    address = "United States",
    note = "IEEE International Conference on Acoustics, Speech and Signal Processing ; Conference date: 14-04-2024 Through 19-04-2024"
    }

  • L. Hekanaho, M. Hirvonen, and T. Virtanen, "Language-based machine perception: linguistic perspectives on the compilation of captioning datasets," Digital Scholarship in the Humanities, vol. 39, iss. 3, p. 864–883, 2024. doi:10.1093/llc/fqae029
    [BibTeX] [Abstract]

    Over the last decade, a plethora of training datasets have been compiled for use in language-based machine perception and in human-centered AI, alongside research regarding their compilation methods. From a primarily linguistic perspective, we add to these studies in two ways. First, we provide an overview of sixty-six training datasets used in automatic image, video, and audio captioning, examining their compilation methods with a metadata analysis. Second, we delve into the annotation process of crowdsourced datasets with an interest in understanding the linguistic factors that affect the form and content of the captions, such as contextualization and perspectivation. With a qualitative content analysis, we examine annotator instructions with a selection of eleven datasets. Drawing from various theoretical frameworks that help assess the effectiveness of the instructions, we discuss the visual and textual presentation of the instructions, as well as the perspective-guidance that is an essential part of the language instructions. While our analysis indicates that some standards in the formulation of instructions seem to have formed in the field, we also identified various reoccurring issues potentially hindering readability and comprehensibility of the instructions, and therefore, caption quality. To enhance readability, we emphasize the importance of text structure, organization of the information, consistent use of typographical cues, and clarity of language use. Last, engaging with previous research, we assess the compilation of both web-sourced and crowdsourced captioning datasets from various perspectives, discussing factors affecting the diversity of the datasets.

    @article{2024_i,
    author = "Hekanaho, Laura and Hirvonen, Maija and Virtanen, Tuomas",
    title = "Language-based machine perception: linguistic perspectives on the compilation of captioning datasets",
    abstract = "Over the last decade, a plethora of training datasets have been compiled for use in language-based machine perception and in human-centered AI, alongside research regarding their compilation methods. From a primarily linguistic perspective, we add to these studies in two ways. First, we provide an overview of sixty-six training datasets used in automatic image, video, and audio captioning, examining their compilation methods with a metadata analysis. Second, we delve into the annotation process of crowdsourced datasets with an interest in understanding the linguistic factors that affect the form and content of the captions, such as contextualization and perspectivation. With a qualitative content analysis, we examine annotator instructions with a selection of eleven datasets. Drawing from various theoretical frameworks that help assess the effectiveness of the instructions, we discuss the visual and textual presentation of the instructions, as well as the perspective-guidance that is an essential part of the language instructions. While our analysis indicates that some standards in the formulation of instructions seem to have formed in the field, we also identified various reoccurring issues potentially hindering readability and comprehensibility of the instructions, and therefore, caption quality. To enhance readability, we emphasize the importance of text structure, organization of the information, consistent use of typographical cues, and clarity of language use. Last, engaging with previous research, we assess the compilation of both web-sourced and crowdsourced captioning datasets from various perspectives, discussing factors affecting the diversity of the datasets.",
    year = "2024",
    month = "September",
    doi = "10.1093/llc/fqae029",
    language = "English",
    volume = "39",
    pages = "864--883",
    journal = "Digital Scholarship in the Humanities",
    issn = "2055-7671",
    publisher = "Oxford University Press",
    number = "3"
    }

  • D. Kocharov and O. Räsänen, "Age-dependent intonational changes in child-directed speech," in Proceedings of Speech Prosody 2024, 2024, p. 225–229. doi:10.21437/SpeechProsody.2024-46
    [BibTeX] [Abstract]

    The linguistic properties of child-directed speech (CDS) change over time as children get older and their language skills develop. The focus of this research is on prosodic changes of CDS within the earliest years of children{’}s life, especially on the changes in melody. We analyzed mothers{’} speech from Providence corpus, a collection of longitudinal (bi-monthly) recordings of mother-child spontaneous speech interactions from six English-speaking children between 1.0–3.5 years of age (363 h of audio). Raw prosodic features were extracted from speech using OpenSMILE toolkit. Timing of prosodic events with respect to segmental content was estimated with automatic alignment of orthographic transcripts and the speech signals. Analyses of prosodic features in the data show that mothers{’} voice in CDS changes during the second and the third years of their children life, as the mean fundamental frequency lowers significantly, while the within-utterance fundamental frequency variability doesn{’}t change.

    @inproceedings{2024_j,
    author = {Kocharov, Daniil and R{\"a}s{\"a}nen, Okko},
    title = "Age-dependent intonational changes in child-directed speech",
    abstract = "The linguistic properties of child-directed speech (CDS) change over time as children get older and their language skills develop. The focus of this research is on prosodic changes of CDS within the earliest years of children{\textquoteright}s life, especially on the changes in melody. We analyzed mothers{\textquoteright} speech from Providence corpus, a collection of longitudinal (bi-monthly) recordings of mother-child spontaneous speech interactions from six English-speaking children between 1.0–3.5 years of age (363 h of audio). Raw prosodic features were extracted from speech using OpenSMILE toolkit. Timing of prosodic events with respect to segmental content was estimated with automatic alignment of orthographic transcripts and the speech signals. Analyses of prosodic features in the data show that mothers{\textquoteright} voice in CDS changes during the second and the third years of their children life, as the mean fundamental frequency lowers significantly, while the within-utterance fundamental frequency variability doesn{\textquoteright}t change.",
    year = "2024",
    doi = "10.21437/SpeechProsody.2024-46",
    language = "English",
    series = "Speech prosody",
    publisher = "ISCA",
    pages = "225--229",
    booktitle = "Proceedings of Speech Prosody 2024",
    note = "Speech prosody ; Conference date: 02-07-2024 Through 05-07-2024"
    }

  • D. Kocharov and O. Räsänen, "The effect of F0 measurements on prosody analysis in language development studies." 2024, p. 22–22.
    [BibTeX] [Abstract] [Download PDF]

    Prosodic research on child-directed speech (CDS) usually focuses on measurements of F0 mean and standard deviation (SD). In general, there is an agreement that CDS has higher pitch mean and variability than ADS (adult-directed speech). However, earlier studies have reported conflicting findings on how F0 of CDS changes with the recipient child{’}s age, as there is evidence for both presence and absence of age-related change of pitch variability. One possibility is that the disagreement might originate from the variability of speech behaviors of speakers in different languages and cultures. Alternatively, it could be explained by the F0 measurement methodology as well. In this work, we investigate how the way we measure F0 influences the overall results on CDS change with child age. We investigated two different factors which could influence melodic analysis of speech: a) what sound segments are used to measure F0: vowels, sonorants, voiced obstruents, no account for sound segments; and b) whether the number of words within an utterance is taken into account in the analysis. We used a dataset of maternal utterances from the Providence Corpus (a collection of twice-monthly recordings of hour-long mother-child spontaneous interactions from six NA-English-speaking children) addressed at children in the age range of 1;0 to 3;0. The F0 values were calculated using OpenSMILE toolkit. The transcription-to-speech alignment was performed using WebMAUS online toolkit. The age-dependency of F0 feature was tested by means of the Spearman{’}s rank correlation coefficient between quantized child age and the feature values associated with the age bin. The results show that the estimation procedure can affect the findings, potentially affecting developmental interpretations. In case of calculating F0 on either voiced consonants and vowels, or vowels only, there is a significant age-dependent increase of F0 SD. This is in contrast to the case of using all F0 values calculated within an utterance by F0 detection algorithm, when no significant age-dependency for F0 SD is found. Thus, it matters whether we take into account segments that are known to have proper voicing or measure F0 from all speech that an automatic F0 estimator considers as voiced irrespectively of the underlying segments. Second, we found no age-related dependencies of F0 SD (whether we took sound segment identities into account or not) for the scenario where the analysis was controlled for the number of words within an utterance, i.e. comparing one-word utterances across all ages, two-word utterances across all ages, etc. The revealed age-dependency of F0 SD for the first scenario might be explained constantly increasing length of utterances in CDS in terms of number of pronounced words along with a child age. This is since the number of pronounced words in an utterance might influence melodic variability within the utterance, where more complex intonational structure may be required to prosodically structure the longer utterances.

    @conference{2024_l,
    author = {Kocharov, Daniil and R{\"a}s{\"a}nen, Okko},
    title = "The effect of F0 measurements on prosody analysis in language development studies",
    abstract = "Prosodic research on child-directed speech (CDS) usually focuses on measurements of F0 mean and standard deviation (SD). In general, there is an agreement that CDS has higher pitch mean and variability than ADS (adult-directed speech). However, earlier studies have reported conflicting findings on how F0 of CDS changes with the recipient child{\textquoteright}s age, as there is evidence for both presence and absence of age-related change of pitch variability. One possibility is that the disagreement might originate from the variability of speech behaviors of speakers in different languages and cultures. Alternatively, it could be explained by the F0 measurement methodology as well. In this work, we investigate how the way we measure F0 influences the overall results on CDS change with child age. We investigated two different factors which could influence melodic analysis of speech: a) what sound segments are used to measure F0: vowels, sonorants, voiced obstruents, no account for sound segments; and b) whether the number of words within an utterance is taken into account in the analysis. We used a dataset of maternal utterances from the Providence Corpus (a collection of twice-monthly recordings of hour-long mother-child spontaneous interactions from six NA-English-speaking children) addressed at children in the age range of 1;0 to 3;0. The F0 values were calculated using OpenSMILE toolkit. The transcription-to-speech alignment was performed using WebMAUS online toolkit. The age-dependency of F0 feature was tested by means of the Spearman{\textquoteright}s rank correlation coefficient between quantized child age and the feature values associated with the age bin. The results show that the estimation procedure can affect the findings, potentially affecting developmental interpretations. In case of calculating F0 on either voiced consonants and vowels, or vowels only, there is a significant age-dependent increase of F0 SD. This is in contrast to the case of using all F0 values calculated within an utterance by F0 detection algorithm, when no significant age-dependency for F0 SD is found. Thus, it matters whether we take into account segments that are known to have proper voicing or measure F0 from all speech that an automatic F0 estimator considers as voiced irrespectively of the underlying segments. Second, we found no age-related dependencies of F0 SD (whether we took sound segment identities into account or not) for the scenario where the analysis was controlled for the number of words within an utterance, i.e. comparing one-word utterances across all ages, two-word utterances across all ages, etc. The revealed age-dependency of F0 SD for the first scenario might be explained constantly increasing length of utterances in CDS in terms of number of pronounced words along with a child age. This is since the number of pronounced words in an utterance might influence melodic variability within the utterance, where more complex intonational structure may be required to prosodically structure the longer utterances.",
    year = "2024",
    language = "English",
    pages = "22--22",
    note = {Fonetiikan p{\"a}iv{\"a}t 2024 (The 36th Finnic Phonetics Symposium) ; Conference date: 25-04-2024 Through 26-04-2024},
    url = "https://cs.ttu.ee/events/fp2024/"
    }

  • K. Lahtinen, L. Mustanoja, and O. Räsänen, "Building a Naturalistic and Representative Affective Speech Corpus for the Finnish Language." 2024, p. 1–1.
    [BibTeX] [Abstract] [Download PDF]

    Spoken language contains affective (emotional) information, which is conveyed by suprasegmental variation (e.g prosody and phonation) in speech as well as by other situational variation such as word choices along with dialectal, syntactic and semantic variation. The information is ultimately perceived by the listener as subjective interpretations. While affect is part of everyday conversational communication, there is little existing research on expression and perception of affect in spoken Finnish [8][7], not to mention across different idiolectal subgroups such as speakers of different age or dialectal background. Since expression and interpretation of affect in language is known to depend on cultural and social conventions, better understanding of the expression of affect in Finnish would be desirable. The goal of our work is to research how affect is expressed in everyday spoken Finnish using large-scale data. A prerequisite for our research on affective language is a speech corpus containing unscripted audio recordings of speech paired with metadata (or annotations) containing information about the affective expression. However, we are aware of only two Finnish speech corpora related to affect, both consisting of acted emotional expressions while reading a pre-defined script and consisting only a small amount of speech in total [1][5]. In contrast, several large-scale datasets containing unscripted speech in Finnish exists [4][2][6], but they lack affect related metadata. Building an affective speech corpus can be done in several ways, typically by recording acted speech in a controlled setting or utilizing publicly available free speech audio sources from different medias such as podcasts, radio or television. The trade-off when building these types of datasets is typically between the richness and balance of affective expression present in the data and the level of information the dataset contains about the expression in the data [3]. In this presentation, we will describe our approach to compiling a spoken Finnish dataset for the study of affective expression by combining the LahjoitaPuhetta, HelPuhe and TamPuhe datasets. The dataset will be built by aligning the audio recordings with their respective text transcriptions and split into individual utterance samples (consisting of audio and text). Each utterance sample in the dataset will be augmented with a text sentiment, speech-to-noise ratio and audio based emotion estimates first by using automated tools and finally annotating a subset of samples manually. The final dataset can be used to build better tools for automated affect related annotation providing more options for researching affect and idiolectical variation using large-scale data. The work is a part of the CONVERGENCE-project at Tampere University, funded by the Jane and Aatos Erkko Foundation.

    @conference{2024_g,
    author = {Lahtinen, Kalle and Mustanoja, Liisa and R{\"a}s{\"a}nen, Okko},
    title = "Building a Naturalistic and Representative Affective Speech Corpus for the Finnish Language",
    abstract = "Spoken language contains affective (emotional) information, which is conveyed by suprasegmental variation (e.g prosody and phonation) in speech as well as by other situational variation such as word choices along with dialectal, syntactic and semantic variation. The information is ultimately perceived by the listener as subjective interpretations. While affect is part of everyday conversational communication, there is little existing research on expression and perception of affect in spoken Finnish [8][7], not to mention across different idiolectal subgroups such as speakers of different age or dialectal background. Since expression and interpretation of affect in language is known to depend on cultural and social conventions, better understanding of the expression of affect in Finnish would be desirable. The goal of our work is to research how affect is expressed in everyday spoken Finnish using large-scale data. A prerequisite for our research on affective language is a speech corpus containing unscripted audio recordings of speech paired with metadata (or annotations) containing information about the affective expression. However, we are aware of only two Finnish speech corpora related to affect, both consisting of acted emotional expressions while reading a pre-defined script and consisting only a small amount of speech in total [1][5]. In contrast, several large-scale datasets containing unscripted speech in Finnish exists [4][2][6], but they lack affect related metadata. Building an affective speech corpus can be done in several ways, typically by recording acted speech in a controlled setting or utilizing publicly available free speech audio sources from different medias such as podcasts, radio or television. The trade-off when building these types of datasets is typically between the richness and balance of affective expression present in the data and the level of information the dataset contains about the expression in the data [3]. In this presentation, we will describe our approach to compiling a spoken Finnish dataset for the study of affective expression by combining the LahjoitaPuhetta, HelPuhe and TamPuhe datasets. The dataset will be built by aligning the audio recordings with their respective text transcriptions and split into individual utterance samples (consisting of audio and text). Each utterance sample in the dataset will be augmented with a text sentiment, speech-to-noise ratio and audio based emotion estimates first by using automated tools and finally annotating a subset of samples manually. The final dataset can be used to build better tools for automated affect related annotation providing more options for researching affect and idiolectical variation using large-scale data. The work is a part of the CONVERGENCE-project at Tampere University, funded by the Jane and Aatos Erkko Foundation.",
    year = "2024",
    month = "April",
    language = "English",
    pages = "1--1",
    note = {Fonetiikan P{\"a}iv{\"a}t 2024 ; Conference date: 25-04-2024 Through 26-04-2024},
    url = "https://cs.ttu.ee/events/fp2024/"
    }

  • J. Martinsson, O. Mogren, M. Sandsten, and T. Virtanen, "From Weak to Strong Sound Event Labels using Adaptive Change-Point Detection and Active Learning," in 2024 32nd European Signal Processing Conference (EUSIPCO), United States, 2024, p. 902–906. doi:10.23919/EUSIPCO63174.2024.10715098
    [BibTeX] [Abstract]

    We propose an adaptive change point detection method (A-CPD) for machine guided weak label annotation of audio recording segments. The goal is to maximize the amount of information gained about the temporal activations of the target sounds. For each unlabeled audio recording, we use a prediction model to derive a probability curve used to guide annotation. The prediction model is initially pre-trained on available annotated sound event data with classes that are disjoint from the classes in the unlabeled dataset. The prediction model then gradually adapts to the annotations provided by the annotator in an active learning loop. We derive query segments to guide the weak label annotator towards strong labels, using change point detection on these probabilities. We show that it is possible to derive strong labels of high quality with a limited annotation budget, and show favorable results for A-CPD when compared to two baseline query segment strategies.

    @inproceedings{2024_EUSIPCO_b,
    author = "Martinsson, John and Mogren, Olof and Sandsten, Maria and Virtanen, Tuomas",
    title = "From Weak to Strong Sound Event Labels using Adaptive Change-Point Detection and Active Learning",
    abstract = "We propose an adaptive change point detection method (A-CPD) for machine guided weak label annotation of audio recording segments. The goal is to maximize the amount of information gained about the temporal activations of the target sounds. For each unlabeled audio recording, we use a prediction model to derive a probability curve used to guide annotation. The prediction model is initially pre-trained on available annotated sound event data with classes that are disjoint from the classes in the unlabeled dataset. The prediction model then gradually adapts to the annotations provided by the annotator in an active learning loop. We derive query segments to guide the weak label annotator towards strong labels, using change point detection on these probabilities. We show that it is possible to derive strong labels of high quality with a limited annotation budget, and show favorable results for A-CPD when compared to two baseline query segment strategies.",
    keywords = "Active learning, annotation, deep learning, sound event detection",
    note = "Publisher Copyright: {\textcopyright} 2024 European Signal Processing Conference, EUSIPCO. All rights reserved.; European Signal Processing Conference ; Conference date: 26-08-2024 Through 30-08-2024",
    year = "2024",
    doi = "10.23919/EUSIPCO63174.2024.10715098",
    language = "English",
    series = "European Signal Processing Conference",
    publisher = "IEEE",
    pages = "902--906",
    booktitle = "2024 32nd European Signal Processing Conference (EUSIPCO)",
    address = "United States"
    }

  • M. Moritz, T. Olan, and T. Virtanen, "Noise-To-Mask Ratio Loss for Deep Neural Network Based Audio Watermarking," in IEEE 5th International Symposium on the Internet of Sounds, IS2 2024, United States, 2024. doi:10.1109/IS262782.2024.10704132
    [BibTeX] [Abstract]

    Digital audio watermarking consists in inserting a message into audio signals in a transparent way and can be used to allow automatic recognition of audio material and management of the copyrights. We propose a perceptual loss function to be used in deep neural network based audio watermarking systems. The loss is based on the noise-To-mask ratio (NMR), which is a model of the psychoacoustic masking effect characteristic of the human ear. We use the NMR loss between marked and host signals to train the deep neural models and we evaluate the objective quality with PEAQ and the subjective quality with a MUSHRA test.

    @inproceedings{2024_h,
    author = "Moritz, Martin and Olan, Toni and Virtanen, Tuomas",
    title = "Noise-To-Mask Ratio Loss for Deep Neural Network Based Audio Watermarking",
    abstract = "Digital audio watermarking consists in inserting a message into audio signals in a transparent way and can be used to allow automatic recognition of audio material and management of the copyrights. We propose a perceptual loss function to be used in deep neural network based audio watermarking systems. The loss is based on the noise-To-mask ratio (NMR), which is a model of the psychoacoustic masking effect characteristic of the human ear. We use the NMR loss between marked and host signals to train the deep neural models and we evaluate the objective quality with PEAQ and the subjective quality with a MUSHRA test.",
    note = "Publisher Copyright: {\textcopyright} 2024 IEEE.; IEEE International Symposium on the Internet of Sounds ; Conference date: 30-09-2024 Through 02-10-2024",
    year = "2024",
    doi = "10.1109/IS262782.2024.10704132",
    language = "English",
    booktitle = "IEEE 5th International Symposium on the Internet of Sounds, IS2 2024",
    publisher = "IEEE",
    address = "United States"
    }

  • M. Neri, A. Politis, D. Krause, M. Carli, and T. Virtanen, "Speaker Distance Estimation in Enclosures from Single-Channel Audio," IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 32, p. 2242–2254, 2024. doi:10.1109/TASLP.2024.3382504
    [BibTeX] [Abstract]

    Distance estimation from audio plays a crucial role in various applications, such as acoustic scene analysis, sound source localization, and room modeling. Most studies predominantly center on employing a classification approach, where distances are discretized into distinct categories, enabling smoother model training and achieving higher accuracy but imposing restrictions on the precision of the obtained sound source position. Towards this direction, in this paper we propose a novel approach for continuous distance estimation from audio signals using a convolutional recurrent neural network with an attention module. The attention mechanism enables the model to focus on relevant temporal and spectral features, enhancing its ability to capture fine-grained distance-related information. To evaluate the effectiveness of our proposed method, we conduct extensive experiments using audio recordings in controlled environments with three levels of realism (synthetic room impulse response, measured response with convolved speech, and real recordings) on four datasets (our synthetic dataset, QMULTIMIT, VoiceHome-2, and STARSS23). Experimental results show that the model achieves an absolute error of 0.11 meters in a noiseless synthetic scenario. Moreover, the results showed an absolute error of about 1.30 meters in the hybrid scenario. The algorithm's performance in the real scenario, where unpredictable environmental factors and noise are prevalent, yields an absolute error of approximately 0.50 meters.

    @article{2024_f,
    author = "Neri, Michael and Politis, Archontis and Krause, Daniel and Carli, Marco and Virtanen, Tuomas",
    title = "Speaker Distance Estimation in Enclosures from Single-Channel Audio",
    abstract = "Distance estimation from audio plays a crucial role in various applications, such as acoustic scene analysis, sound source localization, and room modeling. Most studies predominantly center on employing a classification approach, where distances are discretized into distinct categories, enabling smoother model training and achieving higher accuracy but imposing restrictions on the precision of the obtained sound source position. Towards this direction, in this paper we propose a novel approach for continuous distance estimation from audio signals using a convolutional recurrent neural network with an attention module. The attention mechanism enables the model to focus on relevant temporal and spectral features, enhancing its ability to capture fine-grained distance-related information. To evaluate the effectiveness of our proposed method, we conduct extensive experiments using audio recordings in controlled environments with three levels of realism (synthetic room impulse response, measured response with convolved speech, and real recordings) on four datasets (our synthetic dataset, QMULTIMIT, VoiceHome-2, and STARSS23). Experimental results show that the model achieves an absolute error of 0.11 meters in a noiseless synthetic scenario. Moreover, the results showed an absolute error of about 1.30 meters in the hybrid scenario. The algorithm's performance in the real scenario, where unpredictable environmental factors and noise are prevalent, yields an absolute error of approximately 0.50 meters.",
    keywords = "Acoustics, Attention, Deep Learning, Direction-of-arrival estimation, Distance estimation, Estimation, Explainability, Feature extraction, Recording, Reverberation, Single-channel, Speech processing, Task analysis",
    note = "Publisher Copyright: Authors",
    year = "2024",
    doi = "10.1109/TASLP.2024.3382504",
    language = "English",
    volume = "32",
    pages = "2242--2254",
    journal = "IEEE/ACM Transactions on Audio Speech and Language Processing",
    issn = "2329-9290",
    publisher = "IEEE Advancing Technology for Humanity"
    }

  • O. Räsänen and D. Kocharov, "Age-dependent analysis and stochastic generation of child-directed speech," in Proceedings of the Annual Meeting of the Cognitive Science Society, United States, 2024, p. 5102–5108.
    [BibTeX] [Abstract] [Download PDF]

    Child-directed speech (CDS) is a particular type of speech that adults use when addressing young children. Its properties also change as a function of extralinguistic factors, such as age of the child being addressed. Access to large amounts of representative and varied CDS would be useful for child language research, as this would enable controlled computational modeling experiments of infant language acquisition with realistic input in terms of quality and quantity. In this study, we describe an approach to model age-dependent linguistic properties of CDS using a language model (LM) trained on CDS transcripts and ages of the recipient children, as obtained from North American English corpora of the CHILDES database. The created LM can then be used to stochastically generate synthetic CDS transcripts in an age-appropriate manner, thereby scaling beyond the original datasets in size. We compare characteristics of the generated CDS against the real speech addressed at children of different ages, showing that the LM manages to capture age-dependent changes in CDS, except for a slight difference in the effective vocabulary size. As a side product, we also provide a systematic characterization of age-dependent linguistic properties of CDS in CHILDES, illustrating how all measured aspects of the CDS change with children's age.

    @inproceedings{2024_a,
    author = {R{\"a}s{\"a}nen, Okko and Kocharov, Daniil},
    title = "Age-dependent analysis and stochastic generation of child-directed speech",
    abstract = "Child-directed speech (CDS) is a particular type of speech that adults use when addressing young children. Its properties also change as a function of extralinguistic factors, such as age of the child being addressed. Access to large amounts of representative and varied CDS would be useful for child language research, as this would enable controlled computational modeling experiments of infant language acquisition with realistic input in terms of quality and quantity. In this study, we describe an approach to model age-dependent linguistic properties of CDS using a language model (LM) trained on CDS transcripts and ages of the recipient children, as obtained from North American English corpora of the CHILDES database. The created LM can then be used to stochastically generate synthetic CDS transcripts in an age-appropriate manner, thereby scaling beyond the original datasets in size. We compare characteristics of the generated CDS against the real speech addressed at children of different ages, showing that the LM manages to capture age-dependent changes in CDS, except for a slight difference in the effective vocabulary size. As a side product, we also provide a systematic characterization of age-dependent linguistic properties of CDS in CHILDES, illustrating how all measured aspects of the CDS change with children's age.",
    year = "2024",
    language = "English",
    volume = "46",
    series = "Proceedings of the Annual Conference of the Cognitive Science Society",
    publisher = "University of California eScholarship",
    pages = "5102--5108",
    booktitle = "Proceedings of the Annual Meeting of the Cognitive Science Society",
    address = "United States",
    note = "Annual Meeting of the Cognitive Science Society, CogSci ; Conference date: 24-07-2024 Through 27-07-2024",
    url = "https://cognitivesciencesociety.org/cogsci-2024/"
    }

  • O. Räsänen, M. Cruz Blandon, K. Khorrami, and D. Kocharov, "Modeling Child Language Development using Naturalistic Data at a Scale." 2024, p. 5–5.
    [BibTeX] [Abstract]

    These include learning of the language{’}s phonetic units, segmentation of words from running speech, association word forms with their meanings, and acquisition of the syntax. Typical empirical research on child language development (CLD) consists of well-controlled focused studies conducted in laboratory conditions. In contrast, only a few high-level theories, like NLM-e (Kuhl et al., 2008) and PRIMIR (Werker \\& Curtin, 2005), have aimed to integrate the present understanding of CLD into unified frameworks. However, these frameworks have not gained unanimous acceptance in the field, largely since they are relatively abstract, and only qualitatively describe the mechanisms and representations that could be involved in CLD. As a result, we still have limited understanding of the basic mechanisms and their interactions driving early language acquisition. In principle, computational modelling of CLD is a potential solution to the so-called “integration problem” across empirical findings. This is because computational models can, and must, explicitly address all aspects of the information processing chain from input data to the resulting behavior. By formulating the theories as high-level computational goals and operations, implementing them as functional signal processing and machine learning algorithms, and finally exposing the models to realistic input data comparable to what real infants experience, ecological plausibility and validity of the underlying theories can be explicitly tested. However, the scientific impact of the existing modeling efforts has also been limited. Instead of addressing multiple aspects of language in one model, earlier models have usually focused on individual language phenomena (e.g., phonemic learning or word segmentation). Moreover, they have rarely investigated learning as a function of the developmental timeline of the learner. We argue that two main factors currently hinder the development of more comprehensive, ecologically valid, and thus influential models of CLD: 1) lack of ecologically valid and openly available large-scale speech data to simulate child language experiences at a realistic scale, where various adult speech corpora are currently used instead, and 2) limited validation of the models with respect to empirical data on real infant language learning, where the current standard approach is to evaluate the models against linguistic theory of how speech is formally structured. Without fixing the data and evaluation problems, it is difficult to develop models that could truly help us to understand the big picture of CLD. In this talk, we provide an overview of our ongoing “Modeling Child Language Development using Naturalistic Data at a Scale” (L-SCALE) project, where the aim is to enable development of more comprehensive and ecologically valid computational models of CLD. To achieve this, the L-SCALE project tackles the two challenges identified above: 1) solving the ecologically valid training data problem by creating a pipeline called Generator of Infant Language ExperienceS (GILES) for generation of ecologically relevant large-scale training data for computational models, and 2) solving the mismatch between human data and computational model evaluation by developing evaluation protocols that enable comparison of computational models against empirical data on CLD as a function of learner{’}s age. We will also showcase some recent developments of the project, including a proof- of-concept demonstration of the GILES pipeline and a meta-analytic approach to compare models to empirical data on infant language learning. The L-SCALE project is funded by Kone Foundation (2022–2026).

    @conference{2024_c,
    author = {R{\"a}s{\"a}nen, Okko and Cruz Blandon, Maria and Khorrami, Khazar and Kocharov, Daniil},
    title = "Modeling Child Language Development using Naturalistic Data at a Scale",
    abstract = "These include learning of the language{\textquoteright}s phonetic units, segmentation of words from running speech, association word forms with their meanings, and acquisition of the syntax. Typical empirical research on child language development (CLD) consists of well-controlled focused studies conducted in laboratory conditions. In contrast, only a few high-level theories, like NLM-e (Kuhl et al., 2008) and PRIMIR (Werker \\& Curtin, 2005), have aimed to integrate the present understanding of CLD into unified frameworks. However, these frameworks have not gained unanimous acceptance in the field, largely since they are relatively abstract, and only qualitatively describe the mechanisms and representations that could be involved in CLD. As a result, we still have limited understanding of the basic mechanisms and their interactions driving early language acquisition. In principle, computational modelling of CLD is a potential solution to the so-called “integration problem” across empirical findings. This is because computational models can, and must, explicitly address all aspects of the information processing chain from input data to the resulting behavior. By formulating the theories as high-level computational goals and operations, implementing them as functional signal processing and machine learning algorithms, and finally exposing the models to realistic input data comparable to what real infants experience, ecological plausibility and validity of the underlying theories can be explicitly tested. However, the scientific impact of the existing modeling efforts has also been limited. Instead of addressing multiple aspects of language in one model, earlier models have usually focused on individual language phenomena (e.g., phonemic learning or word segmentation). Moreover, they have rarely investigated learning as a function of the developmental timeline of the learner. We argue that two main factors currently hinder the development of more comprehensive, ecologically valid, and thus influential models of CLD: 1) lack of ecologically valid and openly available large-scale speech data to simulate child language experiences at a realistic scale, where various adult speech corpora are currently used instead, and 2) limited validation of the models with respect to empirical data on real infant language learning, where the current standard approach is to evaluate the models against linguistic theory of how speech is formally structured. Without fixing the data and evaluation problems, it is difficult to develop models that could truly help us to understand the big picture of CLD. In this talk, we provide an overview of our ongoing “Modeling Child Language Development using Naturalistic Data at a Scale” (L-SCALE) project, where the aim is to enable development of more comprehensive and ecologically valid computational models of CLD. To achieve this, the L-SCALE project tackles the two challenges identified above: 1) solving the ecologically valid training data problem by creating a pipeline called Generator of Infant Language ExperienceS (GILES) for generation of ecologically relevant large-scale training data for computational models, and 2) solving the mismatch between human data and computational model evaluation by developing evaluation protocols that enable comparison of computational models against empirical data on CLD as a function of learner{\textquoteright}s age. We will also showcase some recent developments of the project, including a proof- of-concept demonstration of the GILES pipeline and a meta-analytic approach to compare models to empirical data on infant language learning. The L-SCALE project is funded by Kone Foundation (2022–2026).",
    year = "2024",
    language = "English",
    pages = "5--5",
    note = "Finnic Phonetics Symposium ; Conference date: 25-04-2024 Through 26-04-2024"
    }

  • Y. Wang, A. Politis, and T. Virtanen, "Attention-Driven Multichannel Speech Enhancement in Moving Sound Source Scenarios," in ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), United States, 2024, p. 11221–11225. doi:10.1109/ICASSP48485.2024.10448177
    [BibTeX]
    @inproceedings{2024_ICASSP_a,
    author = "Wang, Yuzhu and Politis, Archontis and Virtanen, Tuomas",
    title = "Attention-Driven Multichannel Speech Enhancement in Moving Sound Source Scenarios",
    year = "2024",
    doi = "10.1109/ICASSP48485.2024.10448177",
    language = "English",
    series = "Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing",
    publisher = "IEEE",
    pages = "11221--11225",
    booktitle = "ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",
    address = "United States",
    note = "IEEE International Conference on Acoustics, Speech and Signal Processing ; Conference date: 14-04-2024 Through 19-04-2024"
    }

2023

  • M. Airaksinen, E. Taylor, A. Gallen, E. Ilén, A. Saari, U. Sankilampi, O. Räsänen, L. M. Haataja, and S. Vanhatalo, "Charting infants’ motor development at home using a wearable system: validation and comparison to physical growth charts," Ebiomedicine, vol. 92, 2023. doi:10.1016/j.ebiom.2023.104591
    [BibTeX] [Abstract]

    {Background: Early neurodevelopmental care and research are in urgent need of practical methods for quantitative assessment of early motor development. Here, performance of a wearable system in early motor assessment was validated and compared to developmental tracking of physical growth charts. Methods: Altogether 1358 h of spontaneous movement during 226 recording sessions in 116 infants (age 4–19 months) were analysed using a multisensor wearable system. A deep learning-based automatic pipeline quantified categories of infants' postures and movements at a time scale of seconds. Results from an archived cohort (dataset 1

    @article{2023_f,
    author = {Airaksinen, Manu and Taylor, Elisa and Gallen, Anastasia and Il{\'e}n, Elina and Saari, Antti and Sankilampi, Ulla and R{\"a}s{\"a}nen, Okko and Haataja, Leena M. and Vanhatalo, Sampsa},
    title = "Charting infants{\textquoteright} motor development at home using a wearable system: validation and comparison to physical growth charts",
    abstract = {Background: Early neurodevelopmental care and research are in urgent need of practical methods for quantitative assessment of early motor development. Here, performance of a wearable system in early motor assessment was validated and compared to developmental tracking of physical growth charts. Methods: Altogether 1358 h of spontaneous movement during 226 recording sessions in 116 infants (age 4–19 months) were analysed using a multisensor wearable system. A deep learning-based automatic pipeline quantified categories of infants' postures and movements at a time scale of seconds. Results from an archived cohort (dataset 1, N = 55 infants) recorded under partial supervision were compared to a validation cohort (dataset 2, N = 61) recorded at infants{\textquoteright} homes by the parents. Aggregated recording-level measures including developmental age prediction (DAP) were used for comparison between cohorts. The motor growth was also compared with respective DAP estimates based on physical growth data (length, weight, and head circumference) obtained from a large cohort (N = 17,838 infants; age 4–18 months). Findings: Age-specific distributions of posture and movement categories were highly similar between infant cohorts. The DAP scores correlated tightly with age, explaining 97–99\\% (94–99\\% CI 95) of the variance at the group average level, and 80–82\\% (72–88\\%) of the variance in the individual recordings. Both the average motor and the physical growth measures showed a very strong fit to their respective developmental models (R2 = 0.99). However, single measurements showed more modality-dependent variation that was lowest for motor (σ = 1.4 [1.3–1.5 CI 95] months), length (σ = 1.5 months), and combined physical (σ = 1.5 months) measurements, and it was clearly higher for the weight (σ = 1.9 months) and head circumference (σ = 1.9 months) measurements. Longitudinal tracking showed clear individual trajectories, and its accuracy was comparable between motor and physical measures with longer measurement intervals. Interpretation: A quantified, transparent and explainable assessment of infants' motor performance is possible with a fully automated analysis pipeline, and the results replicate across independent cohorts from out-of-hospital recordings. A holistic assessment of motor development provides an accuracy that is comparable with the conventional physical growth measures. A quantitative measure of infants{\textquoteright} motor development may directly support individual diagnostics and care, as well as facilitate clinical research as an outcome measure in early intervention trials. Funding: This work was supported by the Finnish Academy ( 314602, 335788, 335872, 332017, 343498), Finnish Pediatric Foundation (Lastentautiens{\"a}{\"a}ti{\"o}), Aivos{\"a}{\"a}ti{\"o}, Sigrid Jus{\'e}lius Foundation, and HUS Children's Hospital/ HUS diagnostic center research funds.},
    keywords = "Cerebral palsy, Human activity recognition, Milestones, Motor development, Neurodevelopment, Out-of-hospital",
    note = {Funding Information: This work was supported by the Finnish Academy (314602, 335788, 335872, 332017, 343498), Finnish Pediatric Foundation (Lastentautiens{\"a}{\"a}ti{\"o}), Aivos{\"a}{\"a}ti{\"o}, Sigrid Jus{\'e}lius Foundation, and HUS Children's Hospital/HUS diagnostic center research funds. Funding Information: This work was supported by the Finnish Academy ( 314602 , 335788 , 335872 , 332017 , 343498 ), Finnish Pediatric Foundation (Lastentautiens{\"a}{\"a}ti{\"o}), Aivos{\"a}{\"a}ti{\"o} , Sigrid Jus{\'e}lius Foundation , and HUS Children{\textquoteright}s Hospital / HUS diagnostic center research funds . Publisher Copyright: {\textcopyright} 2023 The Author(s)},
    year = "2023",
    month = "June",
    doi = "10.1016/j.ebiom.2023.104591",
    language = "English",
    volume = "92",
    journal = "Ebiomedicine",
    issn = "2352-3964",
    publisher = "Elsevier BV"
    }

  • M. Airaksinen, S. Vanhatalo, and O. Räsänen, "Comparison of End-to-End Neural Network Architectures and Data Augmentation Methods for Automatic Infant Motility Assessment Using Wearable Sensors," Sensors, vol. 23, iss. 7, 2023. doi:10.3390/s23073773
    [BibTeX] [Abstract]

    Infant motility assessment using intelligent wearables is a promising new approach for assessment of infant neurophysiological development, and where efficient signal analysis plays a central role. This study investigates the use of different end-to-end neural network architectures for processing infant motility data from wearable sensors. We focus on the performance and computational burden of alternative sensor encoder and time series modeling modules and their combinations. In addition, we explore the benefits of data augmentation methods in ideal and nonideal recording conditions. The experiments are conducted using a dataset of multisensor movement recordings from 7-month-old infants, as captured by a recently proposed smart jumpsuit for infant motility assessment. Our results indicate that the choice of the encoder module has a major impact on classifier performance. For sensor encoders, the best performance was obtained with parallel two-dimensional convolutions for intrasensor channel fusion with shared weights for all sensors. The results also indicate that a relatively compact feature representation is obtainable for within-sensor feature extraction without a drastic loss to classifier performance. Comparison of time series models revealed that feedforward dilated convolutions with residual and skip connections outperformed all recurrent neural network (RNN)-based models in performance, training time, and training stability. The experiments also indicate that data augmentation improves model robustness in simulated packet loss or sensor dropout scenarios. In particular, signal- and sensor-dropout-based augmentation strategies provided considerable boosts to performance without negatively affecting the baseline performance. Overall, the results provide tangible suggestions on how to optimize end-to-end neural network training for multichannel movement sensor data.

    @article{2023_g,
    author = {Airaksinen, Manu and Vanhatalo, Sampsa and R{\"a}s{\"a}nen, Okko},
    title = "Comparison of End-to-End Neural Network Architectures and Data Augmentation Methods for Automatic Infant Motility Assessment Using Wearable Sensors",
    abstract = "Infant motility assessment using intelligent wearables is a promising new approach for assessment of infant neurophysiological development, and where efficient signal analysis plays a central role. This study investigates the use of different end-to-end neural network architectures for processing infant motility data from wearable sensors. We focus on the performance and computational burden of alternative sensor encoder and time series modeling modules and their combinations. In addition, we explore the benefits of data augmentation methods in ideal and nonideal recording conditions. The experiments are conducted using a dataset of multisensor movement recordings from 7-month-old infants, as captured by a recently proposed smart jumpsuit for infant motility assessment. Our results indicate that the choice of the encoder module has a major impact on classifier performance. For sensor encoders, the best performance was obtained with parallel two-dimensional convolutions for intrasensor channel fusion with shared weights for all sensors. The results also indicate that a relatively compact feature representation is obtainable for within-sensor feature extraction without a drastic loss to classifier performance. Comparison of time series models revealed that feedforward dilated convolutions with residual and skip connections outperformed all recurrent neural network (RNN)-based models in performance, training time, and training stability. The experiments also indicate that data augmentation improves model robustness in simulated packet loss or sensor dropout scenarios. In particular, signal- and sensor-dropout-based augmentation strategies provided considerable boosts to performance without negatively affecting the baseline performance. Overall, the results provide tangible suggestions on how to optimize end-to-end neural network training for multichannel movement sensor data.",
    keywords = "classifier architectures, human activity recognition, infant motility, wearable technology",
    note = {Funding Information: The research was funded by Academy of Finland grants no. 314602, 314573, 314450, 335778, 332017, and 343498, as well as project grants from Lastentautien tutkimuss{\"a}{\"a}ti{\"o}, Suomen Aivos{\"a}{\"a}ti{\"o} and Sigrid Juselius foundation. Open access funding provided by University of Helsinki. Publisher Copyright: {\textcopyright} 2023 by the authors.},
    year = "2023",
    month = "April",
    doi = "10.3390/s23073773",
    language = "English",
    volume = "23",
    journal = "Sensors",
    issn = "1424-8220",
    publisher = "MDPI",
    number = "7"
    }

  • M. Cruz Blandon, A. Cristia, and O. Räsänen, "Analysing the Impact of Audio Quality on the Use of Naturalistic Long-Form Recordings for Infant-Directed Speech Research," in Proceedings of the Annual Meeting of the Cognitive Science Society, Vol 45, 2023, p. 2021–2028.
    [BibTeX] [Abstract]

    Modelling of early language acquisition aims to understand how infants bootstrap their language skills. The modelling encompasses properties of the input data used for training the models, the cognitive hypotheses and their algorithmic implementations being tested, and the evaluation methodologies to compare models to human data. Recent developments have enabled the use of more naturalistic training data for computational models. This also motivates development of more naturalistic tests of model behaviour. A crucial step towards such an aim is to develop representative speech datasets consisting of speech heard by infants in their natural environments. However, a major drawback of such recordings is that they are typically noisy, and it is currently unclear how the sound quality could affect analyses and modelling experiments conducted on such data. In this paper, we explore this aspect for the case of infant-directed speech (IDS) and adult-directed speech (ADS) analysis. First, we manually and automatically annotated audio quality of utterances extracted from two corpora of child-centred long-form recordings (in English and French). We then compared acoustic features of IDS and ADS in an in-lab dataset and across different audio quality subsets of naturalistic data. Finally, we assessed how the audio quality and recording environment may change the conclusions of a modelling analysis using a recent self-supervised learning model. Our results show that the use of modest and high audio quality naturalistic speech data result in largely similar conclusions on IDS and ADS in terms of acoustic analyses and modelling experiments. We also found that an automatic sound quality assessment tool can be used to screen out useful parts of long-form recordings for a closer analysis with comparable results to that of manual quality annotation.

    @inproceedings{2023_d,
    author = {Cruz Blandon, Maria and Cristia, Alejandrina and R{\"a}s{\"a}nen, Okko},
    title = "Analysing the Impact of Audio Quality on the Use of Naturalistic Long-Form Recordings for Infant-Directed Speech Research",
    abstract = "Modelling of early language acquisition aims to understand how infants bootstrap their language skills. The modelling encompasses properties of the input data used for training the models, the cognitive hypotheses and their algorithmic implementations being tested, and the evaluation methodologies to compare models to human data. Recent developments have enabled the use of more naturalistic training data for computational models. This also motivates development of more naturalistic tests of model behaviour. A crucial step towards such an aim is to develop representative speech datasets consisting of speech heard by infants in their natural environments. However, a major drawback of such recordings is that they are typically noisy, and it is currently unclear how the sound quality could affect analyses and modelling experiments conducted on such data. In this paper, we explore this aspect for the case of infant-directed speech (IDS) and adult-directed speech (ADS) analysis. First, we manually and automatically annotated audio quality of utterances extracted from two corpora of child-centred long-form recordings (in English and French). We then compared acoustic features of IDS and ADS in an in-lab dataset and across different audio quality subsets of naturalistic data. Finally, we assessed how the audio quality and recording environment may change the conclusions of a modelling analysis using a recent self-supervised learning model. Our results show that the use of modest and high audio quality naturalistic speech data result in largely similar conclusions on IDS and ADS in terms of acoustic analyses and modelling experiments. We also found that an automatic sound quality assessment tool can be used to screen out useful parts of long-form recordings for a closer analysis with comparable results to that of manual quality annotation.",
    year = "2023",
    month = "July",
    language = "English",
    series = "Proceedings of the Annual Conference of the Cognitive Science Society",
    publisher = "COGNITIVE SCIENCE SOCIETY",
    pages = "2021--2028",
    booktitle = "Proceedings of the Annual Meeting of the Cognitive Science Society, Vol 45",
    note = "Annual Conference of the Cognitive Science Society ; Conference date: 26-07-2023 Through 29-07-2023"
    }

  • M. A. Cruz Blandón, A. Cristia, and O. Räsänen, "Introducing Meta-analysis in the Evaluation of Computational Models of Infant Language Development," COGNITIVE SCIENCE, vol. 47, iss. 7, p. e13307, 2023. doi:10.1111/cogs.13307
    [BibTeX] [Abstract]

    Computational models of child language development can help us understand the cognitive underpinnings of the language learning process, which occurs along several linguistic levels at once (e.g., prosodic and phonological). However, in light of the replication crisis, modelers face the challenge of selecting representative and consolidated infant data. Thus, it is desirable to have evaluation methodologies that could account for robust empirical reference data, across multiple infant capabilities. Moreover, there is a need for practices that can compare developmental trajectories of infants to those of models as a function of language experience and development. The present study aims to take concrete steps to address these needs by introducing the concept of comparing models with large-scale cumulative empirical data from infants, as quantified by meta-analyses conducted across a large number of individual behavioral studies. We formalize the connection between measurable model and human behavior, and then present a conceptual framework for meta-analytic evaluation of computational models. We exemplify the meta-analytic model evaluation approach with two modeling experiments on infant-directed speech preference and native/non-native vowel discrimination.

    @article{2023_b,
    author = {Cruz Bland{\'o}n, Mar{\'i}a Andrea and Cristia, Alejandrina and R{\"a}s{\"a}nen, Okko},
    title = "Introducing Meta-analysis in the Evaluation of Computational Models of Infant Language Development",
    abstract = "Computational models of child language development can help us understand the cognitive underpinnings of the language learning process, which occurs along several linguistic levels at once (e.g., prosodic and phonological). However, in light of the replication crisis, modelers face the challenge of selecting representative and consolidated infant data. Thus, it is desirable to have evaluation methodologies that could account for robust empirical reference data, across multiple infant capabilities. Moreover, there is a need for practices that can compare developmental trajectories of infants to those of models as a function of language experience and development. The present study aims to take concrete steps to address these needs by introducing the concept of comparing models with large-scale cumulative empirical data from infants, as quantified by meta-analyses conducted across a large number of individual behavioral studies. We formalize the connection between measurable model and human behavior, and then present a conceptual framework for meta-analytic evaluation of computational models. We exemplify the meta-analytic model evaluation approach with two modeling experiments on infant-directed speech preference and native/non-native vowel discrimination.",
    keywords = "Child language development, Computational modeling, Language acquisition, Meta-analysis, Model evaluation",
    note = "Publisher Copyright: {\textcopyright} 2023 The Authors. Cognitive Science published by Wiley Periodicals LLC on behalf of Cognitive Science Society (CSS).",
    year = "2023",
    month = "July",
    doi = "10.1111/cogs.13307",
    language = "English",
    volume = "47",
    pages = "e13307",
    journal = "COGNITIVE SCIENCE",
    issn = "0364-0213",
    publisher = "Wiley-Blackwell",
    number = "7"
    }

  • D. Diaz-Guerra, A. Politis, and T. Virtanen, "Position Tracking of a Varying Number of Sound Sources with Sliding Permutation Invariant Training," in 2023 31st European Signal Processing Conference (EUSIPCO), United States, 2023, p. 251–255. doi:10.23919/eusipco58844.2023.10289897
    [BibTeX] [Abstract]

    Machine-learning-based sound source localization (SSL) methods have shown strong performance in challenging acoustic scenarios. However, little work has been done on adapting such methods to track consistently multiple sources appearing and disappearing, as would occur in reality. In this paper, we present a new training strategy for deep learning SSL models with a straightforward implementation based on the mean squared error of the optimal association between estimated and reference positions in the preceding time frames. It optimizes the desired properties of a tracking system: handling a time-varying number of sources and ordering localization estimates according to their trajectories, minimizing identity switches (IDSs). Evaluation on simulated data of multiple reverberant moving sources and on two model architectures proves its effectiveness in reducing identity switches without compromising frame-wise localization accuracy.

    @inproceedings{2023_EUSIPCO_c,
    author = "Diaz-Guerra, David and Politis, Archontis and Virtanen, Tuomas",
    title = "Position Tracking of a Varying Number of Sound Sources with Sliding Permutation Invariant Training",
    abstract = "Machine-learning-based sound source localization (SSL) methods have shown strong performance in challenging acoustic scenarios. However, little work has been done on adapting such methods to track consistently multiple sources appearing and disappearing, as would occur in reality. In this paper, we present a new training strategy for deep learning SSL models with a straightforward implementation based on the mean squared error of the optimal association between estimated and reference positions in the preceding time frames. It optimizes the desired properties of a tracking system: handling a time-varying number of sources and ordering localization estimates according to their trajectories, minimizing identity switches (IDSs). Evaluation on simulated data of multiple reverberant moving sources and on two model architectures proves its effectiveness in reducing identity switches without compromising frame-wise localization accuracy.",
    year = "2023",
    month = "September",
    day = "4",
    doi = "10.23919/eusipco58844.2023.10289897",
    language = "English",
    series = "European Signal Processing Conference",
    publisher = "IEEE",
    pages = "251--255",
    booktitle = "2023 31st European Signal Processing Conference (EUSIPCO)",
    address = "United States",
    note = "European Signal Processing Conference ; Conference date: 04-09-2023 Through 08-09-2023"
    }

  • K. Khorrami, M. Cruz Blandon, and O. Räsänen, "Computational Insights to Acquisition of Phonemes, Words, and Word Meanings in Early Language: Sequential or Parallel Acquisition?," in Proceedings of the Annual Meeting of the Cognitive Science Society, 2023, p. 389–396.
    [BibTeX] [Abstract]

    Previous computational models of early language acquisition have shown how linguistic structure of speech can be acquired using auditory or audiovisual learning mechanisms. However, real infants have sustained access to both uni- and multimodal sensory experiences. Therefore, it is of interest how the uni- and multimodal learning mechanisms could operate in concert, and how their interplay might affect the acquisition dynamics of different linguistic representations. This paper explores these questions with a computational model capable of simultaneous auditory and audiovisual learning from speech and images. We study how the model{’}s latent representations reflect phonemic, lexical, and semantic knowledge as a function of language experience. We also test how the findings vary with differential emphasis on the two learning mechanisms. As a result, we find phonemic learning always starting to emerge before lexical learning, followed by semantics. However, there is also notable overlap in their development. The same pattern emerges irrespectively of the emphasis on auditory or audiovisual learning. The result illustrates how the acquisition dynamics of linguistic representations are decoupled from the primary learning objectives (mechanisms) of the learner, and how the emergence of phonemes and words can be facilitated by both auditory and audiovisual learning in a synergetic manner.

    @inproceedings{2023,
    author = {Khorrami, Khazar and Cruz Blandon, Maria and R{\"a}s{\"a}nen, Okko},
    title = "Computational Insights to Acquisition of Phonemes, Words, and Word Meanings in Early Language: Sequential or Parallel Acquisition?",
    abstract = "Previous computational models of early language acquisition have shown how linguistic structure of speech can be acquired using auditory or audiovisual learning mechanisms. However, real infants have sustained access to both uni- and multimodal sensory experiences. Therefore, it is of interest how the uni- and multimodal learning mechanisms could operate in concert, and how their interplay might affect the acquisition dynamics of different linguistic representations. This paper explores these questions with a computational model capable of simultaneous auditory and audiovisual learning from speech and images. We study how the model{\textquoteright}s latent representations reflect phonemic, lexical, and semantic knowledge as a function of language experience. We also test how the findings vary with differential emphasis on the two learning mechanisms. As a result, we find phonemic learning always starting to emerge before lexical learning, followed by semantics. However, there is also notable overlap in their development. The same pattern emerges irrespectively of the emphasis on auditory or audiovisual learning. The result illustrates how the acquisition dynamics of linguistic representations are decoupled from the primary learning objectives (mechanisms) of the learner, and how the emergence of phonemes and words can be facilitated by both auditory and audiovisual learning in a synergetic manner.",
    year = "2023",
    language = "English",
    volume = "45",
    series = "Proceedings of the Annual Conference of the Cognitive Science Society",
    publisher = "COGNITIVE SCIENCE SOCIETY",
    pages = "389--396",
    booktitle = "Proceedings of the Annual Meeting of the Cognitive Science Society",
    note = "Annual Meeting of the Cognitive Science Society ; Conference date: 26-07-2023 Through 29-07-2023"
    }

  • K. Khorrami, M. Cruz Blandon, T. Virtanen, and O. Räsänen, "Simultaneous or sequential training? How speech representations cooperate in a multi-task self-supervised learning system," in Proceedings of the 31st European Signal Processing Conference (EUSIPCO), United States, 2023, p. 431–435. doi:10.23919/EUSIPCO58844.2023.10290051
    [BibTeX] [Abstract]

    Speech representation learning with self-supervised algorithms has resulted in notable performance boosts in many downstream tasks. Recent work combined self-supervised learning (SSL) and visually grounded speech (VGS) processing mechanisms for representation learning. The joint training with SSL and VGS mechanisms provides the opportunity to utilize both unlabeled speech and speech-related visual information based on data availability. This has shown to enhance the quality of learned representations, especially at encoding semantic- and lexical-level knowledge. In this work, we further study the joint optimization of wav2vec 2.0-based SSL and transformer-based VGS as a multitask learning system. We explore a set of training scenarios to understand how speech representations are shared or transferred between the two tasks, and what is the optimal training strategy for cross-modal semantic retrieval and phoneme discrimination performance. As a result, we find that sequential training with wav2vec 2.0 first and VGS next provides higher performance on audio-visual retrieval compared to simultaneous optimization of both learning mechanisms. However, the parallel SSL-VGS training reduces the effects of catastrophic forgetting when switching between optimization criteria. Moreover, the results suggest that phonemic representations learned through the VGS mechanism may generalize better across datasets compared to those learned with SSL.

    @inproceedings{2023_EUSIPCO,
    author = {Khorrami, Khazar and Cruz Blandon, Maria and Virtanen, Tuomas and R{\"a}s{\"a}nen, Okko},
    title = "Simultaneous or sequential training? How speech representations cooperate in a multi-task self-supervised learning system",
    abstract = "Speech representation learning with self-supervised algorithms has resulted in notable performance boosts in many downstream tasks. Recent work combined self-supervised learning (SSL) and visually grounded speech (VGS) processing mechanisms for representation learning. The joint training with SSL and VGS mechanisms provides the opportunity to utilize both unlabeled speech and speech-related visual information based on data availability. This has shown to enhance the quality of learned representations, especially at encoding semantic- and lexical-level knowledge. In this work, we further study the joint optimization of wav2vec 2.0-based SSL and transformer-based VGS as a multitask learning system. We explore a set of training scenarios to understand how speech representations are shared or transferred between the two tasks, and what is the optimal training strategy for cross-modal semantic retrieval and phoneme discrimination performance. As a result, we find that sequential training with wav2vec 2.0 first and VGS next provides higher performance on audio-visual retrieval compared to simultaneous optimization of both learning mechanisms. However, the parallel SSL-VGS training reduces the effects of catastrophic forgetting when switching between optimization criteria. Moreover, the results suggest that phonemic representations learned through the VGS mechanism may generalize better across datasets compared to those learned with SSL.",
    year = "2023",
    doi = "10.23919/EUSIPCO58844.2023.10290051",
    language = "English",
    isbn = "978-9-4645-9360-0",
    series = "European Signal Processing Conference",
    publisher = "IEEE",
    pages = "431--435",
    booktitle = "Proceedings of the 31st European Signal Processing Conference (EUSIPCO)",
    address = "United States",
    note = "European Signal Processing Conference ; Conference date: 04-09-2023 Through 08-09-2023"
    }

  • K. Lahtinen, L. Mustanoja, J. Simko, and O. Räsänen, "Human and Machine Perception of Affect in Speech." 2023, p. 1–5.
    [BibTeX] [Abstract]

    This poster presents the motivation, methods and goals for a research project on modeling and studying expression of affect in Finnish language. The project combines machine learning and signal processing methods with linguistic knowledge on affect expression. The research aims at finding ways to recognize how affect is expressed by individual speakers in different speaking conditions (differentiating what is being said from how things are being said) and how affect expressed in spoken language relates to affect expressed in, and implied by, written language. The research is planned to span over a four year period, during which a corpus (dataset) of emotional Finnish speech will be collected, analyzed, and processed for modelling purposes. Based on this data, a speech emotion classifier for conversational Finnish will be developed and later further improved to enable the study and automatic recognition of idiolectical variation of affect expression. Results of the project can be used for improved affective AI systems that are able to understand the richness of human emotions in spoken communication in various circumstances by various speakers and listeners.

    @conference{2023_a,
    author = {Lahtinen, Kalle and Mustanoja, Liisa and Simko, Juraj and R{\"a}s{\"a}nen, Okko},
    title = "Human and Machine Perception of Affect in Speech",
    abstract = "This poster presents the motivation, methods and goals for a research project on modeling and studying expression of affect in Finnish language. The project combines machine learning and signal processing methods with linguistic knowledge on affect expression. The research aims at finding ways to recognize how affect is expressed by individual speakers in different speaking conditions (differentiating what is being said from how things are being said) and how affect expressed in spoken language relates to affect expressed in, and implied by, written language. The research is planned to span over a four year period, during which a corpus (dataset) of emotional Finnish speech will be collected, analyzed, and processed for modelling purposes. Based on this data, a speech emotion classifier for conversational Finnish will be developed and later further improved to enable the study and automatic recognition of idiolectical variation of affect expression. Results of the project can be used for improved affective AI systems that are able to understand the richness of human emotions in spoken communication in various circumstances by various speakers and listeners.",
    keywords = "speech processing, artificial inteligence, machine learning, affective computing, linguistic variation and change, audio, phonology",
    year = "2023",
    month = "October",
    language = "English",
    pages = "1--5",
    note = "26th International Academic Mindtrek Conference, ACADEMIC MINDTRICK 2023 ; Conference date: 03-10-2023 Through 06-10-2023"
    }

  • M. Lavechin, Y. Sy, H. Titeux, M. A. C. Blandón, O. Räsänen, H. Bredin, E. Dupoux, and A. Cristia, "BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models," in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2023, p. 4588–4592. doi:10.21437/Interspeech.2023-978
    [BibTeX] [Abstract]

    Self-supervised techniques for learning speech representations have been shown to develop linguistic competence from exposure to speech without the need for human labels. In order to fully realize the potential of these approaches and further our understanding of how infants learn language, simulations must closely emulate real-life situations by training on developmentally plausible corpora and benchmarking against appropriate test sets. To this end, we propose a language-acquisition-friendly benchmark to probe spoken language models at the lexical and syntactic levels, both of which are compatible with the vocabulary typical of children's language experiences. This paper introduces the benchmark and summarizes a range of experiments showing its usefulness. In addition, we highlight two exciting challenges that need to be addressed for further progress: bridging the gap between text and speech and between clean speech and in-the-wild speech.

    @inproceedings{2023_InterSpecch_a,
    author = {Lavechin, Marvin and Sy, Yaya and Titeux, Hadrien and Bland{\'o}n, Mar{\'i}a Andrea Cruz and R{\"a}s{\"a}nen, Okko and Bredin, Herv{\'e} and Dupoux, Emmanuel and Cristia, Alejandrina},
    title = "BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models",
    abstract = "Self-supervised techniques for learning speech representations have been shown to develop linguistic competence from exposure to speech without the need for human labels. In order to fully realize the potential of these approaches and further our understanding of how infants learn language, simulations must closely emulate real-life situations by training on developmentally plausible corpora and benchmarking against appropriate test sets. To this end, we propose a language-acquisition-friendly benchmark to probe spoken language models at the lexical and syntactic levels, both of which are compatible with the vocabulary typical of children's language experiences. This paper introduces the benchmark and summarizes a range of experiments showing its usefulness. In addition, we highlight two exciting challenges that need to be addressed for further progress: bridging the gap between text and speech and between clean speech and in-the-wild speech.",
    keywords = "child language, language acquisition, self-supervised learning, spoken language modeling",
    note = "Publisher Copyright: {\textcopyright} 2023 International Speech Communication Association. All rights reserved.; Annual Conference of the International Speech Communication Association, INTERSPEECH ; Conference date: 20-08-2023 Through 24-08-2023",
    year = "2023",
    doi = "10.21437/Interspeech.2023-978",
    language = "English",
    series = "Interspeech",
    publisher = "International Speech Communication Association",
    pages = "4588--4592",
    booktitle = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH"
    }

  • Y. Liu, K. Mittapalle, N. Penttilä, T. Ihalainen, P. Alku, and O. Räsänen, "Automatic Assessment of Parkinson's Disease Using Speech Representations of Phonation and Articulation," IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 31, p. 242–255, 2023. doi:10.1109/TASLP.2022.3212829
    [BibTeX] [Abstract]

    Speech from people with Parkinson's disease (PD) are likely to be degraded on phonation, articulation, and prosody. Motivated to describe articulation deficits comprehensively, we investigated 1) the universal phonological features that model articulation manner and place, also known as speech attributes, and 2) glottal features capturing phonation characteristics. These were further supplemented by, and compared with, prosodic features using a popular compact feature set and standard MFCC. Temporal characteristics of these features were modeled by convolutional neural networks. Besides the features, we were also interested in the speech tasks for collecting data for automatic PD speech assessment, like sustained vowels, text reading, and spontaneous monologue. For this, we utilized a recently collected Finnish PD corpus (PDSTU) as well as a Spanish database (PC-GITA). The experiments were formulated as regression problems against expert ratings of PD-related symptoms, including ratings of speech intelligibility, voice impairment, overall severity of communication disorder on PDSTU, as well as on the Unified Parkinson's Disease Rating Scale (UPDRS) on PC-GITA. The experimental results show: 1) the speech attribute features can well indicate the severity of pathologies in parkinsonian speech; 2) combining phonation features with articulatory features improves the PD assessment performance, but requires high-quality recordings to be applicable; 3) read speech leads to more accurate automatic ratings than the use of sustained vowels, but not if the amount of speech is limited to correspond to the sustained vowels in duration; and 4) jointly using data from several speech tasks can further improve the automatic PD assessment performance.

    @article{2023_e,
    author = {Liu, Yuanyuan and Mittapalle, Kiran and Penttil{\"a}, Nelly and Ihalainen, Tiina and Alku, Paavo and R{\"a}s{\"a}nen, Okko},
    title = "Automatic Assessment of Parkinson's Disease Using Speech Representations of Phonation and Articulation",
    abstract = "Speech from people with Parkinson's disease (PD) are likely to be degraded on phonation, articulation, and prosody. Motivated to describe articulation deficits comprehensively, we investigated 1) the universal phonological features that model articulation manner and place, also known as speech attributes, and 2) glottal features capturing phonation characteristics. These were further supplemented by, and compared with, prosodic features using a popular compact feature set and standard MFCC. Temporal characteristics of these features were modeled by convolutional neural networks. Besides the features, we were also interested in the speech tasks for collecting data for automatic PD speech assessment, like sustained vowels, text reading, and spontaneous monologue. For this, we utilized a recently collected Finnish PD corpus (PDSTU) as well as a Spanish database (PC-GITA). The experiments were formulated as regression problems against expert ratings of PD-related symptoms, including ratings of speech intelligibility, voice impairment, overall severity of communication disorder on PDSTU, as well as on the Unified Parkinson's Disease Rating Scale (UPDRS) on PC-GITA. The experimental results show: 1) the speech attribute features can well indicate the severity of pathologies in parkinsonian speech; 2) combining phonation features with articulatory features improves the PD assessment performance, but requires high-quality recordings to be applicable; 3) read speech leads to more accurate automatic ratings than the use of sustained vowels, but not if the amount of speech is limited to correspond to the sustained vowels in duration; and 4) jointly using data from several speech tasks can further improve the automatic PD assessment performance.",
    year = "2023",
    doi = "10.1109/TASLP.2022.3212829",
    language = "English",
    volume = "31",
    pages = "242--255",
    journal = "IEEE/ACM Transactions on Audio Speech and Language Processing",
    issn = "2329-9290",
    publisher = "IEEE Advancing Technology for Humanity"
    }

  • D. Luong, M. Tran, S. Gharib, K. Drossos, and T. Virtanen, "Representation Learning for Audio Privacy Preservation Using Source Separation and Robust Adversarial Learning," in Proceedings of the 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA 2023, United States, 2023. doi:10.1109/WASPAA58266.2023.10248153
    [BibTeX] [Abstract]

    Privacy preservation has long been a concern in smart acoustic monitoring systems, where speech can be passively recorded along with a target signal in the system's operating environment. In this study, we propose the integration of two commonly used approaches in privacy preservation: source separation and adversarial representation learning. The proposed system learns the latent representation of audio recordings such that it prevents differentiating between speech and non-speech recordings. Initially, the source separation network filters out some of the privacy-sensitive data, and during the adversarial learning process, the system will learn privacy-preserving representation on the filtered signal. We demonstrate the effectiveness of our proposed method by comparing our method against systems without source separation, without adversarial learning, and without both. Overall, our results suggest that the proposed system can significantly improve speech privacy preservation compared to that of using source separation or adversarial learning solely while maintaining good performance in the acoustic monitoring task.

    @inproceedings{2023_WASPAA_a,
    author = "Luong, Diep and Tran, Minh and Gharib, Shayan and Drossos, Konstantinos and Virtanen, Tuomas",
    title = "Representation Learning for Audio Privacy Preservation Using Source Separation and Robust Adversarial Learning",
    abstract = "Privacy preservation has long been a concern in smart acoustic monitoring systems, where speech can be passively recorded along with a target signal in the system's operating environment. In this study, we propose the integration of two commonly used approaches in privacy preservation: source separation and adversarial representation learning. The proposed system learns the latent representation of audio recordings such that it prevents differentiating between speech and non-speech recordings. Initially, the source separation network filters out some of the privacy-sensitive data, and during the adversarial learning process, the system will learn privacy-preserving representation on the filtered signal. We demonstrate the effectiveness of our proposed method by comparing our method against systems without source separation, without adversarial learning, and without both. Overall, our results suggest that the proposed system can significantly improve speech privacy preservation compared to that of using source separation or adversarial learning solely while maintaining good performance in the acoustic monitoring task.",
    keywords = "adversarial networks, privacy preservation, sound event detection, source separation",
    note = "Publisher Copyright: {\textcopyright} 2023 IEEE.; IEEE Workshop on Applications of Signal Processing to Audio and Acoustics ; Conference date: 22-10-2023 Through 25-10-2023",
    year = "2023",
    doi = "10.1109/WASPAA58266.2023.10248153",
    language = "English",
    series = "IEEE Workshop on Applications of Signal Processing to Audio and Acoustics",
    publisher = "IEEE",
    booktitle = "Proceedings of the 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA 2023",
    address = "United States"
    }

  • P. Magron and T. Virtanen, "Spectrogram Inversion for Audio Source Separation via Consistency, Mixing, and Magnitude Constraints," in 31st European Signal Processing Conference, EUSIPCO 2023 - Proceedings, United States, 2023, p. 36–40. doi:10.23919/EUSIPCO58844.2023.10290068
    [BibTeX] [Abstract]

    Audio source separation is often achieved by estimating the magnitude spectrogram of each source, and then applying a phase recovery (or spectrogram inversion) algorithm to retrieve time-domain signals. Typically, spectrogram inversion is treated as an optimization problem involving one or several terms in order to promote estimates that comply with a consistency property, a mixing constraint, and/or a target magnitude objective. Nonetheless, it is still unclear which set of constraints and problem formulation is the most appropriate in practice. In this paper, we design a general framework for deriving spectrogram inversion algorithm, which is based on formulating optimization problems by combining these objectives either as soft penalties or hard constraints. We solve these by means of algorithms that perform alternating projections on the subsets corresponding to each objective/constraint. Our framework encompasses existing techniques from the literature as well as novel algorithms. We investigate the potential of these approaches for a speech enhancement task. In particular, one of our novel algorithms outperforms other approaches in a realistic setting where the magnitudes are estimated beforehand using a neural network.

    @inproceedings{2023_EUSIPCO_b,
    author = "Magron, Paul and Virtanen, Tuomas",
    title = "Spectrogram Inversion for Audio Source Separation via Consistency, Mixing, and Magnitude Constraints",
    abstract = "Audio source separation is often achieved by estimating the magnitude spectrogram of each source, and then applying a phase recovery (or spectrogram inversion) algorithm to retrieve time-domain signals. Typically, spectrogram inversion is treated as an optimization problem involving one or several terms in order to promote estimates that comply with a consistency property, a mixing constraint, and/or a target magnitude objective. Nonetheless, it is still unclear which set of constraints and problem formulation is the most appropriate in practice. In this paper, we design a general framework for deriving spectrogram inversion algorithm, which is based on formulating optimization problems by combining these objectives either as soft penalties or hard constraints. We solve these by means of algorithms that perform alternating projections on the subsets corresponding to each objective/constraint. Our framework encompasses existing techniques from the literature as well as novel algorithms. We investigate the potential of these approaches for a speech enhancement task. In particular, one of our novel algorithms outperforms other approaches in a realistic setting where the magnitudes are estimated beforehand using a neural network.",
    keywords = "alternating projections, Audio source separation, phase recovery, spectrogram inversion, speech enhancement",
    note = "Publisher Copyright: {\textcopyright} 2023 European Signal Processing Conference, EUSIPCO. All rights reserved.; European Signal Processing Conference ; Conference date: 04-09-2023 Through 08-09-2023",
    year = "2023",
    doi = "10.23919/EUSIPCO58844.2023.10290068",
    language = "English",
    series = "European Signal Processing Conference",
    publisher = "IEEE",
    pages = "36--40",
    booktitle = "31st European Signal Processing Conference, EUSIPCO 2023 - Proceedings",
    address = "United States"
    }

  • M. Neri, A. Politis, D. Krause, M. Carli, and T. Virtanen, "Single-Channel Speaker Distance Estimation in Reverberant Environments," in Proceedings of the 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA 2023, United States, 2023. doi:10.1109/WASPAA58266.2023.10248087
    [BibTeX] [Abstract]

    We introduce the novel task of continuous-valued speaker distance estimation which focuses on estimating non-discrete distances between a sound source and microphone, based on audio captured by the microphone. A novel learning-based approach for estimating speaker distance in reverberant environments from a single omnidirectional microphone is proposed. Using common acoustic features, such as the magnitude and phase of the audio spectrogram, with a convolutional recurrent neural network results in errors on the order of centimeters in noiseless audios. Experiments are carried out by means of an image-source room simulator with convolved speeches from a public dataset. An ablation study is performed to demonstrate the effectiveness of the proposed feature set. Finally, a study of the impact of real background noise, extracted from the WHAM! dataset at different signal-to-noise ratios highlights the discrepancy between noisy and noiseless scenarios, underlining the difficulty of the problem.

    @inproceedings{2023_WASPAA,
    author = "Neri, Michael and Politis, Archontis and Krause, Daniel and Carli, Marco and Virtanen, Tuomas",
    title = "Single-Channel Speaker Distance Estimation in Reverberant Environments",
    abstract = "We introduce the novel task of continuous-valued speaker distance estimation which focuses on estimating non-discrete distances between a sound source and microphone, based on audio captured by the microphone. A novel learning-based approach for estimating speaker distance in reverberant environments from a single omnidirectional microphone is proposed. Using common acoustic features, such as the magnitude and phase of the audio spectrogram, with a convolutional recurrent neural network results in errors on the order of centimeters in noiseless audios. Experiments are carried out by means of an image-source room simulator with convolved speeches from a public dataset. An ablation study is performed to demonstrate the effectiveness of the proposed feature set. Finally, a study of the impact of real background noise, extracted from the WHAM! dataset at different signal-to-noise ratios highlights the discrepancy between noisy and noiseless scenarios, underlining the difficulty of the problem.",
    keywords = "Deep Learning, Distance estimation, Reverberation, Single-channel",
    note = "Publisher Copyright: {\textcopyright} 2023 IEEE.; IEEE Workshop on Applications of Signal Processing to Audio and Acoustics ; Conference date: 22-10-2023 Through 25-10-2023",
    year = "2023",
    doi = "10.1109/WASPAA58266.2023.10248087",
    language = "English",
    series = "IEEE Workshop on Applications of Signal Processing to Audio and Acoustics",
    publisher = "IEEE",
    booktitle = "Proceedings of the 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA 2023",
    address = "United States"
    }

  • P. Peng, S. W. Li, O. Räsänen, A. Mohamed, and D. Harwath, "Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model," in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2023, p. 391–395. doi:10.21437/Interspeech.2023-2044
    [BibTeX] [Abstract]

    In this paper, we show that representations capturing syllabic units emerge when training a self-supervised speech model with a visually-grounded training objective. We demonstrate that a nearly identical model architecture (HuBERT) trained with a masked language modeling loss does not exhibit this same ability, suggesting that the visual grounding objective is responsible for the emergence of this phenomenon. We propose the use of a minimum cut algorithm to automatically predict syllable boundaries in speech, followed by a 2-stage clustering method to group identical syllables together. We show that our model not only outperforms a state-of-the-art syllabic segmentation method on the language it was trained on (English), but also generalizes in a zero-shot fashion to Estonian. Finally, we show that the same model is capable of zero-shot generalization for a word segmentation task on 4 other languages from the Zerospeech Challenge, in some cases beating the previous state-of-the-art.

    @inproceedings{2023_InterSpecch,
    author = {Peng, Puyuan and Li, Shang Wen and R{\"a}s{\"a}nen, Okko and Mohamed, Abdelrahman and Harwath, David},
    title = "Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model",
    abstract = "In this paper, we show that representations capturing syllabic units emerge when training a self-supervised speech model with a visually-grounded training objective. We demonstrate that a nearly identical model architecture (HuBERT) trained with a masked language modeling loss does not exhibit this same ability, suggesting that the visual grounding objective is responsible for the emergence of this phenomenon. We propose the use of a minimum cut algorithm to automatically predict syllable boundaries in speech, followed by a 2-stage clustering method to group identical syllables together. We show that our model not only outperforms a state-of-the-art syllabic segmentation method on the language it was trained on (English), but also generalizes in a zero-shot fashion to Estonian. Finally, we show that the same model is capable of zero-shot generalization for a word segmentation task on 4 other languages from the Zerospeech Challenge, in some cases beating the previous state-of-the-art.",
    keywords = "self-supervised speech processing, speech segmentation, visually-grounded speech",
    note = "Publisher Copyright: {\textcopyright} 2023 International Speech Communication Association. All rights reserved.; Annual Conference of the International Speech Communication Association, INTERSPEECH ; Conference date: 20-08-2023 Through 24-08-2023",
    year = "2023",
    doi = "10.21437/Interspeech.2023-2044",
    language = "English",
    series = "Interspeech",
    publisher = "International Speech Communication Association",
    pages = "391--395",
    booktitle = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH"
    }

  • O. Räsänen, M. Cruz Blandon, and J. Leppänen, "Is Reliability of Cognitive Measures in Children Dependent on Participant Age? A Case Study with Two Large-Scale Datasets," in Proceedings of the 45th Annual Conference of the Cognitive Science Society, 2023, p. 1998–2004.
    [BibTeX] [Abstract]

    When assessing children in laboratory experiments, the measured responses also contain task-irrelevant participant-level variability (“noise”) and measurement noise. Since experimental data are used to make inferences of development of cognitive capabilities with age, it is important to know if reliability of the used measurements depends on child age. Any systematic age-dependent changes in reliability could result in misleading developmental trajectories, as lower reliability will necessarily result in smaller effect sizes. This paper examines age-dependency of task-independent measurement variability in early childhood (3–40 months) by analyzing two large-scale datasets of participant-level experimental responses: the ManyBabies infant-directed speech preference (MB-IDS) dataset and a saccadic reaction time (SRT) dataset collected from rural South Africa. Analysis of participant- and study-level data reveals that MB-IDS shows comparable reliability across the included age range. In contrast, SRTs reflect systematically increasing measurement consistency with increasing age. Potential reasons and implications of this divergence are briefly discussed.

    @inproceedings{2023_c,
    author = {R{\"a}s{\"a}nen, Okko and Cruz Blandon, Maria and Lepp{\"a}nen, Jukka},
    title = "Is Reliability of Cognitive Measures in Children Dependent on Participant Age? A Case Study with Two Large-Scale Datasets",
    abstract = "When assessing children in laboratory experiments, the measured responses also contain task-irrelevant participant-level variability (“noise”) and measurement noise. Since experimental data are used to make inferences of development of cognitive capabilities with age, it is important to know if reliability of the used measurements depends on child age. Any systematic age-dependent changes in reliability could result in misleading developmental trajectories, as lower reliability will necessarily result in smaller effect sizes. This paper examines age-dependency of task-independent measurement variability in early childhood (3–40 months) by analyzing two large-scale datasets of participant-level experimental responses: the ManyBabies infant-directed speech preference (MB-IDS) dataset and a saccadic reaction time (SRT) dataset collected from rural South Africa. Analysis of participant- and study-level data reveals that MB-IDS shows comparable reliability across the included age range. In contrast, SRTs reflect systematically increasing measurement consistency with increasing age. Potential reasons and implications of this divergence are briefly discussed.",
    year = "2023",
    month = "July",
    language = "English",
    volume = "45",
    series = "Proceedings of the Annual Conference of the Cognitive Science Society",
    publisher = "COGNITIVE SCIENCE SOCIETY",
    pages = "1998--2004",
    booktitle = "Proceedings of the 45th Annual Conference of the Cognitive Science Society",
    note = "Annual Conference of the Cognitive Science Society ; Conference date: 26-07-2023 Through 29-07-2023"
    }

  • K. Shimada, A. Politis, P. Ariyakulam Sudarsanam, D. Krause, K. Uchida, S. Adavanne, A. Hakala, Y. Koyama, N. Takahashi, S. Takahashi, T. Virtanen, and Y. Mitsufuji, "STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events," in Advances in Neural Information Processing Systems 36 (NeurIPS 2023), 2023, p. 1–27.
    [BibTeX]
    @inproceedings{2023_NeurIPS 2023,
    author = "Shimada, Kazuki and Politis, Archontis and Ariyakulam Sudarsanam, Parthasaarathy and Krause, Daniel and Uchida, Kengo and Adavanne, Sharath and Hakala, Aapo and Koyama, Yuichiro and Takahashi, Naoya and Takahashi, Shusuke and Virtanen, Tuomas and Mitsufuji, Yuki",
    title = "STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events",
    year = "2023",
    language = "English",
    series = "Advances in neural information processing systems",
    publisher = "NeurIPS",
    pages = "1--27",
    booktitle = "Advances in Neural Information Processing Systems 36 (NeurIPS 2023)",
    note = "Conference on Neural Information Processing Systems ; Conference date: 10-12-2023 Through 16-12-2023"
    }

  • P. Sudarsanam and T. Virtanen, "Attention-Based Methods For Audio Question Answering," in 31st European Signal Processing Conference, EUSIPCO 2023 - Proceedings, 2023, p. 750–754. doi:10.23919/EUSIPCO58844.2023.10289751
    [BibTeX] [Abstract]

    Audio question answering (AQA) is the task of producing natural language answers when a system is provided with audio and natural language questions. In this paper, we propose neural network architectures based on self-attention and cross-attention for the AQA task. The self-attention layers extract powerful audio and textual representations. The cross-attention maps audio features that are relevant to the textual features to produce answers. All our models are trained on the recently proposed Clotho-AQA dataset for both binary yes/no questions and single-word answer questions. Our results clearly show improvement over the reference method reported in the original paper. On the yes/no binary classification task, our proposed model achieves an accuracy of 68.3\\% compared to 62.7\\% in the reference model. For the single-word answers multiclass classifier, our model produces a top-1 and top-5 accuracy of 57.9\\% and 99.8\\% compared to 54.2\\% and 93.7\\% in the reference model respectively. We further discuss some of the challenges in the Clotho-AQA dataset such as the presence of the same answer word in multiple tenses, singular and plural forms, and the presence of specific and generic answers to the same question. We address these issues and present a revised version of the dataset.

    @inproceedings{2023_EUSIPCO_a,
    author = "Sudarsanam, Parthasaarathy and Virtanen, Tuomas",
    title = "Attention-Based Methods For Audio Question Answering",
    abstract = "Audio question answering (AQA) is the task of producing natural language answers when a system is provided with audio and natural language questions. In this paper, we propose neural network architectures based on self-attention and cross-attention for the AQA task. The self-attention layers extract powerful audio and textual representations. The cross-attention maps audio features that are relevant to the textual features to produce answers. All our models are trained on the recently proposed Clotho-AQA dataset for both binary yes/no questions and single-word answer questions. Our results clearly show improvement over the reference method reported in the original paper. On the yes/no binary classification task, our proposed model achieves an accuracy of 68.3\\% compared to 62.7\\% in the reference model. For the single-word answers multiclass classifier, our model produces a top-1 and top-5 accuracy of 57.9\\% and 99.8\\% compared to 54.2\\% and 93.7\\% in the reference model respectively. We further discuss some of the challenges in the Clotho-AQA dataset such as the presence of the same answer word in multiple tenses, singular and plural forms, and the presence of specific and generic answers to the same question. We address these issues and present a revised version of the dataset.",
    keywords = "attention mechanism, Audio question answering, Clotho-AQA",
    note = "Publisher Copyright: {\textcopyright} 2023 European Signal Processing Conference, EUSIPCO. All rights reserved.; European Signal Processing Conference ; Conference date: 04-09-2023 Through 08-09-2023",
    year = "2023",
    doi = "10.23919/EUSIPCO58844.2023.10289751",
    language = "English",
    series = "European Signal Processing Conference",
    publisher = "European Signal Processing Conference, EUSIPCO",
    pages = "750--754",
    booktitle = "31st European Signal Processing Conference, EUSIPCO 2023 - Proceedings"
    }

  • E. Vaaras, S. Ahlqvist-Björkroth, K. Drossos, L. Lehtonen, and O. Räsänen, "Development of a speech emotion recognizer for large-scale child-centered audio recordings from a hospital environment," Speech Communication, vol. 148, p. 9–22, 2023. doi:10.1016/j.specom.2023.02.001
    [BibTeX] [Abstract]

    In order to study how early emotional experiences shape infant development, one approach is to analyze the emotional content of speech heard by infants, as captured by child-centered daylong recordings, and as analyzed by automatic speech emotion recognition (SER) systems. However, since large-scale daylong audio is initially unannotated and differs from typical speech corpora from controlled environments, there are no existing in-domain SER systems for the task. Based on existing literature, it is also unclear what is the best approach to deploy a SER system for a new domain. Consequently, in this study, we investigated alternative strategies for deploying a SER system for large-scale child-centered audio recordings from a neonatal hospital environment, comparing cross-corpus generalization, active learning (AL), and domain adaptation (DA) methods in the process. We first conducted simulations with existing emotion-labeled speech corpora to find the best strategy for SER system deployment. We then tested how the findings generalize to our new initially unannotated dataset. As a result, we found that the studied AL method provided overall the most consistent results, being less dependent on the specifics of the training corpora or speech features compared to the alternative methods. However, in situations without the possibility to annotate data, unsupervised DA proved to be the best approach. We also observed that deployment of a SER system for real-world daylong child-centered audio recordings achieved a SER performance level comparable to those reported in literature, and that the amount of human effort required for the system deployment was overall relatively modest.

    @article{2023_ICA,
    author = {Vaaras, Einari and Ahlqvist-Bj{\"o}rkroth, Sari and Drossos, Konstantinos and Lehtonen, Liisa and R{\"a}s{\"a}nen, Okko},
    title = "Development of a speech emotion recognizer for large-scale child-centered audio recordings from a hospital environment",
    abstract = "In order to study how early emotional experiences shape infant development, one approach is to analyze the emotional content of speech heard by infants, as captured by child-centered daylong recordings, and as analyzed by automatic speech emotion recognition (SER) systems. However, since large-scale daylong audio is initially unannotated and differs from typical speech corpora from controlled environments, there are no existing in-domain SER systems for the task. Based on existing literature, it is also unclear what is the best approach to deploy a SER system for a new domain. Consequently, in this study, we investigated alternative strategies for deploying a SER system for large-scale child-centered audio recordings from a neonatal hospital environment, comparing cross-corpus generalization, active learning (AL), and domain adaptation (DA) methods in the process. We first conducted simulations with existing emotion-labeled speech corpora to find the best strategy for SER system deployment. We then tested how the findings generalize to our new initially unannotated dataset. As a result, we found that the studied AL method provided overall the most consistent results, being less dependent on the specifics of the training corpora or speech features compared to the alternative methods. However, in situations without the possibility to annotate data, unsupervised DA proved to be the best approach. We also observed that deployment of a SER system for real-world daylong child-centered audio recordings achieved a SER performance level comparable to those reported in literature, and that the amount of human effort required for the system deployment was overall relatively modest.",
    keywords = "Daylong audio, LENA recorder, Real-world audio, Speech analysis, Speech emotion recognition",
    note = "Funding Information: This research was funded by Academy of Finland grants no. 314573 , 314602 , 332962 , and 335872 , and EU Horizon-2020 grant no. 957337 MARVEL. The authors would like to thank the APPLE consortium for the help in the project. Publisher Copyright: {\textcopyright} 2023 The Author(s)",
    year = "2023",
    month = "March",
    doi = "10.1016/j.specom.2023.02.001",
    language = "English",
    volume = "148",
    pages = "9--22",
    journal = "Speech Communication",
    issn = "0167-6393",
    publisher = "Elsevier B.V."
    }

  • E. Vaaras, M. Airaksinen, S. Vanhatalo, and O. Räsänen, "Evaluation of self-supervised pre-training for automatic infant movement classification using wearable movement sensors," in 2023 45th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), United States, 2023, p. 1–6. doi:10.1109/EMBC40787.2023.10340118
    [BibTeX]
    @inproceedings{2023_EMBC,
    author = {Vaaras, Einari and Airaksinen, Manu and Vanhatalo, Sampsa and R{\"a}s{\"a}nen, Okko},
    title = "Evaluation of self-supervised pre-training for automatic infant movement classification using wearable movement sensors",
    year = "2023",
    month = "August",
    doi = "10.1109/EMBC40787.2023.10340118",
    language = "English",
    series = "Annual International Conference of the IEEE Engineering in Medicine and Biology Society",
    publisher = "IEEE",
    pages = "1--6",
    booktitle = "2023 45th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC)",
    address = "United States",
    note = "Annual International Conference of the IEEE Engineering in Medicine \\& Biology Society (EMBC) ; Conference date: 24-07-2023 Through 27-07-2023"
    }

  • H. Xie, O. Räsänen, and T. Virtanen, "On Negative Sampling for Contrastive Audio-Text Retrieval," in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings, United States, 2023. doi:10.1109/ICASSP49357.2023.10095319
    [BibTeX] [Abstract]

    This paper investigates negative sampling for contrastive learning in the context of audio-text retrieval. The strategy for negative sampling refers to selecting negatives (either audio clips or textual descriptions) from a pool of candidates for a positive audio-text pair. We explore sampling strategies via model-estimated within-modality and cross-modality relevance scores for audio and text samples. With a constant training setting on the retrieval system from [1], we study eight sampling strategies, including hard and semi-hard negative sampling. Experimental results show that retrieval performance varies dramatically among different strategies. Particularly, by selecting semi-hard negatives with cross-modality scores, the retrieval system gains improved performance in both text-to-audio and audio-to-text retrieval. Besides, we show that feature collapse occurs while sampling hard negatives with cross-modality scores.

    @inproceedings{2023_ICASSP,
    author = {Xie, Huang and R{\"a}s{\"a}nen, Okko and Virtanen, Tuomas},
    title = "On Negative Sampling for Contrastive Audio-Text Retrieval",
    abstract = "This paper investigates negative sampling for contrastive learning in the context of audio-text retrieval. The strategy for negative sampling refers to selecting negatives (either audio clips or textual descriptions) from a pool of candidates for a positive audio-text pair. We explore sampling strategies via model-estimated within-modality and cross-modality relevance scores for audio and text samples. With a constant training setting on the retrieval system from [1], we study eight sampling strategies, including hard and semi-hard negative sampling. Experimental results show that retrieval performance varies dramatically among different strategies. Particularly, by selecting semi-hard negatives with cross-modality scores, the retrieval system gains improved performance in both text-to-audio and audio-to-text retrieval. Besides, we show that feature collapse occurs while sampling hard negatives with cross-modality scores.",
    keywords = "audio-text retrieval, contrastive learning, Cross-modal retrieval, negative sampling, triplet loss",
    note = "Publisher Copyright: {\textcopyright} 2023 IEEE.; IEEE International Conference on Acoustics, Speech, and Signal Processing ; Conference date: 04-06-2023 Through 10-06-2023",
    year = "2023",
    doi = "10.1109/ICASSP49357.2023.10095319",
    language = "English",
    series = "Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing",
    publisher = "IEEE",
    booktitle = "ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings",
    address = "United States"
    }

  • H. Xie, K. Khorrami, O. Räsänen, and T. Virtanen, "Crowdsourcing and Evaluating Text-Based Audio Retrieval Relevances," in Proceedings of the 8th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2023), 2023, p. 226–230.
    [BibTeX]
    @inproceedings{2023_DCASE 2023,
    author = {Xie, Huang and Khorrami, Khazar and R{\"a}s{\"a}nen, Okko and Virtanen, Tuomas},
    editor = "Fuentes, Magdalena and Heittola, Toni and Imoto, Keisuke and Mesaros, Annamaria and Politis, Archontis and Serizel, Romain and Virtanen, Tuomas",
    title = "Crowdsourcing and Evaluating Text-Based Audio Retrieval Relevances",
    year = "2023",
    language = "English",
    pages = "226--230",
    booktitle = "Proceedings of the 8th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2023)",
    publisher = "Tampere University",
    note = "Workshop on Detection and Classification of Acoustic Scenes and Events ; Conference date: 20-09-2023 Through 22-09-2023"
    }

  • W. Xie, Y. Li, Q. He, W. Cao, and T. Virtanen, "Few-shot Class-incremental Audio Classification Using Adaptively-refined Prototypes," in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2023, p. 301–305. doi:10.21437/Interspeech.2023-1380
    [BibTeX] [Abstract]

    New classes of sounds constantly emerge with a few samples, making it challenging for models to adapt to dynamic acoustic environments. This challenge motivates us to address the new problem of few-shot class-incremental audio classification. This study aims to enable a model to continuously recognize new classes of sounds with a few training samples of new classes while remembering the learned ones. To this end, we propose a method to generate discriminative prototypes and use them to expand the model's classifier for recognizing sounds of new and learned classes. The model is first trained with a random episodic training strategy, and then its backbone is used to generate the prototypes. A dynamic relation projection module refines the prototypes to enhance their discriminability. Results on two datasets (derived from the corpora of Nsynth and FSD-MIX-CLIPS) show that the proposed method exceeds three state-of-the-art methods in average accuracy and performance dropping rate.

    @inproceedings{2023_InterSpecch_b,
    author = "Xie, Wei and Li, Yanxiong and He, Qianhua and Cao, Wenchang and Virtanen, Tuomas",
    title = "Few-shot Class-incremental Audio Classification Using Adaptively-refined Prototypes",
    abstract = "New classes of sounds constantly emerge with a few samples, making it challenging for models to adapt to dynamic acoustic environments. This challenge motivates us to address the new problem of few-shot class-incremental audio classification. This study aims to enable a model to continuously recognize new classes of sounds with a few training samples of new classes while remembering the learned ones. To this end, we propose a method to generate discriminative prototypes and use them to expand the model's classifier for recognizing sounds of new and learned classes. The model is first trained with a random episodic training strategy, and then its backbone is used to generate the prototypes. A dynamic relation projection module refines the prototypes to enhance their discriminability. Results on two datasets (derived from the corpora of Nsynth and FSD-MIX-CLIPS) show that the proposed method exceeds three state-of-the-art methods in average accuracy and performance dropping rate.",
    keywords = "audio classification, class-incremental learning, few-shot learning, meta-training",
    note = "Publisher Copyright: {\textcopyright} 2023 International Speech Communication Association. All rights reserved.; Annual Conference of the International Speech Communication Association, INTERSPEECH ; Conference date: 20-08-2023 Through 24-08-2023",
    year = "2023",
    doi = "10.21437/Interspeech.2023-1380",
    language = "English",
    series = "Interspeech",
    publisher = "International Speech Communication Association",
    pages = "301--305",
    booktitle = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH"
    }

2022

  • M. Airaksinen, A. Gallen, A. Kivi, P. Vijayakrishnan, T. Häyrinen, E. Ilèn, O. Räsänen, L. Haataja, and S. Vanhatalo, "Intelligent wearable allows out-of-the-lab tracking of developing motor abilities in infants," Communications Medicine, vol. 2, 2022. doi:10.1038/s43856-022-00131-6
    [BibTeX]
    @article{2022_ICA,
    author = {Airaksinen, Manu and Gallen, Anastasia and Kivi, Anna and Vijayakrishnan, Pavithra and H{\"a}yrinen, Taru and Il{\`e}n, Elina and R{\"a}s{\"a}nen, Okko and Haataja, Leena and Vanhatalo, Sampsa},
    title = "Intelligent wearable allows out-of-the-lab tracking of developing motor abilities in infants",
    year = "2022",
    doi = "10.1038/s43856-022-00131-6",
    language = "English",
    volume = "2",
    journal = "Communications Medicine",
    issn = "2730-664X",
    publisher = "Springer"
    }

  • D. Dogan, H. Xie, T. Heittola, and T. Virtanen, "Zero-Shot Audio Classification using Image Embeddings," in European Signal Processing Conference 2022, United States, 2022. doi:10.23919/EUSIPCO55093.2022.9909701
    [BibTeX]
    @inproceedings{2022_EUSIPCO,
    author = "Dogan, Duygu and Xie, Huang and Heittola, Toni and Virtanen, Tuomas",
    title = "Zero-Shot Audio Classification using Image Embeddings",
    note = "jufoid=55867; European Signal Processing Conference ; Conference date: 29-08-2022 Through 02-09-2022",
    year = "2022",
    doi = "10.23919/EUSIPCO55093.2022.9909701",
    language = "English",
    series = "European Signal Processing Conference",
    publisher = "IEEE",
    booktitle = "European Signal Processing Conference 2022",
    address = "United States"
    }

  • V. Eklund, A. Diment, and T. Virtanen, "Noise, Device and Room Robustness Methods for Pronunciation Error Detection," in European Signal Processing Conference 2022, United States, 2022. doi:10.23919/EUSIPCO55093.2022.9909625
    [BibTeX]
    @inproceedings{2022_EUSIPCO_a,
    author = "Eklund, Ville-Veikko and Diment, Aleksandr and Virtanen, Tuomas",
    title = "Noise, Device and Room Robustness Methods for Pronunciation Error Detection",
    note = "jufoid=55867; European Signal Processing Conference ; Conference date: 29-08-2022 Through 02-09-2022",
    year = "2022",
    doi = "10.23919/EUSIPCO55093.2022.9909625",
    language = "English",
    series = "European Signal Processing Conference",
    publisher = "IEEE",
    booktitle = "European Signal Processing Conference 2022",
    address = "United States"
    }

  • Y. Li, W. Cao, K. Drossos, and T. Virtanen, "Domestic Activity Clustering from Audio via Depthwise Separable Convolutional Autoencoder Network," in 2022 IEEE 24th International Workshop on Multimedia Signal Processing, MMSP 2022, United States, 2022. doi:10.1109/MMSP55362.2022.9949512
    [BibTeX] [Abstract]

    Automatic estimation of domestic activities from audio can be used to solve many problems, such as reducing the labor cost for nursing the elderly people. This study focuses on solving the problem of domestic activity clustering from audio. The target of domestic activity clustering is to cluster audio clips which belong to the same category of domestic activity into one cluster in an unsupervised way. In this paper, we propose a method of domestic activity clustering using a depthwise separable convolutional autoencoder network. In the proposed method, initial embeddings are learned by the depthwise separable convolutional autoencoder, and a clustering-oriented loss is designed to jointly optimize embedding refinement and cluster assignment. Different methods are evaluated on a public dataset (a derivative of the SINS dataset) used in the challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) in 2018. Our method obtains the normalized mutual information (NMI) score of 54.46\\%, and the clustering accuracy (CA) score of 63.64\\%, and outperforms state-of-the-art methods in terms of NMI and CA. In addition, both computational complexity and memory requirement of our method is lower than that of previous deep-model-based methods. Codes: https://github.com/vinceasvp/domestic-activity-clustering-from-audio.

    @inproceedings{2022_SP_a,
    author = "Li, Yanxiong and Cao, Wenchang and Drossos, Konstantinos and Virtanen, Tuomas",
    title = "Domestic Activity Clustering from Audio via Depthwise Separable Convolutional Autoencoder Network",
    abstract = "Automatic estimation of domestic activities from audio can be used to solve many problems, such as reducing the labor cost for nursing the elderly people. This study focuses on solving the problem of domestic activity clustering from audio. The target of domestic activity clustering is to cluster audio clips which belong to the same category of domestic activity into one cluster in an unsupervised way. In this paper, we propose a method of domestic activity clustering using a depthwise separable convolutional autoencoder network. In the proposed method, initial embeddings are learned by the depthwise separable convolutional autoencoder, and a clustering-oriented loss is designed to jointly optimize embedding refinement and cluster assignment. Different methods are evaluated on a public dataset (a derivative of the SINS dataset) used in the challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) in 2018. Our method obtains the normalized mutual information (NMI) score of 54.46\\%, and the clustering accuracy (CA) score of 63.64\\%, and outperforms state-of-the-art methods in terms of NMI and CA. In addition, both computational complexity and memory requirement of our method is lower than that of previous deep-model-based methods. Codes: https://github.com/vinceasvp/domestic-activity-clustering-from-audio.",
    keywords = "depthwise separable convolutional autoencoder, domestic activity clustering, human activity estimation",
    note = "Funding Information: This work was supported by international scientific research collaboration project of Guangdong Province, China (2021A0505030003), national natural science foundation of China (62111530145, 61771200), Guangdong basic and applied basic research foundation, China (2021A1515011454). Publisher Copyright: {\textcopyright} 2022 IEEE. jufoid=70574; IEEE International Workshop on Multimedia Signal Processing ; Conference date: 26-09-2022 Through 28-09-2022",
    year = "2022",
    doi = "10.1109/MMSP55362.2022.9949512",
    language = "English",
    series = "IEEE International Workshop on Multimedia Signal Processing",
    publisher = "IEEE",
    booktitle = "2022 IEEE 24th International Workshop on Multimedia Signal Processing, MMSP 2022",
    address = "United States"
    }

  • S. Lipping, P. Sudarsanam, K. Drossos, and T. Virtanen, "Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering," in European Signal Processing Conference 2022, United States, 2022. doi:10.23919/EUSIPCO55093.2022.9909680
    [BibTeX]
    @inproceedings{2022_EUSIPCO_b,
    author = "Lipping, Samuel and Sudarsanam, Parthasaarathy and Drossos, Konstantinos and Virtanen, Tuomas",
    title = "Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering",
    note = "jufoid=55867; European Signal Processing Conference ; Conference date: 29-08-2022 Through 02-09-2022",
    year = "2022",
    doi = "10.23919/EUSIPCO55093.2022.9909680",
    language = "English",
    series = "European Signal Processing Conference",
    publisher = "IEEE",
    booktitle = "European Signal Processing Conference 2022",
    address = "United States"
    }

  • I. Martin Morato, F. Paissan, A. Ancilotto, T. Heittola, A. Mesaros, E. Farella, A. Brutti, and T. Virtanen, "Low-Complexity Acoustic Scene Classification in DCASE 2022 Challenge," in Proceedings of the 7th Workshop on Detection and Classication of Acoustic Scenes and Events (DCASE 2022), 2022, p. 111–115.
    [BibTeX] [Abstract] [Download PDF]

    This paper presents an analysis of the Low-Complexity Acoustic Scene Classification task in DCASE 2022 Challenge. The task was a continuation from the previous years, but the low-complexity requirements were changed to the following: the maximum number of allowed parameters, including the zero-valued ones, was 128 K, with parameters being represented using INT8 numerical format; and the maximum number of multiply-accumulate operations at inference time was 30 million. Despite using the same previous year dataset, the audio samples have been shortened to 1 second instead of 10 second for this year challenge. The provided baseline system is a convolutional neural network which employs post-training quantization of parameters, resulting in 46.5 K parameters, and 29.23 million multiply-and-accumulate operations (MMACs). Its performance on the evaluation data is 44.2\\% accuracy and 1.532 log-loss. In comparison, the top system in the challenge obtained an accuracy of 59.6\\% and a log loss of 1.091, having 121 K parameters and 28 MMACs. The task received 48 submissions from 19 different teams, most of which outperformed the baseline system.

    @inproceedings{2022_DCASE 2022,
    author = "Martin Morato, Irene and Paissan, Francesco and Ancilotto, Alberto and Heittola, Toni and Mesaros, Annamaria and Farella, Elisabetta and Brutti, Alessio and Virtanen, Tuomas",
    editor = {Lagrange, Mathieu and Mesaros, Annamaria and Pellegrini, Thomas and Richard, Ga{\"e}l and Serizel, Romain and Stowell, Dan},
    title = "Low-Complexity Acoustic Scene Classification in DCASE 2022 Challenge",
    abstract = "This paper presents an analysis of the Low-Complexity Acoustic Scene Classification task in DCASE 2022 Challenge. The task was a continuation from the previous years, but the low-complexity requirements were changed to the following: the maximum number of allowed parameters, including the zero-valued ones, was 128 K, with parameters being represented using INT8 numerical format; and the maximum number of multiply-accumulate operations at inference time was 30 million. Despite using the same previous year dataset, the audio samples have been shortened to 1 second instead of 10 second for this year challenge. The provided baseline system is a convolutional neural network which employs post-training quantization of parameters, resulting in 46.5 K parameters, and 29.23 million multiply-and-accumulate operations (MMACs). Its performance on the evaluation data is 44.2\\% accuracy and 1.532 log-loss. In comparison, the top system in the challenge obtained an accuracy of 59.6\\% and a log loss of 1.091, having 121 K parameters and 28 MMACs. The task received 48 submissions from 19 different teams, most of which outperformed the baseline system.",
    year = "2022",
    month = "November",
    day = "3",
    language = "English",
    pages = "111--115",
    booktitle = "Proceedings of the 7th Workshop on Detection and Classication of Acoustic Scenes and Events (DCASE 2022)",
    publisher = "DCASE",
    note = "Workshop on Detection and Classification of Acoustic Scenes and Events, DCASE ; Conference date: 03-11-2022 Through 04-11-2022",
    url = "https://dcase.community/workshop2022/"
    }

  • G. Naithani, K. Pietilä, R. Niemisto, E. Paajanen, T. Takala, and T. Virtanen, "Subjective Evaluation of Deep Neural Network Based Speech Enhancement Systems in Real-World Conditions," in 2022 IEEE 24th International Workshop on Multimedia Signal Processing, MMSP 2022, United States, 2022. doi:10.1109/MMSP55362.2022.9949148
    [BibTeX] [Abstract]

    Subjective evaluation results for two low-latency deep neural networks (DNN) are compared to a matured version of a traditional Wiener-filter based noise suppressor. The target use-case is real-world single-channel speech enhancement applications, e.g., communications. Real-world recordings consisting of additive stationary and non-stationary noise types are included. The evaluation is divided into four outcomes: speech quality, noise transparency, speech intelligibility or listening effort, and noise level w.r.t. speech. It is shown that DNNs improve noise suppression in all conditions in comparison to the traditional Wiener-filter baseline without major degradation in speech quality and noise transparency while maintaining speech intelligibility better than the baseline.

    @inproceedings{2022_SP_d,
    author = {Naithani, Gaurav and Pietil{\"a}, Kirsi and Niemisto, Riitta and Paajanen, Erkki and Takala, Tero and Virtanen, Tuomas},
    title = "Subjective Evaluation of Deep Neural Network Based Speech Enhancement Systems in Real-World Conditions",
    abstract = "Subjective evaluation results for two low-latency deep neural networks (DNN) are compared to a matured version of a traditional Wiener-filter based noise suppressor. The target use-case is real-world single-channel speech enhancement applications, e.g., communications. Real-world recordings consisting of additive stationary and non-stationary noise types are included. The evaluation is divided into four outcomes: speech quality, noise transparency, speech intelligibility or listening effort, and noise level w.r.t. speech. It is shown that DNNs improve noise suppression in all conditions in comparison to the traditional Wiener-filter baseline without major degradation in speech quality and noise transparency while maintaining speech intelligibility better than the baseline.",
    keywords = "Deep neural networks, Low latency, Speech enhancement, Subjective evaluation",
    note = "Publisher Copyright: {\textcopyright} 2022 IEEE. jufoid=70574; IEEE International Workshop on Multimedia Signal Processing ; Conference date: 26-09-2022 Through 28-09-2022",
    year = "2022",
    doi = "10.1109/MMSP55362.2022.9949148",
    language = "English",
    series = "International Workshop on Multimedia Signal Processing",
    publisher = "IEEE",
    booktitle = "2022 IEEE 24th International Workshop on Multimedia Signal Processing, MMSP 2022",
    address = "United States"
    }

  • M. Parviainen and P. Pertilä, "Time Difference of Arrival Estimation of Multiple Simultaneous Speakers Using Deep Clustering Neural Networks," in IEEE MMSP 2021 - 23rd Workshop on Multimedia Signal Processing, United States, 2022. doi:10.1109/MMSP53017.2021.9733535
    [BibTeX] [Abstract] [Download PDF]

    A novel multiple acoustic source localization approach is presented that is capable of providing spatial information about concurrent active speakers from a mixture signal captured by a microphone array. The proposed method first separates the observed array mixture signal into single speaker array signals using deep clustering (DC), which is a deep neural network (DNN) based method that maps source signal information into an embedding space, in which a clustering algorithm can be then used to separate the sources. Spatial information in terms of time difference of arrival (TDoA) can be then extracted from each separated signal. This approach is novel for TDoA estimation of multiple sources, since the state-of-the-art method first localizes multiple sources and then performs the separation. The inherent advantage of the proposed approach is that there is no need for data association of the measurements and the sources. The results with data from an actual room show that the proposed approach outperforms the current state-of-the- art in extracting the spatial information from two concurrent speakers mixture signal.

    @inproceedings{2022_SP_b,
    author = {Parviainen, Mikko and Pertil{\"a}, Pasi},
    title = "Time Difference of Arrival Estimation of Multiple Simultaneous Speakers Using Deep Clustering Neural Networks",
    abstract = "A novel multiple acoustic source localization approach is presented that is capable of providing spatial information about concurrent active speakers from a mixture signal captured by a microphone array. The proposed method first separates the observed array mixture signal into single speaker array signals using deep clustering (DC), which is a deep neural network (DNN) based method that maps source signal information into an embedding space, in which a clustering algorithm can be then used to separate the sources. Spatial information in terms of time difference of arrival (TDoA) can be then extracted from each separated signal. This approach is novel for TDoA estimation of multiple sources, since the state-of-the-art method first localizes multiple sources and then performs the separation. The inherent advantage of the proposed approach is that there is no need for data association of the measurements and the sources. The results with data from an actual room show that the proposed approach outperforms the current state-of-the- art in extracting the spatial information from two concurrent speakers mixture signal.",
    note = "JUFOID=70574; IEEE International Workshop on Multimedia Signal Processing, IEEE MMSP 2021 ; Conference date: 06-10-2021 Through 08-10-2021",
    year = "2022",
    doi = "10.1109/MMSP53017.2021.9733535",
    language = "English",
    series = "IEEE International Workshop on Multimedia Signal Processing",
    publisher = "IEEE",
    booktitle = "IEEE MMSP 2021 - 23rd Workshop on Multimedia Signal Processing",
    address = "United States",
    url = "https://attend.ieee.org/mmsp-2021/"
    }

  • A. Politis, K. Shimada, P. Ariyakulam Sudarsanam, S. Adavanne, D. Krause, Y. Koyama, N. Takahashi, S. Takahashi, Y. Mitsufuji, and T. Virtanen, "STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound event," in Workshop on Detection and Classification of Acoustic Scenes and Events, 2022.
    [BibTeX] [Download PDF]
    @inproceedings{2022_DCASE,
    author = "Politis, Archontis and Shimada, Kazuki and Ariyakulam Sudarsanam, Parthasaarathy and Adavanne, Sharath and Krause, Daniel and Koyama, Yuichiro and Takahashi, Naoya and Takahashi, Shusuke and Mitsufuji, Yuki and Virtanen, Tuomas",
    title = "STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound event",
    year = "2022",
    language = "English",
    booktitle = "Workshop on Detection and Classification of Acoustic Scenes and Events",
    publisher = "DCASE",
    note = "Workshop on Detection and Classification of Acoustic Scenes and Events, DCASE ; Conference date: 03-11-2022 Through 04-11-2022",
    url = "https://dcase.community/workshop2022/"
    }

  • B. W. Schuller, Y. Eldar, M. Pantic, S. Narayanan, T. Virtanen, and J. Tao, "Special Issue on Signal Analysis for Detection and Monitoring of Contagious Diseases," IEEE Journal on Selected Topics in Signal Processing, vol. 16, iss. 2, 2022.
    [BibTeX]
    @article{2022_SP,
    author = "Schuller, Bjorn W. and Eldar, Yonina and Pantic, Maja and Narayanan, Shrikanth and Virtanen, Tuomas and Tao, Jianhua",
    title = "Special Issue on Signal Analysis for Detection and Monitoring of Contagious Diseases",
    note = "Publisher Copyright: {\textcopyright} 2007-2012 IEEE.",
    year = "2022",
    month = "February",
    language = "English",
    volume = "16",
    journal = "IEEE Journal on Selected Topics in Signal Processing",
    issn = "1932-4553",
    publisher = "Institute of Electrical and Electronics Engineers Inc.",
    number = "2"
    }

  • B. W. Schuller, Y. Eldar, M. Pantic, S. Narayanan, T. Virtanen, and J. Tao, "Editorial: Intelligent Signal Analysis for Contagious Virus Diseases," IEEE Journal on Selected Topics in Signal Processing, vol. 16, iss. 2, p. 159–163, 2022. doi:10.1109/JSTSP.2022.3160023
    [BibTeX] [Abstract]

    COVID-19 infection-s recent outbreak triggered by the SARS-CoV-2 Corona virus had already led to more than two million reported infected individuals when we first addressed the community by our call - by now, the number sadly rose to roughly half a billion cases worldwide. The outbreak of COIVD-19 has also re-shaped and accelerated the scientific publication landscape in no time. One can observe a massive uprise in interest in work related to the topic of highly contagious virus diseases and potential contributions of digital health including intelligent signal processing. In addition, most publishers have reacted in one or the other way to the crises such as by opening up to pre-prints, waiving publication fees for COVID-19-related research, providing search functions and tools for COVID-19 research, and many more. Here, we gathered 13 carefully selected novel contributions across signal types such as audio, speech, image, video, or symbolic information, as well as their multimodal combination for application in the risk assessment, diagnosis, and monitoring of contagious virus diseases.

    @article{2022_SP_e,
    author = "Schuller, Bjorn W. and Eldar, Yonina and Pantic, Maja and Narayanan, Shrikanth and Virtanen, Tuomas and Tao, Jianhua",
    title = "Editorial: Intelligent Signal Analysis for Contagious Virus Diseases",
    abstract = "COVID-19 infection-s recent outbreak triggered by the SARS-CoV-2 Corona virus had already led to more than two million reported infected individuals when we first addressed the community by our call - by now, the number sadly rose to roughly half a billion cases worldwide. The outbreak of COIVD-19 has also re-shaped and accelerated the scientific publication landscape in no time. One can observe a massive uprise in interest in work related to the topic of highly contagious virus diseases and potential contributions of digital health including intelligent signal processing. In addition, most publishers have reacted in one or the other way to the crises such as by opening up to pre-prints, waiving publication fees for COVID-19-related research, providing search functions and tools for COVID-19 research, and many more. Here, we gathered 13 carefully selected novel contributions across signal types such as audio, speech, image, video, or symbolic information, as well as their multimodal combination for application in the risk assessment, diagnosis, and monitoring of contagious virus diseases.",
    note = "Publisher Copyright: {\textcopyright} 2007-2012 IEEE.",
    year = "2022",
    month = "February",
    doi = "10.1109/JSTSP.2022.3160023",
    language = "English",
    volume = "16",
    pages = "159--163",
    journal = "IEEE Journal on Selected Topics in Signal Processing",
    issn = "1932-4553",
    publisher = "Institute of Electrical and Electronics Engineers Inc.",
    number = "2"
    }

  • E. Vaaras, M. Airaksinen, and O. Räsänen, "Analysis of Self-Supervised Learning and Dimensionality Reduction Methods in Clustering-Based Active Learning for Speech Emotion Recognition," in Proceedings of INTERSPEECH 2022, 2022, p. 1143–1147. doi:10.21437/Interspeech.2022-329
    [BibTeX]
    @inproceedings{2022_InterSpecch,
    author = {Vaaras, Einari and Airaksinen, Manu and R{\"a}s{\"a}nen, Okko},
    title = "Analysis of Self-Supervised Learning and Dimensionality Reduction Methods in Clustering-Based Active Learning for Speech Emotion Recognition",
    year = "2022",
    doi = "10.21437/Interspeech.2022-329",
    language = "English",
    series = "Interspeech",
    publisher = "International Speech Communication Association ISCA",
    pages = "1143--1147",
    booktitle = "Proceedings of INTERSPEECH 2022",
    note = "Interspeech ; Conference date: 01-01-1900"
    }

  • S. Wang, A. Politis, A. Mesaros, and T. Virtanen, "Self-Supervised Learning of Audio Representations from Audio-Visual Data Using Spatial Alignment," IEEE Journal on Selected Topics in Signal Processing, vol. 16, iss. 6, p. 1467–1479, 2022. doi:10.1109/JSTSP.2022.3180592
    [BibTeX] [Abstract]

    Learning from audio-visual data offers many possibilities to express correspondence between the audio and visual content, similar to the human perception that relates aural and visual information. In this work, we present a method for self-supervised representation learning based on audio-visual spatial alignment (AVSA), a more sophisticated alignment task than the audio-visual correspondence (AVC). In addition to the correspondence, AVSA also learns from the spatial location of acoustic and visual content. Based on 360° video and Ambisonics audio, we propose selection of visual objects using object detection, and beamforming of the audio signal towards the detected objects, attempting to learn the spatial alignment between objects and the sound they produce. We investigate the use of spatial audio features to represent the audio input, and different audio formats: Ambisonics, mono, and stereo. Experimental results show a 10\\% improvement on AVSA for the first order ambisonics intensity vector (FOA-IV) in comparison with log-mel spectrogram features; the addition of object-oriented crops also brings significant performance increases for the human action recognition downstream task. A number of audio-only downstream tasks are devised for testing the effectiveness of the learnt audio feature representation, obtaining performance comparable to state-of-the-art methods on acoustic scene classification from ambisonic and binaural audio.

    @article{2022_SP_c,
    author = "Wang, Shanshan and Politis, Archontis and Mesaros, Annamaria and Virtanen, Tuomas",
    title = "Self-Supervised Learning of Audio Representations from Audio-Visual Data Using Spatial Alignment",
    abstract = "Learning from audio-visual data offers many possibilities to express correspondence between the audio and visual content, similar to the human perception that relates aural and visual information. In this work, we present a method for self-supervised representation learning based on audio-visual spatial alignment (AVSA), a more sophisticated alignment task than the audio-visual correspondence (AVC). In addition to the correspondence, AVSA also learns from the spatial location of acoustic and visual content. Based on 360° video and Ambisonics audio, we propose selection of visual objects using object detection, and beamforming of the audio signal towards the detected objects, attempting to learn the spatial alignment between objects and the sound they produce. We investigate the use of spatial audio features to represent the audio input, and different audio formats: Ambisonics, mono, and stereo. Experimental results show a 10\\% improvement on AVSA for the first order ambisonics intensity vector (FOA-IV) in comparison with log-mel spectrogram features; the addition of object-oriented crops also brings significant performance increases for the human action recognition downstream task. A number of audio-only downstream tasks are devised for testing the effectiveness of the learnt audio feature representation, obtaining performance comparable to state-of-the-art methods on acoustic scene classification from ambisonic and binaural audio.",
    keywords = "Audio classification, audio-visual corres-pondence, audio-visual data, audio-visual spatial alignment, feature learning, self-supervised learning",
    note = "Publisher Copyright: {\textcopyright} 2007-2012 IEEE.",
    year = "2022",
    month = "October",
    day = "14",
    doi = "10.1109/JSTSP.2022.3180592",
    language = "English",
    volume = "16",
    pages = "1467--1479",
    journal = "IEEE Journal on Selected Topics in Signal Processing",
    issn = "1932-4553",
    publisher = "Institute of Electrical and Electronics Engineers Inc.",
    number = "6"
    }

  • H. Xie, S. Lipping, and T. Virtanen, "Language-based Audio Retrieval Task in DCASE 2022 Challenge," in Workshop on Detection and Classification of Acoustic Scenes and Events, 2022, p. 216–221.
    [BibTeX] [Download PDF]
    @inproceedings{2022_DCASE_a,
    author = "Xie, Huang and Lipping, Samuel and Virtanen, Tuomas",
    title = "Language-based Audio Retrieval Task in DCASE 2022 Challenge",
    year = "2022",
    language = "English",
    pages = "216--221",
    booktitle = "Workshop on Detection and Classification of Acoustic Scenes and Events",
    publisher = "DCASE",
    note = "Workshop on Detection and Classification of Acoustic Scenes and Events, DCASE ; Conference date: 03-11-2022 Through 04-11-2022",
    url = "https://dcase.community/workshop2022/"
    }

  • H. Xie, O. Räsänen, K. Drossos, and T. Virtanen, "Unsupervised Audio-Caption Aligning Learns Correspondences between Individual Sound Events and Textual Phrases," in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), United States, 2022, p. 8867–8871. doi:10.1109/ICASSP43922.2022.9747336
    [BibTeX] [Abstract]

    We investigate unsupervised learning of correspondences between sound events and textual phrases through aligning audio clips with textual captions describing the content of a whole audio clip. We align originally unaligned and unannotated audio clips and their captions by scoring the similarities between audio frames and words, as encoded by modality-specific encoders and using a ranking-loss criterion to optimize the model. After training, we obtain clip-caption similarity by averaging frame-word similarities and estimate event-phrase correspondences by calculating frame-phrase similarities. We evaluate the method with two cross-modal tasks: audio-caption retrieval, and phrase-based sound event detection (SED). Experimental results show that the proposed method can globally associate audio clips with captions as well as locally learn correspondences between individual sound events and textual phrases in an unsupervised manner.

    @inproceedings{2022_ICASSP,
    author = {Xie, Huang and R{\"a}s{\"a}nen, Okko and Drossos, Konstantinos and Virtanen, Tuomas},
    title = "Unsupervised Audio-Caption Aligning Learns Correspondences between Individual Sound Events and Textual Phrases",
    abstract = "We investigate unsupervised learning of correspondences between sound events and textual phrases through aligning audio clips with textual captions describing the content of a whole audio clip. We align originally unaligned and unannotated audio clips and their captions by scoring the similarities between audio frames and words, as encoded by modality-specific encoders and using a ranking-loss criterion to optimize the model. After training, we obtain clip-caption similarity by averaging frame-word similarities and estimate event-phrase correspondences by calculating frame-phrase similarities. We evaluate the method with two cross-modal tasks: audio-caption retrieval, and phrase-based sound event detection (SED). Experimental results show that the proposed method can globally associate audio clips with captions as well as locally learn correspondences between individual sound events and textual phrases in an unsupervised manner.",
    keywords = "Cross-modal learning, audio, caption, sound event, unsupervised learning",
    note = "jufoid=57409; IEEE International Conference on Acoustics, Speech and Signal Processing ; Conference date: 23-05-2022 Through 27-05-2022",
    year = "2022",
    month = "May",
    doi = "10.1109/ICASSP43922.2022.9747336",
    language = "English",
    isbn = "978-1-6654-0541-6",
    series = "Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing",
    publisher = "IEEE",
    pages = "8867--8871",
    booktitle = "ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",
    address = "United States"
    }

2021

  • S. Adavanne, A. Politis, and T. Virtanen, "Differentiable Tracking-Based Training of Deep Learning Sound Source Localizers," in 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), United States, 2021, p. 211–215. doi:10.48550/arXiv.2111.00030
    [BibTeX] [Abstract]

    Data-based and learning-based sound source localization (SSL) has shown promising results in challenging conditions, and is commonly set as a classification or a regression problem. Regression-based approaches have certain advantages over classification-based, such as continuous direction-of-arrival estimation of static and moving sources. However, multi-source scenarios require multiple regressors without a clear training strategy up-to-date, that does not rely on auxiliary information such as simultaneous sound classification. We investigate end-to-end training of such methods with a technique recently proposed for video object detectors, adapted to the SSL setting. A differentiable network is constructed that can be plugged to the output of the localizer to solve the optimal assignment between predictions and references, optimizing directly the popular CLEAR-MOT tracking metrics. Results indicate large improvements over directly optimizing mean squared errors, in terms of localization error, detection metrics, and tracking capabilities.

    @inproceedings{2021_WASPAA,
    author = "Adavanne, Sharath and Politis, Archontis and Virtanen, Tuomo",
    title = "Differentiable Tracking-Based Training of Deep Learning Sound Source Localizers",
    abstract = "Data-based and learning-based sound source localization (SSL) has shown promising results in challenging conditions, and is commonly set as a classification or a regression problem. Regression-based approaches have certain advantages over classification-based, such as continuous direction-of-arrival estimation of static and moving sources. However, multi-source scenarios require multiple regressors without a clear training strategy up-to-date, that does not rely on auxiliary information such as simultaneous sound classification. We investigate end-to-end training of such methods with a technique recently proposed for video object detectors, adapted to the SSL setting. A differentiable network is constructed that can be plugged to the output of the localizer to solve the optimal assignment between predictions and references, optimizing directly the popular CLEAR-MOT tracking metrics. Results indicate large improvements over directly optimizing mean squared errors, in terms of localization error, detection metrics, and tracking capabilities.",
    keywords = "Training, Location awareness, Measurement, Deep learning, Direction-of-arrival estimation, Conferences, Training data, sound source localization, deep-learning acoustic processing, multi-target tracking",
    note = "jufoid=72074; IEEE Workshop on Applications of Signal Processing to Audio and Acoustics ; Conference date: 17-10-2021 Through 20-10-2021",
    year = "2021",
    doi = "10.48550/arXiv.2111.00030",
    language = "English",
    series = "IEEE Workshop on Applications of Signal Processing to Audio and Acoustics",
    publisher = "IEEE",
    pages = "211--215",
    booktitle = "2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)",
    address = "United States"
    }

  • R. Convey, N. Penttilä, T. Ihalainen, Y. Liu, O. Räsänen, and J. Lintula, "Analysis of Automatic Vowel Articulation Index Software." 2021.
    [BibTeX]
    @conference{2021_d,
    author = {Convey, Rachel and Penttil{\"a}, Nelly and Ihalainen, Tiina and Liu, Yuanyuan and R{\"a}s{\"a}nen, Okko and Lintula, Juulia},
    title = "Analysis of Automatic Vowel Articulation Index Software",
    year = "2021",
    language = "English",
    note = "ASHA Convention ; Conference date: 15-11-2021 Through 22-11-2021"
    }

  • A. Cristia, M. Lavechin, C. Scaff, M. Soderstrom, C. Rowland, O. Räsänen, J. Bunce, and E. Bergelson, "A thorough evaluation of the Language Environment Analysis (LENA) system," BEHAVIOR RESEARCH METHODS, vol. 53, p. 467–486, 2021. doi:10.3758/s13428-020-01393-5
    [BibTeX] [Abstract]

    In the previous decade, dozens of studies involving thousands of children across several research disciplines have made use of a combined daylong audio-recorder and automated algorithmic analysis called the LENAⓇ system, which aims to assess children{’}s language environment. While the system{’}s prevalence in the language acquisition domain is steadily growing, there are only scattered validation efforts on only some of its key characteristics. Here, we assess the LENAⓇ system{’}s accuracy across all of its key measures: speaker classification, Child Vocalization Counts (CVC), Conversational Turn Counts (CTC), and Adult Word Counts (AWC). Our assessment is based on manual annotation of clips that have been randomly or periodically sampled out of daylong recordings, collected from (a) populations similar to the system{’}s original training data (North American English-learning children aged 3-36 months), (b) children learning another dialect of English (UK), and (c) slightly older children growing up in a different linguistic and socio-cultural setting (Tsimane{’} learners in rural Bolivia). We find reasonably high accuracy in some measures (AWC, CVC), with more problematic levels of performance in others (CTC, precision of male adults and other children). Statistical analyses do not support the view that performance is worse for children who are dissimilar from the LENAⓇ original training set. Whether LENAⓇ results are accurate enough for a given research, educational, or clinical application depends largely on the specifics at hand. We therefore conclude with a set of recommendations to help researchers make this determination for their goals.

    @article{2021_i,
    author = {Cristia, Alejandrina and Lavechin, Marvin and Scaff, Camila and Soderstrom, Melanie and Rowland, Caroline and R{\"a}s{\"a}nen, Okko and Bunce, John and Bergelson, Elika},
    title = "A thorough evaluation of the Language Environment Analysis (LENA) system",
    abstract = "In the previous decade, dozens of studies involving thousands of children across several research disciplines have made use of a combined daylong audio-recorder and automated algorithmic analysis called the LENAⓇ system, which aims to assess children{\textquoteright}s language environment. While the system{\textquoteright}s prevalence in the language acquisition domain is steadily growing, there are only scattered validation efforts on only some of its key characteristics. Here, we assess the LENAⓇ system{\textquoteright}s accuracy across all of its key measures: speaker classification, Child Vocalization Counts (CVC), Conversational Turn Counts (CTC), and Adult Word Counts (AWC). Our assessment is based on manual annotation of clips that have been randomly or periodically sampled out of daylong recordings, collected from (a) populations similar to the system{\textquoteright}s original training data (North American English-learning children aged 3-36 months), (b) children learning another dialect of English (UK), and (c) slightly older children growing up in a different linguistic and socio-cultural setting (Tsimane{\textquoteright} learners in rural Bolivia). We find reasonably high accuracy in some measures (AWC, CVC), with more problematic levels of performance in others (CTC, precision of male adults and other children). Statistical analyses do not support the view that performance is worse for children who are dissimilar from the LENAⓇ original training set. Whether LENAⓇ results are accurate enough for a given research, educational, or clinical application depends largely on the specifics at hand. We therefore conclude with a set of recommendations to help researchers make this determination for their goals.",
    keywords = "Adult Word Count, Agreement, Child Vocalization Count, Conversational Turn Count, English, Human transcription, LENA, Measurement error, Method comparison, Reliability, Speech technology, Tsimane{\textquoteright}",
    year = "2021",
    doi = "10.3758/s13428-020-01393-5",
    language = "English",
    volume = "53",
    pages = "467–486",
    journal = "BEHAVIOR RESEARCH METHODS",
    issn = "1554-351X",
    publisher = "Springer Nature"
    }

  • S. Djukanovic, J. Matas, and T. Virtanen, "Acoustic vehicle speed estimation from single sensor measurements," IEEE Sensors Journal, vol. 21, iss. 20, p. 23317–23324, 2021. doi:10.1109/JSEN.2021.3110009
    [BibTeX] [Abstract]

    The paper addresses acoustic vehicle speed estimation using single sensor measurements. We introduce a new speed-dependent feature based on the attenuation of the sound amplitude. The feature is predicted from the audio signal and used as input to a regression model for speed estimation. For this research, we have collected, annotated, and published a dataset of audio-video recordings of single vehicles passing by the camera at a known constant speed. The dataset contains 304 urban-environment real-field recordings of ten different vehicles. The proposed method is trained and tested on the collected dataset. Experiments show that it is able to accurately predict the pass-by instant of a vehicle and to estimate its speed with an average error of 7.39 km/h. When the speed is discretized into intervals of 10 km/h, the proposed method achieves the average accuracy of 53.2\\% for correct interval prediction and 93.4\\% when misclassification of one interval is allowed. Experiments also show that sound disturbances, such as wind, severely affect acoustic speed estimation.

    @article{2021_SJ_a,
    author = "Djukanovic, Slobodan and Matas, Jiri and Virtanen, Tuomas",
    title = "Acoustic vehicle speed estimation from single sensor measurements",
    abstract = "The paper addresses acoustic vehicle speed estimation using single sensor measurements. We introduce a new speed-dependent feature based on the attenuation of the sound amplitude. The feature is predicted from the audio signal and used as input to a regression model for speed estimation. For this research, we have collected, annotated, and published a dataset of audio-video recordings of single vehicles passing by the camera at a known constant speed. The dataset contains 304 urban-environment real-field recordings of ten different vehicles. The proposed method is trained and tested on the collected dataset. Experiments show that it is able to accurately predict the pass-by instant of a vehicle and to estimate its speed with an average error of 7.39 km/h. When the speed is discretized into intervals of 10 km/h, the proposed method achieves the average accuracy of 53.2\\% for correct interval prediction and 93.4\\% when misclassification of one interval is allowed. Experiments also show that sound disturbances, such as wind, severely affect acoustic speed estimation.",
    keywords = "Acoustics, Automobiles, Cameras, Estimation, Feature extraction, log-mel spectrogram, neural network, Roads, Sensors, speed estimation dataset, support vector regression, vehicle speed estimation",
    note = "Publisher Copyright: IEEE",
    year = "2021",
    doi = "10.1109/JSEN.2021.3110009",
    language = "English",
    volume = "21",
    pages = "23317--23324",
    journal = "IEEE Sensors Journal",
    issn = "1530-437X",
    publisher = "Institute of Electrical and Electronics Engineers Inc.",
    number = "20"
    }

  • S. Djukanović, Y. Patel, J. Matas, and T. Virtanen, "Neural network-based acoustic vehicle counting," in 2021 29th European Signal Processing Conference (EUSIPCO), United States, 2021, p. 561–565. doi:10.23919/EUSIPCO54536.2021.9615925
    [BibTeX] [Abstract]

    This paper addresses acoustic vehicle counting using one-channel audio. We predict the pass-by instants of vehicles from local minima of clipped vehicle-to-microphone distance. This distance is predicted from audio using a two-stage (coarse-fine) regression, with both stages realised via neural networks (NNs). Experiments show that the NN-based distance regression outperforms by far the previously proposed support vector regression. The 95\\% confidence interval for the mean of vehicle counting error is within [0.28\\%, −0.55\\%]. Besides the minima-based counting, we propose a deep learning counting that operates on the predicted distance without detecting local minima. Although outperformed in accuracy by the former approach, deep counting has a significant advantage in that it does not depend on minima detection parameters. Results also show that removing low frequencies in features improves the counting performance.

    @inproceedings{2021_EUSIPCO_d,
    author = "Djukanovi{\'c}, Slobodan and Patel, Yash and Matas, Jiri and Virtanen, T.",
    title = "Neural network-based acoustic vehicle counting",
    abstract = "This paper addresses acoustic vehicle counting using one-channel audio. We predict the pass-by instants of vehicles from local minima of clipped vehicle-to-microphone distance. This distance is predicted from audio using a two-stage (coarse-fine) regression, with both stages realised via neural networks (NNs). Experiments show that the NN-based distance regression outperforms by far the previously proposed support vector regression. The 95\\% confidence interval for the mean of vehicle counting error is within [0.28\\%, −0.55\\%]. Besides the minima-based counting, we propose a deep learning counting that operates on the predicted distance without detecting local minima. Although outperformed in accuracy by the former approach, deep counting has a significant advantage in that it does not depend on minima detection parameters. Results also show that removing low frequencies in features improves the counting performance.",
    keywords = "Support vector machines, Deep learning, Europe, Artificial neural networks, Signal processing, Acoustics, Vehicle counting, log-mel spectrogram, neural network, peak detection, deep learning",
    note = "jufoid=55867; European Signal Processing Conference, EUSIPCO 2021 ; Conference date: 23-08-2021 Through 27-08-2021",
    year = "2021",
    doi = "10.23919/EUSIPCO54536.2021.9615925",
    language = "English",
    series = "European Signal Processing Conference",
    publisher = "IEEE",
    pages = "561--565",
    booktitle = "2021 29th European Signal Processing Conference (EUSIPCO)",
    address = "United States"
    }

  • S. Drgas and T. Virtanen, "Joint speaker separation and recognition using non-negative matrix deconvolution with adaptive dictionary," Computer Speech and Language, vol. 70, 2021. doi:10.1016/j.csl.2021.101223
    [BibTeX] [Abstract]

    In this article, we propose a new method for joint cochannel speaker separation and recognition called adaptive-dictionary non-negative matrix deconvolution (DANMD). This method is an extension of non-negative matrix deconvolution (NMD) which models spectrogram matrix as a linear combination of dictionary elements (atoms). We propose a dictionary which is a linear combination of speaker-independent component and components representing speaker variability. The dictionary is parametric and all atoms depend on a small number of parameters. The speaker-independent component and components representing speaker variability are learned from recordings of tens or hundreds of speakers. We show that the proposed method can be applied to the single-channel speech separation task where two speakers of unknown identity are to be separated. In a scenario where the unknown speakers{’} recordings are in training dataset together with recordings of many other speakers, we show that the proposed method outperforms stacked NMD (NMD with a dictionary which contains atoms of all speakers in the dataset) in terms of signal-to-distortion ratio (SDR). DANMD was also tested in a scenario where recordings of the recognized speakers were not in the training dataset. In this case it brought clearly positive signal-to-distortion ratios. The proposed model was also tested for a co-channel speaker identification task, where the parameters of the adapted model are a basis for a decision about the identity of the speakers in the mixture. In this case, the accuracy was 81.2 in comparison to 84.1 in the case of stacked NMD. While the speaker recognition accuracy is lower for the new approach, we find the primary value in the improved SDR.

    @article{2021_h,
    author = "Drgas, Szymon and Virtanen, Tuomas",
    title = "Joint speaker separation and recognition using non-negative matrix deconvolution with adaptive dictionary",
    abstract = "In this article, we propose a new method for joint cochannel speaker separation and recognition called adaptive-dictionary non-negative matrix deconvolution (DANMD). This method is an extension of non-negative matrix deconvolution (NMD) which models spectrogram matrix as a linear combination of dictionary elements (atoms). We propose a dictionary which is a linear combination of speaker-independent component and components representing speaker variability. The dictionary is parametric and all atoms depend on a small number of parameters. The speaker-independent component and components representing speaker variability are learned from recordings of tens or hundreds of speakers. We show that the proposed method can be applied to the single-channel speech separation task where two speakers of unknown identity are to be separated. In a scenario where the unknown speakers{\textquoteright} recordings are in training dataset together with recordings of many other speakers, we show that the proposed method outperforms stacked NMD (NMD with a dictionary which contains atoms of all speakers in the dataset) in terms of signal-to-distortion ratio (SDR). DANMD was also tested in a scenario where recordings of the recognized speakers were not in the training dataset. In this case it brought clearly positive signal-to-distortion ratios. The proposed model was also tested for a co-channel speaker identification task, where the parameters of the adapted model are a basis for a decision about the identity of the speakers in the mixture. In this case, the accuracy was 81.2 in comparison to 84.1 in the case of stacked NMD. While the speaker recognition accuracy is lower for the new approach, we find the primary value in the improved SDR.",
    keywords = "Cochannel speaker identification, Non-negative matrix deconvolution, Speech separation",
    note = "Publisher Copyright: {\textcopyright} 2021 Elsevier Ltd",
    year = "2021",
    month = "November",
    doi = "10.1016/j.csl.2021.101223",
    language = "English",
    volume = "70",
    journal = "Computer Speech and Language",
    issn = "0885-2308",
    publisher = "Academic Press"
    }

  • X. Favory, K. Drossos, T. Virtanen, and X. Serra, "Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags," in 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), United States, 2021, p. 596–600. doi:10.48550/arXiv.2010.14171
    [BibTeX] [Abstract] [Download PDF]

    Self-supervised audio representation learning offers an attractive alternative for obtaining generic audio embeddings, capable to be employed into various downstream tasks. Published approaches that consider both audio and words/tags associated with audio do not employ text processing models that are capable to generalize to tags unknown during training. In this work we propose a method for learning audio representations using an audio autoencoder (AAE), a general word embed-dings model (WEM), and a multi-head self-attention (MHA) mechanism. MHA attends on the output of the WEM, pro-viding a contextualized representation of the tags associated with the audio, and we align the output of MHA with the out-put of the encoder of AAE using a contrastive loss. We jointly optimize AAE and MHA and we evaluate the audio representations (i.e. the output of the encoder of AAE) by utilizing them in three different downstream tasks, namely sound, music genre, and music instrument classification. Our results show that employing multi-head self-attention with multiple heads in the tag-based network can induce better learned audio representations.

    @inproceedings{2021_ICASSP,
    author = "Favory, Xavier and Drossos, Konstantinos and Virtanen, Tuomas and Serra, Xavier",
    title = "Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags",
    abstract = "Self-supervised audio representation learning offers an attractive alternative for obtaining generic audio embeddings, capable to be employed into various downstream tasks. Published approaches that consider both audio and words/tags associated with audio do not employ text processing models that are capable to generalize to tags unknown during training. In this work we propose a method for learning audio representations using an audio autoencoder (AAE), a general word embed-dings model (WEM), and a multi-head self-attention (MHA) mechanism. MHA attends on the output of the WEM, pro-viding a contextualized representation of the tags associated with the audio, and we align the output of MHA with the out-put of the encoder of AAE using a contrastive loss. We jointly optimize AAE and MHA and we evaluate the audio representations (i.e. the output of the encoder of AAE) by utilizing them in three different downstream tasks, namely sound, music genre, and music instrument classification. Our results show that employing multi-head self-attention with multiple heads in the tag-based network can induce better learned audio representations.",
    note = "JUFOID=57409; IEEE International Conference on Acoustics, Speech and Signal Processing ; Conference date: 06-06-2021 Through 11-06-2021",
    year = "2021",
    doi = "10.48550/arXiv.2010.14171",
    language = "English",
    isbn = "978-1-7281-7606-2",
    series = "Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing",
    publisher = "IEEE",
    pages = "596--600",
    booktitle = "2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",
    address = "United States",
    url = "https://2021.ieeeicassp.org"
    }

  • K. Khorrami and O. Räsänen, "Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? - A computational investigation," Language Development Research, vol. 1, iss. 1, p. 123–191, 2021. doi:10.34842/w3vw-s845
    [BibTeX] [Abstract]

    Decades of research has studied how language learning infants learn to discriminate speech sounds, segment words, and associate words with their meanings. While gradual development of such capabilities is unquestionable, the exact nature of these skills and the underlying mental representations yet remains unclear. In parallel, computational studies have shown that basic comprehension of speech can be achieved by statistical learning between speech and concurrent referentially ambiguous visual input. These models can operate without prior linguistic knowledge such as representations of linguistic units, and without learning mechanisms specifically targeted at such units. This has raised the question whether knowledge of linguistic units, such as phone(me)s, syllables, and words, could actually emerge as latent representations supporting the translation between speech and representations in other modalities, and instead of the units ever being proximal learning goals for the learner. In this study, formulate this idea as the so-called latent language hypothesis (LLH), connecting linguistic representation learning to general predictive processing within and across sensory modalities. We review the extent that the audiovisual aspect of LLH is supported by the existing computational studies. We then explore LLH further in extensive learning simulations with different neural network models for audiovisual cross-situational learning, and comparing learning from both synthetic and real speech data. We investigate whether the latent representations learned by the networks reflect phonetic, syllabic, or lexical structure of input speech by utilizing an array of complementary evaluation metrics related to linguistic selectivity and temporal characteristics of the representations. As a result, we find that representations associated with phonetic, syllabic, and lexical units of speech indeed emerge from the audiovisual learning process. The finding is also robust against variations in model architecture or characteristics of model training and testing data. The results suggest that cross-modal and cross-situational learning may, in principle, assist in early language development much beyond just enabling association of acoustic word forms to their referential meanings.

    @article{2021_f,
    author = {Khorrami, Khazar and R{\"a}s{\"a}nen, Okko},
    title = "Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? - A computational investigation",
    abstract = "Decades of research has studied how language learning infants learn to discriminate speech sounds, segment words, and associate words with their meanings. While gradual development of such capabilities is unquestionable, the exact nature of these skills and the underlying mental representations yet remains unclear. In parallel, computational studies have shown that basic comprehension of speech can be achieved by statistical learning between speech and concurrent referentially ambiguous visual input. These models can operate without prior linguistic knowledge such as representations of linguistic units, and without learning mechanisms specifically targeted at such units. This has raised the question whether knowledge of linguistic units, such as phone(me)s, syllables, and words, could actually emerge as latent representations supporting the translation between speech and representations in other modalities, and instead of the units ever being proximal learning goals for the learner. In this study, formulate this idea as the so-called latent language hypothesis (LLH), connecting linguistic representation learning to general predictive processing within and across sensory modalities. We review the extent that the audiovisual aspect of LLH is supported by the existing computational studies. We then explore LLH further in extensive learning simulations with different neural network models for audiovisual cross-situational learning, and comparing learning from both synthetic and real speech data. We investigate whether the latent representations learned by the networks reflect phonetic, syllabic, or lexical structure of input speech by utilizing an array of complementary evaluation metrics related to linguistic selectivity and temporal characteristics of the representations. As a result, we find that representations associated with phonetic, syllabic, and lexical units of speech indeed emerge from the audiovisual learning process. The finding is also robust against variations in model architecture or characteristics of model training and testing data. The results suggest that cross-modal and cross-situational learning may, in principle, assist in early language development much beyond just enabling association of acoustic word forms to their referential meanings.",
    keywords = "neural networks, language representation learning, visually grounded speech, computational modeling, early language acquisition",
    year = "2021",
    month = "December",
    day = "31",
    doi = "10.34842/w3vw-s845",
    language = "English",
    volume = "1",
    pages = "123--191",
    journal = "Language Development Research",
    issn = "2771-7976",
    number = "1"
    }

  • K. Khorrami and O. Räsänen, "Evaluation of Audio-Visual Alignments in Visually Grounded Speech Models," in 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021, 2021, p. 2996–3000. doi:10.21437/Interspeech.2021-496
    [BibTeX] [Abstract]

    Systems that can find correspondences between multiple modal- ities, such as between speech and images, have great potential to solve different recognition and data analysis tasks in an un- supervised manner. This work studies multimodal learning in the context of visually grounded speech (VGS) models, and fo- cuses on their recently demonstrated capability to extract spa- tiotemporal alignments between spoken words and the corre- sponding visual objects without ever been explicitly trained for object localization or word recognition. As the main contribu- tions, we formalize the alignment problem in terms of an au- diovisual alignment tensor that is based on earlier VGS work, introduce systematic metrics for evaluating model performance in aligning visual objects and spoken words, and propose a new VGS model variant for the alignment task utilizing cross-modal attention layer. We test our model and a previously proposed model in the alignment task using SPEECH-COCO captions coupled with MSCOCO images. We compare the alignment performance using our proposed evaluation metrics to the se- mantic retrieval task commonly used to evaluate VGS models. We show that cross-modal attention layer not only helps the model to achieve higher semantic cross-modal retrieval perfor- mance, but also leads to substantial improvements in the align- ment performance between image object and spoken words.

    @inproceedings{2021_InterSpecch_a,
    author = {Khorrami, Khazar and R{\"a}s{\"a}nen, Okko},
    title = "Evaluation of Audio-Visual Alignments in Visually Grounded Speech Models",
    abstract = "Systems that can find correspondences between multiple modal- ities, such as between speech and images, have great potential to solve different recognition and data analysis tasks in an un- supervised manner. This work studies multimodal learning in the context of visually grounded speech (VGS) models, and fo- cuses on their recently demonstrated capability to extract spa- tiotemporal alignments between spoken words and the corre- sponding visual objects without ever been explicitly trained for object localization or word recognition. As the main contribu- tions, we formalize the alignment problem in terms of an au- diovisual alignment tensor that is based on earlier VGS work, introduce systematic metrics for evaluating model performance in aligning visual objects and spoken words, and propose a new VGS model variant for the alignment task utilizing cross-modal attention layer. We test our model and a previously proposed model in the alignment task using SPEECH-COCO captions coupled with MSCOCO images. We compare the alignment performance using our proposed evaluation metrics to the se- mantic retrieval task commonly used to evaluate VGS models. We show that cross-modal attention layer not only helps the model to achieve higher semantic cross-modal retrieval perfor- mance, but also leads to substantial improvements in the align- ment performance between image object and spoken words.",
    keywords = "cross-modal learning, audio-visual alignment, visual object localization, word segmentation",
    note = "jufoid=59094; Annual Conference of the International Speech Communication Association ; Conference date: 30-08-2021 Through 03-09-2021",
    year = "2021",
    doi = "10.21437/Interspeech.2021-496",
    language = "English",
    series = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
    publisher = "International Speech Communication Association ISCA",
    pages = "2996--3000",
    booktitle = "22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021"
    }

  • Y. Liu, N. Penttilä, T. Ihalainen, J. Lintula, R. Convey, and O. Räsänen, "Language-Independent Approach for Automatic Computation of Vowel Articulation Features in Dysarthric Speech Assessment," IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 29, p. 2228–2243, 2021. doi:10.1109/TASLP.2021.3090973
    [BibTeX] [Abstract]

    Imprecise vowel articulation can be observed in people with Parkinson's disease (PD). Acoustic features measuring vowel articulation have been demonstrated to be effective indicators of PD in its assessment. Standard clinical vowel articulation features of vowel working space area (VSA), vowel articulation index (VAI) and formants centralization ratio (FCR), are derived the first two formants of the three corner vowels /a/, /i/ and /u/. Conventionally, manual annotation of the corner vowels from speech data is required before measuring vowel articulation. This process is time-consuming. The present work aims to reduce human effort in clinical analysis of PD speech by proposing an automatic pipeline for vowel articulation assessment. The method is based on automatic corner vowel detection using a language universal phoneme recognizer, followed by statistical analysis of the formant data. The approach removes the restrictions of prior knowledge of speaking content and the language in question. Experimental results on a Finnish PD speech corpus demonstrate the efficacy and reliability of the proposed automatic method in deriving VAI, VSA, FCR and F2i/F2u (the second formant ratio for vowels /i/ and /u/). The automatically computed parameters are shown to be highly correlated with features computed with manual annotations of corner vowels. In addition, automatically and manually computed vowel articulation features have comparable correlations with experts' ratings on speech intelligibility, voice impairment and overall severity of communication disorder. Language-independence of the proposed approach is further validated on a Spanish PD database, PC-GITA, as well as on TORGO corpus of English dysarthric speech.

    @article{2021_b,
    author = {Liu, Yuanyuan and Penttil{\"a}, Nelly and Ihalainen, Tiina and Lintula, Juulia and Convey, Rachel and R{\"a}s{\"a}nen, Okko},
    title = "Language-Independent Approach for Automatic Computation of Vowel Articulation Features in Dysarthric Speech Assessment",
    abstract = "Imprecise vowel articulation can be observed in people with Parkinson's disease (PD). Acoustic features measuring vowel articulation have been demonstrated to be effective indicators of PD in its assessment. Standard clinical vowel articulation features of vowel working space area (VSA), vowel articulation index (VAI) and formants centralization ratio (FCR), are derived the first two formants of the three corner vowels /a/, /i/ and /u/. Conventionally, manual annotation of the corner vowels from speech data is required before measuring vowel articulation. This process is time-consuming. The present work aims to reduce human effort in clinical analysis of PD speech by proposing an automatic pipeline for vowel articulation assessment. The method is based on automatic corner vowel detection using a language universal phoneme recognizer, followed by statistical analysis of the formant data. The approach removes the restrictions of prior knowledge of speaking content and the language in question. Experimental results on a Finnish PD speech corpus demonstrate the efficacy and reliability of the proposed automatic method in deriving VAI, VSA, FCR and F2i/F2u (the second formant ratio for vowels /i/ and /u/). The automatically computed parameters are shown to be highly correlated with features computed with manual annotations of corner vowels. In addition, automatically and manually computed vowel articulation features have comparable correlations with experts' ratings on speech intelligibility, voice impairment and overall severity of communication disorder. Language-independence of the proposed approach is further validated on a Spanish PD database, PC-GITA, as well as on TORGO corpus of English dysarthric speech.",
    keywords = "automatic corner vowels detection, dysarthria, Parkinson's diseases, phoneme recognition, vowel articulation",
    note = "Funding Information: Manuscript received August 26, 2020; revised March 26, 2021 and May 31, 2021; accepted June 15, 2021. Date of publication June 23, 2021; date of current version July 14, 2021. This work was supported by the Academy of Finland under Grant 314602. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Mathew Magimai Doss. (Corresponding author: Yuanyuan Liu.) Yuanyuan Liu is with the Unit of Computing Sciences, Tampere University, Tampere 33720, Pirkanmaa, Finland (e-mail: yuanyuan.liu@tuni.fi). Publisher Copyright: {\textcopyright} 2014 IEEE.",
    year = "2021",
    doi = "10.1109/TASLP.2021.3090973",
    language = "English",
    volume = "29",
    pages = "2228--2243",
    journal = "IEEE/ACM Transactions on Audio Speech and Language Processing",
    issn = "2329-9290",
    publisher = "IEEE Advancing Technology for Humanity"
    }

  • I. Martin Morato, T. Heittola, A. Mesaros, and T. Virtanen, "Low-complexity acoustic scene classification for multi-device audio: analysis of DCASE 2021 Challenge systems," in Proceedings of the 6th Workshop on Detection and Classication of Acoustic Scenes and Events (DCASE 2021), 2021, p. 85–89. doi:10.5281/zenodo.5770113
    [BibTeX] [Abstract]

    This paper presents the details of Task 1A Acoustic Scene Classification in the DCASE 2021 Challenge. The task targeted development of low-complexity solutions with good generalization properties. The provided baseline system is based on a CNN architecture and post-training quantization of parameters. The system is trained using all the available training data, without any specific technique for handling device mismatch, and obtains an overall accuracy of 47.7\\%, with a log loss of 1.473. The task received 99 submissions from 30 teams, and most of the submitted systems outperformed the baseline. The most used techniques among the submissions were residual networks and weight quantization, with the top systems reaching over 70\\% accuracy, and log loss under 0.8. The acoustic scene classification task remained a popular task in the challenge, despite the increasing difficulty of the setup.

    @inproceedings{2021_DCASE 2021_a,
    author = "Martin Morato, Irene and Heittola, Toni and Mesaros, Annamaria and Virtanen, Tuomas",
    editor = "Font, Frederic and Mesaros, Annamaria and P.W. Ellis, Daniel and Fonseca, Eduardo and Fuentes, Magdalena and Elizalde, Benjamin",
    title = "Low-complexity acoustic scene classification for multi-device audio: analysis of DCASE 2021 Challenge systems",
    abstract = "This paper presents the details of Task 1A Acoustic Scene Classification in the DCASE 2021 Challenge. The task targeted development of low-complexity solutions with good generalization properties. The provided baseline system is based on a CNN architecture and post-training quantization of parameters. The system is trained using all the available training data, without any specific technique for handling device mismatch, and obtains an overall accuracy of 47.7\\%, with a log loss of 1.473. The task received 99 submissions from 30 teams, and most of the submitted systems outperformed the baseline. The most used techniques among the submissions were residual networks and weight quantization, with the top systems reaching over 70\\% accuracy, and log loss under 0.8. The acoustic scene classification task remained a popular task in the challenge, despite the increasing difficulty of the setup.",
    year = "2021",
    month = "November",
    day = "15",
    doi = "10.5281/zenodo.5770113",
    language = "English",
    pages = "85--89",
    booktitle = "Proceedings of the 6th Workshop on Detection and Classication of Acoustic Scenes and Events (DCASE 2021)",
    publisher = "DCASE",
    note = "Detection and Classication of Acoustic Scenes and Events ; Conference date: 15-11-2021 Through 19-11-2021"
    }

  • A. Mesaros, T. Heittola, T. Virtanen, and M. D. Plumbley, "Sound Event Detection: A tutorial," IEEE Signal Processing Magazine, vol. 38, iss. 5, p. 67–83, 2021. doi:10.1109/MSP.2021.3090678
    [BibTeX] [Abstract]

    Imagine standing on a street corner in the city. With your eyes closed you can hear and recognize a succession of sounds: cars passing by, people speaking, their footsteps when they walk by, and the continuous falling of rain. The recognition of all these sounds and interpretation of the perceived scene as a city street soundscape comes naturally to humans. It is, however, the result of years of {"}training{"}: encountering and learning associations among the vast varieties of sounds in everyday life, the sources producing these sounds, and the names given to them.

    @article{2021_SPM,
    author = "Mesaros, Annamaria and Heittola, Toni and Virtanen, Tuomas and Plumbley, Mark D.",
    title = "Sound Event Detection: A tutorial",
    abstract = {Imagine standing on a street corner in the city. With your eyes closed you can hear and recognize a succession of sounds: cars passing by, people speaking, their footsteps when they walk by, and the continuous falling of rain. The recognition of all these sounds and interpretation of the perceived scene as a city street soundscape comes naturally to humans. It is, however, the result of years of {"}training{"}: encountering and learning associations among the vast varieties of sounds in everyday life, the sources producing these sounds, and the names given to them.},
    keywords = "Acoustics, Urban areas, Tutorials",
    year = "2021",
    doi = "10.1109/MSP.2021.3090678",
    language = "English",
    volume = "38",
    pages = "67--83",
    journal = "IEEE Signal Processing Magazine",
    issn = "1053-5888",
    publisher = "Institute of Electrical and Electronics Engineers Inc.",
    number = "5"
    }

  • S. I. Mimilakis, K. Drossos, and G. Schuller, "Unsupervised Interpretable Representation Learning for Singing Voice Separation," in 2020 28th European Signal Processing Conference (EUSIPCO), 2021, pp. 1412-1416. doi:10.23919/Eusipco47968.2020.9287352
    [BibTeX]
    @INPROCEEDINGS{2021_EUSIPCO,
    author = "Mimilakis, Stylianos I. and Drossos, Konstantinos and Schuller, Gerald",
    booktitle = "2020 28th European Signal Processing Conference (EUSIPCO)",
    title = "Unsupervised Interpretable Representation Learning for Singing Voice Separation",
    year = "2021",
    pages = "1412-1416",
    keywords = "Time-frequency analysis;Source separation;Fourier transforms;Noise reduction;Multiple signal classification;Signal representation;Task analysis;representation learning;unsupervised learning;denoising auto-encoders;singing voice separation",
    doi = "10.23919/Eusipco47968.2020.9287352"
    }

  • N. Nicodemo, G. Naithani, K. Drossos, T. Virtanen, and R. Saletti, "Memory Requirement Reduction of Deep Neural Networks for Field Programmable Gate Arrays Using Low-Bit Quantization of Parameters," in 2020 28th European Signal Processing Conference (EUSIPCO), 2021, pp. 466-470. doi:10.23919/Eusipco47968.2020.9287739
    [BibTeX] [Abstract] [Download PDF]

    Effective employment of deep neural networks (DNNs) in mobile devices and embedded systems is hampered by requirements for memory and computational power. This paper presents a non-uniform quantization approach which allows for dynamic quantization of DNN parameters for different layers and within the same layer. A virtual bit shift (VBS) scheme is also proposed to improve the accuracy of the proposed scheme. Our method reduces the memory requirements, preserving the performance of the network. The performance of our method is validated in a speech enhancement application, where a fully connected DNN is used to predict the clean speech spectrum from the input noisy speech spectrum. A DNN is optimized and its memory footprint and performance are evaluated using the short-time objective intelligibility, STOI, metric. The application of the low-bit quantization allows a 50{\\%} reduction of the DNN memory footprint while the STOI performance drops only by 2.7{\\%}.

    @INPROCEEDINGS{2021_EUSIPCO_a,
    author = "Nicodemo, Niccolò and Naithani, Gaurav and Drossos, Konstantinos and Virtanen, Tuomas and Saletti, Roberto",
    booktitle = "2020 28th European Signal Processing Conference (EUSIPCO)",
    title = "Memory Requirement Reduction of Deep Neural Networks for Field Programmable Gate Arrays Using Low-Bit Quantization of Parameters",
    year = "2021",
    pages = "466-470",
    keywords = "Quantization (signal);Neural networks;Memory management;Speech enhancement;Logic gates;Table lookup;Field programmable gate arrays;neural network quantization;memory footprint reduction;FPGA;hardware accelerators",
    abstract = "Effective employment of deep neural networks (DNNs) in mobile devices and embedded systems is hampered by requirements for memory and computational power. This paper presents a non-uniform quantization approach which allows for dynamic quantization of DNN parameters for different layers and within the same layer. A virtual bit shift (VBS) scheme is also proposed to improve the accuracy of the proposed scheme. Our method reduces the memory requirements, preserving the performance of the network. The performance of our method is validated in a speech enhancement application, where a fully connected DNN is used to predict the clean speech spectrum from the input noisy speech spectrum. A DNN is optimized and its memory footprint and performance are evaluated using the short-time objective intelligibility, STOI, metric. The application of the low-bit quantization allows a 50{\\%} reduction of the DNN memory footprint while the STOI performance drops only by 2.7{\\%}.",
    doi = "10.23919/Eusipco47968.2020.9287739",
    url = "https://arxiv.org/abs/1911.00527"
    }

  • P. Pertilä, E. Fagerlund, A. Huttunen, and V. Myllylä, "Online Own Voice Detection for a Multi-channel Multi-sensor In-Ear Device," IEEE Sensors Journal, vol. 21, iss. 24, p. 27686–27697, 2021. doi:10.1109/JSEN.2021.3122936
    [BibTeX] [Abstract]

    Voice activity detection (VAD) aims for detecting the presence of speech in a given input signal, and is often the first step in voice -based applications such as speech communication systems. In the context of personal devices, own voice detection (OVD) is a sub-task of VAD, since it targets speech detection of the person wearing the device, while ignoring other speakers in the presence of interference signals. This article first summarizes recent single and multi-microphone, multi-sensor, and hearing aids related VAD techniques. Then, a wearable in-ear device equipped with multiple microphones and an accelerometer is investigated for the OVD task using a neural network with input embedding and long short-term memory (LSTM) layers. The device picks up the user{’}s speech signal through air as well as vibrations through the body. However, besides external sounds the device is sensitive to user{’}s own non-speech vocal noises (e.g. coughing, yawning, etc.) and movement noise caused by physical activities. A signal mixing model is proposed to produce databases of noisy observations used for training and testing the frame-by-frame OVD method. The best model{’}s performance is further studied in the presence of different recorded interference. An ablation study reports the model{’}s performance on sub-sets of sensors. The results show that the OVD approach is robust towards both user motion and user generated vocal non-speech sounds in the presence of loud external interference. The approach is suitable for real-time operation and achieves 90-96 \\% OVD accuracy in challenging use scenarios with a short 10 ms processing frame length.

    @article{2021_SJ,
    author = {Pertil{\"a}, Pasi and Fagerlund, Eemi and Huttunen, Anu and Myllyl{\"a}, Ville},
    title = "Online Own Voice Detection for a Multi-channel Multi-sensor In-Ear Device",
    abstract = "Voice activity detection (VAD) aims for detecting the presence of speech in a given input signal, and is often the first step in voice -based applications such as speech communication systems. In the context of personal devices, own voice detection (OVD) is a sub-task of VAD, since it targets speech detection of the person wearing the device, while ignoring other speakers in the presence of interference signals. This article first summarizes recent single and multi-microphone, multi-sensor, and hearing aids related VAD techniques. Then, a wearable in-ear device equipped with multiple microphones and an accelerometer is investigated for the OVD task using a neural network with input embedding and long short-term memory (LSTM) layers. The device picks up the user{\textquoteright}s speech signal through air as well as vibrations through the body. However, besides external sounds the device is sensitive to user{\textquoteright}s own non-speech vocal noises (e.g. coughing, yawning, etc.) and movement noise caused by physical activities. A signal mixing model is proposed to produce databases of noisy observations used for training and testing the frame-by-frame OVD method. The best model{\textquoteright}s performance is further studied in the presence of different recorded interference. An ablation study reports the model{\textquoteright}s performance on sub-sets of sensors. The results show that the OVD approach is robust towards both user motion and user generated vocal non-speech sounds in the presence of loud external interference. The approach is suitable for real-time operation and achieves 90-96 \\% OVD accuracy in challenging use scenarios with a short 10 ms processing frame length.",
    keywords = "Accelerometer Measurements, Acoustic Measurements, Artificial Neural Networks, Databases, Feature extraction, Hidden Markov models, Human Voice Processing, Sensors, Speech processing, Task analysis, Training",
    note = "Publisher Copyright: Author",
    year = "2021",
    doi = "10.1109/JSEN.2021.3122936",
    language = "English",
    volume = "21",
    pages = "27686--27697",
    journal = "IEEE Sensors Journal",
    issn = "1530-437X",
    publisher = "Institute of Electrical and Electronics Engineers Inc.",
    number = "24"
    }

  • P. Pertilä, E. Cakir, A. Hakala, E. Fagerlund, T. Virtanen, A. Politis, and A. Eronen, "Mobile Microphone Array Speech Detection and Localization in Diverse Everyday Environments," in 2021 29th European Signal Processing Conference (EUSIPCO), United States, 2021, p. 406–410. doi:10.23919/EUSIPCO54536.2021.9616168
    [BibTeX] [Abstract] [Download PDF]

    Joint sound event localization and detection (SELD) is an integral part of developing context awareness into communication interfaces of mobile robots, smartphones, and home assistants. For example, an automatic audio focus for video capture on a mobile phone requires robust detection of relevant acoustic events around the device and their direction. Existing SELD approaches have been evaluated using material produced in controlled indoor environments, or the audio is simulated by mixing isolated sounds to different spatial locations. This paper studies SELD of speech in diverse everyday environments, where the audio corresponds to typical usage scenarios of handheld mobile devices. In order to allow weighting the relative importance of localization vs. detection, we will propose a two-stage hierarchical system, where the first stage is to detect the target events, and the second stage is to localize them. The proposed method utilizes convolutional recurrent neural network (CRNN) and is evaluated on a database of manually annotated microphone array recordings from various acoustic conditions. The array is embedded in a contemporary mobile phone form factor. The obtained results show good speech detection and localization accuracy of the proposed method in contrast to a non-hierarchical flat classification model.

    @inproceedings{2021_EUSIPCO_f,
    author = {Pertil{\"a}, Pasi and Cakir, Emre and Hakala, Aapo and Fagerlund, Eemi and Virtanen, Tuomas and Politis, Archontis and Eronen, Antti},
    title = "Mobile Microphone Array Speech Detection and Localization in Diverse Everyday Environments",
    abstract = "Joint sound event localization and detection (SELD) is an integral part of developing context awareness into communication interfaces of mobile robots, smartphones, and home assistants. For example, an automatic audio focus for video capture on a mobile phone requires robust detection of relevant acoustic events around the device and their direction. Existing SELD approaches have been evaluated using material produced in controlled indoor environments, or the audio is simulated by mixing isolated sounds to different spatial locations. This paper studies SELD of speech in diverse everyday environments, where the audio corresponds to typical usage scenarios of handheld mobile devices. In order to allow weighting the relative importance of localization vs. detection, we will propose a two-stage hierarchical system, where the first stage is to detect the target events, and the second stage is to localize them. The proposed method utilizes convolutional recurrent neural network (CRNN) and is evaluated on a database of manually annotated microphone array recordings from various acoustic conditions. The array is embedded in a contemporary mobile phone form factor. The obtained results show good speech detection and localization accuracy of the proposed method in contrast to a non-hierarchical flat classification model.",
    note = "JUFOID=55867; European Signal Processing Conference, EUSIPCO ; Conference date: 23-08-2021 Through 27-08-2021",
    year = "2021",
    doi = "10.23919/EUSIPCO54536.2021.9616168",
    language = "English",
    isbn = "978-1-6654-0900-1",
    series = "European Signal Processing Conference",
    publisher = "IEEE",
    pages = "406--410",
    booktitle = "2021 29th European Signal Processing Conference (EUSIPCO)",
    address = "United States",
    url = "https://eusipco2021.org"
    }

  • A. Politis, S. Adavanne, D. Krause, A. Deleforge, P. Srivastava, and T. Virtanen, "A Dataset of Dynamic Reverberant Sound Scenes with Directional Interferers for Sound Event Localization and Detection," in Proceedings of the 6th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2021), 2021, p. 125–129. doi:10.48550/arXiv.2106.06999
    [BibTeX]
    @inproceedings{2021_DCASE 2021,
    author = "Politis, Archontis and Adavanne, Sharath and Krause, Daniel and Deleforge, Antoine and Srivastava, Prerak and Virtanen, Tuomas",
    editor = "Font, Frederic and Mesaros, Mesaros and P. W. Ellis, Daniel and Fonseca, Eduardo and Fuentes, Magdalena and Elizalde, Benjamin",
    title = "A Dataset of Dynamic Reverberant Sound Scenes with Directional Interferers for Sound Event Localization and Detection",
    year = "2021",
    doi = "10.48550/arXiv.2106.06999",
    language = "English",
    pages = "125--129",
    booktitle = "Proceedings of the 6th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2021)",
    publisher = "DCASE",
    note = "Detection and Classication of Acoustic Scenes and Events ; Conference date: 15-11-2021 Through 19-11-2021"
    }

  • A. Politis, A. Mesaros, S. Adavanne, T. Heittola, and T. Virtanen, "Overview and Evaluation of Sound Event Localization and Detection in DCASE 2019," IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 29, p. 684–698, 2021. doi:10.1109/TASLP.2020.3047233
    [BibTeX] [Abstract]

    Sound event localization and detection is a novel area of research that emerged from the combined interest of analyzing the acoustic scene in terms of the spatial and temporal activity of sounds of interest. This paper presents an overview of the first international evaluation on sound event localization and detection, organized as a task of the DCASE 2019 Challenge. A large-scale realistic dataset of spatialized sound events was generated for the challenge, to be used for training of learning-based approaches, and for evaluation of the submissions in an unlabeled subset. The overview presents in detail how the systems were evaluated and ranked and the characteristics of the best-performing systems. Common strategies in terms of input features, model architectures, training approaches, exploitation of prior knowledge, and data augmentation are discussed. Since ranking in the challenge was based on individually evaluating localization and event classification performance, part of the overview focuses on presenting metrics for the joint measurement of the two, together with a reevaluation of submissions using these new metrics. The new analysis reveals submissions that performed better on the joint task of detecting the correct type of event close to its original location than some of the submissions that were ranked higher in the challenge. Consequently, ranking of submissions which performed strongly when evaluated separately on detection or localization, but not jointly on both, was affected negatively.

    @article{2021_g,
    author = "Politis, Archontis and Mesaros, Annamaria and Adavanne, Sharath and Heittola, Toni and Virtanen, Tuomas",
    title = "Overview and Evaluation of Sound Event Localization and Detection in DCASE 2019",
    abstract = "Sound event localization and detection is a novel area of research that emerged from the combined interest of analyzing the acoustic scene in terms of the spatial and temporal activity of sounds of interest. This paper presents an overview of the first international evaluation on sound event localization and detection, organized as a task of the DCASE 2019 Challenge. A large-scale realistic dataset of spatialized sound events was generated for the challenge, to be used for training of learning-based approaches, and for evaluation of the submissions in an unlabeled subset. The overview presents in detail how the systems were evaluated and ranked and the characteristics of the best-performing systems. Common strategies in terms of input features, model architectures, training approaches, exploitation of prior knowledge, and data augmentation are discussed. Since ranking in the challenge was based on individually evaluating localization and event classification performance, part of the overview focuses on presenting metrics for the joint measurement of the two, together with a reevaluation of submissions using these new metrics. The new analysis reveals submissions that performed better on the joint task of detecting the correct type of event close to its original location than some of the submissions that were ranked higher in the challenge. Consequently, ranking of submissions which performed strongly when evaluated separately on detection or localization, but not jointly on both, was affected negatively.",
    year = "2021",
    doi = "10.1109/TASLP.2020.3047233",
    language = "English",
    volume = "29",
    pages = "684--698",
    journal = "IEEE/ACM Transactions on Audio Speech and Language Processing",
    issn = "2329-9290",
    publisher = "IEEE Advancing Technology for Humanity"
    }

  • O. Räsänen, S. Seshadri, M. Lavechin, A. Cristia, and M. Casillas, "ALICE: An open-source tool for automatic measurement of phoneme, syllable, and word counts from child-centered daylong recordings," BEHAVIOR RESEARCH METHODS, vol. 53, iss. 2, p. 818–835, 2021. doi:10.3758/s13428-020-01460-x
    [BibTeX] [Abstract]

    Recordings captured by wearable microphones are a standard method for investigating young children{’}s language environments. A key measure to quantify from such data is the amount of speech present in children{’}s home environments. To this end, the LENA recorder and software—a popular system for measuring linguistic input—estimates the number of adult words that children may hear over the course of a recording. However, word count estimation is challenging to do in a language- independent manner; the relationship between observable acoustic patterns and language-specific lexical entities is far from uniform across human languages. In this paper, we ask whether some alternative linguistic units, namely phone(me)s or syllables, could be measured instead of, or in parallel with, words in order to achieve improved cross-linguistic applicability and comparability of an automated system for measuring child language input. We discuss the advantages and disadvantages of measuring different units from theoretical and technical points of view. We also investigate the practical applicability of measuring such units using a novel system called Automatic LInguistic unit Count Estimator (ALICE) together with audio from seven child-centered daylong audio corpora from diverse cultural and linguistic environments. We show that language-independent measurement of phoneme counts is somewhat more accurate than syllables or words, but all three are highly correlated with human annotations on the same data. We share an open-source implementation of ALICE for use by the language research community, enabling automatic phoneme, syllable, and word count estimation from child-centered audio recordings.

    @article{2021_e,
    author = {R{\"a}s{\"a}nen, Okko and Seshadri, Shreyas and Lavechin, Marvin and Cristia, Alejandrina and Casillas, Marisa},
    title = "ALICE: An open-source tool for automatic measurement of phoneme, syllable, and word counts from child-centered daylong recordings",
    abstract = "Recordings captured by wearable microphones are a standard method for investigating young children{\textquoteright}s language environments. A key measure to quantify from such data is the amount of speech present in children{\textquoteright}s home environments. To this end, the LENA recorder and software—a popular system for measuring linguistic input—estimates the number of adult words that children may hear over the course of a recording. However, word count estimation is challenging to do in a language- independent manner; the relationship between observable acoustic patterns and language-specific lexical entities is far from uniform across human languages. In this paper, we ask whether some alternative linguistic units, namely phone(me)s or syllables, could be measured instead of, or in parallel with, words in order to achieve improved cross-linguistic applicability and comparability of an automated system for measuring child language input. We discuss the advantages and disadvantages of measuring different units from theoretical and technical points of view. We also investigate the practical applicability of measuring such units using a novel system called Automatic LInguistic unit Count Estimator (ALICE) together with audio from seven child-centered daylong audio corpora from diverse cultural and linguistic environments. We show that language-independent measurement of phoneme counts is somewhat more accurate than syllables or words, but all three are highly correlated with human annotations on the same data. We share an open-source implementation of ALICE for use by the language research community, enabling automatic phoneme, syllable, and word count estimation from child-centered audio recordings.",
    keywords = "Child-centered audio, Language development, LENA, Speaker diarization, Speech processing, Word count estimation",
    year = "2021",
    doi = "10.3758/s13428-020-01460-x",
    language = "English",
    volume = "53",
    pages = "818–835",
    journal = "BEHAVIOR RESEARCH METHODS",
    issn = "1554-351X",
    publisher = "Springer Nature",
    number = "2"
    }

  • B. W. Schuller, T. Virtanen, M. Riveiro, G. Rizos, J. Han, A. Mesaros, and K. Drossos, "Towards Sonification in Multimodal and User-friendly Explainable Artificial Intelligence," in ICMI 2021 - Proceedings of the 2021 International Conference on Multimodal Interaction, United States, 2021, p. 788–792. doi:10.1145/3462244.3479879
    [BibTeX] [Abstract]

    We are largely used to hearing explanations. For example, if someone thinks you are sad today, they might reply to your {"}why?{"}with {"}because you were so Hmmmmm-mmm-mmm{"}. Today's Artificial Intelligence (AI), however, is - if at all - largely providing explanations of decisions in a visual or textual manner. While such approaches are good for communication via visual media such as in research papers or screens of intelligent devices, they may not always be the best way to explain; especially when the end user is not an expert. In particular, when the AI's task is about Audio Intelligence, visual explanations appear less intuitive than audible, sonified ones. Sonification has also great potential for explainable AI (XAI) in systems that deal with non-audio data - for example, because it does not require visual contact or active attention of a user. Hence, sonified explanations of AI decisions face a challenging, yet highly promising and pioneering task. That involves incorporating innovative XAI algorithms to allow pointing back at the learning data responsible for decisions made by an AI, and to include decomposition of the data to identify salient aspects. It further aims to identify the components of the preprocessing, feature representation, and learnt attention patterns that are responsible for the decisions. Finally, it targets decision-making at the model-level, to provide a holistic explanation of the chain of processing in typical pattern recognition problems from end-to-end. Sonified AI explanations will need to unite methods for sonification of the identified aspects that benefit decisions, decomposition and recomposition of audio to sonify which parts in the audio were responsible for the decision, and rendering attention patterns and salient feature representations audible. Benchmarking sonified XAI is challenging, as it will require a comparison against a backdrop of existing, state-of-the-art visual and textual alternatives, as well as synergistic complementation of all modalities in user evaluations. Sonified AI explanations will need to target different user groups to allow personalisation of the sonification experience for different user needs, to lead to a major breakthrough in comprehensibility of AI via hearing how decisions are made, hence supporting tomorrow's humane AI's trustability. Here, we introduce and motivate the general idea, and provide accompanying considerations including milestones of realisation of sonifed XAI and foreseeable risks.

    @inproceedings{2021_a,
    author = {Schuller, Bj{\"o}rn W. and Virtanen, Tuomas and Riveiro, Maria and Rizos, Georgios and Han, Jing and Mesaros, Annamaria and Drossos, Konstantinos},
    title = "Towards Sonification in Multimodal and User-friendly Explainable Artificial Intelligence",
    abstract = {We are largely used to hearing explanations. For example, if someone thinks you are sad today, they might reply to your {"}why?{"}with {"}because you were so Hmmmmm-mmm-mmm{"}. Today's Artificial Intelligence (AI), however, is - if at all - largely providing explanations of decisions in a visual or textual manner. While such approaches are good for communication via visual media such as in research papers or screens of intelligent devices, they may not always be the best way to explain; especially when the end user is not an expert. In particular, when the AI's task is about Audio Intelligence, visual explanations appear less intuitive than audible, sonified ones. Sonification has also great potential for explainable AI (XAI) in systems that deal with non-audio data - for example, because it does not require visual contact or active attention of a user. Hence, sonified explanations of AI decisions face a challenging, yet highly promising and pioneering task. That involves incorporating innovative XAI algorithms to allow pointing back at the learning data responsible for decisions made by an AI, and to include decomposition of the data to identify salient aspects. It further aims to identify the components of the preprocessing, feature representation, and learnt attention patterns that are responsible for the decisions. Finally, it targets decision-making at the model-level, to provide a holistic explanation of the chain of processing in typical pattern recognition problems from end-to-end. Sonified AI explanations will need to unite methods for sonification of the identified aspects that benefit decisions, decomposition and recomposition of audio to sonify which parts in the audio were responsible for the decision, and rendering attention patterns and salient feature representations audible. Benchmarking sonified XAI is challenging, as it will require a comparison against a backdrop of existing, state-of-the-art visual and textual alternatives, as well as synergistic complementation of all modalities in user evaluations. Sonified AI explanations will need to target different user groups to allow personalisation of the sonification experience for different user needs, to lead to a major breakthrough in comprehensibility of AI via hearing how decisions are made, hence supporting tomorrow's humane AI's trustability. Here, we introduce and motivate the general idea, and provide accompanying considerations including milestones of realisation of sonifed XAI and foreseeable risks.},
    keywords = "Explainable artificial intelligence, human computer interaction, multimodality, sonification, trustworthy artificial intelligence",
    note = "Funding Information: This project has received funding from the European Union{\textquoteright}s Horizon 2020 research and innovation programme under grant agreement No. 826506 (sustAGE). Publisher Copyright: {\textcopyright} 2021 ACM.; ACM International Conference on Multimodal Interaction ; Conference date: 18-10-2021 Through 22-10-2021",
    year = "2021",
    month = "October",
    day = "18",
    doi = "10.1145/3462244.3479879",
    language = "English",
    series = "ICMI 2021 - Proceedings of the 2021 International Conference on Multimodal Interaction",
    publisher = "ACM",
    pages = "788--792",
    booktitle = "ICMI 2021 - Proceedings of the 2021 International Conference on Multimodal Interaction",
    address = "United States"
    }

  • A. Tran, K. Drossos, and T. Virtanen, "WaveTransformer: An Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information," in 2021 29th European Signal Processing Conference (EUSIPCO), United States, 2021, p. 576–580. doi:10.23919/EUSIPCO54536.2021.9616340
    [BibTeX] [Abstract]

    Automated audio captioning (AAC) is a novel task, where a method takes as an input an audio sample and outputs a textual description (i.e. a caption) of its contents. Most AAC methods are adapted from image captioning or machine translation fields. In this work, we present a novel AAC method, explicitly focused on the exploitation of the temporal and time-frequency patterns in audio. We employ three learnable processes for audio encoding, two for extracting the temporal and time-frequency information, and one to merge the output of the previous two processes. To generate the caption, we employ the widely used Transformer decoder. We assess our method utilizing the freely available splits of the Clotho dataset. Our results increase previously reported highest SPIDEr to 17.3, from 16.2 (higher is better).

    @inproceedings{2021_EUSIPCO_e,
    author = "Tran, An and Drossos, Konstantinos and Virtanen, Tuomas",
    title = "WaveTransformer: An Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information",
    abstract = "Automated audio captioning (AAC) is a novel task, where a method takes as an input an audio sample and outputs a textual description (i.e. a caption) of its contents. Most AAC methods are adapted from image captioning or machine translation fields. In this work, we present a novel AAC method, explicitly focused on the exploitation of the temporal and time-frequency patterns in audio. We employ three learnable processes for audio encoding, two for extracting the temporal and time-frequency information, and one to merge the output of the previous two processes. To generate the caption, we employ the widely used Transformer decoder. We assess our method utilizing the freely available splits of the Clotho dataset. Our results increase previously reported highest SPIDEr to 17.3, from 16.2 (higher is better).",
    keywords = "Measurement, Time-frequency analysis, Neural networks, Europe, Transformers, Encoding, Decoding, automated audio captioning, wavetransformer, wavenet, transformer",
    note = "jufoid=55867; European Signal Processing Conference, EUSIPCO 2021 ; Conference date: 23-08-2021 Through 27-08-2021",
    year = "2021",
    doi = "10.23919/EUSIPCO54536.2021.9616340",
    language = "English",
    series = "European Signal Processing Conference",
    publisher = "IEEE",
    pages = "576--580",
    booktitle = "2021 29th European Signal Processing Conference (EUSIPCO)",
    address = "United States"
    }

  • E. Vaaras, S. Ahlqvist-Björkroth, K. Drossos, and O. Räsänen, "Automatic analysis of the emotional content of speech in daylong child-centered recordings from a neonatal intensive care unit," in 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021, 2021, p. 3380–3384. doi:10.21437/Interspeech.2021-303
    [BibTeX] [Abstract]

    Researchers have recently started to study how the emotional speech heard by young infants can affect their developmental outcomes. As a part of this research, hundreds of hours of daylong recordings from preterm infants' audio environments were collected from two hospitals in Finland and Estonia in the context of so-called APPLE study. In order to analyze the emotional content of speech in such a massive dataset, an automatic speech emotion recognition (SER) system is required. However, there are no emotion labels or existing indomain SER systems to be used for this purpose. In this paper, we introduce this initially unannotated large-scale real-world audio dataset and describe the development of a functional SER system for the Finnish subset of the data. We explore the effectiveness of alternative state-of-the-art techniques to deploy a SER system to a new domain, comparing cross-corpus generalization, WGAN-based domain adaptation, and active learning in the task. As a result, we show that the best-performing models are able to achieve a classification performance of 73.4\\% unweighted average recall (UAR) and 73.2\\% UAR for a binary classification for valence and arousal, respectively. The results also show that active learning achieves the most consistent performance compared to the two alternatives.

    @inproceedings{2021_InterSpecch,
    author = {Vaaras, Einari and Ahlqvist-Bj{\"o}rkroth, Sari and Drossos, Konstantinos and R{\"a}s{\"a}nen, Okko},
    title = "Automatic analysis of the emotional content of speech in daylong child-centered recordings from a neonatal intensive care unit",
    abstract = "Researchers have recently started to study how the emotional speech heard by young infants can affect their developmental outcomes. As a part of this research, hundreds of hours of daylong recordings from preterm infants' audio environments were collected from two hospitals in Finland and Estonia in the context of so-called APPLE study. In order to analyze the emotional content of speech in such a massive dataset, an automatic speech emotion recognition (SER) system is required. However, there are no emotion labels or existing indomain SER systems to be used for this purpose. In this paper, we introduce this initially unannotated large-scale real-world audio dataset and describe the development of a functional SER system for the Finnish subset of the data. We explore the effectiveness of alternative state-of-the-art techniques to deploy a SER system to a new domain, comparing cross-corpus generalization, WGAN-based domain adaptation, and active learning in the task. As a result, we show that the best-performing models are able to achieve a classification performance of 73.4\\% unweighted average recall (UAR) and 73.2\\% UAR for a binary classification for valence and arousal, respectively. The results also show that active learning achieves the most consistent performance compared to the two alternatives.",
    keywords = "Daylong audio, Lena recorder, Real-world audio, Speech analysis, Speech emotion recognition",
    note = "Publisher Copyright: Copyright {\textcopyright} 2021 ISCA. jufoid=59094; Annual Conference of the International Speech Communication Association ; Conference date: 30-08-2021 Through 03-09-2021",
    year = "2021",
    doi = "10.21437/Interspeech.2021-303",
    language = "English",
    series = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
    publisher = "International Speech Communication Association",
    pages = "3380--3384",
    booktitle = "22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021"
    }

  • S. Vanhatalo, M. Airaksinen, E. Ilen, T. Häyrinen, J. Ranta, O. Räsänen, and L. Haataja, "Vauvan älyvaatteet : hypeä ja lupausta paremmasta terveydenhoidosta," Duodecim, vol. 137, iss. 6, p. 596–604, 2021.
    [BibTeX] [Abstract]

    Puettavan teknologian ratkaisut ovat levinneet nopeasti kuluttajamarkkinoilla seuraten lähes jokaisen aikuisen elämää älypuhelimissa tai rannelaitteissa kulkevilla antureilla. Pilvipalveluissa olevat tekoälypohjaiset algoritmit antavat yhä useammin arkielämän kannalta merkityksellisiä tuloksia. Nämä älykkään tuntuiset vaatteet ja muut puettavat laitteet ovat toistaiseksi lähinnä hyvinvointiteknologian tarpeisiin tehtyjä kuluttajatuotteita. Viime vuosina on lisääntynyt nopeasti niiden jatkokehitys myös lääketieteelliseen käyttöön. Pullonkaulaksi muodostuu se, että lääketieteellisen laitteen vaatimukset eroavat huomattavasti kuluttajatuotteista. Uudet avoimeen rajapintaan kehitetyt sensori- ja ohjelmistoratkaisut ovat mahdollistaneet älyvaatekehityksen myös akateemisena tutkimus- ja kehitystyönä. Lähivuosina nähdään todennäköisesti useita kliiniseen käyttöön lapsille suunnattuja lääketieteellisiä älyvaatteita. Pitkäaikainenkin diagnosointi ja hoidon seuranta on niiden avulla mahdollista lapsen luonnollisessa elinympäristössä.

    @article{2021,
    author = {Vanhatalo, Sampsa and Airaksinen, Manu and Ilen, Elina and H{\"a}yrinen, Taru and Ranta, Jukka and R{\"a}s{\"a}nen, Okko and Haataja, Leena},
    title = {Vauvan {\"a}lyvaatteet : hype{\"a} ja lupausta paremmasta terveydenhoidosta},
    abstract = {Puettavan teknologian ratkaisut ovat levinneet nopeasti kuluttajamarkkinoilla seuraten l{\"a}hes jokaisen aikuisen el{\"a}m{\"a}{\"a} {\"a}lypuhelimissa tai rannelaitteissa kulkevilla antureilla. Pilvipalveluissa olevat teko{\"a}lypohjaiset algoritmit antavat yh{\"a} useammin arkiel{\"a}m{\"a}n kannalta merkityksellisi{\"a} tuloksia. N{\"a}m{\"a} {\"a}lykk{\"a}{\"a}n tuntuiset vaatteet ja muut puettavat laitteet ovat toistaiseksi l{\"a}hinn{\"a} hyvinvointiteknologian tarpeisiin tehtyj{\"a} kuluttajatuotteita. Viime vuosina on lis{\"a}{\"a}ntynyt nopeasti niiden jatkokehitys my{\"o}s l{\"a}{\"a}ketieteelliseen k{\"a}ytt{\"o}{\"o}n. Pullonkaulaksi muodostuu se, ett{\"a} l{\"a}{\"a}ketieteellisen laitteen vaatimukset eroavat huomattavasti kuluttajatuotteista. Uudet avoimeen rajapintaan kehitetyt sensori- ja ohjelmistoratkaisut ovat mahdollistaneet {\"a}lyvaatekehityksen my{\"o}s akateemisena tutkimus- ja kehitysty{\"o}n{\"a}. L{\"a}hivuosina n{\"a}hd{\"a}{\"a}n todenn{\"a}k{\"o}isesti useita kliiniseen k{\"a}ytt{\"o}{\"o}n lapsille suunnattuja l{\"a}{\"a}ketieteellisi{\"a} {\"a}lyvaatteita. Pitk{\"a}aikainenkin diagnosointi ja hoidon seuranta on niiden avulla mahdollista lapsen luonnollisessa elinymp{\"a}rist{\"o}ss{\"a}.},
    keywords = "Wearable Electronic Devices, Clothing, Infant, Monitoring, Physiologic, Artificial Intelligence, Algorithms, Biosensing Techniques",
    note = "Vertaisarvioitu.",
    year = "2021",
    language = "Suomi",
    volume = "137",
    pages = "596--604",
    journal = "Duodecim",
    issn = "0012-7183",
    publisher = "Laaketieteellinen Aikakauskirja Duodecim",
    number = "6"
    }

  • A. V. Venkatakrishnan, P. Pertilä, and M. Parviainen, "Tampere University Rotated Circular Array Dataset," in 2021 29th European Signal Processing Conference (EUSIPCO), United States, 2021, p. 201–205. doi:10.23919/EUSIPCO54536.2021.9616072
    [BibTeX] [Abstract]

    Advancements in deep learning have resulted in new techniques to address sophisticated audio processing tasks, such as sound localization and recognition. However, supervised training of deep neural networks (DNNs) requires a significant amount of training data. Existing datasets are either recorded or allow synthetic recordings through impulse responses (IRs) via convolution. Recorded datasets often lack sufficient and versatile material for supervised DNN training. On the other hand, impulse response databases allow large scale dataset creation, provided that suitable IRs are available. However, existing IR datasets do not cater to the data requirements of moving and crossing sources problem in sound localization, due to insufficient angular resolution. This work introduces a versatile room IR dataset to address this problem. Various diverse environments such as office rooms, meeting rooms, corridor, and an anechoic chamber are chosen for the data collection. The chosen rooms have varying characteristics, such as reverberation times (T60) and volumes. The data is collected by placing the speaker at three different distances from a rotated microphone array, thus mimicking the moving source condition. Direction of arrival (DoA) estimation is performed by spatializing the sound signal with the collected IRs to verify their quality. The dataset will be publicly available.

    @inproceedings{2021_EUSIPCO_g,
    author = {Venkatakrishnan, Arjun Venkat and Pertil{\"a}, Pasi and Parviainen, Mikko},
    title = "Tampere University Rotated Circular Array Dataset",
    abstract = "Advancements in deep learning have resulted in new techniques to address sophisticated audio processing tasks, such as sound localization and recognition. However, supervised training of deep neural networks (DNNs) requires a significant amount of training data. Existing datasets are either recorded or allow synthetic recordings through impulse responses (IRs) via convolution. Recorded datasets often lack sufficient and versatile material for supervised DNN training. On the other hand, impulse response databases allow large scale dataset creation, provided that suitable IRs are available. However, existing IR datasets do not cater to the data requirements of moving and crossing sources problem in sound localization, due to insufficient angular resolution. This work introduces a versatile room IR dataset to address this problem. Various diverse environments such as office rooms, meeting rooms, corridor, and an anechoic chamber are chosen for the data collection. The chosen rooms have varying characteristics, such as reverberation times (T60) and volumes. The data is collected by placing the speaker at three different distances from a rotated microphone array, thus mimicking the moving source condition. Direction of arrival (DoA) estimation is performed by spatializing the sound signal with the collected IRs to verify their quality. The dataset will be publicly available.",
    keywords = "Training, Location awareness, Deep learning, Direction-of-arrival estimation, Training data, Data collection, Microphone arrays, sound localization, deep learning, direction of arrival, impulse responses",
    note = "jufoid=55867; European Signal Processing Conference, EUSIPCO 2021 ; Conference date: 23-08-2021 Through 27-08-2021",
    year = "2021",
    doi = "10.23919/EUSIPCO54536.2021.9616072",
    language = "English",
    series = "European Signal Processing Conference",
    publisher = "IEEE",
    pages = "201--205",
    booktitle = "2021 29th European Signal Processing Conference (EUSIPCO)",
    address = "United States"
    }

  • S. Wang, A. Mesaros, T. Heittola, and T. Virtanen, "A curated dataset of urban scenes for audio-visual scene analysis," in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), United States, 2021, p. 626–630. doi:10.48550/arXiv.2011.00030
    [BibTeX] [Abstract] [Download PDF]

    This paper introduces a curated dataset of urban scenes for audio-visual scene analysis which consists of carefully selected and recorded material. The data was recorded in multiple European cities, using the same equipment, in multiple locations for each scene, and is openly available. We also present a case study for audio-visual scene recognition and show that joint modeling of audio and visual modalities brings significant performance gain compared to state of the art uni-modal systems. Our approach obtained an 84.8\\% accuracy compared to 75.8\\% for the audio-only and 68.4\\% for the video-only equivalent systems.

    @inproceedings{2021_ICASSP_a,
    author = "Wang, Shanshan and Mesaros, Annamaria and Heittola, Toni and Virtanen, Tuomas",
    title = "A curated dataset of urban scenes for audio-visual scene analysis",
    abstract = "This paper introduces a curated dataset of urban scenes for audio-visual scene analysis which consists of carefully selected and recorded material. The data was recorded in multiple European cities, using the same equipment, in multiple locations for each scene, and is openly available. We also present a case study for audio-visual scene recognition and show that joint modeling of audio and visual modalities brings significant performance gain compared to state of the art uni-modal systems. Our approach obtained an 84.8\\% accuracy compared to 75.8\\% for the audio-only and 68.4\\% for the video-only equivalent systems.",
    keywords = "Acoustic scene, Audio-visual data, Pattern recognition, Scene analysis, Transfer learning",
    note = "Funding Information: This work was supported in part by the European Research Council under the European Unions H2020 Framework Programme through ERC Grant Agreement 637422 EVERYSOUND. The authors wish to thank CSC-IT Centre of Science Ltd., Finland, for providing computational resources. Funding Information: This work was supported in part by the European Research Council under the European Unions H2020 Framework Programme through ERC Grant Agreement 637422 EVERYSOUND. The authors wish to thank CSCIT Centre of Science Ltd., Finland, for providing computational resources. Publisher Copyright: {\textcopyright} 2021 IEEE. jufoid=57409; IEEE International Conference on Acoustics, Speech and Signal Processing ; Conference date: 06-06-2021 Through 11-06-2021",
    year = "2021",
    doi = "10.48550/arXiv.2011.00030",
    language = "English",
    series = "Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing",
    publisher = "IEEE",
    pages = "626--630",
    booktitle = "ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",
    address = "United States",
    url = "https://2021.ieeeicassp.org"
    }

  • S. Wang, G. Naithani, A. Politis, and T. Virtanen, "Deep Neural Network Based Low-Latency Speech Separation with Asymmetric Analysis-Synthesis Window Pair," in European Signal Processing Conference 2021, United States, 2021, p. 301–305. doi:10.23919/EUSIPCO54536.2021.9616165
    [BibTeX] [Abstract] [Download PDF]

    Time-frequency masking or spectrum prediction computed via short symmetric windows are commonly used in low-latency deep neural network (DNN) based source separation. In this paper, we propose the use of an asymmetric analysis-synthesis window pair which allows for training with targets with better frequency resolution, while retaining the low-latency during inference suitable for real-time speech enhancement or assisted hearing applications. In order to assess our approach across various model types and datasets, we evaluate it with a speaker-independent deep clustering (DC) model and a speaker-dependent mask inference (MI) model. We report an improvement in separation performance of up to 1.5 dB in terms of source-to-distortion ratio (SDR) while maintaining an algorithmic latency of 8 ms.

    @inproceedings{2021_EUSIPCO_c,
    author = "Wang, Shanshan and Naithani, Gaurav and Politis, Archontis and Virtanen, Tuomas",
    title = "Deep Neural Network Based Low-Latency Speech Separation with Asymmetric Analysis-Synthesis Window Pair",
    abstract = "Time-frequency masking or spectrum prediction computed via short symmetric windows are commonly used in low-latency deep neural network (DNN) based source separation. In this paper, we propose the use of an asymmetric analysis-synthesis window pair which allows for training with targets with better frequency resolution, while retaining the low-latency during inference suitable for real-time speech enhancement or assisted hearing applications. In order to assess our approach across various model types and datasets, we evaluate it with a speaker-independent deep clustering (DC) model and a speaker-dependent mask inference (MI) model. We report an improvement in separation performance of up to 1.5 dB in terms of source-to-distortion ratio (SDR) while maintaining an algorithmic latency of 8 ms.",
    keywords = "Deep learning, Training, Time-frequency analysis, Source separation, Signal processing algorithms, Europe, Speech enhancement, Monaural speaker separation, Low latency, Asymmetric windows, Deep clustering",
    note = "jufoid=55867; European Signal Processing Conference, EUSIPCO ; Conference date: 23-08-2021 Through 27-08-2021",
    year = "2021",
    doi = "10.23919/EUSIPCO54536.2021.9616165",
    language = "English",
    series = "European Signal Processing Conference",
    publisher = "IEEE",
    pages = "301--305",
    booktitle = "European Signal Processing Conference 2021",
    address = "United States",
    url = "https://eusipco2021.org"
    }

  • S. Wang, T. Heittola, A. Mesaros, and T. Virtanen, "Audio-visual scene classification: analysis of DCASE 2021 Challenge submissions," in Proceedings of the 6th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE2021), 2021, p. 45–49. doi:10.5281/zenodo.5770113
    [BibTeX] [Abstract]

    This paper presents the details of the Audio-Visual Scene Classification task in the DCASE 2021 Challenge (Task 1 Subtask B). The task is concerned with classification using audio and video modalities, using a dataset of synchronized recordings. This task has attracted 43 submissions from 13 different teams around the world. Among all submissions, more than half of the submitted systems have better performance than the baseline. The common techniques among the top systems are the usage of large pretrained models such as ResNet or EfficientNet which are trained for the task-specific problem. Fine-tuning, transfer learning, and data augmentation techniques are also employed to boost the performance. More importantly, multi-modal methods using both audio and video are employed by all the top 5 teams. The best system among all achieved a logloss of 0.195 and accuracy of 93.8\textbackslash{}\\%, compared to the baseline system with logloss of 0.662 and accuracy of 77.1\\%.

    @inproceedings{2021_DCASE2021,
    author = "Wang, Shanshan and Heittola, Toni and Mesaros, Annamaria and Virtanen, Tuomas",
    editor = "Font, Frederic and Mesaros, Annamaria and P.W. Ellis, Daniel and Fonseca, Eduardo and Fuentes, Magdalena and Elizalde, Benjamin",
    title = "Audio-visual scene classification: analysis of DCASE 2021 Challenge submissions",
    abstract = "This paper presents the details of the Audio-Visual Scene Classification task in the DCASE 2021 Challenge (Task 1 Subtask B). The task is concerned with classification using audio and video modalities, using a dataset of synchronized recordings. This task has attracted 43 submissions from 13 different teams around the world. Among all submissions, more than half of the submitted systems have better performance than the baseline. The common techniques among the top systems are the usage of large pretrained models such as ResNet or EfficientNet which are trained for the task-specific problem. Fine-tuning, transfer learning, and data augmentation techniques are also employed to boost the performance. More importantly, multi-modal methods using both audio and video are employed by all the top 5 teams. The best system among all achieved a logloss of 0.195 and accuracy of 93.8\textbackslash{}\\%, compared to the baseline system with logloss of 0.662 and accuracy of 77.1\\%.",
    year = "2021",
    month = "November",
    day = "15",
    doi = "10.5281/zenodo.5770113",
    language = "English",
    pages = "45--49",
    booktitle = "Proceedings of the 6th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE2021)",
    publisher = "DCASE",
    note = "Detection and Classication of Acoustic Scenes and Events ; Conference date: 15-11-2021 Through 19-11-2021"
    }

  • H. Xie and T. Virtanen, "Zero-Shot Audio Classification Via Semantic Embeddings," IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 29, p. 1233–1242, 2021. doi:10.1109/TASLP.2021.3065234
    [BibTeX] [Abstract]

    In this paper, we study zero-shot learning in audio classification via semantic embeddings extracted from textual labels and sentence descriptions of sound classes. Our goal is to obtain a classifier that is capable of recognizing audio instances of sound classes that have no available training samples, but only semantic side information. We employ a bilinear compatibility framework to learn an acoustic-semantic projection between intermediate-level representations of audio instances and sound classes, i.e., acoustic embeddings and semantic embeddings. We use VGGish to extract deep acoustic embeddings from audio clips, and pre-trained language models (Word2Vec, GloVe, BERT) to generate either label embeddings from textual labels or sentence embeddings from sentence descriptions of sound classes. Audio classification is performed by a linear compatibility function that measures how compatible an acoustic embedding and a semantic embedding are. We evaluate the proposed method on a small balanced dataset ESC-50 and a large-scale unbalanced audio subset of AudioSet. The experimental results show that classification performance is significantly improved by involving sound classes that are semantically close to the test classes in training. Meanwhile, we demonstrate that both label embeddings and sentence embeddings are useful for zero-shot learning. Classification performance is improved by concatenating label/sentence embeddings generated with different language models. With their hybrid concatenations, the results are improved further.

    @article{2021_c,
    author = "Xie, Huang and Virtanen, Tuomas",
    title = "Zero-Shot Audio Classification Via Semantic Embeddings",
    abstract = "In this paper, we study zero-shot learning in audio classification via semantic embeddings extracted from textual labels and sentence descriptions of sound classes. Our goal is to obtain a classifier that is capable of recognizing audio instances of sound classes that have no available training samples, but only semantic side information. We employ a bilinear compatibility framework to learn an acoustic-semantic projection between intermediate-level representations of audio instances and sound classes, i.e., acoustic embeddings and semantic embeddings. We use VGGish to extract deep acoustic embeddings from audio clips, and pre-trained language models (Word2Vec, GloVe, BERT) to generate either label embeddings from textual labels or sentence embeddings from sentence descriptions of sound classes. Audio classification is performed by a linear compatibility function that measures how compatible an acoustic embedding and a semantic embedding are. We evaluate the proposed method on a small balanced dataset ESC-50 and a large-scale unbalanced audio subset of AudioSet. The experimental results show that classification performance is significantly improved by involving sound classes that are semantically close to the test classes in training. Meanwhile, we demonstrate that both label embeddings and sentence embeddings are useful for zero-shot learning. Classification performance is improved by concatenating label/sentence embeddings generated with different language models. With their hybrid concatenations, the results are improved further.",
    keywords = "Audio classification, semantic embedding, zero-shot learning",
    note = "Funding Information: Manuscript received August 5, 2020; revised November 19, 2020 and February 11, 2021; accepted March 3, 2021. Date of publication March 11, 2021; date of current version March 26, 2021. This work was supported by the European Research Council under the European Union{\textquoteright}s H2020 Framework Program through ERC Grant Agreement 637422 EVERYSOUND. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Wenwu Wang. (Corresponding author: Huang Xie.) The authors are with the Faculty of Information Technology, and Communication Sciences, Tampere University, Tampere 33720, Finland (e-mail: huang.xie@tuni.fi; tuomas.virtanen@tuni.fi). Digital Object Identifier 10.1109/TASLP.2021.3065234 Funding Information: This work was supported by the European Research Council under the European Union's H2020 Framework Program through ERC Grant Agreement 637422 EVERYSOUND. Publisher Copyright: {\textcopyright} 2014 IEEE.",
    year = "2021",
    doi = "10.1109/TASLP.2021.3065234",
    language = "English",
    volume = "29",
    pages = "1233--1242",
    journal = "IEEE/ACM Transactions on Audio Speech and Language Processing",
    issn = "2329-9290",
    publisher = "IEEE Advancing Technology for Humanity"
    }

  • H. Xie, O. Räsänen, and T. Virtanen, "Zero-shot audio classification with factored linear and nonlinear acoustic-semantic projections," in 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, United States, 2021, p. 326–330. doi:10.1109/ICASSP39728.2021.9414994
    [BibTeX] [Abstract]

    In this paper, we study zero-shot learning in audio classification through factored linear and nonlinear acoustic-semantic projections between audio instances and sound classes. Zero-shot learning in audio classification refers to classification problems that aim at recognizing audio instances of sound classes, which have no available training data but only semantic side information. In this paper, we address zero-shot learning by employing factored linear and nonlinear acoustic-semantic projections. We develop factored linear projections by applying rank decomposition to a bilinear model, and use nonlinear activation functions, such as tanh, to model the non-linearity between acoustic embeddings and semantic embeddings. Compared with the prior bilinear model, experimental results show that the proposed projection methods are effective for improving classification performance of zero-shot learning in audio classification.

    @inproceedings{2021_ICASSP_b,
    author = {Xie, Huang and R{\"a}s{\"a}nen, Okko and Virtanen, Tuomas},
    title = "Zero-shot audio classification with factored linear and nonlinear acoustic-semantic projections",
    abstract = "In this paper, we study zero-shot learning in audio classification through factored linear and nonlinear acoustic-semantic projections between audio instances and sound classes. Zero-shot learning in audio classification refers to classification problems that aim at recognizing audio instances of sound classes, which have no available training data but only semantic side information. In this paper, we address zero-shot learning by employing factored linear and nonlinear acoustic-semantic projections. We develop factored linear projections by applying rank decomposition to a bilinear model, and use nonlinear activation functions, such as tanh, to model the non-linearity between acoustic embeddings and semantic embeddings. Compared with the prior bilinear model, experimental results show that the proposed projection methods are effective for improving classification performance of zero-shot learning in audio classification.",
    keywords = "Acoustic-semantic projection, Audio classification, Zero-shot learning",
    note = "Funding Information: The research leading to these results has received funding from the European Research Council under the European Unions H2020 Framework Programme through ERC Grant Agreement 637422 EVERYSOUND. OR was funded by Academy of Finland grant no. 314602. Publisher Copyright: {\textcopyright} 2021 IEEE JUFOID=57409; IEEE International Conference on Acoustics, Speech and Signal Processing ; Conference date: 01-01-1900 Through 01-01-2000",
    year = "2021",
    doi = "10.1109/ICASSP39728.2021.9414994",
    language = "English",
    isbn = "978-1-7281-7605-5",
    volume = "2021-June",
    series = "Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing",
    publisher = "IEEE",
    pages = "326--330",
    booktitle = "2021 IEEE International Conference on Acoustics, Speech, and Signal Processing",
    address = "United States"
    }

2020

  • M. Airaksinen, O. Räsänen, E. Ilén, T. Häyrinen, A. Kivi, V. Marchi, A. Gallen, S. Blom, A. Varhe, N. Kaartinen, L. Haataja, and S. Vanhatalo, "Automatic Posture and Movement Tracking of Infants with Wearable Movement Sensors," Scientific Reports, vol. 10, iss. 1, 2020. doi:10.1038/s41598-019-56862-5
    [BibTeX] [Abstract]

    Infants' spontaneous and voluntary movements mirror developmental integrity of brain networks since they require coordinated activation of multiple sites in the central nervous system. Accordingly, early detection of infants with atypical motor development holds promise for recognizing those infants who are at risk for a wide range of neurodevelopmental disorders (e.g., cerebral palsy, autism spectrum disorders). Previously, novel wearable technology has shown promise for offering efficient, scalable and automated methods for movement assessment in adults. Here, we describe the development of an infant wearable, a multi-sensor smart jumpsuit that allows mobile accelerometer and gyroscope data collection during movements. Using this suit, we first recorded play sessions of 22 typically developing infants of approximately 7 months of age. These data were manually annotated for infant posture and movement based on video recordings of the sessions, and using a novel annotation scheme specifically designed to assess the overall movement pattern of infants in the given age group. A machine learning algorithm, based on deep convolutional neural networks (CNNs) was then trained for automatic detection of posture and movement classes using the data and annotations. Our experiments show that the setup can be used for quantitative tracking of infant movement activities with a human equivalent accuracy, i.e., it meets the human inter-rater agreement levels in infant posture and movement classification. We also quantify the ambiguity of human observers in analyzing infant movements, and propose a method for utilizing this uncertainty for performance improvements in training of the automated classifier. Comparison of different sensor configurations also shows that four-limb recording leads to the best performance in posture and movement classification.

    @article{2020_c,
    author = {Airaksinen, Manu and R{\"a}s{\"a}nen, Okko and Il{\'e}n, Elina and H{\"a}yrinen, Taru and Kivi, Anna and Marchi, Viviana and Gallen, Anastasia and Blom, Sonja and Varhe, Anni and Kaartinen, Nico and Haataja, Leena and Vanhatalo, Sampsa},
    title = "Automatic Posture and Movement Tracking of Infants with Wearable Movement Sensors",
    abstract = "Infants' spontaneous and voluntary movements mirror developmental integrity of brain networks since they require coordinated activation of multiple sites in the central nervous system. Accordingly, early detection of infants with atypical motor development holds promise for recognizing those infants who are at risk for a wide range of neurodevelopmental disorders (e.g., cerebral palsy, autism spectrum disorders). Previously, novel wearable technology has shown promise for offering efficient, scalable and automated methods for movement assessment in adults. Here, we describe the development of an infant wearable, a multi-sensor smart jumpsuit that allows mobile accelerometer and gyroscope data collection during movements. Using this suit, we first recorded play sessions of 22 typically developing infants of approximately 7 months of age. These data were manually annotated for infant posture and movement based on video recordings of the sessions, and using a novel annotation scheme specifically designed to assess the overall movement pattern of infants in the given age group. A machine learning algorithm, based on deep convolutional neural networks (CNNs) was then trained for automatic detection of posture and movement classes using the data and annotations. Our experiments show that the setup can be used for quantitative tracking of infant movement activities with a human equivalent accuracy, i.e., it meets the human inter-rater agreement levels in infant posture and movement classification. We also quantify the ambiguity of human observers in analyzing infant movements, and propose a method for utilizing this uncertainty for performance improvements in training of the automated classifier. Comparison of different sensor configurations also shows that four-limb recording leads to the best performance in posture and movement classification.",
    note = {EXT={"}Il{\'e}n, Elina{"}},
    year = "2020",
    month = "January",
    day = "13",
    doi = "10.1038/s41598-019-56862-5",
    language = "English",
    volume = "10",
    journal = "Scientific Reports",
    issn = "2045-2322",
    publisher = "Nature Research",
    number = "1"
    }

  • E. Cakir, K. Drossos, and T. Virtanen, "Multi-task Regularization Based on Infrequent Classes for Audio Captioning," in Proceedings of the Fifth Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2020)), 2020, p. 6–10.
    [BibTeX] [Abstract] [Download PDF]

    Audio captioning is a multi-modal task, focusing on using natural language for describing the contents of general audio. Most audio captioning methods are based on deep neural networks, employing an encoder-decoder scheme and a dataset with audio clips and corresponding natural language descriptions (i.e. captions). A significant challenge for audio captioning is the distribution of words in the captions: some words are very frequent but acoustically non-informative, i.e. the function words (e.g. {"}a{"}, {"}the{"}), and other words are infrequent but informative, i.e. the content words (e.g. adjectives, nouns). In this paper we propose two methods to mitigate this class imbalance problem. First, in an autoencoder setting for audio captioning, we weigh each word's contribution to the training loss inversely proportional to its number of occurrences in the whole dataset. Secondly, in addition to multi-class, word-level audio captioning task, we define a multi-label side task based on clip-level content word detection by training a separate decoder. We use the loss from the second task to regularize the jointly trained encoder for the audio captioning task. We evaluate our method using Clotho, a recently published, wide-scale audio captioning dataset, and our results show an increase of 37\\% relative improvement with SPIDEr metric over the baseline method.

    @inproceedings{2020_DCASE 2020_a,
    author = "Cakir, Emre and Drossos, Konstantinos and Virtanen, Tuomas",
    editor = "Ono, Nobutaka and Harada, Noboru and Kawaguchi, Yohei and Mesaros, Annamaria and Imoto, Keisuke and Koizumi, Yuma and Komatsu, Tatsuya",
    title = "Multi-task Regularization Based on Infrequent Classes for Audio Captioning",
    abstract = {Audio captioning is a multi-modal task, focusing on using natural language for describing the contents of general audio. Most audio captioning methods are based on deep neural networks, employing an encoder-decoder scheme and a dataset with audio clips and corresponding natural language descriptions (i.e. captions). A significant challenge for audio captioning is the distribution of words in the captions: some words are very frequent but acoustically non-informative, i.e. the function words (e.g. {"}a{"}, {"}the{"}), and other words are infrequent but informative, i.e. the content words (e.g. adjectives, nouns). In this paper we propose two methods to mitigate this class imbalance problem. First, in an autoencoder setting for audio captioning, we weigh each word's contribution to the training loss inversely proportional to its number of occurrences in the whole dataset. Secondly, in addition to multi-class, word-level audio captioning task, we define a multi-label side task based on clip-level content word detection by training a separate decoder. We use the loss from the second task to regularize the jointly trained encoder for the audio captioning task. We evaluate our method using Clotho, a recently published, wide-scale audio captioning dataset, and our results show an increase of 37\\% relative improvement with SPIDEr metric over the baseline method.},
    keywords = "audio captioning, Clotho, multi-task, regularization, content words, infrequent classes",
    year = "2020",
    language = "English",
    pages = "6--10",
    booktitle = "Proceedings of the Fifth Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2020))",
    publisher = "Tokyo Metropolitan University",
    note = "Workshop on Detection and Classification of Acoustic Scenes and Events, DCASE 2020 ; Conference date: 02-11-2020 Through 03-11-2020",
    url = "http://dcase.community/workshop2020/"
    }

  • M. A. Cruz Blandon and O. Räsänen, "Analysis of Predictive Coding Models for Phonemic Representation Learning in Small Datasets," in International Conference on Machine Learning (ICML), 2020.
    [BibTeX] [Abstract]

    Neural network models using predictive coding are interesting from the viewpoint of computational modelling of human language acquisition, where the objective is to understand how linguistic units could be learned from speech without any labels. Even though several promising predictive coding -based learning algorithms have been proposed in the literature, it is currently unclear how well they generalise to different languages and training dataset sizes. In addition, despite that such models have shown to be effective phonemic feature learners, it is unclear whether minimisation of the predictive loss functions of these models also leads to optimal phoneme-like representations. The present study investigates the behaviour of two predictive coding models, Autoregressive Predictive Coding and Contrastive Predictive Coding, in a phoneme discrimination task (ABX task) for two languages with different dataset sizes. Our experiments show a strong correlation between the autoregressive loss and the phoneme discrimination scores with the two datasets. However, to our surprise, the CPC model shows rapid convergence already after one pass over the training data, and, on average, its representations outperform those of APC on both languages.

    @inproceedings{2020_ICML_a,
    author = {Cruz Blandon, Maria Andrea and R{\"a}s{\"a}nen, Okko},
    title = "Analysis of Predictive Coding Models for Phonemic Representation Learning in Small Datasets",
    abstract = "Neural network models using predictive coding are interesting from the viewpoint of computational modelling of human language acquisition, where the objective is to understand how linguistic units could be learned from speech without any labels. Even though several promising predictive coding -based learning algorithms have been proposed in the literature, it is currently unclear how well they generalise to different languages and training dataset sizes. In addition, despite that such models have shown to be effective phonemic feature learners, it is unclear whether minimisation of the predictive loss functions of these models also leads to optimal phoneme-like representations. The present study investigates the behaviour of two predictive coding models, Autoregressive Predictive Coding and Contrastive Predictive Coding, in a phoneme discrimination task (ABX task) for two languages with different dataset sizes. Our experiments show a strong correlation between the autoregressive loss and the phoneme discrimination scores with the two datasets. However, to our surprise, the CPC model shows rapid convergence already after one pass over the training data, and, on average, its representations outperform those of APC on both languages.",
    year = "2020",
    language = "English",
    booktitle = "International Conference on Machine Learning (ICML)",
    note = "International Conference on Machine Learning ; Conference date: 31-12-1899"
    }

  • S. Djukanović, J. Matas, and T. Virtanen, "Robust Audio-Based Vehicle Counting in Low-to-Moderate Traffic Flow," in 2020 IEEE Intelligent Vehicles Symposium (IV), United States, 2020. doi:10.1109/IV47402.2020.9304600
    [BibTeX] [Abstract]

    The paper presents a method for audio-based vehicle counting (VC) in low-to-moderate traffic using one-channel sound. We formulate VC as a regression problem, i.e., we predict the distance between a vehicle and the microphone. Minima of the proposed distance function correspond to vehicles passing by the microphone. V C is carried out via local minima detection in the predicted distance. We propose to set the minima detection threshold at a point where the probabilities of false positives and false negatives coincide so they statistically cancel each other in total vehicle number. The method is trained and tested on a traffic-monitoring dataset comprising 422 short, 20-second one-channel sound files with a total of 1421 vehicles passing by the microphone. Relative V C error in a traffic location not used in the training is below 2\\% within a wide range of detection threshold values. Experimental results show that the regression accuracy in noisy environments is improved by introducing a novel high-frequency power feature.

    @inproceedings{2020_IV,
    author = "Djukanovi{\'c}, Slobodan and Matas, Jiri and Virtanen, Tuomas",
    title = "Robust Audio-Based Vehicle Counting in Low-to-Moderate Traffic Flow",
    abstract = "The paper presents a method for audio-based vehicle counting (VC) in low-to-moderate traffic using one-channel sound. We formulate VC as a regression problem, i.e., we predict the distance between a vehicle and the microphone. Minima of the proposed distance function correspond to vehicles passing by the microphone. V C is carried out via local minima detection in the predicted distance. We propose to set the minima detection threshold at a point where the probabilities of false positives and false negatives coincide so they statistically cancel each other in total vehicle number. The method is trained and tested on a traffic-monitoring dataset comprising 422 short, 20-second one-channel sound files with a total of 1421 vehicles passing by the microphone. Relative V C error in a traffic location not used in the training is below 2\\% within a wide range of detection threshold values. Experimental results show that the regression accuracy in noisy environments is improved by introducing a novel high-frequency power feature.",
    year = "2020",
    doi = "10.1109/IV47402.2020.9304600",
    language = "English",
    booktitle = "2020 IEEE Intelligent Vehicles Symposium (IV)",
    publisher = "IEEE",
    address = "United States",
    note = "IEEE Intelligent Vehicles Symposium ; Conference date: 19-10-2020 Through 13-11-2020"
    }

  • K. Drossos, S. I. Mimilakis, S. Gharib, Y. Li, and T. Virtanen, "Sound Event Detection with Depthwise Separable and Dilated Convolutions," in 2020 International Joint Conference on Neural Networks (IJCNN), 2020, pp. 1-7. doi:10.1109/IJCNN48605.2020.9207532
    [BibTeX] [Download PDF]
    @INPROCEEDINGS{2020_IJCNN,
    author = "Drossos, Konstantinos and Mimilakis, Stylianos I. and Gharib, Shayan and Li, Yanxiong and Virtanen, Tuomas",
    booktitle = "2020 International Joint Conference on Neural Networks (IJCNN)",
    title = "Sound Event Detection with Depthwise Separable and Dilated Convolutions",
    year = "2020",
    pages = "1-7",
    keywords = "Feature extraction;Two dimensional displays;Kernel;Convolutional codes;Context modeling;Event detection;Convolution;sound event detection;depthwise separable convolution;dilated convolution",
    doi = "10.1109/IJCNN48605.2020.9207532",
    url = "https://arxiv.org/abs/2002.00476"
    }

  • K. Drossos, S. Lipping, and T. Virtanen, "Clotho: an Audio Captioning Dataset," in IEEE 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020, p. 736–740. doi:10.1109/ICASSP40776.2020.9052990
    [BibTeX] [Abstract] [Download PDF]

    Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online.

    @inproceedings{2020_ICASSP_a,
    author = "Drossos, Konstantinos and Lipping, Samuel and Virtanen, Tuomas",
    abstract = "Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online.",
    booktitle = "IEEE 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP)",
    doi = "10.1109/ICASSP40776.2020.9052990",
    pages = "736--740",
    publisher = "IEEE",
    series = "Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing",
    title = "Clotho: an Audio Captioning Dataset",
    year = "2020",
    url = "https://arxiv.org/abs/1910.09387"
    }

  • X. Favory, K. Drossos, T. Virtanen, and X. Serra, "COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations," in International Conference on Machine Learning (ICML), 2020.
    [BibTeX] [Abstract] [Download PDF]

    Audio representation learning based on deep neural networks (DNNs) emerged as an alternative approach to hand-crafted features. For achieving high performance, DNNs often need a large amount of annotated data which can be difficult and costly to obtain. In this paper, we propose a method for learning audio representations, aligning the learned latent representations of audio and associated tags. Aligning is done by maximizing the agreement of the latent representations of audio and tags, using a contrastive loss. The result is an audio embedding model which reflects acoustic and semantic characteristics of sounds. We evaluate the quality of our embedding model, measuring its performance as a feature extractor on three different tasks (namely, sound event recognition, and music genre and musical instrument classification), and investigate what type of characteristics the model captures. Our results are promising, sometimes in par with the state-of-the-art in the considered tasks and the embeddings produced with our method are well correlated with some acoustic descriptors.

    @inproceedings{2020_ICML,
    author = "Favory, Xavier and Drossos, Konstantinos and Virtanen, Tuomas and Serra, Xavier",
    abstract = "Audio representation learning based on deep neural networks (DNNs) emerged as an alternative approach to hand-crafted features. For achieving high performance, DNNs often need a large amount of annotated data which can be difficult and costly to obtain. In this paper, we propose a method for learning audio representations, aligning the learned latent representations of audio and associated tags. Aligning is done by maximizing the agreement of the latent representations of audio and tags, using a contrastive loss. The result is an audio embedding model which reflects acoustic and semantic characteristics of sounds. We evaluate the quality of our embedding model, measuring its performance as a feature extractor on three different tasks (namely, sound event recognition, and music genre and musical instrument classification), and investigate what type of characteristics the model captures. Our results are promising, sometimes in par with the state-of-the-art in the considered tasks and the embeddings produced with our method are well correlated with some acoustic descriptors.",
    booktitle = "International Conference on Machine Learning (ICML)",
    title = "{COALA}: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations",
    year = "2020",
    url = "https://arxiv.org/abs/2006.08386"
    }

  • T. Heittola, A. Mesaros, and T. Virtanen, "Acoustic scene classification in DCASE 2020 Challenge: generalization across devices and low complexity solutions," in Proceedings of the Fifth Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2020), 2020, p. 56–60.
    [BibTeX] [Download PDF]
    @inproceedings{2020_DCASE 2020_b,
    author = "Heittola, Toni and Mesaros, Annamaria and Virtanen, Tuomas",
    editor = "Ono, Nobutaka and Harada, Noboru and Kawaguchi, Yohei and Mesaros, Annamaria and Imoto, Keisuke and Koizumi, Yuma and Komatsu, Tatsuya",
    title = "Acoustic scene classification in DCASE 2020 Challenge: generalization across devices and low complexity solutions",
    year = "2020",
    language = "English",
    pages = "56--60",
    booktitle = "Proceedings of the Fifth Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2020)",
    publisher = "Tokyo Metropolitan University",
    note = "Workshop on Detection and Classification of Acoustic Scenes and Events, DCASE 2020 ; Conference date: 02-11-2020 Through 03-11-2020",
    url = "http://dcase.community/workshop2020/"
    }

  • A. Kivinummi, G. Naithani, O. Tammela, T. Virtanen, E. Kurkela, M. Alhainen, D. J. H. Niehaus, A. Lachman, J. M. Leppänen, and M. J. Peltola, "Associations Between Neonatal Cry Acoustics and Visual Attention During the First Year," Frontiers in Psychology, vol. 11, 2020. doi:10.3389/fpsyg.2020.577510
    [BibTeX] [Abstract]

    It has been suggested that early cry parameters are connected to later cognitive abilities. The present study is the first to investigate whether the acoustic features of infant cry are associated with cognitive development already during the first year, as measured by oculomotor orienting and attention disengagement. Cry sounds for acoustic analyses (fundamental frequency; F0) were recorded in two neonatal cohorts at the age of 0–8 days (Tampere, Finland) or at 6 weeks (Cape Town, South Africa). Eye tracking was used to measure oculomotor orienting to peripheral visual stimuli and attention disengagement from central stimuli at 8 months (Tampere) or at 6 months (Cape Town) of age. Only a marginal positive correlation between fundamental frequency of cry (F0) and visual attention disengagement was observed in the Tampere cohort, but not in the Cape Town cohort. This correlation indicated that infants from the Tampere cohort with a higher neonatal F0 were marginally slower to shift their gaze away from the central stimulus to the peripheral stimulus. No associations between F0 and oculomotor orienting were observed in either cohort. We discuss possible factors influencing the current pattern of results suggesting a lack of replicable associations between neonatal cry and visual attention and suggest directions for future research investigating the potential of early cry analysis in predicting later cognitive development.

    @article{2020_a,
    author = {Kivinummi, Aicha and Naithani, Gaurav and Tammela, Outi and Virtanen, Tuomas and Kurkela, Enni and Alhainen, Miia and Niehaus, Dana J. H. and Lachman, Anusha and Lepp{\"a}nen, Jukka M. and Peltola, Mikko J.},
    title = "Associations Between Neonatal Cry Acoustics and Visual Attention During the First Year",
    abstract = "It has been suggested that early cry parameters are connected to later cognitive abilities. The present study is the first to investigate whether the acoustic features of infant cry are associated with cognitive development already during the first year, as measured by oculomotor orienting and attention disengagement. Cry sounds for acoustic analyses (fundamental frequency; F0) were recorded in two neonatal cohorts at the age of 0–8 days (Tampere, Finland) or at 6 weeks (Cape Town, South Africa). Eye tracking was used to measure oculomotor orienting to peripheral visual stimuli and attention disengagement from central stimuli at 8 months (Tampere) or at 6 months (Cape Town) of age. Only a marginal positive correlation between fundamental frequency of cry (F0) and visual attention disengagement was observed in the Tampere cohort, but not in the Cape Town cohort. This correlation indicated that infants from the Tampere cohort with a higher neonatal F0 were marginally slower to shift their gaze away from the central stimulus to the peripheral stimulus. No associations between F0 and oculomotor orienting were observed in either cohort. We discuss possible factors influencing the current pattern of results suggesting a lack of replicable associations between neonatal cry and visual attention and suggest directions for future research investigating the potential of early cry analysis in predicting later cognitive development.",
    keywords = "attention, cry, eye tracking, fundamental frequency, infant",
    note = {Funding Information: This manuscript has been released as a pre-print at bioRxiv, https://doi.org/10.1101/658732. Funding. The preparation of the manuscript was supported by grants from the Academy of Finland (\\#258708 to TV and \\#307657 to MP) and the National Research Foundation of South Africa (to DN). Publisher Copyright: {\textcopyright} Copyright {\textcopyright} 2020 Kivinummi, Naithani, Tammela, Virtanen, Kurkela, Alhainen, Niehaus, Lachman, Lepp{\"a}nen and Peltola. Copyright: Copyright 2020 Elsevier B.V., All rights reserved.},
    year = "2020",
    doi = "10.3389/fpsyg.2020.577510",
    language = "English",
    volume = "11",
    journal = "Frontiers in Psychology",
    issn = "1664-1078",
    publisher = "Frontiers Media"
    }

  • Y. Li, M. Liu, K. Drossos, and T. Virtanen, "Sound Event Detection Via Dilated Convolutional Recurrent Neural Networks," in 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020, p. 286–290. doi:10.1109/ICASSP40776.2020.9054433
    [BibTeX] [Abstract] [Download PDF]

    Convolutional recurrent neural networks (CRNNs) have achieved state-of-the-art performance for sound event detection (SED). In this paper, we propose to use a dilated CRNN, namely a CRNN with a dilated convolutional kernel, as the classifier for the task of SED. We investigate the effectiveness of dilation operations which provide a CRNN with expanded receptive fields to capture long temporal context without increasing the amount of CRNN's parameters. Compared to the classifier of the baseline CRNN, the classifier of the dilated CRNN obtains a maximum increase of 1.9\%, 6.3\% and 2.5\% at F1 score and a maximum decrease of 1.7\%, 4.1\% and 3.9\% at error rate (ER), on the publicly available audio corpora of the TUT-SED Synthetic 2016, the TUT Sound Event 2016 and the TUT Sound Event 2017, respectively.

    @inproceedings{2020_ICASSP,
    author = "Li, Yanxiong and Liu, Mingle and Drossos, Konstantinos and Virtanen, Tuomas",
    abstract = "Convolutional recurrent neural networks (CRNNs) have achieved state-of-the-art performance for sound event detection (SED). In this paper, we propose to use a dilated CRNN, namely a CRNN with a dilated convolutional kernel, as the classifier for the task of SED. We investigate the effectiveness of dilation operations which provide a CRNN with expanded receptive fields to capture long temporal context without increasing the amount of CRNN's parameters. Compared to the classifier of the baseline CRNN, the classifier of the dilated CRNN obtains a maximum increase of 1.9\%, 6.3\% and 2.5\% at F1 score and a maximum decrease of 1.7\%, 4.1\% and 3.9\% at error rate (ER), on the publicly available audio corpora of the TUT-SED Synthetic 2016, the TUT Sound Event 2016 and the TUT Sound Event 2017, respectively.",
    booktitle = "2020 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)",
    doi = "10.1109/ICASSP40776.2020.9054433",
    isbn = "978-1-5090-6632-2",
    pages = "286--290",
    publisher = "IEEE",
    series = "IEEE International Conference on Acoustics, Speech and Signal Processing",
    title = "Sound Event Detection Via Dilated Convolutional Recurrent Neural Networks",
    year = "2020",
    url = "https://arxiv.org/abs/1911.10888"
    }

  • K. MacDonald, O. Räsänen, M. Casillas, and A. Warlaumont, "Measuring prosodic predictability in children's home language environments," in Proc. Annual Meeting of the Cognitive Science Society, 2020.
    [BibTeX]
    @inproceedings{2020_b,
    author = {MacDonald, Kyle and R{\"a}s{\"a}nen, Okko and Casillas, Marisa and Warlaumont, Anne},
    title = "Measuring prosodic predictability in children's home language environments",
    year = "2020",
    language = "English",
    isbn = "9781713818977",
    booktitle = "Proc. Annual Meeting of the Cognitive Science Society",
    publisher = "COGNITIVE SCIENCE SOCIETY",
    note = "Annual Meeting of the Cognitive Science Society ; Conference date: 29-07-2020 Through 01-08-2020"
    }

  • P. Magron and T. Virtanen, "Online Spectrogram Inversion for Low-Latency Audio Source Separation," IEEE Signal Processing Letters, vol. 27, p. 306–310, 2020. doi:10.1109/LSP.2020.2970310
    [BibTeX] [Abstract] [Download PDF]

    Audio source separation is usually achieved by estimating the short-time Fourier transform (STFT) magnitude of each source, and then applying a spectrogram inversion algorithm to retrieve time-domain signals. In particular, the multiple input spectrogram inversion (MISI) algorithm has been exploited successfully in several recent works. However, this algorithm suffers from two drawbacks, which we address in this letter. First, it has originally been introduced in a heuristic fashion: we propose here a rigorous optimization framework in which MISI is derived, thus proving the convergence of this algorithm. Besides, while MISI operates offline, we propose here an online version of MISI called oMISI, which is suitable for low-latency source separation, an important requirement for e.g., hearing aids applications. oMISI also allows one to use alternative phase initialization schemes exploiting the temporal structure of audio signals. Experiments conducted on a speech separation task show that oMISI performs as well as its offline counterpart, thus demonstrating its potential for real-time source separation.

    @article{2020_SP,
    author = "Magron, Paul and Virtanen, Tuomas",
    abstract = "Audio source separation is usually achieved by estimating the short-time Fourier transform (STFT) magnitude of each source, and then applying a spectrogram inversion algorithm to retrieve time-domain signals. In particular, the multiple input spectrogram inversion (MISI) algorithm has been exploited successfully in several recent works. However, this algorithm suffers from two drawbacks, which we address in this letter. First, it has originally been introduced in a heuristic fashion: we propose here a rigorous optimization framework in which MISI is derived, thus proving the convergence of this algorithm. Besides, while MISI operates offline, we propose here an online version of MISI called oMISI, which is suitable for low-latency source separation, an important requirement for e.g., hearing aids applications. oMISI also allows one to use alternative phase initialization schemes exploiting the temporal structure of audio signals. Experiments conducted on a speech separation task show that oMISI performs as well as its offline counterpart, thus demonstrating its potential for real-time source separation.",
    doi = "10.1109/LSP.2020.2970310",
    issn = "1070-9908",
    journal = "IEEE Signal Processing Letters",
    keywords = "Audio source separation; low-latency; online spectrogram inversion; phase recovery; sinusoidal modeling",
    pages = "306--310",
    publisher = "Institute of Electrical and Electronics Engineers",
    title = "Online Spectrogram Inversion for Low-Latency Audio Source Separation",
    volume = "27",
    year = "2020",
    url = "https://arxiv.org/abs/1911.03128"
    }

  • A. J. Muñoz-Montoro, A. Politis, K. Drossos, and J. J. Carabias-Orti, "Multichannel Singing Voice Separation by Deep Neural Network Informed DOA Constrained CMNMF," in 2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP), 2020, pp. 1-6. doi:10.1109/MMSP48831.2020.9287068
    [BibTeX] [Abstract]

    This work addresses the problem of multichannel source separation combining two powerful approaches, multichannel spectral factorization with recent monophonic deep-learning (DL) based spectrum inference. Individual source spectra at different channels are estimated with a Masker-Denoiser Twin Network (MaD TwinNet), able to model long-term temporal patterns of a musical piece. The monophonic source spectrograms are used within a spatial covariance mixing model based on Complex Non-Negative Matrix Factorization (CNMF) that predicts the spatial characteristics of each source. The proposed framework is evaluated on the task of singing voice separation with a large multichannel dataset. Experimental results show that our joint DL+CNMF method outperforms both the individual monophonic DL-based separation and the multichannel CNMF baseline methods.

    @INPROCEEDINGS{2020_MMSP,
    author = "Muñoz-Montoro, Antonio J. and Politis, Archontis and Drossos, Konstantinos and Carabias-Orti, Julio J.",
    booktitle = "2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)",
    title = "Multichannel Singing Voice Separation by Deep Neural Network Informed {DOA} Constrained {CMNMF}",
    year = "2020",
    pages = "1-6",
    keywords = "Time-frequency analysis;Direction-of-arrival estimation;Estimation;Channel estimation;Information filters;Task analysis;Spectrogram;Multichannel Source Separation;Singing Voice;Deep Learning;CMNMF;Spatial Audio",
    abstract = "This work addresses the problem of multichannel source separation combining two powerful approaches, multichannel spectral factorization with recent monophonic deep-learning (DL) based spectrum inference. Individual source spectra at different channels are estimated with a Masker-Denoiser Twin Network (MaD TwinNet), able to model long-term temporal patterns of a musical piece. The monophonic source spectrograms are used within a spatial covariance mixing model based on Complex Non-Negative Matrix Factorization (CNMF) that predicts the spatial characteristics of each source. The proposed framework is evaluated on the task of singing voice separation with a large multichannel dataset. Experimental results show that our joint DL+CNMF method outperforms both the individual monophonic DL-based separation and the multichannel CNMF baseline methods.",
    doi = "10.1109/MMSP48831.2020.9287068"
    }

  • K. Nguyen, K. Drossos, and T. Virtanen, "Temporal Sub-sampling of Audio Feature Sequences for Automated Audio Captioning," in Proceedings of the Fifth Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2020), 2020, p. 110–114.
    [BibTeX] [Abstract] [Download PDF]

    Audio captioning is the task of automatically creating a textual description for the contents of a general audio signal. Typical audio captioning methods rely on deep neural networks (DNNs), where the target of the DNN is to map the input audio sequence to an output sequence of words, i.e. the caption. Though, the length of the textual description is considerably less than the length of the audio signal, for example 10 words versus some thousands of audio feature vectors. This clearly indicates that an output word corresponds to multiple input feature vectors. In this work we present an approach that focuses on explicitly taking advantage of this difference of lengths between sequences, by applying a temporal sub-sampling to the audio input sequence. We employ a sequence-to-sequence method, which uses a fixed-length vector as an output from the encoder, and we apply temporal sub-sampling between the RNNs of the encoder. We evaluate the benefit of our approach by employing the freely available dataset Clotho and we evaluate the impact of different factors of temporal sub-sampling. Our results show an improvement to all considered metrics.

    @inproceedings{2020_DCASE 2020,
    author = "Nguyen, Khoa and Drossos, Konstantinos and Virtanen, Tuomas",
    editor = "Ono, Nobutaka and Harada, Noboru and Kawaguchi, Yohei and Mesaros, Annamaria and Imoto, Keisuke and Koizumi, Yuma and Komatsu, Tatsuya",
    title = "Temporal Sub-sampling of Audio Feature Sequences for Automated Audio Captioning",
    abstract = "Audio captioning is the task of automatically creating a textual description for the contents of a general audio signal. Typical audio captioning methods rely on deep neural networks (DNNs), where the target of the DNN is to map the input audio sequence to an output sequence of words, i.e. the caption. Though, the length of the textual description is considerably less than the length of the audio signal, for example 10 words versus some thousands of audio feature vectors. This clearly indicates that an output word corresponds to multiple input feature vectors. In this work we present an approach that focuses on explicitly taking advantage of this difference of lengths between sequences, by applying a temporal sub-sampling to the audio input sequence. We employ a sequence-to-sequence method, which uses a fixed-length vector as an output from the encoder, and we apply temporal sub-sampling between the RNNs of the encoder. We evaluate the benefit of our approach by employing the freely available dataset Clotho and we evaluate the impact of different factors of temporal sub-sampling. Our results show an improvement to all considered metrics.",
    keywords = "audio captioning, recurrent neural networks, temporal sub-sampling, hierarchical sub-sampling networks",
    year = "2020",
    language = "English",
    pages = "110--114",
    booktitle = "Proceedings of the Fifth Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2020)",
    publisher = "Tokyo Metropolitan University",
    note = "Workshop on Detection and Classification of Acoustic Scenes and Events, DCASE 2020 ; Conference date: 02-11-2020 Through 03-11-2020",
    url = "http://dcase.community/workshop2020/"
    }

  • P. Pertilä, M. Parviainen, V. Myllylä, A. Huttunen, and P. Jarske, "Time Difference of Arrival Estimation with Deep Learning – From Acoustic Simulations to Recorded Data," in IEEE International Workshop on Multimedia Signal Processing (MMSP), United States, 2020. doi:10.1109/MMSP48831.2020.9287131
    [BibTeX] [Abstract] [Download PDF]

    The spatial information about a sound source is carried by acoustic waves to a microphone array and can be observed through estimation of phase and amplitude differences between microphones. Time difference of arrival (TDoA) captures the propagation delay of the wavefront between microphones and can be used to steer a beamformer or to localize the source. However, reverberation and interference can deteriorate the TDoA estimate. Deep neural networks (DNNs) through supervised learning can extract speech related TDoAs in more adverse conditions than traditional correlation -based methods. Acoustic simulations provide large amounts of data with annotations, while real recordings require manual annotations or the use of reference sensors with proper calibration procedures. The distributions of these two data sources can differ. When a DNN model that is trained using simulated data is presented with real data from a different distribution, its performance decreases if not properly addressed. For the reduction of DNN –based TDoA estimation error, this work investigates the role of different input normalization techniques, mixing of simulated and real data for training, and applying an adversarial domain adaptation technique. Results quantify the reduction in TDoA error for real data using the different approaches. It is evident that the use of normalization methods, domain-adaptation, and real data during training can reduce the TDoA error.

    @inproceedings{2020_MMSP_b,
    author = {Pertil{\"a}, Pasi and Parviainen, Mikko and Myllyl{\"a}, Ville and Huttunen, Anu and Jarske, Petri},
    title = "Time Difference of Arrival Estimation with Deep Learning -- From Acoustic Simulations to Recorded Data",
    abstract = "The spatial information about a sound source is carried by acoustic waves to a microphone array and can be observed through estimation of phase and amplitude differences between microphones. Time difference of arrival (TDoA) captures the propagation delay of the wavefront between microphones and can be used to steer a beamformer or to localize the source. However, reverberation and interference can deteriorate the TDoA estimate. Deep neural networks (DNNs) through supervised learning can extract speech related TDoAs in more adverse conditions than traditional correlation -based methods. Acoustic simulations provide large amounts of data with annotations, while real recordings require manual annotations or the use of reference sensors with proper calibration procedures. The distributions of these two data sources can differ. When a DNN model that is trained using simulated data is presented with real data from a different distribution, its performance decreases if not properly addressed. For the reduction of DNN –based TDoA estimation error, this work investigates the role of different input normalization techniques, mixing of simulated and real data for training, and applying an adversarial domain adaptation technique. Results quantify the reduction in TDoA error for real data using the different approaches. It is evident that the use of normalization methods, domain-adaptation, and real data during training can reduce the TDoA error.",
    note = "jufoid=70574; IEEE International Workshop on Multimedia Signal Processing ; Conference date: 21-09-2020 Through 24-09-2020",
    year = "2020",
    month = "September",
    day = "22",
    doi = "10.1109/MMSP48831.2020.9287131",
    language = "English",
    series = "IEEE International Workshop on Multimedia Signal Processing",
    publisher = "IEEE",
    booktitle = "IEEE International Workshop on Multimedia Signal Processing (MMSP)",
    address = "United States",
    url = "https://attend.ieee.org/mmsp-2020/"
    }

  • A. Politis, S. Adavanne, and T. Virtanen, "A Dataset of Reverberant Spatial Sound Scenes with Moving Sources for Sound Event Localization and Detection," in Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2020.
    [BibTeX]
    @inproceedings{2020_DCASE,
    author = "Politis, Archontis and Adavanne, Sharath and Virtanen, Tuomas",
    title = "A Dataset of Reverberant Spatial Sound Scenes with Moving Sources for Sound Event Localization and Detection",
    year = "2020",
    month = "November",
    day = "2",
    language = "English",
    booktitle = "Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE)",
    note = "Detection and Classification of Acoustic Scenes and Events Workshop ; Conference date: 01-01-2000"
    }

  • P. Pyykkönen, S. I. Mimilakis, K. Drossos, and T. Virtanen, "Depthwise Separable Convolutions Versus Recurrent Neural Networks for Monaural Singing Voice Separation," in 2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP), 2020, pp. 1-6. doi:10.1109/MMSP48831.2020.9287169
    [BibTeX] [Abstract] [Download PDF]

    Recent approaches for music source separation are almost exclusively based on deep neural networks, mostly employing recurrent neural networks (RNNs). Although RNNs are in many cases superior than other types of deep neural networks for sequence processing, they are known to have specific difficulties in training and parallelization, especially for the typically long sequences encountered in music source separation. In this paper we present a use-case of replacing RNNs with depth-wise separable (DWS) convolutions, which are a lightweight and faster variant of the typical convolutions. We focus on singing voice separation, employing an RNN architecture, and we replace the RNNs with DWS convolutions (DWS-CNNs). We conduct an ablation study and examine the effect of the number of channels and layers of DWS-CNNs on the source separation performance, by utilizing the standard metrics of signal-to-artifacts, signal-to-interference, and signal-to-distortion ratio. Our results show that by replacing RNNs with DWS-CNNs yields an improvement of 1.20, 0.06, 0.37 dB, respectively, while using only 20.57{\\%} of the amount of parameters of the RNN architecture.

    @INPROCEEDINGS{2020_MMSP_a,
    author = "Pyykkönen, Pyry and Mimilakis, Styliannos I. and Drossos, Konstantinos and Virtanen, Tuomas",
    booktitle = "2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)",
    title = "Depthwise Separable Convolutions Versus Recurrent Neural Networks for Monaural Singing Voice Separation",
    year = "2020",
    pages = "1-6",
    keywords = "Measurement;Training;Source separation;Recurrent neural networks;Convolution;Task analysis;Standards;Depthwise separable convolutions;recurrent neural networks;mad;madtwinnet;monaural singing voice separation",
    abstract = "Recent approaches for music source separation are almost exclusively based on deep neural networks, mostly employing recurrent neural networks (RNNs). Although RNNs are in many cases superior than other types of deep neural networks for sequence processing, they are known to have specific difficulties in training and parallelization, especially for the typically long sequences encountered in music source separation. In this paper we present a use-case of replacing RNNs with depth-wise separable (DWS) convolutions, which are a lightweight and faster variant of the typical convolutions. We focus on singing voice separation, employing an RNN architecture, and we replace the RNNs with DWS convolutions (DWS-CNNs). We conduct an ablation study and examine the effect of the number of channels and layers of DWS-CNNs on the source separation performance, by utilizing the standard metrics of signal-to-artifacts, signal-to-interference, and signal-to-distortion ratio. Our results show that by replacing RNNs with DWS-CNNs yields an improvement of 1.20, 0.06, 0.37 dB, respectively, while using only 20.57{\\%} of the amount of parameters of the RNN architecture.",
    doi = "10.1109/MMSP48831.2020.9287169",
    url = "https://arxiv.org/abs/2007.02683"
    }

  • O. Räsänen and M. Cruz Blandon, "Unsupervised Discovery of Recurring Speech Patterns using Probabilistic Adaptive Metrics," in Proceedings of the Annual Conference of the International Speech Communication Association, 2020, p. 4871–4875. doi:10.21437/Interspeech.2020-1738
    [BibTeX] [Abstract] [Download PDF]

    Unsupervised spoken term discovery (UTD) aims at finding recurring segments of speech from a corpus of acoustic speech data. One potential approach to this problem is to use dynamic time warping (DTW) to find well-aligning patterns from the speech data. However, automatic selection of initial candidate segments for the DTW-alignment and detection of “sufficiently good” alignments among those require some type of predefined criteria, often operationalized as threshold parameters for pair-wise distance metrics between signal representations. In the existing UTD systems, the optimal hyperparameters may differ across datasets, limiting their applicability to new corpora and truly low-resource scenarios. In this paper, we propose a novel probabilistic approach to DTW-based UTD named as PDTW. In PDTW, distributional characteristics of the processed corpus are utilized for adaptive evaluation of alignment quality, thereby enabling systematic discovery of pattern pairs that have similarity what would be expected by coincidence. We test PDTW on Zero Resource Speech Challenge 2017 datasets as a part of 2020 implementation of the challenge. The results show that the system performs consistently on all five tested languages using fixed hyperparameters, clearly outperforming the earlier DTW-based system in terms of coverage of the detected patterns.

    @inproceedings{2020_InterSpecch,
    author = {R{\"a}s{\"a}nen, Okko and Cruz Blandon, Maria},
    title = "Unsupervised Discovery of Recurring Speech Patterns using Probabilistic Adaptive Metrics",
    abstract = "Unsupervised spoken term discovery (UTD) aims at finding recurring segments of speech from a corpus of acoustic speech data. One potential approach to this problem is to use dynamic time warping (DTW) to find well-aligning patterns from the speech data. However, automatic selection of initial candidate segments for the DTW-alignment and detection of “sufficiently good” alignments among those require some type of predefined criteria, often operationalized as threshold parameters for pair-wise distance metrics between signal representations. In the existing UTD systems, the optimal hyperparameters may differ across datasets, limiting their applicability to new corpora and truly low-resource scenarios. In this paper, we propose a novel probabilistic approach to DTW-based UTD named as PDTW. In PDTW, distributional characteristics of the processed corpus are utilized for adaptive evaluation of alignment quality, thereby enabling systematic discovery of pattern pairs that have similarity what would be expected by coincidence. We test PDTW on Zero Resource Speech Challenge 2017 datasets as a part of 2020 implementation of the challenge. The results show that the system performs consistently on all five tested languages using fixed hyperparameters, clearly outperforming the earlier DTW-based system in terms of coverage of the detected patterns.",
    note = "jufoid=59094; INTERSPEECH ; Conference date: 14-09-2020 Through 18-09-2020",
    year = "2020",
    doi = "10.21437/Interspeech.2020-1738",
    language = "English",
    series = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
    publisher = "International Speech Communication Association ISCA",
    pages = "4871--4875",
    booktitle = "Proceedings of the Annual Conference of the International Speech Communication Association",
    url = "http://www.interspeech2020.org/"
    }

  • S. Zhao, T. Heittola, and T. Virtanen, "Active Learning for Sound Event Detection," IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 28, p. 2895–2905, 2020. doi:10.1109/TASLP.2020.3029652
    [BibTeX] [Abstract]

    This article proposes an active learning system for sound event detection (SED). It aims at maximizing the accuracy of a learned SED model with limited annotation effort. The proposed system analyzes an initially unlabeled audio dataset, from which it selects sound segments for manual annotation. The candidate segments are generated based on a proposed change point detection approach, and the selection is based on the principle of mismatch-first farthest-traversal. During the training of SED models, recordings are used as training inputs, preserving the long-term context for annotated segments. The proposed system clearly outperforms reference methods in the two datasets used for evaluation (TUT Rare Sound 2017 and TAU Spatial Sound 2019). Training with recordings as context outperforms training with only annotated segments. Mismatch-first farthest-traversal outperforms reference sample selection methods based on random sampling and uncertainty sampling. Remarkably, the required annotation effort can be greatly reduced on the dataset where target sound events are rare: by annotating only 2\\% of the training data, the achieved SED performance is similar to annotating all the training data.

    @article{2020,
    author = "Zhao, Shuyang and Heittola, Toni and Virtanen, Tuomas",
    title = "Active Learning for Sound Event Detection",
    abstract = "This article proposes an active learning system for sound event detection (SED). It aims at maximizing the accuracy of a learned SED model with limited annotation effort. The proposed system analyzes an initially unlabeled audio dataset, from which it selects sound segments for manual annotation. The candidate segments are generated based on a proposed change point detection approach, and the selection is based on the principle of mismatch-first farthest-traversal. During the training of SED models, recordings are used as training inputs, preserving the long-term context for annotated segments. The proposed system clearly outperforms reference methods in the two datasets used for evaluation (TUT Rare Sound 2017 and TAU Spatial Sound 2019). Training with recordings as context outperforms training with only annotated segments. Mismatch-first farthest-traversal outperforms reference sample selection methods based on random sampling and uncertainty sampling. Remarkably, the required annotation effort can be greatly reduced on the dataset where target sound events are rare: by annotating only 2\\% of the training data, the achieved SED performance is similar to annotating all the training data.",
    keywords = "Active learning, change point detection, mismatch-first farthest-traversal, sound event detection, weakly supervised learning",
    note = "Funding Information: Manuscript received February 12, 2020; revised July 3, 2020 and August 6, 2020; accepted September 3, 2020. Date of publication October 8, 2020; date of current version November 5, 2020. This work was supported by the European Research Council under the European Unions H2020 Framework Programme through ERC Grant Agreement 637422 EVERYSOUND. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Isabel Barbancho. (Corresponding author: Shuyang Zhao.) The authors are with the Faculty of Information Technology and Communication Sciences, Tampere University, 33720 Tampere, Finland (e-mail: shuyang.zhao@tuni.fi; toni.heittola@tuni.fi; tuomas.virtanen@tuni.fi). Digital Object Identifier 10.1109/TASLP.2020.3029652 Publisher Copyright: {\textcopyright} 2014 IEEE. Copyright: Copyright 2020 Elsevier B.V., All rights reserved.",
    year = "2020",
    doi = "10.1109/TASLP.2020.3029652",
    language = "English",
    volume = "28",
    pages = "2895--2905",
    journal = "IEEE/ACM Transactions on Audio Speech and Language Processing",
    issn = "2329-9290",
    publisher = "IEEE Advancing Technology for Humanity"
    }

2019

  • S. Adavanne, A. Politis, and T. Virtanen, "Localization, Detection and Tracking of Multiple Moving Sound Sources with a Convolutional Recurrent Neural Network," in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), 2019, p. 20–24.
    [BibTeX] [Abstract] [Download PDF]

    This paper investigates the joint localization, detection, and tracking of sound events using a convolutional recurrent neural network (CRNN). We use a CRNN previously proposed for the localization and detection of stationary sources, and show that the recurrent layers enable the spatial tracking of moving sources when trained with dynamic scenes. The tracking performance of the CRNN is compared with a stand-alone tracking method that combines a multi-source (DOA) estimator and a particle filter. Their respective performance is evaluated in various acoustic conditions such as anechoic and reverberant scenarios, stationary and moving sources at several angular velocities, and with a varying number of overlapping sources. The results show that the CRNN manages to track multiple sources more consistently than the parametric method across acoustic scenarios, but at the cost of higher localization error.

    @inproceedings{2019_DCASE2019_a,
    author = "Adavanne, Sharath and Politis, Archontis and Virtanen, Tuomas",
    abstract = "This paper investigates the joint localization, detection, and tracking of sound events using a convolutional recurrent neural network (CRNN). We use a CRNN previously proposed for the localization and detection of stationary sources, and show that the recurrent layers enable the spatial tracking of moving sources when trained with dynamic scenes. The tracking performance of the CRNN is compared with a stand-alone tracking method that combines a multi-source (DOA) estimator and a particle filter. Their respective performance is evaluated in various acoustic conditions such as anechoic and reverberant scenarios, stationary and moving sources at several angular velocities, and with a varying number of overlapping sources. The results show that the CRNN manages to track multiple sources more consistently than the parametric method across acoustic scenarios, but at the cost of higher localization error.",
    booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019)",
    month = "10",
    pages = "20--24",
    title = "Localization, Detection and Tracking of Multiple Moving Sound Sources with a Convolutional Recurrent Neural Network",
    year = "2019",
    url = "https://arxiv.org/abs/1904.12769"
    }

  • S. Adavanne, A. Politis, and T. Virtanen, "A Multi-room Reverberant Dataset for Sound Event Localization and Detection," in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), 2019, p. 10–14.
    [BibTeX] [Abstract] [Download PDF]

    This paper presents the sound event localization and detection (SELD) task setup for the DCASE 2019 challenge. The goal of the SELD task is to detect the temporal activities of a known set of sound event classes, and further localize them in space when active. As part of the challenge, a synthesized dataset where each sound event associated with a spatial coordinate represented using azimuth and elevation angles is provided. These sound events are spatialized using real-life impulse responses collected at multiple spatial coordinates in five different rooms with varying dimensions and material properties. A baseline SELD method employing a convolutional recurrent neural network is used to generate benchmark scores for this reverberant dataset. The benchmark scores are obtained using the recommended cross-validation setup.

    @inproceedings{2019_DCASE2019_c,
    author = "Adavanne, Sharath and Politis, Archontis and Virtanen, Tuomas",
    abstract = "This paper presents the sound event localization and detection (SELD) task setup for the DCASE 2019 challenge. The goal of the SELD task is to detect the temporal activities of a known set of sound event classes, and further localize them in space when active. As part of the challenge, a synthesized dataset where each sound event associated with a spatial coordinate represented using azimuth and elevation angles is provided. These sound events are spatialized using real-life impulse responses collected at multiple spatial coordinates in five different rooms with varying dimensions and material properties. A baseline SELD method employing a convolutional recurrent neural network is used to generate benchmark scores for this reverberant dataset. The benchmark scores are obtained using the recommended cross-validation setup.",
    booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019)",
    month = "10",
    pages = "10--14",
    title = "A Multi-room Reverberant Dataset for Sound Event Localization and Detection",
    year = "2019",
    url = "https://arxiv.org/abs/1905.08546"
    }

  • S. Adavanne, A. Politis, J. Nikunen, and T. Virtanen, "Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks," IEEE Journal of Selected Topics in Signal Processing, vol. 13, iss. 1, p. 34–48, 2019. doi:10.1109/JSTSP.2018.2885636
    [BibTeX] [Abstract]

    In this paper, we propose a convolutional recurrent neural network for joint sound event localization and detection (SELD) of multiple overlapping sound events in three-dimensional (3D) space. The proposed network takes a sequence of consecutive spectrogram time-frames as input and maps it to two outputs in parallel. As the first output, the sound event detection (SED) is performed as a multi-label classification task on each time-frame producing temporal activity for all the sound event classes. As the second output, localization is performed by estimating the 3D Cartesian coordinates of the direction-of-arrival (DOA) for each sound event class using multi-output regression. The proposed method is able to associate multiple DOAs with respective sound event labels and further track this association with respect to time. The proposed method uses separately the phase and magnitude component of the spectrogram calculated on each audio channel as the feature, thereby avoiding any method- and array-specific feature extraction. The method is evaluated on five Ambisonic and two circular array format datasets with different overlapping sound events in anechoic, reverberant and real-life scenarios. The proposed method is compared with two SED, three DOA estimation, and one SELD baselines. The results show that the proposed method is generic and applicable to any array structures, robust to unseen DOA values, reverberation, and low SNR scenarios. The proposed method achieved a consistently higher recall of the estimated number of DOAs across datasets in comparison to the best baseline. Additionally, this recall was observed to be significantly better than the best baseline method for a higher number of overlapping sound events.

    @article{2019_JSTSP_a,
    author = "Adavanne, Sharath and Politis, Archontis and Nikunen, Joonas and Virtanen, Tuomas",
    title = "Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks",
    abstract = "In this paper, we propose a convolutional recurrent neural network for joint sound event localization and detection (SELD) of multiple overlapping sound events in three-dimensional (3D) space. The proposed network takes a sequence of consecutive spectrogram time-frames as input and maps it to two outputs in parallel. As the first output, the sound event detection (SED) is performed as a multi-label classification task on each time-frame producing temporal activity for all the sound event classes. As the second output, localization is performed by estimating the 3D Cartesian coordinates of the direction-of-arrival (DOA) for each sound event class using multi-output regression. The proposed method is able to associate multiple DOAs with respective sound event labels and further track this association with respect to time. The proposed method uses separately the phase and magnitude component of the spectrogram calculated on each audio channel as the feature, thereby avoiding any method- and array-specific feature extraction. The method is evaluated on five Ambisonic and two circular array format datasets with different overlapping sound events in anechoic, reverberant and real-life scenarios. The proposed method is compared with two SED, three DOA estimation, and one SELD baselines. The results show that the proposed method is generic and applicable to any array structures, robust to unseen DOA values, reverberation, and low SNR scenarios. The proposed method achieved a consistently higher recall of the estimated number of DOAs across datasets in comparison to the best baseline. Additionally, this recall was observed to be significantly better than the best baseline method for a higher number of overlapping sound events.",
    keywords = "Direction-of-arrival estimation, Estimation, Task analysis, Azimuth, Microphone arrays, Recurrent neural networks, Sound event detection, direction of arrival estimation, convolutional recurrent neural network",
    note = {EXT={"}Politis, Archontis{"}},
    year = "2019",
    month = "March",
    doi = "10.1109/JSTSP.2018.2885636",
    language = "English",
    volume = "13",
    pages = "34--48",
    journal = "IEEE Journal of Selected Topics in Signal Processing",
    issn = "1932-4553",
    publisher = "Institute of Electrical and Electronics Engineers Inc.",
    number = "1"
    }

  • I. {M. N. Ahsan, C. Kertesz, A. Mesaros, T. Heittola, A. Knight, and T. Virtanen, "Audio-Based Epileptic Seizure Detection," in 2019 27th European Signal Processing Conference (EUSIPCO), 2019. doi:10.23919/EUSIPCO.2019.8902840
    [BibTeX] [Abstract] [Download PDF]

    This paper investigates automatic epileptic seizure detection from audio recordings using convolutional neural networks. The labeling and analysis of seizure events are necessary in the medical field for patient monitoring, but the manual annotation by expert annotators is time-consuming and extremely monotonous. The proposed method treats all seizure vocalizations as a single target event class, and models the seizure detection problem in terms of detecting the target vs non-target classes. For detection, the method employs a convolutional neural network trained to detect the seizure events in short time segments, based on mel-energies as feature representation. Experiments carried out with different seizure types on 900 hours of audio recordings from 40 patients show that the proposed approach can detect seizures with over 80{\\%} accuracy, with a 13{\\%} false positive rate and a 22.8{\\%} false negative rate.

    @inproceedings{2019_EUSIPCO,
    author = "Ahsan, {M. N. Istiaq} and Kertesz, C. and Mesaros, A. and Heittola, T. and Knight, A. and Virtanen, T.",
    abstract = "This paper investigates automatic epileptic seizure detection from audio recordings using convolutional neural networks. The labeling and analysis of seizure events are necessary in the medical field for patient monitoring, but the manual annotation by expert annotators is time-consuming and extremely monotonous. The proposed method treats all seizure vocalizations as a single target event class, and models the seizure detection problem in terms of detecting the target vs non-target classes. For detection, the method employs a convolutional neural network trained to detect the seizure events in short time segments, based on mel-energies as feature representation. Experiments carried out with different seizure types on 900 hours of audio recordings from 40 patients show that the proposed approach can detect seizures with over 80{\\%} accuracy, with a 13{\\%} false positive rate and a 22.8{\\%} false negative rate.",
    booktitle = "2019 27th European Signal Processing Conference (EUSIPCO)",
    doi = "10.23919/EUSIPCO.2019.8902840",
    isbn = "978-1-5386-7300-3",
    keywords = "Epileptic seizure detection; convolutional neural network (CNN); sound event detection; audio processing and analysis.",
    month = "9",
    publisher = "IEEE",
    series = "European Signal Processing Conference",
    title = "Audio-Based Epileptic Seizure Detection",
    year = "2019",
    URL = "https://homepages.tuni.fi/tuomas.virtanen/papers/Audio\_based\_epileptic\_seizure\_detection\_EUSIPCO\_19.pdf"
    }

  • M. Airaksinen, L. Juvela, P. Alku, and O. Räsänen, "Data augmentation strategies for neural network F0 estimation," in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, p. 6485–6489. doi:10.1109/ICASSP.2019.8683041
    [BibTeX]
    @inproceedings{2019_ICASSP_c,
    author = {Airaksinen, Manu and Juvela, Lauri and Alku, Paavo and R{\"a}s{\"a}nen, Okko},
    title = "Data augmentation strategies for neural network F0 estimation",
    year = "2019",
    doi = "10.1109/ICASSP.2019.8683041",
    language = "English",
    isbn = "978-1-4799-8131-1",
    series = "Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing",
    pages = "6485--6489",
    booktitle = "ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",
    note = "IEEE International Conference on Acoustics, Speech, and Signal Processing ; Conference date: 12-05-2019 Through 19-05-2019"
    }

  • H. L. Bear, T. Heittola, A. Mesaros, E. Benetos, and T. Virtanen, "City Classification from Multiple Real-World Sound Scenes," in 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2019, p. 11–15. doi:10.1109/WASPAA.2019.8937271
    [BibTeX] [Abstract] [Download PDF]

    The majority of sound scene analysis work focuses on one of two clearly defined tasks: acoustic scene classification or sound event detection. Whilst this separation of tasks is useful for problem definition, they inherently ignore some subtleties of the real-world, in particular how humans vary in how they describe a scene. Some will describe the weather and features within it, others will use a holistic descriptor like `park', and others still will use unique identifiers such as cities or names. In this paper, we undertake the task of automatic city classification to ask whether we can recognize a city from a set of sound scenes? In this problem each city has recordings from multiple scenes. We test a series of methods for this novel task and show that a simple convolutional neural network (CNN) can achieve accuracy of 50\%. This is less than the acoustic scene classification task baseline in the DCASE 2018 ASC challenge on the same data. A simple adaptation to the class labels of pairing city labels with grouped scenes, accuracy increases to 52\%, closer to the simpler scene classification task. Finally we also formulate the problem in a multi-task learning framework and achieve an accuracy of 56\%, outperforming the aforementioned approaches.

    @inproceedings{2019_WASPAA_a,
    author = "Bear, Helen L. and Heittola, Toni and Mesaros, Annamaria and Benetos, Emmanouil and Virtanen, Tuomas",
    abstract = "The majority of sound scene analysis work focuses on one of two clearly defined tasks: acoustic scene classification or sound event detection. Whilst this separation of tasks is useful for problem definition, they inherently ignore some subtleties of the real-world, in particular how humans vary in how they describe a scene. Some will describe the weather and features within it, others will use a holistic descriptor like `park', and others still will use unique identifiers such as cities or names. In this paper, we undertake the task of automatic city classification to ask whether we can recognize a city from a set of sound scenes? In this problem each city has recordings from multiple scenes. We test a series of methods for this novel task and show that a simple convolutional neural network (CNN) can achieve accuracy of 50\%. This is less than the acoustic scene classification task baseline in the DCASE 2018 ASC challenge on the same data. A simple adaptation to the class labels of pairing city labels with grouped scenes, accuracy increases to 52\%, closer to the simpler scene classification task. Finally we also formulate the problem in a multi-task learning framework and achieve an accuracy of 56\%, outperforming the aforementioned approaches.",
    booktitle = "2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)",
    doi = "10.1109/WASPAA.2019.8937271",
    isbn = "978-1-7281-1124-7",
    keywords = "Acoustic scene classification; location identification; city classification; computational sound scene analysis",
    month = "10",
    pages = "11--15",
    publisher = "IEEE",
    series = "IEEE Workshop on Applications of Signal Processing to Audio and Acoustics",
    title = "City Classification from Multiple Real-World Sound Scenes",
    year = "2019",
    url = "https://arxiv.org/abs/1905.00979"
    }

  • A. Diment, E. Fagerlund, A. Benfield, and T. Virtanen, "Detection of Typical Pronunciation Errors in Non-native English Speech Using Convolutional Recurrent Neural Networks," in 2019 International Joint Conference on Neural Networks, IJCNN 2019, 2019. doi:10.1109/IJCNN.2019.8851963
    [BibTeX] [Abstract] [Download PDF]

    A machine learning method for the automatic detection of pronunciation errors made by non-native speakers of English is proposed. It consists of training word-specific binary classifiers on a collected dataset of isolated words with possible pronunciation errors, typical for Finnish native speakers. The classifiers predict whether the typical error is present in the given word utterance. They operate on sequences of acoustic features, extracted from consecutive frames of an audio recording of a word utterance. The proposed architecture includes a convolutional neural network, a recurrent neural network, or a combination of the two. The optimal topology and hyperpa-rameters are obtained in a Bayesian optimisation setting using a tree-structured Parzen estimator. A dataset of 80 words uttered naturally by 120 speakers is collected. The performance of the proposed system, evaluated on a well-represented subset of the dataset, shows that it is capable of detecting pronunciation errors in most of the words (46/49) with high accuracy (mean accuracy gain over the zero rule 12.21 percent points).

    @inproceedings{2019_IJCNN,
    author = "Diment, Aleksandr and Fagerlund, Eemi and Benfield, Adrian and Virtanen, Tuomas",
    abstract = "A machine learning method for the automatic detection of pronunciation errors made by non-native speakers of English is proposed. It consists of training word-specific binary classifiers on a collected dataset of isolated words with possible pronunciation errors, typical for Finnish native speakers. The classifiers predict whether the typical error is present in the given word utterance. They operate on sequences of acoustic features, extracted from consecutive frames of an audio recording of a word utterance. The proposed architecture includes a convolutional neural network, a recurrent neural network, or a combination of the two. The optimal topology and hyperpa-rameters are obtained in a Bayesian optimisation setting using a tree-structured Parzen estimator. A dataset of 80 words uttered naturally by 120 speakers is collected. The performance of the proposed system, evaluated on a well-represented subset of the dataset, shows that it is capable of detecting pronunciation errors in most of the words (46/49) with high accuracy (mean accuracy gain over the zero rule 12.21 percent points).",
    booktitle = "2019 International Joint Conference on Neural Networks, IJCNN 2019",
    day = "1",
    doi = "10.1109/IJCNN.2019.8851963",
    keywords = "Computer-assisted language learning; computer-assisted pronunciation training CNN; CRNN; GRU; pronunciation learning",
    month = "7",
    publisher = "IEEE",
    title = "Detection of Typical Pronunciation Errors in Non-native English Speech Using Convolutional Recurrent Neural Networks",
    year = "2019",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/Diment19\_PL.pdf"
    }

  • K. Drossos, P. Magron, and T. Virtanen, "Unsupervised Adversarial Domain Adaptation Based On The Wasserstein Distance For Acoustic Scene Classification," in 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2019. doi:10.1109/WASPAA.2019.8937231
    [BibTeX] [Abstract] [Download PDF]

    A challenging problem in deep learning-based machine listening field is the degradation of the performance when using data from unseen conditions. In this paper we focus on the acoustic scene classification (ASC) task and propose an adversarial deep learning method to allow adapting an acoustic scene classification system to deal with a new acoustic channel resulting from data captured with a different recording device. We build upon the theoretical model of HΔH-distance and previous adversarial discriminative deep learning method for ASC unsupervised domain adaptation, and we present an adversarial training based method using the Wasserstein distance. We improve the state-of-the-art mean accuracy on the data from the unseen conditions from 32{\\%} to 45{\\%}, using the TUT Acoustic Scenes dataset.

    @inproceedings{2019_WASPAA_c,
    author = "Drossos, Konstantinos and Magron, Paul and Virtanen, Tuomas",
    abstract = "A challenging problem in deep learning-based machine listening field is the degradation of the performance when using data from unseen conditions. In this paper we focus on the acoustic scene classification (ASC) task and propose an adversarial deep learning method to allow adapting an acoustic scene classification system to deal with a new acoustic channel resulting from data captured with a different recording device. We build upon the theoretical model of HΔH-distance and previous adversarial discriminative deep learning method for ASC unsupervised domain adaptation, and we present an adversarial training based method using the Wasserstein distance. We improve the state-of-the-art mean accuracy on the data from the unseen conditions from 32{\\%} to 45{\\%}, using the TUT Acoustic Scenes dataset.",
    booktitle = "2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)",
    day = "22",
    doi = "10.1109/WASPAA.2019.8937231",
    isbn = "978-1-7281-1124-7",
    keywords = "Acoustic scene classification; unsupervised domain adaptation; Wasserstein distance; adversarial training",
    month = "10",
    publisher = "IEEE",
    series = "IEEE Workshop on Applications of Signal Processing to Audio and Acoustics",
    title = "Unsupervised Adversarial Domain Adaptation Based On The Wasserstein Distance For Acoustic Scene Classification",
    year = "2019",
    url = "https://arxiv.org/abs/1904.10678"
    }

  • K. Drossos, S. Gharib, P. Magron, and T. Virtanen, "Language Modelling for Sound Event Detection with Teacher Forcing and Scheduled Sampling," in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), 2019.
    [BibTeX] [Abstract] [Download PDF]

    A sound event detection (SED) method typically takes as an input a sequence of audio frames and predicts the activities of sound events in each frame. In real-life recordings, the sound events exhibit some temporal structure: for instance, a {"}car horn{"} will likely be followed by a {"}car passing by{"}. While this temporal structure is widely exploited in sequence prediction tasks (e.g., in machine translation), where language models (LM) are exploited, it is not satisfactorily modeled in SED. In this work we propose a method which allows a recurrent neural network (RNN) to learn an LM for the SED task. The method conditions the input of the RNN with the activities of classes at the previous time step. We evaluate our method using F1 score and error rate (ER) over three different and publicly available datasets; the TUT-SED Synthetic 2016 and the TUT Sound Events 2016 and 2017 datasets. The obtained results show an increase of 9{\\%} and 2{\\%} at the F1 (higher is better) and a decrease of 7{\\%} and 2{\\%} at ER (lower is better) for the TUT Sound Events 2016 and 2017 datasets, respectively, when using our method. On the contrary, with our method there is a decrease of 4{\\%} at F1 score and an increase of 7{\\%} at ER for the TUT-SED Synthetic 2016 dataset.

    @inproceedings{2019_DCASE2019_b,
    author = "Drossos, Konstantinos and Gharib, Shayan and Magron, Paul and Virtanen, Tuomas",
    abstract = {A sound event detection (SED) method typically takes as an input a sequence of audio frames and predicts the activities of sound events in each frame. In real-life recordings, the sound events exhibit some temporal structure: for instance, a {"}car horn{"} will likely be followed by a {"}car passing by{"}. While this temporal structure is widely exploited in sequence prediction tasks (e.g., in machine translation), where language models (LM) are exploited, it is not satisfactorily modeled in SED. In this work we propose a method which allows a recurrent neural network (RNN) to learn an LM for the SED task. The method conditions the input of the RNN with the activities of classes at the previous time step. We evaluate our method using F1 score and error rate (ER) over three different and publicly available datasets; the TUT-SED Synthetic 2016 and the TUT Sound Events 2016 and 2017 datasets. The obtained results show an increase of 9{\\%} and 2{\\%} at the F1 (higher is better) and a decrease of 7{\\%} and 2{\\%} at ER (lower is better) for the TUT Sound Events 2016 and 2017 datasets, respectively, when using our method. On the contrary, with our method there is a decrease of 4{\\%} at F1 score and an increase of 7{\\%} at ER for the TUT-SED Synthetic 2016 dataset.},
    booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019)",
    day = "25",
    keywords = "sound event detection; language modelling; sequence modelling; teacher forcing; scheduled sampling",
    month = "10",
    title = "Language Modelling for Sound Event Detection with Teacher Forcing and Scheduled Sampling",
    year = "2019",
    url = "https://arxiv.org/abs/1907.08506"
    }

  • V. M. Garcia-Molla, P. S. Juan, T. Virtanen, A. M. Vidal, and P. Alonso, "Generalization of the K-SVD algorithm for minimization of β-divergence," Digital Signal Processing, vol. 92, p. 47–53, 2019. doi:10.1016/j.dsp.2019.05.001
    [BibTeX] [Abstract]

    In this paper, we propose, describe, and test a modification of the K-SVD algorithm. Given a set of training data, the proposed algorithm computes an overcomplete dictionary by minimizing the β-divergence (β>=1) between the data and its representation as linear combinations of atoms of the dictionary, under strict sparsity restrictions. For the special case β=2, the proposed algorithm minimizes the Frobenius norm and, therefore, for β=2 the proposed algorithm is equivalent to the original K-SVD algorithm. We describe the modifications needed and discuss the possible shortcomings of the new algorithm. The algorithm is tested with random matrices and with an example based on speech separation.

    @article{2019_DSP,
    author = "Garcia-Molla, Victor M. and Juan, Pablo San and Virtanen, Tuomas and Vidal, Antonio M. and Alonso, Pedro",
    abstract = "In this paper, we propose, describe, and test a modification of the K-SVD algorithm. Given a set of training data, the proposed algorithm computes an overcomplete dictionary by minimizing the β-divergence (β>=1) between the data and its representation as linear combinations of atoms of the dictionary, under strict sparsity restrictions. For the special case β=2, the proposed algorithm minimizes the Frobenius norm and, therefore, for β=2 the proposed algorithm is equivalent to the original K-SVD algorithm. We describe the modifications needed and discuss the possible shortcomings of the new algorithm. The algorithm is tested with random matrices and with an example based on speech separation.",
    day = "1",
    doi = "10.1016/j.dsp.2019.05.001",
    issn = "1051-2004",
    journal = "Digital Signal Processing",
    keywords = "Beta-divergence;K-SVD;Matching pursuit algorithms;NMF;Nonnegative K-SVD",
    month = "9",
    pages = "47--53",
    publisher = "Elsevier",
    title = "Generalization of the {K}-{SVD} algorithm for minimization of β-divergence",
    volume = "92",
    year = "2019"
    }

  • {. C. }. Green, S. Adavanne, D. Murphy, and T. Virtanen, "Acoustic Scene Classification Using Higher-Order Ambisonic Features," in 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2019, p. 328–332. doi:10.1109/WASPAA.2019.8937282
    [BibTeX] [Abstract]

    This paper investigates the potential of using higher-order Ambisonic features to perform acoustic scene classification. We compare the performance of systems trained using first-order and fourth-order spatial features extracted from the EigenScape database. Using both Gaussian mixture model and convolutional neural network classifiers, we show that features extracted from higher-order Ambisonics can yield increased classification accuracies relative to first-order features. Diffuseness-based features seem to describe scenes particularly well relative to direction-of-arrival based features. With specific feature subsets, however, differences in classification accuracy between first and fourth-order features become negligible.

    @inproceedings{2019_WASPAA,
    author = "Green, {Marc C.} and Adavanne, Sharath and Murphy, Damian and Virtanen, Tuomas",
    abstract = "This paper investigates the potential of using higher-order Ambisonic features to perform acoustic scene classification. We compare the performance of systems trained using first-order and fourth-order spatial features extracted from the EigenScape database. Using both Gaussian mixture model and convolutional neural network classifiers, we show that features extracted from higher-order Ambisonics can yield increased classification accuracies relative to first-order features. Diffuseness-based features seem to describe scenes particularly well relative to direction-of-arrival based features. With specific feature subsets, however, differences in classification accuracy between first and fourth-order features become negligible.",
    booktitle = "2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)",
    doi = "10.1109/WASPAA.2019.8937282",
    isbn = "978-1-7281-1124-7",
    keywords = "acoustic scene classification; ambisonics; spatial audio; convolutional neural networks; gaussian mixture models",
    month = "10",
    pages = "328--332",
    publisher = "IEEE",
    series = "IEEE Workshop on Applications of Signal Processing to Audio and Acoustics",
    title = "Acoustic Scene Classification Using Higher-Order Ambisonic Features",
    year = "2019"
    }

  • S. Lipping, K. Drossos, and T. Virtanen, "Crowdsourcing a Dataset of Audio Captions," in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), 2019.
    [BibTeX] [Abstract] [Download PDF]

    Audio captioning is a novel field of multi-modal translation and it is the task of creating a textual description of the content of an audio signal (e.g. {"}people talking in a big room{"}). The creation of a dataset for this task requires a considerable amount of work, rendering the crowdsourcing a very attractive option. In this paper we present a three steps based framework for crowdsourcing an audio captioning dataset, based on concepts and practises followed for the creation of widely used image captioning and machine translations datasets. During the first step initial captions are gathered. A grammatically corrected and/or rephrased version of each initial caption is obtained in second step. Finally, the initial and edited captions are rated, keeping the top ones for the produced dataset. We objectively evaluate the impact of our framework during the process of creating an audio captioning dataset, in terms of diversity and amount of typographical errors in the obtained captions. The obtained results show that the resulting dataset has less typographical errors than the initial captions, and on average each sound in the produced dataset has captions with a Jaccard similarity of 0.24, roughly equivalent to two ten-word captions having in common four words with the same root, indicating that the captions are dissimilar while they still contain some of the same information.

    @inproceedings{2019_DCASE2019,
    author = "Lipping, Samuel and Drossos, Konstantinos and Virtanen, Tuomas",
    abstract = {Audio captioning is a novel field of multi-modal translation and it is the task of creating a textual description of the content of an audio signal (e.g. {"}people talking in a big room{"}). The creation of a dataset for this task requires a considerable amount of work, rendering the crowdsourcing a very attractive option. In this paper we present a three steps based framework for crowdsourcing an audio captioning dataset, based on concepts and practises followed for the creation of widely used image captioning and machine translations datasets. During the first step initial captions are gathered. A grammatically corrected and/or rephrased version of each initial caption is obtained in second step. Finally, the initial and edited captions are rated, keeping the top ones for the produced dataset. We objectively evaluate the impact of our framework during the process of creating an audio captioning dataset, in terms of diversity and amount of typographical errors in the obtained captions. The obtained results show that the resulting dataset has less typographical errors than the initial captions, and on average each sound in the produced dataset has captions with a Jaccard similarity of 0.24, roughly equivalent to two ten-word captions having in common four words with the same root, indicating that the captions are dissimilar while they still contain some of the same information.},
    booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019)",
    day = "26",
    keywords = "audio captioning; captioning; amt; crowdsourcing; Amazon Mechanical Turk",
    month = "10",
    title = "Crowdsourcing a Dataset of Audio Captions",
    year = "2019",
    url = "https://arxiv.org/abs/1907.09238"
    }

  • P. Maijala, T. Heittola, and T. Virtanen, "Ympäristömelun mittaaminen käyttäen automaattista lähteiden tunnistusta," in Akustiikkapäivät 2019, 2019, p. 196–205.
    [BibTeX]
    @inproceedings{2019_b,
    author = "Maijala, Panu and Heittola, Toni and Virtanen, Tuomas",
    title = {Ymp{\"a}rist{\"o}melun mittaaminen k{\"a}ytt{\"a}en automaattista l{\"a}hteiden tunnistusta},
    year = "2019",
    language = "Suomi",
    series = {Akustiikkap{\"a}iv{\"a}t},
    publisher = "Akustinen seura",
    pages = "196--205",
    booktitle = {Akustiikkap{\"a}iv{\"a}t 2019},
    note = {AKUSTIIKKAP{\"A}IV{\"A}T ; Conference date: 01-01-1900}
    }

  • I. Martín-Morató, A. Mesaros, T. Heittola, T. Virtanen, M. Cobos, and {. J. }. Ferri, "Sound Event Envelope Estimation in Polyphonic Mixtures," in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, p. 935–939. doi:10.1109/ICASSP.2019.8682858
    [BibTeX] [Abstract] [Download PDF]

    Sound event detection is the task of identifying automatically the presence and temporal boundaries of sound events within an input audio stream. In the last years, deep learning methods have established themselves as the state-of-the-art approach for the task, using binary indicators during training to denote whether an event is active or inactive. However, such binary activity indicators do not fully describe the events, and estimating the envelope of the sounds could provide more precise modeling of their activity. This paper proposes to estimate the amplitude envelopes of target sound event classes in polyphonic mixtures. For training, we use the amplitude envelopes of the target sounds, calculated from mixture signals and, for comparison, from their isolated counterparts. The model is then used to perform envelope estimation and sound event detection. Results show that the envelope estimation allows good modeling of the sounds activity, with detection results comparable to current state-of-the art.

    @inproceedings{2019_ICASSP_a,
    author = "Mart{\'i}n-Morat{\'o}, Irene and Mesaros, Annamaria and Heittola, Toni and Virtanen, Tuomas and Cobos, Maximo and Ferri, {Francesc J.}",
    abstract = "Sound event detection is the task of identifying automatically the presence and temporal boundaries of sound events within an input audio stream. In the last years, deep learning methods have established themselves as the state-of-the-art approach for the task, using binary indicators during training to denote whether an event is active or inactive. However, such binary activity indicators do not fully describe the events, and estimating the envelope of the sounds could provide more precise modeling of their activity. This paper proposes to estimate the amplitude envelopes of target sound event classes in polyphonic mixtures. For training, we use the amplitude envelopes of the target sounds, calculated from mixture signals and, for comparison, from their isolated counterparts. The model is then used to perform envelope estimation and sound event detection. Results show that the envelope estimation allows good modeling of the sounds activity, with detection results comparable to current state-of-the art.",
    booktitle = "ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",
    day = "17",
    doi = "10.1109/ICASSP.2019.8682858",
    isbn = "978-1-4799-8132-8",
    keywords = "acoustic signal detection; acoustic signal processing; learning (artificial intelligence); sound event envelope estimation; polyphonic mixtures; sound event detection; input audio stream; deep learning methods; binary activity indicators; amplitude envelop",
    month = "4",
    pages = "935--939",
    publisher = "IEEE",
    series = "IEEE International Conference on Acoustics, Speech and Signal Processing",
    title = "Sound Event Envelope Estimation in Polyphonic Mixtures",
    year = "2019",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/morato-sound\_event\_envelope\_estimation\_icassp2019.pdf"
    }

  • A. Mesaros, S. Adavanne, A. Politis, T. Heittola, and T. Virtanen, "Joint Measurement of Localization and Detection of Sound Events," in 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2019, p. 333–337. doi:10.1109/WASPAA.2019.8937220
    [BibTeX] [Abstract] [Download PDF]

    Sound event detection and sound localization or tracking have historically been two separate areas of research. Recent development of sound event detection methods approach also the localization side, but lack a consistent way of measuring the joint performance of the system; instead, they measure the separate abilities for detection and for localization. This paper proposes augmentation of the localization metrics with a condition related to the detection, and conversely, use of location information in calculating the true positives for detection. An extensive evaluation example is provided to illustrate the behavior of such joint metrics. The comparison to the detection only and localization only performance shows that the proposed joint metrics operate in a consistent and logical manner, and characterize adequately both aspects.

    @inproceedings{2019_WASPAA_b,
    author = "Mesaros, Annamaria and Adavanne, Sharath and Politis, Archontis and Heittola, Toni and Virtanen, Tuomas",
    abstract = "Sound event detection and sound localization or tracking have historically been two separate areas of research. Recent development of sound event detection methods approach also the localization side, but lack a consistent way of measuring the joint performance of the system; instead, they measure the separate abilities for detection and for localization. This paper proposes augmentation of the localization metrics with a condition related to the detection, and conversely, use of location information in calculating the true positives for detection. An extensive evaluation example is provided to illustrate the behavior of such joint metrics. The comparison to the detection only and localization only performance shows that the proposed joint metrics operate in a consistent and logical manner, and characterize adequately both aspects.",
    booktitle = "2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)",
    doi = "10.1109/WASPAA.2019.8937220",
    isbn = "978-1-7281-1124-7",
    keywords = "Sound event detection and localization; performance evaluation",
    month = "10",
    pages = "333--337",
    publisher = "IEEE",
    series = "IEEE Workshop on Applications of Signal Processing to Audio and Acoustics",
    title = "Joint Measurement of Localization and Detection of Sound Events",
    year = "2019",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/mesaros\_Joint\_localization\_and\_detection\_WASPAA2019.pdf"
    }

  • A. Mesaros, A. Diment, B. Elizalde, T. Heittola, E. Vincent, B. Raj, and T. Virtanen, "Sound Event Detection in the DCASE 2017 Challenge," IEEE-ACM Transactions on Audio Speech and Language Processing, vol. 27, iss. 6, p. 992–1006, 2019. doi:10.1109/TASLP.2019.2907016
    [BibTeX] [Abstract] [Download PDF]

    Each edition of the challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) contained several tasks involving sound event detection in different setups. DCASE 2017 presented participants with three such tasks, each having specific datasets and detection requirements: Task 2, in which target sound events were very rare in both training and testing data, Task 3 having overlapping events annotated in real-life audio, and Task 4, in which only weakly labeled data were available for training. In this paper, we present three tasks, including the datasets and baseline systems, and analyze the challenge entries for each task. We observe the popularity of methods using deep neural networks, and the still widely used mel frequency-based representations, with only few approaches standing out as radically different. Analysis of the systems behavior reveals that task-specific optimization has a big role in producing good performance; however, often this optimization closely follows the ranking metric, and its maximization/minimization does not result in universally good performance. We also introduce the calculation of confidence intervals based on a jackknife resampling procedure to perform statistical analysis of the challenge results. The analysis indicates that while the 95{\\%} confidence intervals for many systems overlap, there are significant differences in performance between the top systems and the baseline for all tasks.

    @article{2019_TASLP_a,
    author = "Mesaros, Annamaria and Diment, Aleksandr and Elizalde, Benjamin and Heittola, Toni and Vincent, Emmanuel and Raj, Bhiksha and Virtanen, Tuomas",
    abstract = "Each edition of the challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) contained several tasks involving sound event detection in different setups. DCASE 2017 presented participants with three such tasks, each having specific datasets and detection requirements: Task 2, in which target sound events were very rare in both training and testing data, Task 3 having overlapping events annotated in real-life audio, and Task 4, in which only weakly labeled data were available for training. In this paper, we present three tasks, including the datasets and baseline systems, and analyze the challenge entries for each task. We observe the popularity of methods using deep neural networks, and the still widely used mel frequency-based representations, with only few approaches standing out as radically different. Analysis of the systems behavior reveals that task-specific optimization has a big role in producing good performance; however, often this optimization closely follows the ranking metric, and its maximization/minimization does not result in universally good performance. We also introduce the calculation of confidence intervals based on a jackknife resampling procedure to perform statistical analysis of the challenge results. The analysis indicates that while the 95{\\%} confidence intervals for many systems overlap, there are significant differences in performance between the top systems and the baseline for all tasks.",
    day = "1",
    doi = "10.1109/TASLP.2019.2907016",
    issn = "2329-9290",
    journal = "IEEE-ACM Transactions on Audio Speech and Language Processing",
    keywords = "confidence intervals;jackknife estimates;pattern recognition;Sound event detection;weak labels",
    month = "6",
    number = "6",
    pages = "992--1006",
    publisher = "IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC",
    title = "Sound Event Detection in the {DCASE} 2017 Challenge",
    volume = "27",
    year = "2019",
    url = "https://hal.inria.fr/hal-02067935/file/mesaros\_TASLP19.pdf"
    }

  • A. Mesaros, T. Heittola, and T. Virtanen, "Acoustic scene classification in DCASE 2019 Challenge: closed and open set classification and data mismatch setups," in Proceedings of Workshop on Detection and Classification of Acoustic Scenes and Events, 2019, 2019. doi:10.33682/1syg-dy60
    [BibTeX] [Abstract] [Download PDF]

    Acoustic Scene Classification is a regular task in the DCASE Challenge, with each edition having it as a task. Throughout the years, modifications to the task have included mostly changing the dataset and increasing its size, but recently also more realistic setups have been introduced. In DCASE 2019 Challenge, the Acoustic Scene Classification task includes three subtasks: Subtask A, a closed-set typical supervised classification where all data is recorded with the same device; Subtask B, a closed-set classification setup with mismatched recording devices between training and evaluation data, and Subtask C, an open-set classification setup in which evaluation data could contain acoustic scenes not encountered in the training. In all subtasks, the provided baseline system was significantly outperformed, with top performance being 85.2\% for Subtask A, 75.5\% for Subtask B, and 67.4\% for Subtask C. This paper presents the outcome of DCASE 2019 Challenge Task 1 in terms of submitted systems performance and analysis.

    @inproceedings{2019_DCASE,
    author = "Mesaros, Annamaria and Heittola, Toni and Virtanen, Tuomas",
    abstract = "Acoustic Scene Classification is a regular task in the DCASE Challenge, with each edition having it as a task. Throughout the years, modifications to the task have included mostly changing the dataset and increasing its size, but recently also more realistic setups have been introduced. In DCASE 2019 Challenge, the Acoustic Scene Classification task includes three subtasks: Subtask A, a closed-set typical supervised classification where all data is recorded with the same device; Subtask B, a closed-set classification setup with mismatched recording devices between training and evaluation data, and Subtask C, an open-set classification setup in which evaluation data could contain acoustic scenes not encountered in the training. In all subtasks, the provided baseline system was significantly outperformed, with top performance being 85.2\% for Subtask A, 75.5\% for Subtask B, and 67.4\% for Subtask C. This paper presents the outcome of DCASE 2019 Challenge Task 1 in terms of submitted systems performance and analysis.",
    booktitle = "Proceedings of Workshop on Detection and Classification of Acoustic Scenes and Events, 2019",
    doi = "10.33682/1syg-dy60",
    title = "Acoustic scene classification in {DCASE} 2019 Challenge: closed and open set classification and data mismatch setups",
    year = "2019",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/DCASE2019Workshop\_Mesaros\_14.pdf"
    }

  • S. I. Mimilakis, K. Drossos, E. Cano, and G. Schuller, "Examining the Mapping Functions of Denoising Autoencoders in Singing Voice Separation," IEEE-ACM Transactions on Audio Speech and Language Processing, vol. 28, p. 266–278, 2019. doi:10.1109/TASLP.2019.2952013
    [BibTeX] [Abstract]

    The goal of this article is to investigate what singing voice separation approaches based on neural networks learn from the data. We examine the mapping functions of neural networks based on the denoising autoencoder (DAE) model that are conditioned on the mixture magnitude spectra. To approximate the mapping functions, we propose an algorithm inspired by the knowledge distillation, denoted the neural couplings algorithm (NCA). The NCA yields a matrix that expresses the mapping of the mixture to the target source magnitude information. Using the NCA, we examine the mapping functions of three fundamental DAE-based models in music source separation; one with single-layer encoder and decoder, one with multi-layer encoder and single-layer decoder, and one using skip-filtering connections (SF) with a single-layer encoding and decoding. We first train these models with realistic data to estimate the singing voice magnitude spectra from the corresponding mixture. We then use the optimized models and test spectral data as input to the NCA. Our experimental findings show that approaches based on the DAE model learn scalar filtering operators, exhibiting a predominant diagonal structure in their corresponding mapping functions, limiting the exploitation of inter-frequency structure of music data. In contrast, skip-filtering connections are shown to assist the DAE model in learning filtering operators that exploit richer inter-frequency structures.

    @article{2019_TASLP,
    author = "Mimilakis, Stylianos Ioannis and Drossos, Konstantinos and Cano, Estefania and Schuller, Gerald",
    abstract = "The goal of this article is to investigate what singing voice separation approaches based on neural networks learn from the data. We examine the mapping functions of neural networks based on the denoising autoencoder (DAE) model that are conditioned on the mixture magnitude spectra. To approximate the mapping functions, we propose an algorithm inspired by the knowledge distillation, denoted the neural couplings algorithm (NCA). The NCA yields a matrix that expresses the mapping of the mixture to the target source magnitude information. Using the NCA, we examine the mapping functions of three fundamental DAE-based models in music source separation; one with single-layer encoder and decoder, one with multi-layer encoder and single-layer decoder, and one using skip-filtering connections (SF) with a single-layer encoding and decoding. We first train these models with realistic data to estimate the singing voice magnitude spectra from the corresponding mixture. We then use the optimized models and test spectral data as input to the NCA. Our experimental findings show that approaches based on the DAE model learn scalar filtering operators, exhibiting a predominant diagonal structure in their corresponding mapping functions, limiting the exploitation of inter-frequency structure of music data. In contrast, skip-filtering connections are shown to assist the DAE model in learning filtering operators that exploit richer inter-frequency structures.",
    doi = "10.1109/TASLP.2019.2952013",
    issn = "2329-9290",
    journal = "IEEE-ACM Transactions on Audio Speech and Language Processing",
    keywords = "Music source separation;singing voice;denoising autoencoder;DAE;skip connections;neural couplings algorithm;NCA",
    pages = "266--278",
    publisher = "IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC",
    title = "Examining the Mapping Functions of Denoising Autoencoders in Singing Voice Separation",
    volume = "28",
    year = "2019"
    }

  • M. Norvasuo, H. Edelman, T. Joensuu, D. Hästbacka, M. Filppula, and P. Pertilä, Urbaani älykäs energia (USE): Yhteenveto ja suuntia tulevaisuuteen, , 2019.
    [BibTeX]
    @book{2019_d,
    author = {Norvasuo, Markku and Edelman, Harry and Joensuu, Tuomo and H{\"a}stbacka, David and Filppula, Mikael and Pertil{\"a}, Pasi},
    title = {Urbaani {\"a}lyk{\"a}s energia (USE): Yhteenveto ja suuntia tulevaisuuteen},
    year = "2019",
    language = "Suomi"
    }

  • P. Pertilä and M. Parviainen, "Time Difference of Arrival Estimation of Speech Signals Using Deep Neural Networks with Integrated Time-frequency Masking," in 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings, 2019, p. 436–440. doi:10.1109/ICASSP.2019.8682574
    [BibTeX] [Abstract]

    The Time Difference of Arrival (TDoA) of a sound wavefront impinging on a microphone pair carries spatial information about the source. However, captured speech typically contains dynamic non-speech interference sources and noise. Therefore, the TDoA estimates fluctuate between speech and interference. Deep Neural Networks (DNNs) have been applied for Time-Frequency (TF) masking for Acoustic Source Localization (ASL) to filter out non-speech components from a speaker location likelihood function. However, the type of TF mask for this task is not obvious. Secondly, the DNN should estimate the TDoA values, but existing solutions estimate the TF mask instead. To overcome these issues, a direct formulation of the TF masking as a part of a DNN-based ASL structure is proposed. Furthermore, the proposed network operates in an online manner, i.e., producing estimates frame-by-frame. Combined with the use of recurrent layers it exploits the sequential progression of speaker related TDoAs. Training with different microphone spacings allows model re-use for different microphone pair geometries in inference. Real-data experiments with smartphone recordings of speech in interference demonstrate the network's generalization capability.

    @inproceedings{2019_ICASSP,
    author = {Pertil{\"a}, Pasi and Parviainen, Mikko},
    abstract = "The Time Difference of Arrival (TDoA) of a sound wavefront impinging on a microphone pair carries spatial information about the source. However, captured speech typically contains dynamic non-speech interference sources and noise. Therefore, the TDoA estimates fluctuate between speech and interference. Deep Neural Networks (DNNs) have been applied for Time-Frequency (TF) masking for Acoustic Source Localization (ASL) to filter out non-speech components from a speaker location likelihood function. However, the type of TF mask for this task is not obvious. Secondly, the DNN should estimate the TDoA values, but existing solutions estimate the TF mask instead. To overcome these issues, a direct formulation of the TF masking as a part of a DNN-based ASL structure is proposed. Furthermore, the proposed network operates in an online manner, i.e., producing estimates frame-by-frame. Combined with the use of recurrent layers it exploits the sequential progression of speaker related TDoAs. Training with different microphone spacings allows model re-use for different microphone pair geometries in inference. Real-data experiments with smartphone recordings of speech in interference demonstrate the network's generalization capability.",
    booktitle = "2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings",
    day = "1",
    doi = "10.1109/ICASSP.2019.8682574",
    keywords = "Acoustic Source Localization; Microphone Arrays; Recurrent Neural Networks; Time-Frequency Masking",
    month = "5",
    pages = "436--440",
    publisher = "IEEE",
    title = "Time Difference of Arrival Estimation of Speech Signals Using Deep Neural Networks with Integrated Time-frequency Masking",
    year = "2019"
    }

  • P. Pertilä, "Data-Dependent Ensemble of Magnitude Spectrum Predictions for Single Channel Speech Enhancement," in 2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP), 2019. doi:10.1109/MMSP.2019.8901800
    [BibTeX] [Abstract]

    The time-frequency mask and the magnitude spectrum are two common targets for deep learning-based speech enhancement. Both the ensemble and the neural network fusion of magnitude spectra obtained with these approaches have been shown to improve the objective perceptual quality with synthetic mixtures of data. This work generalizes the ensemble approach by proposing neural network layers to predict time-frequency varying weights for the combination of the two magnitude spectra. In order to combine the best individual magnitude spectrum estimates, the weight prediction network is trained after the time-frequency mask and magnitude spectrum sub-networks have been separately trained for their corresponding objectives and their weights have been frozen. Using the publicly available CHiME3 -challenge data, which consists of both simulated and real speech recordings in everyday environments with noise and interference, the proposed approach leads to significantly higher noise suppression in terms of segmental source-to-distortion ratio over the alternative approaches. In addition, the approach achieves similar improvements in the average objective instrumentally measured intelligibility scores with respect to the best achieved scores.

    @inproceedings{2019_MMSP,
    author = {Pertil{\"a}, Pasi},
    abstract = "The time-frequency mask and the magnitude spectrum are two common targets for deep learning-based speech enhancement. Both the ensemble and the neural network fusion of magnitude spectra obtained with these approaches have been shown to improve the objective perceptual quality with synthetic mixtures of data. This work generalizes the ensemble approach by proposing neural network layers to predict time-frequency varying weights for the combination of the two magnitude spectra. In order to combine the best individual magnitude spectrum estimates, the weight prediction network is trained after the time-frequency mask and magnitude spectrum sub-networks have been separately trained for their corresponding objectives and their weights have been frozen. Using the publicly available CHiME3 -challenge data, which consists of both simulated and real speech recordings in everyday environments with noise and interference, the proposed approach leads to significantly higher noise suppression in terms of segmental source-to-distortion ratio over the alternative approaches. In addition, the approach achieves similar improvements in the average objective instrumentally measured intelligibility scores with respect to the best achieved scores.",
    booktitle = "2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP)",
    doi = "10.1109/MMSP.2019.8901800",
    isbn = "978-1-7281-1818-5",
    month = "9",
    publisher = "IEEE",
    series = "IEEE International Workshop on Multimedia Signal Processing",
    title = "Data-Dependent Ensemble of Magnitude Spectrum Predictions for Single Channel Speech Enhancement",
    year = "2019"
    }

  • H. Purwins, B. Li, T. Virtanen, J. Schüller, S. Chang, and T. Sainath, "Deep Learning for Audio Signal Processing," IEEE Journal of Selected Topics in Signal Processing, vol. 13, iss. 2, p. 206–219, 2019. doi:10.1109/JSTSP.2019.2908700
    [BibTeX] [Abstract] [Download PDF]

    Given the recent surge in developments of deep learning, this paper provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e., audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.

    @article{2019_JSTSP,
    author = {Purwins, Hendrik and Li, Bo and Virtanen, Tuomas and Sch{\"u}ller, Jan and Chang, Shuo-yiin and Sainath, Tara},
    abstract = "Given the recent surge in developments of deep learning, this paper provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e., audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.",
    day = "1",
    doi = "10.1109/JSTSP.2019.2908700",
    issn = "1932-4553",
    journal = "IEEE Journal of Selected Topics in Signal Processing",
    keywords = "audio enhancement;automatic speech recognition;connectionist temporal memory;Deep learning;environmental sounds;music information retrieval;source separation",
    month = "5",
    number = "2",
    pages = "206--219",
    publisher = "Institute of Electrical and Electronics Engineers",
    title = "Deep Learning for Audio Signal Processing",
    volume = "13",
    year = "2019",
    url = "https://arxiv.org/abs/1905.00078"
    }

  • O. Räsänen and K. Khorrami, "A computational model of early language acquisition from audiovisual experiences of young infants," in Proceedings of INTERSPEECH-2019, 2019, p. 3594–3598. doi:10.21437/Interspeech.2019-1523
    [BibTeX]
    @inproceedings{2019_InterSpecch,
    author = {R{\"a}s{\"a}nen, Okko and Khorrami, Khazar},
    title = "A computational model of early language acquisition from audiovisual experiences of young infants",
    year = "2019",
    doi = "10.21437/Interspeech.2019-1523",
    language = "English",
    series = "Interspeech",
    publisher = "International Speech Communication Association ISCA",
    pages = "3594--3598",
    booktitle = "Proceedings of INTERSPEECH-2019",
    note = "Interspeech ; Conference date: 01-01-1900"
    }

  • O. Räsänen, S. Seshadri, J. Karadayi, E. Riebling, J. Bunce, A. Cristia, F. Metze, M. Casillas, C. Rosemberg, E. Bergelson, and M. Soderstrom, "Automatic word count estimation from daylong child-centered recordings in various language environments using language-independent syllabification of speech," Speech Communication, vol. 113, p. 63–80, 2019. doi:10.1016/j.specom.2019.08.005
    [BibTeX] [Abstract]

    Automatic word count estimation (WCE) from audio recordings can be used to quantify the amount of verbal communication in a recording environment. One key application of WCE is to measure language input heard by infants and toddlers in their natural environments, as captured by daylong recordings from microphones worn by the infants. Although WCE is nearly trivial for high-quality signals in high-resource languages, daylong recordings are substantially more challenging due to the unconstrained acoustic environments and the presence of near- and far-field speech. Moreover, many use cases of interest involve languages for which reliable ASR systems or even well-defined lexicons are not available. A good WCE system should also perform similarly for low- and high-resource languages in order to enable unbiased comparisons across different cultures and environments. Unfortunately, the current state-of-the-art solution, the LENA system, is based on proprietary software and has only been optimized for American English, limiting its applicability. In this paper, we build on existing work on WCE and present the steps we have taken towards a freely available system for WCE that can be adapted to different languages or dialects with a limited amount of orthographically transcribed speech data. Our system is based on language-independent syllabification of speech, followed by a language-dependent mapping from syllable counts (and a number of other acoustic features) to the corresponding word count estimates. We evaluate our system on samples from daylong infant recordings from six different corpora consisting of several languages and socioeconomic environments, all manually annotated with the same protocol to allow direct comparison. We compare a number of alternative techniques for the two key components in our system: speech activity detection and automatic syllabification of speech. As a result, we show that our system can reach relatively consistent WCE accuracy across multiple corpora and languages (with some limitations). In addition, the system outperforms LENA on three of the four corpora consisting of different varieties of English. We also demonstrate how an automatic neural network-based syllabifier, when trained on multiple languages, generalizes well to novel languages beyond the training data, outperforming two previously proposed unsupervised syllabifiers as a feature extractor for WCE.

    @article{2019_ICA,
    author = {R{\"a}s{\"a}nen, Okko and Seshadri, Shreyas and Karadayi, Julien and Riebling, Eric and Bunce, John and Cristia, Alejandrina and Metze, Florian and Casillas, Marisa and Rosemberg, Celia and Bergelson, Elika and Soderstrom, Melanie},
    title = "Automatic word count estimation from daylong child-centered recordings in various language environments using language-independent syllabification of speech",
    abstract = "Automatic word count estimation (WCE) from audio recordings can be used to quantify the amount of verbal communication in a recording environment. One key application of WCE is to measure language input heard by infants and toddlers in their natural environments, as captured by daylong recordings from microphones worn by the infants. Although WCE is nearly trivial for high-quality signals in high-resource languages, daylong recordings are substantially more challenging due to the unconstrained acoustic environments and the presence of near- and far-field speech. Moreover, many use cases of interest involve languages for which reliable ASR systems or even well-defined lexicons are not available. A good WCE system should also perform similarly for low- and high-resource languages in order to enable unbiased comparisons across different cultures and environments. Unfortunately, the current state-of-the-art solution, the LENA system, is based on proprietary software and has only been optimized for American English, limiting its applicability. In this paper, we build on existing work on WCE and present the steps we have taken towards a freely available system for WCE that can be adapted to different languages or dialects with a limited amount of orthographically transcribed speech data. Our system is based on language-independent syllabification of speech, followed by a language-dependent mapping from syllable counts (and a number of other acoustic features) to the corresponding word count estimates. We evaluate our system on samples from daylong infant recordings from six different corpora consisting of several languages and socioeconomic environments, all manually annotated with the same protocol to allow direct comparison. We compare a number of alternative techniques for the two key components in our system: speech activity detection and automatic syllabification of speech. As a result, we show that our system can reach relatively consistent WCE accuracy across multiple corpora and languages (with some limitations). In addition, the system outperforms LENA on three of the four corpora consisting of different varieties of English. We also demonstrate how an automatic neural network-based syllabifier, when trained on multiple languages, generalizes well to novel languages beyond the training data, outperforming two previously proposed unsupervised syllabifiers as a feature extractor for WCE.",
    keywords = "Automatic syllabification, Daylong recordings, Language acquisition, Noise robustness, Word count estimation",
    year = "2019",
    month = "October",
    day = "1",
    doi = "10.1016/j.specom.2019.08.005",
    language = "English",
    volume = "113",
    pages = "63--80",
    journal = "Speech Communication",
    issn = "0167-6393",
    publisher = "Elsevier B.V."
    }

  • P. San Juan Sebastián, T. Virtanen, V. M. Garcia-Molla, and A. M. Vidal, "Analysis of an efficient parallel implementation of active-set Newton algorithm," Journal of Supercomputing, vol. 75, iss. 3, p. 1298–1309, 2019. doi:10.1007/s11227-018-2423-5
    [BibTeX] [Abstract]

    This paper presents an analysis of an efficient parallel implementation of the active-set Newton algorithm (ASNA), which is used to estimate the nonnegative weights of linear combinations of the atoms in a large-scale dictionary to approximate an observation vector by minimizing the Kullback–Leibler divergence between the observation vector and the approximation. The performance of ASNA has been proved in previous works against other state-of-the-art methods. The implementations analysed in this paper have been developed in C, using parallel programming techniques to obtain a better performance in multicore architectures than the original MATLAB implementation. Also a hardware analysis is performed to check the influence of CPU frequency and number of CPU cores in the different implementations proposed. The new implementations allow ASNA algorithm to tackle real-time problems due to the execution time reduction obtained.

    @article{2019,
    author = "San Juan Sebasti{\'a}n, Pablo and Virtanen, Tuomas and Garcia-Molla, Victor M. and Vidal, Antonio M.",
    title = "Analysis of an efficient parallel implementation of active-set Newton algorithm",
    abstract = "This paper presents an analysis of an efficient parallel implementation of the active-set Newton algorithm (ASNA), which is used to estimate the nonnegative weights of linear combinations of the atoms in a large-scale dictionary to approximate an observation vector by minimizing the Kullback–Leibler divergence between the observation vector and the approximation. The performance of ASNA has been proved in previous works against other state-of-the-art methods. The implementations analysed in this paper have been developed in C, using parallel programming techniques to obtain a better performance in multicore architectures than the original MATLAB implementation. Also a hardware analysis is performed to check the influence of CPU frequency and number of CPU cores in the different implementations proposed. The new implementations allow ASNA algorithm to tackle real-time problems due to the execution time reduction obtained.",
    keywords = "Convex optimization, Multicore, Newton algorithm, Parallel computing, Sparse representation",
    year = "2019",
    month = "March",
    doi = "10.1007/s11227-018-2423-5",
    language = "English",
    volume = "75",
    pages = "1298--1309",
    journal = "Journal of Supercomputing",
    issn = "0920-8542",
    publisher = "Springer Netherlands",
    number = "3"
    }

  • S. Seshadri, L. Juvela, O. Räsänen, and P. Alku, "Vocal Effort Based Speaking Style Conversion Using Vocoder Features and Parallel Learning," IEEE Access, vol. 7, p. 17230–17246, 2019. doi:10.1109/ACCESS.2019.2895923
    [BibTeX] [Abstract]

    Speaking style conversion (SSC) is the technology of converting natural speech signals from one style to another. In this study, we aim to provide a general SSC system for converting styles with varying vocal effort and focus on normal-to-Lombard conversion as a case study of this problem. We propose a parametric approach that uses a vocoder to extract speech features. These features are mapped using parallel machine learning models from utterances spoken in normal style to the corresponding features of Lombard speech. Finally, the mapped features are converted to a Lombard speech waveform with the vocoder. A total of three vocoders (GlottDNN, STRAIGHT, and Pulse model in log domain (PML)) and three machine learning mapping methods (standard GMM, Bayesian GMM, and feed-forward DNN) were compared in the proposed normal-to-Lombard style conversion system. The conversion was evaluated using two subjective listening tests measuring perceived Lombardness and quality of the converted speech signals, and by using an instrumental measure called Speech Intelligibility in Bits (SIIB) for speech intelligibility evaluation under various noise levels. The results of the subjective tests show that the system is able to convert normal speech into Lombard speech and that there is a trade-off between quality and Lombardness of the mapped utterances. The GlottDNN and PML stand out as the best vocoders in terms of quality and Lombardness, respectively, whereas the DNN is the best mapping method in terms of Lombardness. PML with the standard GMM seems to give a good compromise between the two attributes. The SIIB experiments indicate that intelligibility of converted speech compared to that of normal speech improved in noisy conditions most effectively when DNN mapping was used with STRAIGHT and PML.

    @article{2019_a,
    author = {Seshadri, Shreyas and Juvela, Lauri and R{\"a}s{\"a}nen, Okko and Alku, Paavo},
    title = "Vocal Effort Based Speaking Style Conversion Using Vocoder Features and Parallel Learning",
    abstract = "Speaking style conversion (SSC) is the technology of converting natural speech signals from one style to another. In this study, we aim to provide a general SSC system for converting styles with varying vocal effort and focus on normal-to-Lombard conversion as a case study of this problem. We propose a parametric approach that uses a vocoder to extract speech features. These features are mapped using parallel machine learning models from utterances spoken in normal style to the corresponding features of Lombard speech. Finally, the mapped features are converted to a Lombard speech waveform with the vocoder. A total of three vocoders (GlottDNN, STRAIGHT, and Pulse model in log domain (PML)) and three machine learning mapping methods (standard GMM, Bayesian GMM, and feed-forward DNN) were compared in the proposed normal-to-Lombard style conversion system. The conversion was evaluated using two subjective listening tests measuring perceived Lombardness and quality of the converted speech signals, and by using an instrumental measure called Speech Intelligibility in Bits (SIIB) for speech intelligibility evaluation under various noise levels. The results of the subjective tests show that the system is able to convert normal speech into Lombard speech and that there is a trade-off between quality and Lombardness of the mapped utterances. The GlottDNN and PML stand out as the best vocoders in terms of quality and Lombardness, respectively, whereas the DNN is the best mapping method in terms of Lombardness. PML with the standard GMM seems to give a good compromise between the two attributes. The SIIB experiments indicate that intelligibility of converted speech compared to that of normal speech improved in noisy conditions most effectively when DNN mapping was used with STRAIGHT and PML.",
    keywords = "Bayesian GMM, DNN, GlottDNN, Lombard speech, pulse model in log domain, speaking style conversion, vocal effort",
    year = "2019",
    doi = "10.1109/ACCESS.2019.2895923",
    language = "English",
    volume = "7",
    pages = "17230--17246",
    journal = "IEEE Access",
    issn = "2169-3536",
    publisher = "Institute of Electrical and Electronics Engineers Inc."
    }

  • S. Seshadri, L. Juvela, J. Yamagishi, O. Räsänen, and P. Alku, "Cycle-consistent Adversarial Networks for Non-parallel Vocal Effort Based Speaking Style Conversion," in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), United States, 2019, p. 6835–6839. doi:10.1109/ICASSP.2019.8682648
    [BibTeX] [Abstract]

    Speaking style conversion (SSC) is the technology of converting natural speech signals from one style to another. In this study, we propose the use of cycle-consistent adversarial networks (CycleGANs) for converting styles with varying vocal effort, and focus on conversion between normal and Lombard styles as a case study of this problem. We propose a parametric approach that uses the Pulse Model in Log domain (PML) vocoder to extract speech features. These features are mapped using the CycleGAN from utterances in the source style to the corresponding features of target speech. Finally, the mapped features are converted to a Lombard speech waveform with the PML. The CycleGAN was compared in subjective listening tests with 2 other standard mapping methods used in conversion, and the CycleGAN was found to have the best performance in terms of speech quality and in terms of the magnitude of the perceptual change between the two styles.

    @inproceedings{2019_ICASSP_d,
    author = {Seshadri, Shreyas and Juvela, Lauri and Yamagishi, Junichi and R{\"a}s{\"a}nen, Okko and Alku, Paavo},
    title = "Cycle-consistent Adversarial Networks for Non-parallel Vocal Effort Based Speaking Style Conversion",
    abstract = "Speaking style conversion (SSC) is the technology of converting natural speech signals from one style to another. In this study, we propose the use of cycle-consistent adversarial networks (CycleGANs) for converting styles with varying vocal effort, and focus on conversion between normal and Lombard styles as a case study of this problem. We propose a parametric approach that uses the Pulse Model in Log domain (PML) vocoder to extract speech features. These features are mapped using the CycleGAN from utterances in the source style to the corresponding features of target speech. Finally, the mapped features are converted to a Lombard speech waveform with the PML. The CycleGAN was compared in subjective listening tests with 2 other standard mapping methods used in conversion, and the CycleGAN was found to have the best performance in terms of speech quality and in terms of the magnitude of the perceptual change between the two styles.",
    keywords = "feature extraction, speech processing, vocoders, cycle-consistent adversarial networks, nonparallel vocal effort, CycleGAN, PML, Lombard speech waveform, natural speech signal conversion, standard mapping methods, SSC technology, speaking style conversion technology, pulse model in log domain vocoder, speech feature extraction, Vocoders, Feature extraction, Training, Standards, Speech, Speech processing, style conversion, vocal effort, Lom-bard speech, pulse-model in log domain vocoder",
    year = "2019",
    month = "April",
    day = "17",
    doi = "10.1109/ICASSP.2019.8682648",
    language = "English",
    isbn = "978-1-4799-8132-8",
    series = "IEEE International Conference on Acoustics, Speech and Signal Processing",
    publisher = "IEEE",
    pages = "6835--6839",
    booktitle = "ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",
    address = "United States",
    note = "IEEE International Conference on Acoustics, Speech and Signal Processing ; Conference date: 01-01-1900 Through 01-01-2000"
    }

  • S. Seshadri and O. Räsänen, "SylNet: An Adaptable End-to-End Syllable Count Estimator for Speech," IEEE Signal Processing Letters, vol. 26, iss. 9, p. 1359–1363, 2019. doi:10.1109/LSP.2019.2929415
    [BibTeX] [Abstract]

    Automatic syllable count estimation (SCE) is used in a variety of applications ranging from speaking rate estimation to detecting social activity from wearable microphones or developmental research concerned with quantifying speech heard by language-learning children in different environments. The majority of previously utilized SCE methods have relied on heuristic digital signal processing (DSP) methods, and only a small number of bi-directional long short-term memory (BLSTM) approaches have made use of modern machine learning approaches in the SCE task. This letter presents a novel end-to-end method called SylNet for automatic syllable counting from speech, built on the basis of a recent developments in neural network architectures. We describe how the entire model can be optimized directly to minimize SCE error on the training data without annotations aligned at the syllable level, and how it can be adapted to new languages using limited speech data with known syllable counts. Experiments on several different languages reveal that SylNet generalizes to languages beyond its training data and further improves with adaptation. It also outperforms several previously proposed methods for syllabification, including end-to-end BLSTMs.

    @article{2019_SP_a,
    author = {Seshadri, Shreyas and R{\"a}s{\"a}nen, Okko},
    title = "SylNet: An Adaptable End-to-End Syllable Count Estimator for Speech",
    abstract = "Automatic syllable count estimation (SCE) is used in a variety of applications ranging from speaking rate estimation to detecting social activity from wearable microphones or developmental research concerned with quantifying speech heard by language-learning children in different environments. The majority of previously utilized SCE methods have relied on heuristic digital signal processing (DSP) methods, and only a small number of bi-directional long short-term memory (BLSTM) approaches have made use of modern machine learning approaches in the SCE task. This letter presents a novel end-to-end method called SylNet for automatic syllable counting from speech, built on the basis of a recent developments in neural network architectures. We describe how the entire model can be optimized directly to minimize SCE error on the training data without annotations aligned at the syllable level, and how it can be adapted to new languages using limited speech data with known syllable counts. Experiments on several different languages reveal that SylNet generalizes to languages beyond its training data and further improves with adaptation. It also outperforms several previously proposed methods for syllabification, including end-to-end BLSTMs.",
    keywords = "estimation theory, learning (artificial intelligence), natural language processing, neural net architecture, recurrent neural nets, speech processing, SylNet, adaptable end-to-end syllable count estimator, automatic syllable count estimation, wearable microphones, developmental research, language-learning children, heuristic digital signal processing methods, SCE task, automatic syllable counting, SCE error, training data, syllable level, speech data, end-to-end BLSTMs, machine learning approaches, SCE methods, speaking rate estimation, social activity detection, DSP methods, bi-directional short-term memory approaches, neural network architectures, limited speech data, Estimation, Training, Adaptation models, Speech processing, Signal processing algorithms, Training data, Channel estimation, syllable count estimation, end-to-end learning, deep learning",
    year = "2019",
    month = "September",
    doi = "10.1109/LSP.2019.2929415",
    language = "English",
    volume = "26",
    pages = "1359--1363",
    journal = "IEEE Signal Processing Letters",
    issn = "1070-9908",
    publisher = "Institute of Electrical and Electronics Engineers Inc.",
    number = "9"
    }

  • S. Seshadri, L. Juvela, P. Alku, and O. Räsänen, "Augmented CycleGANs for continuous scale normal-to-Lombard speaking style conversion," in Proceedings of INTERSPEECH-2019, 2019, p. 2838–2842. doi:10.21437/Interspeech.2019-1681
    [BibTeX]
    @inproceedings{2019_InterSpecch_a,
    author = {Seshadri, Shreyas and Juvela, Lauri and Alku, Paavo and R{\"a}s{\"a}nen, Okko},
    title = "Augmented CycleGANs for continuous scale normal-to-Lombard speaking style conversion",
    year = "2019",
    doi = "10.21437/Interspeech.2019-1681",
    language = "English",
    series = "Interspeech",
    publisher = "International Speech Communication Association ISCA",
    pages = "2838--2842",
    booktitle = "Proceedings of INTERSPEECH-2019",
    note = "Interspeech ; Conference date: 01-01-1900"
    }

  • S. Wang, G. Naithani, and T. Virtanen, "Low-latency Deep Clustering for Speech Separation," in 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings, 2019, p. 76–80. doi:10.1109/ICASSP.2019.8683437
    [BibTeX] [Abstract] [Download PDF]

    This paper proposes a low algorithmic latency adaptation of the deep clustering approach to speaker-independent speech separation. It consists of three parts: a) the usage of long-short-term-memory (LSTM) networks instead of their bidirectional variant used in the original work, b) using a short synthesis window (here 8 ms) required for low-latency operation, and, c) using a buffer in the beginning of audio mixture to estimate cluster centres corresponding to constituent speakers which are then utilized to separate speakers within the rest of the signal. The buffer duration would serve as an initialization phase after which the system is capable of operating with 8 ms algorithmic latency. We evaluate our proposed approach on two-speaker mixtures from Wall Street Journal (WSJ0) corpus. We observe that the use of LSTM yields around one dB lower SDR as compared to the baseline bidirectional LSTM in terms of source to distortion ratio (SDR). Moreover, using an 8 ms synthesis window instead of 32 ms degrades the separation performance by around 2.1 dB as compared to the baseline. Finally, we also report separation performance with different buffer durations noting that separation can be achieved even for buffer duration as low as 300 ms.

    @inproceedings{2019_ICASSP_b,
    author = "Wang, Shanshan and Naithani, Gaurav and Virtanen, Tuomas",
    abstract = "This paper proposes a low algorithmic latency adaptation of the deep clustering approach to speaker-independent speech separation. It consists of three parts: a) the usage of long-short-term-memory (LSTM) networks instead of their bidirectional variant used in the original work, b) using a short synthesis window (here 8 ms) required for low-latency operation, and, c) using a buffer in the beginning of audio mixture to estimate cluster centres corresponding to constituent speakers which are then utilized to separate speakers within the rest of the signal. The buffer duration would serve as an initialization phase after which the system is capable of operating with 8 ms algorithmic latency. We evaluate our proposed approach on two-speaker mixtures from Wall Street Journal (WSJ0) corpus. We observe that the use of LSTM yields around one dB lower SDR as compared to the baseline bidirectional LSTM in terms of source to distortion ratio (SDR). Moreover, using an 8 ms synthesis window instead of 32 ms degrades the separation performance by around 2.1 dB as compared to the baseline. Finally, we also report separation performance with different buffer durations noting that separation can be achieved even for buffer duration as low as 300 ms.",
    booktitle = "2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings",
    day = "1",
    doi = "10.1109/ICASSP.2019.8683437",
    keywords = "Deep clustering; Low latency; Monaural speech separation",
    month = "5",
    pages = "76--80",
    publisher = "IEEE",
    title = "Low-latency Deep Clustering for Speech Separation",
    year = "2019",
    url = "http://arxiv.org/abs/1902.07033"
    }

  • H. Xie and T. Virtanen, "Zero-Shot Audio Classification Based On Class Label Embeddings," in 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2019, p. 264–267. doi:10.1109/WASPAA.2019.8937283
    [BibTeX] [Abstract] [Download PDF]

    This paper proposes a zero-shot learning approach for audio classification based on the textual information about class labels without any audio samples from target classes. We propose an audio classification system built on the bilinear model, which takes audio feature embeddings and semantic class label embeddings as input, and measures the compatibility between an audio feature embedding and a class label embedding. We use VGGish to extract audio feature embeddings from audio recordings. We treat textual labels as semantic side information of audio classes, and use Word2Vec to generate class label embeddings. Results on the ESC-50 dataset show that the proposed system can perform zero-shot audio classification with small training dataset. It can achieve accuracy (26 {\\%} on average) better than random guess (10 {\\%}) on each audio category. Particularly, it reaches up to 39.7 {\\%} for the category of natural audio classes.

    @inproceedings{2019_WASPAA_d,
    author = "Xie, Huang and Virtanen, Tuomas",
    abstract = "This paper proposes a zero-shot learning approach for audio classification based on the textual information about class labels without any audio samples from target classes. We propose an audio classification system built on the bilinear model, which takes audio feature embeddings and semantic class label embeddings as input, and measures the compatibility between an audio feature embedding and a class label embedding. We use VGGish to extract audio feature embeddings from audio recordings. We treat textual labels as semantic side information of audio classes, and use Word2Vec to generate class label embeddings. Results on the ESC-50 dataset show that the proposed system can perform zero-shot audio classification with small training dataset. It can achieve accuracy (26 {\\%} on average) better than random guess (10 {\\%}) on each audio category. Particularly, it reaches up to 39.7 {\\%} for the category of natural audio classes.",
    booktitle = "2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)",
    doi = "10.1109/WASPAA.2019.8937283",
    isbn = "978-1-7281-1124-7",
    keywords = "zero-shot learning; audio classification; class label embedding",
    month = "10",
    pages = "264--267",
    publisher = "IEEE",
    series = "IEEE Workshop on Applications of Signal Processing to Audio and Acoustics",
    title = "Zero-Shot Audio Classification Based On Class Label Embeddings",
    year = "2019",
    url = "https://arxiv.org/abs/1905.01926"
    }

2018

  • Computational analysis of sound scenes and events, T. Virtanen, M. D. Plumbley, and D. Ellis, Eds., Springer, 2018. doi:10.1007/978-3-319-63450-0
    [BibTeX] [Abstract]

    This book presents computational methods for extracting the useful information from audio signals, collecting the state of the art in the field of sound event and scene analysis. The authors cover the entire procedure for developing such methods, ranging from data acquisition and labeling, through the design of taxonomies used in the systems, to signal processing methods for feature extraction and machine learning methods for sound recognition. The book also covers advanced techniques for dealing with environmental variation and multiple overlapping sound sources, and taking advantage of multiple microphones or other modalities. The book gives examples of usage scenarios in large media databases, acoustic monitoring, bioacoustics, and context-aware devices. Graphical illustrations of sound signals and their spectrographic representations are presented, as well as block diagrams and pseudocode of algorithms.

    @book{2018_g,
    editor = "Virtanen, Tuomas and Plumbley, Mark D. and Ellis, Dan",
    title = "Computational analysis of sound scenes and events",
    abstract = "This book presents computational methods for extracting the useful information from audio signals, collecting the state of the art in the field of sound event and scene analysis. The authors cover the entire procedure for developing such methods, ranging from data acquisition and labeling, through the design of taxonomies used in the systems, to signal processing methods for feature extraction and machine learning methods for sound recognition. The book also covers advanced techniques for dealing with environmental variation and multiple overlapping sound sources, and taking advantage of multiple microphones or other modalities. The book gives examples of usage scenarios in large media databases, acoustic monitoring, bioacoustics, and context-aware devices. Graphical illustrations of sound signals and their spectrographic representations are presented, as well as block diagrams and pseudocode of algorithms.",
    year = "2018",
    doi = "10.1007/978-3-319-63450-0",
    language = "English",
    isbn = "978-3-319-63449-4",
    publisher = "Springer"
    }

  • Audio Source Separation and Speech Enhancement, E. Vincent, T. Virtanen, and S. Gannot, Eds., Wiley, 2018.
    [BibTeX]
    @book{2018_o,
    editor = "Vincent, Emmanuel and Virtanen, Tuomas and Gannot, Sharon",
    title = "Audio Source Separation and Speech Enhancement",
    year = "2018",
    language = "English",
    isbn = "978-1-119-27989-1",
    publisher = "Wiley"
    }

  • S. Adavanne, A. Politis, and T. Virtanen, "Direction of Arrival Estimation for Multiple Sound Sources Using Convolutional Recurrent Neural Network," in 2018 26th European Signal Processing Conference (EUSIPCO), 2018, p. 1462–1466. doi:10.23919/EUSIPCO.2018.8553182
    [BibTeX] [Abstract] [Download PDF]

    This paper proposes a deep neural network for estimating the directions of arrival (DOA) of multiple sound sources. The proposed stacked convolutional and recurrent neural network (DOAnet) generates a spatial pseudo-spectrum (SPS) along with the DOA estimates in both azimuth and elevation. We avoid any explicit feature extraction step by using the magnitudes and phases of the spectrograms of all the channels as input to the network. The proposed DOAnet is evaluated by estimating the DOAs of multiple concurrently present sources in anechoic, matched and unmatched reverberant conditions. The results show that the proposed DOAnet is capable of estimating the number of sources and their respective DOAs with good precision and generate SPS with high signal-to-noise ratio.

    @inproceedings{2018_EUSIPCO,
    author = "Adavanne, Sharath and Politis, Archontis and Virtanen, Tuomas",
    abstract = "This paper proposes a deep neural network for estimating the directions of arrival (DOA) of multiple sound sources. The proposed stacked convolutional and recurrent neural network (DOAnet) generates a spatial pseudo-spectrum (SPS) along with the DOA estimates in both azimuth and elevation. We avoid any explicit feature extraction step by using the magnitudes and phases of the spectrograms of all the channels as input to the network. The proposed DOAnet is evaluated by estimating the DOAs of multiple concurrently present sources in anechoic, matched and unmatched reverberant conditions. The results show that the proposed DOAnet is capable of estimating the number of sources and their respective DOAs with good precision and generate SPS with high signal-to-noise ratio.",
    booktitle = "2018 26th European Signal Processing Conference (EUSIPCO)",
    doi = "10.23919/EUSIPCO.2018.8553182",
    isbn = "978-1-5386-3736-4",
    keywords = "array signal processing; direction-of-arrival estimation; feature extraction; feedforward neural nets; recurrent neural nets; signal classification; spatial pseudospectrum; SPS; DOA estimates; explicit feature extraction step; DOAnet; multiple concurrently",
    month = "9",
    pages = "1462--1466",
    title = "Direction of Arrival Estimation for Multiple Sound Sources Using Convolutional Recurrent Neural Network",
    year = "2018",
    url = "https://arxiv.org/abs/1710.10059"
    }

  • S. Adavanne, A. Politis, and T. Virtanen, "Multichannel Sound Event Detection Using 3D Convolutional Neural Networks for Learning Inter-channel Features," in 2018 International Joint Conference on Neural Networks (IJCNN), 2018, pp. 1-7. doi:10.1109/IJCNN.2018.8489542
    [BibTeX] [Abstract]

    In this paper, we propose a stacked convolutional and recurrent neural network (CRNN) with a 3D convolutional neural network (CNN) in the first layer for the multichannel sound event detection (SED) task. The 3D CNN enables the network to simultaneously learn the inter-and intra-channel features from the input multichannel audio. In order to evaluate the proposed method, multichannel audio datasets with different number of overlapping sound sources are synthesized. Each of this dataset has a four-channel first-order Ambisonic, binaural, and single-channel versions, on which the performance of SED using the proposed method are compared to study the potential of SED using multichannel audio. A similar study is also done with the binaural and single-channel versions of the real-life recording TUT-SED 2017 development dataset. The proposed method learns to recognize overlapping sound events from multichannel features faster and performs better SED with a fewer number of training epochs. The results show that on using multichannel Ambisonic audio in place of single-channel audio we improve the overall F-score by 7.5\%

    @INPROCEEDINGS{2018_IJCNN,
    author = "Adavanne, Sharath and Politis, Archontis and Virtanen, Tuomas",
    booktitle = "2018 International Joint Conference on Neural Networks (IJCNN)",
    title = "Multichannel Sound Event Detection Using 3D Convolutional Neural Networks for Learning Inter-channel Features",
    year = "2018",
    volume = "",
    number = "",
    pages = "1-7",
    keywords = "Feature extraction;Three-dimensional displays;Two dimensional displays;Event detection;Task analysis;Recurrent neural networks",
    abstract = "In this paper, we propose a stacked convolutional and recurrent neural network (CRNN) with a 3D convolutional neural network (CNN) in the first layer for the multichannel sound event detection (SED) task. The 3D CNN enables the network to simultaneously learn the inter-and intra-channel features from the input multichannel audio. In order to evaluate the proposed method, multichannel audio datasets with different number of overlapping sound sources are synthesized. Each of this dataset has a four-channel first-order Ambisonic, binaural, and single-channel versions, on which the performance of SED using the proposed method are compared to study the potential of SED using multichannel audio. A similar study is also done with the binaural and single-channel versions of the real-life recording TUT-SED 2017 development dataset. The proposed method learns to recognize overlapping sound events from multichannel features faster and performs better SED with a fewer number of training epochs. The results show that on using multichannel Ambisonic audio in place of single-channel audio we improve the overall F-score by 7.5\%",
    doi = "10.1109/IJCNN.2018.8489542"
    }

  • S. Adavanne, A. Politis, J. Nikunen, and T. Virtanen, "Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks," IEEE Journal of Selected Topics in Signal Processing, 2018. doi:10.1109/JSTSP.2018.2885636
    [BibTeX] [Abstract] [Download PDF]

    In this paper, we propose a convolutional recurrent neural network for joint sound event localization and detection (SELD) of multiple overlapping sound events in three-dimensional (3D) space. The proposed network takes a sequence of consecutive spectrogram time-frames as input and maps it to two outputs in parallel. As the first output, the sound event detection (SED) is performed as a multi-label classification task on each time-frame producing temporal activity for all the sound event classes. As the second output, localization is performed by estimating the 3D Cartesian coordinates of the direction-of-arrival (DOA) for each sound event class using multi-output regression. The proposed method is able to associate multiple DOAs with respective sound event labels and further track this association with respect to time. The proposed method uses separately the phase and magnitude component of the spectrogram calculated on each audio channel as the feature, thereby avoiding any method- and array-specific feature extraction. The method is evaluated on five Ambisonic and two circular array format datasets with different overlapping sound events in anechoic, reverberant and real-life scenarios. The proposed method is compared with two SED, three DOA estimation, and one SELD baselines. The results show that the proposed method is generic and applicable to any array structures, robust to unseen DOA values, reverberation, and low SNR scenarios. The proposed method achieved a consistently higher recall of the estimated number of DOAs across datasets in comparison to the best baseline. Additionally, this recall was observed to be significantly better than the best baseline method for a higher number of overlapping sound events.

    @article{2018_JSTSP,
    author = "Adavanne, Sharath and Politis, Archontis and Nikunen, Joonas and Virtanen, Tuomas",
    abstract = "In this paper, we propose a convolutional recurrent neural network for joint sound event localization and detection (SELD) of multiple overlapping sound events in three-dimensional (3D) space. The proposed network takes a sequence of consecutive spectrogram time-frames as input and maps it to two outputs in parallel. As the first output, the sound event detection (SED) is performed as a multi-label classification task on each time-frame producing temporal activity for all the sound event classes. As the second output, localization is performed by estimating the 3D Cartesian coordinates of the direction-of-arrival (DOA) for each sound event class using multi-output regression. The proposed method is able to associate multiple DOAs with respective sound event labels and further track this association with respect to time. The proposed method uses separately the phase and magnitude component of the spectrogram calculated on each audio channel as the feature, thereby avoiding any method- and array-specific feature extraction. The method is evaluated on five Ambisonic and two circular array format datasets with different overlapping sound events in anechoic, reverberant and real-life scenarios. The proposed method is compared with two SED, three DOA estimation, and one SELD baselines. The results show that the proposed method is generic and applicable to any array structures, robust to unseen DOA values, reverberation, and low SNR scenarios. The proposed method achieved a consistently higher recall of the estimated number of DOAs across datasets in comparison to the best baseline. Additionally, this recall was observed to be significantly better than the best baseline method for a higher number of overlapping sound events.",
    day = "7",
    doi = "10.1109/JSTSP.2018.2885636",
    issn = "1932-4553",
    journal = "IEEE Journal of Selected Topics in Signal Processing",
    keywords = "Direction-of-arrival estimation;Estimation;Task analysis;Azimuth;Microphone arrays;Recurrent neural networks;Sound event detection;direction of arrival estimation;convolutional recurrent neural network",
    month = "12",
    publisher = "Institute of Electrical and Electronics Engineers",
    title = "Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks",
    year = "2018",
    url = "https://arxiv.org/abs/1807.00129"
    }

  • R. Badeau and T. Virtanen, "Nonnegative Matrix Factorization," in Audio Source Separation and Speech Enhancement, E. Vincent, T. Virtanen, and S. Gannot, Eds., Wiley, 2018. doi:10.1002/9781119279860.ch8
    [BibTeX] [Abstract]

    Nonnegative matrix factorization (NMF) is a very powerful model for representing speech and music data. In this chapter, we present the mathematical foundations, and describe several probabilistic frameworks and various algorithms for computing an NMF. We also describe some advanced NMF models that are able to more accurately represent audio signals, by enforcing properties such as sparsity, harmonicity and spectral smoothness, and by taking the non‐stationarity of the data into account. We show that coupled factorizations make it possible to exploit some extra information we may have about the observed signal, including the musical score. Finally, we present several methods that perform dictionary learning for NMF, and we conclude about the main benefits and downsides of NMF models.

    @inbook{2018_e,
    author = "Badeau, Roland and Virtanen, Tuomas",
    editor = "Vincent, Emmanuel and Virtanen, Tuomas and Gannot, Sharon",
    title = "Nonnegative Matrix Factorization",
    abstract = "Nonnegative matrix factorization (NMF) is a very powerful model for representing speech and music data. In this chapter, we present the mathematical foundations, and describe several probabilistic frameworks and various algorithms for computing an NMF. We also describe some advanced NMF models that are able to more accurately represent audio signals, by enforcing properties such as sparsity, harmonicity and spectral smoothness, and by taking the non‐stationarity of the data into account. We show that coupled factorizations make it possible to exploit some extra information we may have about the observed signal, including the musical score. Finally, we present several methods that perform dictionary learning for NMF, and we conclude about the main benefits and downsides of NMF models.",
    year = "2018",
    month = "August",
    day = "3",
    doi = "10.1002/9781119279860.ch8",
    language = "English",
    isbn = "978-1-119-27989-1",
    booktitle = "Audio Source Separation and Speech Enhancement",
    publisher = "Wiley"
    }

  • L. Bramsl{o}w, G. Naithani, A. Hafez, T. Barker, N. H. Pontoppidan, and T. Virtanen, "Improving competing voices segregation for hearing impaired listeners using a low-latency deep neural network algorithm," The Journal of the Acoustical Society of America, vol. 144, iss. 1, p. 172–185, 2018. doi:doi.org/10.1121/1.5045322
    [BibTeX] [Abstract]

    Hearing aid users are challenged in listening situations with noise and especially speech-on-speech situations with two or more competing voices. Specifically, the task of attending to and segregating two competing voices is particularly hard, unlike for normal-hearing listeners, as shown in a small sub-experiment. In the main experiment, the competing voices benefit of a deep neural network (DNN) based stream segregation enhancement algorithm was tested on hearing-impaired listeners. A mixture of two voices was separated using a DNN and presented to the two ears as individual streams and tested for word score. Compared to the unseparated mixture, there was a 13\\%

    @article{2018_ICA,
    author = "Bramsl{\o}w, Lars and Naithani, Gaurav and Hafez, Atefeh and Barker, Tom and Pontoppidan, Niels Henrik and Virtanen, Tuomas",
    title = "Improving competing voices segregation for hearing impaired listeners using a low-latency deep neural network algorithm",
    abstract = "Hearing aid users are challenged in listening situations with noise and especially speech-on-speech situations with two or more competing voices. Specifically, the task of attending to and segregating two competing voices is particularly hard, unlike for normal-hearing listeners, as shown in a small sub-experiment. In the main experiment, the competing voices benefit of a deep neural network (DNN) based stream segregation enhancement algorithm was tested on hearing-impaired listeners. A mixture of two voices was separated using a DNN and presented to the two ears as individual streams and tested for word score. Compared to the unseparated mixture, there was a 13\\%",
    journal = "The Journal of the Acoustical Society of America",
    volume = "144",
    number = "1",
    pages = "172--185",
    year = "2018",
    publisher = "AIP Publishing",
    doi = "doi.org/10.1121/1.5045322"
    }

  • E. Cakir and T. Virtanen, "Musical Instrument Synthesis and Morphing in Multidimensional Latent Space Using Variational, Convolutional Recurrent Autoencoders," in Proceedings of the Audio Engineering Society 145th Convention, 2018.
    [BibTeX] [Abstract] [Download PDF]

    In this work we propose a deep learning based method—namely, variational, convolutional recurrent autoencoders (VCRAE)—for musical instrument synthesis. This method utilizes the higher level time-frequency representations extracted by the convolutional and recurrent layers to learn a Gaussian distribution in the training stage, which will be later used to infer unique samples through interpolation of multiple instruments in the usage stage. The reconstruction performance of VCRAE is evaluated by proxy through an instrument classifier and provides significantly better accuracy than two other baseline autoencoder methods. The synthesized samples for the combinations of 15 different instruments are available on the companion website.

    @inproceedings{2018_AES,
    author = "Cakir, Emre and Virtanen, Tuomas",
    abstract = "In this work we propose a deep learning based method—namely, variational, convolutional recurrent autoencoders (VCRAE)—for musical instrument synthesis. This method utilizes the higher level time-frequency representations extracted by the convolutional and recurrent layers to learn a Gaussian distribution in the training stage, which will be later used to infer unique samples through interpolation of multiple instruments in the usage stage. The reconstruction performance of VCRAE is evaluated by proxy through an instrument classifier and provides significantly better accuracy than two other baseline autoencoder methods. The synthesized samples for the combinations of 15 different instruments are available on the companion website.",
    booktitle = "Proceedings of the Audio Engineering Society 145th Convention",
    publisher = "AES Audio Engineering Society",
    title = "Musical Instrument Synthesis and Morphing in Multidimensional Latent Space Using Variational, Convolutional Recurrent Autoencoders",
    year = "2018",
    url = "https://trepo.tuni.fi/bitstream/handle/10024/129505/AES\_2018\_Musical\_Instrument\_Synthesis\_and\_Morphing\_in\_Multidimensional\_Latent\_Space\_Using\_Variational\_Convolutional\_Recurrent\_Autoencoders.pdf?sequence=1\&isAllowed=y"
    }

  • E. Cakir and T. Virtanen, "End-to-End Polyphonic Sound Event Detection Using Convolutional Recurrent Neural Networks with Learned Time-Frequency Representation Input," in 2018 International Joint Conference on Neural Networks, IJCNN 2018 - Proceedings, 2018. doi:10.1109/IJCNN.2018.8489470
    [BibTeX] [Abstract] [Download PDF]

    Sound event detection systems typically consist of two stages: Extracting hand-crafted features from the raw audio waveform, and learning a mapping between these features and the target sound events using a classifier. Recently, the focus of sound event detection research has been mostly shifted to the latter stage using standard features such as mel spectrogram as the input for classifiers such as deep neural networks. In this work, we utilize end-to-end approach and propose to combine these two stages in a single deep neural network classifier. The feature extraction over the raw waveform is conducted by a feedforward layer block, whose parameters are initialized to extract the time-frequency representations. The feature extraction parameters are updated during training, resulting with a representation that is optimized for the specific task. This feature extraction block is followed by (and jointly trained with) a convolutional recurrent network, which has recently given state-of-the-art results in many sound recognition tasks. The proposed system does not outperform a convolutional recurrent network with fixed hand-crafted features. The final magnitude spectrum characteristics of the feature extraction block parameters indicate that the most relevant information for the given task is contained in 0 - 3 kHz frequency range, and this is also supported by the empirical results on the SED performance.

    @inproceedings{2018_IJCNN_a,
    author = "Cakir, Emre and Virtanen, Tuomas",
    abstract = "Sound event detection systems typically consist of two stages: Extracting hand-crafted features from the raw audio waveform, and learning a mapping between these features and the target sound events using a classifier. Recently, the focus of sound event detection research has been mostly shifted to the latter stage using standard features such as mel spectrogram as the input for classifiers such as deep neural networks. In this work, we utilize end-to-end approach and propose to combine these two stages in a single deep neural network classifier. The feature extraction over the raw waveform is conducted by a feedforward layer block, whose parameters are initialized to extract the time-frequency representations. The feature extraction parameters are updated during training, resulting with a representation that is optimized for the specific task. This feature extraction block is followed by (and jointly trained with) a convolutional recurrent network, which has recently given state-of-the-art results in many sound recognition tasks. The proposed system does not outperform a convolutional recurrent network with fixed hand-crafted features. The final magnitude spectrum characteristics of the feature extraction block parameters indicate that the most relevant information for the given task is contained in 0 - 3 kHz frequency range, and this is also supported by the empirical results on the SED performance.",
    booktitle = "2018 International Joint Conference on Neural Networks, IJCNN 2018 - Proceedings",
    day = "10",
    doi = "10.1109/IJCNN.2018.8489470",
    keywords = "convolutional recurrent neural networks; end-to-end; feature learning; neural networks",
    month = "10",
    publisher = "IEEE",
    title = "End-to-End Polyphonic Sound Event Detection Using Convolutional Recurrent Neural Networks with Learned Time-Frequency Representation Input",
    year = "2018",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/end\_to\_end\_sed\_with\_crnn\_ijcnn\_2018.pdf"
    }

  • K. Drossos, P. Magron, S. I. Mimilakis, and T. Virtanen, "Harmonic-Percussive Source Separation with Deep Neural Networks and Phase Recovery," in 16th International Workshop on Acoustic Signal Enhancement, IWAENC 2018, 2018, p. 421–425. doi:10.1109/IWAENC.2018.8521371
    [BibTeX] [Abstract] [Download PDF]

    Harmonic/percussive source separation (HPSS) consists in separating the pitched instruments from the percussive parts in a music mixture. In this paper, we propose to apply the recently introduced Masker-Denoiser with twin networks (MaD TwinNet) system to this task. MaD TwinNet is a deep learning architecture that has reached state-of-the-art results in monaural singing voice separation. Herein, we propose to apply it to HPSS by using it to estimate the magnitude spectrogram of the percussive source. Then, we retrieve the complex-valued short-term Fourier transform of the sources by means of a phase recovery algorithm, which minimizes the reconstruction error and enforces the phase of the harmonic part to follow a sinusoidal phase model. Experiments conducted on realistic music mixtures show that this novel separation system outperforms the previous state-of-the art kernel additive model approach.

    @inproceedings{2018_IWAENC_a,
    author = "Drossos, Konstantinos and Magron, Paul and Mimilakis, Stylianos Ioannis and Virtanen, Tuomas",
    abstract = "Harmonic/percussive source separation (HPSS) consists in separating the pitched instruments from the percussive parts in a music mixture. In this paper, we propose to apply the recently introduced Masker-Denoiser with twin networks (MaD TwinNet) system to this task. MaD TwinNet is a deep learning architecture that has reached state-of-the-art results in monaural singing voice separation. Herein, we propose to apply it to HPSS by using it to estimate the magnitude spectrogram of the percussive source. Then, we retrieve the complex-valued short-term Fourier transform of the sources by means of a phase recovery algorithm, which minimizes the reconstruction error and enforces the phase of the harmonic part to follow a sinusoidal phase model. Experiments conducted on realistic music mixtures show that this novel separation system outperforms the previous state-of-the art kernel additive model approach.",
    booktitle = "16th International Workshop on Acoustic Signal Enhancement, IWAENC 2018",
    doi = "10.1109/IWAENC.2018.8521371",
    month = "11",
    pages = "421--425",
    publisher = "IEEE",
    title = "Harmonic-Percussive Source Separation with Deep Neural Networks and Phase Recovery",
    year = "2018",
    url = "https://arxiv.org/pdf/1807.11298.pdf"
    }

  • K. Drossos, S. I. Mimilakis, D. Serdyuk, G. Schuller, T. Virtanen, and Y. Bengio, "MaD TwinNet: Masker-Denoiser Architecture with Twin Networks for Monaural Sound Source Separation," in Proceedings of the IEEE World Congress on Computational Intelligence (WCCI)/International Joint Conference on Neural Networks (IJCNN), 2018.
    [BibTeX] [Download PDF]
    @inproceedings{2018_WCCI,
    author = "Drossos, Konstantinos and Mimilakis, Stylianos Ioannis and Serdyuk, Dmitriy and Schuller, Gerald and Virtanen, Tuomas and Bengio, Yoshua",
    booktitle = "Proceedings of the IEEE World Congress on Computational Intelligence (WCCI)/International Joint Conference on Neural Networks (IJCNN)",
    day = "10",
    month = "7",
    publisher = "IEEE",
    title = "{M}a{D} {T}win{N}et: {M}asker-{D}enoiser {A}rchitecture with {T}win {N}etworks for {M}onaural {S}ound {S}ource {S}eparation",
    year = "2018",
    url = "https://arxiv.org/abs/1802.00300"
    }

  • D. Ellis, T. Virtanen, M. D. Plumbley, and B. Raj, "Future Perspective," in Computational Analysis of Sound Scenes and Events, T. Virtanen, M. D. Plumbley, and D. Ellis, Eds., Springer, 2018, p. 401–415. doi:10.1007/978-3-319-63450-0\\_14
    [BibTeX] [Abstract]

    This book has covered the underlying principles and technologies of sound recognition, and described several current application areas. However, the field is still very young; this chapter briefly outlines several emerging areas, particularly relating to the provision of the very large training sets that can be exploited by deep learning approaches. We also forecast some of the technological and application advances we expect in the short-to-medium future.

    @inbook{2018_m,
    author = "Ellis, Dan and Virtanen, Tuomas and Plumbley, Mark D. and Raj, Bhiksha",
    editor = "Virtanen, Tuomas and Plumbley, Mark D. and Ellis, Dan",
    title = "Future Perspective",
    abstract = "This book has covered the underlying principles and technologies of sound recognition, and described several current application areas. However, the field is still very young; this chapter briefly outlines several emerging areas, particularly relating to the provision of the very large training sets that can be exploited by deep learning approaches. We also forecast some of the technological and application advances we expect in the short-to-medium future.",
    year = "2018",
    doi = "10.1007/978-3-319-63450-0\\_14",
    language = "English",
    isbn = "978-3-319-63449-4",
    pages = "401--415",
    booktitle = "Computational Analysis of Sound Scenes and Events",
    publisher = "Springer"
    }

  • S. Gharib, H. Derrar, D. Niizumi, T. Senttula, J. Tommola, T. Heittola, T. Virtanen, and H. Huttunen, "Acoustic Scene Classification: A Competition Review," in 2018 IEEE International Workshop on Machine Learning for Signal Processing, MLSP 2018, 2018. doi:10.1109/MLSP.2018.8517000
    [BibTeX] [Abstract] [Download PDF]

    In this paper we study the problem of acoustic scene classification, i.e., categorization of audio sequences into mutually exclusive classes based on their spectral content. We describe the methods and results discovered during a competition organized in the context of a graduate machine learning course; both by the students and external participants. We identify the most suitable methods and study the impact of each by performing an ablation study of the mixture of approaches. We also compare the results with a neural network baseline, and show the improvement over that. Finally, we discuss the impact of using a competition as a part of a university course, and justify its importance in the curriculum based on student feedback.

    @inproceedings{2018_MLSP,
    author = "Gharib, Shayan and Derrar, Honain and Niizumi, Daisuke and Senttula, Tuukka and Tommola, Janne and Heittola, Toni and Virtanen, Tuomas and Huttunen, Heikki",
    abstract = "In this paper we study the problem of acoustic scene classification, i.e., categorization of audio sequences into mutually exclusive classes based on their spectral content. We describe the methods and results discovered during a competition organized in the context of a graduate machine learning course; both by the students and external participants. We identify the most suitable methods and study the impact of each by performing an ablation study of the mixture of approaches. We also compare the results with a neural network baseline, and show the improvement over that. Finally, we discuss the impact of using a competition as a part of a university course, and justify its importance in the curriculum based on student feedback.",
    booktitle = "2018 IEEE International Workshop on Machine Learning for Signal Processing, MLSP 2018",
    doi = "10.1109/MLSP.2018.8517000",
    month = "9",
    publisher = "IEEE",
    title = "Acoustic Scene Classification: A Competition Review",
    year = "2018",
    url = "https://arxiv.org/pdf/1808.02357.pdf"
    }

  • S. Gharib, K. Drossos, E. Cakir, D. Serdyuk, and T. Virtanen, "Unsupervised Adversarial Domain Adaptation for Acoustic Scene Classification," in Detection and Classification of Acoustic Scenes and Events, 2018.
    [BibTeX] [Abstract] [Download PDF]

    A general problem in acoustic scene classification task is the mismatched conditions between training and testing data, which significantly reduces the performance of the developed methods on classification accuracy. As a countermeasure, we present the first method of unsupervised adversarial domain adaptation for acoustic scene classification. We employ a model pre-trained on data from one set of conditions and by using data from other set of conditions, we adapt the model in order that its output cannot be used for classifying the set of conditions that input data belong to. We use a freely available dataset from the DCASE 2018 challenge Task 1, subtask B, that contains data from mismatched recording devices. We consider the scenario where the annotations are available for the data recorded from one device, but not for the rest. Our results show that with our model agnostic method we can achieve ∼10\% increase at the accuracy on an unseen and unlabeled dataset, while keeping almost the same performance on the labeled dataset.

    @inproceedings{2018_DCASE_a,
    author = "Gharib, Shayan and Drossos, Konstantinos and Cakir, Emre and Serdyuk, Dmitriy and Virtanen, Tuomas",
    abstract = "A general problem in acoustic scene classification task is the mismatched conditions between training and testing data, which significantly reduces the performance of the developed methods on classification accuracy. As a countermeasure, we present the first method of unsupervised adversarial domain adaptation for acoustic scene classification. We employ a model pre-trained on data from one set of conditions and by using data from other set of conditions, we adapt the model in order that its output cannot be used for classifying the set of conditions that input data belong to. We use a freely available dataset from the DCASE 2018 challenge Task 1, subtask B, that contains data from mismatched recording devices. We consider the scenario where the annotations are available for the data recorded from one device, but not for the rest. Our results show that with our model agnostic method we can achieve ∼10\% increase at the accuracy on an unseen and unlabeled dataset, while keeping almost the same performance on the labeled dataset.",
    booktitle = "Detection and Classification of Acoustic Scenes and Events",
    isbn = "978-952-15-4262-6",
    keywords = "acoustic scene classification; domain adaptation; generative adversarial networks",
    publisher = "Tampere University of Technology",
    title = "{U}nsupervised {A}dversarial {D}omain {A}daptation for {A}coustic {S}cene {C}lassification",
    url = "https://arxiv.org/abs/1808.05777",
    year = "2018"
    }

  • T. Heittola, E. Cakir, and T. Virtanen, "The machine learning approach for analysis of sound scenes and events," in Computational Analysis of Sound Scenes and Events, Springer, 2018, p. 13–40. doi:10.1007/978-3-319-63450-0_2
    [BibTeX] [Abstract]

    This chapter explains the basic concepts in computational methods used for analysis of sound scenes and events. Even though the analysis tasks in many applications seem different, the underlying computational methods are typically based on the same principles. We explain the commonalities between analysis tasks such as sound event detection, sound scene classification, or audio tagging. We focus on the machine learning approach, where the sound categories (i.e., classes) to be analyzed are defined in advance. We explain the typical components of an analysis system, including signal pre-processing, feature extraction, and pattern classification. We also preset an example system based on multi-label deep neural networks, which has been found to be applicable in many analysis tasks discussed in this book. Finally, we explain the whole processing chain that involves developing computational audio analysis systems.

    @inbook{2018,
    author = "Heittola, Toni and Cakir, Emre and Virtanen, Tuomas",
    abstract = "This chapter explains the basic concepts in computational methods used for analysis of sound scenes and events. Even though the analysis tasks in many applications seem different, the underlying computational methods are typically based on the same principles. We explain the commonalities between analysis tasks such as sound event detection, sound scene classification, or audio tagging. We focus on the machine learning approach, where the sound categories (i.e., classes) to be analyzed are defined in advance. We explain the typical components of an analysis system, including signal pre-processing, feature extraction, and pattern classification. We also preset an example system based on multi-label deep neural networks, which has been found to be applicable in many analysis tasks discussed in this book. Finally, we explain the whole processing chain that involves developing computational audio analysis systems.",
    booktitle = "Computational Analysis of Sound Scenes and Events",
    doi = "10.1007/978-3-319-63450-0\_2",
    editor2 = "Tuomas Virtanen and Plumbley, Mark D. and Dan Ellis",
    isbn = "978-3-319-63449-4",
    month = "9",
    pages = "13--40",
    publisher = "Springer",
    title = "The machine learning approach for analysis of sound scenes and events",
    year = "2018"
    }

  • G. Huang, T. Heittola, and T. Virtanen, "Using sequential information in polyphonic sound event detection," in 16th International Workshop on Acoustic Signal Enhancement, IWAENC 2018, 2018, p. 291–295. doi:10.1109/IWAENC.2018.8521367
    [BibTeX] [Abstract] [Download PDF]

    To detect the class, and start and end times of sound events in real world recordings is a challenging task. Current computer systems often show relatively high frame-wise accuracy but low event-wise accuracy. In this paper, we attempted to merge the gap by explicitly including sequential information to improve the performance of a state-of-the-art polyphonic sound event detection system. We propose to 1) use delayed predictions of event activities as additional input features that are fed back to the neural network; 2) build N-grams to model the co-occurrence probabilities of different events; 3) use se-quentialloss to train neural networks. Our experiments on a corpus of real world recordings show that the N-grams could smooth the spiky output of a state-of-the-art neural network system, and improve both the frame-wise and the event-wise metrics.

    @inproceedings{2018_IWAENC_d,
    author = "Huang, Guangpu and Heittola, Toni and Virtanen, Tuomas",
    abstract = "To detect the class, and start and end times of sound events in real world recordings is a challenging task. Current computer systems often show relatively high frame-wise accuracy but low event-wise accuracy. In this paper, we attempted to merge the gap by explicitly including sequential information to improve the performance of a state-of-the-art polyphonic sound event detection system. We propose to 1) use delayed predictions of event activities as additional input features that are fed back to the neural network; 2) build N-grams to model the co-occurrence probabilities of different events; 3) use se-quentialloss to train neural networks. Our experiments on a corpus of real world recordings show that the N-grams could smooth the spiky output of a state-of-the-art neural network system, and improve both the frame-wise and the event-wise metrics.",
    booktitle = "16th International Workshop on Acoustic Signal Enhancement, IWAENC 2018",
    day = "2",
    doi = "10.1109/IWAENC.2018.8521367",
    keywords = "Language modelling; Polyphonic sound event detection; Sequential information",
    month = "11",
    pages = "291--295",
    publisher = "IEEE",
    title = "Using sequential information in polyphonic sound event detection",
    year = "2018",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/huang\_iwaenc\_2018.pdf"
    }

  • P. Magron and T. Virtanen, "Expectation-maximization algorithms for Itakura-Saito nonnegative matrix factorization," in Interspeech 2018, 2018.
    [BibTeX] [Abstract] [Download PDF]

    This paper presents novel expectation-maximization (EM) algorithms for estimating the nonnegative matrix factorization model with Itakura-Saito divergence. Indeed, the common EM-based approach exploits the space-alternating generalized EM (SAGE) variant of EM but it usually performs worse than the conventional multiplicative algorithm. We propose to explore more exhaustively those algorithms, in particular the choice of the methodology (standard EM or SAGE variant) and the latent variable set (full or reduced). We then derive four EM-based algorithms, among which three are novel. Speech separation experiments show that one of those novel algorithms using a standard EM methodology and a reduced set of latent variables outperforms its SAGE variants and competes with the conventional multiplicative algorithm.

    @inproceedings{2018_InterSpecch,
    author = "Magron, Paul and Virtanen, Tuomas",
    abstract = "This paper presents novel expectation-maximization (EM) algorithms for estimating the nonnegative matrix factorization model with Itakura-Saito divergence. Indeed, the common EM-based approach exploits the space-alternating generalized EM (SAGE) variant of EM but it usually performs worse than the conventional multiplicative algorithm. We propose to explore more exhaustively those algorithms, in particular the choice of the methodology (standard EM or SAGE variant) and the latent variable set (full or reduced). We then derive four EM-based algorithms, among which three are novel. Speech separation experiments show that one of those novel algorithms using a standard EM methodology and a reduced set of latent variables outperforms its SAGE variants and competes with the conventional multiplicative algorithm.",
    booktitle = "Interspeech 2018",
    keywords = "source separation",
    publisher = "Interspeech",
    series = "Interspeech",
    title = "{E}xpectation-maximization algorithms for {I}takura-{S}aito nonnegative matrix factorization",
    year = "2018",
    url = "https://hal.archives-ouvertes.fr/hal-01632082/document"
    }

  • P. Magron and T. Virtanen, "Bayesian anisotropic Gaussian model for audio source separation," in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018. doi:10.1109/ICASSP.2018.8461741
    [BibTeX] [Abstract] [Download PDF]

    In audio source separation applications, it is common to model the sources as circular-symmetric Gaussian random variables, which is equivalent to assuming that the phase of each source is uniformly distributed. In this paper, we introduce an anisotropic Gaussian source model in which both the magnitude and phase parameters are modeled as random variables. In such a model, it becomes possible to promote a phase value that originates from a signal model and to adjust the relative importance of this underlying model-based phase constraint. We conduct Bayesian inference of the model through the derivation of an expectation-maximization algorithm for estimating the parameters. Experiments conducted on realistic music songs for a monaural source separation task, in an scenario where the variance parameters are assumed known, show that the proposed approach outperforms state-of-the-art techniques.

    @inproceedings{2018_ICASSP_b,
    author = "Magron, Paul and Virtanen, Tuomas",
    abstract = "In audio source separation applications, it is common to model the sources as circular-symmetric Gaussian random variables, which is equivalent to assuming that the phase of each source is uniformly distributed. In this paper, we introduce an anisotropic Gaussian source model in which both the magnitude and phase parameters are modeled as random variables. In such a model, it becomes possible to promote a phase value that originates from a signal model and to adjust the relative importance of this underlying model-based phase constraint. We conduct Bayesian inference of the model through the derivation of an expectation-maximization algorithm for estimating the parameters. Experiments conducted on realistic music songs for a monaural source separation task, in an scenario where the variance parameters are assumed known, show that the proposed approach outperforms state-of-the-art techniques.",
    booktitle = "2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",
    doi = "10.1109/ICASSP.2018.8461741",
    keywords = "source separation",
    month = "4",
    publisher = "IEEE",
    title = "Bayesian anisotropic {G}aussian model for audio source separation",
    year = "2018",
    url = "https://hal.archives-ouvertes.fr/hal-01632081/document"
    }

  • P. Magron and T. Virtanen, "Towards Complex Nonnegative Matrix Factorization with the Beta-Divergence," in 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), 2018, p. 156–160. doi:10.1109/IWAENC.2018.8521317
    [BibTeX] [Abstract] [Download PDF]

    Complex nonnegative matrix factorization (NMF) is a powerful tool for decomposing audio spectrograms while accounting for some phase information in the time-frequency domain. While its estimation was originally based on the Euclidean distance, in this paper we propose to extend it to any beta-divergence, a family of functions widely used in audio to estimate NMF. To this end, we introduce the beta-divergence in a heuristic fashion within a phase-aware probabilistic model. Estimating this model results in performing an NMF with Itakura-Saito (IS) divergence on a quantity called the phase-corrected posterior power of the sources, which is both phase-dependent and nonnegative-valued. Therefore, we replace IS with the beta-divergence, so that the factorization uses an optimal distortion metric and remains phase-aware. Even though by doing so we loose theoretical convergence guarantees, the resulting algorithm demonstrates its potential for an audio source separation task, where it outperforms previous complex NMFs approaches.

    @inproceedings{2018_IWAENC_e,
    author = "Magron, Paul and Virtanen, Tuomas",
    abstract = "Complex nonnegative matrix factorization (NMF) is a powerful tool for decomposing audio spectrograms while accounting for some phase information in the time-frequency domain. While its estimation was originally based on the Euclidean distance, in this paper we propose to extend it to any beta-divergence, a family of functions widely used in audio to estimate NMF. To this end, we introduce the beta-divergence in a heuristic fashion within a phase-aware probabilistic model. Estimating this model results in performing an NMF with Itakura-Saito (IS) divergence on a quantity called the phase-corrected posterior power of the sources, which is both phase-dependent and nonnegative-valued. Therefore, we replace IS with the beta-divergence, so that the factorization uses an optimal distortion metric and remains phase-aware. Even though by doing so we loose theoretical convergence guarantees, the resulting algorithm demonstrates its potential for an audio source separation task, where it outperforms previous complex NMFs approaches.",
    booktitle = "2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC)",
    doi = "10.1109/IWAENC.2018.8521317",
    isbn = "978-1-5386-8152-7",
    keywords = "source separation",
    month = "9",
    pages = "156--160",
    publisher = "IEEE",
    title = "Towards Complex Nonnegative Matrix Factorization with the Beta-Divergence",
    year = "2018",
    url = "https://hal.archives-ouvertes.fr/hal-01779664/document"
    }

  • P. Magron and T. Virtanen, "On modeling the STFT phase of audio signals with the von Mises distribution," in 16th International Workshop on Acoustic Signal Enhancement, IWAENC 2018, 2018. doi:10.1109/IWAENC.2018.8521323
    [BibTeX] [Abstract] [Download PDF]

    In this paper, we study statistical models for the phase of the short-term Fourier transform (STFT) of audio signals. STFT phase globally appears as uniformly distributed, which has led researchers in this field to model it as a uniform random variable. However, some information about the phase can be obtained from a sinusoidal model, which reveals its local structure. Therefore, we propose to model the phase with a von Mises (VM) random variable, which enables us to favor the sinusoidal model-based phase value. We estimate the distribution parameters and we validate this model on real audio data. In particular, we observe that both models (uniform and VM) are relevant from a statistical perspective but they convey different information about the phase (global vs. local). We also apply this VM model to an audio source separation task, where it outperforms previous approaches.

    @inproceedings{2018_IWAENC_f,
    author = "Magron, Paul and Virtanen, Tuomas",
    abstract = "In this paper, we study statistical models for the phase of the short-term Fourier transform (STFT) of audio signals. STFT phase globally appears as uniformly distributed, which has led researchers in this field to model it as a uniform random variable. However, some information about the phase can be obtained from a sinusoidal model, which reveals its local structure. Therefore, we propose to model the phase with a von Mises (VM) random variable, which enables us to favor the sinusoidal model-based phase value. We estimate the distribution parameters and we validate this model on real audio data. In particular, we observe that both models (uniform and VM) are relevant from a statistical perspective but they convey different information about the phase (global vs. local). We also apply this VM model to an audio source separation task, where it outperforms previous approaches.",
    booktitle = "16th International Workshop on Acoustic Signal Enhancement, IWAENC 2018",
    day = "2",
    doi = "10.1109/IWAENC.2018.8521323",
    keywords = "source separation",
    month = "11",
    publisher = "IEEE",
    title = "On modeling the {STFT} phase of audio signals with the von {M}ises distribution",
    year = "2018",
    url = "https://hal.archives-ouvertes.fr/hal-01763147v2/document"
    }

  • P. Magron and T. Virtanen, "Complex ISNMF: a phase-aware model for monaural audio source separation," IEEE-ACM Transactions on Audio Speech and Language Processing, vol. 27, iss. 1, p. 20–31, 2018. doi:10.1109/TASLP.2018.2869684
    [BibTeX] [Abstract] [Download PDF]

    This paper introduces a phase-aware probabilistic model for audio source separation. Classical source models in the short-term Fourier transform domain use circularly-symmetric Gaussian or Poisson random variables. This is equivalent to assuming that the phase of each source is uniformly distributed, which is not suitable for exploiting the underlying structure of the phase. Drawing on preliminary works, we introduce here a Bayesian anisotropic Gaussian source model in which the phase is no longer uniform. Such a model permits us to favor a phase value that originates from a signal model through a Markov chain prior structure. The variance of the latent variables are structured with nonnegative matrix factorization (NMF). The resulting model is called complex Itakura-Saito NMF (ISNMF) since it generalizes the ISNMF model to the case of non-isotropic variables. It combines the advantages of ISNMF, which uses a distortion measure adapted to audio and yields a set of estimates which preserve the overall energy of the mixture, and of complex NMF, which enables one to account for some phase constraints. We derive a generalized expectation-maximization algorithm to estimate the model parameters. Experiments conducted on a musical source separation task in a semi-informed setting show that the proposed approach outperforms state-of-the-art phase-aware separation techniques.

    @article{2018_TASLP_b,
    author = "Magron, Paul and Virtanen, Tuomas",
    abstract = "This paper introduces a phase-aware probabilistic model for audio source separation. Classical source models in the short-term Fourier transform domain use circularly-symmetric Gaussian or Poisson random variables. This is equivalent to assuming that the phase of each source is uniformly distributed, which is not suitable for exploiting the underlying structure of the phase. Drawing on preliminary works, we introduce here a Bayesian anisotropic Gaussian source model in which the phase is no longer uniform. Such a model permits us to favor a phase value that originates from a signal model through a Markov chain prior structure. The variance of the latent variables are structured with nonnegative matrix factorization (NMF). The resulting model is called complex Itakura-Saito NMF (ISNMF) since it generalizes the ISNMF model to the case of non-isotropic variables. It combines the advantages of ISNMF, which uses a distortion measure adapted to audio and yields a set of estimates which preserve the overall energy of the mixture, and of complex NMF, which enables one to account for some phase constraints. We derive a generalized expectation-maximization algorithm to estimate the model parameters. Experiments conducted on a musical source separation task in a semi-informed setting show that the proposed approach outperforms state-of-the-art phase-aware separation techniques.",
    day = "10",
    doi = "10.1109/TASLP.2018.2869684",
    issn = "2329-9290",
    journal = "IEEE-ACM Transactions on Audio Speech and Language Processing",
    keywords = "source separation",
    month = "10",
    number = "1",
    pages = "20--31",
    publisher = "IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC",
    title = "{C}omplex {ISNMF}: a phase-aware model for monaural audio source separation",
    volume = "27",
    year = "2018",
    url = "https://arxiv.org/abs/1802.03156"
    }

  • P. Magron, K. Drossos, S. I. Mimilakis, and T. Virtanen, "Reducing Interference with Phase Recovery in DNN-based Monaural Singing Voice Separation," in Interspeech, 2018.
    [BibTeX] [Abstract] [Download PDF]

    State-of-the-art methods for monaural singing voice separation consist in estimating the magnitude spectrum of the voice in the short-term Fourier transform (STFT) domain by means of deep neural networks (DNNs). The resulting magnitude estimate is then combined with the mixture's phase to retrieve the complex-valued STFT of the voice, which is further synthesized into a time-domain signal. However, when the sources overlap in time and frequency, the STFT phase of the voice differs from the mixture's phase, which results in interference and artifacts in the estimated signals. In this paper, we investigate on recent phase recovery algorithms that tackle this issue and can further enhance the separation quality. These algorithms exploit phase constraints that originate from a sinusoidal model or from consistency , a property that is a direct consequence of the STFT redundancy. Experiments conducted on real music songs show that those algorithms are efficient for reducing interference in the estimated voice compared to the baseline approach.

    @inproceedings{2018_InterSpecch_a,
    author = "Magron, Paul and Drossos, Konstantinos and Mimilakis, Stylianos Ioannis and Virtanen, Tuomas",
    abstract = "State-of-the-art methods for monaural singing voice separation consist in estimating the magnitude spectrum of the voice in the short-term Fourier transform (STFT) domain by means of deep neural networks (DNNs). The resulting magnitude estimate is then combined with the mixture's phase to retrieve the complex-valued STFT of the voice, which is further synthesized into a time-domain signal. However, when the sources overlap in time and frequency, the STFT phase of the voice differs from the mixture's phase, which results in interference and artifacts in the estimated signals. In this paper, we investigate on recent phase recovery algorithms that tackle this issue and can further enhance the separation quality. These algorithms exploit phase constraints that originate from a sinusoidal model or from consistency , a property that is a direct consequence of the STFT redundancy. Experiments conducted on real music songs show that those algorithms are efficient for reducing interference in the estimated voice compared to the baseline approach.",
    booktitle = "Interspeech",
    issn = "2308-457X",
    keywords = "monaural singing voice separation; voice recovery; deep neural networks; MaD TwinNet; Wiener Filtering",
    title = "{R}educing {I}nterference with {P}hase {R}ecovery in {DNN}-based {M}onaural {S}inging {V}oice {S}eparation",
    url = "https://hal.archives-ouvertes.fr/hal-01741278v2",
    year = "2018"
    }

  • K. Mahkonen, T. Virtanen, and J. Kämäräinen, "Cascade of Boolean detector combinations," Eurasip Journal on Image and Video Processing, vol. 2018, p. 1–22, 2018.
    [BibTeX] [Abstract]

    This paper considers a scenario when we have multiple pre-trained detectors for detecting an event and a small dataset for training a combined detection system. We build the combined detector as a Boolean function of thresholded detector scores and implement it as a binary classification cascade. The cascade structure is computationally efficient by providing the possibility to early termination. For the proposed Boolean combination function, the computational load of classification is reduced whenever the function becomes determinate before all the component detectors have been utilized. We also propose an algorithm, which selects all the needed thresholds for the component detectors within the proposed Boolean combination. We present results on two audio-visual datasets, which prove the efficiency of the proposed combination framework. We achieve state-of-the-art accuracy with substantially reduced computation time in laughter detection task, and our algorithm finds better thresholds for the component detectors within the Boolean combination than the other algorithms found in the literature.

    @article{2018_JIV,
    author = {Mahkonen, Katariina and Virtanen, Tuomas and K{\"a}m{\"a}r{\"a}inen, Joni},
    abstract = "This paper considers a scenario when we have multiple pre-trained detectors for detecting an event and a small dataset for training a combined detection system. We build the combined detector as a Boolean function of thresholded detector scores and implement it as a binary classification cascade. The cascade structure is computationally efficient by providing the possibility to early termination. For the proposed Boolean combination function, the computational load of classification is reduced whenever the function becomes determinate before all the component detectors have been utilized. We also propose an algorithm, which selects all the needed thresholds for the component detectors within the proposed Boolean combination. We present results on two audio-visual datasets, which prove the efficiency of the proposed combination framework. We achieve state-of-the-art accuracy with substantially reduced computation time in laughter detection task, and our algorithm finds better thresholds for the component detectors within the Boolean combination than the other algorithms found in the literature.",
    keywords = "Binary classification;Boolean combination;Classification cascade",
    title = "{C}ascade of {B}oolean detector combinations",
    journal = "Eurasip Journal on Image and Video Processing",
    volume = "2018",
    pages = "1--22",
    year = "2018",
    publisher = "Springer"
    }

  • P. Maijala, Z. Shuyang, T. Heittola, and T. Virtanen, "Environmental noise monitoring using source classification in sensors," Applied Acoustics, vol. 129, p. 258–267, 2018. doi:10.1016/j.apacoust.2017.08.006
    [BibTeX] [Abstract]

    Environmental noise monitoring systems continuously measure sound levels without assigning these measurements to different noise sources in the acoustic scenes, therefore incapable of identifying the main noise source. In this paper a feasibility study is presented on a new monitoring concept in which an acoustic pattern classification algorithm running in a wireless sensor is used to automatically assign the measured sound level to different noise sources. A supervised noise source classifier is learned from a small amount of manually annotated recordings and the learned classifier is used to automatically detect the activity of target noise source in the presence of interfering noise sources. The sensor is based on an inexpensive credit-card-sized single-board computer with a microphone and associated electronics and wireless connectivity. The measurement results and the noise source information are transferred from the sensors scattered around the measurement site to a cloud service and a noise portal is used to visualise the measurements to users. The proposed noise monitoring concept was piloted on a rock crushing site. The system ran reliably over 50 days on site, during which it was able to recognise more than 90\\% of the noise sources correctly. The pilot study shows that the proposed noise monitoring system can reduce the amount of required human validation of the sound level measurements when the target noise source is clearly defined.

    @article{2018_AA,
    author = "Maijala, Panu and Shuyang, Zhao and Heittola, Toni and Virtanen, Tuomas",
    title = "Environmental noise monitoring using source classification in sensors",
    abstract = "Environmental noise monitoring systems continuously measure sound levels without assigning these measurements to different noise sources in the acoustic scenes, therefore incapable of identifying the main noise source. In this paper a feasibility study is presented on a new monitoring concept in which an acoustic pattern classification algorithm running in a wireless sensor is used to automatically assign the measured sound level to different noise sources. A supervised noise source classifier is learned from a small amount of manually annotated recordings and the learned classifier is used to automatically detect the activity of target noise source in the presence of interfering noise sources. The sensor is based on an inexpensive credit-card-sized single-board computer with a microphone and associated electronics and wireless connectivity. The measurement results and the noise source information are transferred from the sensors scattered around the measurement site to a cloud service and a noise portal is used to visualise the measurements to users. The proposed noise monitoring concept was piloted on a rock crushing site. The system ran reliably over 50 days on site, during which it was able to recognise more than 90\\% of the noise sources correctly. The pilot study shows that the proposed noise monitoring system can reduce the amount of required human validation of the sound level measurements when the target noise source is clearly defined.",
    keywords = "Acoustic pattern classification, Cloud service, Environmental noise monitoring, Wireless sensor network",
    note = {EXT={"}Maijala, Panu{"}},
    year = "2018",
    doi = "10.1016/j.apacoust.2017.08.006",
    language = "English",
    volume = "129",
    pages = "258--267",
    journal = "Applied Acoustics",
    issn = "0003-682X",
    publisher = "Elsevier Limited"
    }

  • A. Mesaros, T. Heittola, E. Benetos, P. Foster, M. Lagrange, T. Virtanen, and M. D. Plumbley, "Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge," IEEE-ACM Transactions on Audio Speech and Language Processing, 2018. doi:10.1109/TASLP.2017.2778423
    [BibTeX] [Abstract] [Download PDF]

    Public evaluation campaigns and datasets promote active development in target research areas, allowing direct comparison of algorithms. The second edition of the challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2016) has offered such an opportunity for development of state-of-the-art methods, and succeeded in drawing together a large number of participants from academic and industrial backgrounds. In this paper, we report on the tasks and outcomes of the DCASE 2016 challenge. The challenge comprised four tasks: acoustic scene classification, sound event detection in synthetic audio, sound event detection in real-life audio, and domestic audio tagging. We present in detail each task and analyse the submitted systems in terms of design and performance. We observe the emergence of deep learning as the most popular classification method, replacing the traditional approaches based on Gaussian mixture models and support vector machines. By contrast, feature representations have not changed substantially throughout the years, as mel frequency-based representations predominate in all tasks. The datasets created for and used in DCASE 2016 are publicly available and are a valuable resource for further research.

    @article{2018_TASLP,
    author = "Mesaros, Annamaria and Heittola, Toni and Benetos, Emmanouil and Foster, Peter and Lagrange, Mathieu and Virtanen, Tuomas and Plumbley, Mark D.",
    abstract = "Public evaluation campaigns and datasets promote active development in target research areas, allowing direct comparison of algorithms. The second edition of the challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2016) has offered such an opportunity for development of state-of-the-art methods, and succeeded in drawing together a large number of participants from academic and industrial backgrounds. In this paper, we report on the tasks and outcomes of the DCASE 2016 challenge. The challenge comprised four tasks: acoustic scene classification, sound event detection in synthetic audio, sound event detection in real-life audio, and domestic audio tagging. We present in detail each task and analyse the submitted systems in terms of design and performance. We observe the emergence of deep learning as the most popular classification method, replacing the traditional approaches based on Gaussian mixture models and support vector machines. By contrast, feature representations have not changed substantially throughout the years, as mel frequency-based representations predominate in all tasks. The datasets created for and used in DCASE 2016 are publicly available and are a valuable resource for further research.",
    doi = "10.1109/TASLP.2017.2778423",
    issn = "2329-9290",
    journal = "IEEE-ACM Transactions on Audio Speech and Language Processing",
    keywords = "Acoustic scene classification;Acoustics;audio datasets;Event detection;Hidden Markov models;pattern recognition;sound event detection;Speech;Speech processing;Tagging;audio tagging",
    month = "11",
    publisher = "IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC",
    title = "Detection and Classification of Acoustic Scenes and Events: Outcome of the {DCASE} 2016 Challenge",
    year = "2018",
    url = "https://trepo.tuni.fi//bitstream/handle/10024/126402/dcase2016\_taslp.pdf?sequence=1"
    }

  • A. Mesaros, T. Heittola, and T. Virtanen, "Acoustic scene classification: An overview of DCASE 2017 challenge entries," in 16th International Workshop on Acoustic Signal Enhancement, IWAENC 2018, 2018, p. 411–415. doi:10.1109/IWAENC.2018.8521242
    [BibTeX] [Abstract] [Download PDF]

    We present an overview of the challenge entries for the Acoustic Scene Classification task of DCASE 2017 Challenge. Being the most popular task of the challenge, acoustic scene classification entries provide a wide variety of approaches for comparison, with a wide performance gap from top to bottom. Analysis of the submissions confirms once more the popularity of deep-learning approaches and mel frequency representations. Statistical analysis indicates that the top ranked system performed significantly better than the others, and that combinations of top systems are capable of reaching close to perfect performance on the given data.

    @inproceedings{2018_IWAENC_b,
    author = "Mesaros, Annamaria and Heittola, Toni and Virtanen, Tuomas",
    abstract = "We present an overview of the challenge entries for the Acoustic Scene Classification task of DCASE 2017 Challenge. Being the most popular task of the challenge, acoustic scene classification entries provide a wide variety of approaches for comparison, with a wide performance gap from top to bottom. Analysis of the submissions confirms once more the popularity of deep-learning approaches and mel frequency representations. Statistical analysis indicates that the top ranked system performed significantly better than the others, and that combinations of top systems are capable of reaching close to perfect performance on the given data.",
    booktitle = "16th International Workshop on Acoustic Signal Enhancement, IWAENC 2018",
    day = "2",
    doi = "10.1109/IWAENC.2018.8521242",
    keywords = "Acoustic scene classification; Audio classb ification; DCASE challenge",
    month = "11",
    pages = "411--415",
    publisher = "IEEE",
    title = "Acoustic scene classification: {A}n overview of {DCASE} 2017 challenge entries",
    year = "2018",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/mesaros-iwaenc2018-asc-in-dcase2017.pdf"
    }

  • A. Mesaros, T. Heittola, and T. Virtanen, "A multi-device dataset for urban acoustic scene classification," in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop, 2018, p. 9–13.
    [BibTeX] [Abstract] [Download PDF]

    This paper introduces the acoustic scene classification task of DCASE 2018 Challenge and the TUT Urban Acoustic Scenes 2018 dataset provided for the task, and evaluates the performance of a baseline system in the task. As in previous years of the challenge, the task is defined for classification of short audio samples into one of predefined acoustic scene classes, using a supervised, closed-set classification setup. The newly recorded TUT Urban Acoustic Scenes 2018 dataset consists of ten different acoustic scenes and was recorded in six large European cities, therefore it has a higher acoustic variability than the previous datasets used for this task, and in addition to high-quality binaural recordings, it also includes data recorded with mobile devices. We also present the baseline system consisting of a convolutional neural network and its performance in the subtasks using the recommended cross-validation setup.

    @inproceedings{2018_DCASE,
    author = "Mesaros, Annamaria and Heittola, Toni and Virtanen, Tuomas",
    abstract = "This paper introduces the acoustic scene classification task of DCASE 2018 Challenge and the TUT Urban Acoustic Scenes 2018 dataset provided for the task, and evaluates the performance of a baseline system in the task. As in previous years of the challenge, the task is defined for classification of short audio samples into one of predefined acoustic scene classes, using a supervised, closed-set classification setup. The newly recorded TUT Urban Acoustic Scenes 2018 dataset consists of ten different acoustic scenes and was recorded in six large European cities, therefore it has a higher acoustic variability than the previous datasets used for this task, and in addition to high-quality binaural recordings, it also includes data recorded with mobile devices. We also present the baseline system consisting of a convolutional neural network and its performance in the subtasks using the recommended cross-validation setup.",
    booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop",
    keywords = "Acoustic scene classification",
    pages = "9--13",
    title = "A multi-device dataset for urban acoustic scene classification",
    year = "2018",
    url = "http://dcase.community/documents/workshop2018/proceedings/DCASE2018Workshop\_Mesaros\_8.pdf"
    }

  • A. Mesaros, T. Heittola, and D. Ellis, "Datasets and Evaluation," in Computational Analysis of Sound Scenes and Events, Springer, 2018, p. 147–179. doi:10.1007/978-3-319-63450-0_6
    [BibTeX] [Abstract]

    Developing computational systems requires methods for evaluating their performance to guide development and compare alternate approaches. A reliable evaluation procedure for a classification or recognition system will involve a standard dataset of example input data along with the intended target output, and well-defined metrics to compare the systems' outputs with this ground truth. This chapter examines the important factors in the design and construction of evaluation datasets and goes through the metrics commonly used in system evaluation, comparing their properties. We include a survey of currently available datasets for environmental sound scene and event recognition and conclude with advice for designing evaluation protocols.

    @inbook{2018_a,
    author = "Mesaros, Annamaria and Heittola, Toni and Ellis, Dan",
    abstract = "Developing computational systems requires methods for evaluating their performance to guide development and compare alternate approaches. A reliable evaluation procedure for a classification or recognition system will involve a standard dataset of example input data along with the intended target output, and well-defined metrics to compare the systems' outputs with this ground truth. This chapter examines the important factors in the design and construction of evaluation datasets and goes through the metrics commonly used in system evaluation, comparing their properties. We include a survey of currently available datasets for environmental sound scene and event recognition and conclude with advice for designing evaluation protocols.",
    booktitle = "Computational Analysis of Sound Scenes and Events",
    doi = "10.1007/978-3-319-63450-0\_6",
    editor2 = "Tuomas Virtanen and Mark D. Plumbley and Dan Ellis",
    isbn = "978-3-319-63449-4",
    month = "9",
    pages = "147--179",
    publisher = "Springer",
    title = "Datasets and Evaluation",
    year = "2018"
    }

  • S. I. Mimilakis, K. Drossos, J. F. a}o Santos, G. Schuller, T. Virtanen, and Y. Bengio, "Monaural Singing Voice Separation with Skip-Filtering Connections and Recurrent Inference of Time-Frequency Mask," in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, p. 721–725.
    [BibTeX] [Abstract] [Download PDF]

    Singing voice separation based on deep learning relies on the usage of time-frequency masking. In many cases the masking process is not a learnable function or is not encapsulated into the deep learning optimization. Consequently, most of the existing methods rely on a post processing step using the generalized Wiener filtering. This work proposes a method that learns and optimizes (during training) a source-dependent mask and does not need the aforementioned post processing step. We introduce a recurrent inference algorithm, a sparse transformation step to improve the mask generation process, and a learned denoising filter. Obtained results show an increase of 0.49 dB for the signal to distortion ratio and 0.30 dB for the signal to interference ratio, compared to previous state-of-the-art approaches for monaural singing voice separation.

    @inproceedings{2018_ICASSP,
    author = "Mimilakis, Stylianos Ioannis and Drossos, Konstantinos and Santos, Jo{\\textasciitilde a}o F. and Schuller, Gerald and Virtanen, Tuomas and Bengio, Yoshua",
    abstract = "Singing voice separation based on deep learning relies on the usage of time-frequency masking. In many cases the masking process is not a learnable function or is not encapsulated into the deep learning optimization. Consequently, most of the existing methods rely on a post processing step using the generalized Wiener filtering. This work proposes a method that learns and optimizes (during training) a source-dependent mask and does not need the aforementioned post processing step. We introduce a recurrent inference algorithm, a sparse transformation step to improve the mask generation process, and a learned denoising filter. Obtained results show an increase of 0.49 dB for the signal to distortion ratio and 0.30 dB for the signal to interference ratio, compared to previous state-of-the-art approaches for monaural singing voice separation.",
    title = "Monaural Singing Voice Separation with Skip-Filtering Connections and Recurrent Inference of Time-Frequency Mask",
    booktitle = "2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",
    pages = "721--725",
    year = "2018",
    organization = "IEEE",
    url = "https://arxiv.org/abs/1711.01437"
    }

  • S. I. Mimilakis, E. Cano, D. FitzGerald, K. Drossos, and G. Schuller, "Examining The Perceptual Effect of Alternative Objective Functions for Deep Learning Based Music Source Separation," in IEEE Asilomar Conference on Signals, Systems, and Computers, 2018.
    [BibTeX]
    @inproceedings{2018_b,
    author = "Mimilakis, Stylianos Ioannis and Cano, Estefania and FitzGerald, D. and Drossos, Konstantinos and Schuller, Gerald",
    booktitle = "IEEE Asilomar Conference on Signals, Systems, and Computers",
    publisher = "IEEE",
    title = "Examining The Perceptual Effect of Alternative Objective Functions for Deep Learning Based Music Source Separation",
    year = "2018"
    }

  • G. Naithani, J. Nikunen, L. Bramslow, and T. Virtanen, "Deep neural network based speech separation optimizing an objective estimator of intelligibility for low latency applications," in 16th International Workshop on Acoustic Signal Enhancement, IWAENC 2018, 2018, p. 386–390. doi:10.1109/IWAENC.2018.8521379
    [BibTeX] [Abstract] [Download PDF]

    Mean square error (MSE) has been the preferred choice as loss function in the current deep neural network (DNN) based speech separation techniques. In this paper, we propose a new cost function with the aim of optimizing the extended short time objective intelligibility (ESTOI) measure. We focus on applications where low algorithmic latency (≤ 10 ms) is important. We use long short-term memory networks (LSTM) and evaluate our proposed approach on four sets of two-speaker mixtures from extended Danish hearing in noise (HINT) dataset. We show that the proposed loss function can offer improved or at par objective intelligibility (in terms of ESTOI) compared to an MSE optimized baseline while resulting in lower objective separation performance (in terms of the source to distortion ratio (SDR)). We then proceed to propose an approach where the network is first initialized with weights optimized for MSE criterion and then trained with the proposed ESTOI loss criterion. This approach mitigates some of the losses in objective separation performance while preserving the gains in objective intelligibility.

    @inproceedings{2018_IWAENC,
    author = "Naithani, Gaurav and Nikunen, Joonas and Bramslow, Lars and Virtanen, Tuomas",
    abstract = "Mean square error (MSE) has been the preferred choice as loss function in the current deep neural network (DNN) based speech separation techniques. In this paper, we propose a new cost function with the aim of optimizing the extended short time objective intelligibility (ESTOI) measure. We focus on applications where low algorithmic latency (≤ 10 ms) is important. We use long short-term memory networks (LSTM) and evaluate our proposed approach on four sets of two-speaker mixtures from extended Danish hearing in noise (HINT) dataset. We show that the proposed loss function can offer improved or at par objective intelligibility (in terms of ESTOI) compared to an MSE optimized baseline while resulting in lower objective separation performance (in terms of the source to distortion ratio (SDR)). We then proceed to propose an approach where the network is first initialized with weights optimized for MSE criterion and then trained with the proposed ESTOI loss criterion. This approach mitigates some of the losses in objective separation performance while preserving the gains in objective intelligibility.",
    booktitle = "16th International Workshop on Acoustic Signal Enhancement, IWAENC 2018",
    day = "2",
    doi = "10.1109/IWAENC.2018.8521379",
    keywords = "Deep neural networks; Low latency; Speech intelligibility; Speech separation",
    month = "11",
    pages = "386--390",
    publisher = "IEEE",
    title = "{D}eep neural network based speech separation optimizing an objective estimator of intelligibility for low latency applications",
    year = "2018",
    url = "https://arxiv.org/pdf/1807.06899.pdf"
    }

  • G. Naithani, J. Kivinummi, T. Virtanen, O. Tammela, M. J. Peltola, and J. M. Leppänen, "Automatic segmentation of infant cry signals using hidden Markov models," Eurasip Journal on Audio, Speech, and Music Processing, vol. 2018, iss. 1, 2018. doi:10.1186/s13636-018-0124-x
    [BibTeX] [Abstract] [Download PDF]

    Automatic extraction of acoustic regions of interest from recordings captured in realistic clinical environments is a necessary preprocessing step in any cry analysis system. In this study, we propose a hidden Markov model (HMM) based audio segmentation method to identify the relevant acoustic parts of the cry signal (i.e., expiratory and inspiratory phases) from recordings made in natural environments with various interfering acoustic sources. We examine and optimize the performance of the system by using different audio features and HMM topologies. In particular, we propose using fundamental frequency and aperiodicity features. We also propose a method for adapting the segmentation system trained on acoustic material captured in a particular acoustic environment to a different acoustic environment by using feature normalization and semi-supervised learning (SSL). The performance of the system was evaluated by analyzing a total of 3 h and 10 min of audio material from 109 infants, captured in a variety of recording conditions in hospital wards and clinics. The proposed system yields frame-based accuracy up to 89.2\%. We conclude that the proposed system offers a solution for automated segmentation of cry signals in cry analysis applications.

    @article{2018_JASM,
    author = {Naithani, Gaurav and Kivinummi, Jaana and Virtanen, Tuomas and Tammela, Outi and Peltola, Mikko J. and Lepp{\"a}nen, Jukka M.},
    abstract = "Automatic extraction of acoustic regions of interest from recordings captured in realistic clinical environments is a necessary preprocessing step in any cry analysis system. In this study, we propose a hidden Markov model (HMM) based audio segmentation method to identify the relevant acoustic parts of the cry signal (i.e., expiratory and inspiratory phases) from recordings made in natural environments with various interfering acoustic sources. We examine and optimize the performance of the system by using different audio features and HMM topologies. In particular, we propose using fundamental frequency and aperiodicity features. We also propose a method for adapting the segmentation system trained on acoustic material captured in a particular acoustic environment to a different acoustic environment by using feature normalization and semi-supervised learning (SSL). The performance of the system was evaluated by analyzing a total of 3 h and 10 min of audio material from 109 infants, captured in a variety of recording conditions in hospital wards and clinics. The proposed system yields frame-based accuracy up to 89.2\%. We conclude that the proposed system offers a solution for automated segmentation of cry signals in cry analysis applications.",
    doi = "10.1186/s13636-018-0124-x",
    issn = "1687-4714",
    journal = "Eurasip Journal on Audio, Speech, and Music Processing",
    keywords = "Acoustic analysis;Audio segmentation;Hidden Markov models;Infant cry analysis;Model adaptation",
    number = "1",
    publisher = "Springer Verlag",
    title = "Automatic segmentation of infant cry signals using hidden {M}arkov models",
    volume = "2018",
    year = "2018",
    url = "https://doi.org/10.1186/s13636-018-0124-x"
    }

  • J. Nikunen and T. Virtanen, "Estimation of time-varying room impulse responses of multiple sound sources from observed mixture and isolated source signals," in 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings, United States, 2018, p. 421–425. doi:10.1109/ICASSP.2018.8462535
    [BibTeX] [Abstract] [Download PDF]

    This paper proposes a method for online estimation of time-varying room impulse responses (RIR) between multiple isolated sound sources and a far-field mixture. The algorithm is formulated as adaptive convolutive filtering in short-time Fourier transform (STFT) domain. We use the recursive least squares (RLS) algorithm for estimating the filter parameters due to its fast convergence rate, which is required for modeling rapidly changing RIRs of moving sound sources. The proposed method allows separation of reverberated sources from the far-field mixture given that their close-field signals are available. The evaluation is based on measuring unmixing performance (removal of reverberated source) using objective separation criteria calculated between the ground truth recording of the preserved sources and the unmixing result obtained with the proposed algorithm. We compare online and offline formulations for the RIR estimation and also provide evaluation with blind source separation algorithm only operating on the mixture signal.

    @inproceedings{2018_ICASSP_a,
    author = "Nikunen, Joonas and Virtanen, Tuomas",
    abstract = "This paper proposes a method for online estimation of time-varying room impulse responses (RIR) between multiple isolated sound sources and a far-field mixture. The algorithm is formulated as adaptive convolutive filtering in short-time Fourier transform (STFT) domain. We use the recursive least squares (RLS) algorithm for estimating the filter parameters due to its fast convergence rate, which is required for modeling rapidly changing RIRs of moving sound sources. The proposed method allows separation of reverberated sources from the far-field mixture given that their close-field signals are available. The evaluation is based on measuring unmixing performance (removal of reverberated source) using objective separation criteria calculated between the ground truth recording of the preserved sources and the unmixing result obtained with the proposed algorithm. We compare online and offline formulations for the RIR estimation and also provide evaluation with blind source separation algorithm only operating on the mixture signal.",
    address = "United States",
    booktitle = "2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings",
    day = "10",
    doi = "10.1109/ICASSP.2018.8462535",
    isbn = "9781538646588",
    keywords = "Adaptive filtering; Informed source separation; Online room impulse response estimation; Source unmixing",
    month = "9",
    pages = "421--425",
    publisher = "Institute of Electrical and Electronics Engineers Inc.",
    series = "Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing",
    title = "{E}stimation of time-varying room impulse responses of multiple sound sources from observed mixture and isolated source signals",
    volume = "2018-April",
    year = "2018",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/Nikunen\_ICASSP2018\_rev"
    }

  • J. Nikunen, A. Diment, and T. Virtanen, "Separation of Moving Sound Sources Using Multichannel NMF and Acoustic Tracking," IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 26, iss. 2, p. 281–295, 2018. doi:10.1109/TASLP.2017.2774925
    [BibTeX] [Abstract]

    In this paper we propose a method for separation of moving sound sources. The method is based on first tracking the sources and then estimation of source spectrograms using multichannel non-negative matrix factorization (NMF) and extracting the sources from the mixture by single-channel Wiener filtering. We propose a novel multichannel NMF model with time-varying mixing of the sources denoted by spatial covariance matrices (SCM) and provide update equations for optimizing model parameters minimizing squared Frobenius norm. The SCMs of the model are obtained based on estimated directions of arrival of tracked sources at each time frame. The evaluation is based on established objective separation criteria and using real recordings of two and three simultaneous moving sound sources. The compared methods include conventional beamforming and ideal ratio mask separation. The proposed method is shown to exceed the separation quality of other evaluated blind approaches according to all measured quantities. Additionally, we evaluate the method's susceptibility towards tracking errors by comparing the separation quality achieved using annotated ground truth source trajectories.

    @article{2018_h,
    author = "Nikunen, Joonas and Diment, Aleksandr and Virtanen, Tuomas",
    title = "Separation of Moving Sound Sources Using Multichannel NMF and Acoustic Tracking",
    abstract = "In this paper we propose a method for separation of moving sound sources. The method is based on first tracking the sources and then estimation of source spectrograms using multichannel non-negative matrix factorization (NMF) and extracting the sources from the mixture by single-channel Wiener filtering. We propose a novel multichannel NMF model with time-varying mixing of the sources denoted by spatial covariance matrices (SCM) and provide update equations for optimizing model parameters minimizing squared Frobenius norm. The SCMs of the model are obtained based on estimated directions of arrival of tracked sources at each time frame. The evaluation is based on established objective separation criteria and using real recordings of two and three simultaneous moving sound sources. The compared methods include conventional beamforming and ideal ratio mask separation. The proposed method is shown to exceed the separation quality of other evaluated blind approaches according to all measured quantities. Additionally, we evaluate the method's susceptibility towards tracking errors by comparing the separation quality achieved using annotated ground truth source trajectories.",
    keywords = "acoustic source tracking, Acoustics, Array signal processing, Direction-of-arrival estimation, Estimation, Mathematical model, microphone arrays, Microphones, moving sound sources, Sound source separation, Spectrogram, time-varying mixing model",
    year = "2018",
    doi = "10.1109/TASLP.2017.2774925",
    language = "English",
    volume = "26",
    pages = "281--295",
    journal = "IEEE/ACM Transactions on Audio Speech and Language Processing",
    issn = "2329-9290",
    publisher = "IEEE Advancing Technology for Humanity",
    number = "2"
    }

  • M. Parviainen, P. Pertila, T. Virtanen, and P. Grosche, "Time-frequency masking strategies for single-channel low-latency speech enhancement using neural networks," in 16th International Workshop on Acoustic Signal Enhancement, IWAENC 2018, 2018, p. 51–55. doi:10.1109/IWAENC.2018.8521400
    [BibTeX] [Abstract] [Download PDF]

    This paper presents a low-latency neural network based speech enhancement system. Low-latency operation is critical for speech communication applications. The system uses the time-frequency (TF) masking approach to retain speech and remove the non-speech content from the observed signal. The ideal TF mask are obtained by supervised training of neural networks. As the main contribution different neural network models are experimentally compared to investigate computational complexity and speech enhancement performance. The proposed system is trained and tested on noisy speech data where signal-to-noise ratio (SNR) ranges from -5 dB to +5 dB and the results show significant reduction of non-speech content in the resulting signal while still meeting a low-latency operation criterion, which is here considered to be less than 20 ms.

    @inproceedings{2018_IWAENC_g,
    author = "Parviainen, Mikko and Pertila, Pasi and Virtanen, Tuomas and Grosche, Peter",
    abstract = "This paper presents a low-latency neural network based speech enhancement system. Low-latency operation is critical for speech communication applications. The system uses the time-frequency (TF) masking approach to retain speech and remove the non-speech content from the observed signal. The ideal TF mask are obtained by supervised training of neural networks. As the main contribution different neural network models are experimentally compared to investigate computational complexity and speech enhancement performance. The proposed system is trained and tested on noisy speech data where signal-to-noise ratio (SNR) ranges from -5 dB to +5 dB and the results show significant reduction of non-speech content in the resulting signal while still meeting a low-latency operation criterion, which is here considered to be less than 20 ms.",
    booktitle = "16th International Workshop on Acoustic Signal Enhancement, IWAENC 2018",
    day = "2",
    doi = "10.1109/IWAENC.2018.8521400",
    keywords = "Neural networks; Speech enhancement; Speech separation",
    month = "11",
    pages = "51--55",
    publisher = "IEEE",
    title = "{T}ime-frequency masking strategies for single-channel low-latency speech enhancement using neural networks",
    year = "2018",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/iwaenc\_parviainen\_2018.pdf"
    }

  • Z. Shuyang, T. Heittola, and T. Virtanen, "An Active Learning Method Using Clustering and Committee-Based Sample Selection for Sound Event Classification," in 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), 2018, pp. 116-120. doi:10.1109/IWAENC.2018.8521336
    [BibTeX]
    @inproceedings{2018_IWAENC_c,
    author = "Shuyang, Zhao and Heittola, Toni and Virtanen, Tuomas",
    booktitle = "2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC)",
    title = "An Active Learning Method Using Clustering and Committee-Based Sample Selection for Sound Event Classification",
    year = "2018",
    volume = "",
    number = "",
    pages = "116-120",
    keywords = "Labeling;Acoustics;Training;Predictive models;Clustering algorithms;Measurement;Process control;active learning;K-medoids clustering;committee-based sample selection;sound event classification",
    doi = "10.1109/IWAENC.2018.8521336"
    }

  • E. Vincent, S. Gannot, and T. Virtanen, "Acoustics: Spatial Properties," in Audio Source Separation and Speech Enhancement, Wiley, 2018, p. 31–46.
    [BibTeX]
    @inbook{2018_j,
    author = "Vincent, Emmanuel and Gannot, Sharon and Virtanen, Tuomas",
    title = "Acoustics: Spatial Properties",
    year = "2018",
    language = "English",
    isbn = "978-1-119-27989-1",
    pages = "31--46",
    booktitle = "Audio Source Separation and Speech Enhancement",
    publisher = "Wiley"
    }

  • E. Vincent, T. Virtanen, and S. Gannot, "Perspectives," in Audio Source Separation and Speech Enhancement, Wiley, 2018.
    [BibTeX]
    @inbook{2018_q,
    author = "Vincent, Emmanuel and Virtanen, Tuomas and Gannot, Sharon",
    title = "Perspectives",
    year = "2018",
    language = "English",
    isbn = "978-1-119-27989-1",
    booktitle = "Audio Source Separation and Speech Enhancement",
    publisher = "Wiley"
    }

  • E. Vincent, S. Gannot, and T. Virtanen, "Introduction," in Audio Source Separation and Speech Enhancement, Wiley, 2018, p. 3–14.
    [BibTeX]
    @inbook{2018_r,
    author = "Vincent, Emmanuel and Gannot, Sharon and Virtanen, Tuomas",
    title = "Introduction",
    year = "2018",
    language = "English",
    isbn = "978-1-119-27989-1",
    pages = "3--14",
    booktitle = "Audio Source Separation and Speech Enhancement",
    publisher = "Wiley"
    }

  • T. Virtanen, M. D. Plumbley, and D. Ellis, "Introduction to sound scene and event analysis," in Computational Analysis of Sound Scenes and Events, T. Virtanen, M. D. Plumbley, and D. Ellis, Eds., Springer, 2018, p. 3–12. doi:10.1007/978-3-319-63450-0\\_1
    [BibTeX] [Abstract]

    Developing computational systems requires methods for evaluating their performance to guide development and compare alternate approaches. A reliable evaluation procedure for a classification or recognition system will involve a standard dataset of example input data along with the intended target output, and well-defined metrics to compare the systems' outputs with this ground truth. This chapter examines the important factors in the design and construction of evaluation datasets and goes through the metrics commonly used in system evaluation, comparing their properties. We include a survey of currently available datasets for environmental sound scene and event recognition and conclude with advice for designing evaluation protocols.

    @inbook{2018_l,
    author = "Virtanen, Tuomas and Plumbley, Mark D. and Ellis, Dan",
    editor = "Virtanen, Tuomas and Plumbley, Mark D. and Ellis, Dan",
    title = "Introduction to sound scene and event analysis",
    abstract = "Developing computational systems requires methods for evaluating their performance to guide development and compare alternate approaches. A reliable evaluation procedure for a classification or recognition system will involve a standard dataset of example input data along with the intended target output, and well-defined metrics to compare the systems' outputs with this ground truth. This chapter examines the important factors in the design and construction of evaluation datasets and goes through the metrics commonly used in system evaluation, comparing their properties. We include a survey of currently available datasets for environmental sound scene and event recognition and conclude with advice for designing evaluation protocols.",
    year = "2018",
    doi = "10.1007/978-3-319-63450-0\\_1",
    language = "English",
    isbn = "978-3-319-63449-4",
    pages = "3--12",
    booktitle = "Computational Analysis of Sound Scenes and Events",
    publisher = "Springer"
    }

  • T. Virtanen, E. Vincent, and S. Gannot, "Time-Frequency Processing: Spectral Properties," in Audio Source Separation and Speech Enhancement, Wiley, 2018, p. 15–30.
    [BibTeX]
    @inbook{2018_n,
    author = "Virtanen, Tuomas and Vincent, Emmanuel and Gannot, Sharon",
    title = "Time-Frequency Processing: Spectral Properties",
    year = "2018",
    language = "English",
    isbn = "978-1-119-27989-1",
    pages = "15--30",
    booktitle = "Audio Source Separation and Speech Enhancement",
    publisher = "Wiley"
    }

  • S. Watanabe, D. Kolossa, and T. Virtanen, "Application of Source Separation to Robust Speech Analysis and Recognition," in Audio Source Separation and Speech Enhancement, Wiley, 2018, p. 377–412.
    [BibTeX]
    @inbook{2018_i,
    author = "Watanabe, Shinji and Kolossa, Dorothea and Virtanen, Tuomas",
    title = "Application of Source Separation to Robust Speech Analysis and Recognition",
    year = "2018",
    language = "English",
    isbn = "978-1-119-27989-1",
    pages = "377--412",
    booktitle = "Audio Source Separation and Speech Enhancement",
    publisher = "Wiley"
    }

  • {. J. {Carabias Orti}, J. Nikunen, T. Virtanen, and P. Vera-Candeas, "Multichannel Blind Sound Source Separation using Spatial Covariance Model with Level and Time Differences and Non-Negative Matrix Factorization," IEEE-ACM Transactions on Audio Speech and Language Processing, 2018. doi:10.1109/TASLP.2018.2830105
    [BibTeX] [Abstract] [Download PDF]

    This paper presents an algorithm for multichannel sound source separation using explicit modeling of level and time differences in source spatial covariance matrices (SCM). We propose a novel SCM model in which the spatial properties are modeled by the weighted sum of direction of arrival (DOA) kernels. DOA kernels are obtained as the combination of phase and level difference covariance matrices representing both time and level differences between microphones for a grid of predefined source directions. The proposed SCM model is combined with the NMF model for the magnitude spectrograms. Opposite to other SCM models in the literature, in this work, source localization is implicitly defined in the model and estimated during the signal factorization. Therefore, no localization pre-processing is required. Parameters are estimated using complex-valued non-negative matrix factorization (CNMF) with both Euclidean distance and Itakura Saito divergence. Separation performance of the proposed system is evaluated using the two-channel SiSEC development dataset and four channels signals recorded in a regular room with moderate reverberation. Finally, a comparison to other state-of-the-art methods is performed, showing better achieved separation performance in terms of SIR and perceptual measures.

    @article{2018_TASLP_a,
    author = "{Carabias Orti}, {Julio Jose} and Nikunen, Joonas and Virtanen, Tuomas and Vera-Candeas, Pedro",
    abstract = "This paper presents an algorithm for multichannel sound source separation using explicit modeling of level and time differences in source spatial covariance matrices (SCM). We propose a novel SCM model in which the spatial properties are modeled by the weighted sum of direction of arrival (DOA) kernels. DOA kernels are obtained as the combination of phase and level difference covariance matrices representing both time and level differences between microphones for a grid of predefined source directions. The proposed SCM model is combined with the NMF model for the magnitude spectrograms. Opposite to other SCM models in the literature, in this work, source localization is implicitly defined in the model and estimated during the signal factorization. Therefore, no localization pre-processing is required. Parameters are estimated using complex-valued non-negative matrix factorization (CNMF) with both Euclidean distance and Itakura Saito divergence. Separation performance of the proposed system is evaluated using the two-channel SiSEC development dataset and four channels signals recorded in a regular room with moderate reverberation. Finally, a comparison to other state-of-the-art methods is performed, showing better achieved separation performance in terms of SIR and perceptual measures.",
    day = "26",
    doi = "10.1109/TASLP.2018.2830105",
    issn = "2329-9290",
    journal = "IEEE-ACM Transactions on Audio Speech and Language Processing",
    keywords = "Covariance matrices;direction of arrival estimation;Direction-of-arrival estimation;interaural level difference;interaural time difference;Kernel;Microphones;multichannel source separation;non-negative matrix factorization;Source separation;spati;spatial\_s",
    month = "4",
    publisher = "IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC",
    title = "Multichannel Blind Sound Source Separation using Spatial Covariance Model with Level and Time Differences and Non-Negative Matrix Factorization",
    year = "2018",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/level-time\_SCM2018.pdf"
    }

2017

  • S. Adavanne, K. Drossos, E. Cakir, and T. Virtanen, "Stacked convolutional and recurrent neural networks for bird audio detection," in 2017 25th European Signal Processing Conference (EUSIPCO), 2017, p. 1729–1733. doi:10.23919/EUSIPCO.2017.8081505
    [BibTeX] [Download PDF]
    @inproceedings{2017_EUSIPCO,
    author = "Adavanne, Sharath and Drossos, Konstantinos and Cakir, Emre and Virtanen, Tuomas",
    booktitle = "2017 25th European Signal Processing Conference (EUSIPCO)",
    doi = "10.23919/EUSIPCO.2017.8081505",
    isbn = "978-0-9928626-7-1",
    pages = "1729--1733",
    publisher = "IEEE",
    title = "Stacked convolutional and recurrent neural networks for bird audio detection",
    year = "2017",
    url = "https://arxiv.org/abs/1706.02047"
    }

  • S. Adavanne and T. Virtanen, "Sound event detection using weakly labeled dataset with stacked convolutional and recurrent neural network," in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), 2017, p. 12–16.
    [BibTeX] [Download PDF]
    @inproceedings{2017_DCASE2017_a,
    author = "Adavanne, Sharath and Virtanen, Tuomas",
    booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)",
    pages = "12--16",
    publisher = "Tampere University of Technology. Laboratory of Signal Processing",
    title = "Sound event detection using weakly labeled dataset with stacked convolutional and recurrent neural network",
    year = "2017",
    url = "https://arxiv.org/abs/1710.02998"
    }

  • S. Adavanne, P. Pertilä, and T. Virtanen, "Sound event detection using spatial features and convolutional recurrent neural network," in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2017), 2017.
    [BibTeX] [Abstract] [Download PDF]

    This paper proposes to use low-level spatial features extracted from multichannel audio for sound event detection. We extend the convolutional recurrent neural network to handle more than one type of these multichannel features by learning from each of them separately in the initial stages. We show that instead of concatenating the features of each channel into a single feature vector the network learns sound events in multichannel audio better when they are presented as separate layers of a volume. Using the proposed spatial features over monaural features on the same network gives an absolute F-score improvement of 6.1\% on the publicly available TUT-SED 2016 dataset and 2.7\% on the TUT-SED 2009 dataset that is fifteen times larger

    @inproceedings{2017_ICASSP 2017,
    author = "Adavanne, Sharath and Pertilä, Pasi and Virtanen, Tuomas",
    abstract = "This paper proposes to use low-level spatial features extracted from multichannel audio for sound event detection. We extend the convolutional recurrent neural network to handle more than one type of these multichannel features by learning from each of them separately in the initial stages. We show that instead of concatenating the features of each channel into a single feature vector the network learns sound events in multichannel audio better when they are presented as separate layers of a volume. Using the proposed spatial features over monaural features on the same network gives an absolute F-score improvement of 6.1\% on the publicly available TUT-SED 2016 dataset and 2.7\% on the TUT-SED 2009 dataset that is fifteen times larger",
    booktitle = "IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2017)",
    keywords = "Sound event detection;multichannel audio;spatial features;convolutional recurrent neural network",
    title = "{S}ound event detection using spatial features and convolutional recurrent neural network",
    url = "https://arxiv.org/pdf/1706.02291.pdf",
    year = "2017"
    }

  • D. Caballero, R. Araya, H. Kronholm, J. Viiri, A. Mansikkaniemi, S. Lehesvuori, T. Virtanen, and M. Kurimo, "ASR in classroom today: Automatic visualization of conceptual network in science classrooms," in Data Driven Approaches in Digital Education - 12th European Conference on Technology Enhanced Learning, EC-TEL 2017, Proceedings, Germany, 2017, p. 541–544. doi:10.1007/978-3-319-66610-5_58
    [BibTeX] [Abstract] [Download PDF]

    Automatic Speech Recognition (ASR) field has improved substantially in the last years. We are in a point never saw before, where we can apply such algorithms in non-ideal conditions such as real classrooms. In these scenarios it is still not possible to reach perfect recognition rates, however we can already take advantage of these improvements. This paper shows preliminary results using ASR in Chilean and Finnish middle and high school to automatically provide teachers a visualization of the structure of concepts present in their discourse in science classrooms. These visualizations are conceptual networks that relate key concepts used by the teacher. This is an interesting tool that gives feedback to the teacher about his/her pedagogical practice in classes. The result of initial comparisons shows great similarity between conceptual networks generated in a manual way with those generated automatically.

    @inproceedings{2017_EX-TEL,
    author = "Caballero, Daniela and Araya, Roberto and Kronholm, Hanna and Viiri, Jouni and Mansikkaniemi, Andr{\'e} and Lehesvuori, Sami and Virtanen, Tuomas and Kurimo, Mikko",
    abstract = "Automatic Speech Recognition (ASR) field has improved substantially in the last years. We are in a point never saw before, where we can apply such algorithms in non-ideal conditions such as real classrooms. In these scenarios it is still not possible to reach perfect recognition rates, however we can already take advantage of these improvements. This paper shows preliminary results using ASR in Chilean and Finnish middle and high school to automatically provide teachers a visualization of the structure of concepts present in their discourse in science classrooms. These visualizations are conceptual networks that relate key concepts used by the teacher. This is an interesting tool that gives feedback to the teacher about his/her pedagogical practice in classes. The result of initial comparisons shows great similarity between conceptual networks generated in a manual way with those generated automatically.",
    address = "Germany",
    booktitle = "Data Driven Approaches in Digital Education - 12th European Conference on Technology Enhanced Learning, EC-TEL 2017, Proceedings",
    doi = "10.1007/978-3-319-66610-5\_58",
    isbn = "9783319666099",
    keywords = "Automatic speech recognition; Classroom dialogue; Conceptual network; Teacher discourse",
    pages = "541--544",
    publisher = "Springer Verlag",
    series = "Lecture Notes in Computer Science",
    title = "{ASR} in classroom today: {A}utomatic visualization of conceptual network in science classrooms",
    year = "2017",
    url = "https://jyx.jyu.fi/bitstream/handle/123456789/55458/ectel142\%201.pdf?sequence=1\&isAllowed=y"
    }

  • E. Cakir, G. Parascandolo, T. Heittola, H. Huttunen, and T. Virtanen, "Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection," IEEE-ACM Transactions on Audio Speech and Language Processing, vol. 25, iss. 6, p. 1291–1303, 2017. doi:10.1109/TASLP.2017.2690575
    [BibTeX] [Abstract] [Download PDF]

    Sound events often occur in unstructured environments where they exhibit wide variations in their frequency content and temporal structure. Convolutional neural networks (CNNs) are able to extract higher level features that are invariant to local spectral and temporal variations. Recurrent neural networks (RNNs) are powerful in learning the longer term temporal context in the audio signals. CNNs and RNNs as classifiers have recently shown improved performances over established methods in various sound recognition tasks. We combine these two approaches in a convolutional recurrent neural network (CRNN) and apply it on a polyphonic sound event detection task. We compare the performance of the proposed CRNN method with CNN, RNN, and other established methods, and observe a considerable improvement for four different datasets consisting of everyday sound events.

    @article{2017_TASLP,
    author = "Cakir, Emre and Parascandolo, Giambattista and Heittola, Toni and Huttunen, Heikki and Virtanen, Tuomas",
    abstract = "Sound events often occur in unstructured environments where they exhibit wide variations in their frequency content and temporal structure. Convolutional neural networks (CNNs) are able to extract higher level features that are invariant to local spectral and temporal variations. Recurrent neural networks (RNNs) are powerful in learning the longer term temporal context in the audio signals. CNNs and RNNs as classifiers have recently shown improved performances over established methods in various sound recognition tasks. We combine these two approaches in a convolutional recurrent neural network (CRNN) and apply it on a polyphonic sound event detection task. We compare the performance of the proposed CRNN method with CNN, RNN, and other established methods, and observe a considerable improvement for four different datasets consisting of everyday sound events.",
    doi = "10.1109/TASLP.2017.2690575",
    issn = "2329-9290",
    journal = "IEEE-ACM Transactions on Audio Speech and Language Processing",
    keywords = "deep neural networks;sound event detection",
    month = "6",
    number = "6",
    pages = "1291--1303",
    publisher = "IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC",
    title = "Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection",
    volume = "25",
    year = "2017",
    url = "https://arxiv.org/abs/1702.06286"
    }

  • E. Cakir, S. Adavanne, G. Parascandolo, K. Drossos, and T. Virtanen, "Convolutional recurrent neural networks for bird audio detection," in European Signal Processing Conference, 2017, p. 1744–1748. doi:10.23919/EUSIPCO.2017.8081508
    [BibTeX] [Download PDF]
    @inproceedings{2017_EUSIPCO_a,
    author = "Cakir, Emre and Adavanne, Sharath and Parascandolo, Giambattista and Drossos, Konstantinos and Virtanen, Tuomas",
    booktitle = "European Signal Processing Conference",
    doi = "10.23919/EUSIPCO.2017.8081508",
    pages = "1744--1748",
    publisher = "IEEE",
    series = "European Signal Processing Conference",
    title = "Convolutional recurrent neural networks for bird audio detection",
    year = "2017",
    url = "https://arxiv.org/pdf/1703.02317.pdf"
    }

  • E. Cakir and T. Virtanen, "Convolutional Recurrent Neural Networks for Rare Sound Event Detection," in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), 2017, p. 27–31.
    [BibTeX] [Download PDF]
    @inproceedings{2017_DCASE2017_b,
    author = "Cakir, Emre and Virtanen, Tuomas",
    booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)",
    pages = "27--31",
    publisher = "Tampere University of Technology. Laboratory of Signal Processing",
    title = "Convolutional Recurrent Neural Networks for Rare Sound Event Detection",
    year = "2017",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/DCASE2017Workshop\_Cakir\_105.pdf"
    }

  • E. Cakir, K. Drossos, and T. Virtanen, "QMUL bird audio detection challenge 2016," Tampere University of Technology 2017.
    [BibTeX] [Abstract] [Download PDF]

    In this paper, we focus on bird audio detection in short audio segments (namely 10 seconds) using stacked convolutional and recurrent neural networks. The evaluation data for this task was recorded in an acoustic soundscape different from the development data, thus motivating to work on methods that are generic and context independent. Data augmentation and regularization methods are proposed and evaluated in this regard. Area under curve (AUC) measure is used to compare different results. Our best achieved AUC measure on five cross-validations of the development data is 95.3\% and 88.41\% on the unseen evaluation data.

    @techreport{2017_d,
    author = "Cakir, Emre and Drossos, Konstantinos and Virtanen, Tuomas",
    abstract = "In this paper, we focus on bird audio detection in short audio segments (namely 10 seconds) using stacked convolutional and recurrent neural networks. The evaluation data for this task was recorded in an acoustic soundscape different from the development data, thus motivating to work on methods that are generic and context independent. Data augmentation and regularization methods are proposed and evaluated in this regard. Area under curve (AUC) measure is used to compare different results. Our best achieved AUC measure on five cross-validations of the development data is 95.3\% and 88.41\% on the unseen evaluation data.",
    keywords = "Bird audio detection; convolutional recurrent neural network",
    title = "{QMUL} bird audio detection challenge 2016",
    url = "http://machine-listening.eecs.qmul.ac.uk/wp-content/uploads/sites/26/2017/01/adavanne.pdf",
    year = "2017",
    institution = "Tampere University of Technology"
    }

  • S. Delikaris-Manias and P. Pertilä, "Time–Frequency Domain Spatial Audio Enhancement," in Parametric Time‐Frequency Domain Spatial Audio, John Wiley & Sons, Ltd, 2017, pp. 251-264. doi:https://doi.org/10.1002/9781119252634.ch10
    [BibTeX] [Abstract]

    Abstract Multi-microphone devices enable flexible recording of sound sources in the presence of interferers, noise, and reverberation. The most common signal enhancement techniques for microphone arrays are based on the design of directional filters or beamforming. This chapter provides a brief overview of these techniques, with a focus on post-filtering techniques. In adaptive beamformers there is a trade-off between directional selectivity and noise amplification which can be observed in the directivity factor. A class of adaptive beamformers are described as part of the informed spatial filters that combine beamforming with noise reduction. Time-frequency masking is commonly applied at the output of the beamformer to adjust the spectrum to better match that of the desired source signal. The post-filters are only capable of reducing uncorrelated or correlated noise in the beamforming output, and rely on the output of the beamformer for the suppression of the interference.

    @inbook{2017_f,
    author = "Delikaris-Manias, Symeon and Pertilä, Pasi",
    publisher = "John Wiley \& Sons, Ltd",
    isbn = "9781119252634",
    title = "Time–Frequency Domain Spatial Audio Enhancement",
    booktitle = "Parametric Time‐Frequency Domain Spatial Audio",
    chapter = "10",
    pages = "251-264",
    doi = "https://doi.org/10.1002/9781119252634.ch10",
    year = "2017",
    keywords = "adaptive beamformers, multi-microphone devices, post-filtering techniques, signal enhancement techniques, spatial filters, time-frequency masking",
    abstract = "Abstract Multi-microphone devices enable flexible recording of sound sources in the presence of interferers, noise, and reverberation. The most common signal enhancement techniques for microphone arrays are based on the design of directional filters or beamforming. This chapter provides a brief overview of these techniques, with a focus on post-filtering techniques. In adaptive beamformers there is a trade-off between directional selectivity and noise amplification which can be observed in the directivity factor. A class of adaptive beamformers are described as part of the informed spatial filters that combine beamforming with noise reduction. Time-frequency masking is commonly applied at the output of the beamformer to adjust the spectrum to better match that of the desired source signal. The post-filters are only capable of reducing uncorrelated or correlated noise in the beamforming output, and rely on the output of the beamformer for the suppression of the interference."
    }

  • A. Diment and T. Virtanen, "Transfer Learning of Weakly Labelled Audio," in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2017, p. 6–10. doi:10.1109/WASPAA.2017.8169984
    [BibTeX] [Download PDF]
    @inproceedings{2017_WASPAA_a,
    author = "Diment, Aleksandr and Virtanen, Tuomas",
    booktitle = "IEEE Workshop on Applications of Signal Processing to Audio and Acoustics",
    doi = "10.1109/WASPAA.2017.8169984",
    isbn = "978-1-5386-1631-4",
    pages = "6--10",
    series = "IEEE Workshop on Applications of Signal Processing to Audio and Acoustics",
    title = "Transfer Learning of Weakly Labelled Audio",
    year = "2017",
    url = "http://diment.kapsi.fi/papers/Diment17\_TL.pdf"
    }

  • S. Drgas, T. Virtanen, J. Lücke, and A. Hurmalainen, "Binary Non-Negative Matrix Deconvolution for Audio Dictionary Learning," IEEE-ACM Transactions on Audio Speech and Language Processing, vol. 25, iss. 8, p. 1644–1656, 2017. doi:10.1109/TASLP.2017.2709909
    [BibTeX] [Abstract] [Download PDF]

    In this study, we propose an unsupervised method for dictionary learning in audio signals. The new method, called binary nonnegative matrix deconvolution (BNMD), is developed and used to discover patterns from magnitude-scale spectrograms. The BNMD models an audio spectrogram as a sum of delayed patterns having binary gains (activations). Only small subsets of patterns can be active for a given spectrogram excerpt. The proposed method was applied to speaker identification and separation tasks. The experimental results show that dictionaries obtained by the BNMD bring much higher speaker identification accuracies averaged over a range of SNRs from -6 dB to 9 dB (91.3\%) than the NMD-based dictionaries (37.8-75.4\%). The BNMD also gives a benefit over dictionaries obtained using vector quantization (87.8\%). For bigger dictionaries the difference between the BNMD and the vector quantization (VQ) is getting smaller. For the speech separation task the BNMD dictionary gave a slight improvement over the VQ.

    @article{2017_TASLP_c,
    author = {Drgas, Szymon and Virtanen, Tuomas and L{\"u}cke, J{\"o}rg and Hurmalainen, Antti},
    abstract = "In this study, we propose an unsupervised method for dictionary learning in audio signals. The new method, called binary nonnegative matrix deconvolution (BNMD), is developed and used to discover patterns from magnitude-scale spectrograms. The BNMD models an audio spectrogram as a sum of delayed patterns having binary gains (activations). Only small subsets of patterns can be active for a given spectrogram excerpt. The proposed method was applied to speaker identification and separation tasks. The experimental results show that dictionaries obtained by the BNMD bring much higher speaker identification accuracies averaged over a range of SNRs from -6 dB to 9 dB (91.3\%) than the NMD-based dictionaries (37.8-75.4\%). The BNMD also gives a benefit over dictionaries obtained using vector quantization (87.8\%). For bigger dictionaries the difference between the BNMD and the vector quantization (VQ) is getting smaller. For the speech separation task the BNMD dictionary gave a slight improvement over the VQ.",
    doi = "10.1109/TASLP.2017.2709909",
    issn = "2329-9290",
    journal = "IEEE-ACM Transactions on Audio Speech and Language Processing",
    keywords = "Sparse coding;speaker recognition;speech separation",
    month = "8",
    number = "8",
    pages = "1644--1656",
    publisher = "IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC",
    title = "Binary Non-Negative Matrix Deconvolution for Audio Dictionary Learning",
    volume = "25",
    year = "2017",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/BNMD.pdf"
    }

  • K. Drossos, S. Adavanne, and T. Virtanen, "Automated Audio Captioning with Recurrent Neural Networks," in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2017. doi:10.1109/WASPAA.2017.8170058
    [BibTeX] [Abstract] [Download PDF]

    We present the first approach to automated audio captioning. We employ an encoder-decoder scheme with an alignment model in between. The input to the encoder is a sequence of log mel-band energies calculated from an audio file, while the output is a sequence of words, i.e. a caption. The encoder is a multi-layered, bi-directional gated recurrent unit (GRU) and the decoder a multi-layered GRU with a classification layer connected to the last GRU of the decoder. The classification layer and the alignment model are fully connected layers with shared weights between timesteps. The proposed method is evaluated using data drawn from a commercial sound effects library, ProSound Effects. The resulting captions were rated through metrics utilized in machine translation and image captioning fields. Results from metrics show that the proposed method can predict words appearing in the original caption, but not always correctly ordered.

    @inproceedings{2017_WASPAA,
    author = "Drossos, Konstantinos and Adavanne, Sharath and Virtanen, Tuomas",
    abstract = "We present the first approach to automated audio captioning. We employ an encoder-decoder scheme with an alignment model in between. The input to the encoder is a sequence of log mel-band energies calculated from an audio file, while the output is a sequence of words, i.e. a caption. The encoder is a multi-layered, bi-directional gated recurrent unit (GRU) and the decoder a multi-layered GRU with a classification layer connected to the last GRU of the decoder. The classification layer and the alignment model are fully connected layers with shared weights between timesteps. The proposed method is evaluated using data drawn from a commercial sound effects library, ProSound Effects. The resulting captions were rated through metrics utilized in machine translation and image captioning fields. Results from metrics show that the proposed method can predict words appearing in the original caption, but not always correctly ordered.",
    booktitle = "IEEE Workshop on Applications of Signal Processing to Audio and Acoustics",
    doi = "10.1109/WASPAA.2017.8170058",
    isbn = "978-1-5386-1632-1",
    keywords = "audio captioning",
    publisher = "IEEE",
    title = "Automated Audio Captioning with Recurrent Neural Networks",
    year = "2017",
    url = "https://arxiv.org/abs/1706.10006"
    }

  • K. Drossos, S. I. Mimilakis, A. Floros, T. Virtanen, and G. Schuller, "Close Miking Empirical Practice Verification: A Source Separation Approach," in Audio Engineering Society Convention 142, 2017.
    [BibTeX] [Abstract]

    Close miking represents a widely employed practice of placing a microphone very near to the sound source in order to capture more direct sound and minimize any pickup of ambient sound, including other, concurrently active sources. It is used by the audio engineering community for decades for audio recording, based on a number of empirical rules that were evolved during the recording practice itself. But can this empirical knowledge and close miking practice be systematically verified? In this work we aim to address this question based on an analytic methodology that employs techniques and metrics originating from the sound source separation evaluation field. In particular, we apply a quantitative analysis of the source separation capabilities of the close miking technique. The analysis is applied on a recording dataset obtained at multiple positions of a typical musical hall, multiple distances between the microphone and the sound source multiple microphone types and multiple level differences between the sound source and the ambient acoustic component. For all the above cases we calculate the Source to Interference Ratio (SIR) metric. The results obtained clearly demonstrate an optimum close-miking performance that matches the current empirical knowledge of professional audio recording.

    @inproceedings{2017_AES,
    author = "Drossos, Konstantinos and Mimilakis, Stylianos Ioannis and Floros, Andreas and Virtanen, Tuomas and Schuller, Gerald",
    abstract = "Close miking represents a widely employed practice of placing a microphone very near to the sound source in order to capture more direct sound and minimize any pickup of ambient sound, including other, concurrently active sources. It is used by the audio engineering community for decades for audio recording, based on a number of empirical rules that were evolved during the recording practice itself. But can this empirical knowledge and close miking practice be systematically verified? In this work we aim to address this question based on an analytic methodology that employs techniques and metrics originating from the sound source separation evaluation field. In particular, we apply a quantitative analysis of the source separation capabilities of the close miking technique. The analysis is applied on a recording dataset obtained at multiple positions of a typical musical hall, multiple distances between the microphone and the sound source multiple microphone types and multiple level differences between the sound source and the ambient acoustic component. For all the above cases we calculate the Source to Interference Ratio (SIR) metric. The results obtained clearly demonstrate an optimum close-miking performance that matches the current empirical knowledge of professional audio recording.",
    booktitle = "Audio Engineering Society Convention 142",
    publisher = "AES Audio Engineering Society",
    title = "Close Miking Empirical Practice Verification: A Source Separation Approach",
    year = "2017"
    }

  • D. Ellis, T. Virtanen, M. D. Plumbley, and B. Raj, "Future Perspective," in Computational Analysis of Sound Scenes and Events, Springer, 2017, p. 401–415. doi:10.1007/978-3-319-63450-0_14
    [BibTeX] [Abstract]

    This book has covered the underlying principles and technologies of sound recognition, and described several current application areas. However, the field is still very young; this chapter briefly outlines several emerging areas, particularly relating to the provision of the very large training sets that can be exploited by deep learning approaches. We also forecast some of the technological and application advances we expect in the short-to-medium future.

    @inbook{2017_e,
    author = "Ellis, Dan and Virtanen, Tuomas and Plumbley, Mark D. and Raj, Bhiksha",
    abstract = "This book has covered the underlying principles and technologies of sound recognition, and described several current application areas. However, the field is still very young; this chapter briefly outlines several emerging areas, particularly relating to the provision of the very large training sets that can be exploited by deep learning approaches. We also forecast some of the technological and application advances we expect in the short-to-medium future.",
    booktitle = "Computational Analysis of Sound Scenes and Events",
    doi = "10.1007/978-3-319-63450-0\_14",
    isbn = "978-3-319-63449-4",
    month = "9",
    pages = "401--415",
    publisher = "Springer",
    title = "Future Perspective",
    year = "2017"
    }

  • P. Magron, R. Badeau, and A. Liutkus, "Lévy NMF : un modèle robuste de séparation de sources non-négatives," in Actes du XXVIème Colloque GRETSI, 2017.
    [BibTeX] [Abstract]

    In this paper, we address the problem of robust source separation of nonnegative data. We introduce the PαS distributions, which are a subclass of the stable distributions family, to model the nonnegative latent sources. Since those distributions are heavy-tailed, they are expected to be robust to outliers. Considering the Lévy distribution, the only PαS distribution whose density admits a closed form expression, we propose a mixture model called Lévy Nonnegative Matrix Factorization (Lévy NMF). The model is estimated in a maximum-likelihood sense. We also derive an estimator of the sources which extends the validity of the generalized Wiener filtering to the PαS case. Experiments on musical spectrograms and fluorescence spectra highlight the potential of the Lévy NMF model for decomposing nonnegative data.

    @inproceedings{2017_c,
    author = "Magron, Paul and Badeau, Roland and Liutkus, Antoine",
    abstract = "In this paper, we address the problem of robust source separation of nonnegative data. We introduce the PαS distributions, which are a subclass of the stable distributions family, to model the nonnegative latent sources. Since those distributions are heavy-tailed, they are expected to be robust to outliers. Considering the Lévy distribution, the only PαS distribution whose density admits a closed form expression, we propose a mixture model called Lévy Nonnegative Matrix Factorization (Lévy NMF). The model is estimated in a maximum-likelihood sense. We also derive an estimator of the sources which extends the validity of the generalized Wiener filtering to the PαS case. Experiments on musical spectrograms and fluorescence spectra highlight the potential of the Lévy NMF model for decomposing nonnegative data.",
    booktitle = "Actes du XXVI{\`e}me Colloque GRETSI",
    month = "9",
    title = "{L}{\'e}vy {NMF} : un mod{\`e}le robuste de s{\'e}paration de sources non-n{\'e}gatives",
    year = "2017"
    }

  • P. Magron, R. Badeau, and B. David, "Phase-dependent anisotropic Gaussian model for audio source separation," in 42nd International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, p. 531–535. doi:10.1109/ICASSP.2017.7952212
    [BibTeX] [Abstract]

    Phase reconstruction of complex components in the time-frequency domain is a challenging but necessary task for audio source separation. While traditional approaches do not exploit phase constraints that originate from signal modeling, some prior information about the phase can be obtained from sinusoidal modeling. In this paper, we introduce a probabilistic mixture model which allows us to incorporate such phase priors within a source separation framework. While the magnitudes are estimated beforehand, the phases are modeled by Von Mises random variables whose location parameters are the phase priors. We then approximate this non-tractable model by an anisotropic Gaussian model, in which the phase dependencies are preserved. This enables us to derive an MMSE estimator of the sources which optimally combines Wiener filtering and prior phase estimates. Experimental results highlight the potential of incorporating phase priors into mixture models for separating overlapping components in complex audio mixtures.

    @inproceedings{2017_ICASSP_b,
    author = "Magron, Paul and Badeau, Roland and David, Bertrand",
    abstract = "Phase reconstruction of complex components in the time-frequency domain is a challenging but necessary task for audio source separation. While traditional approaches do not exploit phase constraints that originate from signal modeling, some prior information about the phase can be obtained from sinusoidal modeling. In this paper, we introduce a probabilistic mixture model which allows us to incorporate such phase priors within a source separation framework. While the magnitudes are estimated beforehand, the phases are modeled by Von Mises random variables whose location parameters are the phase priors. We then approximate this non-tractable model by an anisotropic Gaussian model, in which the phase dependencies are preserved. This enables us to derive an MMSE estimator of the sources which optimally combines Wiener filtering and prior phase estimates. Experimental results highlight the potential of incorporating phase priors into mixture models for separating overlapping components in complex audio mixtures.",
    booktitle = "42nd International Conference on Acoustics, Speech and Signal Processing (ICASSP)",
    doi = "10.1109/ICASSP.2017.7952212",
    month = "3",
    pages = "531--535",
    publisher = "IEEE",
    title = "{P}hase-dependent anisotropic {G}aussian model for audio source separation",
    year = "2017"
    }

  • P. Magron, R. Badeau, and A. Liutkus, "Lévy NMF for robust nonnegative source separation," in 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2017, p. 259–263. doi:10.1109/WASPAA.2017.8170035
    [BibTeX] [Abstract]

    Source separation, which consists in decomposing data into meaningful structured components, is an active research topic in many fields including music signal processing. In this paper, we introduce the Positive α-stable (PαS) distributions to model the latent sources, which are a subclass of the stable distributions family. They notably permit us to model random variables that are both nonnegative and impulsive. Considering the Levy distribution, the only PαS distribution whose density is tractable, we propose a mixture model called Lévy Nonnegative Matrix Factorization (Lévy NMF). This model accounts for low-rank structures in nonnegative data that possibly has high variability or is corrupted by very adverse noise. The model parameters are estimated in a maximum-likelihood sense. We also derive an estimator of the sources, which extends the validity of the Wiener filtering to the PαS case. Experiments on synthetic data and realistic music signals show that Lévy NMF compares favorably with state-of-the art techniques in terms of robustness to impulsive noise and highlight its potential for decomposing nonnegative data.

    @inproceedings{2017_WASPAA_e,
    author = "Magron, P. and Badeau, R. and Liutkus, A.",
    abstract = "Source separation, which consists in decomposing data into meaningful structured components, is an active research topic in many fields including music signal processing. In this paper, we introduce the Positive α-stable (PαS) distributions to model the latent sources, which are a subclass of the stable distributions family. They notably permit us to model random variables that are both nonnegative and impulsive. Considering the Levy distribution, the only PαS distribution whose density is tractable, we propose a mixture model called L{\'e}vy Nonnegative Matrix Factorization (L{\'e}vy NMF). This model accounts for low-rank structures in nonnegative data that possibly has high variability or is corrupted by very adverse noise. The model parameters are estimated in a maximum-likelihood sense. We also derive an estimator of the sources, which extends the validity of the Wiener filtering to the PαS case. Experiments on synthetic data and realistic music signals show that L{\'e}vy NMF compares favorably with state-of-the art techniques in terms of robustness to impulsive noise and highlight its potential for decomposing nonnegative data.",
    booktitle = "2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)",
    doi = "10.1109/WASPAA.2017.8170035",
    keywords = "Conferences; Cost function; Dispersion; Random variables; Robustness; Source separation; L{\'e}vy distribution; Positive alpha-stable distribution; audio source separation; nonnegative matrix factorization",
    month = "10",
    pages = "259--263",
    publisher = "IEEE",
    title = "{L}{\'e}vy {NMF} for robust nonnegative source separation",
    year = "2017"
    }

  • P. Magron, J. L. Roux, and T. Virtanen, "Consistent Anisotropic Wiener Filtering for Audio Source Separation," in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2017, p. 269–273. doi:10.1109/WASPAA.2017.8170037
    [BibTeX] [Download PDF]
    @inproceedings{2017_WASPAA_f,
    author = "Magron, Paul and Roux, Jonathan Le and Virtanen, Tuomas",
    booktitle = "IEEE Workshop on Applications of Signal Processing to Audio and Acoustics",
    doi = "10.1109/WASPAA.2017.8170037",
    isbn = "978-1-5386-1631-4",
    pages = "269--273",
    series = "IEEE Workshop on Applications of Signal Processing to Audio and Acoustics",
    title = "Consistent Anisotropic Wiener Filtering for Audio Source Separation",
    year = "2017",
    url = "https://hal.archives-ouvertes.fr/hal-01593126/document"
    }

  • P. Maijala, Z. Shuyang, T. Heittola, and T. Virtanen, "Environmental noise monitoring using source classification in sensors," Applied Acoustics, vol. 129, p. 258–267, 2017. doi:10.1016/j.apacoust.2017.08.006
    [BibTeX] [Abstract] [Download PDF]

    Environmental noise monitoring systems continuously measure sound levels without assigning these measurements to different noise sources in the acoustic scenes, therefore incapable of identifying the main noise source. In this paper a feasibility study is presented on a new monitoring concept in which an acoustic pattern classification algorithm running in a wireless sensor is used to automatically assign the measured sound level to different noise sources. A supervised noise source classifier is learned from a small amount of manually annotated recordings and the learned classifier is used to automatically detect the activity of target noise source in the presence of interfering noise sources. The sensor is based on an inexpensive credit-card-sized single-board computer with a microphone and associated electronics and wireless connectivity. The measurement results and the noise source information are transferred from the sensors scattered around the measurement site to a cloud service and a noise portal is used to visualise the measurements to users. The proposed noise monitoring concept was piloted on a rock crushing site. The system ran reliably over 50 days on site, during which it was able to recognise more than 90\% of the noise sources correctly. The pilot study shows that the proposed noise monitoring system can reduce the amount of required human validation of the sound level measurements when the target noise source is clearly defined.

    @article{2017_AA_a,
    author = "Maijala, Panu and Shuyang, Zhao and Heittola, Toni and Virtanen, Tuomas",
    abstract = "Environmental noise monitoring systems continuously measure sound levels without assigning these measurements to different noise sources in the acoustic scenes, therefore incapable of identifying the main noise source. In this paper a feasibility study is presented on a new monitoring concept in which an acoustic pattern classification algorithm running in a wireless sensor is used to automatically assign the measured sound level to different noise sources. A supervised noise source classifier is learned from a small amount of manually annotated recordings and the learned classifier is used to automatically detect the activity of target noise source in the presence of interfering noise sources. The sensor is based on an inexpensive credit-card-sized single-board computer with a microphone and associated electronics and wireless connectivity. The measurement results and the noise source information are transferred from the sensors scattered around the measurement site to a cloud service and a noise portal is used to visualise the measurements to users. The proposed noise monitoring concept was piloted on a rock crushing site. The system ran reliably over 50 days on site, during which it was able to recognise more than 90\% of the noise sources correctly. The pilot study shows that the proposed noise monitoring system can reduce the amount of required human validation of the sound level measurements when the target noise source is clearly defined.",
    doi = "10.1016/j.apacoust.2017.08.006",
    issn = "0003-682X",
    journal = "Applied Acoustics",
    keywords = "Acoustic pattern classification; Cloud service; Environmental noise monitoring; Wireless sensor network",
    month = "8",
    pages = "258--267",
    publisher = "Elsevier",
    title = "Environmental noise monitoring using source classification in sensors",
    volume = "129",
    year = "2017",
    url = "http://www.sciencedirect.com/science/article/pii/S0003682X17307533"
    }

  • M. Malik, S. Adavanne, K. Drossos, T. Virtanen, D. Ticha, and R. Jarina, "Stacked convolutional and recurrent neural networks for music emotion recognition," in Sound and Music Computing Conference, 2017.
    [BibTeX] [Abstract] [Download PDF]

    This paper studies the emotion recognition from musical tracks in the 2-dimensional valence-arousal (V-A) emotional space. We propose a method based on convolutional (CNN) and recurrent neural networks (RNN), having significantly fewer parameters compared with state-of-the-art (SOTA) method for the same task. We utilize one CNN layer followed by two branches of RNNs trained separately for arousal and valence. The method was evaluated using the “MediaEval2015 emotion in music” dataset. We achieved an RMSE of 0.202 for arousal and 0.268 for valence, which is the best result reported on this dataset

    @inproceedings{2017_SMC,
    author = "Malik, Miroslav and Adavanne, Sharath and Drossos, Konstantinos and Virtanen, Tuomas and Ticha, Dasa and Jarina, Roman",
    abstract = "This paper studies the emotion recognition from musical tracks in the 2-dimensional valence-arousal (V-A) emotional space. We propose a method based on convolutional (CNN) and recurrent neural networks (RNN), having significantly fewer parameters compared with state-of-the-art (SOTA) method for the same task. We utilize one CNN layer followed by two branches of RNNs trained separately for arousal and valence. The method was evaluated using the “MediaEval2015 emotion in music” dataset. We achieved an RMSE of 0.202 for arousal and 0.268 for valence, which is the best result reported on this dataset",
    booktitle = "Sound and Music Computing Conference",
    title = "{S}tacked convolutional and recurrent neural networks for music emotion recognition",
    url = "https://arxiv.org/pdf/1706.02292.pdf",
    year = "2017"
    }

  • A. Mesaros, T. Heittola, and T. Virtanen, "Assessment of human and machine performance in acoustic scene classification: DCASE 2016 case study," in 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), United States, 2017, p. 319–323. doi:10.1109/WASPAA.2017.8170047
    [BibTeX] [Abstract] [Download PDF]

    Human and machine performance in acoustic scene classification is examined through a parallel experiment using TUT Acoustic Scenes 2016 dataset. The machine learning perspective is presented based on the systems submitted for the 2016 challenge on Detection and Classification of Acoustic Scenes and Events. The human performance, assessed through a listening experiment, was found to be significantly lower than machine performance. Test subjects exhibited different behavior throughout the experiment, leading to significant differences in performance between groups of subjects. An expert listener trained for the task obtained similar accuracy to the average of submitted systems, comparable also to previous studies of human abilities in recognizing everyday acoustic scenes.

    @inproceedings{2017_WASPAA_b,
    author = "Mesaros, Annamaria and Heittola, Toni and Virtanen, Tuomas",
    abstract = "Human and machine performance in acoustic scene classification is examined through a parallel experiment using TUT Acoustic Scenes 2016 dataset. The machine learning perspective is presented based on the systems submitted for the 2016 challenge on Detection and Classification of Acoustic Scenes and Events. The human performance, assessed through a listening experiment, was found to be significantly lower than machine performance. Test subjects exhibited different behavior throughout the experiment, leading to significant differences in performance between groups of subjects. An expert listener trained for the task obtained similar accuracy to the average of submitted systems, comparable also to previous studies of human abilities in recognizing everyday acoustic scenes.",
    address = "United States",
    booktitle = "2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)",
    doi = "10.1109/WASPAA.2017.8170047",
    isbn = "978-1-5386-1631-4",
    keywords = "acoustic scene classification; machine learning; human performance; listening experiment",
    pages = "319–323",
    publisher = "IEEE Computer Society",
    title = "Assessment of human and machine performance in acoustic scene classification: {DCASE} 2016 case study",
    year = "2017",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/mesaros-waspaa2017-humans-vs-machines-asc.pdf"
    }

  • A. Mesaros, T. Heittola, A. Diment, {. M. Elizalde, A. Shah, E. Vincent, B. Raj, and T. Virtanen, "DCASE 2017 challenge setup: tasks, datasets and baseline system," in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), 2017, p. 85–92.
    [BibTeX] [Abstract] [Download PDF]

    DCASE 2017 Challenge consists of four tasks: acoustic scene classification, detection of rare sound events, sound event detection in real-life audio, and large-scale weakly supervised sound event detection for smart cars. This paper presents the setup of these tasks: task definition, dataset, experimental setup, and baseline system results on the development dataset. The baseline systems for all tasks rely on the same implementation using multilayer perceptron and log mel-energies, but differ in the structure of the output layer and the decision making process, as well as the evaluation of system output using task specific metrics.

    @inproceedings{2017_DCASE2017,
    author = "Mesaros, Annamaria and Heittola, Toni and Diment, Aleksandr and Elizalde, {Benjamin Martinez} and Shah, Ankit and Vincent, Emmanuel and Raj, Bhiksha and Virtanen, Tuomas",
    abstract = "DCASE 2017 Challenge consists of four tasks: acoustic scene classification, detection of rare sound events, sound event detection in real-life audio, and large-scale weakly supervised sound event detection for smart cars. This paper presents the setup of these tasks: task definition, dataset, experimental setup, and baseline system results on the development dataset. The baseline systems for all tasks rely on the same implementation using multilayer perceptron and log mel-energies, but differ in the structure of the output layer and the decision making process, as well as the evaluation of system output using task specific metrics.",
    booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)",
    keywords = "Sound scene analysis; Acoustic scene classification; Sound event detection; Audio tagging; Rare sound events",
    pages = "85--92",
    publisher = "Tampere University of Technology. Laboratory of Signal Processing",
    title = "{DCASE} 2017 challenge setup: tasks, datasets and baseline system",
    year = "2017",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/dcase-2017-challenge-paper.pdf"
    }

  • S. I. Mimilakis, K. Drossos, T. Virtanen, and G. Schuller, "A Recurrent Encoder-Decoder Approach With Skip-Filtering Connections for Monaural Singing Voice Separation," in 27th IEEE International Workshop on Machine Learning for Signal Processing (MLSP), 2017. doi:10.1109/MLSP.2017.8168117
    [BibTeX] [Abstract] [Download PDF]

    The objective of deep learning methods based on encoder-decoder architectures for music source separation is to approximate either ideal time-frequency masks or spectral representations of the target music source(s). The spectral representations are then used to derive time-frequency masks. In this work we introduce a method to directly learn time-frequency masks from an observed mixture magnitude spectrum. We employ recurrent neural networks and train them using prior knowledge only for the magnitude spectrum of the target source. To assess the performance of the proposed method, we focus on the task of singing voice separation. The results from an objective evaluation show that our proposed method provides comparable results to deep learning based methods which operate over complicated signal representations. Compared to previous methods that approximate time-frequency masks, our method has increased performance of signal to distortion ratio by an average of 3.8 dB.

    @inproceedings{2017_MLSP,
    author = "Mimilakis, Stylianos Ioannis and Drossos, Konstantinos and Virtanen, Tuomas and Schuller, Gerald",
    abstract = "The objective of deep learning methods based on encoder-decoder architectures for music source separation is to approximate either ideal time-frequency masks or spectral representations of the target music source(s). The spectral representations are then used to derive time-frequency masks. In this work we introduce a method to directly learn time-frequency masks from an observed mixture magnitude spectrum. We employ recurrent neural networks and train them using prior knowledge only for the magnitude spectrum of the target source. To assess the performance of the proposed method, we focus on the task of singing voice separation. The results from an objective evaluation show that our proposed method provides comparable results to deep learning based methods which operate over complicated signal representations. Compared to previous methods that approximate time-frequency masks, our method has increased performance of signal to distortion ratio by an average of 3.8 dB.",
    booktitle = "27th IEEE International Workshop on Machine Learning for Signal Processing (MLSP)",
    doi = "10.1109/MLSP.2017.8168117",
    keywords = "singing voice separation;",
    publisher = "IEEE",
    title = "A Recurrent Encoder-Decoder Approach With Skip-Filtering Connections for Monaural Singing Voice Separation",
    year = "2017",
    url = "https://arxiv.org/pdf/1709.00611.pdf"
    }

  • G. Naithani, T. Barker, G. Parascandolo, L. Bramsl⊘w, N. H. Pontoppidan, and T. Virtanen, "Low latency sound source separation using convolutional recurrent neural networks," in 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2017, pp. 71-75. doi:10.1109/WASPAA.2017.8169997
    [BibTeX] [Download PDF]
    @INPROCEEDINGS{2017_WASPAA_d,
    author = "Naithani, Gaurav and Barker, Tom and Parascandolo, Giambattista and Bramsl⊘w, Lars and Pontoppidan, Niels Henrik and Virtanen, Tuomas",
    booktitle = "2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)",
    title = "Low latency sound source separation using convolutional recurrent neural networks",
    year = "2017",
    pages = "71-75",
    keywords = "Convolution;Neural networks;Training data;Training;Source separation;Time-frequency analysis;Source Separation;Low-latency;Deep Neural Networks;Convolutional Recurrent Neural Networks",
    doi = "10.1109/WASPAA.2017.8169997",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/PID4978439.pdf"
    }

  • J. Nikunen and T. Virtanen, "Source Separation and Reconstruction of Spatial Audio Using Spectrogram Factorization," in Parametric time-frequency-domain spatial audio, John Wiley \\& Sons, 2017, p. 215–250. doi:10.1002/9781119252634.ch9
    [BibTeX] [Abstract]

    This chapter introduces methods for factorizing the spectrogram of multichannel audio into repetitive spectral objects and apply the introduced models to the analysis of spatial audio and modification of spatial sound through source separation. The purpose of decomposing an audio spectrogram using spectral templates is to learn the underlying structures (audio objects) from the observed data. The chapter discusses two main scenarios such as parameterization of multichannel surround sound and parameterization of microphone array signals. It explains the principles of source separation by time-frequency filtering using separation masks constructed from the spectrogram models. The chapter introduces a spatial covariance matrix model based on the directions of arrival of sound events and spectral templates, and discusses its relationship to conventional spatial audio signal processing. Source separation using spectrogram factorization models is achieved via time- frequency filtering of the original observation short-time Fourier transform (STFT) by a generalized Wiener filter obtained from the spectrogram model parameters.

    @inbook{2017,
    author = "Nikunen, Joonas and Virtanen, Tuomas",
    abstract = "This chapter introduces methods for factorizing the spectrogram of multichannel audio into repetitive spectral objects and apply the introduced models to the analysis of spatial audio and modification of spatial sound through source separation. The purpose of decomposing an audio spectrogram using spectral templates is to learn the underlying structures (audio objects) from the observed data. The chapter discusses two main scenarios such as parameterization of multichannel surround sound and parameterization of microphone array signals. It explains the principles of source separation by time-frequency filtering using separation masks constructed from the spectrogram models. The chapter introduces a spatial covariance matrix model based on the directions of arrival of sound events and spectral templates, and discusses its relationship to conventional spatial audio signal processing. Source separation using spectrogram factorization models is achieved via time- frequency filtering of the original observation short-time Fourier transform (STFT) by a generalized Wiener filter obtained from the spectrogram model parameters.",
    booktitle = "Parametric time-frequency-domain spatial audio",
    doi = "10.1002/9781119252634.ch9",
    editor2 = "Ville Pulkki and Symeon Delikaris-Manias and Archontis Politis",
    isbn = "978-1-119-25259-7",
    month = "10",
    pages = "215--250",
    publisher = "John Wiley {\\&} Sons",
    title = "Source Separation and Reconstruction of Spatial Audio Using Spectrogram Factorization",
    year = "2017"
    }

  • J. Nikunen, A. Diment, and T. Virtanen, "Separation of Moving Sound Sources Using Multichannel NMF and Acoustic Tracking," IEEE-ACM Transactions on Audio Speech and Language Processing, 2017. doi:10.1109/TASLP.2017.2774925
    [BibTeX] [Abstract] [Download PDF]

    In this paper we propose a method for separation of moving sound sources. The method is based on first tracking the sources and then estimation of source spectrograms using multichannel non-negative matrix factorization (NMF) and extracting the sources from the mixture by single-channel Wiener filtering. We propose a novel multichannel NMF model with time-varying mixing of the sources denoted by spatial covariance matrices (SCM) and provide update equations for optimizing model parameters minimizing squared Frobenius norm. The SCMs of the model are obtained based on estimated directions of arrival of tracked sources at each time frame. The evaluation is based on established objective separation criteria and using real recordings of two and three simultaneous moving sound sources. The compared methods include conventional beamforming and ideal ratio mask separation. The proposed method is shown to exceed the separation quality of other evaluated blind approaches according to all measured quantities. Additionally, we evaluate the method's susceptibility towards tracking errors by comparing the separation quality achieved using annotated ground truth source trajectories.

    @article{2017_TASLP_a,
    author = "Nikunen, Joonas and Diment, Aleksandr and Virtanen, Tuomas",
    abstract = "In this paper we propose a method for separation of moving sound sources. The method is based on first tracking the sources and then estimation of source spectrograms using multichannel non-negative matrix factorization (NMF) and extracting the sources from the mixture by single-channel Wiener filtering. We propose a novel multichannel NMF model with time-varying mixing of the sources denoted by spatial covariance matrices (SCM) and provide update equations for optimizing model parameters minimizing squared Frobenius norm. The SCMs of the model are obtained based on estimated directions of arrival of tracked sources at each time frame. The evaluation is based on established objective separation criteria and using real recordings of two and three simultaneous moving sound sources. The compared methods include conventional beamforming and ideal ratio mask separation. The proposed method is shown to exceed the separation quality of other evaluated blind approaches according to all measured quantities. Additionally, we evaluate the method's susceptibility towards tracking errors by comparing the separation quality achieved using annotated ground truth source trajectories.",
    doi = "10.1109/TASLP.2017.2774925",
    issn = "2329-9290",
    journal = "IEEE-ACM Transactions on Audio Speech and Language Processing",
    keywords = "acoustic source tracking;Acoustics;Array signal processing;Direction-of-arrival estimation;Estimation;Mathematical model;microphone arrays;Microphones;moving sound sources;Sound source separation;Spectrogram;time-varying mixing model",
    month = "11",
    publisher = "IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC",
    title = "Separation of Moving Sound Sources Using Multichannel {NMF} and Acoustic Tracking",
    year = "2017",
    url = "https://arxiv.org/pdf/1710.10005.pdf"
    }

  • J. Nikunen and T. Virtanen, "Time-difference of arrival model for spherical microphone arrays and application to direction of arrival estimation," in Proceedings of 25th European Signal Processing Conference, 2017, p. 1255–1259. doi:10.23919/EUSIPCO.2017.8081409
    [BibTeX] [Abstract] [Download PDF]

    Summary form only given. Strong light-matter coupling has been recently successfully explored in the GHz and THz [1] range with on-chip platforms. New and intriguing quantum optical phenomena have been predicted in the ultrastrong coupling regime [2], when the coupling strength Ω becomes comparable to the unperturbed frequency of the system ω. We recently proposed a new experimental platform where we couple the inter-Landau level transition of an high-mobility 2DEG to the highly subwavelength photonic mode of an LC meta-atom [3] showing very large Ω/ωc = 0.87. Our system benefits from the collective enhancement of the light-matter coupling which comes from the scaling of the coupling Ω ∝ √n, were n is the number of optically active electrons. In our previous experiments [3] and in literature [4] this number varies from 104-103 electrons per meta-atom. We now engineer a new cavity, resonant at 290 GHz, with an extremely reduced effective mode surface Seff = 4 × 10-14 m2 (FE simulations, CST), yielding large field enhancements above 1500 and allowing to enter the few (<;100) electron regime. It consist of a complementary metasurface with two very sharp metallic tips separated by a 60 nm gap (Fig.1(a, b)) on top of a single triangular quantum well. THz-TDS transmission experiments as a function of the applied magnetic field reveal strong anticrossing of the cavity mode with linear cyclotron dispersion. Measurements for arrays of only 12 cavities are reported in Fig.1(c). On the top horizontal axis we report the number of electrons occupying the topmost Landau level as a function of the magnetic field. At the anticrossing field of B=0.73 T we measure approximately 60 electrons ultra strongly coupled (Ω/ω- ||

    @inproceedings{2017_EUSIPCO_b,
    author = "Nikunen, Joonas and Virtanen, Tuomas",
    abstract = "Summary form only given. Strong light-matter coupling has been recently successfully explored in the GHz and THz [1] range with on-chip platforms. New and intriguing quantum optical phenomena have been predicted in the ultrastrong coupling regime [2], when the coupling strength Ω becomes comparable to the unperturbed frequency of the system ω. We recently proposed a new experimental platform where we couple the inter-Landau level transition of an high-mobility 2DEG to the highly subwavelength photonic mode of an LC meta-atom [3] showing very large Ω/ωc = 0.87. Our system benefits from the collective enhancement of the light-matter coupling which comes from the scaling of the coupling Ω ∝ √n, were n is the number of optically active electrons. In our previous experiments [3] and in literature [4] this number varies from 104-103 electrons per meta-atom. We now engineer a new cavity, resonant at 290 GHz, with an extremely reduced effective mode surface Seff = 4 × 10-14 m2 (FE simulations, CST), yielding large field enhancements above 1500 and allowing to enter the few (<;100) electron regime. It consist of a complementary metasurface with two very sharp metallic tips separated by a 60 nm gap (Fig.1(a, b)) on top of a single triangular quantum well. THz-TDS transmission experiments as a function of the applied magnetic field reveal strong anticrossing of the cavity mode with linear cyclotron dispersion. Measurements for arrays of only 12 cavities are reported in Fig.1(c). On the top horizontal axis we report the number of electrons occupying the topmost Landau level as a function of the magnetic field. At the anticrossing field of B=0.73 T we measure approximately 60 electrons ultra strongly coupled (Ω/ω- ||",
    booktitle = "Proceedings of 25th European Signal Processing Conference",
    doi = "10.23919/EUSIPCO.2017.8081409",
    pages = "1255--1259",
    publisher = "IEEE",
    title = "Time-difference of arrival model for spherical microphone arrays and application to direction of arrival estimation",
    year = "2017",
    url = "http://www.eurasip.org/Proceedings/Eusipco/Eusipco2017/papers/1570346912.pdf"
    }

  • M. Parviainen and P. Pertilä, "Self-localization of dynamic user-worn microphones from observed speech," Applied Acoustics, vol. Volume 117, Part A, pp. 76-85, 2017. doi:http://dx.doi.org/10.1016/j.apacoust.2016.10.019
    [BibTeX] [Abstract]

    Abstract The increase of mobile devices and most recently wearables has raised the interest to utilize their sensors for various applications such as indoor localization. We present the first acoustic self-localization scheme that is passive, and is capable of operating when sensors are moving, and possibly unsynchronized. As a result, the relative microphone positions are obtained and therefore an ad hoc microphone array has been established. The proposed system takes advantage of the knowledge that a device is worn by its user e.g. attached to his/her clothing. A user here acts as a sound source and the sensor is the user-worn microphone. Such an entity is referred to as a node. Node-related spatial information is obtained from Time Difference of Arrival (TDOA) estimated from audio captured by the nodes. Kalman filtering is used for node tracking and prediction of spatial information during periods of node silence. Finally, the node positions are recovered using multidimensional scaling (MDS). The only information required by the proposed system is observations of sounds produced by the nodes such as speech to localize the moving nodes. The general framework for acoustic self-localization is presented followed by an implementation to demonstrate the concept. Real data collected by off-the-shelf equipment is used to evaluate the positioning accuracy of nodes in contrast to image based method. The presented system achieves an accuracy of approximately 10 cm in an acoustic laboratory.

    @article{2017_AA,
    author = {Parviainen, Mikko and Pertil{\"a}, Pasi},
    abstract = "Abstract The increase of mobile devices and most recently wearables has raised the interest to utilize their sensors for various applications such as indoor localization. We present the first acoustic self-localization scheme that is passive, and is capable of operating when sensors are moving, and possibly unsynchronized. As a result, the relative microphone positions are obtained and therefore an ad hoc microphone array has been established. The proposed system takes advantage of the knowledge that a device is worn by its user e.g. attached to his/her clothing. A user here acts as a sound source and the sensor is the user-worn microphone. Such an entity is referred to as a node. Node-related spatial information is obtained from Time Difference of Arrival (TDOA) estimated from audio captured by the nodes. Kalman filtering is used for node tracking and prediction of spatial information during periods of node silence. Finally, the node positions are recovered using multidimensional scaling (MDS). The only information required by the proposed system is observations of sounds produced by the nodes such as speech to localize the moving nodes. The general framework for acoustic self-localization is presented followed by an implementation to demonstrate the concept. Real data collected by off-the-shelf equipment is used to evaluate the positioning accuracy of nodes in contrast to image based method. The presented system achieves an accuracy of approximately 10 cm in an acoustic laboratory.",
    doi = "http://dx.doi.org/10.1016/j.apacoust.2016.10.019",
    issn = "0003-682X",
    journal = "Applied Acoustics",
    keywords = "Self-localization; Ad hoc networks; Microphone arrays; Acoustic measurements; Kalman filtering; Data association",
    month = "February",
    pages = "76 - 85",
    title = "{S}elf-localization of dynamic user-worn microphones from observed speech",
    volume = "Volume 117, Part A",
    year = "2017"
    }

  • M. Parviainen and P. Pertilä, "Obtaining an optimal set of head-related transfer functions with a small amount of measurements," in 2017 IEEE International Workshop on Signal Processing Systems (SiPS), 2017. doi:10.1109/SiPS.2017.8110008
    [BibTeX] [Abstract]

    This article presents a method to obtain personalized Head-Related Transfer Functions (HRTFs) for creating virtual soundscapes based on small amount of measurements. The best matching set of HRTFs are selected among the entries from publicly available databases. The proposed method is evaluated using a listening test where subjects assess the audio samples created using the best matching set of HRTFs against a randomly chosen set of HRTFs from the same location. The listening test indicates that subjects prefer the proposed method over random set of HRTFs.

    @inproceedings{2017_SiPS,
    author = {Parviainen, M. and Pertil{\"a}, P.},
    abstract = "This article presents a method to obtain personalized Head-Related Transfer Functions (HRTFs) for creating virtual soundscapes based on small amount of measurements. The best matching set of HRTFs are selected among the entries from publicly available databases. The proposed method is evaluated using a listening test where subjects assess the audio samples created using the best matching set of HRTFs against a randomly chosen set of HRTFs from the same location. The listening test indicates that subjects prefer the proposed method over random set of HRTFs.",
    booktitle = "2017 IEEE International Workshop on Signal Processing Systems (SiPS)",
    doi = "10.1109/SiPS.2017.8110008",
    keywords = "acoustic signal processing; audio signal processing; hearing; sound reproduction; transfer functions; HRTFs; head-related transfer functions; matching set; optimal set; publicly available databases; randomly chosen set; virtual soundscapes; Ear; Indexes; M",
    month = "10",
    publisher = "IEEE",
    title = "Obtaining an optimal set of head-related transfer functions with a small amount of measurements",
    year = "2017"
    }

  • J. M. Perez-Macias, S. Adavanne, J. Viik, A. Värri, S. Himanen, and M. Tenhunen, "Assessment of support vector machines and convolutional neural networks to detect snoring using Emfit mattress," in 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 2017, p. 2883–2886. doi:10.1109/EMBC.2017.8037459
    [BibTeX] [Abstract]

    Snoring (SN) is an essential feature of sleep breathing disorders, such as obstructive sleep apnea (OSA). In this study, we evaluate epoch-based snoring detection methods using an unobtrusive electromechanical film transducer (Emfit) mattress sensor using polysomnography recordings as a reference. Two different approaches were investigated: a support vector machine (SVM) classifier fed with a subset of spectral features and convolutional neural network (CNN) fed with spectrograms. Representative 10-min normal breathing (NB) and SN periods were selected for analysis in 30 subjects and divided into thirty-second epochs. In the evaluation, average results over 10 fold Monte Carlo cross-validation with 80\% training and 20\% test split were reported. Highest performance was achieved using CNN, with 92\% sensitivity, 96\% specificity, 94\% accuracy, and 0.983 area under the receiver operating characteristics curve (AROC). Results showed a 6\% average increase of performance of the CNN over SVM and greater robustness, and similar performance to ambient microphones.

    @inproceedings{2017_EMBC,
    author = {Perez-Macias, Jose Martin and Adavanne, Sharath and Viik, Jari and V{\"a}rri, Alpo and Himanen, Sari-Leena and Tenhunen, Mirja},
    abstract = "Snoring (SN) is an essential feature of sleep breathing disorders, such as obstructive sleep apnea (OSA). In this study, we evaluate epoch-based snoring detection methods using an unobtrusive electromechanical film transducer (Emfit) mattress sensor using polysomnography recordings as a reference. Two different approaches were investigated: a support vector machine (SVM) classifier fed with a subset of spectral features and convolutional neural network (CNN) fed with spectrograms. Representative 10-min normal breathing (NB) and SN periods were selected for analysis in 30 subjects and divided into thirty-second epochs. In the evaluation, average results over 10 fold Monte Carlo cross-validation with 80\% training and 20\% test split were reported. Highest performance was achieved using CNN, with 92\% sensitivity, 96\% specificity, 94\% accuracy, and 0.983 area under the receiver operating characteristics curve (AROC). Results showed a 6\% average increase of performance of the CNN over SVM and greater robustness, and similar performance to ambient microphones.",
    booktitle = "2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC)",
    doi = "10.1109/EMBC.2017.8037459",
    month = "9",
    pages = "2883--2886",
    publisher = "IEEE",
    title = "Assessment of support vector machines and convolutional neural networks to detect snoring using {E}mfit mattress",
    year = "2017"
    }

  • P. Pertilä and E. Cakir, "Robust direction estimation with convolutional neural networks based steered response power," in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 6125-6129. doi:10.1109/ICASSP.2017.7953333
    [BibTeX]
    @INPROCEEDINGS{2017_ICASSP_a,
    author = "Pertilä, Pasi and Cakir, Emre",
    booktitle = "2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",
    title = "Robust direction estimation with convolutional neural networks based steered response power",
    year = "2017",
    pages = "6125-6129",
    keywords = "Speech;Interference;Reverberation;Training;Microphones;Estimation;Convolution;sound source localization;steered response power;convolutional neural networks;time-frequency masking",
    doi = "10.1109/ICASSP.2017.7953333"
    }

  • P. Pertilä, "Microphone-Array-Based Speech Enhancement Using Neural Networks," in Parametric Time‐Frequency Domain Spatial Audio, John Wiley & Sons, Ltd, 2017, pp. 291-325. doi:https://doi.org/10.1002/9781119252634.ch12
    [BibTeX] [Abstract]

    Abstract This chapter analyses the use of artificial neural networks (ANNs) in learning to predict time-frequency (TF) masks from the noisy input data. Artificial neural networks are inspired by the operation of biological neural networks, where individual neurons receive inputs from other connected neurons. The chapter focuses on TF mask prediction for speech enhancement in dynamic noise environments using artificial neural networks. It reviews the enhancement framework of microphone array signals using beamforming with post-filtering. The chapter presents an overview of the supervised learning framework used for the TF mask-based speech enhancement. It explores the effectiveness of feed-forward neural networks for a real-world enhancement application using recordings from everyday noisy environments, where a microphone array is used to capture the signals. Estimated instrumental intelligibility and signal-to-noise ratio (SNR) scores are evaluated to measure how well the predicted masks improve speech quality, using networks trained on different input features.

    @inbook{2017_g,
    author = "Pertilä, Pasi",
    publisher = "John Wiley \& Sons, Ltd",
    isbn = "9781119252634",
    title = "Microphone-Array-Based Speech Enhancement Using Neural Networks",
    booktitle = "Parametric Time‐Frequency Domain Spatial Audio",
    chapter = "12",
    pages = "291-325",
    doi = "https://doi.org/10.1002/9781119252634.ch12",
    year = "2017",
    keywords = "artificial neural networks, instrumental intelligibility, microphone array signals, post-filtering, signal-to-noise ratio, speech enhancement, time-frequency masks",
    abstract = "Abstract This chapter analyses the use of artificial neural networks (ANNs) in learning to predict time-frequency (TF) masks from the noisy input data. Artificial neural networks are inspired by the operation of biological neural networks, where individual neurons receive inputs from other connected neurons. The chapter focuses on TF mask prediction for speech enhancement in dynamic noise environments using artificial neural networks. It reviews the enhancement framework of microphone array signals using beamforming with post-filtering. The chapter presents an overview of the supervised learning framework used for the TF mask-based speech enhancement. It explores the effectiveness of feed-forward neural networks for a real-world enhancement application using recordings from everyday noisy environments, where a microphone array is used to capture the signals. Estimated instrumental intelligibility and signal-to-noise ratio (SNR) scores are evaluated to measure how well the predicted masks improve speech quality, using networks trained on different input features."
    }

  • G. Richard, T. Virtanen, J. P. Bello, N. Ono, and H. Glotin, "Introduction to the Special Section on Sound Scene and Event Analysis," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, iss. 6, p. 1169–1171, 2017. doi:10.1109/TASLP.2017.2699334
    [BibTeX] [Abstract]

    The papers in this special section are devoted to the growing field of acoustic scene classification and acoustic event recognition. Machine listening systems still have difficulties to reach the ability of human listeners in the analysis of realistic acoustic scenes. If sustained research efforts have been made for decades in speech recognition, speaker identification and to a lesser extent in music information retrieval, the analysis of other types of sounds, such as environmental sounds, is the subject of growing interest from the community and is targeting an ever increasing set of audio categories. This problem appears to be particularly challenging due to the large variety of potential sound sources in the scene, which may in addition have highly different acoustic characteristics, especially in bioacoustics. Furthermore, in realistic environments, multiple sources are often present simultaneously, and in reverberant conditions.

    @article{2017_TASLP_b,
    author = "Richard, G. and Virtanen, T. and Bello, J. P. and Ono, N. and Glotin, H.",
    abstract = "The papers in this special section are devoted to the growing field of acoustic scene classification and acoustic event recognition. Machine listening systems still have difficulties to reach the ability of human listeners in the analysis of realistic acoustic scenes. If sustained research efforts have been made for decades in speech recognition, speaker identification and to a lesser extent in music information retrieval, the analysis of other types of sounds, such as environmental sounds, is the subject of growing interest from the community and is targeting an ever increasing set of audio categories. This problem appears to be particularly challenging due to the large variety of potential sound sources in the scene, which may in addition have highly different acoustic characteristics, especially in bioacoustics. Furthermore, in realistic environments, multiple sources are often present simultaneously, and in reverberant conditions.",
    issn = "2329-9290",
    journal = "IEEE/ACM Transactions on Audio, Speech, and Language Processing",
    month = "6",
    number = "6",
    pages = "1169--1171",
    publisher = "IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC",
    title = "Introduction to the Special Section on Sound Scene and Event Analysis",
    volume = "25",
    year = "2017",
    doi = "10.1109/TASLP.2017.2699334"
    }

  • Z. Shuyang, T. Heittola, and T. Virtanen, "Active Learning for Sound Event Classification by Clustering Unlabeled Data," in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, p. 751–755.
    [BibTeX] [Download PDF]
    @inproceedings{2017_ICASSP,
    author = "Shuyang, Zhao and Heittola, Toni and Virtanen, Tuomas",
    keywords = "active learning;sound event classification;K-medoids clustering",
    title = "Active Learning for Sound Event Classification by Clustering Unlabeled Data",
    url = "https://trepo.tuni.fi/handle/10024/129132",
    booktitle = "2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",
    pages = "751--755",
    year = "2017",
    organization = "IEEE"
    }

  • Z. Shuyang, T. Heittola, and T. Virtanen, "Learning vocal mode classifiers from heterogeneous data sources," in 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), United States, 2017, p. 16–20. doi:10.1109/WASPAA.2017.8169986
    [BibTeX] [Abstract] [Download PDF]

    This paper targets on a generalized vocal mode classifier (speech/singing) that works on audio data from an arbitrary data source. However, previous studies on sound classification are commonly based on cross-validation using a single dataset, without considering the cases that training and testing data are recorded in mismatched condition. Experiments revealed a big difference between homogeneous recognition scenario and heterogeneous recognition scenario, using a new dataset TUT-vocal-2016. In the homogeneous recognition scenario, the classification accuracy using cross-validation on TUT-vocal-2016 was 95.5\%. In heterogeneous recognition scenario, seven existing datasets were used as training material and TUT-vocal-2016 was used for testing, the classification accuracy was only 69.6\%. Several feature normalization methods were tested to improve the performance in heterogeneous recognition scenario. The best performance (96.8\%) was obtained using the proposed subdataset-wise normalization.

    @inproceedings{2017_WASPAA_c,
    author = "Shuyang, Zhao and Heittola, Toni and Virtanen, Tuomas",
    abstract = "This paper targets on a generalized vocal mode classifier (speech/singing) that works on audio data from an arbitrary data source. However, previous studies on sound classification are commonly based on cross-validation using a single dataset, without considering the cases that training and testing data are recorded in mismatched condition. Experiments revealed a big difference between homogeneous recognition scenario and heterogeneous recognition scenario, using a new dataset TUT-vocal-2016. In the homogeneous recognition scenario, the classification accuracy using cross-validation on TUT-vocal-2016 was 95.5\%. In heterogeneous recognition scenario, seven existing datasets were used as training material and TUT-vocal-2016 was used for testing, the classification accuracy was only 69.6\%. Several feature normalization methods were tested to improve the performance in heterogeneous recognition scenario. The best performance (96.8\%) was obtained using the proposed subdataset-wise normalization.",
    address = "United States",
    booktitle = "2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)",
    doi = "10.1109/WASPAA.2017.8169986",
    isbn = "978-1-5386-1631-4",
    keywords = "sound classification; vocal mode; heterogeneous data sources; feature normalization",
    pages = "16–20",
    publisher = "IEEE Computer Society",
    title = "Learning vocal mode classifiers from heterogeneous data sources",
    year = "2017",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/vocal\_mode.pdf"
    }

  • M. Valenti, S. Squartini, A. Diment, G. Parascandolo, and T. Virtanen, "A convolutional neural network approach for acoustic scene classification," in 2017 International Joint Conference on Neural Networks, IJCNN 2017, 2017, p. 1547–1554. doi:10.1109/IJCNN.2017.7966035
    [BibTeX] [Abstract] [Download PDF]

    This paper presents a novel application of convolutional neural networks (CNNs) for the task of acoustic scene classification (ASC). We here propose the use of a CNN trained to classify short sequences of audio, represented by their log-mel spectrogram. We also introduce a training method that can be used under particular circumstances in order to make full use of small datasets. The proposed system is tested and evaluated on three different ASC datasets and compared to other state-of-the-art systems which competed in the 'Detection and Classification of Acoustic Scenes and Events' (DCASE) challenges held in 20161 and 2013. The best accuracy scores obtained by our system on the DCASE 2016 datasets are 79.0\% (development) and 86.2\% (evaluation), which constitute a 6.4\% and 9\% improvements with respect to the baseline system. Finally, when tested on the DCASE 2013 evaluation dataset, the proposed system manages to reach a 77.0\% accuracy, improving by 1\% the challenge winner's score.

    @inproceedings{2017_IJCNN,
    author = "Valenti, Michele and Squartini, Stefano and Diment, Aleksandr and Parascandolo, Giambattista and Virtanen, Tuomas",
    abstract = "This paper presents a novel application of convolutional neural networks (CNNs) for the task of acoustic scene classification (ASC). We here propose the use of a CNN trained to classify short sequences of audio, represented by their log-mel spectrogram. We also introduce a training method that can be used under particular circumstances in order to make full use of small datasets. The proposed system is tested and evaluated on three different ASC datasets and compared to other state-of-the-art systems which competed in the 'Detection and Classification of Acoustic Scenes and Events' (DCASE) challenges held in 20161 and 2013. The best accuracy scores obtained by our system on the DCASE 2016 datasets are 79.0\% (development) and 86.2\% (evaluation), which constitute a 6.4\% and 9\% improvements with respect to the baseline system. Finally, when tested on the DCASE 2013 evaluation dataset, the proposed system manages to reach a 77.0\% accuracy, improving by 1\% the challenge winner's score.",
    booktitle = "2017 International Joint Conference on Neural Networks, IJCNN 2017",
    doi = "10.1109/IJCNN.2017.7966035",
    month = "6",
    pages = "1547--1554",
    publisher = "IEEE",
    title = "A convolutional neural network approach for acoustic scene classification",
    year = "2017",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/ijcnn\_paper\_valenti\_extended.pdf"
    }

  • T. Virtanen, M. D. Plumbley, and D. Ellis, Computational analysis of sound scenes and events, Springer, 2017. doi:10.1007/978-3-319-63450-0
    [BibTeX] [Abstract] [Download PDF]

    This book presents computational methods for extracting the useful information from audio signals, collecting the state of the art in the field of sound event and scene analysis. The authors cover the entire procedure for developing such methods, ranging from data acquisition and labeling, through the design of taxonomies used in the systems, to signal processing methods for feature extraction and machine learning methods for sound recognition. The book also covers advanced techniques for dealing with environmental variation and multiple overlapping sound sources, and taking advantage of multiple microphones or other modalities. The book gives examples of usage scenarios in large media databases, acoustic monitoring, bioacoustics, and context-aware devices. Graphical illustrations of sound signals and their spectrographic representations are presented, as well as block diagrams and pseudocode of algorithms.

    @book{2017_a,
    author = "Virtanen, Tuomas and Plumbley, Mark D. and Ellis, Dan",
    abstract = "This book presents computational methods for extracting the useful information from audio signals, collecting the state of the art in the field of sound event and scene analysis. The authors cover the entire procedure for developing such methods, ranging from data acquisition and labeling, through the design of taxonomies used in the systems, to signal processing methods for feature extraction and machine learning methods for sound recognition. The book also covers advanced techniques for dealing with environmental variation and multiple overlapping sound sources, and taking advantage of multiple microphones or other modalities. The book gives examples of usage scenarios in large media databases, acoustic monitoring, bioacoustics, and context-aware devices. Graphical illustrations of sound signals and their spectrographic representations are presented, as well as block diagrams and pseudocode of algorithms.",
    doi = "10.1007/978-3-319-63450-0",
    isbn = "978-3-319-63449-4",
    month = "9",
    publisher = "Springer",
    title = "Computational analysis of sound scenes and events",
    year = "2017",
    url = "http://www.springer.com/us/book/9783319634494"
    }

  • T. Virtanen, M. D. Plumbley, and D. Ellis, "Introduction to sound scene and event analysis," in Computational Analysis of Sound Scenes and Events, Springer, 2017, p. 3–12. doi:10.1007/978-3-319-63450-0_1
    [BibTeX] [Abstract]

    Developing computational systems requires methods for evaluating their performance to guide development and compare alternate approaches. A reliable evaluation procedure for a classification or recognition system will involve a standard dataset of example input data along with the intended target output, and well-defined metrics to compare the systems' outputs with this ground truth. This chapter examines the important factors in the design and construction of evaluation datasets and goes through the metrics commonly used in system evaluation, comparing their properties. We include a survey of currently available datasets for environmental sound scene and event recognition and conclude with advice for designing evaluation protocols.

    @inbook{2017_b,
    author = "Virtanen, Tuomas and Plumbley, Mark D. and Ellis, Dan",
    abstract = "Developing computational systems requires methods for evaluating their performance to guide development and compare alternate approaches. A reliable evaluation procedure for a classification or recognition system will involve a standard dataset of example input data along with the intended target output, and well-defined metrics to compare the systems' outputs with this ground truth. This chapter examines the important factors in the design and construction of evaluation datasets and goes through the metrics commonly used in system evaluation, comparing their properties. We include a survey of currently available datasets for environmental sound scene and event recognition and conclude with advice for designing evaluation protocols.",
    booktitle = "Computational Analysis of Sound Scenes and Events",
    doi = "10.1007/978-3-319-63450-0\_1",
    editor2 = "Tuomas Virtanen and Mark D. Plumbley and Dan Ellis",
    isbn = "978-3-319-63449-4",
    month = "9",
    pages = "3--12",
    publisher = "Springer",
    title = "Introduction to sound scene and event analysis",
    year = "2017"
    }

  • T. Virtanen, A. Mesaros, T. Heittola, A. Diment, E. Vincent, E. Benetos, and B. M. Elizalde, Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), Tampere University of Technology. Laboratory of Signal Processing, 2017.
    [BibTeX]
    @book{2017_h,
    author = "Virtanen, Tuomas and Mesaros, Annamaria and Heittola, Toni and Diment, Aleksandr and Vincent, Emmanuel and Benetos, Emmanouil and Elizalde, Benjamin Martinez",
    month = "11",
    publisher = "Tampere University of Technology. Laboratory of Signal Processing",
    title = "{P}roceedings of the {D}etection and {C}lassification of {A}coustic {S}cenes and {E}vents 2017 {W}orkshop ({DCASE}2017)",
    year = "2017"
    }

2016

  • S. Adavanne, G. Parascandolo, P. Pertila, T. Heittola, and T. Virtanen, "Sound event detection in multichannel audio using spatial and harmonic features," in Detection and Classification of Acoustic Scenes and Events, 2016.
    [BibTeX] [Abstract] [Download PDF]

    In this paper, we propose the use of spatial and harmonic features in combination with long short term memory (LSTM) recurrent neural network (RNN) for automatic sound event detection (SED) task. Real life sound recordings typically have many overlapping sound events, making it hard to recognize with just mono channel audio. Human listeners have been successfully recognizing the mixture of overlapping sound events using pitch cues and exploiting the stereo (multichannel) audio signal available at their ears to spatially localize these events. Traditionally SED systems have only been using mono channel audio, motivated by the human listener we propose to extend them to use multichannel audio. The proposed SED system is compared against the state of the art mono channel method on the development subset of TUT sound events detection 2016 database [1]. The usage of spatial and harmonic features are shown to improve the performance of SED.

    @inproceedings{2016_DCASE,
    author = "Adavanne, Sharath and Parascandolo, Giambattista and Pertila, Pasi and Heittola, Toni and Virtanen, Tuomas",
    abstract = "In this paper, we propose the use of spatial and harmonic features in combination with long short term memory (LSTM) recurrent neural network (RNN) for automatic sound event detection (SED) task. Real life sound recordings typically have many overlapping sound events, making it hard to recognize with just mono channel audio. Human listeners have been successfully recognizing the mixture of overlapping sound events using pitch cues and exploiting the stereo (multichannel) audio signal available at their ears to spatially localize these events. Traditionally SED systems have only been using mono channel audio, motivated by the human listener we propose to extend them to use multichannel audio. The proposed SED system is compared against the state of the art mono channel method on the development subset of TUT sound events detection 2016 database [1]. The usage of spatial and harmonic features are shown to improve the performance of SED.",
    booktitle = "Detection and Classification of Acoustic Scenes and Events",
    keywords = "Sound event detection;multichannel;time difference of arrival;pitch;recurrent neural networks;long short term memory",
    title = "{S}ound event detection in multichannel audio using spatial and harmonic features",
    url = "https://dcase.community/documents/workshop2016/proceedings/Adavanne-DCASE2016workshop.pdf",
    year = "2016"
    }

  • T. Barker and T. Virtanen, "Blind Separation of Audio Mixtures Through Nonnegative Tensor Factorisation of Modulation Spectograms," IEEE-ACM Transactions on Audio Speech and Language Processing, vol. 24, iss. 12, p. 2377–2389, 2016. doi:10.1109/TASLP.2016.2602546
    [BibTeX] [Abstract] [Download PDF]

    This paper presents an algorithm for unsupervised single-channel source separation of audio mixtures. The approach specifically addresses the challenging case of separation where no training data are available. By representing mixtures in the modulation spectrogram (MS) domain, we exploit underlying similarities in patterns present across frequency. A three-dimensional tensor factorization is able to take advantage of these redundant patterns, and is used to separate a mixture into an approximated sum of components by minimizing a divergence cost. Furthermore, we show that the basic tensor factorization can be extended with convolution in time being used to improve separation results and provide update rules to learn components in such a manner. Following factorization, sources are reconstructed in the audio domain from estimated components using a novel approach based on reconstruction masks that are learned using MS activations, and then applied to a mixture spectrogram. We demonstrate that the proposed method produces superior separation performance to a spectrally based nonnegative matrix factorization approach, in terms of source-to-distortion ratio. We also compare separation with the perceptually motivated interference-related perceptual score metric and identify cases with higher performance.

    @article{2016_TASLP,
    author = "Barker, Tom and Virtanen, Tuomas",
    abstract = "This paper presents an algorithm for unsupervised single-channel source separation of audio mixtures. The approach specifically addresses the challenging case of separation where no training data are available. By representing mixtures in the modulation spectrogram (MS) domain, we exploit underlying similarities in patterns present across frequency. A three-dimensional tensor factorization is able to take advantage of these redundant patterns, and is used to separate a mixture into an approximated sum of components by minimizing a divergence cost. Furthermore, we show that the basic tensor factorization can be extended with convolution in time being used to improve separation results and provide update rules to learn components in such a manner. Following factorization, sources are reconstructed in the audio domain from estimated components using a novel approach based on reconstruction masks that are learned using MS activations, and then applied to a mixture spectrogram. We demonstrate that the proposed method produces superior separation performance to a spectrally based nonnegative matrix factorization approach, in terms of source-to-distortion ratio. We also compare separation with the perceptually motivated interference-related perceptual score metric and identify cases with higher performance.",
    doi = "10.1109/TASLP.2016.2602546",
    issn = "2329-9290",
    journal = "IEEE-ACM Transactions on Audio Speech and Language Processing",
    keywords = "Factorization; nonnegative matrix factorization (NMF); source separation; speech enhancement",
    month = "12",
    number = "12",
    pages = "2377--2389",
    publisher = "IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC",
    title = "Blind Separation of Audio Mixtures Through Nonnegative Tensor Factorisation of Modulation Spectograms",
    volume = "24",
    year = "2016",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/blind-separation-audio.pdf"
    }

  • T. Barker and T. Virtanen, "Blind Separation of Audio Mixtures Through Nonnegative Tensor Factorization of Modulation Spectrograms," Ieee-Acm transactions on audio speech and language processing, vol. 24, iss. 12, p. 2377–2389, 2016. doi:10.1109/TASLP.2016.2602546
    [BibTeX] [Abstract]

    This paper presents an algorithm for unsupervised single-channel source separation of audio mixtures. The approach specifically addresses the challenging case of separation where no training data are available. By representing mixtures in the modulation spectrogram (MS) domain, we exploit underlying similarities in patterns present across frequency. A three-dimensional tensor factorization is able to take advantage of these redundant patterns, and is used to separate a mixture into an approximated sum of components by minimizing a divergence cost. Furthermore, we show that the basic tensor factorization can be extended with convolution in time being used to improve separation results and provide update rules to learn components in such a manner. Following factorization, sources are reconstructed in the audio domain from estimated components using a novel approach based on reconstruction masks that are learned using MS activations, and then applied to a mixture spectrogram. We demonstrate that the proposed method produces superior separation performance to a spectrally based nonnegative matrix factorization approach, in terms of source-to-distortion ratio. We also compare separation with the perceptually motivated interference-related perceptual score metric and identify cases with higher performance.

    @article{2016_TASLP_a,
    author = "Barker, Tom and Virtanen, Tuomas",
    title = "Blind Separation of Audio Mixtures Through Nonnegative Tensor Factorization of Modulation Spectrograms",
    abstract = "This paper presents an algorithm for unsupervised single-channel source separation of audio mixtures. The approach specifically addresses the challenging case of separation where no training data are available. By representing mixtures in the modulation spectrogram (MS) domain, we exploit underlying similarities in patterns present across frequency. A three-dimensional tensor factorization is able to take advantage of these redundant patterns, and is used to separate a mixture into an approximated sum of components by minimizing a divergence cost. Furthermore, we show that the basic tensor factorization can be extended with convolution in time being used to improve separation results and provide update rules to learn components in such a manner. Following factorization, sources are reconstructed in the audio domain from estimated components using a novel approach based on reconstruction masks that are learned using MS activations, and then applied to a mixture spectrogram. We demonstrate that the proposed method produces superior separation performance to a spectrally based nonnegative matrix factorization approach, in terms of source-to-distortion ratio. We also compare separation with the perceptually motivated interference-related perceptual score metric and identify cases with higher performance.",
    keywords = "Factorization, nonnegative matrix factorization (NMF), source separation, speech enhancement",
    year = "2016",
    month = "December",
    day = "1",
    doi = "10.1109/TASLP.2016.2602546",
    language = "English",
    volume = "24",
    pages = "2377--2389",
    journal = "Ieee-Acm transactions on audio speech and language processing",
    issn = "2329-9290",
    publisher = "IEEE Advancing Technology for Humanity",
    number = "12"
    }

  • E. Cakir, E. C. Ozan, and T. Virtanen, "Filterbank Learning for Deep Neural Network Based Polyphonic Sound Event Detection," in 2016 International Joint Conference on Neural Networks (IJCNN), 2016. doi:10.1109/IJCNN.2016.7727634
    [BibTeX] [Abstract] [Download PDF]

    Deep learning techniques such as deep feedforward neural networks and deep convolutional neural networks have recently been shown to improve the performance in sound event detection compared to traditional methods such as Gaussian mixture models. One of the key factors of this improvement is the capability of deep architectures to automatically learn higher levels of acoustic features in each layer. In this work, we aim to combine the feature learning capabilities of deep architectures with the empirical knowledge of human perception. We use the first layer of a deep neural network to learn a mapping from a high-resolution magnitude spectrum to smaller amount of frequency bands, which effectively learns a filterbank for the sound event detection task. We initialize the first hidden layer weights to match with the perceptually motivated mel filterbank magnitude response. We also integrate this initialization scheme with context windowing by using an appropriately constrained deep convolutional neural network. The proposed method does not only result with better detection accuracy, but also provides insight on the frequencies deemed essential for better discrimination of given sound events.

    @inproceedings{2016_IJCNN,
    author = "Cakir, Emre and Ozan, Ezgi Can and Virtanen, Tuomas",
    abstract = "Deep learning techniques such as deep feedforward neural networks and deep convolutional neural networks have recently been shown to improve the performance in sound event detection compared to traditional methods such as Gaussian mixture models. One of the key factors of this improvement is the capability of deep architectures to automatically learn higher levels of acoustic features in each layer. In this work, we aim to combine the feature learning capabilities of deep architectures with the empirical knowledge of human perception. We use the first layer of a deep neural network to learn a mapping from a high-resolution magnitude spectrum to smaller amount of frequency bands, which effectively learns a filterbank for the sound event detection task. We initialize the first hidden layer weights to match with the perceptually motivated mel filterbank magnitude response. We also integrate this initialization scheme with context windowing by using an appropriately constrained deep convolutional neural network. The proposed method does not only result with better detection accuracy, but also provides insight on the frequencies deemed essential for better discrimination of given sound events.",
    booktitle = "2016 International Joint Conference on Neural Networks (IJCNN)",
    day = "3",
    doi = "10.1109/IJCNN.2016.7727634",
    month = "11",
    publisher = "IEEE",
    title = "Filterbank Learning for Deep Neural Network Based Polyphonic Sound Event Detection",
    year = "2016",
    url = "https://tutcris.tut.fi/portal/files/13966417/filterbank\_learning\_ijcnn\_2016.pdf"
    }

  • A. Diment, M. Parviainen, T. Virtanen, R. Zelov, and A. Glasman, "Noise-robust detection of whispering in telephone calls using deep neural networks," in 2016 24th European Signal Processing Conference (EUSIPCO), 2016, pp. 2310-2314. doi:10.1109/EUSIPCO.2016.7760661
    [BibTeX] [Abstract] [Download PDF]

    Detection of whispered speech in the presence of high levels of background noise has applications in fraudulent behaviour recognition. For instance, it can serve as an indicator of possible insider trading. We propose a deep neural network (DNN)-based whispering detection system, which operates on both magnitude and phase features, including the group delay feature from all-pole models (APGD). We show that the APGD feature outperforms the conventional ones. Trained and evaluated on the collected diverse dataset of whispered and normal speech with emulated phone line distortions and significant amounts of added background noise, the proposed system performs with accuracies as high as 91.8\%.

    @INPROCEEDINGS{2016_EUSIPCO_b,
    author = "Diment, Aleksandr and Parviainen, Mikko and Virtanen, Tuomas and Zelov, Roman and Glasman, Alex",
    booktitle = "2016 24th European Signal Processing Conference (EUSIPCO)",
    title = "Noise-robust detection of whispering in telephone calls using deep neural networks",
    year = "2016",
    volume = "",
    number = "",
    pages = "2310-2314",
    keywords = "Speech;Feature extraction;Noise measurement;Training;Signal processing;Neural networks;Speech recognition",
    abstract = "Detection of whispered speech in the presence of high levels of background noise has applications in fraudulent behaviour recognition. For instance, it can serve as an indicator of possible insider trading. We propose a deep neural network (DNN)-based whispering detection system, which operates on both magnitude and phase features, including the group delay feature from all-pole models (APGD). We show that the APGD feature outperforms the conventional ones. Trained and evaluated on the collected diverse dataset of whispered and normal speech with emulated phone line distortions and significant amounts of added background noise, the proposed system performs with accuracies as high as 91.8\%.",
    doi = "10.1109/EUSIPCO.2016.7760661",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/Diment16\_WHI.pdf"
    }

  • K. Drossos, M. Kaliakatsos-Papakostas, A. Floros, and T. Virtanen, "On the Impact of The Semantic Content of Sound Events in Emotion Elicitation," Journal of the Audio Engineering Society, vol. 64, iss. 7/8, p. 525–532, 2016. doi:10.17743/jaes.2016.0024
    [BibTeX] [Abstract]

    Sound events are known to have an influence on the listener’s emotions, but the reason for this influence is less clear. Take for example the sound produced by a gun firing. Does the emotional impact arise from the fact that the listener recognizes that a gun produced the sound (semantic content) or does it arise from the attributes of the sound created by the firing gun? This research explores the relation between the semantic similarity of the sound events and the elicited emotions. Results indicate that the semantic content seems to have a limited role in the conformation of the listener’s affective states. However, when the semantic content is matched to specific areas in the Arousal-Valence space or when the source’s spatial position is considered, the effect of the semantic content is higher, especially for the cases of medium to low valence and medium to high arousal or when the sound source is at the lateral positions of the listener’s head.

    @article{2016_AES,
    author = "Drossos, Konstantinos and Kaliakatsos-Papakostas, Maximos and Floros, Andreas and Virtanen, Tuomas",
    abstract = "Sound events are known to have an influence on the listener’s emotions, but the reason for this influence is less clear. Take for example the sound produced by a gun firing. Does the emotional impact arise from the fact that the listener recognizes that a gun produced the sound (semantic content) or does it arise from the attributes of the sound created by the firing gun? This research explores the relation between the semantic similarity of the sound events and the elicited emotions. Results indicate that the semantic content seems to have a limited role in the conformation of the listener’s affective states. However, when the semantic content is matched to specific areas in the Arousal-Valence space or when the source’s spatial position is considered, the effect of the semantic content is higher, especially for the cases of medium to low valence and medium to high arousal or when the sound source is at the lateral positions of the listener’s head.",
    day = "11",
    doi = "10.17743/jaes.2016.0024",
    issn = "1549-4950",
    journal = "Journal of the Audio Engineering Society",
    month = "8",
    number = "7/8",
    pages = "525--532",
    publisher = "Audio Engineering Society",
    title = "On the Impact of The Semantic Content of Sound Events in Emotion Elicitation",
    volume = "64",
    year = "2016"
    }

  • K. Mahkonen, A. Hurmalainen, T. Virtanen, and J. Kämäräinen, "Cascade processing for speeding up sliding window sparse classification," in European Signal Processing Conference (EUSIPCO), 2016, 2016. doi:10.1109/EUSIPCO.2016.7760660
    [BibTeX] [Abstract] [Download PDF]

    Sparse representations have been found to provide high classification accuracy in many fields. Their drawback is the high computational load. In this work, we propose a novel cascaded classifier structure to speed up the decision process while utilizing sparse signal representation. In particular, we apply the cascaded decision process for noise robust automatic speech recognition task. The cascaded decision process is implemented using a feedforward neural network (NN) and time sparse versions of a non-negative matrix factorization (NMF) based sparse classification method of [1]. The recognition accuracy of our cascade is among the three best in the recent CHiME2013 benchmark and obtains six times faster the accuracy of NMF alone as in [1].

    @inproceedings{2016_EUSIPCO,
    author = {Mahkonen, Katariina and Hurmalainen, Antti and Virtanen, Tuomas and K{\"a}m{\"a}r{\"a}inen, Joni-Kristian},
    abstract = "Sparse representations have been found to provide high classification accuracy in many fields. Their drawback is the high computational load. In this work, we propose a novel cascaded classifier structure to speed up the decision process while utilizing sparse signal representation. In particular, we apply the cascaded decision process for noise robust automatic speech recognition task. The cascaded decision process is implemented using a feedforward neural network (NN) and time sparse versions of a non-negative matrix factorization (NMF) based sparse classification method of [1]. The recognition accuracy of our cascade is among the three best in the recent CHiME2013 benchmark and obtains six times faster the accuracy of NMF alone as in [1].",
    booktitle = "European Signal Processing Conference (EUSIPCO), 2016",
    doi = "10.1109/EUSIPCO.2016.7760660",
    publisher = "IEEE",
    title = "Cascade processing for speeding up sliding window sparse classification",
    year = "2016",
    url = "http://vision.cs.tut.fi/data/publications/eusipco2016\_cascaded\_nmf.pdf"
    }

  • A. Mesaros, T. Heittola, and T. Virtanen, "TUT database for acoustic scene classification and sound event detection," in 2016 24th European Signal Processing Conference (EUSIPCO), 2016, pp. 1128-1132. doi:10.1109/EUSIPCO.2016.7760424
    [BibTeX] [Abstract] [Download PDF]

    We introduce TUT Acoustic Scenes 2016 database for environmental sound research, consisting ofbinaural recordings from 15 different acoustic environments. A subset of this database, called TUT Sound Events 2016, contains annotations for individual sound events, specifically created for sound event detection. TUT Sound Events 2016 consists of residential area and home environments, and is manually annotated to mark onset, offset and label of sound events. In this paper we present the recording and annotation procedure, the database content, a recommended cross-validation setup and performance of supervised acoustic scene classification system and event detection baseline system using mel frequency cepstral coefficients and Gaussian mixture models. The database is publicly released to provide support for algorithm development and common ground for comparison of different techniques.

    @INPROCEEDINGS{2016_EUSIPCO_a,
    author = "Mesaros, Annamaria and Heittola, Toni and Virtanen, Tuomas",
    booktitle = "2016 24th European Signal Processing Conference (EUSIPCO)",
    title = "{TUT} database for acoustic scene classification and sound event detection",
    year = "2016",
    volume = "",
    number = "",
    pages = "1128-1132",
    keywords = "Event detection;Databases;Automobiles;Signal processing;Mel frequency cepstral coefficient;Europe",
    abstract = "We introduce TUT Acoustic Scenes 2016 database for environmental sound research, consisting ofbinaural recordings from 15 different acoustic environments. A subset of this database, called TUT Sound Events 2016, contains annotations for individual sound events, specifically created for sound event detection. TUT Sound Events 2016 consists of residential area and home environments, and is manually annotated to mark onset, offset and label of sound events. In this paper we present the recording and annotation procedure, the database content, a recommended cross-validation setup and performance of supervised acoustic scene classification system and event detection baseline system using mel frequency cepstral coefficients and Gaussian mixture models. The database is publicly released to provide support for algorithm development and common ground for comparison of different techniques.",
    doi = "10.1109/EUSIPCO.2016.7760424",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/mesaros\_eusipco2016-dcase.pdf"
    }

  • A. Mesaros, T. Heittola, and T. Virtanen, "Metrics for polyphonic sound event detection," Applied Sciences, vol. 6, iss. 6, p. 162, 2016. doi:10.3390/app6060162
    [BibTeX] [Abstract] [Download PDF]

    This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sound sources active simultaneously. The system output in this case contains overlapping events, marked as multiple sounds detected as being active at the same time. The polyphonic system output requires a suitable procedure for evaluation against a reference. Metrics from neighboring fields such as speech recognition and speaker diarization can be used, but they need to be partially redefined to deal with the overlapping events. We present a review of the most common metrics in the field and the way they are adapted and interpreted in the polyphonic case. We discuss segment-based and event-based definitions of each metric and explain the consequences of instance-based and class-based averaging using a case study. In parallel, we provide a toolbox containing implementations of presented metrics.

    @article{2016_AS,
    author = "Mesaros, Annamaria and Heittola, Toni and Virtanen, Tuomas",
    abstract = "This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sound sources active simultaneously. The system output in this case contains overlapping events, marked as multiple sounds detected as being active at the same time. The polyphonic system output requires a suitable procedure for evaluation against a reference. Metrics from neighboring fields such as speech recognition and speaker diarization can be used, but they need to be partially redefined to deal with the overlapping events. We present a review of the most common metrics in the field and the way they are adapted and interpreted in the polyphonic case. We discuss segment-based and event-based definitions of each metric and explain the consequences of instance-based and class-based averaging using a case study. In parallel, we provide a toolbox containing implementations of presented metrics.",
    journal = "Applied Sciences",
    number = "6",
    pages = "162",
    title = "{M}etrics for polyphonic sound event detection",
    url = "http://www.mdpi.com/2076-3417/6/6/162",
    volume = "6",
    year = "2016",
    doi = "10.3390/app6060162"
    }

  • S. I. Mimilakis, K. Drossos, T. Virtanen, and G. Schuller, "Deep neural networks for dynamic range compression in mastering applications," in Audio Engineering Society Convention 140, 2016.
    [BibTeX] [Abstract]

    The process of audio mastering often, if not always, includes various audio signal processing techniques such as frequency equalization and dynamic range compression. With respect to the genre and style of the audio content, the parameters of these techniques are controlled by a mastering engineer, in order to process the original audio material. This operation relies on musical and perceptually pleasing facets of the perceived acoustic characteristics, transmitted from the audio material under the mastering process. Modeling such dynamic operations, which involve adaptation regarding the audio content, becomes vital in automated applications since it significantly affects the overall performance. In this work we present a system capable of modelling such behavior focusing on the automatic dynamic range compression. It predicts frequency coefficients that allow the dynamic range compression, via a trained deep neural network, and applies them to unmastered audio signal served as input. Both dynamic range compression and the prediction of the corresponding frequency coefficients take place inside the time-frequency domain, using magnitude spectra acquired from a critical band filter bank, similar to humans’ peripheral auditory system. Results from conducted listening tests, incorporating professional music producers and audio mastering engineers, demonstrate on average an equivalent performance compared to professionally mastered audio content. Improvements were also observed when compared to relevant and commercial software.

    @inproceedings{2016_AES_a,
    author = "Mimilakis, Stylianos Ioannis and Drossos, Konstantinos and Virtanen, Tuomas and Schuller, Gerald",
    title = "Deep neural networks for dynamic range compression in mastering applications",
    abstract = "The process of audio mastering often, if not always, includes various audio signal processing techniques such as frequency equalization and dynamic range compression. With respect to the genre and style of the audio content, the parameters of these techniques are controlled by a mastering engineer, in order to process the original audio material. This operation relies on musical and perceptually pleasing facets of the perceived acoustic characteristics, transmitted from the audio material under the mastering process. Modeling such dynamic operations, which involve adaptation regarding the audio content, becomes vital in automated applications since it significantly affects the overall performance. In this work we present a system capable of modelling such behavior focusing on the automatic dynamic range compression. It predicts frequency coefficients that allow the dynamic range compression, via a trained deep neural network, and applies them to unmastered audio signal served as input. Both dynamic range compression and the prediction of the corresponding frequency coefficients take place inside the time-frequency domain, using magnitude spectra acquired from a critical band filter bank, similar to humans’ peripheral auditory system. Results from conducted listening tests, incorporating professional music producers and audio mastering engineers, demonstrate on average an equivalent performance compared to professionally mastered audio content. Improvements were also observed when compared to relevant and commercial software.",
    booktitle = "Audio Engineering Society Convention 140",
    year = "2016",
    organization = "Audio Engineering Society",
    publisher = "AES Audio Engineering Society"
    }

  • G. Naithani, G. Parascandolo, T. Barker, N. H. Pontoppidan, and T. Virtanen, "Low-latency sound source separation using deep neural networks," in 2016 IEEE Global Conference on Signal and Information Processing (GlobalSIP), 2016, pp. 272-276. doi:10.1109/GlobalSIP.2016.7905846
    [BibTeX] [Download PDF]
    @INPROCEEDINGS{2016_GlobalSIP,
    author = "Naithani, Gaurav and Parascandolo, Giambattista and Barker, Tom and Pontoppidan, Niels Henrik and Virtanen, Tuomas",
    booktitle = "2016 IEEE Global Conference on Signal and Information Processing (GlobalSIP)",
    title = "Low-latency sound source separation using deep neural networks",
    year = "2016",
    volume = "",
    number = "",
    pages = "272-276",
    keywords = "Source separation;Training;Context;Measurement;Neural networks;Speech;Acoustics;Source separation;Deep neural networks;Low-latency",
    doi = "10.1109/GlobalSIP.2016.7905846",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/naithani\_globalsip2016.pdf"
    }

  • J. Nikunen, A. Diment, T. Virtanen, and M. Vilermo, "Binaural rendering of microphone array captures based on source separation," Speech Communication, vol. 76, p. 157–169, 2016.
    [BibTeX] [Download PDF]
    @article{2016_ICA,
    author = "Nikunen, Joonas and Diment, Aleksandr and Virtanen, Tuomas and Vilermo, Miikka",
    journal = "Speech Communication",
    pages = "157--169",
    title = "Binaural rendering of microphone array captures based on source separation",
    url = "http://www.sciencedirect.com/science/article/pii/S0167639315001004",
    volume = "76",
    year = "2016"
    }

  • G. Parascandolo, H. Huttunen, and T. Virtanen, "Recurrent Neural Networks for Polyphonic Sound Event Detection in Real Life Recordings," in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, p. 6440–6444. doi:10.1109/ICASSP.2016.7472917
    [BibTeX] [Abstract] [Download PDF]

    In this paper we present an approach to polyphonic sound event detection in real life recordings based on bi-directional long short term memory (BLSTM) recurrent neural networks (RNNs). A single multilabel BLSTM RNN is trained to map acoustic features of a mixture signal consisting of sounds from multiple classes, to binary activity indicators of each event class. Our method is tested on a large database of real-life recordings, with 61 classes (e.g. music, car, speech) from 10 different everyday contexts. The proposed method outperforms previous approaches by a large margin, and the results are further improved using data augmentation techniques. Overall, our system reports an average F1-score of 65.5\% on 1 second blocks and 64.7\% on single frames, a relative improvement over previous state-of-the-art approach of 6.8\% and 15.1\% respectively.

    @inproceedings{2016_ICASSP,
    author = "Parascandolo, Giambattista and Huttunen, Heikki and Virtanen, Tuomas",
    abstract = "In this paper we present an approach to polyphonic sound event detection in real life recordings based on bi-directional long short term memory (BLSTM) recurrent neural networks (RNNs). A single multilabel BLSTM RNN is trained to map acoustic features of a mixture signal consisting of sounds from multiple classes, to binary activity indicators of each event class. Our method is tested on a large database of real-life recordings, with 61 classes (e.g. music, car, speech) from 10 different everyday contexts. The proposed method outperforms previous approaches by a large margin, and the results are further improved using data augmentation techniques. Overall, our system reports an average F1-score of 65.5\% on 1 second blocks and 64.7\% on single frames, a relative improvement over previous state-of-the-art approach of 6.8\% and 15.1\% respectively.",
    booktitle = "2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",
    doi = "10.1109/ICASSP.2016.7472917",
    isbn = "978-1-4799-9988-0",
    month = "3",
    pages = "6440--6444",
    title = "Recurrent Neural Networks for Polyphonic Sound Event Detection in Real Life Recordings",
    year = "2016",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/parascandolo-icassp2016.pdf"
    }

  • M. Parviainen and P. Pertilä, "Self-localization of dynamic user-worn microphones from observed speech," Applied Acoustics, vol. 117, iss. Part A, p. 76–85, 2016. doi:10.1016/j.apacoust.2016.10.019
    [BibTeX] [Abstract]

    Abstract The increase of mobile devices and most recently wearables has raised the interest to utilize their sensors for various applications such as indoor localization. We present the first acoustic self-localization scheme that is passive, and is capable of operating when sensors are moving, and possibly unsynchronized. As a result, the relative microphone positions are obtained and therefore an ad hoc microphone array has been established. The proposed system takes advantage of the knowledge that a device is worn by its user e.g. attached to his/her clothing. A user here acts as a sound source and the sensor is the user-worn microphone. Such an entity is referred to as a node. Node-related spatial information is obtained from Time Difference of Arrival (TDOA) estimated from audio captured by the nodes. Kalman filtering is used for node tracking and prediction of spatial information during periods of node silence. Finally, the node positions are recovered using multidimensional scaling (MDS). The only information required by the proposed system is observations of sounds produced by the nodes such as speech to localize the moving nodes. The general framework for acoustic self-localization is presented followed by an implementation to demonstrate the concept. Real data collected by off-the-shelf equipment is used to evaluate the positioning accuracy of nodes in contrast to image based method. The presented system achieves an accuracy of approximately 10 cm in an acoustic laboratory.

    @article{2016_AA,
    author = {Parviainen, Mikko and Pertil{\"a}, Pasi},
    title = "Self-localization of dynamic user-worn microphones from observed speech",
    abstract = "Abstract The increase of mobile devices and most recently wearables has raised the interest to utilize their sensors for various applications such as indoor localization. We present the first acoustic self-localization scheme that is passive, and is capable of operating when sensors are moving, and possibly unsynchronized. As a result, the relative microphone positions are obtained and therefore an ad hoc microphone array has been established. The proposed system takes advantage of the knowledge that a device is worn by its user e.g. attached to his/her clothing. A user here acts as a sound source and the sensor is the user-worn microphone. Such an entity is referred to as a node. Node-related spatial information is obtained from Time Difference of Arrival (TDOA) estimated from audio captured by the nodes. Kalman filtering is used for node tracking and prediction of spatial information during periods of node silence. Finally, the node positions are recovered using multidimensional scaling (MDS). The only information required by the proposed system is observations of sounds produced by the nodes such as speech to localize the moving nodes. The general framework for acoustic self-localization is presented followed by an implementation to demonstrate the concept. Real data collected by off-the-shelf equipment is used to evaluate the positioning accuracy of nodes in contrast to image based method. The presented system achieves an accuracy of approximately 10 cm in an acoustic laboratory.",
    keywords = "Self-localization, Ad hoc networks, Microphone arrays, Acoustic measurements, Kalman filtering, Data association",
    year = "2016",
    month = "November",
    day = "9",
    doi = "10.1016/j.apacoust.2016.10.019",
    language = "English",
    volume = "117",
    pages = "76--85",
    journal = "Applied Acoustics",
    issn = "0003-682X",
    publisher = "Elsevier Limited",
    number = "Part A"
    }

  • P. Pertilä and A. Brutti, "Increasing the environment-awareness of rake beamforming for directive acoustic sources," in 2016 IEEE International Workshop on Acoustic Signal Enhancement (IWAENC), 2016, pp. 1-5. doi:10.1109/IWAENC.2016.7602932
    [BibTeX]
    @INPROCEEDINGS{2016_IWAENC,
    author = "Pertilä, Pasi and Brutti, Alessio",
    booktitle = "2016 IEEE International Workshop on Acoustic Signal Enhancement (IWAENC)",
    title = "Increasing the environment-awareness of rake beamforming for directive acoustic sources",
    year = "2016",
    volume = "",
    number = "",
    pages = "1-5",
    keywords = "Array signal processing;Microphone arrays;Speech;Reverberation;Signal to noise ratio;Mirrors;Microphone arrays;Beamforming;Acoustic reflection;Speech enhancement;Speech intelligibility",
    doi = "10.1109/IWAENC.2016.7602932"
    }

  • M. Valenti, A. Diment, G. Parascandolo, S. Squartini, and T. Virtanen, "DCASE 2016 Acoustic Scene Classification Using Convolutional Neural Networks," in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), 2016.
    [BibTeX] [Download PDF]
    @inproceedings{2016_DCASE2016,
    author = "Valenti, Michele and Diment, Aleksandr and Parascandolo, Giambattista and Squartini, Stefano and Virtanen, Tuomas",
    booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016)",
    keywords = "Acoustic scene classification; convolutional neural networks; DCASE; computational audio processing",
    month = "9",
    publisher = "Tampere University of Technology. Department of Signal Processing",
    title = "{DCASE} 2016 Acoustic Scene Classification Using Convolutional Neural Networks",
    year = "2016",
    url = "http://dcase.community/documents/workshop2016/proceedings/Valenti-DCASE2016workshop.pdf"
    }

  • T. Virtanen, A. Mesaros, T. Heittola, {. D. }. Plumbley, P. Foster, E. Benetos, and M. Lagrange, Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), Tampere University of Technology. Department of Signal Processing, 2016.
    [BibTeX]
    @book{2016,
    author = "Virtanen, Tuomas and Mesaros, Annamaria and Heittola, Toni and Plumbley, {Mark D.} and Foster, Peter and Benetos, Emmanouil and Lagrange, Mathieu",
    publisher = "Tampere University of Technology. Department of Signal Processing",
    title = "{P}roceedings of the {D}etection and {C}lassification of {A}coustic {S}cenes and {E}vents 2016 {W}orkshop ({DCASE}2016)",
    year = "2016"
    }

2015

  • D. Baby, T. Virtanen, J. Gemmeke, and H. V. Hamme, "Coupled dictionaries for exemplar-based speech enhancement and automatic speech recognition," IEEE-ACM Transactions on Audio Speech and Language Processing, vol. 23, iss. 11, p. 1788–1799, 2015. doi:10.1109/TASLP.2015.2450491
    [BibTeX] [Abstract] [Download PDF]

    Exemplar-based speech enhancement systems work by decomposing the noisy speech as a weighted sum of speech and noise exemplars stored in a dictionary and use the resulting speech and noise estimates to obtain a time-varying filter in the full-resolution frequency domain to enhance the noisy speech. To obtain the decomposition, exemplars sampled in lower dimensional spaces are preferred over the full-resolution frequency domain for their reduced computational complexity and the ability to better generalize to unseen cases. But the resulting filter may be sub-optimal as the mapping of the obtained speech and noise estimates to the full-resolution frequency domain yields a low-rank approximation. This paper proposes an efficient way to directly compute the full-resolution frequency estimates of speech and noise using coupled dictionaries: an input dictionary containing atoms from the desired exemplar space to obtain the decomposition and a coupled output dictionary containing exemplars from the full-resolution frequency domain. We also introduce modulation spectrogram features for the exemplar-based tasks using this approach. The proposed system was evaluated for various choices of input exemplars and yielded improved speech enhancement performances on the AURORA-2 and AURORA-4 databases. We further show that the proposed approach also results in improved word error rates (WERs) for the speech recognition tasks using HMM-GMM and deep-neural network (DNN) based systems.

    @article{2015_TASLP,
    author = "Baby, Deepak and Virtanen, Tuomas and Gemmeke, Jort and Hamme, Hugo Van",
    abstract = "Exemplar-based speech enhancement systems work by decomposing the noisy speech as a weighted sum of speech and noise exemplars stored in a dictionary and use the resulting speech and noise estimates to obtain a time-varying filter in the full-resolution frequency domain to enhance the noisy speech. To obtain the decomposition, exemplars sampled in lower dimensional spaces are preferred over the full-resolution frequency domain for their reduced computational complexity and the ability to better generalize to unseen cases. But the resulting filter may be sub-optimal as the mapping of the obtained speech and noise estimates to the full-resolution frequency domain yields a low-rank approximation. This paper proposes an efficient way to directly compute the full-resolution frequency estimates of speech and noise using coupled dictionaries: an input dictionary containing atoms from the desired exemplar space to obtain the decomposition and a coupled output dictionary containing exemplars from the full-resolution frequency domain. We also introduce modulation spectrogram features for the exemplar-based tasks using this approach. The proposed system was evaluated for various choices of input exemplars and yielded improved speech enhancement performances on the AURORA-2 and AURORA-4 databases. We further show that the proposed approach also results in improved word error rates (WERs) for the speech recognition tasks using HMM-GMM and deep-neural network (DNN) based systems.",
    doi = "10.1109/TASLP.2015.2450491",
    issn = "2329-9290",
    journal = "IEEE-ACM Transactions on Audio Speech and Language Processing",
    keywords = "Exemplar-based;Modulation envelope;Noise robust automatic speech recognition;Non-negative sparse coding",
    month = "11",
    number = "11",
    pages = "1788--1799",
    publisher = "IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC",
    title = "Coupled dictionaries for exemplar-based speech enhancement and automatic speech recognition",
    volume = "23",
    year = "2015",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/dbaby\_aslp2015.pdf"
    }

  • D. Baby, J. F. Gemmeke, T. Virtanen, and H. Van Hamme, "Exemplar-based speech enhancement for deep neural network based automatic speech recognition," in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 4485-4489. doi:10.1109/ICASSP.2015.7178819
    [BibTeX] [Download PDF]
    @INPROCEEDINGS{2015_ICASSP_b,
    author = "Baby, Deepak and Gemmeke, Jort F. and Virtanen, Tuomas and Van Hamme, Hugo",
    booktitle = "2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",
    title = "Exemplar-based speech enhancement for deep neural network based automatic speech recognition",
    year = "2015",
    volume = "",
    number = "",
    pages = "4485-4489",
    keywords = "Training;Speech recognition;Neural networks;Testing;Computational modeling;Speech;deep neural networks;non-negative matrix factorisation;coupled dictionaries;speech enhancement;modulation envelope",
    doi = "10.1109/ICASSP.2015.7178819",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/dbaby\_icassp2015.pdf"
    }

  • T. Barker, T. Virtanen, and N. H. Pontoppidan, "Low-latency sound-source-separation using non-negative matrix factorisation with coupled analysis and synthesis dictionaries," in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 241-245. doi:10.1109/ICASSP.2015.7177968
    [BibTeX] [Download PDF]
    @INPROCEEDINGS{2015_ICASSP,
    author = "Barker, Tom and Virtanen, Tuomas and Pontoppidan, Niels Henrik",
    booktitle = "2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",
    title = "Low-latency sound-source-separation using non-negative matrix factorisation with coupled analysis and synthesis dictionaries",
    year = "2015",
    volume = "",
    number = "",
    pages = "241-245",
    keywords = "Dictionaries;Welding;Tin;Computational modeling;Analytical models;Discrete Fourier transforms;Mixture models;Non-negative matrix factorisation;NMF;source separation;real-time;low-latency",
    doi = "10.1109/ICASSP.2015.7177968",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/barker\_icassp2015.pdf"
    }

  • D. Battaglino, A. Mesaros, L. Lepauloux, L. Pilati, and N. Evans, "Acoustic context recognition for mobile devices using a reduced complexity SVM," in 2015 23rd European Signal Processing Conference (EUSIPCO), 2015, pp. 534-538. doi:10.1109/EUSIPCO.2015.7362440
    [BibTeX]
    @INPROCEEDINGS{2015_EUSIPCO,
    author = "Battaglino, Daniele and Mesaros, Annamaria and Lepauloux, Ludovick and Pilati, Laurent and Evans, Nicholas",
    booktitle = "2015 23rd European Signal Processing Conference (EUSIPCO)",
    title = "Acoustic context recognition for mobile devices using a reduced complexity SVM",
    year = "2015",
    volume = "",
    number = "",
    pages = "534-538",
    keywords = "Context;Support vector machines;Training;Training data;Mobile handsets;Complexity theory;Hidden Markov models;Acoustic Context Recognition;mobile devices contextualization;SVM;k-means;LDA",
    doi = "10.1109/EUSIPCO.2015.7362440"
    }

  • E. Cakir, T. Heittola, H. Huttunen, and T. Virtanen, "Polyphonic sound event detection using multi label deep neural networks," in 2015 International Joint Conference on Neural Networks (IJCNN), 2015, pp. 1-7. doi:10.1109/IJCNN.2015.7280624
    [BibTeX] [Abstract] [Download PDF]

    In this paper, the use of multi label neural networks are proposed for detection of temporally overlapping sound events in realistic environments. Real-life sound recordings typically have many overlapping sound events, making it hard to recognize each event with the standard sound event detection methods. Frame-wise spectral-domain features are used as inputs to train a deep neural network for multi label classification in this work. The model is evaluated with recordings from realistic everyday environments and the obtained overall accuracy is 63.8\%. The method is compared against a state-of-the-art method using non-negative matrix factorization as a pre-processing stage and hidden Markov models as a classifier. The proposed method improves the accuracy by 19\% percentage points overall.

    @INPROCEEDINGS{2015_IJCNN,
    author = "Cakir, Emre and Heittola, Toni and Huttunen, Heikki and Virtanen, Tuomas",
    booktitle = "2015 International Joint Conference on Neural Networks (IJCNN)",
    title = "Polyphonic sound event detection using multi label deep neural networks",
    year = "2015",
    volume = "",
    number = "",
    pages = "1-7",
    keywords = "Sound event detection;deep neural networks",
    abstract = "In this paper, the use of multi label neural networks are proposed for detection of temporally overlapping sound events in realistic environments. Real-life sound recordings typically have many overlapping sound events, making it hard to recognize each event with the standard sound event detection methods. Frame-wise spectral-domain features are used as inputs to train a deep neural network for multi label classification in this work. The model is evaluated with recordings from realistic everyday environments and the obtained overall accuracy is 63.8\%. The method is compared against a state-of-the-art method using non-negative matrix factorization as a pre-processing stage and hidden Markov models as a classifier. The proposed method improves the accuracy by 19\% percentage points overall.",
    doi = "10.1109/IJCNN.2015.7280624",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/cakir\_ijcnn2015.pdf"
    }

  • E. Cakir, T. Heittola, H. Huttunen, and T. Virtanen, "Multi-label vs. combined single-label sound event detection with deep neural networks," in 2015 23rd European Signal Processing Conference (EUSIPCO), 2015, pp. 2551-2555. doi:10.1109/EUSIPCO.2015.7362845
    [BibTeX] [Download PDF]
    @INPROCEEDINGS{2015_EUSIPCO_a,
    author = "Cakir, Emre and Heittola, Toni and Huttunen, Heikki and Virtanen, Tuomas",
    booktitle = "2015 23rd European Signal Processing Conference (EUSIPCO)",
    title = "Multi-label vs. combined single-label sound event detection with deep neural networks",
    year = "2015",
    volume = "",
    number = "",
    pages = "2551-2555",
    keywords = "Training;Feature extraction;Signal processing;Europe;Event detection;Databases;Cost function;Sound event detection;deep neural networks;multi-label classification;binary classification;audio analysis",
    doi = "10.1109/EUSIPCO.2015.7362845",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/multi\_vs\_single\_eusipco-2015.pdf"
    }

  • A. Diment and T. Virtanen, "Archetypal analysis for audio dictionary learning," in 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2015, pp. 1-5. doi:10.1109/WASPAA.2015.7336903
    [BibTeX] [Abstract] [Download PDF]

    This paper proposes dictionary learning with archetypes for audio processing. Archetypes refer to so-called pure types, which are a combination of a few data points and which can be combined to obtain a data point. The concept has been found useful in various problems, but it has not yet been applied for audio analysis. The algorithm performs archetypal analysis that minimises the generalised Kullback-Leibler divergence, shown suitable for audio, between an observation and the model. The methodology is evaluated in a source separation scenario (mixtures of speech) and shows results, which are comparable to the state-of-the-art, with perceptual measures indicating its superiority over all of the competing methods in the case of medium-size dictionaries.

    @INPROCEEDINGS{2015_WASPAA,
    author = "Diment, Aleksandr and Virtanen, Tuomas",
    booktitle = "2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)",
    title = "Archetypal analysis for audio dictionary learning",
    year = "2015",
    volume = "",
    number = "",
    pages = "1-5",
    keywords = "Dictionaries;Speech;Algorithm design and analysis;Signal processing;Acoustics;Analytical models;Signal processing algorithms;archetypes;audio analysis;non-negative matrix factorisation;sparse representation",
    abstract = "This paper proposes dictionary learning with archetypes for audio processing. Archetypes refer to so-called pure types, which are a combination of a few data points and which can be combined to obtain a data point. The concept has been found useful in various problems, but it has not yet been applied for audio analysis. The algorithm performs archetypal analysis that minimises the generalised Kullback-Leibler divergence, shown suitable for audio, between an observation and the model. The methodology is evaluated in a source separation scenario (mixtures of speech) and shows results, which are comparable to the state-of-the-art, with perceptual measures indicating its superiority over all of the competing methods in the case of medium-size dictionaries.",
    doi = "10.1109/WASPAA.2015.7336903",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/Diment15\_AA.pdf"
    }

  • A. Diment, E. Cakir, T. Heittola, and T. Virtanen, "Automatic recognition of environmental sound events using all-pole group delay features," in 2015 23rd European Signal Processing Conference (EUSIPCO), 2015, pp. 729-733. doi:10.1109/EUSIPCO.2015.7362479
    [BibTeX] [Abstract] [Download PDF]

    A feature based on the group delay function from all-pole models (APGD) is proposed for environmental sound event recognition. The commonly used spectral features take into account merely the magnitude information, whereas the phase is overlooked due to the complications related to its interpretation. Additional information concealed in the phase is hypothesised to be beneficial for sound event recognition. The APGD is an approach to inferring phase information, which has shown applicability for analysis of speech and music signals and is now studied in environmental audio. The evaluation is performed within a multi-label deep neural network (DNN) framework on a diverse real-life dataset of environmental sounds. It shows performance improvement compared to the baseline log mel-band energy case. In combination with the magnitude-based features, APGD demonstrates further improvement.

    @inproceedings{2015_EUSIPCO_b,
    author = "Diment, Aleksandr and Cakir, Emre and Heittola, Toni and Virtanen, Tuomas",
    abstract = "A feature based on the group delay function from all-pole models (APGD) is proposed for environmental sound event recognition. The commonly used spectral features take into account merely the magnitude information, whereas the phase is overlooked due to the complications related to its interpretation. Additional information concealed in the phase is hypothesised to be beneficial for sound event recognition. The APGD is an approach to inferring phase information, which has shown applicability for analysis of speech and music signals and is now studied in environmental audio. The evaluation is performed within a multi-label deep neural network (DNN) framework on a diverse real-life dataset of environmental sounds. It shows performance improvement compared to the baseline log mel-band energy case. In combination with the magnitude-based features, APGD demonstrates further improvement.",
    booktitle = "2015 23rd European Signal Processing Conference (EUSIPCO)",
    title = "Automatic recognition of environmental sound events using all-pole group delay features",
    doi = "10.1109/EUSIPCO.2015.7362479",
    pages = "729-733",
    volume = "",
    number = "",
    keywords = "Delays;Discrete cosine transforms;Feature extraction;Computational modeling;Signal processing;Europe;Neural networks;Phase spectrum;sound event recognition;audio classification;neural networks",
    year = "2015",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/Diment15\_APGD4events.pdf"
    }

  • S. Drgas and T. Virtanen, "Speaker verification using adaptive dictionaries in non-negative spectrogram deconvolution," in 12th International Conference on Latent Variable Analysis and Signal Separation, 2015.
    [BibTeX] [Download PDF]
    @inproceedings{2015_LVA_ICA,
    author = "Drgas, Szymon and Virtanen, Tuomas",
    booktitle = "12th International Conference on Latent Variable Analysis and Signal Separation",
    title = "{S}peaker verification using adaptive dictionaries in non-negative spectrogram deconvolution",
    url = "http://link.springer.com/chapter/10.1007\%2F978-3-319-22482-4\_54",
    year = "2015"
    }

  • K. Drossos, A. Floros, A. Giannakoulopoulos, and N. Kanellopoulos, "Investigating the Impact of Sound Angular Position on the Listener Affective State," IEEE Transactions on Affective Computing, vol. 6, iss. 1, p. 27–42, 2015. doi:10.1109/TAFFC.2015.2392768
    [BibTeX] [Abstract]

    Emotion recognition from sound signals represents an emerging field of recent research. Although many existing works focus on emotion recognition from music, there seems to be a relative scarcity of research on emotion recognition from general sounds. One of the key characteristics of sound events is the sound source spatial position, i.e. the location of the source relatively to the acoustic receiver. Existing studies that aim to investigate the relation of the latter source placement and the elicited emotions are limited to distance, front and back spatial localization and/or specific emotional categories. In this paper we analytically investigate the effect of the source angular position on the listener's emotional state, modeled in the well-established valence/arousal affective space. Towards this aim, we have developed an annotated sound events dataset using binaural processed versions of the available International Affective Digitized Sound (IADS) sound events library. All subjective affective annotations were obtained using the Self Assessment Manikin (SAM) approach. Preliminary results obtained by processing these annotation scores are likely to indicate a systematic change in the listener affective state as the sound source angular position changes. This trend is more obvious when the sound source is located outside of the visible field of the listener.

    @article{2015_TAC,
    author = "Drossos, K. and Floros, Andreas and Giannakoulopoulos, A. and Kanellopoulos, Nikolaos",
    abstract = "Emotion recognition from sound signals represents an emerging field of recent research. Although many existing works focus on emotion recognition from music, there seems to be a relative scarcity of research on emotion recognition from general sounds. One of the key characteristics of sound events is the sound source spatial position, i.e. the location of the source relatively to the acoustic receiver. Existing studies that aim to investigate the relation of the latter source placement and the elicited emotions are limited to distance, front and back spatial localization and/or specific emotional categories. In this paper we analytically investigate the effect of the source angular position on the listener's emotional state, modeled in the well-established valence/arousal affective space. Towards this aim, we have developed an annotated sound events dataset using binaural processed versions of the available International Affective Digitized Sound (IADS) sound events library. All subjective affective annotations were obtained using the Self Assessment Manikin (SAM) approach. Preliminary results obtained by processing these annotation scores are likely to indicate a systematic change in the listener affective state as the sound source angular position changes. This trend is more obvious when the sound source is located outside of the visible field of the listener.",
    doi = "10.1109/TAFFC.2015.2392768",
    issn = "1949-3045",
    journal = "IEEE Transactions on Affective Computing",
    keywords = "acoustic receivers; audio signal processing; cognition; emotion recognition; IADS; SAM; acoustic receiver; elicited emotions; emotional categories; international affective digitized sound sound events library; listener affective state; self assessment Mani",
    month = "1",
    number = "1",
    pages = "27--42",
    publisher = "Institute of Electrical and Electronics Engineers",
    title = "Investigating the Impact of Sound Angular Position on the Listener Affective State",
    volume = "6",
    year = "2015"
    }

  • K. Drossos, A. Floros, and K. L. Kermanidis, "Evaluating the Impact of Sound Events’ Rhythm Characteristics to Listener’s Valence," Journal of the Audio Engineering Society, vol. 63, iss. 3, p. 139–153, 2015.
    [BibTeX] [Abstract]

    While modern sound researchers generally focus on speech and music, mammalian hearing arose from the need to sense those events in the environment that produced sound waves. Such unorganized sound stimuli, referred to as Sound Events (SEs), can also produce an affective and emotional response. In this research, the investigators explore valence recognition of SEs utilizing rhythm-related acoustics cues. A well-known data set with emotionally annotated SEs was employed; various rhythm-related attributes were then extracted and several machine-learning experiments were conducted. The results portray that the rhythm of a SE can affect the listener’s valence up to an extent and, combined with previous works on SEs, could lead to a comprehensive recognition of the rhythm’s effect on the emotional state of the listener.

    @article{2015_AES,
    author = "Drossos, Konstantinos and Floros, Andreas and Kermanidis, Katia Lida",
    abstract = "While modern sound researchers generally focus on speech and music, mammalian hearing arose from the need to sense those events in the environment that produced sound waves. Such unorganized sound stimuli, referred to as Sound Events (SEs), can also produce an affective and emotional response. In this research, the investigators explore valence recognition of SEs utilizing rhythm-related acoustics cues. A well-known data set with emotionally annotated SEs was employed; various rhythm-related attributes were then extracted and several machine-learning experiments were conducted. The results portray that the rhythm of a SE can affect the listener’s valence up to an extent and, combined with previous works on SEs, could lead to a comprehensive recognition of the rhythm’s effect on the emotional state of the listener.",
    issn = "1549-4950",
    journal = "Journal of the Audio Engineering Society",
    number = "3",
    pages = "139--153",
    publisher = "Audio Engineering Society",
    title = "Evaluating the Impact of Sound Events’ Rhythm Characteristics to Listener’s Valence",
    volume = "63",
    year = "2015"
    }

  • P. Foster, S. Dixon, and A. Klapuri, "Identifying Cover Songs Using Information-Theoretic Measures of Similarity," Ieee-Acm transactions on audio speech and language processing, vol. 23, iss. 6, p. 993–1005, 2015. doi:10.1109/TASLP.2015.2416655
    [BibTeX] [Abstract]

    This paper investigates methods for quantifying similarity between audio signals, specifically for the task of cover song detection. We consider an information-theoretic approach, where we compute pairwise measures of predictability between time series. We compare discrete-valued approaches operating on quantized audio features, to continuous-valued approaches. In the discrete case, we propose a method for computing the normalized compression distance, where we account for correlation between time series. In the continuous case, we propose to compute information-based measures of similarity as statistics of the prediction error between time series. We evaluate our methods on two cover song identification tasks using a data set comprised of 300 Jazz standards and using the Million Song Dataset. For both datasets, we observe that continuous-valued approaches outperform discrete-valued approaches. We consider approaches to estimating the normalized compression distance (NCD) based on string compression and prediction, where we observe that our proposed normalized compression distance with alignment (NCDA) improves average performance over NCD, for sequential compression algorithms. Finally, we demonstrate that continuous-valued distances may be combined to improve performance with respect to baseline approaches. Using a large-scale filter-and-refine approach, we demonstrate state-of-the-art performance for cover song identification using the Million Song Dataset.

    @article{2015_TASLP_b,
    author = "Foster, Peter and Dixon, Simon and Klapuri, Anssi",
    title = "Identifying Cover Songs Using Information-Theoretic Measures of Similarity",
    abstract = "This paper investigates methods for quantifying similarity between audio signals, specifically for the task of cover song detection. We consider an information-theoretic approach, where we compute pairwise measures of predictability between time series. We compare discrete-valued approaches operating on quantized audio features, to continuous-valued approaches. In the discrete case, we propose a method for computing the normalized compression distance, where we account for correlation between time series. In the continuous case, we propose to compute information-based measures of similarity as statistics of the prediction error between time series. We evaluate our methods on two cover song identification tasks using a data set comprised of 300 Jazz standards and using the Million Song Dataset. For both datasets, we observe that continuous-valued approaches outperform discrete-valued approaches. We consider approaches to estimating the normalized compression distance (NCD) based on string compression and prediction, where we observe that our proposed normalized compression distance with alignment (NCDA) improves average performance over NCD, for sequential compression algorithms. Finally, we demonstrate that continuous-valued distances may be combined to improve performance with respect to baseline approaches. Using a large-scale filter-and-refine approach, we demonstrate state-of-the-art performance for cover song identification using the Million Song Dataset.",
    keywords = "Audio similarity measures, cover song identification, normalized compression distance, time series prediction, INDIVIDUAL SEQUENCES, DATA-COMPRESSION, BEAT TRACKING, MUSIC, CLASSIFICATION, IDENTIFICATION, PREDICTION, RETRIEVAL, FEATURES, ENTROPY",
    note = "publication\\_forum:81398",
    year = "2015",
    month = "June",
    doi = "10.1109/TASLP.2015.2416655",
    language = "English",
    volume = "23",
    pages = "993--1005",
    journal = "Ieee-Acm transactions on audio speech and language processing",
    issn = "2329-9290",
    publisher = "IEEE Advancing Technology for Humanity",
    number = "6"
    }

  • A. Hurmalainen, R. Saeidi, and T. Virtanen, "Similarity induced group sparsity for non-negative matrix factorisation," in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 4425-4429. doi:10.1109/ICASSP.2015.7178807
    [BibTeX] [Abstract] [Download PDF]

    Non-negative matrix factorisations are used in several branches of signal processing and data analysis for separation and classification. Sparsity constraints are commonly set on the model to promote discovery of a small number of dominant patterns. In group sparse models, atoms considered to belong to a consistent group are permitted to activate together, while activations across groups are suppressed, reducing the number of simultaneously active sources or other structures. Whereas most group sparse models require explicit division of atoms into separate groups without addressing their mutual relations, we propose a constraint that permits dynamic relationships between atoms or groups, based on any defined distance measure. The resulting solutions promote approximation with components considered similar to each other. Evaluation results are shown for speech enhancement and noise robust speech and speaker recognition.

    @INPROCEEDINGS{2015_ICASSP_a,
    author = "Hurmalainen, Antti and Saeidi, Rahim and Virtanen, Tuomas",
    booktitle = "2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",
    title = "Similarity induced group sparsity for non-negative matrix factorisation",
    year = "2015",
    volume = "",
    number = "",
    pages = "4425-4429",
    keywords = "Speech;Atomic measurements;Speech recognition;Noise measurement;Sparse matrices;Noise;Cost function;non-negative matrix factorization;group sparsity;sparse representations;speech recognition;speaker recognition",
    abstract = "Non-negative matrix factorisations are used in several branches of signal processing and data analysis for separation and classification. Sparsity constraints are commonly set on the model to promote discovery of a small number of dominant patterns. In group sparse models, atoms considered to belong to a consistent group are permitted to activate together, while activations across groups are suppressed, reducing the number of simultaneously active sources or other structures. Whereas most group sparse models require explicit division of atoms into separate groups without addressing their mutual relations, we propose a constraint that permits dynamic relationships between atoms or groups, based on any defined distance measure. The resulting solutions promote approximation with components considered similar to each other. Evaluation results are shown for speech enhancement and noise robust speech and speaker recognition.",
    doi = "10.1109/ICASSP.2015.7178807",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/hurmalainen\_icassp2015.pdf"
    }

  • A. Hurmalainen, R. Saeidi, and T. Virtanen, "Noise Robust Speaker Recognition with Convolutive Sparse Coding," in Proceedings of 16th Interspeech, 2015.
    [BibTeX] [Abstract] [Download PDF]

    Recognition and classification of speech content in everyday environments is challenging due to the large diversity of real-world noise sources, which may also include competing speech. At signal-to-noise ratios below 0 dB, a majority of features may become corrupted, severely degrading the performance of classifiers built upon clean observations of a target class. As the energy and complexity of competing sources increase, their explicit modelling becomes integral for successful detection and classification of target speech. We have previously demonstrated how non-negative compositional modelling in a spectrogram space is suitable for robust recognition of speech and speakers even at low SNRs. In this work, the sparse coding approach is extended to cover the whole separation and classification chain to recognise the speaker of short utterances in difficult noise environments. A convolutive matrix factorisation and coding system is evaluated on 2nd CHiME Track 1 data. Over 98\% average speaker recognition accuracy is achieved for shorter than three second utterances at +9 ... -6 dB SNR, illustrating the system's performance in challenging conditions.

    @inproceedings{2015_InterSpecch,
    author = "Hurmalainen, Antti and Saeidi, Rahim and Virtanen, Tuomas",
    abstract = "Recognition and classification of speech content in everyday environments is challenging due to the large diversity of real-world noise sources, which may also include competing speech. At signal-to-noise ratios below 0 dB, a majority of features may become corrupted, severely degrading the performance of classifiers built upon clean observations of a target class. As the energy and complexity of competing sources increase, their explicit modelling becomes integral for successful detection and classification of target speech. We have previously demonstrated how non-negative compositional modelling in a spectrogram space is suitable for robust recognition of speech and speakers even at low SNRs. In this work, the sparse coding approach is extended to cover the whole separation and classification chain to recognise the speaker of short utterances in difficult noise environments. A convolutive matrix factorisation and coding system is evaluated on 2nd CHiME Track 1 data. Over 98\% average speaker recognition accuracy is achieved for shorter than three second utterances at +9 ... -6 dB SNR, illustrating the system's performance in challenging conditions.",
    booktitle = "Proceedings of 16th Interspeech",
    keywords = "speaker recognition; noise robustness; compositional models; sparse coding; non-negative matrix factorization",
    month = "September",
    title = "Noise Robust Speaker Recognition with Convolutive Sparse Coding",
    year = "2015",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/hurmalainen\_interspeech2015.pdf"
    }

  • K. Mahkonen, J. Kämäräinen, and T. Virtanen, "Lifelog scene change detection using cascades of audio and video detectors," in 12th Asian Conference on Computer Vision, Singapore, 1-5-11.2014, Germany, 2015, p. 434–444. doi:10.1007/978-3-319-16634-6\\_32
    [BibTeX]
    @inproceedings{2015,
    author = {Mahkonen, Katariina and K{\"a}m{\"a}r{\"a}inen, Joni-Kristian and Virtanen, Tuomas},
    title = "Lifelog scene change detection using cascades of audio and video detectors",
    note = {siirret{\"a}{\"a}n 2015; ilmestyy LNCS
    Contribution: organisation=sgn,FACT1=1
    Portfolio EDEND: 2015-01-15; Asian Conference on Computer Vision ; Conference date: 01-01-1900}, year = "2015", doi = "10.1007/978-3-319-16634-6\\_32", language = "English", isbn = "9783319166339", series = "Lecture Notes in Computer Science", publisher = "Springer Verlag", pages = "434--444", booktitle = "12th Asian Conference on Computer Vision, Singapore, 1-5-11.2014", address = "Germany" }

  • A. Mesaros, T. Heittola, O. Dikmen, and T. Virtanen, "Sound event detection in real life recordings using coupled matrix factorization of spectral representations and class activity annotations," in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 151-155. doi:10.1109/ICASSP.2015.7177950
    [BibTeX] [Abstract] [Download PDF]

    Methods for detection of overlapping sound events in audio involve matrix factorization approaches, often assigning separated components to event classes. We present a method that bypasses the supervised construction of class models. The method learns the components as a non-negative dictionary in a coupled matrix factorization problem, where the spectral representation and the class activity annotation of the audio signal share the activation matrix. In testing, the dictionaries are used to estimate directly the class activations. For dealing with large amount of training data, two methods are proposed for reducing the size of the dictionary. The methods were tested on a database of real life recordings, and outperformed previous approaches by over 10\%.

    @inproceedings{2015_ICASSP_c,
    author = "Mesaros, Annamaria and Heittola, Toni and Dikmen, Onur and Virtanen, Tuomas",
    abstract = "Methods for detection of overlapping sound events in audio involve matrix factorization approaches, often assigning separated components to event classes. We present a method that bypasses the supervised construction of class models. The method learns the components as a non-negative dictionary in a coupled matrix factorization problem, where the spectral representation and the class activity annotation of the audio signal share the activation matrix. In testing, the dictionaries are used to estimate directly the class activations. For dealing with large amount of training data, two methods are proposed for reducing the size of the dictionary. The methods were tested on a database of real life recordings, and outperformed previous approaches by over 10\%.",
    booktitle = "2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",
    keywords = "Dictionaries;Training;Testing;Event detection;Context;Acoustics;Accuracy;coupled non-negative matrix factorization;non-negative dictionaries;sound event detection",
    pages = "151-155",
    publisher = "IEEE",
    title = "Sound event detection in real life recordings using coupled matrix factorization of spectral representations and class activity annotations",
    doi = "10.1109/ICASSP.2015.7177950",
    year = "2015",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/mesaros\_icassp2015.pdf"
    }

  • P. Pertilä and J. Nikunen, "Distant speech separation using predicted time-frequency masks from spatial features," Speech Communication, vol. 68, p. 97–106, 2015. doi:10.1016/j.specom.2015.01.006
    [BibTeX] [Abstract]

    Speech separation algorithms are faced with a difficult task of producing high degree of separation without containing unwanted artifacts. The time-frequency (T-F) masking technique applies a real-valued (or binary) mask on top of the signal's spectrum to filter out unwanted components. The practical difficulty lies in the mask estimation. Often, using efficient masks engineered for separation performance leads to presence of unwanted musical noise artifacts in the separated signal. This lowers the perceptual quality and intelligibility of the output. Microphone arrays have been long studied for processing of distant speech. This work uses a feed-forward neural network for mapping microphone array's spatial features into a T-F mask. Wiener filter is used as a desired mask for training the neural network using speech examples in simulated setting. The T-F masks predicted by the neural network are combined to obtain an enhanced separation mask that exploits the information regarding interference between all sources. The final mask is applied to the delay-and-sum beamformer (DSB) output. The algorithm's objective separation capability in conjunction with the separated speech intelligibility is tested with recorded speech from distant talkers in two rooms from two distances. The results show improvement in instrumental measure for intelligibility and frequency-weighted SNR over complex-valued non-negative matrix factorization (CNMF) source separation approach, spatial sound source separation, and conventional beamforming methods such as the DSB and minimum variance distortionless response (MVDR).

    @article{2015_ICA,
    author = {Pertil{\"a}, Pasi and Nikunen, Joonas},
    abstract = "Speech separation algorithms are faced with a difficult task of producing high degree of separation without containing unwanted artifacts. The time-frequency (T-F) masking technique applies a real-valued (or binary) mask on top of the signal's spectrum to filter out unwanted components. The practical difficulty lies in the mask estimation. Often, using efficient masks engineered for separation performance leads to presence of unwanted musical noise artifacts in the separated signal. This lowers the perceptual quality and intelligibility of the output. Microphone arrays have been long studied for processing of distant speech. This work uses a feed-forward neural network for mapping microphone array's spatial features into a T-F mask. Wiener filter is used as a desired mask for training the neural network using speech examples in simulated setting. The T-F masks predicted by the neural network are combined to obtain an enhanced separation mask that exploits the information regarding interference between all sources. The final mask is applied to the delay-and-sum beamformer (DSB) output. The algorithm's objective separation capability in conjunction with the separated speech intelligibility is tested with recorded speech from distant talkers in two rooms from two distances. The results show improvement in instrumental measure for intelligibility and frequency-weighted SNR over complex-valued non-negative matrix factorization (CNMF) source separation approach, spatial sound source separation, and conventional beamforming methods such as the DSB and minimum variance distortionless response (MVDR).",
    doi = "10.1016/j.specom.2015.01.006",
    issn = "0167-6393",
    journal = "Speech Communication",
    keywords = "Beamforming; Microphone arrays; Neural networks; Speech separation; Time-frequency masking",
    pages = "97--106",
    publisher = "Elsevier",
    title = "Distant speech separation using predicted time-frequency masks from spatial features",
    volume = "68",
    year = "2015"
    }

  • E. Räsänen, O. Pulkkinen, T. Virtanen, M. Zollner, and H. Hennig, "Fluctuations of Hi-Hat Timing and Dynamics in a Virtuoso Drum Track of a Popular Music Recording," PLoS ONE, vol. 10, iss. 6, 2015.
    [BibTeX] [Download PDF]
    @article{2015_PLOSONE,
    author = {R{\"a}s{\"a}nen, Esa and Pulkkinen, Otto and Virtanen, Tuomas and Zollner, Manfred and Hennig, Holger},
    journal = "PLoS ONE",
    number = "6",
    title = "Fluctuations of Hi-Hat Timing and Dynamics in a Virtuoso Drum Track of a Popular Music Recording",
    url = "http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0127902",
    volume = "10",
    year = "2015"
    }

  • U. Simsekli, T. Virtanen, and A. T. Cemgil, "Non-negative Tensor Factorization Models for Bayesian Audio Processing," Digital Signal Processing, 2015.
    [BibTeX] [Download PDF]
    @article{2015_DSP,
    author = "Simsekli, Umut and Virtanen, Tuomas and Cemgil, Ali Taylan",
    journal = "Digital Signal Processing",
    title = "{N}on-negative {T}ensor {F}actorization {M}odels for {B}ayesian {A}udio {P}rocessing",
    url = "http://www.sciencedirect.com/science/article/pii/S105120041500086X",
    year = "2015"
    }

  • T. Virtanen, J. F. Gemmeke, B. Raj, and P. Smaragdis, "Compositional Models for Audio Processing: Uncovering the structure of sound mixtures," IEEE Signal Processing Magazine, vol. 32, iss. 2, pp. 125-144, 2015. doi:10.1109/MSP.2013.2288990
    [BibTeX] [Download PDF]
    @ARTICLE{2015_SPM,
    author = "Virtanen, Tuomas and Gemmeke, Jort Florent and Raj, Bhiksha and Smaragdis, Paris",
    journal = "IEEE Signal Processing Magazine",
    title = "Compositional Models for Audio Processing: Uncovering the structure of sound mixtures",
    year = "2015",
    volume = "32",
    number = "2",
    pages = "125-144",
    keywords = "Matrix decomposition;Spectrogram;Time-frequency analysis;Principal component analysis;Atomic clocks;Acoustic signal processing;Signal resolution",
    doi = "10.1109/MSP.2013.2288990",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/virtanen-spm-compositional.pdf"
    }

2014

  • D. Baby, T. Virtanen, T. Barker, and H. Van hamme, "Coupled dictionary training for exemplar-based speech enhancement," in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 2883-2887. doi:10.1109/ICASSP.2014.6854127
    [BibTeX] [Download PDF]
    @INPROCEEDINGS{2014_ICASSP_a,
    author = "Baby, Deepak and Virtanen, Tuomas and Barker, Tom and Van hamme, Hugo",
    booktitle = "2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",
    title = "Coupled dictionary training for exemplar-based speech enhancement",
    year = "2014",
    volume = "",
    number = "",
    pages = "2883-2887",
    keywords = "Discrete Fourier transforms;Speech;Noise;Dictionaries;Speech enhancement;Noise measurement;Non-negative matrix factorisation;coupled dictionary training;speech enhancement;modulation envelope",
    doi = "10.1109/ICASSP.2014.6854127",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/Baby\_ICASSP2014.pdf"
    }

  • D. Baby, T. Virtanen, J. F. Gemmeke, T. Barker, and H. Van hamme, "Exemplar-based noise robust automatic speech recognition using modulation spectrogram features," in 2014 IEEE Spoken Language Technology Workshop (SLT), 2014, pp. 519-524. doi:10.1109/SLT.2014.7078628
    [BibTeX]
    @INPROCEEDINGS{2014_SLT,
    author = "Baby, Deepak and Virtanen, Tuomas and Gemmeke, Jort F. and Barker, Tom and Van hamme, Hugo",
    booktitle = "2014 IEEE Spoken Language Technology Workshop (SLT)",
    title = "Exemplar-based noise robust automatic speech recognition using modulation spectrogram features",
    year = "2014",
    volume = "",
    number = "",
    pages = "519-524",
    keywords = "Speech;Noise;Discrete Fourier transforms;Modulation;Dictionaries;Databases;Spectrogram;modulation envelope;coupled dictionaries;non-negative matrix factorisation;automatic speech recognition",
    doi = "10.1109/SLT.2014.7078628"
    }

  • T. Barker, T. Virtanen, and O. Delhomme, "Ultrasound-Coupled Semi-Supervised Nonnegative Matrix Factorisation for Speech Enhancement," in 2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), Florence, Italy, May 4-9.2014, 2014, p. 2148–2152. doi:10.1109/ICASSP.2014.6853975
    [BibTeX] [Download PDF]
    @inproceedings{2014_ICASSP,
    author = "Barker, Tom and Virtanen, Tuomas and Delhomme, Olivier",
    booktitle = "2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), Florence, Italy, May 4-9.2014",
    doi = "10.1109/ICASSP.2014.6853975",
    isbn = "978-1-4799-2893-4",
    pages = "2148--2152",
    publisher = "Institute of Electrical and Electronics Engineers IEEE",
    title = "Ultrasound-Coupled Semi-Supervised Nonnegative Matrix Factorisation for Speech Enhancement",
    year = "2014",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/Barker\_ICASSP2014.pdf"
    }

  • T. Barker and T. Virtanen, "Semi-supervised non-negative tensor factorisation of modulation spectrograms for monaural speech separation," in Neural Networks (IJCNN), 2014 International Joint Conference on, 2014, p. 3556–3561. doi:10.1109/IJCNN.2014.6889522
    [BibTeX] [Abstract] [Download PDF]

    This paper details the use of a semi-supervised approach to audio source separation. Where only a single source model is available, the model for an unknown source must be estimated. A mixture signal is separated through factorisation of a feature-tensor representation, based on the modulation spectrogram. Harmonically related components tend to modulate in a similar fashion, and this redundancy of patterns can be isolated. This feature representation requires fewer parameters than spectrally based methods and so minimises overfitting. Following the tensor factorisation, the separated signals are reconstructed by learning appropriate Wiener-filter spectral parameters which have been constrained by activation parameters learned in the first stage. Strong results were obtained for two-speaker mixtures where source separation performance exceeded those used as benchmarks. Specifically, the proposed semi-supervised method outperformed both semi-supervised non-negative matrix factorisation and blind non-negative modulation spectrum tensor factorisation.

    @inproceedings{2014_IJCNN,
    author = "Barker, Tom and Virtanen, Tuomas",
    abstract = "This paper details the use of a semi-supervised approach to audio source separation. Where only a single source model is available, the model for an unknown source must be estimated. A mixture signal is separated through factorisation of a feature-tensor representation, based on the modulation spectrogram. Harmonically related components tend to modulate in a similar fashion, and this redundancy of patterns can be isolated. This feature representation requires fewer parameters than spectrally based methods and so minimises overfitting. Following the tensor factorisation, the separated signals are reconstructed by learning appropriate Wiener-filter spectral parameters which have been constrained by activation parameters learned in the first stage. Strong results were obtained for two-speaker mixtures where source separation performance exceeded those used as benchmarks. Specifically, the proposed semi-supervised method outperformed both semi-supervised non-negative matrix factorisation and blind non-negative modulation spectrum tensor factorisation.",
    booktitle = "Neural Networks (IJCNN), 2014 International Joint Conference on",
    doi = "10.1109/IJCNN.2014.6889522",
    isbn = "978-1-4799-1484-5",
    keywords = "Wiener filters; audio signal processing; matrix decomposition; signal reconstruction; source separation; speech processing; tensors; Wiener-filter spectral parameters; activation parameters; audio source separation; blind nonnegative modulation spectrum te",
    month = "7",
    pages = "3556--3561",
    title = "Semi-supervised non-negative tensor factorisation of modulation spectrograms for monaural speech separation",
    year = "2014",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/WCCI\_TomBarker\_Preprint.pdf"
    }

  • T. Barker, H. V. Hamme, and T. Virtanen, "Modelling Primitive Streaming of Simple Tone Sequences Through Factorisation of Modulation Pattern Tensors," in INTERSPEECH2014, 15th Annual Conference of the International Speech Communication Association, 14-18 September 2014, Singapore, 2014, p. 1371–1375.
    [BibTeX] [Download PDF]
    @inproceedings{2014_InterSpecch,
    author = "Barker, Tom and Hamme, Hugo Van and Virtanen, Tuomas",
    booktitle = "INTERSPEECH2014, 15th Annual Conference of the International Speech Communication Association, 14-18 September 2014, Singapore",
    pages = "1371--1375",
    publisher = "International Speech Communication Association",
    title = "Modelling Primitive Streaming of Simple Tone Sequences Through Factorisation of Modulation Pattern Tensors",
    year = "2014",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/Barker2014b.pdf"
    }

  • A. Diment, P. Rajan, T. Heittola, and T. Virtanen, "Group Delay Function from All-Pole Models for Musical Instrument Recognition," in Sound, Music, and Motion, M. Aramaki, O. Derrien, R. Kronland-Martinet, and S. Ystad, Eds., Springer International Publishing, 2014, pp. 606-618. doi:10.1007/978-3-319-12976-1_37
    [BibTeX] [Abstract]

    In this work, the feature based on the group delay function from all-pole models (APGD) is proposed for pitched musical instrument recognition. Conventionally, the spectrum-related features take into account merely the magnitude information, whereas the phase is often overlooked due to the complications related to its interpretation. However, there is often additional information concealed in the phase, which could be beneficial for recognition. The APGD is an elegant approach to inferring phase information, which lacks of the issues related to interpreting the phase and does not require extensive parameter adjustment. Having shown applicability for speech-related problems, it is now explored in terms of instrument recognition. The evaluation is performed with various instrument sets and shows noteworthy absolute accuracy gains of up to 7\% compared to the baseline mel-frequency cepstral coefficients (MFCCs) case. Combined with the MFCCs and with feature selection, APGD demonstrates superiority over the baseline with all the evaluated sets.

    @incollection{2014,
    author = "Diment, Aleksandr and Rajan, Padmanabhan and Heittola, Toni and Virtanen, Tuomas",
    editor = "Aramaki, Mitsuko and Derrien, Olivier and Kronland-Martinet, Richard and Ystad, S{\o}lvi",
    abstract = "In this work, the feature based on the group delay function from all-pole models (APGD) is proposed for pitched musical instrument recognition. Conventionally, the spectrum-related features take into account merely the magnitude information, whereas the phase is often overlooked due to the complications related to its interpretation. However, there is often additional information concealed in the phase, which could be beneficial for recognition. The APGD is an elegant approach to inferring phase information, which lacks of the issues related to interpreting the phase and does not require extensive parameter adjustment. Having shown applicability for speech-related problems, it is now explored in terms of instrument recognition. The evaluation is performed with various instrument sets and shows noteworthy absolute accuracy gains of up to 7\% compared to the baseline mel-frequency cepstral coefficients (MFCCs) case. Combined with the MFCCs and with feature selection, APGD demonstrates superiority over the baseline with all the evaluated sets.",
    booktitle = "Sound, Music, and Motion",
    doi = "10.1007/978-3-319-12976-1\_37",
    isbn = "978-3-319-12975-4",
    keywords = "Musical instrument recognition; Music information retrieval; All-pole group delay feature; Phase spectrum",
    pages = "606-618",
    publisher = "Springer International Publishing",
    title = "Group Delay Function from All-Pole Models for Musical Instrument Recognition",
    year = "2014"
    }

  • O. Gencoglu, T. Virtanen, and H. Huttunen, "Recognition of acoustic events using deep neural networks," in 2014 22nd European Signal Processing Conference (EUSIPCO), 2014, p. 506–510.
    [BibTeX] [Abstract] [Download PDF]

    This paper proposes the use of a deep neural network for the recognition of isolated acoustic events such as footsteps, baby crying, motorcycle, rain etc. For an acoustic event classification task containing 61 distinct classes, classification accuracy of the neural network classifier (60.3\%) excels that of the conventional Gaussian mixture model based hidden Markov model classifier (54.8\%). In addition, an unsupervised layerwise pretraining followed by standard backpropagation training of a deep network (known as a deep belief network) results in further increase of 2-4\% in classification accuracy. Effects of implementation parameters such as types of features and number of adjacent frames as additional features are found to be significant on classification accuracy.

    @inproceedings{2014_EUSIPCO,
    author = "Gencoglu, Oguzhan and Virtanen, Tuomas and Huttunen, Heikki",
    title = "Recognition of acoustic events using deep neural networks",
    booktitle = "2014 22nd European Signal Processing Conference (EUSIPCO)",
    pages = "506--510",
    year = "2014",
    organization = "IEEE",
    abstract = "This paper proposes the use of a deep neural network for the recognition of isolated acoustic events such as footsteps, baby crying, motorcycle, rain etc. For an acoustic event classification task containing 61 distinct classes, classification accuracy of the neural network classifier (60.3\%) excels that of the conventional Gaussian mixture model based hidden Markov model classifier (54.8\%). In addition, an unsupervised layerwise pretraining followed by standard backpropagation training of a deep network (known as a deep belief network) results in further increase of 2-4\% in classification accuracy. Effects of implementation parameters such as types of features and number of adjacent frames as additional features are found to be significant on classification accuracy.",
    keywords = "acoustic event classification;deep learning;deep neural networks;deep belief networks",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/dnn\_eusipco2014.pdf"
    }

  • D. Giannoulis, E. Benetos, A. Klapuri, and M. D. Plumbley, "Improving instrument recognition in polyphonic music through system integration," in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4-9 May 2014, Florence, Italy, 2014, p. 1–5. doi:10.1109/ICASSP.2014.6854599
    [BibTeX]
    @inproceedings{2014_ICASSP_d,
    author = "Giannoulis, Dimitrios and Benetos, Emmanouil and Klapuri, Anssi and Plumbley, Mark D.",
    title = "Improving instrument recognition in polyphonic music through system integration",
    note = "Contribution: organisation=sgn,FACT1=1
    Portfolio EDEND: 2015-01-21; IEEE International Conference on Acoustics, Speech and Signal Processing ; Conference date: 01-01-1900 Through 01-01-2000", year = "2014", doi = "10.1109/ICASSP.2014.6854599", language = "English", isbn = "978-1-4799-2893-4", pages = "1--5", booktitle = "2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4-9 May 2014, Florence, Italy" }

  • T. Heittola, A. Mesaros, D. Korpi, A. Eronen, and T. Virtanen, "Method for creating location-specific audio textures," EURASIP Journal on Audio, Speech and Music Processing, vol. 2014, iss. 9, 2014. doi:10.1186/1687-4722-2014-9
    [BibTeX] [Abstract]

    An approach is proposed for creating location-specific audio textures for virtual location-exploration services. The presented approach creates audio textures by processing a small amount of audio recorded at a given location, providing a cost-effective way to produce a versatile audio signal that characterizes the location. The resulting texture is non-repetitive and conserves the location-specific characteristics of the audio scene, without the need of collecting large amount of audio from each location. The method consists of two stages: analysis and synthesis. In the analysis stage, the source audio recording is segmented into homogeneous segments. In the synthesis stage, the audio texture is created by randomly drawing segments from the source audio so that the consecutive segments will have timbral similarity near the segment boundaries. Results obtained in listening experiments show that there is no statistically significant difference in the audio quality or location-specificity of audio when the created audio textures are compared to excerpts of the original recordings. Therefore, the proposed audio textures could be utilized in virtual location-exploration services. Examples of source signals and audio textures created from them are available at www.cs.tut.fi/\textasciitilde heittolt/audiotexture.

    @article{2014_JASM,
    author = "Heittola, Toni and Mesaros, Annamaria and Korpi, Dani and Eronen, Antti and Virtanen, Tuomas",
    abstract = "An approach is proposed for creating location-specific audio textures for virtual location-exploration services. The presented approach creates audio textures by processing a small amount of audio recorded at a given location, providing a cost-effective way to produce a versatile audio signal that characterizes the location. The resulting texture is non-repetitive and conserves the location-specific characteristics of the audio scene, without the need of collecting large amount of audio from each location. The method consists of two stages: analysis and synthesis. In the analysis stage, the source audio recording is segmented into homogeneous segments. In the synthesis stage, the audio texture is created by randomly drawing segments from the source audio so that the consecutive segments will have timbral similarity near the segment boundaries. Results obtained in listening experiments show that there is no statistically significant difference in the audio quality or location-specificity of audio when the created audio textures are compared to excerpts of the original recordings. Therefore, the proposed audio textures could be utilized in virtual location-exploration services. Examples of source signals and audio textures created from them are available at www.cs.tut.fi/\textasciitilde heittolt/audiotexture.",
    journal = "EURASIP Journal on Audio, Speech and Music Processing",
    number = "9",
    title = "Method for creating location-specific audio textures",
    volume = "2014",
    year = "2014",
    doi = "10.1186/1687-4722-2014-9"
    }

  • J. Nikunen and T. Virtanen, "Multichannel audio separation by Direction of Arrival Based Spatial Covariance Model and Non-negative Matrix Factorization," in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, p. 6727–6731. doi:10.1109/ICASSP.2014.6854892
    [BibTeX] [Abstract] [Download PDF]

    This paper studies multichannel audio separation using non-negative matrix factorization (NMF) combined with a new model for spatial covariance matrices (SCM). The proposed model for SCMs is parameterized by source direction of arrival (DoA) and its parameters can be optimized to yield a spatially coherent solution over frequencies thus avoiding permutation ambiguity and spatial liasing. The model constrains the estimation of SCMs to a set of geometrically possible solutions. Additionally we present a method for using a priori DoA information of the sources extracted blindly from the mixture for the initialization of the parameters of the proposed model. The simulations show that the proposed algorithm exceeds the separation quality of existing spatial separation methods.

    @inproceedings{2014_ICASSP_c,
    author = "Nikunen, Joonas and Virtanen, Tuomas",
    abstract = "This paper studies multichannel audio separation using non-negative matrix factorization (NMF) combined with a new model for spatial covariance matrices (SCM). The proposed model for SCMs is parameterized by source direction of arrival (DoA) and its parameters can be optimized to yield a spatially coherent solution over frequencies thus avoiding permutation ambiguity and spatial liasing. The model constrains the estimation of SCMs to a set of geometrically possible solutions. Additionally we present a method for using a priori DoA information of the sources extracted blindly from the mixture for the initialization of the parameters of the proposed model. The simulations show that the proposed algorithm exceeds the separation quality of existing spatial separation methods.",
    booktitle = "2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",
    keywords = "Spatial sound separation; non-negative matrix factorization; spatial covariance models; Complex-Valued NMF",
    pages = "6727--6731",
    title = "Multichannel audio separation by Direction of Arrival Based Spatial Covariance Model and Non-negative Matrix Factorization",
    year = "2014",
    doi = "10.1109/ICASSP.2014.6854892",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/Nikunen\_ICASSP2014.pdf"
    }

  • J. Nikunen and T. Virtanen, "Direction of Arrival Based Spatial Covariance Model for Blind Sound Source Separation," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, iss. 3, p. 727–739, 2014. doi:10.1109/TASLP.2014.2303576
    [BibTeX] [Abstract] [Download PDF]

    This paper addresses the problem of sound source separation from a multichannel microphone array capture via estimation of source spatial covariance matrix (SCM) of a short-time Fourier transformed mixture signal. In many conventional audio separation algorithms the source mixing parameter estimation is done separately for each frequency thus making them prone to errors and leading to suboptimal source estimates. In this paper we propose a SCM model which consists of a weighted sum of direction of arrival (DoA) kernels and estimate only the weights dependent on the source directions. In the proposed algorithm, the spatial properties of the sources become jointly optimized over all frequencies, leading to more coherent source estimates and mitigating the effect of spatial aliasing at high frequencies. The proposed SCM model is combined with a linear model for magnitudes and the parameter estimation is formulated in a complex-valued non-negative matrix factorization (CNMF) framework. Simulations consist of recordings done with a hand-held device sized array having multiple microphones embedded inside the device casing. Separation quality of the proposed algorithm is shown to exceed the performance of existing state of the art separation methods with two sources when evaluated by objective separation quality metrics.

    @article{2014_TASLP_a,
    author = "Nikunen, Joonas and Virtanen, Tuomas",
    abstract = "This paper addresses the problem of sound source separation from a multichannel microphone array capture via estimation of source spatial covariance matrix (SCM) of a short-time Fourier transformed mixture signal. In many conventional audio separation algorithms the source mixing parameter estimation is done separately for each frequency thus making them prone to errors and leading to suboptimal source estimates. In this paper we propose a SCM model which consists of a weighted sum of direction of arrival (DoA) kernels and estimate only the weights dependent on the source directions. In the proposed algorithm, the spatial properties of the sources become jointly optimized over all frequencies, leading to more coherent source estimates and mitigating the effect of spatial aliasing at high frequencies. The proposed SCM model is combined with a linear model for magnitudes and the parameter estimation is formulated in a complex-valued non-negative matrix factorization (CNMF) framework. Simulations consist of recordings done with a hand-held device sized array having multiple microphones embedded inside the device casing. Separation quality of the proposed algorithm is shown to exceed the performance of existing state of the art separation methods with two sources when evaluated by objective separation quality metrics.",
    journal = "IEEE/ACM Transactions on Audio, Speech, and Language Processing",
    keywords = "Complex-valued NMF",
    month = "March",
    number = "3",
    pages = "727--739",
    title = "Direction of Arrival Based Spatial Covariance Model for Blind Sound Source Separation",
    volume = "22",
    year = "2014",
    doi = "10.1109/TASLP.2014.2303576",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/nikunen\_taslp2014.pdf"
    }

  • M. Parviainen, P. Pertilä, and M. S. Hämäläinen, "Self-localization of Wireless Acoustic Sensors in Meeting Rooms," in 4th Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA), 2014.
    [BibTeX] [Abstract]

    This paper presents a passive acoustic self-localization and synchronization system, which estimates the positions of wireless acoustic sensors utilizing the signals emitted by the persons present in the same room. The system is designed to utilize common off-the-shelf devices such as mobile phones. Once devices are self-localized and synchronized, the system could be utilized by traditional array processing methods. The proposed calibration system is evaluated with real recordings from meeting scenarios. The proposed system builds on earlier work with the added contribution of this work is i) increasing the accuracy of positioning, and ii) introduction data-driven data association. The results show that improvement over the existing methods in all tested recordings with 10 smartphones.

    @inproceedings{2014_HSCMA,
    author = {Parviainen, Mikko and Pertil{\"a}, Pasi and H{\"a}m{\"a}l{\"a}inen, Matti S.},
    abstract = "This paper presents a passive acoustic self-localization and synchronization system, which estimates the positions of wireless acoustic sensors utilizing the signals emitted by the persons present in the same room. The system is designed to utilize common off-the-shelf devices such as mobile phones. Once devices are self-localized and synchronized, the system could be utilized by traditional array processing methods. The proposed calibration system is evaluated with real recordings from meeting scenarios. The proposed system builds on earlier work with the added contribution of this work is i) increasing the accuracy of positioning, and ii) introduction data-driven data association. The results show that improvement over the existing methods in all tested recordings with 10 smartphones.",
    booktitle = "4th Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA)",
    keywords = "self localization; self synchronization; microphone arrays",
    organization = "IEEE",
    title = "{S}elf-localization of {W}ireless {A}coustic {S}ensors in {M}eeting {R}ooms",
    year = "2014"
    }

  • P. Pertilä and J. Nikunen, "Microphone Array Post-Filtering Using Supervised Machine Learning for Speech Enhancement," in INTERSPEECH 2014 - 15th Annual Conference of the International Speech Communication Association, 2014.
    [BibTeX]
    @inproceedings{2014_InterSpecch_b,
    author = {Pertil{\"a}, Pasi and Nikunen, Joonas},
    booktitle = "INTERSPEECH 2014 - 15th Annual Conference of the International Speech Communication Association",
    keywords = "microphone array; speech enhancement; machine learning; beamforming",
    title = "{M}icrophone {A}rray {P}ost-{F}iltering {U}sing {S}upervised {M}achine {L}earning for {S}peech {E}nhancement",
    year = "2014"
    }

  • G. Sanchez, H. Silén, J. Nurminen, and M. Gabbouj, "Hierarchical modeling of F0 contours for voice conversion," in INTERSPEECH 2014, Proceedings of the15th Annual Conference of the International Speech Communication Association, 14-18, September 2014, Singapore, 2014, p. 2318–2321.
    [BibTeX]
    @inproceedings{2014_InterSpecch_a,
    author = "Sanchez, Gerard and Sil{\'e}n, Hanna and Nurminen, Jani and Gabbouj, Moncef",
    booktitle = "INTERSPEECH 2014, Proceedings of the15th Annual Conference of the International Speech Communication Association, 14-18, September 2014, Singapore",
    pages = "2318--2321",
    publisher = "International Speech Communication Association",
    title = "Hierarchical modeling of {F}0 contours for voice conversion",
    year = "2014"
    }

  • C. Schörkhuber, A. Klapuri, N. Holighaus, and M. Dörfler, "A Matlab Toolbox for Efficient Perfect Reconstruction Time-Frequency Transforms with Log-Frequency Resolution,," in AES 53rd International Conference on Semantic Audio, London, UK, January 27, 2014, 2014.
    [BibTeX]
    @inproceedings{2014_AES,
    author = {Sch{\"o}rkhuber, Christian and Klapuri, Anssi and Holighaus, Nicki and D{\"o}rfler, Monika},
    title = "A Matlab Toolbox for Efficient Perfect Reconstruction Time-Frequency Transforms with Log-Frequency Resolution,",
    note = "Contribution: organisation=sgn,FACT1=1
    Portfolio EDEND: 2014-09-05
    Publisher name: Audio Engineering Society; AES International Conference on Semantic Audio ; Conference date: 01-01-2014", year = "2014", language = "English", booktitle = "AES 53rd International Conference on Semantic Audio, London, UK, January 27, 2014", publisher = "Audio Engineering Society" }

  • T. Virtanen, B. Raj, J. F. Gemmeke, and H. Van hamme, "Active-set newton algorithm for non-negative sparse coding of audio," in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 3092-3096. doi:10.1109/ICASSP.2014.6854169
    [BibTeX] [Download PDF]
    @INPROCEEDINGS{2014_ICASSP_b,
    author = "Virtanen, Tuomas and Raj, Bhiksha and Gemmeke, Jort F. and Van hamme, Hugo",
    booktitle = "2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",
    title = "Active-set newton algorithm for non-negative sparse coding of audio",
    year = "2014",
    volume = "",
    number = "",
    pages = "3092-3096",
    keywords = "Dictionaries;Vectors;Speech;Source separation;Sparse matrices;Signal processing algorithms;Encoding;sound source separation;non-negative matrix factorization;Newton algorithm;convex optimization;sparse coding",
    doi = "10.1109/ICASSP.2014.6854169",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/ASNA\_L1.pdf"
    }

  • Z. Wu, T. Virtanen, E. S. Chng, and H. Li, "Exemplar-Based Sparse Representation With Residual Compensation for Voice Conversion," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, iss. 10, pp. 1506-1521, 2014. doi:10.1109/TASLP.2014.2333242
    [BibTeX] [Download PDF]
    @ARTICLE{2014_TASLP,
    author = "Wu, Zhizheng and Virtanen, Tuomas and Chng, Eng Siong and Li, Haizhou",
    journal = "IEEE/ACM Transactions on Audio, Speech, and Language Processing",
    title = "Exemplar-Based Sparse Representation With Residual Compensation for Voice Conversion",
    year = "2014",
    volume = "22",
    number = "10",
    pages = "1506-1521",
    keywords = "Speech;Speech processing;Spectrogram;Training;Vectors;IEEE transactions;Training data;Exemplar;nonnegative matrix factorization;residual compensation;sparse representation;voice conversion",
    doi = "10.1109/TASLP.2014.2333242",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/taslp\_voco\_exemplar\_2014.pdf"
    }

2013

  • T. Barker and T. Virtanen, "Non-negative Tensor Factorisation of Modulation Spectrograms for Monaural Sound Source Separation," in Proceedings of the 14th Annual Conference of the International Speech Communication Association (Interspeech 2013), 25-29 August, Lyon, France, 2013, p. 827 – 831.
    [BibTeX] [Download PDF]
    @inproceedings{2013_Interspeech 2013_a,
    author = "Barker, Tom and Virtanen, Tuomas",
    booktitle = "Proceedings of the 14th Annual Conference of the International Speech Communication Association (Interspeech 2013), 25-29 August, Lyon, France",
    pages = "827 -- 831",
    publisher = "International Speech Communication Association",
    series = "Interspeech",
    title = "Non-negative Tensor Factorisation of Modulation Spectrograms for Monaural Sound Source Separation",
    year = "2013",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/barker\_ntf2013.pdf"
    }

  • E. Benetos, S. Dixon, D. Giannoulis, H. Kirchhoff, and A. Klapuri, "Automatic Music Transcription: Challenges and Future Directions," Journal of Intelligent Information Systems, vol. 41, iss. 3, p. 407–434, 2013. doi:10.1007/s10844-013-0258-3
    [BibTeX]
    @article{2013_g,
    author = "Benetos, Emmanouil and Dixon, Simon and Giannoulis, Dimitrios and Kirchhoff, Holger and Klapuri, Anssi",
    title = "Automatic Music Transcription: Challenges and Future Directions",
    note = "Contribution: organisation=sgn,FACT1=1
    Portfolio EDEND: 2013-12-29
    Publisher name: Springer New York", year = "2013", doi = "10.1007/s10844-013-0258-3", language = "English", volume = "41", pages = "407--434", journal = "Journal of Intelligent Information Systems", issn = "0925-9902", publisher = "Springer Netherlands", number = "3" }

  • F. Briggs, Y. Huang, R. Raich, K. Eftaxias, Z. Lei, W. Cukierski, S. F. Hadley, A. Hadley, M. Betts, X. Z. Fern, J. Irvine, L. Neal, A. Thomas, G. Fodor, G. Tsoumakas, H. W. Ng, T. N. T. Nguyen, H. Huttunen, P. Ruusuvuori, T. Manninen, A. Diment, T. Virtanen, J. Marzat, J. Defretin, D. Callender, C. Hurlburt, K. Larrey, and M. Milakov, "The 9th annual MLSP competition: New methods for acoustic classification of multiple simultaneous bird species in a noisy environment," in 2013 IEEE International Workshop on Machine Learning for Signal Processing (MLSP), 2013, pp. 1-8. doi:10.1109/MLSP.2013.6661934
    [BibTeX]
    @INPROCEEDINGS{2013_MLSP,
    author = "Briggs, Forrest and Huang, Yonghong and Raich, Raviv and Eftaxias, Konstantinos and Lei, Zhong and Cukierski, William and Hadley, Sarah Frey and Hadley, Adam and Betts, Matthew and Fern, Xiaoli Z. and Irvine, Jed and Neal, Lawrence and Thomas, Anil and Fodor, Gábor and Tsoumakas, Grigorios and Ng, Hong Wei and Nguyen, Thi Ngoc Tho and Huttunen, Heikki and Ruusuvuori, Pekka and Manninen, Tapio and Diment, Aleksandr and Virtanen, Tuomas and Marzat, Julien and Defretin, Joseph and Callender, Dave and Hurlburt, Chris and Larrey, Ken and Milakov, Maxim",
    booktitle = "2013 IEEE International Workshop on Machine Learning for Signal Processing (MLSP)",
    title = "The 9th annual MLSP competition: New methods for acoustic classification of multiple simultaneous bird species in a noisy environment",
    year = "2013",
    volume = "",
    number = "",
    pages = "1-8",
    keywords = "Birds;Spectrogram;Vectors;Rain;Histograms;Image segmentation;Noise",
    doi = "10.1109/MLSP.2013.6661934"
    }

  • A. Diment, R. Padmanabhan, T. Heittola, and T. Virtanen, "Modified Group Delay Feature for Musical Instrument Recognition," in 10th International Symposium on Computer Music Multidisciplinary Research (CMMR), 2013.
    [BibTeX] [Abstract] [Download PDF]

    In this work, the modified group delay feature (MODGDF) is proposed for pitched musical instrument recognition. Conventionally, the spectrum-related features used in instrument recognition take into account merely the magnitude information, whereas the phase is often overlooked due to the complications related to its interpretation. However, there is often additional information concealed in the phase, which could be beneficial for recognition. The MODGDF is a method of incorporating phase information, which lacks of the issues related to phase unwrapping. Having shown its applicability for speech-related problems, it is now explored in terms of musical instrument recognition. The evaluation is performed on separate note recordings in various instrument sets, and combined with the conventional mel frequency cepstral coefficients (MFCCs), MODGDF shows the noteworthy absolute accuracy gains of up to 5.1\% compared to the baseline MFCCs case.

    @inproceedings{2013_CMMR,
    author = "Diment, Aleksandr and Padmanabhan, Rajan and Heittola, Toni and Virtanen, Tuomas",
    abstract = "In this work, the modified group delay feature (MODGDF) is proposed for pitched musical instrument recognition. Conventionally, the spectrum-related features used in instrument recognition take into account merely the magnitude information, whereas the phase is often overlooked due to the complications related to its interpretation. However, there is often additional information concealed in the phase, which could be beneficial for recognition. The MODGDF is a method of incorporating phase information, which lacks of the issues related to phase unwrapping. Having shown its applicability for speech-related problems, it is now explored in terms of musical instrument recognition. The evaluation is performed on separate note recordings in various instrument sets, and combined with the conventional mel frequency cepstral coefficients (MFCCs), MODGDF shows the noteworthy absolute accuracy gains of up to 5.1\% compared to the baseline MFCCs case.",
    booktitle = "10th International Symposium on Computer Music Multidisciplinary Research (CMMR)",
    title = "Modified Group Delay Feature for Musical Instrument Recognition",
    year = "2013",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/Diment13\_MODGDF.pdf"
    }

  • A. Diment, T. Heittola, and T. Virtanen, "Semi-supervised Learning for Musical Instrument Recognition," in 21st European Signal Processing Conference 2013 (EUSIPCO 2013), 2013, pp. 1-5.
    [BibTeX] [Abstract] [Download PDF]

    In this work, the semi-supervised learning (SSL) techniques are explored in the context of musical instrument recognition. The conventional supervised approaches normally rely on annotated data to train the classifier. This implies performing costly manual annotations of the training data. The SSL methods enable utilising the additional unannotated data, which is significantly easier to obtain, allowing the overall development cost maintained at the same level while notably improving the performance. The implemented classifier incorporates the Gaussian mixture model-based SSL scheme utilising the iterative EM-based algorithm, as well as the extensions facilitating a simpler convergence criteria. The evaluation is performed on a set of nine instruments while training on a dataset, in which the relative size of the labelled data is as little as 15\%. It yields a noteworthy absolute performance gain of 16\% compared to the performance of the initial supervised models.

    @inproceedings{2013_EUSIPCO 2013,
    author = "Diment, Aleksandr and Heittola, Toni and Virtanen, Tuomas",
    abstract = "In this work, the semi-supervised learning (SSL) techniques are explored in the context of musical instrument recognition. The conventional supervised approaches normally rely on annotated data to train the classifier. This implies performing costly manual annotations of the training data. The SSL methods enable utilising the additional unannotated data, which is significantly easier to obtain, allowing the overall development cost maintained at the same level while notably improving the performance. The implemented classifier incorporates the Gaussian mixture model-based SSL scheme utilising the iterative EM-based algorithm, as well as the extensions facilitating a simpler convergence criteria. The evaluation is performed on a set of nine instruments while training on a dataset, in which the relative size of the labelled data is as little as 15\%. It yields a noteworthy absolute performance gain of 16\% compared to the performance of the initial supervised models.",
    booktitle = "21st European Signal Processing Conference 2013 (EUSIPCO 2013)",
    keywords = "Music information retrieval; musical instrument recognition; semi-supervised learning; instruments",
    month = "Sep",
    title = "Semi-supervised Learning for Musical Instrument Recognition",
    pages = "1-5",
    year = "2013",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/Diment13\_SSL.pdf"
    }

  • J. Geiger, F. Weninger, A. Hurmalainen, J. Gemmeke, M. Wöllmer, B. Schuller, G. Rigoll, and T. Virtanen, "The TUM+TUT+KUL Approach to the CHiME Challenge 2013: Multi-Stream ASR Exploiting BLSTM Networks and Sparse NMF," in proceedings of the 2nd CHiME workshop, 2013, pp. 25-30.
    [BibTeX] [Abstract] [Download PDF]

    We present our joint contribution to the 2nd CHiME Speech Separation and Recognition Challenge. Our system combines speech enhancement by supervised sparse non-negative matrix factorisation (NMF) with a multi-stream speech recognition system. In addition to a conventional MFCC HMM recogniser, predictions by a bidirectional Long Short-Term Memory recurrent neural network (BLSTM-RNN) and from non-negative sparse classification (NSC) are integrated into a triple-stream recogniser. Experiments are carried out on the small vocabulary and the medium vocabulary recognition tasks of the Challenge. Consistent improvements over the Challenge baselines demonstrate the efficacy of the proposed system, resulting in an average word accuracy of 92.8\% in the small vocabulary task and an average word error rate of 41.42\% in the medium vocabulary task.

    @inproceedings{2013_CHiME,
    author = {Geiger, J{\"u}rgen and Weninger, Felix and Hurmalainen, Antti and Gemmeke, Jort and W{\"o}llmer, Martin and Schuller, Bj{\"o}rn and Rigoll, Gerhard and Virtanen, Tuomas},
    abstract = "We present our joint contribution to the 2nd CHiME Speech Separation and Recognition Challenge. Our system combines speech enhancement by supervised sparse non-negative matrix factorisation (NMF) with a multi-stream speech recognition system. In addition to a conventional MFCC HMM recogniser, predictions by a bidirectional Long Short-Term Memory recurrent neural network (BLSTM-RNN) and from non-negative sparse classification (NSC) are integrated into a triple-stream recogniser. Experiments are carried out on the small vocabulary and the medium vocabulary recognition tasks of the Challenge. Consistent improvements over the Challenge baselines demonstrate the efficacy of the proposed system, resulting in an average word accuracy of 92.8\% in the small vocabulary task and an average word error rate of 41.42\% in the medium vocabulary task.",
    awards = "Best paper award (the 2nd CHiME workshop, 2013)",
    booktitle = "proceedings of the 2nd CHiME workshop",
    journal = "Proceedings of the 2nd CHiME workshop",
    keywords = "Long Short-Term Memory;recurrent neural networks;non-negative matrix factorisation;dynamic Bayesian networks",
    month = "June",
    pages = "25-30",
    title = "{T}he {TUM}+{TUT}+{KUL} {A}pproach to the {CH}i{ME} {C}hallenge 2013: {M}ulti-{S}tream {ASR} {E}xploiting {BLSTM} {N}etworks and {S}parse {NMF}",
    year = "2013",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/pP1\_geiger.pdf"
    }

  • J. T. Geiger, F. Weninger, A. Hurmalainen, J. F. Gemmeke, M. Wöllmer, B. Schuller, G. Rigoll, and T. Virtanen, "The TUM+TUT+KUL Approach to the 2nd CHiME Challenge: Multi-Stream ASR Exploiting BLSTM Networks and Sparse NMF," in The 2nd International Workshop on Machine Listening in Multisource Environments CHiME Workshop, 1st June 2013, Vancouver, Canada (in conjuction with ICASSP), 2013, p. 25–30.
    [BibTeX]
    @inproceedings{2013_in conjuction with ICASSP_b,
    author = {Geiger, Jurgen T. and Weninger, Felix and Hurmalainen, Antti and Gemmeke, Jort F. and W{\"o}llmer, Martin and Schuller, Bj{\"o}rn and Rigoll, Gerhard and Virtanen, Tuomas},
    title = "The TUM+TUT+KUL Approach to the 2nd CHiME Challenge: Multi-Stream ASR Exploiting BLSTM Networks and Sparse NMF",
    note = "Poster Session; Best Paper Award Winner
    Contribution: organisation=sgn,FACT1=1
    Portfolio EDEND: 2013-12-29
    Publisher name: International Workshop on Machine Listening in Multisource Environments", year = "2013", language = "English", series = "International Workshop on Machine Listening in Multisource Environments", publisher = "International Workshop on Machine Listening in Multisource Environments", pages = "25--30", booktitle = "The 2nd International Workshop on Machine Listening in Multisource Environments CHiME Workshop, 1st June 2013, Vancouver, Canada (in conjuction with ICASSP)" }

  • J. Gemmeke, A. Hurmalainen, and T. Virtanen, "HMM-regularization for NMF-based noise robust ASR," in The 2nd International Workshop on Machine Listening in Multisource Environments CHiME Workshop, 1st June 2013, Vancouver, Canada (in conjuction with ICASSP), 2013, p. 47–52.
    [BibTeX] [Download PDF]
    @inproceedings{2013_in conjuction with ICASSP,
    author = "Gemmeke, Jort and Hurmalainen, Antti and Virtanen, Tuomas",
    booktitle = "The 2nd International Workshop on Machine Listening in Multisource Environments CHiME Workshop, 1st June 2013, Vancouver, Canada (in conjuction with ICASSP)",
    pages = "47--52",
    keywords = "speech enhancement;exemplar-based;noise robustness;Non-Negative Matrix Factorization;Hidden Markov Models",
    publisher = "International Workshop on Machine Listening in Multisource Environments",
    series = "International Workshop on Machine Listening in Multisource Environments",
    title = "{HMM}-regularization for {NMF}-based noise robust {ASR}",
    year = "2013",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/pP5\_gemmeke.pdf"
    }

  • J. F. Gemmeke, T. Virtanen, and K. Demuynck, "Exemplar-based joint channel and noise compensation," in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp. 868-872. doi:10.1109/ICASSP.2013.6637772
    [BibTeX] [Download PDF]
    @INPROCEEDINGS{2013_ICASSP_b,
    author = "Gemmeke, Jort F. and Virtanen, Tuomas and Demuynck, Kris",
    booktitle = "2013 IEEE International Conference on Acoustics, Speech and Signal Processing",
    title = "Exemplar-based joint channel and noise compensation",
    year = "2013",
    volume = "",
    number = "",
    pages = "868-872",
    keywords = "Speech;Noise;Speech recognition;Dictionaries;Noise robustness;Iron;Hidden Markov models;Speech recognition;source separation;matrix factorization;noise robustness;channel compensation",
    doi = "10.1109/ICASSP.2013.6637772",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/3645\_postprint.pdf"
    }

  • D. Giannoulis and A. Klapuri, "Musical Instrument Recognition in Polyphonic Audio Using Missing Feature Approach," IEEE Transactions on Audio Speech and Language Processing, vol. 21, iss. 9, p. 1805–1817, 2013. doi:10.1109/TASL.2013.2248720
    [BibTeX]
    @article{2013_b,
    author = "Giannoulis, Dimitrios and Klapuri, Anssi",
    title = "Musical Instrument Recognition in Polyphonic Audio Using Missing Feature Approach",
    note = "Contribution: organisation=sgn,FACT1=1
    Portfolio EDEND: 2013-12-29
    Publisher name: Institute of Electrical and Electronics Engineers IEEE", year = "2013", doi = "10.1109/TASL.2013.2248720", language = "English", volume = "21", pages = "1805--1817", journal = "IEEE Transactions on Audio Speech and Language Processing", issn = "1558-7916", publisher = "Institute of Electrical and Electronics Engineers Inc.", number = "9" }

  • D. Giannoulis, A. Klapuri, and M. D. Plumbley, "Recognition of harmonic sounds in polyphonic audio using a missing feature approach," in 2013 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 26-31 May 2013, Vancouver, Canada, United States, 2013, p. 8658–8662. doi:10.1109/ICASSP.2013.6639356
    [BibTeX]
    @inproceedings{2013_ICASSP_f,
    author = "Giannoulis, Dimitrios and Klapuri, Anssi and Plumbley, Mark D.",
    title = "Recognition of harmonic sounds in polyphonic audio using a missing feature approach",
    note = "Contribution: organisation=sgn,FACT1=1
    Portfolio EDEND: 2013-12-29
    Publisher name: Institute of Electrical and Electronics Engineers IEEE; IEEE International Conference on Acoustics, Speech and Signal Processing ; Conference date: 01-01-1900 Through 01-01-2000", year = "2013", doi = "10.1109/ICASSP.2013.6639356", language = "English", isbn = "978-1-4799-0356-6", series = "IEEE International Conference on Acoustics, Speech, and Signal Processing", publisher = "IEEE", pages = "8658--8662", booktitle = "2013 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 26-31 May 2013, Vancouver, Canada", address = "United States" }

  • T. Heittola, A. Mesaros, A. Eronen, and T. Virtanen, "Context-Dependent Sound Event Detection," EURASIP Journal on Audio, Speech and Music Processing, 2013. doi:10.1186/1687-4722-2013-1
    [BibTeX] [Abstract]

    The work presented in this article studies how the context information can be used in the automatic sound event detection process, and how the detection system can benefit from such information. Humans are using context information to make more accurate predictions about the sound events and ruling out unlikely events given the context. We propose a similar utilization of context information in the automatic sound event detection process. The proposed approach is composed of two stages: automatic context recognition stage and sound event detection stage. Contexts are modeled using Gaussian mixture models and sound events are modeled using three-state left-to-right hidden Markov models. In the first stage, audio context of the tested signal is recognized. Based on the recognized context, a context-specific set of sound event classes is selected for the sound event detection stage. The event detection stage also uses context-dependent acoustic models and count-based event priors. Two alternative event detection approaches are studied. In the first one, a monophonic event sequence is outputted by detecting the most prominent sound event at each time instance using Viterbi decoding. The second approach introduces a new method for producing polyphonic event sequence by detecting multiple overlapping sound events using multiple restricted Viterbi passes. A new metric is introduced to evaluate the sound event detection performance with various level of polyphony. This combines the detection accuracy and coarse time-resolution error into one metric, making the comparison of the performance of detection algorithms simpler. The two-step approach was found to improve the results substantially compared to the context-independent baseline system. In the block-level, the detection accuracy can be almost doubled by using the proposed context-dependent event detection.

    @article{2013_JASM,
    author = "Heittola, Toni and Mesaros, Annamaria and Eronen, Antti and Virtanen, Tuomas",
    abstract = "The work presented in this article studies how the context information can be used in the automatic sound event detection process, and how the detection system can benefit from such information. Humans are using context information to make more accurate predictions about the sound events and ruling out unlikely events given the context. We propose a similar utilization of context information in the automatic sound event detection process. The proposed approach is composed of two stages: automatic context recognition stage and sound event detection stage. Contexts are modeled using Gaussian mixture models and sound events are modeled using three-state left-to-right hidden Markov models. In the first stage, audio context of the tested signal is recognized. Based on the recognized context, a context-specific set of sound event classes is selected for the sound event detection stage. The event detection stage also uses context-dependent acoustic models and count-based event priors. Two alternative event detection approaches are studied. In the first one, a monophonic event sequence is outputted by detecting the most prominent sound event at each time instance using Viterbi decoding. The second approach introduces a new method for producing polyphonic event sequence by detecting multiple overlapping sound events using multiple restricted Viterbi passes. A new metric is introduced to evaluate the sound event detection performance with various level of polyphony. This combines the detection accuracy and coarse time-resolution error into one metric, making the comparison of the performance of detection algorithms simpler. The two-step approach was found to improve the results substantially compared to the context-independent baseline system. In the block-level, the detection accuracy can be almost doubled by using the proposed context-dependent event detection.",
    journal = "EURASIP Journal on Audio, Speech and Music Processing",
    keywords = "CASA;Sound event detection",
    title = "Context-Dependent Sound Event Detection",
    doi = "10.1186/1687-4722-2013-1",
    year = "2013"
    }

  • T. Heittola, A. Mesaros, T. Virtanen, and M. Gabbouj, "Supervised Model Training for Overlapping Sound Events Based on Unsupervised Source Separation," in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, Canada, 2013, pp. 8677-8681. doi:10.1109/ICASSP.2013.6639360
    [BibTeX] [Abstract] [Download PDF]

    Sound event detection is addressed in the presence of overlapping sounds. Unsupervised sound source separation into streams is used as a preprocessing step to minimize the interference of overlapping events. This poses a problem in supervised model training, since there is no knowledge about which separated stream contains the targeted sound source. We propose two iterative approaches based on EM algorithm to select the most likely stream to contain the target sound: one by selecting always the most likely stream and another one by gradually eliminating the most unlikely streams from the training. The approaches were evaluated with a database containing recordings from various contexts, against the baseline system trained without applying stream selection. Both proposed approaches were found to give a reasonable increase of 8 percentage units in the detection accuracy.

    @inproceedings{2013_ICASSP_a,
    author = "Heittola, Toni and Mesaros, Annamaria and Virtanen, Tuomas and Gabbouj, Moncef",
    abstract = "Sound event detection is addressed in the presence of overlapping sounds. Unsupervised sound source separation into streams is used as a preprocessing step to minimize the interference of overlapping events. This poses a problem in supervised model training, since there is no knowledge about which separated stream contains the targeted sound source. We propose two iterative approaches based on EM algorithm to select the most likely stream to contain the target sound: one by selecting always the most likely stream and another one by gradually eliminating the most unlikely streams from the training. The approaches were evaluated with a database containing recordings from various contexts, against the baseline system trained without applying stream selection. Both proposed approaches were found to give a reasonable increase of 8 percentage units in the detection accuracy.",
    address = "Vancouver, Canada",
    booktitle = "2013 IEEE International Conference on Acoustics, Speech and Signal Processing",
    pages = "8677-8681",
    keywords = "sound event detection;casa",
    publisher = "IEEE Computer Society",
    title = "Supervised Model Training for Overlapping Sound Events Based on Unsupervised Source Separation",
    year = "2013",
    doi = "10.1109/ICASSP.2013.6639360",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/icassp2013\_heittola.pdf"
    }

  • A. Hurmalainen, J. Gemmeke, and T. Virtanen, "Modelling Non-stationary Noise with Spectral Factorisation in Automatic Speech Recognition," Computer Speech \\& Language, vol. 27, iss. 3, pp. 763-779, 2013. doi:10.1016/j.csl.2012.07.008
    [BibTeX] [Abstract]

    Speech recognition systems intended for everyday use must be able to cope with a large variety of noise types and levels, including highly non-stationary multi-source mixtures. This study applies spectral factorisation algorithms and long temporal context for separating speech and noise from mixed signals. To adapt the system to varying environments, noise models are acquired from the context, or learnt from the mixture itself without prior information. We also propose methods for reducing the size of the bases used for speech and noise modelling by 20-40 times for better practical applicability. We evaluate the performance of the methods both as a standalone classifier and as a signal-enhancing front-end for external recognisers. For the CHiME noisy speech corpus containing non-stationary multi-source household noises at signal-to-noise ratios ranging from +9 to -6 dB, we report average keyword recognition rates up to 87.8\% using a single-stream sparse classification algorithm.

    @article{2013_a,
    author = "Hurmalainen, Antti and Gemmeke, Jort and Virtanen, Tuomas",
    abstract = "Speech recognition systems intended for everyday use must be able to cope with a large variety of noise types and levels, including highly non-stationary multi-source mixtures. This study applies spectral factorisation algorithms and long temporal context for separating speech and noise from mixed signals. To adapt the system to varying environments, noise models are acquired from the context, or learnt from the mixture itself without prior information. We also propose methods for reducing the size of the bases used for speech and noise modelling by 20-40 times for better practical applicability. We evaluate the performance of the methods both as a standalone classifier and as a signal-enhancing front-end for external recognisers. For the CHiME noisy speech corpus containing non-stationary multi-source household noises at signal-to-noise ratios ranging from +9 to -6 dB, we report average keyword recognition rates up to 87.8\% using a single-stream sparse classification algorithm.",
    journal = "Computer Speech {\\&} Language",
    keywords = "automatic speech recognition;noise robustness;non-stationary noise;non-negative spectral factorisation;exemplar-based",
    month = "May",
    number = "3",
    pages = "763-779",
    title = "Modelling Non-stationary Noise with Spectral Factorisation in Automatic Speech Recognition",
    doi = "10.1016/j.csl.2012.07.008",
    volume = "27",
    year = "2013",
    rul = "http://www.sciencedirect.com/science/article/pii/S0885230812000563"
    }

  • A. Hurmalainen and T. Virtanen, "Learning State Labels for Sparse Classification of Speech with Matrix Deconvolution," in Proceedings of the Automatic Speech Recognition and Understanding Workshop (ASRU), 2013.
    [BibTeX] [Abstract] [Download PDF]

    Non-negative spectral factorisation with long temporal context has been successfully used for noise robust recognition of speech in multi-source environments. Sparse classification from activations of speech atoms can be employed instead of conventional GMMs to determine speech state likelihoods. For accurate classification, correct linguistic state labels must be assigned to speech atoms. We propose using non-negative matrix deconvolution for learning the labels with algorithms closely matching a framework that separates speech from additive noises. Experiments on the 1st CHiME Challenge corpus show improvement in recognition accuracy over labels acquired from original atom sources or previously used least squares regression. The new approach also circumvents numerical issues encountered in previous learning methods, and opens up possibilities for new speech basis generation algorithms.

    @inproceedings{2013_ASRU,
    author = "Hurmalainen, Antti and Virtanen, Tuomas",
    abstract = "Non-negative spectral factorisation with long temporal context has been successfully used for noise robust recognition of speech in multi-source environments. Sparse classification from activations of speech atoms can be employed instead of conventional GMMs to determine speech state likelihoods. For accurate classification, correct linguistic state labels must be assigned to speech atoms. We propose using non-negative matrix deconvolution for learning the labels with algorithms closely matching a framework that separates speech from additive noises. Experiments on the 1st CHiME Challenge corpus show improvement in recognition accuracy over labels acquired from original atom sources or previously used least squares regression. The new approach also circumvents numerical issues encountered in previous learning methods, and opens up possibilities for new speech basis generation algorithms.",
    booktitle = "Proceedings of the Automatic Speech Recognition and Understanding Workshop (ASRU)",
    keywords = "automatic speech recognition;noise robustness;non-negative matrix factorization;sparse classification",
    month = "December",
    organization = "IEEE",
    title = "Learning State Labels for Sparse Classification of Speech with Matrix Deconvolution",
    year = "2013",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/Hurmalainen\_ASRU2013.pdf"
    }

  • A. Hurmalainen, J. Gemmeke, and T. Virtanen, "Compact Long Context Spectral Factorisation Models for Noise Robust Recognition of Medium Vocabulary Speech," in Proceedings of the 2nd CHiME workshop, 2013, pp. 13-18.
    [BibTeX] [Abstract] [Download PDF]

    In environments containing multiple non-stationary sound sources, it becomes increasingly difficult to recognise speech from its short-time spectra alone. Long-context speech and noise models, where phonetic patterns and noise events may span hundreds of milliseconds, have been found beneficial in such separation tasks. Thus far the majority of work employing non-negative matrix factorisation to long-context spectrogram separation has been conducted on small vocabulary tasks by exploiting large speech and noise dictionaries containing thousands of atoms. In this work we study whether the previously proposed factorisation methods are applicable to more natural speech and limited noise context while keeping the model sizes practically feasible. Results are evaluated on the WSJ0 5k -based 2nd CHiME Challenge Track 2 corpus, where we achieve approximately 4\% absolute improvement in speech recognition rates compared to baseline using the proposed enhancement framework.

    @inproceedings{2013_CHiME_a,
    author = "Hurmalainen, Antti and Gemmeke, Jort and Virtanen, Tuomas",
    abstract = "In environments containing multiple non-stationary sound sources, it becomes increasingly difficult to recognise speech from its short-time spectra alone. Long-context speech and noise models, where phonetic patterns and noise events may span hundreds of milliseconds, have been found beneficial in such separation tasks. Thus far the majority of work employing non-negative matrix factorisation to long-context spectrogram separation has been conducted on small vocabulary tasks by exploiting large speech and noise dictionaries containing thousands of atoms. In this work we study whether the previously proposed factorisation methods are applicable to more natural speech and limited noise context while keeping the model sizes practically feasible. Results are evaluated on the WSJ0 5k -based 2nd CHiME Challenge Track 2 corpus, where we achieve approximately 4\% absolute improvement in speech recognition rates compared to baseline using the proposed enhancement framework.",
    booktitle = "Proceedings of the 2nd CHiME workshop",
    keywords = "spectral factorisation;speech recognition;noise robustness",
    month = "June",
    pages = "13-18",
    title = "Compact Long Context Spectral Factorisation Models for Noise Robust Recognition of Medium Vocabulary Speech",
    year = "2013",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/pS13\_hurmalainen.pdf"
    }

  • A. Hurmalainen and T. Virtanen, "Acquiring Variable Length Speech Bases for Factorisation-Based Noise Robust Speech Recognition," in Proceedings of the 21st European Signal Processing Conference (EUSIPCO), 2013.
    [BibTeX] [Abstract] [Download PDF]

    Studies from multiple disciplines show that spectro-temporal units of natural languages and human speech perception are longer than short-time frames commonly employed in automatic speech recognition. Extended temporal context is also beneficial for separation of concurrent sound sources such as speech and noise. However, the length of patterns in speech varies greatly, making it difficult to model with fixed-length units. We propose methods for acquiring variable length speech atom bases for accurate yet compact representation of speech with a large temporal context. Bases are generated from spectral features, from assigned state labels, and as a combination of both. Results for factorisation-based speech recognition in noisy conditions show equal or better separation and recognition quality in comparison to fixed length units, while model sizes are reduced by up to 40\%.

    @inproceedings{2013_EUSIPCO,
    author = "Hurmalainen, Antti and Virtanen, Tuomas",
    abstract = "Studies from multiple disciplines show that spectro-temporal units of natural languages and human speech perception are longer than short-time frames commonly employed in automatic speech recognition. Extended temporal context is also beneficial for separation of concurrent sound sources such as speech and noise. However, the length of patterns in speech varies greatly, making it difficult to model with fixed-length units. We propose methods for acquiring variable length speech atom bases for accurate yet compact representation of speech with a large temporal context. Bases are generated from spectral features, from assigned state labels, and as a combination of both. Results for factorisation-based speech recognition in noisy conditions show equal or better separation and recognition quality in comparison to fixed length units, while model sizes are reduced by up to 40\%.",
    booktitle = "Proceedings of the 21st European Signal Processing Conference (EUSIPCO)",
    keywords = "spectral factorization;speech recognition;noise robustness",
    month = "September",
    organization = "The European Association for Signal Processing (EURASIP)",
    title = "Acquiring Variable Length Speech Bases for Factorisation-Based Noise Robust Speech Recognition",
    year = "2013",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/Hurmalainen\_EUSIPCO2013.pdf"
    }

  • A. Hurmalainen, T. Virtanen, J. F. Gemmeke, M. Wöllmer, B. Schuller, and G. Rigoll, "Compact long context spectral factorization models for noise robust recognition of medium vocabulary speech," in The 2nd International Workshop on Machine Listening in Multisource Environments CHiME Workshop, 1st June 2013, Vancouver, Canada (in conjuction with ICASSP), 2013, p. 13–18.
    [BibTeX]
    @inproceedings{2013_in conjuction with ICASSP_c,
    author = {Hurmalainen, Antti and Virtanen, Tuomas and Gemmeke, Jort F. and W{\"o}llmer, Martin and Schuller, Bj{\"o}rn and Rigoll, Gerhard},
    title = "Compact long context spectral factorization models for noise robust recognition of medium vocabulary speech",
    note = "Oral Session
    Contribution: organisation=sgn,FACT1=1
    Portfolio EDEND: 2013-12-29
    Publisher name: International Workshop on Machine Listening in Multisource Environments", year = "2013", language = "English", series = "International Workshop on Machine Listening in Multisource Environments", publisher = "International Workshop on Machine Listening in Multisource Environments", pages = "13--18", booktitle = "The 2nd International Workshop on Machine Listening in Multisource Environments CHiME Workshop, 1st June 2013, Vancouver, Canada (in conjuction with ICASSP)" }

  • J. Kauppinen, A. Klapuri, and T. Virtanen, "Music self-similarity modeling using augmented nonnegative matrix factorization of block and stripe patterns," in 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2013, pp. 1-4. doi:10.1109/WASPAA.2013.6701855
    [BibTeX] [Download PDF]
    @INPROCEEDINGS{2013_WASPAA,
    author = "Kauppinen, Joonas and Klapuri, Anssi and Virtanen, Tuomas",
    booktitle = "2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics",
    title = "Music self-similarity modeling using augmented nonnegative matrix factorization of block and stripe patterns",
    year = "2013",
    volume = "",
    number = "",
    pages = "1-4",
    keywords = "Feature extraction;Vectors;Signal processing algorithms;Music;Signal processing;Approximation methods;Music structure analysis;nonnegative matrix factorization;self-similarity",
    doi = "10.1109/WASPAA.2013.6701855",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/Kauppinen-WASPAA2013-final.pdf"
    }

  • H. Kirchhoff, S. Dixon, and A. Klapuri, "Missing template estimation for user-assisted music transcription," in 2013 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 26-31 May 2013, Vancouver, Canada, United States, 2013, p. 26–30. doi:10.1109/ICASSP.2013.6637602
    [BibTeX]
    @inproceedings{2013_ICASSP_e,
    author = "Kirchhoff, Holger and Dixon, Simon and Klapuri, Anssi",
    title = "Missing template estimation for user-assisted music transcription",
    note = "Contribution: organisation=sgn,FACT1=1
    Portfolio EDEND: 2013-12-29
    Publisher name: Institute of Electrical and Electronics Engineers IEEE; IEEE International Conference on Acoustics, Speech and Signal Processing ; Conference date: 01-01-1900 Through 01-01-2000", year = "2013", doi = "10.1109/ICASSP.2013.6637602", language = "English", isbn = "978-1-4799-0356-6", series = "IEEE International Conference on Acoustics, Speech, and Signal Processing", publisher = "IEEE", pages = "26--30", booktitle = "2013 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 26-31 May 2013, Vancouver, Canada", address = "United States" }

  • H. Kirchhoff, S. Dixon, and A. Klapuri, "Multiple instrument tracking based on reconstruction error, pitch continuity and instrument activity," in 10th International Symposium on Computer Music Multidisciplinary Research, 15- 18 October 2013, Marseille, France, 2013, p. 894–903.
    [BibTeX]
    @inproceedings{2013_d,
    author = "Kirchhoff, Holger and Dixon, Simon and Klapuri, Anssi",
    title = "Multiple instrument tracking based on reconstruction error, pitch continuity and instrument activity",
    note = "Contribution: organisation=sgn,FACT1=1
    Portfolio EDEND: 2013-12-29
    Publisher name: LMA", year = "2013", language = "English", isbn = "978-2-909669-23-6", series = "International Symposium on Computer Music Multidisciplinary Research", publisher = "LMA", pages = "894--903", booktitle = "10th International Symposium on Computer Music Multidisciplinary Research, 15- 18 October 2013, Marseille, France" }

  • D. Korpi, T. Heittola, T. Partala, A. Eronen, A. Mesaros, and T. Virtanen, "On the human ability to discriminate audio ambiances from similar locations of an urban environment," Personal and Ubiquitous Computing, vol. 17, iss. 4, 2013. doi:10.1007/s00779-012-0625-z
    [BibTeX]
    @article{2013_PUC,
    author = "Korpi, Dani and Heittola, Toni and Partala, T. and Eronen, A. and Mesaros, Annamaria and Virtanen, Tuomas",
    title = "On the human ability to discriminate audio ambiances from similar locations of an urban environment",
    note = "November 2012, Online first.Poistettu Portfolio13:sta tupla r=2009.
    Contribution: organisation=sgn,FACT1=1
    Publisher name: Springer-Verlag", year = "2013", doi = "10.1007/s00779-012-0625-z", language = "English", volume = "17", journal = "Personal and Ubiquitous Computing", issn = "1617-4909", publisher = "Springer London", number = "4" }

  • K. Mahkonen, A. Eronen, T. Virtanen, E. Helander, V. Popa, J. Leppänen, and I. Curcio, "Music Dereverberation by Spectral Linear Prediction in Live Recordings," in 16th International Conference on Digital Audio Effects, Ireland, 2-5.9,2013, 2013.
    [BibTeX] [Download PDF]
    @inproceedings{2013_DAFx,
    author = {Mahkonen, Katariina and Eronen, Antti and Virtanen, Tuomas and Helander, Elina and Popa, Victor and Lepp{\"a}nen, Jussi and Curcio, Igor},
    booktitle = "16th International Conference on Digital Audio Effects, Ireland, 2-5.9,2013",
    publisher = "International Conference on Digital Audio Effects",
    series = "International Conference on Digital Audio Effects",
    title = "Music Dereverberation by Spectral Linear Prediction in Live Recordings",
    year = "2013",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/Mahkonen\_DAFX2013.pdf"
    }

  • A. Mesaros, T. Heittola, and K. Palomäki, "Analysis of acoustic-semantic relationship for diversely annotated real-world audio data," in Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp. 813-817. doi:http://dx.doi.org/10.1109/ICASSP.2013.6637761
    [BibTeX] [Abstract]

    A common problem of freely annotated or user contributed audio databases is the high variability of the labels, related to homonyms, synonyms, plurals, etc. Automatically re-labeling audio data based on audio similarity could offer a solution to this problem. This paper studies the relationship between audio and labels in a sound event database, by evaluating semantic similarity of labels of acoustically similar sound event instances. The assumption behind the study is that acoustically similar events are annotated with semantically similar labels. Indeed, for 43\% of the tested data, there was at least one in ten acoustically nearest neighbors having a synonym as label, while the closest related term is on average one level higher or lower in the semantic hierarchy.

    @inproceedings{2013_ICASSP,
    author = {Mesaros, Annamaria and Heittola, Toni and Palom{\"a}ki, Kalle},
    abstract = "A common problem of freely annotated or user contributed audio databases is the high variability of the labels, related to homonyms, synonyms, plurals, etc. Automatically re-labeling audio data based on audio similarity could offer a solution to this problem. This paper studies the relationship between audio and labels in a sound event database, by evaluating semantic similarity of labels of acoustically similar sound event instances. The assumption behind the study is that acoustically similar events are annotated with semantically similar labels. Indeed, for 43\% of the tested data, there was at least one in ten acoustically nearest neighbors having a synonym as label, while the closest related term is on average one level higher or lower in the semantic hierarchy.",
    booktitle = "Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing",
    doi = "http://dx.doi.org/10.1109/ICASSP.2013.6637761",
    isbn = "978-1-4799-0356-6",
    keywords = "audio similarity;semantic similarity;sound events",
    pages = "813-817",
    publisher = "IEEE Computer Society",
    title = "Analysis of acoustic-semantic relationship for diversely annotated real-world audio data",
    year = "2013"
    }

  • A. Mesaros, T. Heittola, and K. Palomäki, "Query-by-example retrieval of sound events using an integrated similarity measure of content and label," in 14th International Workshop on Image and Audio Analysis for Multimedia Interactive Services (WIA2MIS), 2013, pp. 1-4.
    [BibTeX] [Abstract]

    This paper presents a method for combining audio similarity and semantic similarity into a single similarity measure for query-by-example retrieval. The integrated similarity measure is used to retrieve sound events that are similar in content to the given query and have labels containing similar words. Through the semantic component, the method is able to handle variability in labels of sound events. Through the acoustic component, the method retrieves acoustically similar examples. On a test database of over 3000 sound event examples, the proposed method obtains a better retrieval performance than audio-based retrieval, and returns results closer acoustically to the query than a label-based retrieval.

    @inproceedings{2013_WIA2MIS,
    author = {Mesaros, Annamaria and Heittola, Toni and Palom{\"a}ki, Kalle},
    abstract = "This paper presents a method for combining audio similarity and semantic similarity into a single similarity measure for query-by-example retrieval. The integrated similarity measure is used to retrieve sound events that are similar in content to the given query and have labels containing similar words. Through the semantic component, the method is able to handle variability in labels of sound events. Through the acoustic component, the method retrieves acoustically similar examples. On a test database of over 3000 sound event examples, the proposed method obtains a better retrieval performance than audio-based retrieval, and returns results closer acoustically to the query than a label-based retrieval.",
    booktitle = "14th International Workshop on Image and Audio Analysis for Multimedia Interactive Services (WIA2MIS)",
    pages = "1-4",
    title = "Query-by-example retrieval of sound events using an integrated similarity measure of content and label",
    year = "2013"
    }

  • J. Nurminen, H. Silén, E. Helander, and M. Gabbouj, "Evaluation of detailed modeling of the LP residual in statistical speech synthesis," in 2013 IEEE International Symposium on Circuits and Systems (ISCAS), 2013, pp. 313-316. doi:10.1109/ISCAS.2013.6571844
    [BibTeX]
    @INPROCEEDINGS{2013_ISCAS,
    author = "Nurminen, Jani and Silén, Hanna and Helander, Elina and Gabbouj, Moncef",
    booktitle = "2013 IEEE International Symposium on Circuits and Systems (ISCAS)",
    title = "Evaluation of detailed modeling of the LP residual in statistical speech synthesis",
    year = "2013",
    volume = "",
    number = "",
    pages = "313-316",
    keywords = "Hidden Markov models;Speech;Speech synthesis;Harmonic analysis;Databases;Cutoff frequency;Training;statistical speech synthesis;linear prediction;residual modeling",
    doi = "10.1109/ISCAS.2013.6571844"
    }

  • J. Nurminen, H. Silen, and M. Gabbouj, "Speaker-specific retraining for enhanced compression of unit selection text-to-speech databases," in Proceedings of the 14th Annual Conference of the International Speech Communication Association (Interspeech 2013), 25-29 August, Lyon, France, 2013, p. 388–391.
    [BibTeX]
    @inproceedings{2013_Interspeech 2013_b,
    author = "Nurminen, Jani and Silen, Hanna and Gabbouj, Moncef",
    booktitle = "Proceedings of the 14th Annual Conference of the International Speech Communication Association (Interspeech 2013), 25-29 August, Lyon, France",
    pages = "388--391",
    publisher = "International Speech Communication Association",
    series = "Interspeech",
    title = "{S}peaker-specific retraining for enhanced compression of unit selection text-to-speech databases",
    year = "2013"
    }

  • P. Pertilä, "Online Blind Speech Separation using Multiple Acoustic Speaker Tracking and Time-Frequency Masking," Computer Speech \\& Language, vol. 27, iss. 3, p. 683–702, 2013. doi:10.1016/j.csl.2012.08.003
    [BibTeX] [Abstract]

    Separating speech signals of multiple simultaneous talkers in a reverberant enclosure is known as the cocktail party problem. In real-time applications online solutions capable of separating the signals as they are observed are required in contrast to separating the signals offline after observation. Often a talker may move, which should also be considered by the separation system. This work proposes an online method for speaker detection, speaker direction tracking, and speech separation. The separation is based on multiple acoustic source tracking (MAST) using Bayesian filtering and time–frequency masking. Measurements from three room environments with varying amounts of reverberation using two different designs of microphone arrays are used to evaluate the capability of the method to separate up to four simultaneously active speakers. Separation of moving talkers is also considered. Results are compared to two reference methods: ideal binary masking (IBM) and oracle tracking (O-T). Simulations are used to evaluate the effect of number of microphones and their spacing.

    @article{2013,
    author = {Pertil{\"a}, Pasi},
    abstract = "Separating speech signals of multiple simultaneous talkers in a reverberant enclosure is known as the cocktail party problem. In real-time applications online solutions capable of separating the signals as they are observed are required in contrast to separating the signals offline after observation. Often a talker may move, which should also be considered by the separation system. This work proposes an online method for speaker detection, speaker direction tracking, and speech separation. The separation is based on multiple acoustic source tracking (MAST) using Bayesian filtering and time–frequency masking. Measurements from three room environments with varying amounts of reverberation using two different designs of microphone arrays are used to evaluate the capability of the method to separate up to four simultaneously active speakers. Separation of moving talkers is also considered. Results are compared to two reference methods: ideal binary masking (IBM) and oracle tracking (O-T). Simulations are used to evaluate the effect of number of microphones and their spacing.",
    journal = "Computer Speech {\\&} Language",
    keywords = "Blind source separation;Acoustic source tracking;Particle filtering;Time-frequency masking;Microphone arrays;Spatial Sound Source Separation",
    month = "May",
    doi = "10.1016/j.csl.2012.08.003",
    number = "3",
    pages = "683–702",
    title = "Online Blind Speech Separation using Multiple Acoustic Speaker Tracking and Time-Frequency Masking",
    volume = "27",
    year = "2013"
    }

  • P. Pertilä, M. S. Hämäläinen, and M. Mieskolainen, "Passive temporal offset estimation of multichannel recordings of an ad-hoc microphone array," IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, iss. 11, pp. 2393-2402, 2013. doi:10.1109/TASLP.2013.2286921
    [BibTeX] [Abstract]

    In recent years ad-hoc microphone arrays have become ubiquitous, and the capture hardware and quality is increasingly more sophisticated. Ad-hoc arrays hold a vast potential for audio applications, but they are inherently asynchronous, i.e., temporal offset exists in each channel, and furthermore the device locations are generally unknown. Therefore, the data is not directly suitable for traditional microphone array applications such as source localization and beamforming. This work presents a least squares method for temporal offset estimation of a static ad-hoc microphone array. The method utilizes the captured audio content without the need to emit calibration signals, provided that during the recording a sufficient amount of sound sources surround the array. The Cramer-Rao lower bound of the estimator is given and the effect of limited number of surrounding sources on the solution accuracy is investigated. A practical implementation is then presented using non-linear filtering with automatic parameter adjustment. Simulations over a range of reverberation and noise levels demonstrate the algorithm's robustness. Using smartphones an average RMS error of 3.5 samples (at 48 kHz) was reached when the algorithm's assumptions were met.

    @article{2013_TASLP,
    author = {Pertil{\"a}, Pasi and H{\"a}m{\"a}l{\"a}inen, Matti S. and Mieskolainen, Mikael},
    abstract = "In recent years ad-hoc microphone arrays have become ubiquitous, and the capture hardware and quality is increasingly more sophisticated. Ad-hoc arrays hold a vast potential for audio applications, but they are inherently asynchronous, i.e., temporal offset exists in each channel, and furthermore the device locations are generally unknown. Therefore, the data is not directly suitable for traditional microphone array applications such as source localization and beamforming. This work presents a least squares method for temporal offset estimation of a static ad-hoc microphone array. The method utilizes the captured audio content without the need to emit calibration signals, provided that during the recording a sufficient amount of sound sources surround the array. The Cramer-Rao lower bound of the estimator is given and the effect of limited number of surrounding sources on the solution accuracy is investigated. A practical implementation is then presented using non-linear filtering with automatic parameter adjustment. Simulations over a range of reverberation and noise levels demonstrate the algorithm's robustness. Using smartphones an average RMS error of 3.5 samples (at 48 kHz) was reached when the algorithm's assumptions were met.",
    doi = "10.1109/TASLP.2013.2286921",
    issn = "1558-7916",
    journal = "IEEE Transactions on Audio, Speech, and Language Processing",
    keywords = "Acoustic measurement;Ad hoc networks;Calibration;Microphone arrays;Synchronization;self localization",
    month = "Nov.",
    number = "11",
    pages = "2393-2402",
    title = "Passive temporal offset estimation of multichannel recordings of an ad-hoc microphone array",
    volume = "21",
    year = "2013"
    }

  • P. Pertilä and A. Tinakari, "Time-of-arrival estimation for blind beamforming," in 2013 18th International Conference on Digital Signal Processing (DSP), 2013, pp. 1-6. doi:10.1109/ICDSP.2013.6622689
    [BibTeX]
    @INPROCEEDINGS{2013_DSP,
    author = "Pertilä, Pasi and Tinakari, Aki",
    booktitle = "2013 18th International Conference on Digital Signal Processing (DSP)",
    title = "Time-of-arrival estimation for blind beamforming",
    year = "2013",
    volume = "",
    number = "",
    pages = "1-6",
    keywords = "Vectors;Kalman filters;Sensors;Array signal processing;Microphone arrays;Smart phones;Time of arrival estimation;Beam steering;Kalman filter;Speech enhancement",
    doi = "10.1109/ICDSP.2013.6622689"
    }

  • C. Schörkhuber, A. Klapuri, and A. Sontacchi, "Audio Pitch Shifting Using the Constant-Q Transform," Journal of the Audio Engineering Society, vol. 61, iss. 7-8, p. 562–572, 2013.
    [BibTeX]
    @article{2013_AES,
    author = {Sch{\"o}rkhuber, Christian and Klapuri, Anssi and Sontacchi, Alois},
    title = "Audio Pitch Shifting Using the Constant-Q Transform",
    note = "Contribution: organisation=sgn,FACT1=1
    Portfolio EDEND: 2013-12-29
    Publisher name: Audio Engineering Society", year = "2013", language = "English", volume = "61", pages = "562--572", journal = "Journal of the Audio Engineering Society", issn = "1549-4950", publisher = "Audio Engineering Society", number = "7-8" }

  • H. Silen, J. Nurminen, E. Helander, and M. Gabbouj, "Voice Conversion for Non-Parallel Datasets Using Dynamic Kernel Partial Least Squares Regression," in Proceedings of the 14th Annual Conference of the International Speech Communication Association (Interspeech 2013), 25-29 August, Lyon, France, 2013, p. 373–377.
    [BibTeX]
    @inproceedings{2013_Interspeech 2013,
    author = "Silen, Hanna and Nurminen, Jani and Helander, Elina and Gabbouj, Moncef",
    booktitle = "Proceedings of the 14th Annual Conference of the International Speech Communication Association (Interspeech 2013), 25-29 August, Lyon, France",
    pages = "373--377",
    publisher = "International Speech Communication Association",
    series = "Interspeech",
    title = "{V}oice {C}onversion for {N}on-{P}arallel {D}atasets {U}sing {D}ynamic {K}ernel {P}artial {L}east {S}quares {R}egression",
    year = "2013"
    }

  • T. Virtanen, J. F. Gemmeke, and B. Raj, "Active-Set Newton Algorithm for Overcomplete Non-Negative Representations of Audio," IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, iss. 11, pp. 2277-2289, 2013. doi:10.1109/TASL.2013.2263144
    [BibTeX] [Download PDF]
    @ARTICLE{2013_TASLP_a,
    author = "Virtanen, Tuomas and Gemmeke, Jort Florent and Raj, Bhiksha",
    journal = "IEEE Transactions on Audio, Speech, and Language Processing",
    title = "Active-Set Newton Algorithm for Overcomplete Non-Negative Representations of Audio",
    year = "2013",
    volume = "21",
    number = "11",
    pages = "2277-2289",
    keywords = "Large scale systems;Pattern recognition;Optimization;Acoustic signal analysis;Source separation;Acoustic signal analysis;audio source separation;convex optimization;Newton algorithm;non-negative matrix factorization;sparse coding;sparse representation;supervised source separation",
    doi = "10.1109/TASL.2013.2263144",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/asna.pdf"
    }

  • Z. Wu, T. Virtanen, T. Kinnunen, E. S. Chng, and H. Li, "Exemplar-based Voice Conversion using Non-negative Spectrogram Deconvolution," in in proc. 8th ISCA Speech Synthesis Workshop, 2013.
    [BibTeX] [Download PDF]
    @inproceedings{2013_ISCA,
    author = "Wu, Zhizheng and Virtanen, Tuomas and Kinnunen, Tomi and Chng, Eng Siong and Li, Haizhou",
    booktitle = "in proc. 8th ISCA Speech Synthesis Workshop",
    keywords = "voice conversion; NMF",
    title = "Exemplar-based Voice Conversion using Non-negative Spectrogram Deconvolution",
    year = "2013",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/ssw\_2013\_nmf.pdf"
    }

  • Z. Wu, T. Virtanen, T. Kinnunen, E. S. Chng, and H. Li, "Exemplar-based unit selection for voice conversion utilizing temporal information," in In proc. Interspeech, 2013.
    [BibTeX] [Download PDF]
    @inproceedings{2013_InterSpecch,
    author = "Wu, Zhizheng and Virtanen, Tuomas and Kinnunen, Tomi and Chng, Eng Siong and Li, Haizhou",
    booktitle = "In proc. Interspeech",
    keywords = "voice conversion; NMF",
    title = "Exemplar-based unit selection for voice conversion utilizing temporal information",
    year = "2013",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/exemplar\_unit\_selection.pdf"
    }

2012

  • E. Helander, H. Silen, T. Virtanen, and M. Gabbouj, "Voice Conversion Using Dynamic Kernel Partial Least Squares Regression," IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, iss. 3, pp. 806-817, 2012. doi:10.1109/TASL.2011.2165944
    [BibTeX] [Download PDF]
    @ARTICLE{2012_TASLP,
    author = "Helander, Elina and Silen, Hanna and Virtanen, Tuomas and Gabbouj, Moncef",
    journal = "IEEE Transactions on Audio, Speech, and Language Processing",
    title = "Voice Conversion Using Dynamic Kernel Partial Least Squares Regression",
    year = "2012",
    volume = "20",
    number = "3",
    pages = "806-817",
    keywords = "Kernel;Speech;Hidden Markov models;Data models;Training data;Statistical analysis;Training;Kernel methods;partial least squares regression;voice conversion",
    doi = "10.1109/TASL.2011.2165944",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/voice-conversion-TASLP.pdf"
    }

  • A. Hurmalainen and T. Virtanen, "Modelling spectro-temporal dynamics in factorisation-based noise-robust automatic speech recognition," in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012, pp. 4113-4116. doi:10.1109/ICASSP.2012.6288823
    [BibTeX] [Download PDF]
    @INPROCEEDINGS{2012_ICASSP,
    author = "Hurmalainen, Antti and Virtanen, Tuomas",
    booktitle = "2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",
    title = "Modelling spectro-temporal dynamics in factorisation-based noise-robust automatic speech recognition",
    year = "2012",
    volume = "",
    number = "",
    pages = "4113-4116",
    keywords = "Speech;Vectors;Noise;Speech recognition;Spectrogram;Feature extraction;Noise measurement;Automatic speech recognition;exemplar-based;spectral factorisation;noise robustness",
    doi = "10.1109/ICASSP.2012.6288823",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/hurmalainen\_icassp2012.pdf"
    }

  • A. Hurmalainen, J. Gemmeke, and T. Virtanen, "Detection, Separation and Recognition of Speech From Continuous Signals Using Spectral Factorisation," in 20th European Signal Processing Conference (EUSIPCO), Bucharest, Romania, 2012, pp. 2649-2653.
    [BibTeX] [Abstract] [Download PDF]

    In real world speech processing, the signals are often continuous and consist of momentary segments of speech over non-stationary background noise. It has been demonstrated that spectral factorisation using multi-frame atoms can be successfully employed to separate and recognise speech in adverse conditions. While in previous work full knowledge of utterance endpointing and speaker identity was used for noise modelling and speech recognition, this study proposes spectral factorisation and sparse classification techniques to detect, identify, separate and recognise speech from a continuous noisy input. Speech models are trained beforehand, but noise models are acquired adaptively from the input by using voice activity detection without prior knowledge of noise-only locations. The results are evaluated on the CHiME corpus, containing utterances from 34 speakers over highly non-stationary multi-source noise.

    @inproceedings{2012_EUSIPCO,
    author = "Hurmalainen, Antti and Gemmeke, Jort and Virtanen, Tuomas",
    abstract = "In real world speech processing, the signals are often continuous and consist of momentary segments of speech over non-stationary background noise. It has been demonstrated that spectral factorisation using multi-frame atoms can be successfully employed to separate and recognise speech in adverse conditions. While in previous work full knowledge of utterance endpointing and speaker identity was used for noise modelling and speech recognition, this study proposes spectral factorisation and sparse classification techniques to detect, identify, separate and recognise speech from a continuous noisy input. Speech models are trained beforehand, but noise models are acquired adaptively from the input by using voice activity detection without prior knowledge of noise-only locations. The results are evaluated on the CHiME corpus, containing utterances from 34 speakers over highly non-stationary multi-source noise.",
    address = "Bucharest, Romania",
    booktitle = "20th European Signal Processing Conference (EUSIPCO)",
    keywords = "Spectral factorization;speech recognition;speaker recognition;voice activity detection;speech separation",
    month = "August",
    organization = "European Association for Signal, Speech, and Image Processing (EURASIP)",
    pages = "2649-2653",
    title = "Detection, Separation and Recognition of Speech From Continuous Signals Using Spectral Factorisation",
    year = "2012",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/hurmalainen\_eusipco2012.pdf"
    }

  • A. Hurmalainen, R. Saeidi, and T. Virtanen, "Group Sparsity for Speaker Identity Discrimination in Factorisation-based Speech Recognition," in 13th Interspeech, 2012.
    [BibTeX] [Abstract] [Download PDF]

    Spectrogram factorisation using a dictionary of spectro-temporal atoms has been successfully employed to separate a mixed audio signal into its source components. When atoms from multiple sources are included in a combined dictionary, the relative weights of activated atoms reveal likely sources as well as the content of each source. Enforcing sparsity on the activation weights produces solutions, where only a small number of atoms are active at a time. In this paper we propose using group sparsity to restrict simultaneous activation of sources, allowing us to discover the identity of an unknown speaker from multiple candidates, and further to recognise the phonetic content more reliably with a narrowed down subset of atoms belonging to the most likely speakers. An evaluation on the CHiME corpus shows that the use of group sparsity improves the results of noise robust speaker identification and speech recognition using speaker-dependent models.

    @inproceedings{2012_InterSpecch_a,
    author = "Hurmalainen, Antti and Saeidi, Rahim and Virtanen, Tuomas",
    abstract = "Spectrogram factorisation using a dictionary of spectro-temporal atoms has been successfully employed to separate a mixed audio signal into its source components. When atoms from multiple sources are included in a combined dictionary, the relative weights of activated atoms reveal likely sources as well as the content of each source. Enforcing sparsity on the activation weights produces solutions, where only a small number of atoms are active at a time. In this paper we propose using group sparsity to restrict simultaneous activation of sources, allowing us to discover the identity of an unknown speaker from multiple candidates, and further to recognise the phonetic content more reliably with a narrowed down subset of atoms belonging to the most likely speakers. An evaluation on the CHiME corpus shows that the use of group sparsity improves the results of noise robust speaker identification and speech recognition using speaker-dependent models.",
    booktitle = "13th Interspeech",
    keywords = "group sparsity; speech recognition; speaker identification; spectrogram factorization",
    organization = "International Speech Communication Association (ISCA)",
    title = "Group Sparsity for Speaker Identity Discrimination in Factorisation-based Speech Recognition",
    year = "2012",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/hurmalainen\_interspeech2012.pdf"
    }

  • S. Kiranyaz, T. Mäkinen, and M. Gabbouj, "Dynamic and scalable audio classification by collective network of binary classifiers framework: An evolutionary approach," Neural Networks, vol. 34, p. 80–95, 2012. doi:10.1016/j.neunet.2012.07.003
    [BibTeX]
    @article{2012_NN,
    author = {Kiranyaz, Serkan and M{\"a}kinen, Toni and Gabbouj, Moncef},
    doi = "10.1016/j.neunet.2012.07.003",
    issn = "0893-6080",
    journal = "Neural Networks",
    pages = "80--95",
    publisher = "Elsevier",
    title = "{D}ynamic and scalable audio classification by collective network of binary classifiers framework: {A}n evolutionary approach",
    volume = "34",
    year = "2012"
    }

  • D. Korpi, T. Heittola, T. Partala, A. Eronen, A. Mesaros, and T. Virtanen, "On the human ability to discriminate audio ambiances from similar locations of an urban environment," Personal and Ubiquitous Computing, vol. November 2012, 2012.
    [BibTeX] [Abstract] [Download PDF]

    When developing advanced location-based systems augmented with audio ambiances, it would be cost-effective to use a few representative samples from typical environments for describing a larger number of similar locations. The aim of this experiment was to study the human ability to discriminate audio ambiances recorded in similar locations of the same urban environment. A listening experiment consisting of material from three different environments and nine different locations was carried out with nineteen subjects to study the credibility of audio representations for certain environments which would diminish the need for collecting huge audio databases. The first goal was to study to what degree humans are able to recognize whether the recording has been made in an indicated location or in another similar location, when presented with the name of the place, location on a map, and the associated audio ambiance. The second goal was to study whether the ability to discriminate audio ambiances from different locations is affected by a visual cue, by presenting additional information in form of a photograph of the suggested location. The results indicate that audio ambiances from similar urban areas of the same city differ enough so that it is not acceptable to use a single recording as ambience to represent different yet similar locations. Including an image was found to increase the perceived credibility of all the audio samples in representing a certain location. The results suggest that developers of audio-augmented location-based systems should aim at using audio samples recorded on-site for each location in order to achieve a credible impression.

    @article{2012_PUC,
    author = "Korpi, Dani and Heittola, Toni and Partala, Timo and Eronen, Antti and Mesaros, Annamaria and Virtanen, Tuomas",
    abstract = "When developing advanced location-based systems augmented with audio ambiances, it would be cost-effective to use a few representative samples from typical environments for describing a larger number of similar locations. The aim of this experiment was to study the human ability to discriminate audio ambiances recorded in similar locations of the same urban environment. A listening experiment consisting of material from three different environments and nine different locations was carried out with nineteen subjects to study the credibility of audio representations for certain environments which would diminish the need for collecting huge audio databases. The first goal was to study to what degree humans are able to recognize whether the recording has been made in an indicated location or in another similar location, when presented with the name of the place, location on a map, and the associated audio ambiance. The second goal was to study whether the ability to discriminate audio ambiances from different locations is affected by a visual cue, by presenting additional information in form of a photograph of the suggested location. The results indicate that audio ambiances from similar urban areas of the same city differ enough so that it is not acceptable to use a single recording as ambience to represent different yet similar locations. Including an image was found to increase the perceived credibility of all the audio samples in representing a certain location. The results suggest that developers of audio-augmented location-based systems should aim at using audio samples recorded on-site for each location in order to achieve a credible impression.",
    journal = "Personal and Ubiquitous Computing",
    title = "On the human ability to discriminate audio ambiances from similar locations of an urban environment",
    volume = "November 2012",
    year = "2012",
    url = "http://link.springer.com/article/10.1007/s00779-012-0625-z"
    }

  • F. Mazhar, T. Heittola, T. Virtanen, and J. Holm, "Automatic Scoring of Guitar Chords," in Proc. AES 45th International Conference, 2012.
    [BibTeX]
    @inproceedings{2012_AES,
    author = "Mazhar, Fawad and Heittola, Toni and Virtanen, Tuomas and Holm, Jukka",
    booktitle = "Proc. AES 45th International Conference",
    keywords = "transcription",
    title = "Automatic Scoring of Guitar Chords",
    year = "2012"
    }

  • T. Mäkinen, S. Kiranyaz, J. Raitoharju, and M. Gabbouj, "An evolutionary feature synthesis approach for content-based audio retrieval," EURASIP Journal on Audio, Speech, and Music Processing, iss. 23, 2012.
    [BibTeX] [Abstract]

    A vast amount of audio features have been proposed in the literature to characterize the content of audio signals. In order to overcome specific problems related to the existing features (such as lack of discriminative power), as well as to reduce the need for manual feature selection, in this article, we propose an evolutionary feature synthesis technique with a built-in feature selection scheme. The proposed synthesis process searches for optimal linear/nonlinear operators and feature weights from a pre-defined multi-dimensional search space to generate a highly discriminative set of new (artificial) features. The evolutionary search process is based on a stochastic optimization approach in which a multi-dimensional particle swarm optimization algorithm, along with fractional global best formation and heterogeneous particle behavior techniques, is applied. Unlike many existing feature generation approaches, the dimensionality of the synthesized feature vector is also searched and optimized within a set range in order to better meet the varying requirements set by many practical applications and classifiers. The new features generated by the proposed synthesis approach are compared with typical low-level audio features in several classification and retrieval tasks. The results demonstrate a clear improvement of up to 15–20\% in average retrieval performance. Moreover, the proposed synthesis technique surpasses the synthesis performance of evolutionary artificial neural networks, exhibiting a considerable capability to accurately distinguish among different audio classes.

    @article{2012_JASM,
    author = {M{\"a}kinen, Toni and Kiranyaz, Serkan and Raitoharju, Jenni and Gabbouj, Moncef},
    abstract = "A vast amount of audio features have been proposed in the literature to characterize the content of audio signals. In order to overcome specific problems related to the existing features (such as lack of discriminative power), as well as to reduce the need for manual feature selection, in this article, we propose an evolutionary feature synthesis technique with a built-in feature selection scheme. The proposed synthesis process searches for optimal linear/nonlinear operators and feature weights from a pre-defined multi-dimensional search space to generate a highly discriminative set of new (artificial) features. The evolutionary search process is based on a stochastic optimization approach in which a multi-dimensional particle swarm optimization algorithm, along with fractional global best formation and heterogeneous particle behavior techniques, is applied. Unlike many existing feature generation approaches, the dimensionality of the synthesized feature vector is also searched and optimized within a set range in order to better meet the varying requirements set by many practical applications and classifiers. The new features generated by the proposed synthesis approach are compared with typical low-level audio features in several classification and retrieval tasks. The results demonstrate a clear improvement of up to 15--20\% in average retrieval performance. Moreover, the proposed synthesis technique surpasses the synthesis performance of evolutionary artificial neural networks, exhibiting a considerable capability to accurately distinguish among different audio classes.",
    journal = "EURASIP Journal on Audio, Speech, and Music Processing",
    keywords = "CASA;general audio classification",
    number = "23",
    title = "{A}n evolutionary feature synthesis approach for content-based audio retrieval",
    year = "2012"
    }

  • T. Mäkinen, S. Kiranyaz, J. Pulkkinen, and M. Gabbouj, "Evolutionary Feature Generation for Content-based Audio Classification and Retrieval," in 20th European Signal Processing Conference (EUSIPCO), 2012.
    [BibTeX]
    @inproceedings{2012_EUSIPCO_a,
    author = {M{\"a}kinen, Toni and Kiranyaz, Serkan and Pulkkinen, Jenni and Gabbouj, Moncef},
    booktitle = "20th European Signal Processing Conference (EUSIPCO)",
    keywords = "CASA;general audio classification",
    title = "{E}volutionary Feature Generation for Content-based Audio Classification and Retrieval",
    year = "2012"
    }

  • J. Nikunen, T. Virtanen, P. Pertilä, and M. Vilermo, "Permutation Alignment Of Frequency-Domain ICA By The Maximization Of Intra-Source Envelope Correlations," in European Signal Processing Conference (EUSIPCO), 2012.
    [BibTeX] [Abstract] [Download PDF]

    This paper presents a novel method for solving the permutation ambiguity of frequency-domain independent component analysis based on source signal envelope correlation maximization. The proposed method is developed for blind source separation with high sampling frequency and significant spatial aliasing. We propose a method that analyzes the source envelope using a rank-one singular value decomposition (SVD) applied to an initial source magnitude spectrogram obtained by a time difference of arrival (TDoA) based permutation alignment method. The permutation for frequencies with incoherent TDoA are corrected by maximizing the cross-correlation of the SVD analyzed source activation vector and each independent component magnitude envelope. We evaluate the separation quality using real high sampling frequency speech captures and the proposed method is found to improve the separation over the baseline algorithm.

    @inproceedings{2012_EUSIPCO_b,
    author = {Nikunen, Joonas and Virtanen, Tuomas and Pertil{\"a}, Pasi and Vilermo, Miikka},
    abstract = "This paper presents a novel method for solving the permutation ambiguity of frequency-domain independent component analysis based on source signal envelope correlation maximization. The proposed method is developed for blind source separation with high sampling frequency and significant spatial aliasing. We propose a method that analyzes the source envelope using a rank-one singular value decomposition (SVD) applied to an initial source magnitude spectrogram obtained by a time difference of arrival (TDoA) based permutation alignment method. The permutation for frequencies with incoherent TDoA are corrected by maximizing the cross-correlation of the SVD analyzed source activation vector and each independent component magnitude envelope. We evaluate the separation quality using real high sampling frequency speech captures and the proposed method is found to improve the separation over the baseline algorithm.",
    booktitle = "European Signal Processing Conference (EUSIPCO)",
    keywords = "Blind Source Separation; Independent Component Analysis; Permutation Alignment",
    organization = "EURASIP",
    title = "Permutation Alignment Of Frequency-Domain {ICA} By The Maximization Of Intra-Source Envelope Correlations",
    year = "2012",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/nikunen\_eusipco2012.pdf"
    }

  • J. Nikunen, T. Virtanen, and M. Vilermo, "Multichannel Audio Upmixing by Time-Frequency Filtering Using Non-Negative Tensor Factorization," Journal of the Audio Engineering Society, vol. 60, iss. 10, pp. 794-806, 2012.
    [BibTeX]
    @article{2012_AES_a,
    author = "Nikunen, Joonas and Virtanen, Tuomas and Vilermo, Miikka",
    journal = "Journal of the Audio Engineering Society",
    keywords = "NMF",
    month = "October",
    number = "10",
    pages = "794-806",
    title = "Multichannel Audio Upmixing by Time-Frequency Filtering Using Non-Negative Tensor Factorization",
    volume = "60",
    year = "2012"
    }

  • J. Nurminen, H. Silen, V. Popa, E. Helander, and M. Gabbouj, "Voice Conversion," in Speech Enhancement, Modeling and Recognition: Algorithms and Applications, 2012, p. 1–27. doi:10.5772/37334
    [BibTeX]
    @inproceedings{2012_ICA,
    author = "Nurminen, Jani and Silen, Hanna and Popa, Victor and Helander, Elina and Gabbouj, Moncef",
    editor = "Ramakrishnan, S.",
    booktitle = "Speech Enhancement, Modeling and Recognition: Algorithms and Applications",
    doi = "10.5772/37334",
    isbn = "978-953-51-0291-5",
    pages = "1--27",
    publisher = "InTech",
    title = "Voice Conversion",
    year = "2012"
    }

  • P. Pertilä, M. Mieskolainen, and M. Hämäläinen, "Passive Self-Localization of Microphones Using Ambient Sounds," in Proc. 20th European Signal Processing Conference (EUSIPCO-2012), 2012.
    [BibTeX] [Abstract]

    This work presents a method to localize a set of microphones using recorded signals from surrounding continuous sounds such as speech. When a sound wave travels through a microphone array a time difference of arrival (TDOA) can be extracted between each microphone pair. A sound wave impinging towards a microphone pair from the end-fire direction results in the extreme TDOA value, leading to information about microphone distance. In indoors the reverberation may cause TDOA outliers, and a set of non-linear techniques for estimating the distance is proposed. The multidimensional scaling (MDS) is used to map the microphone pairwise distances into Cartesian microphone locations. The accuracy of the method and the effect of the number of sources is evaluated using speech signals in simulated environment. A self-localization RMS error of 7 cm was reached using ten asynchronous smartphones in a meeting room from a recorded conversation with a maximum of 3.7 m device separation.

    @inproceedings{2012_EUSIPCO-2012,
    author = {Pertil{\"a}, Pasi and Mieskolainen, Mikael and H{\"a}m{\"a}l{\"a}inen, Matti},
    abstract = "This work presents a method to localize a set of microphones using recorded signals from surrounding continuous sounds such as speech. When a sound wave travels through a microphone array a time difference of arrival (TDOA) can be extracted between each microphone pair. A sound wave impinging towards a microphone pair from the end-fire direction results in the extreme TDOA value, leading to information about microphone distance. In indoors the reverberation may cause TDOA outliers, and a set of non-linear techniques for estimating the distance is proposed. The multidimensional scaling (MDS) is used to map the microphone pairwise distances into Cartesian microphone locations. The accuracy of the method and the effect of the number of sources is evaluated using speech signals in simulated environment. A self-localization RMS error of 7 cm was reached using ten asynchronous smartphones in a meeting room from a recorded conversation with a maximum of 3.7 m device separation.",
    booktitle = "Proc. 20th European Signal Processing Conference (EUSIPCO-2012)",
    keywords = "Microphone arrays; Array Shape Calibration; Self-Localization; TDOA estimation; Multidimensional Scaling; self localization",
    organization = "EURASIP",
    title = "{P}assive {S}elf-{L}ocalization of {M}icrophones {U}sing {A}mbient {S}ounds",
    year = "2012"
    }

  • V. Popa, H. Silén, J. Nurminen, and M. Gabbouj, "Local Linear Transformation for Voice Conversion," in ICASSP, Kyoto, 2012.
    [BibTeX] [Abstract]

    Many popular approaches to spectral conversion involve linear transformations determined for particular acoustic classes and compute the converted result as linear combination between different local transformations in an attempt to ensure a continuous conversion. These methods often produce over-smoothed spectra and parameter tracks. The proposed method computes an individual linear transformation for every feature vector based on a small neighborhood in the acoustic space thus preserving local details. The method effectively reduces the over-smoothing by eliminating undesired contributions from acoustically remote regions. The method is evaluated in listening tests against the well-known Gaussian Mixture Model based conversion, representative for the class of methods involving linear transformations. Perceptual results indicate a clear preference for the proposed scheme.

    @inproceedings{2012_ICASSP_a,
    author = "Popa, Victor and Sil{\'e}n, Hanna and Nurminen, Jani and Gabbouj, Moncef",
    abstract = "Many popular approaches to spectral conversion involve linear transformations determined for particular acoustic classes and compute the converted result as linear combination between different local transformations in an attempt to ensure a continuous conversion. These methods often produce over-smoothed spectra and parameter tracks. The proposed method computes an individual linear transformation for every feature vector based on a small neighborhood in the acoustic space thus preserving local details. The method effectively reduces the over-smoothing by eliminating undesired contributions from acoustically remote regions. The method is evaluated in listening tests against the well-known Gaussian Mixture Model based conversion, representative for the class of methods involving linear transformations. Perceptual results indicate a clear preference for the proposed scheme.",
    address = "Kyoto",
    booktitle = "ICASSP",
    keywords = "voice conversion;local linear transformation",
    month = "March",
    organization = "IEEE",
    title = "{L}ocal {L}inear {T}ransformation for {V}oice {C}onversion",
    year = "2012"
    }

  • A. B. Rad and T. Virtanen, "Phase spectrum prediction of audio signals," in 5th International Symposium on Communications, Control and Signal Processing, 2012.
    [BibTeX] [Download PDF]
    @inproceedings{2012_ISCCSP,
    author = "Rad, Ali Bahrami and Virtanen, Tuomas",
    booktitle = "5th International Symposium on Communications, Control and Signal Processing",
    title = "Phase spectrum prediction of audio signals",
    year = "2012",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/phaseprediction.pdf"
    }

  • B. Raj, T. Virtanen, and R. Singh, "Foundations 3. The Problem of Robustness in Automatic Speech Recognition," in Techniques for Noise Robustness in Automatic Speech Recognition, T. Virtanen, R. Singh, and B. Raj, Eds., United Kingdom: John Wiley \\& Sons, 2012, p. 31–52.
    [BibTeX]
    @inbook{2012_d,
    author = "Raj, Bhiksha and Virtanen, Tuomas and Singh, Rita",
    editor = "Virtanen, Tuomas and Singh, Rita and Raj, Bhiksha",
    title = "Foundations 3. The Problem of Robustness in Automatic Speech Recognition",
    note = "ei ut-numeroa 29.8.2013
    Contribution: organisation=sgn,FACT1=1", year = "2012", language = "English", isbn = "978-1-1199-7088-0", pages = "31--52", booktitle = "Techniques for Noise Robustness in Automatic Speech Recognition", publisher = "John Wiley \\& Sons", address = "United Kingdom" }

  • F. Rodriguez-Serrano, J. J. C. Orti, P. Vera-Candeas, T. Virtanen, and N. Ruiz-Reyes, "Multiple Instrument Mixtures Source Separation Evaluation Using Instrument-Dependent NMF Models," in The 10th International Conference on Latent Variable Analysis and Source Separation, 2012.
    [BibTeX] [Download PDF]
    @inproceedings{2012_LVA_ICA,
    author = "Rodriguez-Serrano, F. and Orti, Julio Jos{\'e} Carabias and Vera-Candeas, P. and Virtanen, Tuomas and Ruiz-Reyes, N.",
    booktitle = "The 10th International Conference on Latent Variable Analysis and Source Separation",
    keywords = "instruments",
    title = "Multiple Instrument Mixtures Source Separation Evaluation Using Instrument-Dependent {NMF} Models",
    year = "2012",
    url = "http://www.springerlink.com/content/qm858t190w17533p/fulltext.pdf"
    }

  • R. Saeidi, A. Hurmalainen, T. Virtanen, and D. A. van Leeuwen, "Exemplar-based Sparse Representation and Sparse Discrimination for Noise Robust Speaker Identification," in Proc. Odyssey 2012: The Speaker and Language Recognition Workshop, 2012.
    [BibTeX] [Download PDF]
    @inproceedings{2012_Odyssey,
    author = "Saeidi, Rahim and Hurmalainen, Antti and Virtanen, Tuomas and van Leeuwen, D.A.",
    booktitle = "Proc. Odyssey 2012: The Speaker and Language Recognition Workshop",
    keywords = "speaker recognition",
    title = "Exemplar-based Sparse Representation and Sparse Discrimination for Noise Robust Speaker Identification",
    year = "2012",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/Exemplar\_Odyssey12\_Final.pdf"
    }

  • H. Silen, E. Helander, J. Nurminen, and M. Gabbouj, "Ways to Implement Global Variance in Statistical Speech Synthesis," in Proceedings of 13th Annual Conference of the International Speech Communication Association, Interspeech 2012, September 9 - 13, Portland, Oregon, USA, 2012, p. 1–4.
    [BibTeX]
    @inproceedings{2012_InterSpecch,
    author = "Silen, Hanna and Helander, Elina and Nurminen, Jani and Gabbouj, Moncef",
    booktitle = "Proceedings of 13th Annual Conference of the International Speech Communication Association, Interspeech 2012, September 9 - 13, Portland, Oregon, USA",
    pages = "1--4",
    publisher = "International Speech Communication Association ISCA",
    series = "Interspeech",
    title = "{W}ays to {I}mplement {G}lobal {V}ariance in {S}tatistical {S}peech {S}ynthesis",
    year = "2012"
    }

  • T. Virtanen, R. Singh, and B. Raj, Techniques for Noise Robustness in Automatic Speech Recognition, John Wiley \\& Sons, 2012.
    [BibTeX] [Abstract] [Download PDF]

    Automatic speech recognition (ASR) systems are finding increasing use in everyday life. Many of the commonplace environments where the systems are used are noisy, for example users calling up a voice search system from a busy cafeteria or a street. This can result in degraded speech recordings and adversely affect the performance of speech recognition systems. As the use of ASR systems increases, knowledge of the state-of-the-art in techniques to deal with such problems becomes critical to system and application engineers and researchers who work with or on ASR technologies. This book presents a comprehensive survey of the state-of-the-art in techniques used to improve the robustness of speech recognition systems to these degrading external influences. Key features: 1) Reviews all the main noise robust ASR approaches, including signal separation, voice activity detection, robust feature extraction, model compensation and adaptation, missing data techniques and recognition of reverberant speech. 2) Acts as a timely exposition of the topic in light of more widespread use in the future of ASR technology in challenging environments. 3) Addresses robustness issues and signal degradation which are both key requirements for practitioners of ASR. 4) Includes contributions from top ASR researchers from leading research units in the field

    @book{2012,
    author = "Virtanen, Tuomas and Singh, Rita and Raj, Bhiksha",
    abstract = "Automatic speech recognition (ASR) systems are finding increasing use in everyday life. Many of the commonplace environments where the systems are used are noisy, for example users calling up a voice search system from a busy cafeteria or a street. This can result in degraded speech recordings and adversely affect the performance of speech recognition systems. As the use of ASR systems increases, knowledge of the state-of-the-art in techniques to deal with such problems becomes critical to system and application engineers and researchers who work with or on ASR technologies. This book presents a comprehensive survey of the state-of-the-art in techniques used to improve the robustness of speech recognition systems to these degrading external influences. Key features: 1) Reviews all the main noise robust ASR approaches, including signal separation, voice activity detection, robust feature extraction, model compensation and adaptation, missing data techniques and recognition of reverberant speech. 2) Acts as a timely exposition of the topic in light of more widespread use in the future of ASR technology in challenging environments. 3) Addresses robustness issues and signal degradation which are both key requirements for practitioners of ASR. 4) Includes contributions from top ASR researchers from leading research units in the field",
    keywords = "speech recognition;noise robustness",
    publisher = "John Wiley {\\&} Sons",
    title = "{T}echniques for {N}oise {R}obustness in {A}utomatic {S}peech {R}ecognition",
    url = "http://eu.wiley.com/WileyCDA/WileyTitle/productCd-1119970881.html",
    year = "2012"
    }

  • T. Virtanen, R. Singh, and B. Raj, "Foundations 2. The Basics of Automatic Speech Recognition," in Techniques for Noise Robustness in Automatic Speech Recognition, T. Virtanen, R. Singh, and B. Raj, Eds., United Kingdom: John Wiley \\& Sons, 2012, p. 9–30.
    [BibTeX]
    @inbook{2012_a,
    author = "Virtanen, Tuomas and Singh, Rita and Raj, Bhiksha",
    editor = "Virtanen, Tuomas and Singh, Rita and Raj, Bhiksha",
    title = "Foundations 2. The Basics of Automatic Speech Recognition",
    note = "ei ut-numeroa 5.10.2013
    Contribution: organisation=sgn,FACT1=1", year = "2012", language = "English", isbn = "978-1-1199-7088-0", pages = "9--30", booktitle = "Techniques for Noise Robustness in Automatic Speech Recognition", publisher = "John Wiley \\& Sons", address = "United Kingdom" }

  • F. Weninger, M. Wöllmer, J. Geiger, B. Schuller, J. Gemmeke, A. Hurmalainen, T. Virtanen, and G. Rigoll, "Non-Negative Matrix Factorization for Highly Noise-Robust ASR: to Enhance or to Recognize?," in In proc. 37th International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2012.
    [BibTeX] [Abstract] [Download PDF]

    This paper proposes a multi-stream speech recognition system that combines information from three complementary analysis methods in order to improve automatic speech recognition in highly noisy and reverberant environments, as featured in the 2011 PASCAL CHiME Challenge. We integrate word predictions by a bidirectional Long Short-Term Memory recurrent neural network and non-negative sparse classification (NSC) into a multi-stream Hidden Markov Model using convolutive non-negative matrix factorization (NMF) for speech enhancement. Our results suggest that NMF-based enhancement and NSC are complementary despite their overlap in methodology, reaching up to 91.9\% average keyword accuracy on the Challenge test set at signal-to-noise ratios from -6 to 9 dB-the best result reported so far on these data.

    @inproceedings{2012_ICASSP_b,
    author = {Weninger, Felix and W{\"o}llmer, Martin and Geiger, J{\"u}rgen and Schuller, Bj{\"o}rn and Gemmeke, Jort and Hurmalainen, Antti and Virtanen, Tuomas and Rigoll, Gerhard},
    abstract = "This paper proposes a multi-stream speech recognition system that combines information from three complementary analysis methods in order to improve automatic speech recognition in highly noisy and reverberant environments, as featured in the 2011 PASCAL CHiME Challenge. We integrate word predictions by a bidirectional Long Short-Term Memory recurrent neural network and non-negative sparse classification (NSC) into a multi-stream Hidden Markov Model using convolutive non-negative matrix factorization (NMF) for speech enhancement. Our results suggest that NMF-based enhancement and NSC are complementary despite their overlap in methodology, reaching up to 91.9\% average keyword accuracy on the Challenge test set at signal-to-noise ratios from -6 to 9 dB-the best result reported so far on these data.",
    booktitle = "In proc. 37th International Conference on Acoustics, Speech, and Signal Processing (ICASSP)",
    keywords = "NMF;speech recognition",
    title = "Non-Negative Matrix Factorization for Highly Noise-Robust {ASR}: to Enhance or to Recognize?",
    year = "2012",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/Weninger\_ICASSP2012.pdf"
    }

2011

  • J. Gemmeke, A. Hurmalainen, T. Virtanen, and S. Yang, "Toward a Practical Implementation of Exemplar-Based Noise Robust ASR," in European Signal Processing Conference (EUSIPCO), Barcelona, Spain, 2011, pp. 1490-1494.
    [BibTeX] [Abstract]

    In previous work it was shown that, at least in principle, an exemplar-based approach to noise robust ASR is possible. The method, sparse representation based classification (SC), works by modelling noisy speech as a sparse linear combination of speech and noise exemplars. After recovering the sparsest possible linear combination of labelled exemplars, noise robust posterior likelihoods are estimated by using the weights of the exemplars as evidence of the state labels underlying exemplars. Although promising recognition accuracies at low SNRs were obtained, the method was impractical due to its slow execution speed. Moreover, the performance was not as good on noisy speech corrupted by noise types not represented by the noise exemplars. The importance of sparsity was poorly understood, and the influence of the size of the exemplar-dictionary was unclear. In this paper we investigate all these issues, and we show for example that speedups of a factor 28 can be obtained by using modern GPUs, bringing its execution speed within range to practical applications.

    @inproceedings{2011_EUSIPCO,
    author = "Gemmeke, Jort and Hurmalainen, Antti and Virtanen, Tuomas and Yang, Sun",
    abstract = "In previous work it was shown that, at least in principle, an exemplar-based approach to noise robust ASR is possible. The method, sparse representation based classification (SC), works by modelling noisy speech as a sparse linear combination of speech and noise exemplars. After recovering the sparsest possible linear combination of labelled exemplars, noise robust posterior likelihoods are estimated by using the weights of the exemplars as evidence of the state labels underlying exemplars. Although promising recognition accuracies at low SNRs were obtained, the method was impractical due to its slow execution speed. Moreover, the performance was not as good on noisy speech corrupted by noise types not represented by the noise exemplars. The importance of sparsity was poorly understood, and the influence of the size of the exemplar-dictionary was unclear. In this paper we investigate all these issues, and we show for example that speedups of a factor 28 can be obtained by using modern GPUs, bringing its execution speed within range to practical applications.",
    address = "Barcelona, Spain",
    booktitle = "European Signal Processing Conference (EUSIPCO)",
    keywords = "speech recognition",
    month = "August",
    organization = "EURASIP",
    pages = "1490-1494",
    title = "Toward a Practical Implementation of Exemplar-Based Noise Robust {ASR}",
    year = "2011"
    }

  • J. Gemmeke, T. Virtanen, and A. Hurmalainen, "Exemplar-based Sparse Representations for Noise Robust Automatic Speech Recognition," IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, iss. 7, pp. 2067-2080, 2011.
    [BibTeX] [Abstract] [Download PDF]

    This paper proposes to use exemplar-based sparse representations for noise robust automatic speech recognition. First, we describe how speech can be modelled as a linear combination of a small number of exemplars from a large speech exemplar dictionary. The exemplars are time-frequency patches of real speech, each spanning multiple time frames. We then propose to model speech corrupted by additive noise as a linear combination of noise and speech exemplars, and we derive an algorithm for recovering this sparse linear combination of exemplars from the observed noisy speech. We describe how the framework can be used for doing hybrid exemplar-based/HMM recognition by using the exemplar-activations together with the phonetic information associated with the exemplars. As an alternative to hybrid recognition, the framework also allows us to take a source separation approach which enables exemplar-based feature enhancement as well as missing data mask estimation. We evaluate the performance of these exemplar-based methods in connected digit recognition on the AURORA-2 database. Our results show that the hybrid system performed substantially better than source separation or missing data mask estimation at lower SNRs, achieving up to 57.1\% accuracy at SNR= -5 dB. Although not as effective as two baseline recognisers at higher SNRs, the novel approach offers a promising direction of future research on exemplar-based ASR.

    @article{2011_TASLP,
    author = "Gemmeke, Jort and Virtanen, Tuomas and Hurmalainen, Antti",
    abstract = "This paper proposes to use exemplar-based sparse representations for noise robust automatic speech recognition. First, we describe how speech can be modelled as a linear combination of a small number of exemplars from a large speech exemplar dictionary. The exemplars are time-frequency patches of real speech, each spanning multiple time frames. We then propose to model speech corrupted by additive noise as a linear combination of noise and speech exemplars, and we derive an algorithm for recovering this sparse linear combination of exemplars from the observed noisy speech. We describe how the framework can be used for doing hybrid exemplar-based/HMM recognition by using the exemplar-activations together with the phonetic information associated with the exemplars. As an alternative to hybrid recognition, the framework also allows us to take a source separation approach which enables exemplar-based feature enhancement as well as missing data mask estimation. We evaluate the performance of these exemplar-based methods in connected digit recognition on the AURORA-2 database. Our results show that the hybrid system performed substantially better than source separation or missing data mask estimation at lower SNRs, achieving up to 57.1\% accuracy at SNR= -5 dB. Although not as effective as two baseline recognisers at higher SNRs, the novel approach offers a promising direction of future research on exemplar-based ASR.",
    journal = "IEEE Transactions on Audio, Speech, and Language Processing",
    keywords = "speech recognition",
    month = "September",
    number = "7",
    pages = "2067-2080",
    title = "Exemplar-based Sparse Representations for Noise Robust Automatic Speech Recognition",
    volume = "19",
    year = "2011",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/gemmeke\_taslp11.pdf"
    }

  • J. Gemmeke, T. Virtanen, and A. Hurmalainen, "Exemplar-Based Speech Enhancement and its Application to Noise-Robust Automatic Speech Recognition," in Proc. International Workshop on Machine Listening in Multisource Environments (CHiME), Florence, Italy, 2011, pp. 53-57.
    [BibTeX] [Abstract] [Download PDF]

    In this work an exemplar-based technique for speech enhancement of noisy speech is proposed. The technique works by finding a sparse representation of the noisy speech in a dictionary containing both speech and noise exemplars, and uses the activated dictionary atoms to create a time-varying filter to enhance the noisy speech. The speech enhancement algorithm is evaluated using measured signal to noise ratio (SNR) improvements as well as by using automatic speech recognition. Experiments on the PASCAL CHiME challenge corpus, which contains speech corrupted by both reverberation and authentic living room noise at varying SNRs ranging from 9 to -6 dB, confirm the validity of the proposed technique. Examples of enhanced signals are available at http://www.cs.tut.fi/\textasciitilde tuomasv/.

    @inproceedings{2011_CHiME_a,
    author = "Gemmeke, Jort and Virtanen, Tuomas and Hurmalainen, Antti",
    abstract = "In this work an exemplar-based technique for speech enhancement of noisy speech is proposed. The technique works by finding a sparse representation of the noisy speech in a dictionary containing both speech and noise exemplars, and uses the activated dictionary atoms to create a time-varying filter to enhance the noisy speech. The speech enhancement algorithm is evaluated using measured signal to noise ratio (SNR) improvements as well as by using automatic speech recognition. Experiments on the PASCAL CHiME challenge corpus, which contains speech corrupted by both reverberation and authentic living room noise at varying SNRs ranging from 9 to -6 dB, confirm the validity of the proposed technique. Examples of enhanced signals are available at http://www.cs.tut.fi/\textasciitilde tuomasv/.",
    address = "Florence, Italy",
    booktitle = "Proc. International Workshop on Machine Listening in Multisource Environments (CHiME)",
    keywords = "NMF;speech recognition",
    month = "September",
    pages = "53-57",
    title = "Exemplar-Based Speech Enhancement and its Application to Noise-Robust Automatic Speech Recognition",
    year = "2011",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/chime-enhancement/pP4\_gemmeke.pdf"
    }

  • T. Heittola, A. Mesaros, T. Virtanen, and A. Eronen, "Sound event detection and context recognition," in Akustiikkapäivät 2011, 2011, p. 51–56.
    [BibTeX]
    @inproceedings{2011,
    author = "Heittola, Toni and Mesaros, Annamaria and Virtanen, Tuomas and Eronen, Antti",
    booktitle = {Akustiikkap{\"a}iv{\"a}t 2011},
    pages = "51--56",
    title = "Sound event detection and context recognition",
    year = "2011"
    }

  • T. Heittola, A. Mesaros, T. Virtanen, and A. Eronen, "Sound Event Detection in Multisource Environments Using Source Separation," in CHiME 2011 - Workshop on Machine Listening in Multisource Environments, 2011, pp. 36-40.
    [BibTeX] [Abstract] [Download PDF]

    This paper proposes a sound event detection system for natural multisource environments, using a sound source separation front-end. The recognizer aims at detecting sound events from various everyday contexts. The audio is preprocessed using non-negative matrix factorization and separated into four individual signals. Each sound event class is represented by a Hidden Markov Model trained using mel frequency cepstral coefficients extracted from the audio. Each separated signal is used individually for feature extraction and then segmentation and classification of sound events using the Viterbi algorithm. The separation allows detection of a maximum of four overlapping events. The proposed system shows a significant increase in event detection accuracy compared to a system able to output a single sequence of events.

    @inproceedings{2011_CHiME_b,
    author = "Heittola, Toni and Mesaros, Annamaria and Virtanen, Tuomas and Eronen, Antti",
    abstract = "This paper proposes a sound event detection system for natural multisource environments, using a sound source separation front-end. The recognizer aims at detecting sound events from various everyday contexts. The audio is preprocessed using non-negative matrix factorization and separated into four individual signals. Each sound event class is represented by a Hidden Markov Model trained using mel frequency cepstral coefficients extracted from the audio. Each separated signal is used individually for feature extraction and then segmentation and classification of sound events using the Viterbi algorithm. The separation allows detection of a maximum of four overlapping events. The proposed system shows a significant increase in event detection accuracy compared to a system able to output a single sequence of events.",
    booktitle = "CHiME 2011 - Workshop on Machine Listening in Multisource Environments",
    keywords = "sound event detection",
    pages = "36-40",
    title = "Sound Event Detection in Multisource Environments Using Source Separation",
    year = "2011",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/chime2011\_heittola.pdf"
    }

  • A. Hurmalainen, T. Virtanen, J. Gemmeke, and K. Mahkonen, "Esimerkkipohjainen meluisan puheen automaattinen tunnistus," in Akustiikkapäivät 2011, Tampere, 11.-12.5.2011, Akustinen Seura ry, 2011, p. 1–5.
    [BibTeX]
    @inproceedings{2011_a,
    author = "Hurmalainen, Antti and Virtanen, Tuomas and Gemmeke, Jort and Mahkonen, Katariina",
    booktitle = {Akustiikkap{\"a}iv{\"a}t 2011, Tampere, 11.-12.5.2011, Akustinen Seura ry},
    pages = "1--5",
    publisher = "Akustinen seura",
    series = {Akustiikkap{\"a}iv{\"a}t},
    title = "{E}simerkkipohjainen meluisan puheen automaattinen tunnistus",
    year = "2011"
    }

  • A. Hurmalainen, J. Gemmeke, and T. Virtanen, "Non-negative matrix deconvolution in noise robust speech recognition," in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011, pp. 4588-4591. doi:10.1109/ICASSP.2011.5947376
    [BibTeX] [Download PDF]
    @INPROCEEDINGS{2011_ICASSP,
    author = "Hurmalainen, Antti and Gemmeke, Jort and Virtanen, Tuomas",
    booktitle = "2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",
    title = "Non-negative matrix deconvolution in noise robust speech recognition",
    year = "2011",
    volume = "",
    number = "",
    pages = "4588-4591",
    keywords = "Speech;Noise;Speech recognition;Dictionaries;Deconvolution;Noise measurement;Hidden Markov models;Automatic speech recognition;noise robustness;deconvolution;sparsity;exemplar-based",
    doi = "10.1109/ICASSP.2011.5947376",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/NMD\_icassp2011.pdf"
    }

  • A. Hurmalainen, K. Mahkonen, J. Gemmeke, and T. Virtanen, "Exemplar-based Recognition of Speech in Highly Variable Noise," in Proc. International Workshop on Machine Listening in Multisource Environments (CHiME), Florence, Italy, 2011, pp. 1-5.
    [BibTeX] [Abstract] [Download PDF]

    Robustness against varying background noise is a crucial requirement for the use of automatic speech recognition in everyday situations. In previous work, we proposed an exemplar-based recognition system for tackling the issue at low SNRs. In this work, we compare several exemplar-based factorisation and decoding algorithms in pursuit of higher noise robustness. The algorithms are evaluated using the PASCAL CHiME challenge corpus, which contains multiple speakers and authentic living room noise at six SNRs ranging from 9 to -6 dB. The results show that the proposed exemplar-based techniques offer a substantial improvement in the noise robustness of speech recognition.

    @inproceedings{2011_CHiME,
    author = "Hurmalainen, Antti and Mahkonen, Katariina and Gemmeke, Jort and Virtanen, Tuomas",
    abstract = "Robustness against varying background noise is a crucial requirement for the use of automatic speech recognition in everyday situations. In previous work, we proposed an exemplar-based recognition system for tackling the issue at low SNRs. In this work, we compare several exemplar-based factorisation and decoding algorithms in pursuit of higher noise robustness. The algorithms are evaluated using the PASCAL CHiME challenge corpus, which contains multiple speakers and authentic living room noise at six SNRs ranging from 9 to -6 dB. The results show that the proposed exemplar-based techniques offer a substantial improvement in the noise robustness of speech recognition.",
    address = "Florence, Italy",
    booktitle = "Proc. International Workshop on Machine Listening in Multisource Environments (CHiME)",
    keywords = "automatic speech recognition;exemplar-based;noise robustness;sparse representation",
    month = "September",
    pages = "1-5",
    title = "Exemplar-based Recognition of Speech in Highly Variable Noise",
    year = "2011",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/pS11\_hurmalainen.pdf"
    }

  • H. Kallasjoki, U. Remes, J. Gemmeke, T. Virtanen, and K. Palomäki, "Uncertainty measures for improving exemplar-based source separation," in 12th Annual Conference of the International Speech Communication Association, Florence, Italy, 2011.
    [BibTeX] [Download PDF]
    @inproceedings{2011_InterSpecch_c,
    author = {Kallasjoki, Heikki and Remes, Ulpu and Gemmeke, Jort and Virtanen, Tuomas and Palom{\"a}ki, Kalle},
    address = "Florence, Italy",
    booktitle = "12th Annual Conference of the International Speech Communication Association",
    keywords = "noise robustness; speech recognition",
    title = "Uncertainty measures for improving exemplar-based source separation",
    year = "2011",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/kallasjoki\_interspeech2011.pdf"
    }

  • K. Mahkonen, A. Hurmalainen, T. Virtanen, and J. Gemmeke, "Mapping Sparse Representation to State Likelihoods in Noise-Robust Automatic Speech Recognition," in Interspeech, 2011, p. 465–468.
    [BibTeX] [Download PDF]
    @inproceedings{2011_InterSpecch,
    author = "Mahkonen, Katariina and Hurmalainen, Antti and Virtanen, Tuomas and Gemmeke, Jort",
    title = "Mapping Sparse Representation to State Likelihoods in Noise-Robust Automatic Speech Recognition",
    booktitle = "Interspeech",
    pages = "465--468",
    year = "2011",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/mahkonen\_interspeech2011.pdf"
    }

  • A. Mesaros, T. Heittola, and A. Klapuri, "Latent Semantic Analysis in Sound Event Detection," in European Signal Processing Conference (EUSIPCO-2011), Barcelona, Spain, 2011, pp. 1307-1311.
    [BibTeX] [Abstract]

    This paper presents the use of probabilistic latent semantic analysis (PLSA) for modeling co-occurrence of overlapping sound events in audio recordings from everyday audio environments such as office, street or shop. Co-occurrence of events is represented as the degree of their overlapping in a fixed length segment of polyphonic audio. In the training stage, PLSA is used to learn the relationships between individual events. In detection, the PLSA model continuously adjusts the probabilities of events according to the history of events detected so far. The event probabilities provided by the model are integrated into a sound event detection system that outputs a monophonic sequence of events. The model offers a very good representation of the data, having low perplexity on test recordings. Using PLSA for estimating prior probabilities of events provides an increase of event detection accuracy to 35\%, compared to 30\% for using uniform priors for the events. There are different levels of performance increase in different audio contexts, with few contexts showing significant improvement.

    @inproceedings{2011_EUSIPCO-2011,
    author = "Mesaros, Annamaria and Heittola, Toni and Klapuri, Anssi",
    abstract = "This paper presents the use of probabilistic latent semantic analysis (PLSA) for modeling co-occurrence of overlapping sound events in audio recordings from everyday audio environments such as office, street or shop. Co-occurrence of events is represented as the degree of their overlapping in a fixed length segment of polyphonic audio. In the training stage, PLSA is used to learn the relationships between individual events. In detection, the PLSA model continuously adjusts the probabilities of events according to the history of events detected so far. The event probabilities provided by the model are integrated into a sound event detection system that outputs a monophonic sequence of events. The model offers a very good representation of the data, having low perplexity on test recordings. Using PLSA for estimating prior probabilities of events provides an increase of event detection accuracy to 35\%, compared to 30\% for using uniform priors for the events. There are different levels of performance increase in different audio contexts, with few contexts showing significant improvement.",
    address = "Barcelona, Spain",
    booktitle = "European Signal Processing Conference (EUSIPCO-2011)",
    keywords = "sound event detection",
    pages = "1307-1311",
    title = "Latent Semantic Analysis in Sound Event Detection",
    year = "2011"
    }

  • T. Mökinen, S. Kiranyaz, and M. Gabbouj, "Content-based audio classification using collective network of binary classifiers," in 2011 IEEE Workshop on Evolving and Adaptive Intelligent Systems (EAIS), 2011, pp. 116-123. doi:10.1109/EAIS.2011.5945911
    [BibTeX]
    @INPROCEEDINGS{2011_EAIS,
    author = "Mökinen, Toni and Kiranyaz, Serkan and Gabbouj, Moncef",
    booktitle = "2011 IEEE Workshop on Evolving and Adaptive Intelligent Systems (EAIS)",
    title = "Content-based audio classification using collective network of binary classifiers",
    year = "2011",
    volume = "",
    number = "",
    pages = "116-123",
    keywords = "Training;Feature extraction;Neurons;Hidden Markov models;Databases;Accuracy;Computer architecture;audio content - based classification;evolutionary neural networks;particle swarm optimization;multilayer perceptron",
    doi = "10.1109/EAIS.2011.5945911"
    }

  • J. Nikunen, T. Virtanen, and M. Vilermo, "Multichannel audio upmixing based on non-negative tensor factorization representation," in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2011.
    [BibTeX] [Download PDF]
    @inproceedings{2011_WASPAA,
    author = "Nikunen, Joonas and Virtanen, Tuomas and Vilermo, Miikka",
    booktitle = "IEEE Workshop on Applications of Signal Processing to Audio and Acoustics",
    keywords = "NMF; object-based coding",
    title = "Multichannel audio upmixing based on non-negative tensor factorization representation",
    year = "2011",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/nikunen\_waspaa.pdf"
    }

  • J. J. C. Orti, T. Virtanen, P. Vera-Candeas, N. Ruiz-Reyes, and F. J. Canadas-Quesada, "Musical Instrument Sound Multi-Excitation Model for Non-Negative Spectrogram Factorization," IEEE Journal of Selected Topics in Signal Processing, vol. 5, iss. 6, pp. 1144-1158, 2011. doi:10.1109/JSTSP.2011.2159700
    [BibTeX] [Download PDF]
    @ARTICLE{2011_JSTSP,
    author = "Orti, Julio Jos{\'e} Carabias and Virtanen, Tuomas and Vera-Candeas, P. and Ruiz-Reyes, N. and Canadas-Quesada, F.J.",
    journal = "IEEE Journal of Selected Topics in Signal Processing",
    title = "Musical Instrument Sound Multi-Excitation Model for Non-Negative Spectrogram Factorization",
    year = "2011",
    volume = "5",
    number = "6",
    pages = "1144-1158",
    keywords = "Instruments;Harmonic analysis;Adaptation model;Music;Computational modeling;Reliability;Training;Automatic music transcription;excitation-filter model;excitation modeling;non-negative matrix factorization (NMF);source-filter model;spectral analysis",
    doi = "10.1109/JSTSP.2011.2159700",
    url = "http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=\&arnumber=5887381"
    }

  • P. Pertilä, M. Mieskolainen, and M. S. Hämäläinen, "Closed-Form Self-Localization of Asynchronous Microphone Arrays," in In Proc. The Third Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA'11), 2011.
    [BibTeX] [Abstract]

    The utilization of distributed microphone arrays in many speech processing applications such as beamforming and speaker localization rely on the precise knowledge of microphone locations. Several self- localization approaches have been presented in the literature but still a simple, accurate, and robust method for asynchronous devices is lacking. This work presents an analytical solution for estimating the positions and rotations of asynchronous loudspeaker equipped microphone arrays or devices. The method is based on emitting and receiving calibration signals from each device, and extracting the time of arrival (TOA) values. Utilizing the knowledge of array geometry in the TOA estimation is proposed to improve accuracy of translation. Results with measurements using four devices on a table surface demonstrates a mean translation error of 11 mm with standard deviation of 6 mm and mean z-axis rotation error of 0.11 (rad) with a standard deviation of 0.14 (rad) in contrast to computer vision annotations with 200 rotations and translation estimates.

    @inproceedings{2011_HSCMA'11,
    author = {Pertil{\"a}, Pasi and Mieskolainen, Mikael and H{\"a}m{\"a}l{\"a}inen, Matti S.},
    abstract = "The utilization of distributed microphone arrays in many speech processing applications such as beamforming and speaker localization rely on the precise knowledge of microphone locations. Several self- localization approaches have been presented in the literature but still a simple, accurate, and robust method for asynchronous devices is lacking. This work presents an analytical solution for estimating the positions and rotations of asynchronous loudspeaker equipped microphone arrays or devices. The method is based on emitting and receiving calibration signals from each device, and extracting the time of arrival (TOA) values. Utilizing the knowledge of array geometry in the TOA estimation is proposed to improve accuracy of translation. Results with measurements using four devices on a table surface demonstrates a mean translation error of 11 mm with standard deviation of 6 mm and mean z-axis rotation error of 0.11 (rad) with a standard deviation of 0.14 (rad) in contrast to computer vision annotations with 200 rotations and translation estimates.",
    booktitle = "In Proc. The Third Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA'11)",
    keywords = "self-localization; self localization; microphone arrays; calibra- tion; localization; computer vision",
    title = "Closed-Form Self-Localization of Asynchronous Microphone Arrays",
    year = "2011"
    }

  • V. Popa, J. Nurminen, and M. Gabbouj, "A Study of Bilinear Models in Voice Conversion," Journal of Signal and Information Processing, vol. 2, iss. 2, p. 125–139, 2011. doi:10.4236/jsip.2011.22017
    [BibTeX]
    @article{2011_JSIP,
    author = "Popa, Victor and Nurminen, Jani and Gabbouj, Moncef",
    doi = "10.4236/jsip.2011.22017",
    issn = "2159-4465",
    journal = "Journal of Signal and Information Processing",
    number = "2",
    pages = "125--139",
    publisher = "Scientific Research Publishing",
    title = "{A} {S}tudy of {B}ilinear {M}odels in {V}oice {C}onversion",
    volume = "2",
    year = "2011"
    }

  • B. Raj, R. Singh, and T. Virtanen, "Phoneme-dependent NMF for speech enhancement in monaural mixtures," in In proc. 12th Annual Conference of the International Speech Communication Association, Florence, Italy, 2011.
    [BibTeX] [Download PDF]
    @inproceedings{2011_InterSpecch_a,
    author = "Raj, Bhiksha and Singh, Rita and Virtanen, Tuomas",
    address = "Florence, Italy",
    booktitle = "In proc. 12th Annual Conference of the International Speech Communication Association",
    keywords = "NMF",
    title = "Phoneme-dependent {NMF} for speech enhancement in monaural mixtures",
    year = "2011",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/phoneme\_nmf.pdf"
    }

  • H. Silén, E. Helander, and M. Gabbouj, "Prediction of voice aperiodicity based on spectral representations in HMM speech synthesis," in Interspeech, Florence, Italy, 2011, pp. 105-108.
    [BibTeX] [Abstract]

    In hidden Markov model-based speech synthesis, speech is typically parameterized using source-filter decomposition. A widely used analysis/synthesis framework, STRAIGHT, decomposes the speech waveform into a framewise spectral envelope and a mixed mode excitation signal. Inclusion of an aperiodicity measure in the model enables synthesis also for signals that are not purely voiced or unvoiced. In the traditional approach employing hidden Markov modeling and decision tree-based clustering, the connection between speech spectrum and aperiodicities is not taken into account. In this paper, we take advantage of this dependency and predict voice aperiodicities afterwards based on synthetic spectral representations. The evaluations carried out for English data confirm that the proposed approach is able to provide prediction accuracy that is comparable to the traditional approach.

    @inproceedings{2011_InterSpecch_b,
    author = "Sil{\'e}n, Hanna and Helander, Elina and Gabbouj, Moncef",
    abstract = "In hidden Markov model-based speech synthesis, speech is typically parameterized using source-filter decomposition. A widely used analysis/synthesis framework, STRAIGHT, decomposes the speech waveform into a framewise spectral envelope and a mixed mode excitation signal. Inclusion of an aperiodicity measure in the model enables synthesis also for signals that are not purely voiced or unvoiced. In the traditional approach employing hidden Markov modeling and decision tree-based clustering, the connection between speech spectrum and aperiodicities is not taken into account. In this paper, we take advantage of this dependency and predict voice aperiodicities afterwards based on synthetic spectral representations. The evaluations carried out for English data confirm that the proposed approach is able to provide prediction accuracy that is comparable to the traditional approach.",
    address = "Florence, Italy",
    booktitle = "Interspeech",
    keywords = "aperiodicity prediction;hidden Markov model;speech synthesis",
    month = "August",
    pages = "105 - 108",
    title = "{P}rediction of voice aperiodicity based on spectral representations in {HMM} speech synthesis",
    year = "2011"
    }

2010

  • A. J. Eronen and A. P. Klapuri, "Music Tempo Estimation With k-NN Regression," IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, iss. 1, pp. 50-57, 2010. doi:10.1109/TASL.2009.2023165
    [BibTeX]
    @ARTICLE{2010_TASLP,
    author = "Eronen, Antti J. and Klapuri, Anssi P.",
    journal = "IEEE Transactions on Audio, Speech, and Language Processing",
    title = "Music Tempo Estimation With k-NN Regression",
    year = "2010",
    volume = "18",
    number = "1",
    pages = "50-57",
    keywords = "Performance analysis;Signal processing;Feature extraction;Acoustic signal detection;Harmonic analysis;Visual effects;Software libraries;Mood;Signal analysis;Acoustic measurements;Chroma features;$k$-nearest neighbor ($k$-NN) regression;music tempo estimation",
    doi = "10.1109/TASL.2009.2023165"
    }

  • J. Gemmeke and T. Virtanen, "Artificial and online acquired noise dictionaries for noise robust ASR," in Proceedings of the 11th Annual Conference of the International Speech Communication Association (Interspeech 2010), 2010, p. 2082–2085.
    [BibTeX] [Download PDF]
    @inproceedings{2010_Interspeech 2010,
    author = "Gemmeke, Jort and Virtanen, Tuomas",
    booktitle = "Proceedings of the 11th Annual Conference of the International Speech Communication Association (Interspeech 2010)",
    pages = "2082--2085",
    title = "Artificial and online acquired noise dictionaries for noise robust {ASR}",
    year = "2010",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/noise\_dictionaries.pdf"
    }

  • J. F. Gemmeke and T. Virtanen, "Noise robust exemplar-based connected digit recognition," in 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 2010, pp. 4546-4549. doi:10.1109/ICASSP.2010.5495580
    [BibTeX] [Download PDF]
    @INPROCEEDINGS{2010_ICASSP_b,
    author = "Gemmeke, Jort F. and Virtanen, Tuomas",
    booktitle = "2010 IEEE International Conference on Acoustics, Speech and Signal Processing",
    title = "Noise robust exemplar-based connected digit recognition",
    year = "2010",
    volume = "",
    number = "",
    pages = "4546-4549",
    keywords = "Noise robustness;Speech enhancement;Speech recognition;Automatic speech recognition;Hidden Markov models;Decoding;Signal to noise ratio;Background noise;Speech processing;Signal processing;Speech recognition;exemplar-based;noise robustness;non-negative matrix factorization;sparsity",
    doi = "10.1109/ICASSP.2010.5495580",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/ICASSP2010\_final.pdf"
    }

  • T. Heittola, A. Mesaros, A. Eronen, and T. Virtanen, "Audio context recognition using audio event histograms," in In Proc. European Signal Processing Conference, 2010.
    [BibTeX] [Abstract] [Download PDF]

    This paper presents a method for audio context recognition, meaning classification between everyday environments. The method is based on representing each audio context using a histogram of audio events which are detected using a supervised classifier. In the training stage, each context is modeled with a histogram estimated from annotated training data. In the testing stage, individual sound events are detected in the unknown recording and a histogram of the sound event occurrences is built. Context recognition is performed by computing the cosine distance between this histogram and event histograms of each context from the training database. Term frequency–inverse document frequency weighting is studied for controlling the importance of different events in the histogram distance calculation. An average classification accuracy of 89\% is obtained in the recognition between ten everyday contexts. Combining the event based context recognition system with more conventional audio based recognition increases the recognition rate to 92\%.

    @inproceedings{2010_EUSIPCO,
    author = "Heittola, Toni and Mesaros, Annamaria and Eronen, Antti and Virtanen, Tuomas",
    abstract = "This paper presents a method for audio context recognition, meaning classification between everyday environments. The method is based on representing each audio context using a histogram of audio events which are detected using a supervised classifier. In the training stage, each context is modeled with a histogram estimated from annotated training data. In the testing stage, individual sound events are detected in the unknown recording and a histogram of the sound event occurrences is built. Context recognition is performed by computing the cosine distance between this histogram and event histograms of each context from the training database. Term frequency--inverse document frequency weighting is studied for controlling the importance of different events in the histogram distance calculation. An average classification accuracy of 89\% is obtained in the recognition between ten everyday contexts. Combining the event based context recognition system with more conventional audio based recognition increases the recognition rate to 92\%.",
    booktitle = "In Proc. European Signal Processing Conference",
    keywords = "CASA;Context recognition",
    title = "Audio context recognition using audio event histograms",
    year = "2010",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/eusipco2010\_heittola.pdf"
    }

  • E. Helander, T. Virtanen, J. Nurminen, and M. Gabbouj, "Voice Conversion Using Partial Least Squares Regression," IEEE Transactions on Audio, Speech, and Language Processing, 2010.
    [BibTeX] [Download PDF]
    @article{2010_TASLP_b,
    author = "Helander, Elina and Virtanen, Tuomas and Nurminen, Jani and Gabbouj, Moncef",
    journal = "IEEE Transactions on Audio, Speech, and Language Processing",
    keywords = "voice conversion",
    title = "Voice Conversion Using Partial Least Squares Regression",
    year = "2010",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/voco\_pls.pdf"
    }

  • E. Helander, H. Silén, J. Miguez, and M. Gabbouj, "Maximum a posteriori voice conversion using sequential Monte Carlo methods," in Interspeech, 2010.
    [BibTeX]
    @inproceedings{2010_InterSpecch_a,
    author = "Helander, Elina and Sil{\'e}n, Hanna and Miguez, Joaquin and Gabbouj, Moncef",
    booktitle = "Interspeech",
    keywords = "voice conversion;smoothing;particle filtering",
    month = "September",
    title = "{M}aximum a posteriori voice conversion using sequential {M}onte {C}arlo methods",
    year = "2010"
    }

  • M. Helen and T. Virtanen, "Audio query by example using similarity measures between probability density functions of features," Eurasip Journal on Audio, Speech, and Music Processing, vol. 2010, iss. 179303, p. 1–12, 2010. doi:10.1155/2010/179303
    [BibTeX]
    @article{2010_JASM_b,
    author = "Helen, Marko and Virtanen, Tuomas",
    title = "Audio query by example using similarity measures between probability density functions of features",
    note = "Contribution: organisation=sgn,FACT1=1",
    year = "2010",
    doi = "10.1155/2010/179303",
    language = "English",
    volume = "2010",
    pages = "1--12",
    journal = "Eurasip Journal on Audio, Speech, and Music Processing",
    issn = "1687-4714",
    publisher = "Springer Publishing Company",
    number = "179303"
    }

  • S. Keronen, U. Remes, K. Palomäki, T. Virtanen, and M. Kurimo, "Comparison of Noise Robust Methods in Large Vocabulary Speech Recognition," in n Proc. European Signal Processing Conference, 2010.
    [BibTeX] [Download PDF]
    @inproceedings{2010_EUSIPCO_b,
    author = {Keronen, Sami and Remes, Ulpu and Palom{\"a}ki, Kalle and Virtanen, Tuomas and Kurimo, Mikko},
    booktitle = "n Proc. European Signal Processing Conference",
    title = "Comparison of Noise Robust Methods in Large Vocabulary Speech Recognition",
    year = "2010",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/keronen\_eusipco2010.pdf"
    }

  • A. Klapuri and T. Virtanen, "Representing Musical Sounds With an Interpolating State Model," IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, iss. 3, pp. 613-624, 2010. doi:10.1109/TASL.2010.2040781
    [BibTeX] [Download PDF]
    @ARTICLE{2010_TASLP_a,
    author = "Klapuri, Anssi and Virtanen, Tuomas",
    journal = "IEEE Transactions on Audio, Speech, and Language Processing",
    title = "Representing Musical Sounds With an Interpolating State Model",
    year = "2010",
    volume = "18",
    number = "3",
    pages = "613-624",
    keywords = "Hidden Markov models;Signal processing algorithms;Vector quantization;Clustering algorithms;Audio coding;Signal processing;Multidimensional systems;Instruments;Computational modeling;Computational complexity;Acoustic signal processing;audio coding;discrete cosine transforms (DCTs);interpolation;vector quantization",
    doi = "10.1109/TASL.2010.2040781",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/interpolating.pdf"
    }

  • A. Klapuri, T. Virtanen, and T. Heittola, "Sound source separation in monaural music signals using excitation-filter model and EM algorithm," in 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 2010, pp. 5510-5513. doi:10.1109/ICASSP.2010.5495216
    [BibTeX] [Download PDF]
    @INPROCEEDINGS{2010_ICASSP_a,
    author = "Klapuri, Anssi and Virtanen, Tuomas and Heittola, Toni",
    booktitle = "2010 IEEE International Conference on Acoustics, Speech and Signal Processing",
    title = "Sound source separation in monaural music signals using excitation-filter model and {EM} algorithm",
    year = "2010",
    volume = "",
    number = "",
    pages = "5510-5513",
    keywords = "Source separation;Multiple signal classification;Music;Instruments;Frequency estimation;Power harmonic filters;Maximum likelihood estimation;Image analysis;Humans;Layout;Sound source separation;excitation-filter model;maximum likelihood estimation;expectation maximization",
    doi = "10.1109/ICASSP.2010.5495216",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/em-nmf.pdf"
    }

  • A. Mesaros and T. Virtanen, "Recognition of phonemes and words in singing," in 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 2010, pp. 2146-2149. doi:10.1109/ICASSP.2010.5495585
    [BibTeX] [Download PDF]
    @INPROCEEDINGS{2010_ICASSP_c,
    author = "Mesaros, Annamaria and Virtanen, Tuomas",
    booktitle = "2010 IEEE International Conference on Acoustics, Speech and Signal Processing",
    title = "Recognition of phonemes and words in singing",
    year = "2010",
    volume = "",
    number = "",
    pages = "2146-2149",
    keywords = "Speech recognition;Hidden Markov models;Natural languages;Databases;Music information retrieval;Automatic speech recognition;Text recognition;Signal processing;System testing;Information analysis;singing recognition;speech recognition;query-by-singing",
    doi = "10.1109/ICASSP.2010.5495585",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/singrec.pdf"
    }

  • A. Mesaros, T. Heittola, A. Eronen, and T. Virtanen, "Acoustic event detection in real life recordings," in In Proc. European Signal Processing Conference, Aalborg, Denmark, 2010, pp. 1267-1271.
    [BibTeX] [Abstract] [Download PDF]

    This paper presents a system for acoustic event detection in recordings from real life environments. The events are modeled using a network of hidden Markov models; their size and topology is chosen based on a study of isolated events recognition. We also studied the effect of ambient background noise on event classification performance. On real life recordings, we tested recognition of isolated sound events and event detection. For event detection, the system performs recognition and temporal positioning of a sequence of events. An accuracy of 24\% was obtained in classifying isolated sound events into 61 classes. This corresponds to the accuracy of classifying between 61 events when mixed with ambient background noise at 0dB signal-to-noise ratio. In event detection, the system is capable of recognizing almost one third of the events, and the temporal positioning of the events is not correct for 84\% of the time.

    @inproceedings{2010_EUSIPCO_a,
    author = "Mesaros, Annamaria and Heittola, Toni and Eronen, Antti and Virtanen, Tuomas",
    abstract = "This paper presents a system for acoustic event detection in recordings from real life environments. The events are modeled using a network of hidden Markov models; their size and topology is chosen based on a study of isolated events recognition. We also studied the effect of ambient background noise on event classification performance. On real life recordings, we tested recognition of isolated sound events and event detection. For event detection, the system performs recognition and temporal positioning of a sequence of events. An accuracy of 24\% was obtained in classifying isolated sound events into 61 classes. This corresponds to the accuracy of classifying between 61 events when mixed with ambient background noise at 0dB signal-to-noise ratio. In event detection, the system is capable of recognizing almost one third of the events, and the temporal positioning of the events is not correct for 84\% of the time.",
    address = "Aalborg, Denmark",
    booktitle = "In Proc. European Signal Processing Conference",
    keywords = "CASA;sound event detection",
    pages = "1267-1271",
    title = "Acoustic event detection in real life recordings",
    year = "2010",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/acoustic\_event\_detection\_1406.pdf"
    }

  • A. Mesaros and T. Virtanen, "Automatic recognition of lyrics in singing," EURASIP Journal on Audio, Speech and Music Processing, vol. 2010, 2010.
    [BibTeX] [Download PDF]
    @article{2010_JASM,
    author = "Mesaros, Annamaria and Virtanen, Tuomas",
    journal = "EURASIP Journal on Audio, Speech and Music Processing",
    keywords = "lyrics",
    title = "Automatic recognition of lyrics in singing",
    url = "http://www.hindawi.com/journals/asmp/2010/546047.html",
    volume = "2010",
    year = "2010"
    }

  • T. Mäkinen and P. Pertilä, "Shooter localization and bullet trajectory, caliber, and speed estimation based on detected firing sounds," Applied Acoustics, vol. 10, iss. 71, p. 902–913, 2010.
    [BibTeX] [Abstract]

    Shooter localization and estimation of bullet trajectory, caliber and speed have become essential tasks for example in peacekeeping and police assignments. A novel approach for such estimation and localization is presented in this paper, as a numerical estimation method is applied to the problem. Both simulated and recorded gunshot data are considered, as a known bullet shock wave model and detected firing sounds are utilized in creating a likelihood function corresponding to different bullet states. For this, a state-space model of the underlying dynamic system is developed, and a well-known optimization algorithm is used to find the global maximum of the evaluated function. Two different criteria are used to measure the likelihood values, namely the Generalized Cross Correlation (GCC) and the Mean-Squared Error (MSE). The achieved localization and estimation results are accurate and applicable when considering the usability of the method against hostile snipers. The shooter position and bullet state estimation errors vary between 2\% and 10\%, depending on the estimated parameter at stake.

    @article{2010_AA,
    author = {M{\"a}kinen, Toni and Pertil{\"a}, Pasi},
    abstract = "Shooter localization and estimation of bullet trajectory, caliber and speed have become essential tasks for example in peacekeeping and police assignments. A novel approach for such estimation and localization is presented in this paper, as a numerical estimation method is applied to the problem. Both simulated and recorded gunshot data are considered, as a known bullet shock wave model and detected firing sounds are utilized in creating a likelihood function corresponding to different bullet states. For this, a state-space model of the underlying dynamic system is developed, and a well-known optimization algorithm is used to find the global maximum of the evaluated function. Two different criteria are used to measure the likelihood values, namely the Generalized Cross Correlation (GCC) and the Mean-Squared Error (MSE). The achieved localization and estimation results are accurate and applicable when considering the usability of the method against hostile snipers. The shooter position and bullet state estimation errors vary between 2\% and 10\%, depending on the estimated parameter at stake.",
    impact_factor = "0.784",
    journal = "Applied Acoustics",
    keywords = "Shooter localization; Sniper localization; Shock wave; Muzzle blast; Simulated annealing; Monte Carlo methods",
    month = "October",
    number = "71",
    pages = "902–913",
    title = "{S}hooter localization and bullet trajectory, caliber, and speed estimation based on detected firing sounds",
    volume = "10",
    year = "2010"
    }

  • J. Nikunen and T. Virtanen, "Noise-to-mask ratio minimization by weighted non-negative matrix factorization," in 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 2010, pp. 25-28. doi:10.1109/ICASSP.2010.5496264
    [BibTeX] [Download PDF]
    @INPROCEEDINGS{2010_ICASSP_d,
    author = "Nikunen, Joonas and Virtanen, Tuomas",
    booktitle = "2010 IEEE International Conference on Acoustics, Speech and Signal Processing",
    title = "Noise-to-mask ratio minimization by weighted non-negative matrix factorization",
    year = "2010",
    volume = "",
    number = "",
    pages = "25-28",
    keywords = "Signal to noise ratio;Acoustic noise;Signal processing algorithms;Audio coding;Signal processing;Psychoacoustic models;Masking threshold;Noise measurement;Nuclear magnetic resonance;Filter bank;Non-negative matrix factorization;Noise-to-mask ratio;Audio coding;Signal representations",
    doi = "10.1109/ICASSP.2010.5496264",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/nikunen.pdf"
    }

  • J. Nikunen and T. Virtanen, "Object-Based Audio Coding Using Non-Negative Matrix Factorization for the Spectrogram Representation," in in proc. 128th Audio Engineering Society Convention, 2010.
    [BibTeX]
    @inproceedings{2010_AES,
    author = "Nikunen, Joonas and Virtanen, Tuomas",
    booktitle = "in proc. 128th Audio Engineering Society Convention",
    keywords = "object-based coding",
    title = "Object-Based Audio Coding Using Non-Negative Matrix Factorization for the Spectrogram Representation",
    year = "2010"
    }

  • P. Pertilä and M. S. Hämäläinen, "A track before detect approach for sequential Bayesian tracking of multiple speech sources," in 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 2010, pp. 4974-4977. doi:10.1109/ICASSP.2010.5495092
    [BibTeX]
    @INPROCEEDINGS{2010_ICASSP,
    author = "Pertilä, Pasi and Hämäläinen, Matti S.",
    booktitle = "2010 IEEE International Conference on Acoustics, Speech and Signal Processing",
    title = "A track before detect approach for sequential Bayesian tracking of multiple speech sources",
    year = "2010",
    volume = "",
    number = "",
    pages = "4974-4977",
    keywords = "Bayesian methods;Speech;Acoustic signal detection;Particle filters;Target tracking;Particle tracking;Acoustic measurements;Signal to noise ratio;Filtering;Particle measurements;Acoustic Tracking;Multiple Sources;Particle Filters;Likelihood Ratio;Track Management",
    doi = "10.1109/ICASSP.2010.5495092"
    }

  • B. Raj, T. Virtanen, S. Chaudhure, and R. Singh, "Non-negative matrix factorization based compensation of music for automatic speech recognition," in Proceedings of Interspeech 2010, Makuhari, Japan, 2010.
    [BibTeX] [Download PDF]
    @inproceedings{2010_InterSpecch_c,
    author = "Raj, Bhiksha and Virtanen, Tuomas and Chaudhure, Sourish and Singh, Rita",
    address = "Makuhari, Japan",
    booktitle = "Proceedings of Interspeech 2010",
    keywords = "NMF; non-negative matrix factorization; noise robustness; speech recognition",
    title = "Non-negative matrix factorization based compensation of music for automatic speech recognition",
    year = "2010",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/nmf\_compensation.pdf"
    }

  • H. Silén, E. Helander, J. Nurminen, K. Koppinen, and M. Gabbouj, "Using Robust Viterbi Algorithm and HMM-Modeling in Unit Selection TTS to Replace Units of Poor Quality," in Interspeech 2010, 2010.
    [BibTeX] [Abstract]

    In hidden Markov model-based unit selection synthesis, the benefits of both unit selection and statistical parametric speech synthesis are combined. However, conventional Viterbi algorithm is forced to do a selection also when no suitable units are available. This can drift the search and decrease the overall quality. Consequently, we propose to use robust Viterbi algorithm that can simultaneously detect bad units and select the best sequence. The unsuitable units are replaced using hidden Markov model-based synthesis. Evaluations indicate that the use of robust Viterbi algorithm combined with unit replacement increases the quality compared to the traditional algorithm.

    @inproceedings{2010_InterSpecch,
    author = "Sil{\'e}n, Hanna and Helander, Elina and Nurminen, Jani and Koppinen, Konsta and Gabbouj, Moncef",
    abstract = "In hidden Markov model-based unit selection synthesis, the benefits of both unit selection and statistical parametric speech synthesis are combined. However, conventional Viterbi algorithm is forced to do a selection also when no suitable units are available. This can drift the search and decrease the overall quality. Consequently, we propose to use robust Viterbi algorithm that can simultaneously detect bad units and select the best sequence. The unsuitable units are replaced using hidden Markov model-based synthesis. Evaluations indicate that the use of robust Viterbi algorithm combined with unit replacement increases the quality compared to the traditional algorithm.",
    booktitle = "Interspeech 2010",
    keywords = "speech synthesis;robust Viterbi algorithm;unit selection;hidden Markov models",
    title = "{U}sing {R}obust {V}iterbi {A}lgorithm and {HMM}-{M}odeling in {U}nit {S}election {TTS} to {R}eplace {U}nits of {P}oor {Q}uality",
    year = "2010"
    }

  • H. Silén, E. Helander, J. Nurminen, and M. Gabbouj, "Analysis of Duration Prediction Accuracy in HMM-Based Speech Synthesis," in The Fifth International Conference on Speech Prosody, 2010.
    [BibTeX] [Abstract]

    Appropriate phoneme durations are essential for high quality speech synthesis. In hidden Markov model-based text-to-speech (HMM-TTS), durations are typically modeled statistically using state duration probability distributions and duration prediction for unseen contexts. Use of rich context features enables synthesis without high-level linguistic knowledge. In this paper we analyze the accuracy of state duration modeling against phone duration modeling using simple prediction techniques. In addition to the decision tree-based techniques, regression techniques for rich context features with high collinearity are discussed and evaluated.

    @inproceedings{2010_SP,
    author = "Sil{\'e}n, Hanna and Helander, Elina and Nurminen, Jani and Gabbouj, Moncef",
    abstract = "Appropriate phoneme durations are essential for high quality speech synthesis. In hidden Markov model-based text-to-speech (HMM-TTS), durations are typically modeled statistically using state duration probability distributions and duration prediction for unseen contexts. Use of rich context features enables synthesis without high-level linguistic knowledge. In this paper we analyze the accuracy of state duration modeling against phone duration modeling using simple prediction techniques. In addition to the decision tree-based techniques, regression techniques for rich context features with high collinearity are discussed and evaluated.",
    booktitle = "The Fifth International Conference on Speech Prosody",
    keywords = "speech synthesis",
    month = "May",
    title = "{A}nalysis of {D}uration {P}rediction {A}ccuracy in {HMM}-{B}ased {S}peech {S}ynthesis",
    year = "2010"
    }

  • S. Tervo and T. Korhonen, "Estimation of reflective surfaces from continuous signals," in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP, Dallas, Texas, USA, March 14-19, 2010, 2010, p. 153–156. doi:10.1109/ICASSP.2010.5496104
    [BibTeX]
    @inproceedings{2010_ICASSP_e,
    author = "Tervo, Sakari and Korhonen, Teemu",
    booktitle = "Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP, Dallas, Texas, USA, March 14-19, 2010",
    doi = "10.1109/ICASSP.2010.5496104",
    isbn = "978-1-4244-4296-6",
    pages = "153--156",
    title = "Estimation of reflective surfaces from continuous signals",
    year = "2010"
    }

  • T. Virtanen, J. Gemmeke, and A. Hurmalainen, "State-based labelling for a sparse representation of speech and its application to robust speech recognition," in Interspeech 2010, 2010.
    [BibTeX] [Download PDF]
    @inproceedings{2010_InterSpecch_b,
    author = "Virtanen, Tuomas and Gemmeke, Jort and Hurmalainen, Antti",
    booktitle = "Interspeech 2010",
    title = "State-based labelling for a sparse representation of speech and its application to robust speech recognition",
    year = "2010",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/labeling.pdf"
    }

2009

  • D. D. S. Alves, J. Paulus, and J. Fonseca, "Drum transcription from multichannel recordings with non-negative matrix factorization," in Proc. of the 17th European Signal Processing Conference, Glasgow, Scotland, UK, 2009, p. 894–898.
    [BibTeX]
    @inproceedings{2009_EUSIPCO,
    author = "Alves, David Dos Santos and Paulus, Jouni and Fonseca, Jose",
    address = "Glasgow, Scotland, UK",
    booktitle = "Proc. of the 17th European Signal Processing Conference",
    keywords = "drums; transcription; NMF",
    month = "Aug",
    pages = "894--898",
    title = "{D}rum transcription from multichannel recordings with non-negative matrix factorization",
    year = "2009"
    }

  • T. Heittola, A. Klapuri, and T. Virtanen, "Musical Instrument Recognition in Polyphonic Audio Using Source-Filter Model for Sound Separation," in in Proc. 10th Int. Society for Music Information Retrieval Conf. (ISMIR 2009), Kobe, Japan, 2009, pp. 327-332.
    [BibTeX] [Abstract] [Download PDF]

    This paper proposes a novel approach to musical instrument recognition in polyphonic audio signals by using a source-filter model and an augmented non-negative matrix factorization algorithm for sound separation. The mixture signal is decomposed into a sum of spectral bases modeled as a product of excitations and filters. The excitations are restricted to harmonic spectra and their fundamental frequencies are estimated in advance using a multipitch estimator, whereas the filters are restricted to have smooth frequency responses by modeling them as a sum of elementary functions on the Mel-frequency scale. The pitch and timbre information are used in organizing individual notes into sound sources. In the recognition, Mel-frequency cepstral coefficients are used to represent the coarse shape of the power spectrum of sound sources and Gaussian mixture models are used to model instrument-conditional densities of the extracted features. The method is evaluated with polyphonic signals, randomly generated from 19 instrument classes. The recognition rate for signals having six note polyphony reaches 59\%.

    @inproceedings{2009_ISMIR 2009,
    author = "Heittola, Toni and Klapuri, Anssi and Virtanen, Tuomas",
    abstract = "This paper proposes a novel approach to musical instrument recognition in polyphonic audio signals by using a source-filter model and an augmented non-negative matrix factorization algorithm for sound separation. The mixture signal is decomposed into a sum of spectral bases modeled as a product of excitations and filters. The excitations are restricted to harmonic spectra and their fundamental frequencies are estimated in advance using a multipitch estimator, whereas the filters are restricted to have smooth frequency responses by modeling them as a sum of elementary functions on the Mel-frequency scale. The pitch and timbre information are used in organizing individual notes into sound sources. In the recognition, Mel-frequency cepstral coefficients are used to represent the coarse shape of the power spectrum of sound sources and Gaussian mixture models are used to model instrument-conditional densities of the extracted features. The method is evaluated with polyphonic signals, randomly generated from 19 instrument classes. The recognition rate for signals having six note polyphony reaches 59\%.",
    address = "Kobe, Japan",
    booktitle = "in Proc. 10th Int. Society for Music Information Retrieval Conf. (ISMIR 2009)",
    keywords = "instruments;separation;source-filter model",
    organization = "International Society for Music Information Retrieval (ISMIR)",
    pages = "327-332",
    title = "Musical Instrument Recognition in Polyphonic Audio Using Source-Filter Model for Sound Separation",
    year = "2009",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/ismir09-heittola.pdf"
    }

  • M. Helén, T. Lahti, and A. Klapuri, "Tools for automatic audio management," in Open Information Management: applications of interconnectivity and collaboration, 2009, p. 244–265.
    [BibTeX]
    @inproceedings{2009_ICA,
    author = "Hel{\'e}n, Marko and Lahti, Tommi and Klapuri, Anssi",
    editor = "Niiranen, S.",
    booktitle = "Open Information Management: applications of interconnectivity and collaboration",
    isbn = "978-1-60566-246-6",
    pages = "244--265",
    title = "{T}ools for automatic audio management",
    year = "2009"
    }

  • M. Helén and T. Virtanen, "Audio Query by Example Using Similarity Measures between Probability Density Functions of Features," EURASIP Journal on Audio, Speech, and Music Processing, vol. 2010, iss. 1, p. 1–12, 2009. doi:10.1155/2010/179303
    [BibTeX]
    @article{2009_JASM,
    author = "Hel{\'e}n, Marko and Virtanen, Tuomas",
    title = "Audio Query by Example Using Similarity Measures between Probability Density Functions of Features",
    journal = "EURASIP Journal on Audio, Speech, and Music Processing",
    volume = "2010",
    number = "1",
    pages = "1--12",
    year = "2009",
    doi = "10.1155/2010/179303"
    }

  • A. Klapuri, "A method for visualizing the pitch content of polyphonic music signals," in Proc. 10th Int. Society for Music Information Retrieval Conf. (ISMIR 2009), Kobe, Japan, 2009.
    [BibTeX]
    @inproceedings{2009_ISMIR 2009_a,
    author = "Klapuri, Anssi",
    address = "Kobe, Japan",
    booktitle = "Proc. 10th Int. Society for Music Information Retrieval Conf. (ISMIR 2009)",
    title = "{A} method for visualizing the pitch content of polyphonic music signals",
    year = "2009"
    }

  • A. Klapuri, "A classification approach to multipitch analysis," in 6th Sound and Music Computing Conference, Porto, Portugal, 2009.
    [BibTeX]
    @inproceedings{2009_SMC,
    author = "Klapuri, Anssi",
    address = "Porto, Portugal",
    booktitle = "6th Sound and Music Computing Conference",
    title = "{A} classification approach to multipitch analysis",
    year = "2009"
    }

  • A. Löytynoja and P. Pertilä, "A real-time talker localization implementation using multi-PHAT and particle filter," in Proceedings of the 17th European Signal Processing Conference, Eusipco, Glasgow, Scotland, UK, 2009, pp. 1418-1422.
    [BibTeX]
    @inproceedings{2009_EUSIPCO_c,
    author = {L{\"o}ytynoja, Antti and Pertil{\"a}, Pasi},
    address = "Glasgow, Scotland, UK",
    booktitle = "Proceedings of the 17th European Signal Processing Conference, Eusipco",
    keywords = "speaker tracking",
    month = "August",
    pages = "1418-1422",
    title = "{A} real-time talker localization implementation using multi-{PHAT} and particle filter",
    year = "2009"
    }

  • A. Mesaros and T. Virtanen, "Adaptation of a speech recognizer for singing voice," in Proceedings of the 17th European Signal Processing Conference, 2009, pp. 1779-1783.
    [BibTeX] [Download PDF]
    @inproceedings{2009_EUSIPCO_d,
    author = "Mesaros, Annamaria and Virtanen, Tuomas",
    booktitle = "Proceedings of the 17th European Signal Processing Conference",
    keywords = "singer identification",
    pages = "1779-1783",
    publisher = "Eurasip",
    title = "Adaptation of a speech recognizer for singing voice",
    year = "2009",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/sp\_adapt.pdf"
    }

  • M. Myllymäki and T. Virtanen, "Non-stationary noise model compensation in voice activity detection," in 2009 17th European Signal Processing Conference, 2009, pp. 2186-2190.
    [BibTeX] [Download PDF]
    @INPROCEEDINGS{2009_EUSIPCO_b,
    author = "Myllymäki, Mikko and Virtanen, Tuomas",
    booktitle = "2009 17th European Signal Processing Conference",
    title = "Non-stationary noise model compensation in voice activity detection",
    year = "2009",
    volume = "",
    number = "",
    pages = "2186-2190",
    keywords = "Noise;Estimation;Adaptation models;Robustness;Hidden Markov models",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/myllymaki\_eusipco2009.pdf"
    }

  • T. Mäkinen, P. Pertilä, and P. Auranen, "Supersonic bullet state estimation using particle filtering," in Proceedings of 2009 IEEE International Conference on Signal and Image Processing Applications, ICSIPA, Kuala Lumpur, Malaysia, 2009.
    [BibTeX]
    @inproceedings{2009_ICSIPA,
    author = {M{\"a}kinen, Toni and Pertil{\"a}, Pasi and Auranen, Pasi},
    address = "Kuala Lumpur, Malaysia",
    booktitle = "Proceedings of 2009 IEEE International Conference on Signal and Image Processing Applications, ICSIPA",
    keywords = "bullet state",
    month = "November",
    title = "{S}upersonic bullet state estimation using particle filtering",
    year = "2009"
    }

  • M. Parviainen, "Robust self-localization solutions for meeting room environments," in Proceedings of the 13th IEEE International Symposium on consumer Electronics, ISCE 2009, Kyoto, Japan, 25-28 May 2009, 2009, p. 237–240. doi:10.1109/ISCE.2009.5156957
    [BibTeX]
    @inproceedings{2009_ISCE,
    author = "Parviainen, M.",
    booktitle = "Proceedings of the 13th IEEE International Symposium on consumer Electronics, ISCE 2009, Kyoto, Japan, 25-28 May 2009",
    doi = "10.1109/ISCE.2009.5156957",
    isbn = "978-1-4244-2976-9",
    pages = "237--240",
    title = "Robust self-localization solutions for meeting room environments",
    year = "2009"
    }

  • J. Paulus and A. Klapuri, "Music Structure Analysis Using a Probabilistic Fitness Measure and a Greedy Search Algorithm," IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, iss. 6, p. 1159–1170, 2009.
    [BibTeX]
    @article{2009_TASLP,
    author = "Paulus, Jouni and Klapuri, Anssi",
    journal = "IEEE Transactions on Audio, Speech, and Language Processing",
    keywords = "Music structure analysis",
    month = "Aug",
    number = "6",
    pages = "1159--1170",
    title = "{M}usic {S}tructure {A}nalysis {U}sing a {P}robabilistic {F}itness {M}easure and a {G}reedy {S}earch {A}lgorithm",
    volume = "17",
    year = "2009"
    }

  • J. Paulus and A. Klapuri, "Labelling the Structural Parts of a Music Piece with Markov Models," in Computer Music Modeling and Retrieval: Genesis of Meaning in Sound and Music - 5th International Symposium, CMMR 2008 Copenhagen, Denmark, May 19-23, 2008, Revised Papers, S. Ystad, R. Kronland-Martinet, and K. Jensen, Eds., Springer Berlin / Heidelberg, 2009, p. 166–176.
    [BibTeX]
    @incollection{2009,
    author = "Paulus, Jouni and Klapuri, Anssi",
    editor = "Ystad, S{\o}lvi and Kronland-Martinet, Richard and Jensen, Kristoffer",
    booktitle = "Computer Music Modeling and Retrieval: Genesis of Meaning in Sound and Music - 5th International Symposium, CMMR 2008 Copenhagen, Denmark, May 19-23, 2008, Revised Papers",
    keywords = "Music structure analysis",
    pages = "166--176",
    publisher = "Springer Berlin / Heidelberg",
    title = "{L}abelling the {S}tructural {P}arts of a {M}usic {P}iece with {M}arkov {M}odels",
    year = "2009"
    }

  • J. Paulus and A. Klapuri, "Music structure analysis with a probabilistic fitness function in MIREX2009," in Proc. of the Fifth Annual Music Information Retrieval Evaluation eXchange, Kobe, Japan, 2009.
    [BibTeX]
    @inproceedings{2009_MIREX,
    author = "Paulus, Jouni and Klapuri, Anssi",
    address = "Kobe, Japan",
    booktitle = "Proc. of the Fifth Annual Music Information Retrieval Evaluation eXchange",
    keywords = "Music structure analysis",
    month = "Oct",
    title = "{M}usic structure analysis with a probabilistic fitness function in {MIREX}2009",
    year = "2009"
    }

  • J. Paulus and A. Klapuri, "Drum sound detection in polyphonic music with hidden markov models," EURASIP Journal on Audio, Speech, and Music Processing, vol. 2009, p. 1–9, 2009.
    [BibTeX] [Download PDF]
    @article{2009_JASM_a,
    author = "Paulus, Jouni and Klapuri, Anssi",
    title = "Drum sound detection in polyphonic music with hidden markov models",
    journal = "EURASIP Journal on Audio, Speech, and Music Processing",
    volume = "2009",
    pages = "1--9",
    year = "2009",
    publisher = "Springer",
    url = "https://link.springer.com/content/pdf/10.1155/2009/497292.pdf"
    }

  • V. Popa, J. Nurminen, and M. Gabbouj, "A novel technique for voice conversion based on style and content decomposition with bilinear models," in Proceedings of the 10th Annual Conference of the International Speech Communication Associationa, Interspeech 2009, Brighton, UK, 6-10 September 2009, 2009, p. 2655–2658.
    [BibTeX]
    @inproceedings{2009_InterSpecch,
    author = "Popa, Victor and Nurminen, Jani and Gabbouj, Moncef",
    booktitle = "Proceedings of the 10th Annual Conference of the International Speech Communication Associationa, Interspeech 2009, Brighton, UK, 6-10 September 2009",
    pages = "2655--2658",
    title = "A novel technique for voice conversion based on style and content decomposition with bilinear models",
    year = "2009"
    }

  • H. Silén, E. Helander, J. Nurminen, and M. Gabbouj, "Parameterization of vocal fry in HMM-based speech synthesis," in Proceedings of the 10th Annual Conference of the International Speech Communication Associationa, Interspeech, Brighton, UK, 2009, pp. 1775-1778.
    [BibTeX] [Abstract] [Download PDF]

    HMM-based speech synthesis offers a way to generate speech with different voice qualities. However, sometimes databases contain certain inherent voice qualities that need to be parametrized properly. One example of this is vocal fry typically occurring at the end of utterances. A popular mixed excitation vocoder for HMMbased speech synthesis is STRAIGHT. The standard STRAIGHT is optimized for modal voices and may not produce high quality with other voice types. Fortunately, due to the flexibility of STRAIGHT, different F0 and aperiodicity measures can be used in the synthesis without any inherent degradations in speech quality. We have replaced the STRAIGHT excitation with a representation based on a robust F0 measure and a carefully determined two-band voicing. According to our analysis-synthesis experiments, the new parameterization can improve the speech quality. In HMM-based speech synthesis, the quality is significantly improved especially due to the better modeling of vocal fry.

    @inproceedings{2009_InterSpecch_a,
    author = "Sil{\'e}n, Hanna and Helander, Elina and Nurminen, Jani and Gabbouj, Moncef",
    abstract = "HMM-based speech synthesis offers a way to generate speech with different voice qualities. However, sometimes databases contain certain inherent voice qualities that need to be parametrized properly. One example of this is vocal fry typically occurring at the end of utterances. A popular mixed excitation vocoder for HMMbased speech synthesis is STRAIGHT. The standard STRAIGHT is optimized for modal voices and may not produce high quality with other voice types. Fortunately, due to the flexibility of STRAIGHT, different F0 and aperiodicity measures can be used in the synthesis without any inherent degradations in speech quality. We have replaced the STRAIGHT excitation with a representation based on a robust F0 measure and a carefully determined two-band voicing. According to our analysis-synthesis experiments, the new parameterization can improve the speech quality. In HMM-based speech synthesis, the quality is significantly improved especially due to the better modeling of vocal fry.",
    address = "Brighton, UK",
    booktitle = "Proceedings of the 10th Annual Conference of the International Speech Communication Associationa, Interspeech",
    keywords = "speech synthesis",
    month = "September",
    pages = "1775-1778",
    title = "{P}arameterization of vocal fry in {HMM}-based speech synthesis",
    url = "http://www.isca-speech.org/archive/interspeech\_2009/i09\_1775.html",
    year = "2009"
    }

  • T. Virtanen and T. Heittola, "Interpolating hidden Markov model and its application to automatic instrument recognition," in 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, 2009, pp. 49-52. doi:10.1109/ICASSP.2009.4959517
    [BibTeX] [Download PDF]
    @INPROCEEDINGS{2009_ICASSP,
    author = "Virtanen, Tuomas and Heittola, Toni",
    booktitle = "2009 IEEE International Conference on Acoustics, Speech and Signal Processing",
    title = "Interpolating hidden Markov model and its application to automatic instrument recognition",
    year = "2009",
    volume = "",
    number = "",
    pages = "49-52",
    keywords = "Hidden Markov models;Instruments;Interpolation;Signal processing algorithms;Piecewise linear techniques;Parameter estimation;Acoustic signal processing;Pattern classification;Signal synthesis;Speech synthesis;Hidden Markov models;acoustic signal processing;musical instruments;pattern classification",
    doi = "10.1109/ICASSP.2009.4959517",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/ihmm\_icassp09.pdf"
    }

  • T. Virtanen, "Spectral covariance in prior distributions of non-negative matrix factorization based speech separation," in 2009 17th European Signal Processing Conference, 2009, pp. 1933-1937.
    [BibTeX] [Download PDF]
    @INPROCEEDINGS{2009_EUSIPCO_a,
    author = "Virtanen, Tuomas",
    booktitle = "2009 17th European Signal Processing Conference",
    title = "Spectral covariance in prior distributions of non-negative matrix factorization based speech separation",
    year = "2009",
    volume = "",
    number = "",
    pages = "1933-1937",
    keywords = "Vectors;Speech;Training;Stability analysis;Testing",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/covariance/virtanen\_eusipco2009.pdf"
    }

  • T. Virtanen and A. T. Cemgil, "Mixtures of Gamma Priors for Non-Negative Matrix Factorization Based Speech Separation," in ICA, 2009.
    [BibTeX]
    @inproceedings{2009_ICA_a,
    author = "Virtanen, Tuomas and Cemgil, Ali Taylan",
    booktitle = "ICA",
    keywords = "NMF;speech",
    title = "{M}ixtures of {G}amma {P}riors for {N}on-{N}egative {M}atrix {F}actorization {B}ased {S}peech {S}eparation",
    year = "2009"
    }

2008

  • E. B. Bilcu and J. Astola, "A hybrid approach to bilingual text-to-phoneme mapping," Facta Universitatis, Series: Electronics and Energetics, vol. 21, iss. 1, p. 91–105, 2008.
    [BibTeX]
    @article{2008,
    author = {Bilcu, Enik{\"o} Beatrice and Astola, Jaakko},
    issn = "0353-3670",
    journal = "Facta Universitatis, Series: Electronics and Energetics",
    number = "1",
    pages = "91--105",
    title = "{A} hybrid approach to bilingual text-to-phoneme mapping",
    volume = "21",
    year = "2008"
    }

  • T. Heittola and A. Klapuri, "TUT acoustic event detection system 2007," Lecture Notes in Computer Science, vol. 4625, p. 364–370, 2008. doi:10.1007/978-3-540-68585-2\\_35
    [BibTeX]
    @article{2008_LNCS_b,
    author = "Heittola, T. and Klapuri, A.",
    title = "TUT acoustic event detection system 2007",
    note = "Contribution: organisation=sgn,FACT1=1",
    year = "2008",
    doi = "10.1007/978-3-540-68585-2\\_35",
    language = "English",
    volume = "4625",
    pages = "364--370",
    journal = "Lecture Notes in Computer Science",
    issn = "0302-9743",
    publisher = "Springer Science and Business Media Deutschland GmbH"
    }

  • E. Helander, J. Nurminen, and M. Gabbouj, "LSF mapping for voice conversion with very small training sets," in Proceedings of 2008 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP, Las Vegas, Nevada, USA, 2008, pp. 4669-4672.
    [BibTeX]
    @inproceedings{2008_ICASSP_a,
    author = "Helander, Elina and Nurminen, Jani and Gabbouj, Moncef",
    address = "Las Vegas, Nevada, USA",
    booktitle = "Proceedings of 2008 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP",
    keywords = "voice conversion",
    month = "March",
    pages = "4669-4672",
    title = "{LSF} mapping for voice conversion with very small training sets",
    year = "2008"
    }

  • E. Helander, J. Schwarz, J. Nurminen, H. Silén, and M. Gabbouj, "On the impact of alignment on voice conversion performance," in roceedings of the 9th Annual Conference of the International Speech Communication Associationa, Interspeech, Brisbane, Australia, 2008, pp. 1453-1456.
    [BibTeX]
    @inproceedings{2008_InterSpecch,
    author = "Helander, Elina and Schwarz, Jan and Nurminen, Jani and Sil{\'e}n, Hanna and Gabbouj, Moncef",
    address = "Brisbane, Australia",
    booktitle = "roceedings of the 9th Annual Conference of the International Speech Communication Associationa, Interspeech",
    keywords = "voice conversion",
    month = "September",
    pages = "1453-1456",
    title = "{O}n the impact of alignment on voice conversion performance",
    year = "2008"
    }

  • A. Klapuri, "Multipitch Analysis of Polyphonic Music and Speech Signals Using an Auditory Model," IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, iss. 2, pp. 255-266, 2008. doi:10.1109/TASL.2007.908129
    [BibTeX]
    @ARTICLE{2008_TASLP,
    author = "Klapuri, Anssi",
    journal = "IEEE Transactions on Audio, Speech, and Language Processing",
    title = "Multipitch Analysis of Polyphonic Music and Speech Signals Using an Auditory Model",
    year = "2008",
    volume = "16",
    number = "2",
    pages = "255-266",
    keywords = "Speech analysis;Signal analysis;Multiple signal classification;Frequency estimation;Computational modeling;Music;Humans;Robustness;Signal processing;Interference;Acoustic signal analysis;fundamental frequency estimation;music information retrieval;pitch perception;Acoustic signal analysis;fundamental frequency estimation;music information retrieval;pitch perception",
    doi = "10.1109/TASL.2007.908129"
    }

  • A. Klapuri and T. Virtanen, "Automatic music transcription," in , Handbook of Signal Processing in Acoustics, 2008, p. 277–303.
    [BibTeX]
    @inproceedings{2008_SP,
    author = "Klapuri, Anssi and Virtanen, Tuomas",
    editor = "Havelock, D.",
    booktitle = ", Handbook of Signal Processing in Acoustics",
    isbn = "978-0-387-77698-9",
    pages = "277--303",
    publisher = "Springer",
    title = "Automatic music transcription",
    year = "2008"
    }

  • T. Korhonen and P. Pertilä, "TUT acoustic source tracking system 2007," in Lecture Notes in Computer Science, 2008, pp. 104-112.
    [BibTeX]
    @inproceedings{2008_LNCS,
    author = {Korhonen, Teemu and Pertil{\"a}, Pasi},
    booktitle = "Lecture Notes in Computer Science",
    pages = "104-112",
    title = "{TUT} acoustic source tracking system 2007",
    volume = "4625",
    year = "2008"
    }

  • T. Lahti, M. Helén, O. Vuorinen, E. Väyrynen, J. Partala, J. Peltola, and S. Mäkelä, "On Enabling Techniques for Personal Audio Content Management," in ACM International Conference on Multimedia Information Retrieval (MIR 2008), Vancouver, Canada, 2008.
    [BibTeX] [Abstract]

    State-of-the-art automatic analysis tools for personal audio con-tent management are discussed in this paper. Our main target is to create a system, which has several co-operating management tools for audio database and which improve the results of each other. Bayesian networks based audio classification algorithm provides classification into four main audio classes (silence, speech, music, and noise) and serves as a first step for other subsequent analysis tools. For speech analysis we propose an improved Bayesian information criterion based speaker segmen-tation and clustering algorithm applying also a combined gender and emotion detection algorithm utilizing prosodic features. For the other main classes it is often hard to device any general and well functional pre-categorization that would fit the unforesee-able types of user recorded data. For compensating the absence of analysis tools for these classes we propose the use of efficient audio similarity measure and query-by-example algorithm with database clustering capabilities. The experimental results show that the combined use of the algorithms is feasible in practice.

    @inproceedings{2008_MIR 2008,
    author = {Lahti, Tommi and Hel{\'e}n, Marko and Vuorinen, Olli and V{\"a}yrynen, Eero and Partala, Juha and Peltola, Johannes and M{\"a}kel{\"a}, Satu-Marja},
    abstract = "State-of-the-art automatic analysis tools for personal audio con-tent management are discussed in this paper. Our main target is to create a system, which has several co-operating management tools for audio database and which improve the results of each other. Bayesian networks based audio classification algorithm provides classification into four main audio classes (silence, speech, music, and noise) and serves as a first step for other subsequent analysis tools. For speech analysis we propose an improved Bayesian information criterion based speaker segmen-tation and clustering algorithm applying also a combined gender and emotion detection algorithm utilizing prosodic features. For the other main classes it is often hard to device any general and well functional pre-categorization that would fit the unforesee-able types of user recorded data. For compensating the absence of analysis tools for these classes we propose the use of efficient audio similarity measure and query-by-example algorithm with database clustering capabilities. The experimental results show that the combined use of the algorithms is feasible in practice.",
    address = "Vancouver, Canada",
    booktitle = "ACM International Conference on Multimedia Information Retrieval (MIR 2008)",
    keywords = "audio content management",
    month = "October",
    title = "{O}n {E}nabling {T}echniques for {P}ersonal {A}udio {C}ontent {M}anagement",
    year = "2008"
    }

  • A. Mesaros and T. Virtanen, "Automatic Alignment of Music Audio and Lyrics," in DAFx08, 2008.
    [BibTeX] [Download PDF]
    @inproceedings{2008_DAFx_a,
    author = "Mesaros, Annamaria and Virtanen, Tuomas",
    booktitle = "DAFx08",
    keywords = "Lyrics",
    title = "Automatic Alignment of Music Audio and Lyrics",
    year = "2008",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/autalign\_cr.pdf"
    }

  • M. Myllymäki and T. Virtanen, "Voice Activity Detection in the Presence of Breathing Noise Using Neural Network and Hidden Markov Model," in EUSIPCO 2008, 2008.
    [BibTeX] [Download PDF]
    @inproceedings{2008_EUSIPCO,
    author = {Myllym{\"a}ki, Mikko and Virtanen, Tuomas},
    booktitle = "EUSIPCO 2008",
    keywords = "voice activity; HMM; neural network",
    title = "Voice Activity Detection in the Presence of Breathing Noise Using Neural Network and Hidden Markov Model",
    year = "2008",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/eusipco2008.pdf"
    }

  • J. Paulus and A. Klapuri, "Labelling the Structural Parts of a Music Piece with Markov Models," in Proc. of the 2008 Computers in Music Modeling and Retrieval Conference, Copenhagen, Denmark, 2008, p. 137–147.
    [BibTeX]
    @inproceedings{2008_CMMR,
    author = "Paulus, Jouni and Klapuri, Anssi",
    editor = "Jensen, Kristoffer",
    address = "Copenhagen, Denmark",
    booktitle = "Proc. of the 2008 Computers in Music Modeling and Retrieval Conference",
    keywords = "Music structure analysis",
    month = "May",
    pages = "137--147",
    title = "{L}abelling the {S}tructural {P}arts of a {M}usic {P}iece with {M}arkov {M}odels",
    year = "2008"
    }

  • J. Paulus and A. Klapuri, "Acoustic Features for Music Piece Structure Analysis," in Proc. of the 11th International Conference on Digital Audio Effects, Espoo, Finland, 2008, p. 309–312.
    [BibTeX]
    @inproceedings{2008_DAFx,
    author = "Paulus, Jouni and Klapuri, Anssi",
    address = "Espoo, Finland",
    booktitle = "Proc. of the 11th International Conference on Digital Audio Effects",
    keywords = "Music structure analysis",
    month = "Sep",
    pages = "309--312",
    title = "{A}coustic {F}eatures for {M}usic {P}iece {S}tructure {A}nalysis",
    year = "2008"
    }

  • J. Paulus and A. Klapuri, "Music Structure Analysis Using a Probabilistic Fitness Measure And an Integrated Musicological Model," in Proc. of the Ninth International Conference on Music Information Retrieval, Philadelphia, PA, USA, 2008.
    [BibTeX]
    @inproceedings{2008_ISMIR,
    author = "Paulus, Jouni and Klapuri, Anssi",
    address = "Philadelphia, PA, USA",
    booktitle = "Proc. of the Ninth International Conference on Music Information Retrieval",
    keywords = "Music structure analysis",
    month = "Sep",
    number = "369-374",
    series = {""},
    title = "{M}usic {S}tructure {A}nalysis {U}sing a {P}robabilistic {F}itness {M}easure {A}nd an {I}ntegrated {M}usicological {M}odel",
    year = "2008"
    }

  • P. Pertilä, "Array steered response time-alignment for propagation delay compensation for acoustic localization," in Proceedings of the Forty-Second Asilomar Conference on Signals, Systems and Computers, Pacific Grove, California, USA, 2008, pp. 298-302.
    [BibTeX]
    @inproceedings{2008_a,
    author = {Pertil{\"a}, Pasi},
    address = "Pacific Grove, California, USA",
    booktitle = "Proceedings of the Forty-Second Asilomar Conference on Signals, Systems and Computers",
    month = "October",
    pages = "298-302",
    title = "{A}rray steered response time-alignment for propagation delay compensation for acoustic localization",
    year = "2008"
    }

  • P. Pertilä, T. Korhonen, and A. Visa, "Measurement combination for acoustic source localization in a room environment," Eurasip Journal on Audio, Speech, and Music Processing, vol. 2008, iss. 278185, 2008.
    [BibTeX] [Download PDF]
    @article{2008_JASM,
    author = {Pertil{\"a}, Pasi and Korhonen, Teemu and Visa, Ari},
    journal = "Eurasip Journal on Audio, Speech, and Music Processing",
    keywords = "Speaker tracking",
    number = "278185",
    title = "{M}easurement combination for acoustic source localization in a room environment",
    url = "http://www.hindawi.com/journals/asmp/aip.278185.html",
    volume = "2008",
    year = "2008"
    }

  • T. Pirinen, "A confidence statistic and an outlier detector for difference estimates in sensor arrays," IEEE Sensors Journal, vol. 8, iss. 12, p. 2008–2015, 2008. doi:10.1109/JSEN.2008.2007677
    [BibTeX]
    @article{2008_SJ,
    author = "Pirinen, Tuomo",
    doi = "10.1109/JSEN.2008.2007677",
    issn = "1530-437X",
    journal = "IEEE Sensors Journal",
    number = "12",
    pages = "2008--2015",
    publisher = "IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC",
    title = "{A} confidence statistic and an outlier detector for difference estimates in sensor arrays",
    volume = "8",
    year = "2008"
    }

  • T. Pirinen, "An experimental comparison of time delay weights for direction of arrival estimation," in Proceedings of the 11th International Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, 1-4 September 2008, 2008, p. 4 p.
    [BibTeX]
    @inproceedings{2008_DAFx-08,
    author = "Pirinen, Tuomo",
    editor = "Pakarinen, J.",
    booktitle = "Proceedings of the 11th International Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, 1-4 September 2008",
    isbn = "978-951-22-9516-6",
    pages = "4 p",
    title = "{A}n experimental comparison of time delay weights for direction of arrival estimation",
    year = "2008"
    }

  • M. Ryynänen and A. Klapuri, "Query by Humming of MIDI and Audio Using Locality Sensitive Hashing," in Proceedings of 2008 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'08), Las Vegas, Nevada, USA, 2008.
    [BibTeX]
    @inproceedings{2008_ICASSP'08,
    author = {Ryyn{\"a}nen, Matti and Klapuri, Anssi},
    address = "Las Vegas, Nevada, USA",
    booktitle = "Proceedings of 2008 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'08)",
    month = "April",
    title = "{Q}uery by {H}umming of {MIDI} and {A}udio {U}sing {L}ocality {S}ensitive {H}ashing",
    year = "2008"
    }

  • M. Ryynänen and A. Klapuri, "Automatic Transcription of Melody, Bass Line, and Chords in Polyphonic Music," Computer Music Journal, vol. 32, iss. 3, 2008.
    [BibTeX]
    @article{2008_CMJ,
    author = {Ryyn{\"a}nen, Matti and Klapuri, Anssi},
    journal = "Computer Music Journal",
    keywords = "Transcription",
    number = "3",
    title = "{A}utomatic {T}ranscription of {M}elody, {B}ass {L}ine, and {C}hords in {P}olyphonic {M}usic",
    volume = "32",
    year = "2008"
    }

  • M. Ryynänen, T. Virtanen, J. Paulus, and A. Klapuri, "Accompaniment separation and karaoke application based on automatic melody transcription," in IEEE International Conf. on Multimedia and Expo, Hannover, Germany, 2008.
    [BibTeX] [Download PDF]
    @inproceedings{2008_ICME,
    author = {Ryyn{\"a}nen, Matti and Virtanen, Tuomas and Paulus, Jouni and Klapuri, Anssi},
    address = "Hannover, Germany",
    booktitle = "IEEE International Conf. on Multimedia and Expo",
    month = "June",
    title = "Accompaniment separation and karaoke application based on automatic melody transcription",
    year = "2008",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/ryynanen\_icme08.pdf"
    }

  • H. Silén, E. Helander, J. Nurminen, and M. Gabbouj, "Evaluation of Finnish unit selection and HMM-based speech synthesis," in Proceedings of the 9th Annual Conference of the International Speech Communication Associationa, Interspeech, Brisbane, Australia, 2008, pp. 1853-1856.
    [BibTeX]
    @inproceedings{2008_InterSpecch_a,
    author = "Sil{\'e}n, Hanna and Helander, Elina and Nurminen, Jani and Gabbouj, Moncef",
    address = "Brisbane, Australia",
    booktitle = "Proceedings of the 9th Annual Conference of the International Speech Communication Associationa, Interspeech",
    keywords = "speech synthesis",
    month = "September",
    pages = "1853-1856",
    title = "{E}valuation of {F}innish unit selection and {HMM}-based speech synthesis",
    year = "2008"
    }

  • T. Virtanen, A. Taylan Cemgil, and S. Godsill, "Bayesian extensions to non-negative matrix factorisation for audio signal modelling," in 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, 2008, pp. 1825-1828. doi:10.1109/ICASSP.2008.4517987
    [BibTeX] [Download PDF]
    @INPROCEEDINGS{2008_ICASSP,
    author = "Virtanen, Tuomas and Taylan Cemgil, A. and Godsill, Simon",
    booktitle = "2008 IEEE International Conference on Acoustics, Speech and Signal Processing",
    title = "Bayesian extensions to non-negative matrix factorisation for audio signal modelling",
    year = "2008",
    volume = "",
    number = "",
    pages = "1825-1828",
    keywords = "Bayesian methods;Instruments;Signal processing;Time frequency analysis;Signal generators;Source separation;Spectrogram;Multiple signal classification;Music;Laboratories;acoustic signal processing;matrix decomposition;MAP estimation;source separation",
    doi = "10.1109/ICASSP.2008.4517987",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/icassp08.pdf"
    }

  • T. Virtanen, A. Mesaros, and M. Ryynänen, "Combining Pitch-Based Inference and Non-Negative Spectrogram Factorization in Separating Vocals from Polyphonic Music," in SAPA@ INTERSPEECH, 2008, p. 17–22.
    [BibTeX] [Download PDF]
    @inproceedings{2008_InterSpecch_b,
    author = {Virtanen, Tuomas and Mesaros, Annamaria and Ryyn{\"a}nen, Matti},
    keywords = "NMF; vocal; separation",
    title = "Combining Pitch-Based Inference and Non-Negative Spectrogram Factorization in Separating Vocals from Polyphonic Music",
    booktitle = "SAPA@ INTERSPEECH",
    pages = "17--22",
    year = "2008",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/pitchnmf.pdf"
    }

2007

  • E. B. Bilcu and J. Astola, "Improved hybrid approach for bilingual language recognition from text," in Proceedings of the 5th International Symposium on Image and Signal Processing and Analysis, ISPA 2007, Istanbul, Turkey, 27-29 September 2007, 2007, p. 190–195.
    [BibTeX]
    @inproceedings{2007_ISPA,
    author = {Bilcu, Enik{\"o} Beatrice and Astola, Jaakko},
    editor = "Petrou, M.",
    booktitle = "Proceedings of the 5th International Symposium on Image and Signal Processing and Analysis, ISPA 2007, Istanbul, Turkey, 27-29 September 2007",
    isbn = "978-953-184-117-7",
    pages = "190--195",
    title = "{I}mproved hybrid approach for bilingual language recognition from text",
    year = "2007"
    }

  • T. Heittola and A. Klapuri, "TUT Acoustic Event Detection System 2007," in Multimodal Technologies for Perception of Humans,Joint Proceedings of the CLEAR 2007 and RT 2007 Evaluation Workshops, Baltimore, MD, USA, 2007, pp. 364-370. doi:http://dx.doi.org/10.1007/978-3-540-68585-2_35
    [BibTeX] [Abstract]

    This paper describes a system used in acoustic event detection task of the CLEAR 2007 evaluation. The objective of the task is to detect acoustic events (door slam, steps, paper wrapping etc.) using acoustic data from a multiple microphone set up in the meeting room environment. A system based on hidden Markov models and multi-channel audio data was implemented. Mel-Frequency Cepstral Coefficients are used to represent the power spectrum of the acoustic signal. Fully-connected three-state hidden Markov models are trained for 12 acoustic events and one-state models are trained for speech, silence, and unknown events.

    @inproceedings{2007_b,
    author = "Heittola, Toni and Klapuri, Anssi",
    editor = "R. Stiefelhagen, R. Bowers, J. Fiscus",
    abstract = "This paper describes a system used in acoustic event detection task of the CLEAR 2007 evaluation. The objective of the task is to detect acoustic events (door slam, steps, paper wrapping etc.) using acoustic data from a multiple microphone set up in the meeting room environment. A system based on hidden Markov models and multi-channel audio data was implemented. Mel-Frequency Cepstral Coefficients are used to represent the power spectrum of the acoustic signal. Fully-connected three-state hidden Markov models are trained for 12 acoustic events and one-state models are trained for speech, silence, and unknown events.",
    address = "Baltimore, MD, USA",
    booktitle = "Multimodal Technologies for Perception of Humans,Joint Proceedings of the CLEAR 2007 and RT 2007 Evaluation Workshops",
    doi = "http://dx.doi.org/10.1007/978-3-540-68585-2\_35",
    keywords = "sound event detection;HMM",
    pages = "364-370",
    title = "{TUT} {A}coustic {E}vent {D}etection {S}ystem 2007",
    volume = "4625",
    year = "2007"
    }

  • E. Helander, H. Silén, and M. Gabbouj, "The use of diphone variants in optimal text selection for Finnish unit selection speech synthesis," in Proceedings of the 12th International conference Speech and Computer, SPECOM, Moscow, Russia, 2007, pp. 293-298.
    [BibTeX]
    @inproceedings{2007_SPECOM,
    author = "Helander, Elina and Sil{\'e}n, Hanna and Gabbouj, Moncef",
    address = "Moscow, Russia",
    booktitle = "Proceedings of the 12th International conference Speech and Computer, SPECOM",
    keywords = "speech synthesis",
    month = "October",
    pages = "293-298",
    title = "{T}he use of diphone variants in optimal text selection for {F}innish unit selection speech synthesis",
    year = "2007"
    }

  • E. Helander and J. Nurminen, "A novel method for prosody prediction in voice conversion," in IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP, Honolulu, Hawaii, USA, 2007, pp. 509-512.
    [BibTeX]
    @inproceedings{2007_ICASSP,
    author = "Helander, Elina and Nurminen, Jani",
    address = "Honolulu, Hawaii, USA",
    booktitle = "IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP",
    keywords = "voice conversion",
    month = "April",
    pages = "509-512",
    title = "{A} novel method for prosody prediction in voice conversion",
    volume = "4",
    year = "2007"
    }

  • E. Helander and J. Nurminen, "On the importance of pure prosody in the perception of speaker identity," in Proceedings of the 8th Annual Conference of the International Speech Communication Association, Interspeech, Antwerp, Belgium, 2007, pp. 2665-2668.
    [BibTeX]
    @inproceedings{2007_InterSpecch,
    author = "Helander, Elina and Nurminen, Jani",
    address = "Antwerp, Belgium",
    booktitle = "Proceedings of the 8th Annual Conference of the International Speech Communication Association, Interspeech",
    month = "August",
    pages = "2665-2668",
    title = "{O}n the importance of pure prosody in the perception of speaker identity",
    year = "2007"
    }

  • E. Helander, J. Nurminen, and M. Gabbouj, "Analysis of LSF frame selection in voice conversion," in Proceedings of the 12th International conference Speech and Computer, SPECOM, Moscow, Russia, 2007, pp. 651-656.
    [BibTeX]
    @inproceedings{2007_SPECOM_a,
    author = "Helander, Elina and Nurminen, Jani and Gabbouj, Moncef",
    address = "Moscow, Russia",
    booktitle = "Proceedings of the 12th International conference Speech and Computer, SPECOM",
    month = "October",
    pages = "651-656",
    title = "{A}nalysis of {LSF} frame selection in voice conversion",
    year = "2007"
    }

  • M. Helen and T. Virtanen, "Query by example of audio signals using Euclidean distance between Gaussian mixture models," in Proceedings of 2007 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP, Honolulu, Hawaii, USA, 15-20 April 2007, 2007, p. 225–228.
    [BibTeX]
    @inproceedings{2007_ICASSP_e,
    author = "Helen, M. and Virtanen, T.",
    title = "Query by example of audio signals using Euclidean distance between Gaussian mixture models",
    note = "Contribution: organisation=sgn,FACT1=1",
    year = "2007",
    language = "English",
    isbn = "1-4244-0728-1",
    pages = "225--228",
    booktitle = "Proceedings of 2007 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP, Honolulu, Hawaii, USA, 15-20 April 2007"
    }

  • M. Helén and T. Virtanen, "A Similarity Measure for Audio Query by Example Based on Perceptual Coding and Compression," in proc. 10th International Conference on Digital Audio Effects (DAFx-07), 2007.
    [BibTeX] [Download PDF]
    @inproceedings{2007_DAFx-07,
    author = "Hel{\'e}n, Marko and Virtanen, Tuomas",
    booktitle = "proc. 10th International Conference on Digital Audio Effects (DAFx-07)",
    month = "September",
    title = "A Similarity Measure for Audio Query by Example Based on Perceptual Coding and Compression",
    year = "2007",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/dafx\_helen4.pdf"
    }

  • A. Klapuri, "Analysis of musical instrument sounds by source-filter-decay model," in IEEE International Conference on Audio, Speech and Signal Processing (ICASSP), Hawaii, USA, 2007.
    [BibTeX]
    @inproceedings{2007_ICASSP_a,
    author = "Klapuri, Anssi",
    address = "Hawaii, USA",
    booktitle = "IEEE International Conference on Audio, Speech and Signal Processing (ICASSP)",
    keywords = "instruments; source-filter-decay model",
    title = "{A}nalysis of musical instrument sounds by source-filter-decay model",
    year = "2007"
    }

  • T. Korhonen and P. Pertilä, "TUT acoustic source tracking system 2007," in Multimodal Technologies for Perception of Humans, International Evaluation Workshops CLEAR 2007 and RT 2007, Baltimore, USA, 2007, pp. 104-112.
    [BibTeX]
    @inproceedings{2007,
    author = {Korhonen, Teemu and Pertil{\"a}, Pasi},
    address = "Baltimore, USA",
    booktitle = "Multimodal Technologies for Perception of Humans, International Evaluation Workshops CLEAR 2007 and RT 2007",
    keywords = "Speaker tracking",
    month = "May",
    pages = "104-112",
    publisher = "LNCS",
    title = "{TUT} acoustic source tracking system 2007",
    year = "2007"
    }

  • M. McKinney, D. Moelants, M. E. P. Davies, and A. Klapuri, "Evaluation of audio beat tracking and music tempo extraction algorithms," Journal of New Music Research, vol. 36, iss. 1, pp. 1-16, 2007.
    [BibTeX] [Abstract]

    This is an extended analysis of eight different algorithms for musical tempo extraction and beat tracking. The algorithms participated in the 2006 Music Informa- tion Retrieval Evaluation eXchange (MIREX), where they were evaluated using a set of 140 musical excerpts, each with beats annotated by 40 different listeners. Performance metrics were constructed to measure the algorithms’ abilities to predict the most perceptually salient musical beats and tempi of the excerpts. Detailed results of the evaluation are presented here and algorithm performance is evaluated as a function of musical genre, the presence of percussion, musical meter and the most salient perceptual tempo of each excerpt.

    @article{2007_JNMR,
    author = "McKinney, M. and Moelants, D. and Davies, M. E. P. and Klapuri, Anssi",
    abstract = "This is an extended analysis of eight different algorithms for musical tempo extraction and beat tracking. The algorithms participated in the 2006 Music Informa- tion Retrieval Evaluation eXchange (MIREX), where they were evaluated using a set of 140 musical excerpts, each with beats annotated by 40 different listeners. Performance metrics were constructed to measure the algorithms’ abilities to predict the most perceptually salient musical beats and tempi of the excerpts. Detailed results of the evaluation are presented here and algorithm performance is evaluated as a function of musical genre, the presence of percussion, musical meter and the most salient perceptual tempo of each excerpt.",
    journal = "Journal of New Music Research",
    number = "1",
    pages = "1-16",
    title = "{E}valuation of audio beat tracking and music tempo extraction algorithms",
    volume = "36",
    year = "2007"
    }

  • A. Mesaros, T. Virtanen, and A. Klapuri, "Singer Identification in Polyphonic Music Using Vocal Separation and Pattern Recognition Methods," in International Conference on Music Information Retrieval, Vienna, Austria, 2007.
    [BibTeX] [Download PDF]
    @inproceedings{2007_ISMIR_a,
    author = "Mesaros, Annamaria and Virtanen, Tuomas and Klapuri, Anssi",
    address = "Vienna, Austria",
    booktitle = "International Conference on Music Information Retrieval",
    keywords = "singer identification; separation",
    title = "Singer Identification in Polyphonic Music Using Vocal Separation and Pattern Recognition Methods",
    year = "2007",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/ismir2007mesaros.pdf"
    }

  • J. Paulus and A. Klapuri, "Combining temporal and spectral features in HMM-based drum transcription," in Proc. of the 8th International Conference on Music Information Retrieval, Vienna, Austria, 2007, p. 225–228.
    [BibTeX]
    @inproceedings{2007_ISMIR,
    author = "Paulus, Jouni and Klapuri, Anssi",
    address = "Vienna, Austria",
    booktitle = "Proc. of the 8th International Conference on Music Information Retrieval",
    keywords = "drums; HMM",
    month = "Sep",
    pages = "225--228",
    title = "{C}ombining temporal and spectral features in {HMM}-based drum transcription",
    year = "2007"
    }

  • P. Pertilä, T. Korhonen, T. Pirinen, and M. Parviainen, "TUT acoustic source tracking system 2006," in Lecture Notes in Computer Science, 2007, pp. 127-136.
    [BibTeX]
    @inproceedings{2007_LNCS,
    author = {Pertil{\"a}, Pasi and Korhonen, Teemu and Pirinen, Tuomo and Parviainen, Mikko},
    booktitle = "Lecture Notes in Computer Science",
    keywords = "Speaker tracking",
    pages = "127-136",
    title = "{TUT} acoustic source tracking system 2006",
    volume = "4122",
    year = "2007"
    }

  • P. Pertilä and M. Parviainen, "Robust speaker localization in meeting room domain," in Proceedings of 2007 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP, Honolulu, Hawaii, USA, 2007, pp. 497-500.
    [BibTeX]
    @inproceedings{2007_ICASSP_b,
    author = {Pertil{\"a}, Pasi and Parviainen, Mikko},
    address = "Honolulu, Hawaii, USA",
    booktitle = "Proceedings of 2007 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP",
    month = "April",
    pages = "497-500",
    title = "{R}obust speaker localization in meeting room domain",
    volume = "4",
    year = "2007"
    }

  • P. Pertilä, "Sound source localization in a Bayesian framework," in Digest of TISE Seminar 2007, Nokia, Finland, 5 June 2007, 2007, p. 64–69.
    [BibTeX]
    @inproceedings{2007_TISE,
    author = {Pertil{\"a}, P.},
    editor = "Koivisto, P.",
    title = "Sound source localization in a Bayesian framework",
    note = "Contribution: organisation=sgn,FACT1=1",
    year = "2007",
    language = "English",
    isbn = "978-952-15-1768-6",
    pages = "64--69",
    booktitle = "Digest of TISE Seminar 2007, Nokia, Finland, 5 June 2007"
    }

  • T. Pirinen, P. Pertilä, and T. Korhonen, "Seinille kasvaa korvat," Prosessori, iss. 2, pp. 46-47, 2007.
    [BibTeX]
    @article{2007_a,
    author = {Pirinen, Tuomo and Pertil{\"a}, Pasi and Korhonen, Teemu},
    journal = "Prosessori",
    number = "2",
    pages = "46-47",
    title = "{S}einille kasvaa korvat",
    year = "2007"
    }

  • M. Ryynänen and A. Klapuri, "Automatic bass line transcription from streaming polyphonic audio," in IEEE International Conference on Audio, Speech and Signal Processing (ICASSP), Hawaii, USA, 2007.
    [BibTeX]
    @inproceedings{2007_ICASSP_c,
    author = {Ryyn{\"a}nen, Matti and Klapuri, Anssi},
    address = "Hawaii, USA",
    booktitle = "IEEE International Conference on Audio, Speech and Signal Processing (ICASSP)",
    title = "{A}utomatic bass line transcription from streaming polyphonic audio",
    year = "2007"
    }

  • H. Silén, E. Helander, K. Koppinen, and M. Gabbouj, "Building a Finnish unit selection TTS system," in Proceedings of the 6th ISCA Workshop on Speech Synthesis, Bonn, Gemany, 2007, pp. 310-315.
    [BibTeX]
    @inproceedings{2007_ISCA,
    author = "Sil{\'e}n, Hanna and Helander, Elina and Koppinen, Konsta and Gabbouj, Moncef",
    address = "Bonn, Gemany",
    booktitle = "Proceedings of the 6th ISCA Workshop on Speech Synthesis",
    keywords = "speech synthesis",
    month = "August",
    pages = "310-315",
    title = "{B}uilding a {F}innish unit selection {TTS} system",
    year = "2007"
    }

  • T. Virtanen, "Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria," IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, iss. 3, pp. 1066-1074, 2007. doi:10.1109/TASL.2006.885253
    [BibTeX] [Download PDF]
    @ARTICLE{2007_TASLP,
    author = "Virtanen, Tuomas",
    journal = "IEEE Transactions on Audio, Speech, and Language Processing",
    title = "Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria",
    year = "2007",
    volume = "15",
    number = "3",
    pages = "1066-1074",
    keywords = "Source separation;Unsupervised learning;Multiple signal classification;Spectrogram;Machine learning algorithms;Music;Sparse matrices;Humans;Independent component analysis;Costs;Acoustic signal analysis;audio source separation;blind source separation;music;nonnegative matrix factorization;sparse coding;unsupervised learning",
    doi = "10.1109/TASL.2006.885253",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/virtanen\_taslp2007.pdf"
    }

  • T. Virtanen and M. Helen, "Probabilistic Model Based Similarity Measures for Audio Query-by-Example," in 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2007, pp. 82-85. doi:10.1109/ASPAA.2007.4393031
    [BibTeX] [Download PDF]
    @INPROCEEDINGS{2007_WASPAA,
    author = "Virtanen, Tuomas and Helen, Marko",
    booktitle = "2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics",
    title = "Probabilistic Model Based Similarity Measures for Audio Query-by-Example",
    year = "2007",
    volume = "",
    number = "",
    pages = "82-85",
    keywords = "Hidden Markov models;Testing;Acoustic measurements;Distortion measurement;Signal generators;Speech;Databases;Humans;Conferences;Acoustic signal processing",
    doi = "10.1109/ASPAA.2007.4393031",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/virtanen\_waspaa07.pdf"
    }

2006

  • E. B. Bilcu and J. Astola, "Neural networks with random letter codes for text-to-phoneme mapping and small training dictionary," in Proceedings of the 14th European Signal Processing Conference, EUSIPCO, Florence, Italy, 2006.
    [BibTeX]
    @inproceedings{2006_EUSIPCO,
    author = {Bilcu, Enik{\"o} Beatrice and Astola, Jaakko},
    address = "Florence, Italy",
    booktitle = "Proceedings of the 14th European Signal Processing Conference, EUSIPCO",
    month = "September",
    title = "{N}eural networks with random letter codes for text-to-phoneme mapping and small training dictionary",
    year = "2006"
    }

  • E. B. Bilcu and J. Astola, "A Hybrid neural network for language identification from text," in Proceedings of the 2006 IEEE International Workshop on Machine Learning for Signal Processing, Maynooth, Ireland, 2006, pp. 253-258.
    [BibTeX]
    @inproceedings{2006_MLSP,
    author = {Bilcu, Enik{\"o} Beatrice and Astola, Jaakko},
    address = "Maynooth, Ireland",
    booktitle = "Proceedings of the 2006 IEEE International Workshop on Machine Learning for Signal Processing",
    month = "September",
    pages = "253-258",
    title = "{A} {H}ybrid neural network for language identification from text",
    year = "2006"
    }

  • A. Eronen, V. Peltonen, J. Tuomi, A. Klapuri, S. Fagerlund, T. Sorsa, G. Lorho, and J. Huopaniemi, "Audio-based context recognition," IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, iss. 1, pp. 321-329, 2006. doi:10.1109/TSA.2005.854103
    [BibTeX]
    @ARTICLE{2006_TASLP,
    author = {Eronen, Antti and Peltonen, Vesa and Tuomi, Juha and Klapuri, Anssi and Fagerlund, Seppo and Sorsa, Timo and Lorho, Ga{\"e}tan and Huopaniemi, Jyri},
    journal = "IEEE Transactions on Audio, Speech, and Language Processing",
    title = "Audio-based context recognition",
    year = "2006",
    volume = "14",
    number = "1",
    pages = "321-329",
    keywords = "Humans;System testing;Hidden Markov models;Acoustic devices;Context awareness;Mobile handsets;Acoustic signal processing;Computational complexity;Vectors;Feature extraction;Audio classification;context awareness;feature extraction;hidden Markov models (HMMs)",
    doi = "10.1109/TSA.2005.854103"
    }

  • D. FitzGerald and J. Paulus, "Unpitched Percussion Transcription," in Signal Processing Methods for Music Transcription, A. Klapuri and M. Davy, Eds., Springer-Verlag, 2006, p. 131–162.
    [BibTeX] [Download PDF]
    @incollection{2006_b,
    author = "FitzGerald, Derry and Paulus, Jouni",
    editor = "Klapuri, Anssi and Davy, Manuel",
    booktitle = "Signal Processing Methods for Music Transcription",
    pages = "131--162",
    publisher = "Springer-Verlag",
    title = "Unpitched Percussion Transcription",
    url = "http://www.springerlink.com/content/w4175761l68h5t85",
    year = "2006"
    }

  • F. Gouyon, A. Klapuri, S. Dixon, M. Alonso, G. Tzanetakis, C. Uhle, and P. Cano, "An experimental comparison of audio tempo induction algorithms," IEEE Trans. Audio, Speech, and Language Processing, vol. 14, iss. 5, pp. 1832-1844, 2006.
    [BibTeX] [Download PDF]
    @article{2006_TASLP_a,
    author = "Gouyon, Fabien and Klapuri, Anssi and Dixon, Simon and Alonso, Miguel and Tzanetakis, George and Uhle, Christian and Cano, Pedro",
    journal = "IEEE Trans. Audio, Speech, and Language Processing",
    keywords = "tempo",
    month = "Sept",
    number = "5",
    pages = "1832-1844",
    title = "{A}n experimental comparison of audio tempo induction algorithms",
    url = "http://ieeexplore.ieee.org/stamp/stamp.jsp?tp={{{\\&}}}arnumber=1678001{{{\\&}}}isnumber=35293",
    volume = "14",
    year = "2006"
    }

  • M. Helén and T. Lahti, "Query by Example Methods for Audio Signals," in 7th Nordic Signal Processing Symposium (NORSIG 2006), Reykjavik, Iceland, 2006.
    [BibTeX]
    @inproceedings{2006_NORSIG 2006,
    author = "Hel{\'e}n, Marko and Lahti, Tommi",
    address = "Reykjavik, Iceland",
    booktitle = "7th Nordic Signal Processing Symposium (NORSIG 2006)",
    keywords = "query by example",
    month = "June",
    title = "{Q}uery by {E}xample {M}ethods for {A}udio {S}ignals",
    year = "2006"
    }

  • P. Herrera-Boyer, A. Klapuri, and M. Davy, "Automatic classification of pitched musical instrument sounds," in Signal Processing Methods for Music Transcription, 2006, p. 163–200. doi:10.1007/0-387-32845-9_6
    [BibTeX]
    @inproceedings{2006_SPMMT,
    author = "Herrera-Boyer, P. and Klapuri, Anssi and Davy, Manuel",
    editor = "Klapuri, A. and Davy, M.",
    booktitle = "Signal Processing Methods for Music Transcription",
    doi = "10.1007/0-387-32845-9\_6",
    isbn = "978-0-387-30667-4",
    pages = "163--200",
    publisher = "Springer",
    title = "{A}utomatic classification of pitched musical instrument sounds",
    year = "2006"
    }

  • A. Klapuri and M. Davy, Signal Processing Methods for Music Transcription, Berlin, Heidelberg: Springer-Verlag, 2006.
    [BibTeX]
    @book{2006,
    author = "Klapuri, Anssi and Davy, Manuel",
    title = "Signal Processing Methods for Music Transcription",
    year = "2006",
    isbn = "0387306676",
    publisher = "Springer-Verlag",
    address = "Berlin, Heidelberg"
    }

  • A. Klapuri, "Auditory model-based methods for multiple fundamental frequency estimation," in Signal Processing Methods for Music Transcription, 2006, p. 229–265. doi:10.1007/0-387-32845-9_8
    [BibTeX]
    @inproceedings{2006_SPMMT_a,
    author = "Klapuri, Anssi",
    editor = "Klapuri, A. and Davy, M.",
    booktitle = "Signal Processing Methods for Music Transcription",
    doi = "10.1007/0-387-32845-9\_8",
    isbn = "978-0-387-30667-4",
    pages = "229--265",
    publisher = "Springer",
    title = "{A}uditory model-based methods for multiple fundamental frequency estimation",
    year = "2006"
    }

  • A. Klapuri, A. Eronen, and J. Astola, "Analysis of the meter of acoustic musical signals," IEEE Trans. Audio, Speech, and Language Processing, vol. 14, iss. 1, 2006.
    [BibTeX]
    @article{2006_TASLP_b,
    author = "Klapuri, Anssi and Eronen, Antti and Astola, Jaakko",
    journal = "IEEE Trans. Audio, Speech, and Language Processing",
    keywords = "meter",
    number = "1",
    title = "Analysis of the meter of acoustic musical signals",
    volume = "14",
    year = "2006"
    }

  • A. Klapuri, "Multiple fundamental frequency estimation by summing harmonic amplitudes," in 7th International Conference on Music Information Retrieval (ISMIR-06), Victoria, Canada, 2006.
    [BibTeX]
    @inproceedings{2006_ISMIR-06,
    author = "Klapuri, Anssi",
    address = "Victoria, Canada",
    booktitle = "7th International Conference on Music Information Retrieval (ISMIR-06)",
    month = "Oct",
    title = "Multiple fundamental frequency estimation by summing harmonic amplitudes",
    year = "2006"
    }

  • A. Klapuri, "Multiple fundamental frequency estimation by summing harmonic amlitudes," in Proceedings of the 7th International Conference on Music Information Retrieval, ISMIR 2006, Victoria, BC, Canada, 8-12 October 2006, 2006, p. 216–221.
    [BibTeX]
    @inproceedings{2006_ISMIR_a,
    author = "Klapuri, A.",
    editor = "Dannenberg, R.",
    title = "Multiple fundamental frequency estimation by summing harmonic amlitudes",
    note = "Contribution: organisation=sgn,FACT1=1",
    year = "2006",
    language = "English",
    pages = "216--221",
    booktitle = "Proceedings of the 7th International Conference on Music Information Retrieval, ISMIR 2006, Victoria, BC, Canada, 8-12 October 2006"
    }

  • M. Parviainen, T. Pirinen, and P. Pertilä, "A speaker localization system for lecture room environment," in Machine Learning for Multimodal Interaction, the Third International Workshop, MLMI, 2006, pp. 225-235.
    [BibTeX]
    @inproceedings{2006_MLMI,
    author = {Parviainen, Mikko and Pirinen, Tuomo and Pertil{\"a}, Pasi},
    booktitle = "Machine Learning for Multimodal Interaction, the Third International Workshop, MLMI",
    keywords = "Speaker tracking",
    pages = "225-235",
    title = "{A} speaker localization system for lecture room environment",
    volume = "4299",
    year = "2006"
    }

  • J. Paulus, "Acoustic Modelling of Drum Sounds with Hidden Markov Models for Music Transcription," in 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, 2006, p. V-V. doi:10.1109/ICASSP.2006.1661257
    [BibTeX]
    @INPROCEEDINGS{2006_ICASSP,
    author = "Paulus, J.",
    booktitle = "2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings",
    title = "Acoustic Modelling of Drum Sounds with Hidden Markov Models for Music Transcription",
    year = "2006",
    volume = "5",
    number = "",
    pages = "V-V",
    keywords = "Hidden Markov models;Multiple signal classification;Music;Instruments;Pattern recognition;Source separation;Signal analysis;Acoustic signal processing;Acoustic signal detection;Taxonomy",
    doi = "10.1109/ICASSP.2006.1661257"
    }

  • J. Paulus and A. Klapuri, "Music Structure Analysis by Finding Repeated Parts," in Proc. of the 1st ACM Audio and Music Computing Multimedia Workshop, Santa Barbara, CA, USA, 2006, p. 59–68.
    [BibTeX]
    @inproceedings{2006_a,
    author = "Paulus, Jouni and Klapuri, Anssi",
    address = "Santa Barbara, CA, USA",
    booktitle = "Proc. of the 1st ACM Audio and Music Computing Multimedia Workshop",
    month = "Oct",
    pages = "59--68",
    title = "Music Structure Analysis by Finding Repeated Parts",
    year = "2006"
    }

  • P. Pertilä, "Sound source localization system," in Digest of TISE Seminar 2006, Siivikkala, Ylöjärvi, Finland, 31 May 2006. TISE publications, 2006, p. 28–33.
    [BibTeX]
    @inproceedings{2006_TISE,
    author = {Pertil{\"a}, P.},
    editor = "Koivisto, P.",
    title = "Sound source localization system",
    note = "ISBN 952-15-1591-0, ISSN 1458-8463
    Contribution: organisation=sgn,FACT1=1", year = "2006", language = "English", isbn = "952-15-1591-0", pages = "28--33", booktitle = {Digest of TISE Seminar 2006, Siivikkala, Yl{\"o}j{\"a}rvi, Finland, 31 May 2006. TISE publications} }

  • T. Pirinen and A. Visa, "Signal independent wideband activity detection features for microphone arrays," in Proceedings of the 2006 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2006, Toulouse, France, 14-19 May 2006, 2006, p. 1109–1112.
    [BibTeX]
    @inproceedings{2006_ICASSP_a,
    author = "Pirinen, Tuomo and Visa, Ari",
    booktitle = "Proceedings of the 2006 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2006, Toulouse, France, 14-19 May 2006",
    pages = "1109--1112",
    title = "{S}ignal independent wideband activity detection features for microphone arrays",
    year = "2006"
    }

  • T. Pirinen, "A Lattice viewpoint for direction of arrival estimation using quantized time differences of arrival," in Proceedings of the Fourth IEEE Workshop on Sensor Array and Multichannel Processing, SAM, Waltham, Massachusetts, USA, 12-14 July 2006, 2006, p. 50–54.
    [BibTeX]
    @inproceedings{2006_SAM,
    author = "Pirinen, Tuomo",
    booktitle = "Proceedings of the Fourth IEEE Workshop on Sensor Array and Multichannel Processing, SAM, Waltham, Massachusetts, USA, 12-14 July 2006",
    pages = "50--54",
    title = "{A} {L}attice viewpoint for direction of arrival estimation using quantized time differences of arrival",
    year = "2006"
    }

  • M. Ryynänen and A. Klapuri, "Transcription of the Singing Melody in Polyphonic Music," in Proc. 7th International Conference on Music Information Retrieval (ISMIR 2006), Victoria, Canada, 2006.
    [BibTeX]
    @inproceedings{2006_ISMIR 2006,
    author = {Ryyn{\"a}nen, Matti and Klapuri, Anssi},
    address = "Victoria, Canada",
    booktitle = "Proc. 7th International Conference on Music Information Retrieval (ISMIR 2006)",
    month = "October",
    title = "{T}ranscription of the {S}inging {M}elody in {P}olyphonic {M}usic",
    year = "2006"
    }

  • M. Ryynänen, "Singing transcription," in Signal Processing Methods for Music Transcription, 2006, p. 361–391.
    [BibTeX]
    @inproceedings{2006_SPMMT_b,
    author = {Ryyn{\"a}nen, Matti},
    editor = "Klapuri, A. and Davy, M.",
    booktitle = "Signal Processing Methods for Music Transcription",
    isbn = "0-837-30667-6",
    pages = "361--391",
    publisher = "Springer",
    title = "{S}inging transcription",
    year = "2006"
    }

  • T. Virtanen, "Speech Recognition Using Factorial Hidden Markov Models for Separation in the Feature Space," in proc. Interspeech, Pittsburgh, USA, 2006.
    [BibTeX] [Download PDF]
    @inproceedings{2006_InterSpecch,
    author = "Virtanen, Tuomas",
    address = "Pittsburgh, USA",
    booktitle = "proc. Interspeech",
    keywords = "speech recognition; HMM;",
    title = "Speech Recognition Using Factorial Hidden Markov Models for Separation in the Feature Space",
    year = "2006",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/speech/icslp06.pdf"
    }

  • T. Virtanen and A. Klapuri, "Analysis of polyphonic audio using source-filter model and non-negative matrix factorization," in Advances in Models for Acoustic Processing, Neural Information Processing Systems Workshop, 2006.
    [BibTeX] [Download PDF]
    @inproceedings{2006_c,
    author = "Virtanen, Tuomas and Klapuri, Anssi",
    booktitle = "Advances in Models for Acoustic Processing, Neural Information Processing Systems Workshop",
    keywords = "instruments; source-filter model; NMF",
    title = "Analysis of polyphonic audio using source-filter model and non-negative matrix factorization",
    year = "2006",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/virtanen-AMAC2006.pdf"
    }

  • T. Virtanen, "Unsupervised learning methods for source separation in monaural music signals," in Signal Processing Methods for Music Transcription, A. Klapuri and M. Davy, Eds., Springer, 2006, p. 267–296.
    [BibTeX]
    @inbook{2006_h,
    author = "Virtanen, T.",
    editor = "Klapuri, A. and Davy, M.",
    title = "Unsupervised learning methods for source separation in monaural music signals",
    note = "Contribution: organisation=sgn,FACT1=1",
    year = "2006",
    language = "English",
    isbn = "0-837-30667-6",
    pages = "267--296",
    booktitle = "Signal Processing Methods for Music Transcription",
    publisher = "Springer"
    }

2005

  • E. B. Bilcu, J. Astola, and J. Saarinen, "Comparative study of letter encoding for text-to-phoneme mapping," in Proceedings of 13. European Signal Processing Conference, EUSIPCO, Antalya, Turkey, 2005.
    [BibTeX] [Abstract]

    Text-to-phoneme mapping is a very important preliminary step in any text-to-speech synthesis system. In this paper, we study the performances of the multilayer perceptron (MLP) neural network for the problem of text-to-phoneme mapping. Specifically, we study the influence of the input letter encoding in the conversion accuracy of such system. We show, that for large network complexities the orthogonal binary codes (as introduced in NetTalk) gives better performance. On the other hand in applications that require very small memory load and computational complexity other compact codes may be more suitable. This study is a first step toward implementation a neural network based text-to-phoneme mapping in mobile devices.

    @inproceedings{2005_EUSIPCO_a,
    author = {Bilcu, Enik{\"o} Beatrice and Astola, Jaakko and Saarinen, Jukka},
    abstract = "Text-to-phoneme mapping is a very important preliminary step in any text-to-speech synthesis system. In this paper, we study the performances of the multilayer perceptron (MLP) neural network for the problem of text-to-phoneme mapping. Specifically, we study the influence of the input letter encoding in the conversion accuracy of such system. We show, that for large network complexities the orthogonal binary codes (as introduced in NetTalk) gives better performance. On the other hand in applications that require very small memory load and computational complexity other compact codes may be more suitable. This study is a first step toward implementation a neural network based text-to-phoneme mapping in mobile devices.",
    address = "Antalya, Turkey",
    booktitle = "Proceedings of 13. European Signal Processing Conference, EUSIPCO",
    month = "September",
    title = "{C}omparative study of letter encoding for text-to-phoneme mapping",
    year = "2005"
    }

  • M. Helén and T. Virtanen, "Separation of drums from polyphonic music using non-negative matrix factorization and support vector machine," in 2005 13th European Signal Processing Conference, 2005, p. 1–4.
    [BibTeX]
    @inproceedings{2005_EUSIPCO_c,
    author = "Hel{\'e}n, Marko and Virtanen, Tuomas",
    title = "Separation of drums from polyphonic music using non-negative matrix factorization and support vector machine",
    booktitle = "2005 13th European Signal Processing Conference",
    pages = "1--4",
    year = "2005",
    organization = "IEEE"
    }

  • A. Klapuri, T. Virtanen, and M. Helén, "Modeling musical sounds with an interpolating state model," in Proc. European signal processing conference, Antalya, Turkey, 2005.
    [BibTeX] [Download PDF]
    @inproceedings{2005_EUSIPCO_b,
    author = "Klapuri, Anssi and Virtanen, Tuomas and Hel{\'e}n, Marko",
    address = "Antalya, Turkey",
    booktitle = "Proc. European signal processing conference",
    keywords = "instruments; interpolating state model",
    title = "Modeling musical sounds with an interpolating state model",
    year = "2005",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/ism.pdf"
    }

  • A. Klapuri, "A perceptually motivated multiple-F0 estimation method," in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, 2005.
    [BibTeX]
    @inproceedings{2005_WASPAA,
    author = "Klapuri, Anssi",
    address = "New Paltz, New York",
    booktitle = "Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics",
    month = "Oct",
    title = "{A} perceptually motivated multiple-{F}0 estimation method",
    year = "2005"
    }

  • T. Korhonen, P. Pertilä, and A. Visa, "Particle filtering in high clutter environment," in Proceedings of the 2005 Finnish Signal Processing Symposium - FINSIG'05, Kuopio, Finland, 2005, pp. 12-15.
    [BibTeX]
    @inproceedings{2005_FINSIG,
    author = {Korhonen, Teemu and Pertil{\"a}, Pasi and Visa, Ari},
    address = "Kuopio, Finland",
    booktitle = "Proceedings of the 2005 Finnish Signal Processing Symposium - FINSIG'05",
    month = "August",
    pages = "12-15",
    title = "{P}article filtering in high clutter environment",
    year = "2005"
    }

  • A. Mesaros and J. Astola, "Inter-dependence of spectral measures for the singing voice," in Proceedings of International Symposium on Signal, Circuits and Systems, ISSCS, Iasi, Romania, 2005, pp. 307-310.
    [BibTeX]
    @inproceedings{2005_ISSCS,
    author = "Mesaros, Annamaria and Astola, Jaakko",
    address = "Iasi, Romania",
    booktitle = "Proceedings of International Symposium on Signal, Circuits and Systems, ISSCS",
    keywords = "singing",
    month = "July",
    pages = "307-310",
    title = "{I}nter-dependence of spectral measures for the singing voice",
    year = "2005"
    }

  • T. Mikkonen, "Homogeneous graph invariants," in International conference on Discrete Mathematics and ist applications, Tamil Nadu, India, 9-11 December 2005, 2005, p. 4 p.
    [BibTeX]
    @inproceedings{2005_ICA_a,
    author = "Mikkonen, Tomi",
    booktitle = "International conference on Discrete Mathematics and ist applications, Tamil Nadu, India, 9-11 December 2005",
    pages = "4 p",
    title = "Homogeneous graph invariants",
    year = "2005"
    }

  • M. Parviainen, P. Pertilä, T. Korhonen, and A. Visa, "A spatiotemporal approach for passive sound source localization - real-world experiments," in Proceedings of International Workshop on Nonlinear Signal and Image Processing, NSIP, Sapporo, Japan, 2005, pp. 468-473.
    [BibTeX]
    @inproceedings{2005_NSIP,
    author = {Parviainen, Mikko and Pertil{\"a}, Pasi and Korhonen, Teemu and Visa, Ari},
    address = "Sapporo, Japan",
    booktitle = "Proceedings of International Workshop on Nonlinear Signal and Image Processing, NSIP",
    month = "May",
    pages = "468-473",
    title = "{A} spatiotemporal approach for passive sound source localization - real-world experiments",
    year = "2005"
    }

  • J. Paulus and T. Virtanen, "Drum Transcription with Non-negative Spectrogram Factorisation," in Proc. of the 13th European Signal Processing Conference, Antalya, Turkey, 2005.
    [BibTeX] [Download PDF]
    @inproceedings{2005_EUSIPCO,
    author = "Paulus, Jouni and Virtanen, Tuomas",
    address = "Antalya, Turkey",
    booktitle = "Proc. of the 13th European Signal Processing Conference",
    keywords = "drums; NMF",
    month = "Sep",
    title = "Drum Transcription with Non-negative Spectrogram Factorisation",
    year = "2005",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/eusipco05\_paulus.pdf"
    }

  • J. Paulus, "Drum Transcription from Polyphonic Music with Instrument-wise Hidden Markov Models," in Proc. of the First Annual Music Information Retrieval Evaluation eXchange, London, UK, 2005.
    [BibTeX]
    @inproceedings{2005_MIREX,
    author = "Paulus, Jouni",
    address = "London, UK",
    booktitle = "Proc. of the First Annual Music Information Retrieval Evaluation eXchange",
    keywords = "HMM",
    month = "Sep",
    title = "{D}rum {T}ranscription from {P}olyphonic {M}usic with {I}nstrument-wise {H}idden {M}arkov {M}odels",
    year = "2005"
    }

  • P. Pertilä, M. Parviainen, T. Korhonen, and A. Visa, "Moving sound source localization in large areas," in Proceedings of 2005 International Symposium on Intelligent Signal Processing and Communication Systems, ISPACS, Hong Kong, 2005, pp. 745-748.
    [BibTeX]
    @inproceedings{2005_ICA,
    author = {Pertil{\"a}, Pasi and Parviainen, Mikko and Korhonen, Teemu and Visa, Ari},
    address = "Hong Kong",
    booktitle = "Proceedings of 2005 International Symposium on Intelligent Signal Processing and Communication Systems, ISPACS",
    month = "December",
    pages = "745-748",
    title = "{M}oving sound source localization in large areas",
    year = "2005"
    }

  • P. Pertilä, "Sound source localization - a spatiotemporal approach," in Digest of TISE Seminar 2005, Terälahti, Tampere, Finland, 30 May 2005, 2005, p. 9–11.
    [BibTeX]
    @inproceedings{2005_TISE_a,
    author = {Pertil{\"a}, P.},
    editor = "Koivisto, P.",
    title = "Sound source localization - a spatiotemporal approach",
    note = "ISBN 952-15-1360-8, ISSN 1458-8463
    Contribution: organisation=sgn,FACT1=1", year = "2005", language = "English", pages = "9--11", booktitle = {Digest of TISE Seminar 2005, Ter{\"a}lahti, Tampere, Finland, 30 May 2005} }

  • A. Pertusa, A. Klapuri, and J. M. Inesta, "Recognition of note onsets in digital music using semitone bands," Lecture Notes in Computer Science, vol. 3773, p. 869–879, 2005.
    [BibTeX]
    @article{2005_LNCS_a,
    author = "Pertusa, A. and Klapuri, A. and Inesta, J.M.",
    title = "Recognition of note onsets in digital music using semitone bands",
    note = "Contribution: organisation=sgn,FACT1=1",
    year = "2005",
    language = "English",
    volume = "3773",
    pages = "869--879",
    journal = "Lecture Notes in Computer Science",
    issn = "0302-9743",
    publisher = "Springer Science and Business Media Deutschland GmbH"
    }

  • T. Pirinen, "Normalized confidence factors for robust direction of arrival estimation," in Proceedings of 2005 IEEE International Symposium on Circuits and Systems, ISCAS 2005, Kobe, Japan, 23-26 May 2005, 2005, p. 1429–1432.
    [BibTeX]
    @inproceedings{2005_ISCAS,
    author = "Pirinen, Tuomo",
    booktitle = "Proceedings of 2005 IEEE International Symposium on Circuits and Systems, ISCAS 2005, Kobe, Japan, 23-26 May 2005",
    pages = "1429--1432",
    title = "{N}ormalized confidence factors for robust direction of arrival estimation",
    year = "2005"
    }

  • T. Pirinen, P. Pertilä, and M. Parviainen, "The TUT 2005 source localization system," in Proceedings of the Rich Transcription 2005 Spring Meeting Recognition Evaluation, Edinburgh, UK, 2005, pp. 93-99.
    [BibTeX]
    @inproceedings{2005,
    author = {Pirinen, Tuomo and Pertil{\"a}, Pasi and Parviainen, Mikko},
    address = "Edinburgh, UK",
    booktitle = "Proceedings of the Rich Transcription 2005 Spring Meeting Recognition Evaluation",
    month = "July",
    pages = "93-99",
    title = "{T}he {TUT} 2005 source localization system",
    year = "2005"
    }

  • M. Ryynänen and A. Klapuri, "Polyphonic music transcription using note event modeling," in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, 2005.
    [BibTeX]
    @inproceedings{2005_WASPAA_a,
    author = {Ryyn{\"a}nen, Matti and Klapuri, Anssi},
    address = "New Paltz, New York",
    booktitle = "Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics",
    month = "Oct",
    title = "Polyphonic music transcription using note event modeling",
    year = "2005"
    }

  • T. Virtanen, "Methods for one-channel sound source separation," in Digest of TISE Seminar 2005, Terälahti, Tampere, Finland, 30 May 2005, 2005, p. 5–8.
    [BibTeX]
    @inproceedings{2005_TISE,
    author = "Virtanen, T.",
    editor = "Koivisto, P.",
    title = "Methods for one-channel sound source separation",
    note = "ISBN 952-15-1360-8, 1458-8463
    Contribution: organisation=sgn,FACT1=1", year = "2005", language = "English", pages = "5--8", booktitle = {Digest of TISE Seminar 2005, Ter{\"a}lahti, Tampere, Finland, 30 May 2005} }

  • C. Wooters, N. Mirghafori, A. Stolcke, T. Pirinen, I. Bulyko, D. Gelbart, M. Graciarena, S. Otterson, B. Peskin, and M. Ostendorf, "The 2004 ICSI-SR-UW meeting recognition system," Lecture Notes in Computer Science, vol. 3361, p. 196–208, 2005.
    [BibTeX]
    @article{2005_LNCS,
    author = "Wooters, C. and Mirghafori, N. and Stolcke, A. and Pirinen, Tuomo and Bulyko, I. and Gelbart, D. and Graciarena, M. and Otterson, S. and Peskin, B. and Ostendorf, M.",
    issn = "0302-9743",
    journal = "Lecture Notes in Computer Science",
    pages = "196--208",
    publisher = "Springer Verlag",
    title = "{T}he 2004 {ICSI}-{SR}-{UW} meeting recognition system",
    volume = "3361",
    year = "2005"
    }

2004

  • E. B. Bilcu, J. Astola, and J. Saarinen, "Recurrent neural networks with both side input context dependence for text-to-phoneme mapping," in Proceedings of the 2004 First International Symposium on Control, Communications and Signal Processing, ISCCSP, Hammamet, Tunisia, 2004, pp. 599-602.
    [BibTeX]
    @inproceedings{2004_ISCCSP,
    author = {Bilcu, Enik{\"o} Beatrice and Astola, Jaakko and Saarinen, Jukka},
    address = "Hammamet, Tunisia",
    booktitle = "Proceedings of the 2004 First International Symposium on Control, Communications and Signal Processing, ISCCSP",
    month = "March",
    pages = "599-602",
    title = "Recurrent neural networks with both side input context dependence for text-to-phoneme mapping",
    year = "2004"
    }

  • A. Klapuri, A. Eronen, and J. Astola, Automatic estimation of the meter of acoustic musical signals, TTY-Paino, 2004.
    [BibTeX]
    @book{2004,
    author = "Klapuri, Anssi and Eronen, Antti and Astola, Jaakko",
    isbn = "952-15-1149-4",
    number = "1/2004",
    publisher = "TTY-Paino",
    series = "Tampere University of Technology, Institute of Signal Processing, Report",
    title = "{A}utomatic estimation of the meter of acoustic musical signals",
    year = "2004"
    }

  • A. Klapuri, "Automatic music transcription as we know it today," Journal of New Music Research, vol. 33, iss. 3, pp. 269-282, 2004.
    [BibTeX]
    @article{2004_JNMR,
    author = "Klapuri, Anssi",
    journal = "Journal of New Music Research",
    keywords = "music transcription",
    month = "September",
    number = "3",
    pages = "269-282",
    title = "{A}utomatic music transcription as we know it today",
    volume = "33",
    year = "2004"
    }

  • K. Koppinen, "Analysis of the asymptotic impulse and frequency responses of polynomial predictors," Signal Processing, vol. 84, iss. 3, p. 549–560, 2004.
    [BibTeX]
    @article{2004_SP,
    author = "Koppinen, Konsta",
    issn = "0165-1684",
    journal = "Signal Processing",
    number = "3",
    pages = "549--560",
    publisher = "Elsevier",
    title = "{A}nalysis of the asymptotic impulse and frequency responses of polynomial predictors",
    volume = "84",
    year = "2004"
    }

  • K. Koppinen, "Signal Processing," Signal Processing, pp. 549-560, 2004.
    [BibTeX]
    @article{2004_SP_a,
    author = "Koppinen, Konsta",
    journal = "Signal Processing",
    pages = "549-560",
    title = "{S}ignal {P}rocessing",
    year = "2004"
    }

  • N. Mirfhafori, A. Stolcke, C. Wooters, T. Pirinen, I. Bulyko, D. Gelbart, M. Graciarena, S. Otterson, B. Peskin, and M. Ostendorf, "From switchboard to meetings: development of the 2004 ICSI-SRI-UW meeting recognition system," in Proceedings of the 8th International Conference on Spoken Language Processing, Interspeech 20004, ICSLP, Jeju Island, Korea, 4-8 October 2004, 2004, p. 4 p.
    [BibTeX]
    @inproceedings{2004_InterSpecch,
    author = "Mirfhafori, N. and Stolcke, A. and Wooters, C. and Pirinen, Tuomo and Bulyko, I. and Gelbart, D. and Graciarena, M. and Otterson, S. and Peskin, B. and Ostendorf, M.",
    editor = "Kim, S. H. and Youn, D. H.",
    booktitle = "Proceedings of the 8th International Conference on Spoken Language Processing, Interspeech 20004, ICSLP, Jeju Island, Korea, 4-8 October 2004",
    pages = "4 p",
    title = "{F}rom switchboard to meetings: development of the 2004 {ICSI}-{SRI}-{UW} meeting recognition system",
    year = "2004"
    }

  • P. Pertilä, M. Parviainen, T. Korhonen, and A. Visa, "A spatiotemporal approach to passive sound source localization," in Proceedings of International Symposium on Communications and Information Technologies 2004, ISCIT, Sapporo, Japan, 2004, pp. 1150-1154.
    [BibTeX]
    @inproceedings{2004_ICA,
    author = {Pertil{\"a}, Pasi and Parviainen, Mikko and Korhonen, Teemu and Visa, Ari},
    address = "Sapporo, Japan",
    booktitle = "Proceedings of International Symposium on Communications and Information Technologies 2004, ISCIT",
    month = "October",
    pages = "1150-1154",
    title = "{A} spatiotemporal approach to passive sound source localization",
    year = "2004"
    }

  • A. Pertusa, A. Klapuri, and J. M. I{ n}esta, "Recognition of note onsets in digital music using semitone bands," in Progress in Pattern Recognition, Image Analysis and Applications: 10th Iberoamerican Congress on Pattern Recognition, Havana, Cuba, 2004, pp. 869-879.
    [BibTeX] [Download PDF]
    @inproceedings{2004_ICA_a,
    author = "Pertusa, Antonio and Klapuri, Anssi and I{\ n}esta, J.M.",
    editor = "Alberto Sanfeliu, Manuel Lazo",
    address = "Havana, Cuba",
    booktitle = "Progress in Pattern Recognition, Image Analysis and Applications: 10th Iberoamerican Congress on Pattern Recognition",
    pages = "869-879",
    title = "{R}ecognition of note onsets in digital music using semitone bands",
    url = "http://www.springerlink.com/content/q61t720x6575/",
    volume = "3773/2005",
    year = "2004"
    }

  • T. Pirinen and J. Yli-Hietanen, "Time delay based failure-robust direction of arrival estimation," in Proceedings of 2004 IEEE Sensor Array and Multichannel Signal Processing Workshop, SAM 2004, Barcelona, Spain, 18-21 July 2004, 2004, p. 5 p.
    [BibTeX]
    @inproceedings{2004_SAM,
    author = "Pirinen, Tuomo and Yli-Hietanen, Jari",
    booktitle = "Proceedings of 2004 IEEE Sensor Array and Multichannel Signal Processing Workshop, SAM 2004, Barcelona, Spain, 18-21 July 2004",
    pages = "5 p",
    title = "{T}ime delay based failure-robust direction of arrival estimation",
    year = "2004"
    }

  • T. Pirinen, J. Yli-Hietanen, P. Pertilä, and A. Visa, "Detection and compensation of sensor malfunction in time delay based direction of arrival estimation," in Proceedings of 2004 IEEE International Symposium on Circuits and Systems, ISCAS, Vancouver, Canada, 2004, pp. 872-875.
    [BibTeX]
    @inproceedings{2004_ISCAS,
    author = {Pirinen, Tuomo and Yli-Hietanen, Jari and Pertil{\"a}, Pasi and Visa, Ari},
    address = "Vancouver, Canada",
    booktitle = "Proceedings of 2004 IEEE International Symposium on Circuits and Systems, ISCAS",
    month = "May",
    pages = "872 - 875",
    title = "Detection and compensation of sensor malfunction in time delay based direction of arrival estimation",
    year = "2004"
    }

  • M. Ryynänen and A. Klapuri, "Modelling of note events for singing transcription," in Proc. ISCA Tutorial and Research Workshop on Statistical and Perceptual Audio Processing, 2004.
    [BibTeX]
    @inproceedings{2004_ISCA,
    author = {Ryyn{\"a}nen, Matti and Klapuri, Anssi},
    booktitle = "Proc. ISCA Tutorial and Research Workshop on Statistical and Perceptual Audio Processing",
    title = "Modelling of note events for singing transcription",
    year = "2004"
    }

  • A. Stolcke, C. Wooters, N. Mirghafori, T. Pirinen, I. Bulyko, D. Gelbart, M. Graciarena, S. Otterson, B. Peskin, and M. Ostendorf, "Progress in meeting recognition: The ICSI-SRI-UW spring 2004 evaluation system," in Proceedings of NIST ICASSP 2004 Meeting Recognition Workshop, Montreal, Canada, 17 May 2004, 2004, p. 7 p.
    [BibTeX]
    @inproceedings{2004_ICASSP,
    author = "Stolcke, A. and Wooters, C. and Mirghafori, N. and Pirinen, Tuomo and Bulyko, I. and Gelbart, D. and Graciarena, M. and Otterson, S. and Peskin, B. and Ostendorf, M.",
    booktitle = "Proceedings of NIST ICASSP 2004 Meeting Recognition Workshop, Montreal, Canada, 17 May 2004",
    pages = "7 p",
    title = "{P}rogress in meeting recognition: {T}he {ICSI}-{SRI}-{UW} spring 2004 evaluation system",
    year = "2004"
    }

  • T. Virtanen, "Separation of Sound Sources by Convolutive Sparse Coding," in ISCA Tutorial and Research Workshop (ITRW) on Statistical and Perceptual Audio Processing, 2004.
    [BibTeX] [Download PDF]
    @inproceedings{2004_ITRW,
    author = "Virtanen, Tuomas",
    editor = "Tutorial, ISCA and on Statistical, Research Workshop and Processing, Perceptual Audio",
    keywords = "sparse coding; NMF",
    title = "Separation of Sound Sources by Convolutive Sparse Coding",
    year = "2004",
    booktitle = "ISCA Tutorial and Research Workshop (ITRW) on Statistical and Perceptual Audio Processing",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/sapa2004.pdf"
    }

  • T. Virtanen, "Separation of sound sources," in Digest of TISE Seminar 2004, Kangasala, 1 June 2004, 2004, p. 38–39.
    [BibTeX]
    @inproceedings{2004_TISE,
    author = "Virtanen, T.",
    editor = "Koivisto, P.",
    title = "Separation of sound sources",
    note = "ISBN 952-15-1175-3, ISSN 1458-8463
    Contribution: organisation=sgn,FACT1=1", year = "2004", language = "English", pages = "38--39", booktitle = "Digest of TISE Seminar 2004, Kangasala, 1 June 2004" }

2003

  • A. Eronen and T. Heittola, "Discriminative training of unsupervised acoustic models for non-speech audio," in Proceedings of the 2003 Finnish Signal Processing Symposium, FINSIG'03, Tampere, Finland, 2003, pp. 54-58.
    [BibTeX]
    @inproceedings{2003_FINSIG_a,
    author = "Eronen, Antti and Heittola, Toni",
    address = "Tampere, Finland",
    booktitle = "Proceedings of the 2003 Finnish Signal Processing Symposium, FINSIG'03",
    number = "20",
    series = {""},
    pages = "54-58",
    title = "{D}iscriminative training of unsupervised acoustic models for non-speech audio",
    year = "2003"
    }

  • A. Eronen, "Musical instrument recognition using ICA-based transform of features and discriminatively trained HMMs," in Proceedings of the Seventh International Symposium on Signal Processing and its Applications, Paris, France, 2003, pp. 133-136.
    [BibTeX]
    @inproceedings{2003_ICA,
    author = "Eronen, Antti",
    address = "Paris, France",
    booktitle = "Proceedings of the Seventh International Symposium on Signal Processing and its Applications",
    keywords = "instruments; HMM; ICA",
    month = "July",
    pages = "133-136",
    title = "{M}usical instrument recognition using {ICA}-based transform of features and discriminatively trained {HMM}s",
    volume = "2",
    year = "2003"
    }

  • A. Eronen, J. Tuomi, A. Klapuri, S. Fagerlund, T. Sorsa, G. Lorho, and J. Huopaniemi, "Audio-based context awareness - acoustic modeling and perceptual evaluation," in IEEE Proceedings of 2003 International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2003, Hong Kong, 6-10 April 2003, 2003, p. 529–532.
    [BibTeX]
    @inproceedings{2003_ICASSP_c,
    author = "Eronen, A. and Tuomi, J. and Klapuri, A. and Fagerlund, S. and Sorsa, T. and Lorho, G. and Huopaniemi, J.",
    title = "Audio-based context awareness - acoustic modeling and perceptual evaluation",
    note = "ISBN 0-7803-7663-3
    Contribution: organisation=sgn,FACT1=1", year = "2003", language = "English", isbn = "0-7803-7663-3", pages = "529--532", booktitle = "IEEE Proceedings of 2003 International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2003, Hong Kong, 6-10 April 2003" }

  • E. Gómez, A. Klapuri, and B. Meudic, "Melody Description and Extraction in the Context of Music Content Processing," Journal of New Music Research, vol. 32, iss. 1, 2003.
    [BibTeX]
    @article{2003_JNMR,
    author = "G{\'o}mez, Emilia and Klapuri, Anssi and Meudic, Beno{\^i}t",
    journal = "Journal of New Music Research",
    number = "1",
    title = "{M}elody {D}escription and {E}xtraction in the {C}ontext of {M}usic {C}ontent {P}rocessing",
    volume = "32",
    year = "2003"
    }

  • M. Helén and T. Virtanen, "Perceptually motivated parametric representation for harmonic sounds for data compression purposes," in Proceedings of the 6th International Conference on Digital Audio Effects DAFx-03, London, England, 2003, pp. 249-253.
    [BibTeX] [Download PDF]
    @inproceedings{2003_DAFx_a,
    author = "Hel{\'e}n, Marko and Virtanen, Tuomas",
    address = "London, England",
    booktitle = "Proceedings of the 6th International Conference on Digital Audio Effects DAFx-03",
    keywords = "sinusoidal model",
    month = "September",
    pages = "249-253",
    title = "Perceptually motivated parametric representation for harmonic sounds for data compression purposes",
    year = "2003",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/DAFx03\_Helen.pdf"
    }

  • A. Klapuri, "Multiple fundamental frequency estimation by harmonicity and spectral smoothness," IEEE Trans. Speech and Audio Processing, vol. 11, iss. 6, pp. 804-816, 2003.
    [BibTeX]
    @article{2003_TASP,
    author = "Klapuri, Anssi",
    journal = "IEEE Trans. Speech and Audio Processing",
    keywords = "fundamental frequency estimation",
    number = "6",
    pages = "804-816",
    title = "{M}ultiple fundamental frequency estimation by harmonicity and spectral smoothness",
    volume = "11",
    year = "2003"
    }

  • A. P. Klapuri, "Multiple fundamental frequency estimation based on harmonicitiy and spectral smoothness," IEEE Transactions on Speech and Audio Processing, vol. 11, iss. 6, p. 804–816, 2003.
    [BibTeX]
    @article{2003,
    author = "Klapuri, A. P.",
    title = "Multiple fundamental frequency estimation based on harmonicitiy and spectral smoothness",
    note = "ISSN 1063-6676
    Contribution: organisation=sgn,FACT1=1", year = "2003", language = "English", volume = "11", pages = "804--816", journal = "IEEE Transactions on Speech and Audio Processing", issn = "1063-6676", publisher = "Institute of Electrical and Electronics Engineers Inc.", number = "6" }

  • K. Koppinen, "Design of narrowband fir filters with minimal noise gain using complex interpolation," in IEEE Proceedings of 2003 International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2003, Hong Kong, 2003, pp. 265-268.
    [BibTeX]
    @inproceedings{2003_ICASSP,
    author = "Koppinen, Konsta",
    address = "Hong Kong",
    booktitle = "IEEE Proceedings of 2003 International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2003",
    month = "April",
    pages = "265-268",
    title = "{D}esign of narrowband fir filters with minimal noise gain using complex interpolation",
    year = "2003"
    }

  • S. Kuja-Halkola and A. Eronen, "Simultaneous training and order selection of gaussian mixture models for speaker recognition," in Proceedings of the 2003 Finnish Signal Processing Symposium, FINSIG'03, Tampere, Finland, 19 May 2003, 2003, p. 259–263.
    [BibTeX]
    @inproceedings{2003_FINSIG_b,
    author = "Kuja-Halkola, Sami and Eronen, Antti",
    editor = "Huttunen, H. and Gotchev, A. and Vasilache, A.",
    booktitle = "Proceedings of the 2003 Finnish Signal Processing Symposium, FINSIG'03, Tampere, Finland, 19 May 2003",
    pages = "259--263",
    publisher = "TICSP",
    title = "{S}imultaneous training and order selection of gaussian mixture models for speaker recognition",
    year = "2003"
    }

  • T. Mäkelä and R. Niemistö, "Effects of harmonic components generated by polynomial preprocessors in acoustic echo control," in Proceedings of the Eight International Workshop on Acoustic Echo and Noise Control, IWAENC 2003, Kyoto, Japan, 8-11 September 2003, 2003, p. 139–142.
    [BibTeX]
    @inproceedings{2003_IWAENC_a,
    author = {M{\"a}kel{\"a}, Tuomo and Niemist{\"o}, Riitta},
    editor = "Makino, S. and Miyoshi, M.",
    booktitle = "Proceedings of the Eight International Workshop on Acoustic Echo and Noise Control, IWAENC 2003, Kyoto, Japan, 8-11 September 2003",
    pages = "139--142",
    title = "{E}ffects of harmonic components generated by polynomial preprocessors in acoustic echo control",
    year = "2003"
    }

  • R. Niemistö and T. Mäkelä, "On performance of linear adaptive filtering algorithms in acoustic echo control in presence of distorting loudspeakers," in Proceedings of the Eight International Workshop on Acoustic Echo and Noise Control, IWAENC 2003, Kyoto, Japan, 8-11 September 2003, 2003, p. 79–82.
    [BibTeX]
    @inproceedings{2003_IWAENC,
    author = {Niemist{\"o}, Riitta and M{\"a}kel{\"a}, Tuomo},
    editor = "Makino, S. and Miyoshi, M.",
    booktitle = "Proceedings of the Eight International Workshop on Acoustic Echo and Noise Control, IWAENC 2003, Kyoto, Japan, 8-11 September 2003",
    pages = "79--82",
    title = "{O}n performance of linear adaptive filtering algorithms in acoustic echo control in presence of distorting loudspeakers",
    year = "2003"
    }

  • M. Parviainen and T. Virtanen, "Two-channel separation of speech using direction-of-arrival estimation and sinusoids plus transients modeling," in Proceedings of 2003 IEEE International Symposium on Intelligent Signal Processing and Communication Systems, ISPACS 2003, Awaji Island, Japan, 2003, pp. 127-132.
    [BibTeX] [Download PDF]
    @inproceedings{2003_ISPACS,
    author = "Parviainen, Mikko and Virtanen, Tuomas",
    address = "Awaji Island, Japan",
    booktitle = "Proceedings of 2003 IEEE International Symposium on Intelligent Signal Processing and Communication Systems, ISPACS 2003",
    keywords = "sinusoidal model",
    month = "December",
    pages = "127-132",
    title = "Two-channel separation of speech using direction-of-arrival estimation and sinusoids plus transients modeling",
    year = "2003",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/ispacs03.pdf"
    }

  • J. Paulus and A. Klapuri, "Conventional and Periodic N-grams in the Transcription of Drum Sequences," in Proc. of the IEEE International Conference on Multimedia and Expo, Baltimore, Maryland, USA, 2003, p. 737–740.
    [BibTeX]
    @inproceedings{2003_ICME,
    author = "Paulus, Jouni and Klapuri, Anssi",
    address = "Baltimore, Maryland, USA",
    booktitle = "Proc. of the IEEE International Conference on Multimedia and Expo",
    month = "Jul",
    pages = "737--740",
    title = "Conventional and Periodic {N}-grams in the Transcription of Drum Sequences",
    volume = "2",
    year = "2003"
    }

  • J. Paulus and A. Klapuri, "Model-based Event Labeling in the Transcription of Percussive Audio Signals," in Proc. of the 6th International Conference on Digital Audio Effects, London, UK, 2003, p. 73–77.
    [BibTeX]
    @inproceedings{2003_DAFx,
    author = "Paulus, Jouni and Klapuri, Anssi",
    editor = "Davies, Mike",
    address = "London, UK",
    booktitle = "Proc. of the 6th International Conference on Digital Audio Effects",
    month = "Sep",
    pages = "73--77",
    title = "Model-based Event Labeling in the Transcription of Percussive Audio Signals",
    year = "2003"
    }

  • P. Pertilä, T. Pirinen, A. Visa, and T. Korhonen, "Comparison of three post-processing methods for acoustic localization," in Proceedings of SPIE, Unattended Ground Sensor Technologies and Applications V, Orlando, Florida, USA, 2003, pp. 9-17.
    [BibTeX]
    @inproceedings{2003_SPIE,
    author = {Pertil{\"a}, Pasi and Pirinen, Tuomo and Visa, Ari and Korhonen, Teemu},
    address = "Orlando, Florida, USA",
    booktitle = "Proceedings of SPIE, Unattended Ground Sensor Technologies and Applications V",
    month = "April",
    pages = "9-17",
    title = "{C}omparison of three post-processing methods for acoustic localization",
    year = "2003"
    }

  • T. Pirinen, P. Pertilä, and A. Visa, "Toward intelligent sensors - reliability for time delay based direction of arrival estimates," in IEEE Proceedings of 2003 International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2003, Hong Kong, 2003.
    [BibTeX]
    @inproceedings{2003_ICASSP_a,
    author = {Pirinen, Tuomo and Pertil{\"a}, Pasi and Visa, Ari},
    address = "Hong Kong",
    booktitle = "IEEE Proceedings of 2003 International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2003",
    month = "Hong Kong",
    title = "{T}oward intelligent sensors - reliability for time delay based direction of arrival estimates",
    year = "2003"
    }

  • T. Pirinen, P. Pertilä, and A. Visa, "A new method for outlier removal in time delay based direction of arrival estimates," in Proceedings of SPIE, Unattended Ground Sensor Technologies and Applications V, Orlando, Florida, USA, 2003, pp. 18-29.
    [BibTeX]
    @inproceedings{2003_SPIE_a,
    author = {Pirinen, Tuomo and Pertil{\"a}, Pasi and Visa, Ari},
    address = "Orlando, Florida, USA",
    booktitle = "Proceedings of SPIE, Unattended Ground Sensor Technologies and Applications V",
    month = "April",
    pages = "18-29",
    title = "{A} new method for outlier removal in time delay based direction of arrival estimates",
    year = "2003"
    }

  • T. Viitaniemi, A. Klapuri, and A. Eronen, "A probabilistic model for the transcription of single-voice melodies," in Proceedings of the 2003 Finnish Signal Processing Symposium, FINSIG'03, Tampere, Finland, 19 May 2003, 2003, p. 59–63.
    [BibTeX]
    @inproceedings{2003_FINSIG,
    author = "Viitaniemi, Timo and Klapuri, Anssi and Eronen, Antti",
    editor = "Huttunen, H. and Gotchev, A. and Vasilache, A.",
    booktitle = "Proceedings of the 2003 Finnish Signal Processing Symposium, FINSIG'03, Tampere, Finland, 19 May 2003",
    pages = "59--63",
    publisher = "TICSP",
    title = "{A} probabilistic model for the transcription of single-voice melodies",
    year = "2003"
    }

  • T. Virtanen, "Sound Source Separation Using Sparse Coding with Temporal Continuity Objective," in International Computer Music Conference, 2003.
    [BibTeX] [Download PDF]
    @inproceedings{2003_ICMC,
    author = "Virtanen, Tuomas",
    booktitle = "International Computer Music Conference",
    keywords = "NMF;sparseness",
    title = "Sound Source Separation Using Sparse Coding with Temporal Continuity Objective",
    year = "2003",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/icmc2003.pdf"
    }

  • T. Virtanen, "Algorithm for the separation of harmonic sounds with time-frequency smoothness constraint," in Proceedings of the 6th International Conference on Digital Audio Effects DAFx-03, London, England, 2003, pp. 35-40.
    [BibTeX] [Download PDF]
    @inproceedings{2003_DAFx_b,
    author = "Virtanen, Tuomas",
    address = "London, England",
    booktitle = "Proceedings of the 6th International Conference on Digital Audio Effects DAFx-03",
    keywords = "sinusoidal model",
    month = "September",
    pages = "35-40",
    title = "Algorithm for the separation of harmonic sounds with time-frequency smoothness constraint",
    year = "2003",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/dafx2003.pdf"
    }

  • T. Virtanen, "Separation of sounds," in Digest of TISE Seminar 2003, Nokia, Finland, 5 June 2003, 2003, p. 1–2.
    [BibTeX]
    @inproceedings{2003_TISE,
    author = "Virtanen, T.",
    editor = "Koivisto, P.",
    title = "Separation of sounds",
    note = "ISBN 952-15-1043-9
    Contribution: organisation=sgn,FACT1=1", year = "2003", language = "English", pages = "1--2", booktitle = "Digest of TISE Seminar 2003, Nokia, Finland, 5 June 2003" }

2002

  • E. B. Bilcu, J. Suontausta, and J. Saarinen, "A New Transform Domain Neural Network for Text-To-Phoneme Mapping," in Proceedings of the 6th WSEAS International Multiconference on Circuits, Systems, Communications and Computers, CSCC 2002, July 7-14, 2002, Grete, Greece, 2002, p. 4591–4596.
    [BibTeX]
    @inproceedings{2002_CSCC,
    author = {Bilcu, Enik{\"o} Beatrice and Suontausta, Janne and Saarinen, Jukka},
    booktitle = "Proceedings of the 6th WSEAS International Multiconference on Circuits, Systems, Communications and Computers, CSCC 2002, July 7-14, 2002, Grete, Greece",
    pages = "4591--4596",
    title = "A New Transform Domain Neural Network for Text-To-Phoneme Mapping",
    year = "2002"
    }

  • E. B. Bilcu, P. Salmela, J. Suontausta, and J. Saarinen, "Application of the Neural Networks for Text-to-Phoneme Mapping," in Proceedings of EUSIPCO 2002 the XI European Signal Processing Conference, September 3-6, 2002, Tolouse, France, 2002, p. 97–100.
    [BibTeX]
    @inproceedings{2002_EUSIPCO,
    author = {Bilcu, Enik{\"o} Beatrice and Salmela, Petri and Suontausta, Janne and Saarinen, Jukka},
    booktitle = "Proceedings of EUSIPCO 2002 the XI European Signal Processing Conference, September 3-6, 2002, Tolouse, France",
    pages = "97--100",
    title = "Application of the Neural Networks for Text-to-Phoneme Mapping",
    year = "2002"
    }

  • A. Eronen, J. Tuomi, A. Klapuri, S. Fagerlund, T. Sorsa, G. Lorho, and J. Huopaniemi, "Audio-based context awareness - Acoustic modeling and perceptual evaluation," in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, 2002, pp. 1941-1944.
    [BibTeX]
    @inproceedings{2002_ICASSP_b,
    author = {Eronen, Antti and Tuomi, Juha and Klapuri, Anssi and Fagerlund, Seppo and Sorsa, Timo and Lorho, Ga{\"e}tan and Huopaniemi, Jyri},
    booktitle = "Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing",
    keywords = "context recognition",
    month = "May",
    pages = "1941-1944",
    title = "Audio-based context awareness - {A}coustic modeling and perceptual evaluation",
    year = "2002"
    }

  • T. Heittola and A. Klapuri, "Locating segments with drums in music signals," in International Conference on Music Information Retrieval (ISMIR), Paris, France, 2002, pp. 271-272.
    [BibTeX]
    @inproceedings{2002_ISMIR_a,
    author = "Heittola, Toni and Klapuri, Anssi",
    address = "Paris, France",
    booktitle = "International Conference on Music Information Retrieval (ISMIR)",
    keywords = "drums",
    month = "September",
    pages = "271-272",
    title = "Locating segments with drums in music signals",
    year = "2002"
    }

  • A. Klapuri and J. T. Astola, "Efficient calculation of a physiologically-motivated representation for sound," in DSP2002, 14th International Conference on Digital Signal Processing Proceedings, July 1-3, 2002, Santorini, Greece, 2002, p. 587–590.
    [BibTeX]
    @inproceedings{2002_DSP_a,
    author = "Klapuri, A. and Astola, J.T.",
    editor = "Skodras, A.N. and Constantinides, A.G.",
    title = "Efficient calculation of a physiologically-motivated representation for sound",
    note = "ISBN: 0-07803-7504-1
    Contribution: organisation=sgn,FACT1=1", year = "2002", language = "English", pages = "587--590", booktitle = "DSP2002, 14th International Conference on Digital Signal Processing Proceedings, July 1-3, 2002, Santorini, Greece" }

  • A. Klapuri, "Automatic transcription of music," in Digest of TISE Seminar 2002, June 10, 2002, Ylöjärvi, Finland, 2002, p. s. 27.
    [BibTeX]
    @inproceedings{2002_TISE,
    author = "Klapuri, A.",
    title = "Automatic transcription of music",
    note = "ISBN 952-15-0852-3, ISSN 1458-8463
    Contribution: organisation=sgn,FACT1=1", year = "2002", language = "English", isbn = "952-15-0852-3", pages = "s. 27", booktitle = {Digest of TISE Seminar 2002, June 10, 2002, Yl{\"o}j{\"a}rvi, Finland} }

  • R. Niemistö and T. Mäkelä, "Robust adaptive polynomial filters for acoustic echo cancellation," in Proceedings of the 5th Nordic Signal Processing Symposium, NORSIG 2002, October 4-7, 2002, on board Hurtigruten, Norway, 2002, p. 5 s.
    [BibTeX]
    @inproceedings{2002_NORSIG,
    author = {Niemist{\"o}, Riitta and M{\"a}kel{\"a}, Tuomo},
    booktitle = "Proceedings of the 5th Nordic Signal Processing Symposium, NORSIG 2002, October 4-7, 2002, on board Hurtigruten, Norway",
    isbn = "82-993158-4-0",
    pages = "5 s",
    title = "Robust adaptive polynomial filters for acoustic echo cancellation",
    year = "2002"
    }

  • R. Niemistö, T. Mäkelä, and V. Myllylä, "Robust fast affine projection algorithm for nonlinear acoustic echo cancellation," in Proceedings of EUSIPCO 2002, XI European Signal Processing Conference, September 3-6, 2002, Tolouse, France, 2002, p. 523–526.
    [BibTeX]
    @inproceedings{2002_EUSIPCO_a,
    author = {Niemist{\"o}, Riitta and M{\"a}kel{\"a}, Tuomo and Myllyl{\"a}, V.},
    booktitle = "Proceedings of EUSIPCO 2002, XI European Signal Processing Conference, September 3-6, 2002, Tolouse, France",
    pages = "523--526",
    title = "Robust fast affine projection algorithm for nonlinear acoustic echo cancellation",
    year = "2002"
    }

  • J. Paulus and A. Klapuri, "Measuring the Similarity of Rhythmic Patterns," in Proc. of the Third International Conference on Music Information Retrieval, Paris, France, 2002, p. 150–156.
    [BibTeX]
    @inproceedings{2002_ISMIR,
    author = "Paulus, Jouni and Klapuri, Anssi",
    editor = "Fingerhut, Michael",
    address = "Paris, France",
    booktitle = "Proc. of the Third International Conference on Music Information Retrieval",
    month = "Oct",
    pages = "150--156",
    title = "Measuring the Similarity of Rhythmic Patterns",
    year = "2002"
    }

  • V. Peltonen, J. Tuomi, A. Klapuri, J. Huopaniemi, and T. Sorsa, "Computational auditory scene recognition," in 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2002, p. II-1941-II-1944. doi:10.1109/ICASSP.2002.5745009
    [BibTeX]
    @INPROCEEDINGS{2002_ICASSP_a,
    author = "Peltonen, Vesa and Tuomi, Juha and Klapuri, Anssi and Huopaniemi, Jyri and Sorsa, Timo",
    booktitle = "2002 IEEE International Conference on Acoustics, Speech, and Signal Processing",
    title = "Computational auditory scene recognition",
    year = "2002",
    volume = "2",
    number = "",
    pages = "II-1941-II-1944",
    keywords = "Roads;Libraries;Artificial neural networks;Mel frequency cepstral coefficient;Vehicles;Rail transportation",
    doi = "10.1109/ICASSP.2002.5745009"
    }

  • T. Virtanen and A. Klapuri, "Separation of harmonic sounds using linear models for the overtone series," in 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2002, p. II-1757-II-1760. doi:10.1109/ICASSP.2002.5744962
    [BibTeX] [Download PDF]
    @INPROCEEDINGS{2002_ICASSP,
    author = "Virtanen, Tuomas and Klapuri, Anssi",
    booktitle = "2002 IEEE International Conference on Acoustics, Speech, and Signal Processing",
    title = "Separation of harmonic sounds using linear models for the overtone series",
    year = "2002",
    volume = "2",
    number = "",
    pages = "II-1757-II-1760",
    keywords = "Laboratories;Transforms;Polynomials;Smoothing methods;Harmonic analysis",
    doi = "10.1109/ICASSP.2002.5744962",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/icassp2002.pdf"
    }

  • T. Virtanen, "Separation of harmonic sounds," in Digest of TISE Seminar 2002, June 10, 2002, Ylöjärvi, Finland, 2002, p. 28–29.
    [BibTeX]
    @inproceedings{2002_TISE_a,
    author = "Virtanen, T.",
    title = "Separation of harmonic sounds",
    note = "ISBN 952-15-0852-3, ISSN 1458-8463
    Contribution: organisation=sgn,FACT1=1", year = "2002", language = "English", isbn = "952-15-0852-3", pages = "28--29", booktitle = {Digest of TISE Seminar 2002, June 10, 2002, Yl{\"o}j{\"a}rvi, Finland} }

  • J. Yli-Hietanen and T. Saarelainen, "Analysis of robust time-delay based angle-of-arrival estimation methods," in DSP2002, 14th International Conference on Digital Signal Processing Proceedings, July 1-3, 2002, Santorini, Greece, 2002, p. 239–242.
    [BibTeX]
    @inproceedings{2002_DSP,
    author = "Yli-Hietanen, Jari and Saarelainen, Teemu",
    editor = "Skodras, A.N. and Constantinides, A.G.",
    booktitle = "DSP2002, 14th International Conference on Digital Signal Processing Proceedings, July 1-3, 2002, Santorini, Greece",
    pages = "239--242",
    title = "Analysis of robust time-delay based angle-of-arrival estimation methods",
    year = "2002"
    }

2001

  • A. Eronen, "Comparison of features for musical instrument recognition," in Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575), 2001, pp. 19-22. doi:10.1109/ASPAA.2001.969532
    [BibTeX]
    @INPROCEEDINGS{2001_Cat. No.01TH8575,
    author = "Eronen, A.",
    booktitle = "Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575)",
    title = "Comparison of features for musical instrument recognition",
    year = "2001",
    volume = "",
    number = "",
    pages = "19-22",
    keywords = "Instruments;Cepstral analysis;Steady-state;Mel frequency cepstral coefficient;Humans;Performance analysis;Brightness;Frequency synchronization;Acoustic testing;Filter bank",
    doi = "10.1109/ASPAA.2001.969532"
    }

  • A. Klapuri, "Multipitch estimation and sound separation by the spectral smoothness principle," in 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings, 2001, p. 3381-3384 vol.5. doi:10.1109/ICASSP.2001.940384
    [BibTeX]
    @INPROCEEDINGS{2001_ICASSP,
    author = "Klapuri, Anssi",
    booktitle = "2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings",
    title = "Multipitch estimation and sound separation by the spectral smoothness principle",
    year = "2001",
    volume = "5",
    number = "",
    pages = "3381-3384 vol.5",
    keywords = "Humans;Auditory system;Psychoacoustic models;Frequency;Databases;Instruments;Smoothing methods;Error correction;Error analysis;Computational modeling",
    doi = "10.1109/ICASSP.2001.940384"
    }

  • A. Klapuri, "Means of integrating audio content analysis algorithms," in 10th Audio Engineering Society Convention, Amsterdam, Netherlands, 2001.
    [BibTeX]
    @inproceedings{2001_AES_a,
    author = "Klapuri, Anssi",
    address = "Amsterdam, Netherlands",
    booktitle = "10th Audio Engineering Society Convention",
    title = "Means of integrating audio content analysis algorithms",
    year = "2001"
    }

  • A. Klapuri, T. Virtanen, A. Eronen, and J. Seppänen, "Automatic transcription of musical recordings," in Consistent \\& Reliable Acoustic Cues Workshop, CRAC-01, Aalborg, Denmark, 2001.
    [BibTeX]
    @inproceedings{2001_CRAC,
    author = {Klapuri, Anssi and Virtanen, Tuomas and Eronen, Antti and Sepp{\"a}nen, Jarno},
    address = "Aalborg, Denmark",
    booktitle = "Consistent {\\&} Reliable Acoustic Cues Workshop, CRAC-01",
    keywords = "transcription",
    month = "September",
    title = "Automatic transcription of musical recordings",
    year = "2001"
    }

  • A. Klapuri, A. Eronen, J. Seppänen, and T. Virtanen, "Automatic transcription of music," in Symposium on Stochastic Modeling of Music, 14th Meeting of the FWO Research Society on Foundations of Music Research, Ghent, Belgium, 2001.
    [BibTeX]
    @inproceedings{2001,
    author = {Klapuri, Anssi and Eronen, Antti and Sepp{\"a}nen, Jarno and Virtanen, Tuomas},
    address = "Ghent, Belgium",
    booktitle = "Symposium on Stochastic Modeling of Music, 14th Meeting of the FWO Research Society on Foundations of Music Research",
    month = "October",
    title = "Automatic transcription of music",
    year = "2001"
    }

  • V. Peltonen, A. Eronen, M. Parviainen, and A. Klapuri, "Recognition of everyday auditory scenes: potentials, latencies and cues," in 110th Audio Engineering Society Convention, Amsterdam, Netherlands, 2001.
    [BibTeX]
    @inproceedings{2001_AES,
    author = "Peltonen, Vesa and Eronen, Antti and Parviainen, Mikko and Klapuri, Anssi",
    address = "Amsterdam, Netherlands",
    booktitle = "110th Audio Engineering Society Convention",
    keywords = "context recognition",
    title = "Recognition of everyday auditory scenes: potentials, latencies and cues",
    year = "2001"
    }

  • V. Peltonen, A. Eronen, M. Parviainen, and A. Klapuri, "Recognition of Everyday Auditory Scenes: Potentials, Latencies and Clues," in Audio Engineering Society, Convention Paper, Presented at the 110th Convention, 2001 May 12-15, Amsterdam, The Netherlands, 2001, p. 4 s.
    [BibTeX]
    @inproceedings{2001_AES_c,
    author = "Peltonen, V. and Eronen, A. and Parviainen, M. and Klapuri, A.",
    title = "Recognition of Everyday Auditory Scenes: Potentials, Latencies and Clues",
    note = "Contribution: organisation=sgn,FACT1=1",
    year = "2001",
    language = "English",
    pages = "4 s",
    booktitle = "Audio Engineering Society, Convention Paper, Presented at the 110th Convention, 2001 May 12-15, Amsterdam, The Netherlands"
    }

  • T. Virtanen and A. Klapuri, "Separation of harmonic sounds using multipitch analysis and iterative parameter estimation," in Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575), 2001, pp. 83-86. doi:10.1109/ASPAA.2001.969548
    [BibTeX] [Download PDF]
    @INPROCEEDINGS{2001_Cat. No.01TH8575_a,
    author = "Virtanen, Tuomas and Klapuri, Anssi",
    booktitle = "Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575)",
    title = "Separation of harmonic sounds using multipitch analysis and iterative parameter estimation",
    year = "2001",
    volume = "",
    number = "",
    pages = "83-86",
    keywords = "Harmonic analysis;Parameter estimation;Frequency estimation;Amplitude estimation;Acoustic signal processing;Iterative methods;Image analysis;Humans;Laboratories;Signal processing algorithms",
    doi = "10.1109/ASPAA.2001.969548",
    url = "https://homepages.tuni.fi/tuomas.virtanen/papers/waspaa2001.pdf"
    }

  • T. Virtanen, "Accure Sinusoidal Model Analysis and Parameter Redustion by Fusion of Componets," in Audio Engineering Society, Convention Paper, Presented at the 110th Convention, Amsterdam, The Netherlands, 2001.
    [BibTeX]
    @inproceedings{2001_AES_b,
    author = "Virtanen, Tuomas",
    address = "Amsterdam, The Netherlands",
    booktitle = "Audio Engineering Society, Convention Paper, Presented at the 110th Convention",
    keywords = "sinusoidal model",
    month = "May",
    title = "Accure Sinusoidal Model Analysis and Parameter Redustion by Fusion of Componets",
    year = "2001"
    }

2000

  • A. Eronen and A. Klapuri, "Musical instrument recognition using cepstral coefficients and temporal features," in 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2000, p. II753-II756 vol.2. doi:10.1109/ICASSP.2000.859069
    [BibTeX]
    @INPROCEEDINGS{2000_ICASSP_a,
    author = "Eronen, A. and Klapuri, A.",
    booktitle = "2000 IEEE International Conference on Acoustics, Speech, and Signal Processing",
    title = "Musical instrument recognition using cepstral coefficients and temporal features",
    year = "2000",
    volume = "2",
    number = "",
    pages = "II753-II756 vol.2",
    keywords = "Instruments;Cepstral analysis;Signal processing algorithms;Testing;Laboratories;Data mining;Algorithm design and analysis;Multiple signal classification;Music;Signal analysis",
    doi = "10.1109/ICASSP.2000.859069"
    }

  • S. Jakob, I. Korhonen, E. Ruokonen, T. Virtanen, A. Kogan, and J. Takala, "Detection of artifacts in monitored trends in intensive care," Computer Methods and Programs in Biomedicine, vol. 63, iss. 3, p. 203–209, 2000. doi:10.1016/S0169-2607(00)00110-3
    [BibTeX] [Abstract]

    In intensive care, decision-making is often based on trend analysis of physiological parameters. Artifact detection is a pre-requisite for interpretation of trends both for clinical and research purposes. In this study, we developed and tested three methods of artifact detection in physiological data (systolic, mean and diastolic artery and pulmonary artery pressures, central venous pressure, and peripheral temperature) using pre-filtered physiological signals (2-min median filtering) from 41 patients after cardiac surgery. These methods were: (1) the Rosner statistic; (2) slope detection with rules; and (3) comparison with a running median (median detection). After tuning the methods using data from 20 randomly chosen patients, the methods were tested using the data from the remaining patients. The results were compared with those obtained by manual identification of artifacts by three senior intensive care unit physicians. Out of an average of 22 480 data points for each variable, the three observers labelled 0.98\\% (220 data points) as artifacts. The inter-observer agreement was good. The average (range) sensitivity for artifact detection in all variables in the test database was 66\\% (33-92\\%) for the Rosner statistic, 64\\% (24-98\\%) for slope detection and 72\\% (41-98\\%) for median detection. All methods had a high specificity (greater than or equal to 94\\%). Slope detection had the highest mean positive prediction rate (53\\%; 21-85\\%). When the performance was measured by the cost function, slope detection and running median performed equally well and were superior to Rosner statistics for systemic arterial and central venous pressure and peripheral temperature. None of the methods produced acceptable results for pulmonary artery pressures. We conclude that median filtering of physiological variables is effective in removing artifacts. In post-operative cardiac surgery patients, the remaining artifacts are difficult to detect among physiological and pathophysiological changes. This makes large databases for tuning artifact algorithms mandatory. Despite these limitations, the performance of running median and slope detection were good in selected physiological variables. (C) 2000 Elsevier Science Ireland Ltd. All rights reserved.

    @article{2000,
    author = "Jakob, S and Korhonen, I and Ruokonen, E and Virtanen, T and Kogan, A and Takala, J",
    title = "Detection of artifacts in monitored trends in intensive care",
    abstract = "In intensive care, decision-making is often based on trend analysis of physiological parameters. Artifact detection is a pre-requisite for interpretation of trends both for clinical and research purposes. In this study, we developed and tested three methods of artifact detection in physiological data (systolic, mean and diastolic artery and pulmonary artery pressures, central venous pressure, and peripheral temperature) using pre-filtered physiological signals (2-min median filtering) from 41 patients after cardiac surgery. These methods were: (1) the Rosner statistic; (2) slope detection with rules; and (3) comparison with a running median (median detection). After tuning the methods using data from 20 randomly chosen patients, the methods were tested using the data from the remaining patients. The results were compared with those obtained by manual identification of artifacts by three senior intensive care unit physicians. Out of an average of 22 480 data points for each variable, the three observers labelled 0.98\\% (220 data points) as artifacts. The inter-observer agreement was good. The average (range) sensitivity for artifact detection in all variables in the test database was 66\\% (33-92\\%) for the Rosner statistic, 64\\% (24-98\\%) for slope detection and 72\\% (41-98\\%) for median detection. All methods had a high specificity (greater than or equal to 94\\%). Slope detection had the highest mean positive prediction rate (53\\%; 21-85\\%). When the performance was measured by the cost function, slope detection and running median performed equally well and were superior to Rosner statistics for systemic arterial and central venous pressure and peripheral temperature. None of the methods produced acceptable results for pulmonary artery pressures. We conclude that median filtering of physiological variables is effective in removing artifacts. In post-operative cardiac surgery patients, the remaining artifacts are difficult to detect among physiological and pathophysiological changes. This makes large databases for tuning artifact algorithms mandatory. Despite these limitations, the performance of running median and slope detection were good in selected physiological variables. (C) 2000 Elsevier Science Ireland Ltd. All rights reserved.",
    keywords = "detection, artifacts, monitored trends, intensive care, ALARM SYSTEM, MANAGEMENT",
    year = "2000",
    month = "November",
    doi = "10.1016/S0169-2607(00)00110-3",
    language = "English",
    volume = "63",
    pages = "203--209",
    journal = "Computer Methods and Programs in Biomedicine",
    issn = "0169-2607",
    publisher = "Elsevier",
    number = "3"
    }

  • J. Kivimäki, T. Lahti, and K. Koppinen, "A Phonetic Vocoder for Finnish," in Proceedings of the X European Signal Processing Conference (EUSIPCO), Tampere, Finland, 2000, pp. 1301-1304.
    [BibTeX]
    @inproceedings{2000_EUSIPCO_a,
    author = {Kivim{\"a}ki, Jukka and Lahti, Tommi and Koppinen, Konsta},
    address = "Tampere, Finland",
    booktitle = "Proceedings of the X European Signal Processing Conference (EUSIPCO)",
    month = "September",
    pages = "1301-1304",
    title = "{A} Phonetic Vocoder for Finnish",
    year = "2000"
    }

  • A. Klapuri, "Qualitative and quantitative aspects in the design of periodicity estimation algorithms," in Proceedings of the European Signal Processing Conference EUSIPCO, 2000.
    [BibTeX]
    @inproceedings{2000_EUSIPCO_b,
    author = "Klapuri, Anssi",
    booktitle = "Proceedings of the European Signal Processing Conference EUSIPCO",
    keywords = "transcription",
    title = "{Q}ualitative and quantitative aspects in the design of periodicity estimation algorithms",
    year = "2000"
    }

  • A. Klapuri, T. Virtanen, and J. Holm, "Robust multipitch estimation for the analysis and manipulation of polyphonic musical signals," in In Proc. COST-G6 Conference on Digital Audio Effects, DAFx-00, Verona, Italy, 2000.
    [BibTeX]
    @inproceedings{2000_DAFx,
    author = "Klapuri, Anssi and Virtanen, Tuomas and Holm, Jan-Markus",
    address = "Verona, Italy",
    booktitle = "In Proc. COST-G6 Conference on Digital Audio Effects, DAFx-00",
    keywords = "transcription",
    title = "{R}obust multipitch estimation for the analysis and manipulation of polyphonic musical signals",
    year = "2000"
    }

  • K. Koppinen and J. Astola, "Generalized IIR polynomial predictive filters," in Signal Processing X Theories and Applications, Proceedings of EUSIPCO 2000, 10th European Signal Processing Conference, Tampere, Finland, 2000, pp. 2457-2460.
    [BibTeX]
    @inproceedings{2000_EUSIPCO_c,
    author = "Koppinen, Konsta and Astola, Jaakko",
    address = "Tampere, Finland",
    booktitle = "Signal Processing X Theories and Applications, Proceedings of EUSIPCO 2000, 10th European Signal Processing Conference",
    month = "September",
    pages = "2457-2460",
    title = "{G}eneralized {IIR} polynomial predictive filters",
    year = "2000"
    }

  • T. Mikkonen and K. Koppinen, "Soft-Decision Decoding of Binary Block Codes in Celp Speech Coding," in Eusipco 2000, X European Signal Processing Conference, Tampere, Finland, 2000, pp. 825-828.
    [BibTeX]
    @inproceedings{2000_EUSIPCO_d,
    author = "Mikkonen, Tomi and Koppinen, Konsta",
    address = "Tampere, Finland",
    booktitle = "Eusipco 2000, X European Signal Processing Conference",
    month = "September",
    pages = "825-828",
    title = "{S}oft-{D}ecision {D}ecoding of {B}inary {B}lock {C}odes in {C}elp {S}peech {C}oding",
    year = "2000"
    }

  • A. Rosti and V. Koivunen, "Classification of mfsk modulated signals using the mean of complex envelope," in Signal Processing X Theories and Applications, Proceedings of EUSIPCO 2000, tenth European Signal processing Conference, 4-8 September 2000, Tampere, Finland, 2000, p. 581–584.
    [BibTeX] [Abstract]

    Modulation classification has many important applications in communications, e.g., reconfigurable receivers, spectrum managment and interference cancellation. In this paper we address the problem of classifying digitally modulated signals using cyclostationary statistics. We derive the first-order moments of the complex envelope of digitally modulated signals and verify their periodicity. A novel feature for the classification of the frequency shift keyed signals is proposed. The performance of this feature in distinguishing among different FSK constellations is studied in simulation. Some comparisons to commonly used features are performed.

    @inproceedings{2000_EUSIPCO,
    author = "Rosti, Antti-Veikko and Koivunen, Visa",
    editor = "Gabbouj, M. and Kuosmanen, P.",
    abstract = "Modulation classification has many important applications in communications, e.g., reconfigurable receivers, spectrum managment and interference cancellation. In this paper we address the problem of classifying digitally modulated signals using cyclostationary statistics. We derive the first-order moments of the complex envelope of digitally modulated signals and verify their periodicity. A novel feature for the classification of the frequency shift keyed signals is proposed. The performance of this feature in distinguishing among different FSK constellations is studied in simulation. Some comparisons to commonly used features are performed.",
    booktitle = "Signal Processing X Theories and Applications, Proceedings of EUSIPCO 2000, tenth European Signal processing Conference, 4-8 September 2000, Tampere, Finland",
    isbn = "952-15-0443-9",
    pages = "581--584",
    title = "Classification of mfsk modulated signals using the mean of complex envelope",
    year = "2000"
    }

  • T. Saarelainen and J. Yli-Hietanen, "A design method for small sensor arrays in angle of arrival estimation," in 2000 10th European Signal Processing Conference, 2000, pp. 1-4.
    [BibTeX] [Abstract]

    The use of small sensor arrays in modern signal processing systems has recently become more common due to the increase in computational processing power and interest in intelligent sensing and surveillance. However, not much information is available on the design of small sensor arrays having arbitrary geometry, that effectively can accomplish these tasks. In this paper we address the problem of designing such small sensor array systems for angle of arrival (AOA) estimation algorithms. Two different cost functions are derived and their applicability is demonstrated in simulation. The accuracy of the AOA estimates is also studied for two different array configurations.

    @inproceedings{2000_EUSIPCO_f,
    author = "Saarelainen, Teemu and Yli-Hietanen, Jari",
    abstract = "The use of small sensor arrays in modern signal processing systems has recently become more common due to the increase in computational processing power and interest in intelligent sensing and surveillance. However, not much information is available on the design of small sensor arrays having arbitrary geometry, that effectively can accomplish these tasks. In this paper we address the problem of designing such small sensor array systems for angle of arrival (AOA) estimation algorithms. Two different cost functions are derived and their applicability is demonstrated in simulation. The accuracy of the AOA estimates is also studied for two different array configurations.",
    booktitle = "2000 10th European Signal Processing Conference",
    pages = "1-4",
    title = "{A} design method for small sensor arrays in angle of arrival estimation",
    year = "2000"
    }

  • T. Saarelainen and J. Yli-Hietanen, "Design Method for Small Sensor Arrays in Angle of Arrival Estimation," in Signal Processing X, Theories and Applications, EUSIPCO 2000, 4-8 September 2000, Tampere, Finland, 2000, p. 1589–1592.
    [BibTeX]
    @inproceedings{2000_EUSIPCO_g,
    author = "Saarelainen, Teemu and Yli-Hietanen, Jari",
    booktitle = "Signal Processing X, Theories and Applications, EUSIPCO 2000, 4-8 September 2000, Tampere, Finland",
    pages = "1589--1592",
    title = "{D}esign {M}ethod for {S}mall {S}ensor {A}rrays in {A}ngle of {A}rrival {E}stimation",
    year = "2000"
    }

  • J. Sillanpää, A. Klapuri, J. Seppänen, and T. Virtanen, "Recognition of acoustic noise mixtures by combined bottom-up and top-down processing," in Proceedings of the European Signal Processing Conference EUSIPCO, 2000.
    [BibTeX]
    @inproceedings{2000_EUSIPCO_e,
    author = {Sillanp{\"a}{\"a}, Jukka and Klapuri, Anssi and Sepp{\"a}nen, Jarno and Virtanen, Tuomas},
    booktitle = "Proceedings of the European Signal Processing Conference EUSIPCO",
    title = "Recognition of acoustic noise mixtures by combined bottom-up and top-down processing",
    year = "2000"
    }

  • T. Virtanen and A. Klapuri, "Separation of Harmonic Sound Sources Using Sinusoidal Modeling," in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Istanbul, Turkey, 2000.
    [BibTeX]
    @inproceedings{2000_ICASSP,
    author = "Virtanen, Tuomas and Klapuri, Anssi",
    address = "Istanbul, Turkey",
    booktitle = "IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",
    keywords = "Separation; sinusoidal model",
    title = "{S}eparation of {H}armonic {S}ound {S}ources {U}sing {S}inusoidal {M}odeling",
    year = "2000"
    }

  • J. Yli-Hietanen, T. Saarelainen, and J. Routakangas, "Robust Angle-of-Arrival Estimation of Transient Signals," in In Proceedings of the IEEE Nordic Signal Processing Symposium (NORSIG), Kolmården, Sweden, 2000, pp. 65-68.
    [BibTeX]
    @inproceedings{2000_NORSIG,
    author = "Yli-Hietanen, Jari and Saarelainen, Teemu and Routakangas, Jussi",
    address = "Kolm{\aa}rden, Sweden",
    booktitle = "In Proceedings of the IEEE Nordic Signal Processing Symposium (NORSIG)",
    month = "June",
    pages = "65-68",
    title = "Robust Angle-of-Arrival Estimation of Transient Signals",
    year = "2000"
    }

1999

  • A. Klapuri, "Sound onset detection by applying psychoacoustic knowledge," in 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99, 1999, p. 3089-3092 vol.6. doi:10.1109/ICASSP.1999.757494
    [BibTeX]
    @INPROCEEDINGS{1999_ICASSP,
    author = "Klapuri, Anssi",
    booktitle = "1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99",
    title = "Sound onset detection by applying psychoacoustic knowledge",
    year = "1999",
    volume = "6",
    number = "",
    pages = "3089-3092 vol.6",
    keywords = "Psychology;Acoustic signal detection;Signal processing;Robustness;Psychoacoustic models;Frequency;Audio recording;Event detection;Acoustic signal processing;Laboratories",
    doi = "10.1109/ICASSP.1999.757494"
    }

  • A. Klapuri, "Pitch estimation using multiple independent time-frequency windows," in Proceedings of the 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. WASPAA'99 (Cat. No.99TH8452), 1999, pp. 115-118. doi:10.1109/ASPAA.1999.810863
    [BibTeX]
    @INPROCEEDINGS{1999_Cat. No.99TH8452,
    author = "Klapuri, Anssi",
    booktitle = "Proceedings of the 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. WASPAA'99 (Cat. No.99TH8452)",
    title = "Pitch estimation using multiple independent time-frequency windows",
    year = "1999",
    volume = "",
    number = "",
    pages = "115-118",
    keywords = "Time frequency analysis;Psychoacoustic models;Signal processing algorithms;Yield estimation;Frequency estimation;Acoustic signal processing;Psychology;Noise robustness;Interference;Laboratories",
    doi = "10.1109/ASPAA.1999.810863"
    }

  • A. Klapuri, "Wide-band Pitch Estimation for Natural Sound Sources with Inharmonicities," in 106th Audio Engineering Society Convention, 1999.
    [BibTeX]
    @inproceedings{1999_AES,
    author = "Klapuri, Anssi",
    booktitle = "106th Audio Engineering Society Convention",
    keywords = "Transcription",
    month = "May",
    title = "{W}ide-band {P}itch {E}stimation for {N}atural {S}ound {S}ources with {I}nharmonicities",
    year = "1999"
    }

  • J. Seppänen, S. Kananoja, J. Yli-Hietanen, K. Koppinen, and J. Sjöberg, "Maximization of the subjective loudness of speech with constrained amplitude," in Applications of Signal Processing to Audio and Acoustics, 1999 IEEE Workshop on, 1999, pp. 139-142. doi:https://doi.org/10.1109/ASPAA.1999.810869
    [BibTeX] [Abstract]

    We introduce an adaptive algorithm for constraining the amplitude of speech signals while at the same time trying to maintain the subjective loudness and trying not to produce disturbing artifacts. The algorithm can be applied to compensate for the clipping distortion of amplifiers in speech reproduction devices. The algorithm analyzes the speech signal on multiple frequency bands and applies an internal audibility law in order to make inaudible changes to the signal. An example of the audibility law, presented in the form of a matrix, is described, associated with a specific speech reproduction device. Multiple band-pass signals are processed with a waveshaper to accomplish soft-clipping and to constrain the amplitude of the processed signal. When processed with the proposed algorithm, the computational loudness value of speech signals was found to diminish only slightly (approximately 6 sones) during processing, while at the same time the signal amplitude could be reduced by even 15 dB.

    @inproceedings{1999_WASPAA,
    author = {Sepp{\"a}nen, Jarno and Kananoja, Sami and Yli-Hietanen, Jari and Koppinen, Konsta and Sj{\"o}berg, Jari},
    abstract = "We introduce an adaptive algorithm for constraining the amplitude of speech signals while at the same time trying to maintain the subjective loudness and trying not to produce disturbing artifacts. The algorithm can be applied to compensate for the clipping distortion of amplifiers in speech reproduction devices. The algorithm analyzes the speech signal on multiple frequency bands and applies an internal audibility law in order to make inaudible changes to the signal. An example of the audibility law, presented in the form of a matrix, is described, associated with a specific speech reproduction device. Multiple band-pass signals are processed with a waveshaper to accomplish soft-clipping and to constrain the amplitude of the processed signal. When processed with the proposed algorithm, the computational loudness value of speech signals was found to diminish only slightly (approximately 6 sones) during processing, while at the same time the signal amplitude could be reduced by even 15 dB.",
    booktitle = "Applications of Signal Processing to Audio and Acoustics, 1999 IEEE Workshop on",
    doi = "https://doi.org/10.1109/ASPAA.1999.810869",
    keywords = "nonlinear distortion;speech processing;adaptive signal processing",
    pages = "139-142",
    title = "{M}aximization of the subjective loudness of speech with constrained amplitude",
    year = "1999"
    }

  • J. Yli-Hietanen, K. Koppinen, and J. Astola, "Time-delay Selection for Robust Angle of Arrival Estimation," in Proceedings of the IASTED Internatioanl Conference Signal and Image Processing (SIP99), Nassau, Bahamas, 1999.
    [BibTeX]
    @inproceedings{1999_SIP99,
    author = "Yli-Hietanen, Jari and Koppinen, Konsta and Astola, Jaakko",
    address = "Nassau, Bahamas",
    booktitle = "Proceedings of the IASTED Internatioanl Conference Signal and Image Processing (SIP99)",
    month = "October",
    title = "{T}ime-delay {S}election for {R}obust {A}ngle of {A}rrival {E}stimation",
    year = "1999"
    }

1998

  • A. Klapuri, "Number theoretical means of resolving a mixture of several harmonic sounds," in 9th European Signal Processing Conference (EUSIPCO 1998), 1998, pp. 1-5.
    [BibTeX]
    @INPROCEEDINGS{1998_EUSIPCO 1998,
    author = "Klapuri, Anssi",
    booktitle = "9th European Signal Processing Conference (EUSIPCO 1998)",
    title = "Number theoretical means of resolving a mixture of several harmonic sounds",
    year = "1998",
    volume = "",
    number = "",
    pages = "1-5",
    keywords = "Harmonic analysis;Signal resolution;Robustness;Mathematical model;Probability;Signal processing algorithms;Hidden Markov models"
    }

  • K. Koppinen, J. Yli-Hietanen, and P. Händel, "Design of Multi-Delay Predictive Filters Using Dynamic Programming," in Proceedings of EUSIPCO'98, 9th European Signal Processing Conference, 1998, pp. 161-164.
    [BibTeX]
    @inproceedings{1998_EUSIPCO,
    author = {Koppinen, Konsta and Yli-Hietanen, Jari and H{\"a}ndel, P.},
    editor = "Theodoridis, S. et al.",
    booktitle = "Proceedings of EUSIPCO'98, 9th European Signal Processing Conference",
    pages = "161-164",
    title = "{D}esign of {M}ulti-{D}elay {P}redictive {F}ilters {U}sing {D}ynamic {P}rogramming",
    volume = "1",
    year = "1998"
    }

  • J. Yli-Hietanen, K. Koppinen, and E. Paajanen, "Siren Sound Suppression for Speech Enhancement in Mobile Communications," in ICSPAT98, Toronto, Canada, 1998, pp. 1277-1280.
    [BibTeX]
    @inproceedings{1998_ICSPAT,
    author = "Yli-Hietanen, Jari and Koppinen, Konsta and Paajanen, Erkki",
    address = "Toronto, Canada",
    booktitle = "ICSPAT98",
    month = "September",
    pages = "1277-1280",
    title = "{S}iren {S}ound {S}uppression for {S}peech {E}nhancement in {M}obile {C}ommunications",
    year = "1998"
    }

  • J. Yli-Hietanen, K. Koppinen, and K. Halonen, "Cluster filter," in 9th European Signal Processing Conference (EUSIPCO 1998), 1998, pp. 1905-1907.
    [BibTeX]
    @INPROCEEDINGS{1998_EUSIPCO 1998_a,
    author = "Yli-Hietanen, Jari and Koppinen, Konsta and Halonen, Katriina",
    booktitle = "9th European Signal Processing Conference (EUSIPCO 1998)",
    title = "Cluster filter",
    year = "1998",
    volume = "",
    number = "",
    pages = "1905-1907",
    keywords = "Delay effects;Robustness;Estimation;Sensor arrays;Signal processing;Delays"
    }

1997

  • K. Koppinen, J. Yli-Hietanen, and J. Astola, "Optimization of generalized predictors," in IMTC Proceedings, Ottawa, Canada, 1997, pp. 54-59.
    [BibTeX]
    @inproceedings{1997_IMTC,
    author = "Koppinen, Konsta and Yli-Hietanen, Jari and Astola, Jaakko",
    address = "Ottawa, Canada",
    booktitle = "IMTC Proceedings",
    pages = "54-59",
    title = "{O}ptimization of generalized predictors",
    volume = "1",
    year = "1997"
    }

1996

  • K. Koppinen, O. Vainio, and J. Astola, "Analysis and Design of Polynomial Predictors," in Proc. IEEE Nordic Signal Processing Symposium, Espoo, Finland, 1996, pp. 45-48.
    [BibTeX]
    @inproceedings{1996_NORSIG_a,
    author = "Koppinen, Konsta and Vainio, O. and Astola, Jaakko",
    address = "Espoo, Finland",
    booktitle = "Proc. IEEE Nordic Signal Processing Symposium",
    month = "Sep",
    pages = "45-48",
    title = "{A}nalysis and {D}esign of {P}olynomial {P}redictors",
    year = "1996"
    }

  • J. Yli-Hietanen, K. Kalliojärvi, and J. Astola, "Robust Time-Delay Based Angle of Arrival Estimation," in Proceedings of Norsig'96, 1996.
    [BibTeX]
    @inproceedings{1996_NORSIG,
    author = {Yli-Hietanen, Jari and Kallioj{\"a}rvi, Kari and Astola, Jaakko},
    booktitle = "Proceedings of Norsig'96",
    title = "{R}obust {T}ime-{D}elay {B}ased {A}ngle of {A}rrival {E}stimation",
    year = "1996"
    }

  • J. Yli-Hietanen, K. Kalliojärvi, and J. Astola, "Low-complexity angle of arrival estimation of wideband signals using small arrays," in Proceedings of 8th Workshop on Statistical Signal and Array Processing, 1996, pp. 109-112. doi:10.1109/SSAP.1996.534832
    [BibTeX] [Abstract]

    When the signal to noise ratio is relatively high, the angle of arrival of the strongest signal can be estimated with a very simple method and a small 3D sensor array. The differences in the arrival times of the wideband signal received by spatially separated sensors are estimated using the polarity coincidence correlation. These time differences, i.e. time delays, determine the angle of arrival. In this paper the effects of quantization of the time delays are studied. It is found out that this simple method gives comparable performance to the conventional direct correlation based methods in the case of a relatively high signal to noise ratio.

    @inproceedings{1996_ICA,
    author = {Yli-Hietanen, Jari and Kallioj{\"a}rvi, Kari and Astola, Jaakko},
    abstract = "When the signal to noise ratio is relatively high, the angle of arrival of the strongest signal can be estimated with a very simple method and a small 3D sensor array. The differences in the arrival times of the wideband signal received by spatially separated sensors are estimated using the polarity coincidence correlation. These time differences, i.e. time delays, determine the angle of arrival. In this paper the effects of quantization of the time delays are studied. It is found out that this simple method gives comparable performance to the conventional direct correlation based methods in the case of a relatively high signal to noise ratio.",
    booktitle = "Proceedings of 8th Workshop on Statistical Signal and Array Processing",
    doi = "10.1109/SSAP.1996.534832",
    keywords = "direction-of-arrival estimation;array signal processing",
    pages = "109-112",
    title = "{L}ow-complexity angle of arrival estimation of wideband signals using small arrays",
    year = "1996"
    }