New Avenues in Audio Intelligence: Towards Holistic Real-life Audio Understanding

Agrawal, A., Lu, J., Antol, S., Mitchell, M., Zitnick, C. L., Parikh, D., Batra, D. (2017). VQA: Visual question answering. International Journal of Computer Vision, 123(1), 4–31. https://doi.org/10.1007/s11263-016-0966-6
Google Scholar | Crossref Allik, A., Fazekas, G., Sandler, M. B. (2016). An ontology for audio features. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR) (pp. 73–79), New York, NY.
Google Scholar Amiriparian, S., Julka, S., Cummins, N., Schuller, B. (2018). Deep convolutional recurrent neural networks for rare sound event detection. In Proceedings of the 44. Jahrestagung für Akustik (DAGA), Munich, Germany, 4 pages.
Google Scholar Amiriparian, S., Pugachevskiy, S., Cummins, N., Hantke, S., Pohjalainen, J., Keren, G., Schuller, B. (2017). CAST a database: Rapid targeted large-scale big data acquisition via small-world modelling of social media platforms. In Proceedings of the Biannual Conference on Affective Computing and Intelligent Interaction (ACII) (pp. 340–345), San Antonio, TX.
Google Scholar | Crossref Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., Vinyals, O. (2012). Speaker diarization: A review of recent research. IEEE Transactions on Audio, Speech, and Language Processing, 20(2), 356–370. https://doi.org/10.1109/TASL.2011.2125954
Google Scholar | Crossref Aytar, Y., Vondrick, C., Torralba, A. (2016). SoundNet: Learning sound representations from unlabeled video. In Proceedings of the Advances in Neural Information Processing Systems (NIPS) (pp. 892–900), Barcelona, Spain: MIT Press
Google Scholar Barker, T., Virtanen, T. (2013). Non-negative tensor factorisation of modulation spectrograms for monaural sound source separation. In Proceedings of the INTERSPEECH (ISCA) (pp. 827–831), Lyon, France.
Google Scholar Borth, D., Chen, T., Ji, R., Chang, S.-F. (2013). Sentibank: Large-scale ontology and classifiers for detecting sentiment and emotions in visual content. In Proceedings of the ACM International Conference on Multimedia (ACMMM) (pp. 459–460), Barcelona, Spain, ACM.
Google Scholar | Crossref Bredin, H. (2017). TristouNet: Triplet loss for speaker turn embedding. In IEEE Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5430–5434), New Orleans, LA.
Google Scholar | Crossref Büchler, M., Allegro, S., Launer, S., Dillier, N. (2005). Sound classification in hearing aids inspired by auditory scene analysis. EURASIP Journal on Advances in Signal Processing, 2005(18), 1–12. https://doi.org/10.1155/ASP.2005.2991
Google Scholar | Crossref Buitelaar, P., Cimiano, P., Magnini, B. (2005). Ontology learning from text: Methods, evaluation and applications. Impacting the World of Science Press.
Google Scholar Chicco, D., Sadowski, P., Baldi, P. (2014). Deep autoencoder neural networks for gene ontology annotation predictions. In Proceedings of the 5th ACM conference on bioinformatics, computational biology, and health informatics (pp. 533–540), Newport Beach, California, USA.
Google Scholar | Crossref Coutinho, E., Weninger, F., Schuller, B., Scherer, K. R. (2014). The Munich LSTM-RNN approach to the MediaEval 2014 “Emotion in Music” Task. In Proceedings of the MediaEval Multimedia Benchmark Workshop Barcelona, Spain, CEUR.
Google Scholar Davis, K. H., Biddulph, R., Balashek, S. (1952). Automatic recognition of spoken digits. The Journal of the Acoustical Society of America, 24(6), 637–642. https://doi.org/10.1121/1.1906946
Google Scholar | Crossref Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 248–255), Miami, FL.
Google Scholar | Crossref Drossos, K., Adavanne, S., Virtanen, T. (2017). Automated audio captioning with recurrent neural networks. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (pp. 374–378), New Paltz, NY.
Google Scholar | Crossref Durand, N., Derivaux, S., Forestier, G., Wemmert, C., Gançarski, P., Boussaid, O., Puissant, A. (2007). Ontology-based object recognition for remote sensing image interpretation. In Proceedings of the IEEE International Conference on Tools with Artificial Intelligence (ICTAI) (pp. 472–479), Patras, Greece.
Google Scholar | Crossref Ehrig, M., Maedche, A. (2003). Ontology-focused crawling of web documents. In Proceedings of the ACM Symposium on Applied Computing (SAC) (pp. 1174–1178), Melbourne, Florida, ACM.
Google Scholar | Crossref Elizalde, B., Shah, A., Dalmia, S., Lee, M. H., Badlani, R., Kumar, A., Raj, B., Lane, I. (2017). An approach for self-training audio event detectors using web data. https://doi.org/10.23919/EUSIPCO.2017.8081532
Google Scholar Fan, Z., Lai, Y., Jang, J. R. (2018). SVSGAN: singing voice separation via generative adversarial network. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 726–730), Calgary, Canada.
Google Scholar | Crossref Farquhar, A., Fikes, R., Rise, J. (1997). The ontolingua server: A tool for collaborative ontology construction. International Journal of Human–Computer Studies, 46(6), 707–727. https://doi.org/10.1006/ijhc.1996.0121
Google Scholar | Crossref Furfaro, R., Linares, R., Gaylor, D., Jah, M., Walls, R. (2016). Resident space object characterization and behavior understanding via machine learning and ontology-based bayesian networks. In Advanced Maui Optical and Space Surveillance Technologies Conference (AMOS), Wailea, Maui, Hawaii.
Google Scholar Ganesh, S., Jayaraj, M., Kalyan, V., Murthy, S., Aghila, G. (2004). Ontology-based web crawler. In IEEE Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC) (pp. 337–341), Las Vegas, NV.
Google Scholar | Crossref Garcia-Romero, D., Snyder, D., Sell, G., Povey, D., McCree, A. (2017). Speaker diarization using deep neural network embeddings. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4930–4934), New Orleans, LA.
Google Scholar | Crossref Gemmeke, J., Ellis, D., Freedman, D., Jansen, A., Lawrence, W., Channing Moore, R., Plakal, M., Ritter, M. (2017). Audio set: An ontology and human-labeled dataset for audio events. In IEEE Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 776–780), New Orleans, LA.
Google Scholar | Crossref Gotmare, P. (2016). Methodology for semi-automatic ontology construction using ontology learning: A survey. International Journal of Computer Applications Emerging Trends in Computing, 2016(2), 1–3.
Google Scholar Grimm, M., Kroschel, K. (2005). Evaluation of natural emotions using self assessment manikins. In IEEE Workshop on Automatic Speech Recognition and Understanding, 2005 (pp. 381–385), Cancun, Mexico.
Google Scholar | Crossref Han, B., Rho, S., Jun, S., Hwang, E. (2010). Music emotion classification and context-based music recommendation. Multimedia Tools and Applications, 47(3), 433–460. https://doi.org/10.1007/s11042-009-0332-6
Google Scholar | Crossref Hansen, J. H. L., Hasan, T. (2015). Speaker recognition by machines and humans: A tutorial review. IEEE Signal Processing Magazine, 32(6), 74–99.
Google Scholar | Crossref Hantke, S., Marchi, E., Schuller, B. (2016a). Introducing the weighted trustability evaluator for crowdsourcing exemplified by speaker likability classification. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16) (pp. 2156–2161), Portorož, Slovenia.
Google Scholar Hantke, S., Weninger, F., Kurle, R., Ringeval, F., Batliner, A., Mousa, A., Schuller, B. (2016b). I hear you eat and speak: Automatic recognition of eating condition and food types, use-cases, and impact on ASR performance. PLOS One, 11(5), 1–24. https://doi.org/10.1371/journal.pone.0154486
Google Scholar | Crossref Hantke, S., Zhang, Z., Schuller, B. (2017). Towards intelligent crowdsourcing for audio data annotation: Integrating active learning in the real world. In Proceedings of the INTERSPEECH, ISCA (pp. 3951–3955), Stockholm, Sweden.
Google Scholar | Crossref Hatala, M., Kalantari, L., Wakkary, R., Newby, K. (2004). Ontology and rule based retrieval of sound objects in augmented audio reality system for museum visitors. In Proceedings of the ACM Symposium on Applied Computing (SAC) (pp. 1045–1050), Nicosia, Cyprus, ACM.
Google Scholar | Crossref Heittola, T., Mesaros, A., Eronen, A., Virtanen, T. (2013). Context-dependent sound event detection. EURASIP Journal on Audio, Speech, and Music Processing, 2013(1), 1. https://doi.org/10.1186/1687-4722-2013-1
Google Scholar | Crossref Hilario, M., Kalousis, A., Nguyen, P., Woznica, A. (2009). A data mining ontology for algorithm selection and meta-mining. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD) (pp. 76–87), Bled, Slovenia.
Google Scholar Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A. R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Kingsbury, B., Sainath, T. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.
Google Scholar | Crossref | ISI Hoelzl, G., Ferscha, A., Halbmayer, P., Pereira, W. (2014). Goal oriented smart watches for cyber physical superorganisms. In Proceedings of the ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct Publication (pp. 1071–1076), Seattle, WA, ACM.
Google Scholar | Crossref Huang, P.-S., Kim, M., Hasegawa-Johnson, M., Smaragdis, P. (2015). Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(12), 2136–2147. https://doi.org/10.1109/TASLP.2015.2468583
Google Scholar | Crossref Jansson, A., Humphrey, E., Montecchio, N., Bittner, R., Kumar, A., Weyde, T. (2017). Singing voice separation with deep U-net convolutional networks. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR) (pp. 323–332), Suzhou, China.
Google Scholar Kavalerov, I., Wisdom, S., Erdogan, H., Patton, B., Wilson, K., Le Roux, J., Hershey, J. R. (2019). Universal sound separation. arXiv preprint arXiv:1905.03330. https://arxiv.org/abs/1905.03330
Google Scholar Le Lan, G., Charlet, D., Larcher, A., Meignier, S. (2017). A triplet ranking-based neural network for speaker diarization and linking. In Proceedings of the INTERSPEECH, ISCA (pp. 3572–3576), Stockholm, Sweden.
Google Scholar | Crossref Le Roux, J., Hershey, J. R., Weninger, F. (2015). Deep NMF for speech separation. In IEEE Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 66–70), Brisbane, Australia.
Google Scholar | Crossref Lu, J., Xiong, C., Parikh, D., Socher, R. (2017). Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 3242–3250), Honolulu, HI.
Google Scholar | Crossref Lu, J.-M., Lo, Y.-C. (2017). Investigation of smartphone use while walking and its influences on one’s behavior among pedestrians in taiwan. In International Conference on Human–Computer Interaction (pp. 469–475), Springer, Vancouver, Canada.
Google Scholar | Crossref Maedche, A., Staab, S. (2001). Ontology learning for the semantic web. IEEE Intelligent Systems, 16(2), 72–79. https://doi.org/10.1109/5254.920602
Google Scholar | Crossref | ISI Mesaros, A., Heittola, T., Benetos, E., Foster, P., Lagrange, M., Virtanen, T., Plumbley, M. D. (2018). Detection and classification of acoustic scenes and events: Outcome of the DCASE 2016 challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(2), 379–393.
Google Scholar | Crossref Mesaros, A., Heittola, T., Virtanen, T. (2016). Metrics for polyphonic sound event detection. Applied Sciences, 6(6), 162. https://doi.org/10.3390/app6060162
Google Scholar | Crossref Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning (pp. 1928–1937), New York, USA.
Google Scholar Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533. https://doi.org/10.1038/nature14236
Google Scholar | Crossref | Medline | ISI Naithani, G., Parascandolo, G., Barker, T., Pontoppidan, N. H., Virtanen, T. (2016). Low-latency sound source separation using deep neural networks. In Proceedings of the Global Conference on Signal and Information Processing (GlobalSIP) (pp. 272–276), Washington, DC.
Google Scholar | Crossref Nakatani, T., Okuno, H. G. (1998). Sound ontology for computational auditory scene analysis. In Proceedings of the Conference of the Association for the Advancement of Artificial Intelligence (AAAI) (pp. 1004–1010), Madison, WI.
Google Scholar Nikunen, J., Diment, A., Virtanen, T. (2018). Separation of moving sound sources using multichannel NMF and acoustic tracking. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(2), 281–295. https://doi.org/10.1109/TASLP.2017.2774925
Google Scholar | Crossref Nordqvist, P., Leijon, A. (2004). An efficient robust sound classification algorithm for hearing aids. The Journal of the Acoustical Society of America, 115(6), 3033–3041. https://doi.org/10.1121/1.1710877
Google Scholar | Crossref | Medline Noy, N. F., Chugh, A., Liu, W., Musen, M. A. (2006). A framework for ontology evolution in collaborative environments. In Proceedings of the International Semantic Web Conference (ISWC) (pp. 544–555), Athens, GA.
Google Scholar | Crossref Ozerov, A., Févotte, C., Blouet, R., Durrieu, J.-L. (2011). Multichannel nonnegative tensor factorization with structured constraints for user-guided audio source separation. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 257–260), Prague, Czech Republic.
Google Scholar | Crossref Pan, S. J., Yang, Q. (2009). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10), 1345–1359. https://doi.org/10.1109/TKDE.2009.191
Google Scholar | Crossref Petrucci, G., Ghidini, C., Rospocher, M. (2016). Ontology learning in the deep. In European Knowledge Acquisition Workshop (pp. 480–495),

留言 (0)

沒有登入
gif