100% found this document useful (4 votes)
51 views55 pages

Speech and Computer 22nd International Conference SPECOM 2020 ST Petersburg Russia October 7 9 2020 Proceedings Alexey Karpov

Russia

Uploaded by

gustadenysgx
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (4 votes)
51 views55 pages

Speech and Computer 22nd International Conference SPECOM 2020 ST Petersburg Russia October 7 9 2020 Proceedings Alexey Karpov

Russia

Uploaded by

gustadenysgx
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Download the Full Version of textbook for Fast Typing at textbookfull.

com

Speech and Computer 22nd International Conference


SPECOM 2020 St Petersburg Russia October 7 9 2020
Proceedings Alexey Karpov

https://textbookfull.com/product/speech-and-computer-22nd-
international-conference-specom-2020-st-petersburg-russia-
october-7-9-2020-proceedings-alexey-karpov/

OR CLICK BUTTON

DOWNLOAD NOW

Download More textbook Instantly Today - Get Yours Now at textbookfull.com


Recommended digital products (PDF, EPUB, MOBI) that
you can download immediately if you are interested.

Interactive Collaborative Robotics 5th International


Conference ICR 2020 St Petersburg Russia October 7 9 2020
Proceedings Andrey Ronzhin
https://textbookfull.com/product/interactive-collaborative-
robotics-5th-international-conference-icr-2020-st-petersburg-russia-
october-7-9-2020-proceedings-andrey-ronzhin/
textboxfull.com

Artificial General Intelligence 13th International


Conference AGI 2020 St Petersburg Russia September 16 19
2020 Proceedings Ben Goertzel
https://textbookfull.com/product/artificial-general-intelligence-13th-
international-conference-agi-2020-st-petersburg-russia-
september-16-19-2020-proceedings-ben-goertzel/
textboxfull.com

Speech and Computer 16th International Conference SPECOM


2014 Novi Sad Serbia October 5 9 2014 Proceedings 1st
Edition Andrey Ronzhin
https://textbookfull.com/product/speech-and-computer-16th-
international-conference-specom-2014-novi-sad-serbia-
october-5-9-2014-proceedings-1st-edition-andrey-ronzhin/
textboxfull.com

International Youth Conference on Electronics,


Telecommunications and Information Technologies:
Proceedings of the YETI 2020, St. Petersburg, Russia Elena
Velichko
https://textbookfull.com/product/international-youth-conference-on-
electronics-telecommunications-and-information-technologies-
proceedings-of-the-yeti-2020-st-petersburg-russia-elena-velichko/
textboxfull.com
Artificial Intelligence and Natural Language 9th
Conference AINL 2020 Helsinki Finland October 7 9 2020
Proceedings Andrey Filchenkov
https://textbookfull.com/product/artificial-intelligence-and-natural-
language-9th-conference-ainl-2020-helsinki-finland-
october-7-9-2020-proceedings-andrey-filchenkov/
textboxfull.com

Artificial Intelligence and Natural Language 7th


International Conference AINL 2018 St Petersburg Russia
October 17 19 2018 Proceedings Dmitry Ustalov
https://textbookfull.com/product/artificial-intelligence-and-natural-
language-7th-international-conference-ainl-2018-st-petersburg-russia-
october-17-19-2018-proceedings-dmitry-ustalov/
textboxfull.com

Runtime Verification 20th International Conference RV 2020


Los Angeles CA USA October 6 9 2020 Proceedings Jyotirmoy
Deshmukh
https://textbookfull.com/product/runtime-verification-20th-
international-conference-rv-2020-los-angeles-ca-usa-
october-6-9-2020-proceedings-jyotirmoy-deshmukh/
textboxfull.com

Applied Computer Sciences in Engineering: 7th Workshop on


Engineering Applications, WEA 2020, Bogota, Colombia,
October 7–9, 2020, Proceedings Juan Carlos Figueroa-García
https://textbookfull.com/product/applied-computer-sciences-in-
engineering-7th-workshop-on-engineering-applications-wea-2020-bogota-
colombia-october-7-9-2020-proceedings-juan-carlos-figueroa-garcia/
textboxfull.com

Advances in Computer Graphics 37th Computer Graphics


International Conference CGI 2020 Geneva Switzerland
October 20 23 2020 Proceedings Nadia Magnenat-Thalmann
https://textbookfull.com/product/advances-in-computer-graphics-37th-
computer-graphics-international-conference-cgi-2020-geneva-
switzerland-october-20-23-2020-proceedings-nadia-magnenat-thalmann/
textboxfull.com
Alexey Karpov
Rodmonga Potapova (Eds.)
LNAI 12335

Speech and Computer


22nd International Conference, SPECOM 2020
St. Petersburg, Russia, October 7–9, 2020
Proceedings

123
Lecture Notes in Artificial Intelligence 12335

Subseries of Lecture Notes in Computer Science

Series Editors
Randy Goebel
University of Alberta, Edmonton, Canada
Yuzuru Tanaka
Hokkaido University, Sapporo, Japan
Wolfgang Wahlster
DFKI and Saarland University, Saarbrücken, Germany

Founding Editor
Jörg Siekmann
DFKI and Saarland University, Saarbrücken, Germany
More information about this series at http://www.springer.com/series/1244
Alexey Karpov Rodmonga Potapova (Eds.)

Speech and Computer


22nd International Conference, SPECOM 2020
St. Petersburg, Russia, October 7–9, 2020
Proceedings

123
Editors
Alexey Karpov Rodmonga Potapova
St. Petersburg Institute for Informatics Institute for Applied
and Automation of the Russian Academy and Mathematical Linguistics
of Sciences Moscow State Linguistic University
St. Petersburg, Russia Moscow, Russia

ISSN 0302-9743 ISSN 1611-3349 (electronic)


Lecture Notes in Artificial Intelligence
ISBN 978-3-030-60275-8 ISBN 978-3-030-60276-5 (eBook)
https://doi.org/10.1007/978-3-030-60276-5
LNCS Sublibrary: SL7 – Artificial Intelligence

© Springer Nature Switzerland AG 2020


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the
material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now
known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book are
believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors
give a warranty, expressed or implied, with respect to the material contained herein or for any errors or
omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
SPECOM 2020 Preface

The International Conference on Speech and Computer (SPECOM) has become a


regular event since the first SPECOM held in St. Petersburg, Russia, in October 1996.
24 years ago, SPECOM was established by the St. Petersburg Institute for Informatics
and Automation of the Russian Academy of Sciences (SPIIRAS) and the Herzen State
Pedagogical University of Russia thanks to the efforts of Prof. Yuri Kosarev and Prof.
Rajmund Piotrowski.
SPECOM is a conference with a long tradition that attracts researchers in the area of
computer speech processing, including recognition, synthesis, understanding, and
related domains like signal processing, language and text processing, computational
paralinguistics, multi-modal speech processing or human-computer interaction.
The SPECOM international conference is an ideal platform for know-how exchange –
especially for experts working on Slavic or other highly inflectional languages –
including both, under-resourced and regular well-resourced languages.
In its long history, the SPECOM conference was organized alternately by SPIIRAS
and by the Moscow State Linguistic University (MSLU) in their hometowns. Fur-
thermore, in 1997 it was organized by the Cluj-Napoca Subsidiary of the Research
Institute for Computer Technique (Romania), in 2005 by the University of Patras (in
Patras, Greece), in 2011 by the Kazan Federal University (Russia), in 2013 by the
University of West Bohemia (in Pilsen, Czech Republic), in 2014 by the University of
Novi Sad (in Novi Sad, Serbia), in 2015 by the University of Patras (in Athens,
Greece), in 2016 by the Budapest University of Technology and Economics (in
Budapest, Hungary), in 2017 by the University of Hertfordshire (in Hatfield, UK), in
2018 by the Leipzig University of Telecommunications (in Leipzig, Germany), and in
2019 by the Boğaziçi University (in Istanbul, Turkey).
SPECOM 2020 was the 22nd event in the series, and this time it was organized by
SPIIRAS in cooperation with MSLU during October 7–9, 2020, in an online format. In
July 2020, SPIIRAS incorporated five other research institutions, but has now been
transformed into the St. Petersburg Federal Research Center of the Russian Academy of
Sciences (SPC RAS). The conference was sponsored by HUAWEI (Russian Research
Center) as a general sponsor, and by ASM Solutions (Moscow, Russia), as well as
supported by the International Speech Communication Association (ISCA) and Saint
Petersburg Convention Bureau. The official conference service agency is
Monomax PCO.
SPECOM 2020 was held jointly with the 5th International Conference on Interactive
Collaborative Robotics (ICR 2020), where problems and modern solutions of
human-robot interaction were discussed.
During SPECOM and ICR 2020, three keynote lectures were given by Prof. Isabel
Trancoso (University of Lisbon and INESC-ID, Lisbon, Portugal) on “Profiling Speech
for Clinical Applications”, by Dr. Ilshat Mamaev (Karlsruhe Institute of Technology,
vi SPECOM 2020 Preface

Germany) on “A Concept for a Human-Robot Collaboration Workspace using Prox-


imity Sensors”, as well as by researchers of HUAWEI (Russian Research Center).
Due to the COVID-19 global pandemic, for the first time, SPECOM 2020 was
organized as a fully virtual conference. The virtual conference, in the online format via
Zoom, had a number of advantages including: an increased number of participants
because listeners could take part without any fees, essentially reduced registration fees
for authors of the presented papers, no costs for travel and accommodation, a paperless
green conference with only electronic proceedings, free access to video presentations
after the conference, comfortable home conditions, etc.
This volume contains a collection of submitted papers presented at the conference,
which were thoroughly reviewed by members of the Program Committee and addi-
tional reviewers consisting of more than 100 top specialists in the conference topic
areas. In total, 65 accepted papers out of over 160 papers submitted for SPECOM/ICR
were selected by the Program Committee for presentation at the conference and for
inclusion in this book. Theoretical and more general contributions were presented in
common plenary sessions. Problem-oriented sessions as well as panel discussions
brought together specialists in limited problem areas with the aim of exchanging
knowledge and skills resulting from research projects of all kinds.
We would like to express our gratitude to all authors for providing their papers on
time, to the members of the conference Program Committee and the organizers of the
special sessions for their careful reviews and paper selection, and to the editors and
correctors for their hard work in preparing this volume. Special thanks are due to the
members of the Organizing Committee for their tireless effort and enthusiasm during
the conference organization.

October 2020 Alexey Karpov


Rodmonga Potapova
Organization

The 22nd International Conference on Speech and Computer (SPECOM 2020) was
organized by the St. Petersburg Institute for Informatics and Automation of the Russian
Academy of Sciences (SPIIRAS, St. Petersburg, Russia) in cooperation with the
Moscow State Linguistic University (MSLU, Moscow, Russia). The conference
website is: http://specom.nw.ru/2020.

General Chairs
Alexey Karpov SPIIRAS, Russia
Rodmonga Potapova MSLU, Russia

Program Committee

Shyam Agrawal, India Oliver Jokisch, Germany


Tanel Alumäe, Estonia Denis Jouvet, France
Elias Azarov, Belarus Tatiana Kachkovskaia, Russia
Anton Batliner, Germany Alexey Karpov, Russia
Jerome Bellegarda, USA Heysem Kaya, The Netherlands
Milana Bojanic, Serbia Tomi Kinnunen, Finland
Nick Campbell, Ireland Irina Kipyatkova, Russia
Eric Castelli, Vietnam Daniil Kocharov, Russia
Josef Chaloupka, Czech Republic Liliya Komalova, Russia
Vladimir Chuchupal, Russia Evgeny Kostyuchenko, Russia
Nicholas Cummins, Germany Galina Lavrentyeva, Russia
Maria De Marsico, Italy Benjamin Lecouteux, France
Febe De Wet, South Africa Anat Lerner, Israel
Vlado Delić, Serbia Boris Lobanov, Belarus
Anna Esposito, Italy Elena Lyakso, Russia
Yannick Estève, France Joseph Mariani, France
Keelan Evanini, USA Konstantin Markov, Japan
Vera Evdokimova, Russia Jindřich Matoušek, Czech Republic
Nikos Fakotakis, Greece Yuri Matveev, Russia
Mauro Falcone, Italy Ivan Medennikov, Russia
Philip Garner, Switzerland Peter Mihajlik, Hungary
Gábor Gosztolya, Hungary Wolfgang Minker, Germany
Tunga Gungor, Turkey Iosif Mporas, UK
Abualseoud Hanani, Palestine Ludek Muller, Czech Republic
Ruediger Hoffmann, Germany Bernd Möbius, Germany
Marek Hrúz, Czech Republic Sebastian Möller, Germany
Kristiina Jokinen, Japan Satoshi Nakamura, Japan
viii Organization

Jana Neitsch, Denmark Ingo Siegert, Germany


Stavros Ntalampiras, Italy Vered Silber-Varod, Israel
Dimitar Popov, Bulgaria Vasiliki Simaki, Sweden
Branislav Popović, Serbia Pavel Skrelin, Russia
Vsevolod Potapov, Russia Claudia Soria, Italy
Rodmonga Potapova, Russia Victor Sorokin, Russia
Valeriy Pylypenko, Ukraine Tilo Strutz, Germany
Gerhard Rigoll, Germany Sebastian Stüker, Germany
Fabien Ringeval, France Ivan Tashev, USA
Milan Rusko, Slovakia Natalia Tomashenko, France
Sergey Rybin, Russia Laszlo Toth, Hungary
Sakriani Sakti, Japan Isabel Trancoso, Portugal
Albert Ali Salah, The Netherlands Jan Trmal, USA
Maximilian Schmitt, Germany Charl van Heerden, South Africa
Friedhelm Schwenker, Germany Vasilisa Verkhodanova, The Netherlands
Milan Sečujski, Serbia Matthias Wolff, Germany
Tatiana Sherstinova, Russia Zeynep Yucel, Japan
Tatiana Shevchenko, Russia Miloš Železný, Czech Republic

Additional Reviewers

Gerasimos Arvanitis, Greece Olesia Makhnytkina, Russia


Alexandr Axyonov, Russia Danila Mamontov, Germany
Cem Rıfkı Aydın, Turkey Maxim Markitantov, Russia
Gözde Berk, Turkey Dragiša Mišković, Serbia
Tijana Delić, Serbia Dmitry Ryumin, Russia
Denis Dresvyanskiy, Germany Andrey Shulipa, Russia
Bojana Jakovljević, Serbia Siniša Suzić, Serbia
Uliana Kochetkova, Russia Alena Velichko, Russia
Sergey Kuleshov, Russia Oxana Verkholyak, Russia

Organizing Committee

Alexey Karpov (Chair) Natalia Kashina


Andrey Ronzhin Ekaterina Miroshnikova
Rodmonda Potapova Natalia Dormidontova
Daniil Kocharov Margarita Avstriyskaya
Irina Kipyatkova Dmitriy Levonevskiy
Dmitry Ryumin
Contents

Lightweight CNN for Robust Voice Activity Detection . . . . . . . . . . . . . . . . 1


Tanvirul Alam and Akib Khan

Hate Speech Detection Using Transformer Ensembles


on the HASOC Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Pedro Alonso, Rajkumar Saini, and György Kovács

MP3 Compression to Diminish Adversarial Noise in End-to-End


Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Iustina Andronic, Ludwig Kürzinger, Edgar Ricardo Chavez Rosas,
Gerhard Rigoll, and Bernhard U. Seeber

Exploration of End-to-End ASR for OpenSTT – Russian Open


Speech-to-Text Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Andrei Andrusenko, Aleksandr Laptev, and Ivan Medennikov

Directional Clustering with Polyharmonic Phase Estimation for Enhanced


Speaker Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Sergei Astapov, Dmitriy Popov, and Vladimir Kabarov

Speech Emotion Recognition Using Spectrogram Patterns as Features . . . . . . 57


Umut Avci

Pragmatic Markers in Dialogue and Monologue: Difficulties


of Identification and Typical Formation Models . . . . . . . . . . . . . . . . . . . . . 68
Natalia Bogdanova-Beglarian, Olga Blinova, Tatiana Sherstinova,
Daria Gorbunova, Kristina Zaides, and Tatiana Popova

Data Augmentation and Loss Normalization for Deep Noise Suppression. . . . 79


Sebastian Braun and Ivan Tashev

Automatic Information Extraction from Scanned Documents . . . . . . . . . . . . 87


Lukáš Bureš, Petr Neduchal, and Luděk Müller

Dealing with Newly Emerging OOVs in Broadcast Programs


by Daily Updates of the Lexicon and Language Model . . . . . . . . . . . . . . . . 97
Petr Cerva, Veronika Volna, and Lenka Weingartova

A Rumor Detection in Russian Tweets . . . . . . . . . . . . . . . . . . . . . . . . . . . 108


Aleksandr Chernyaev, Alexey Spryiskov, Alexander Ivashko,
and Yuliya Bidulya
x Contents

Automatic Prediction of Word Form Reduction in Russian


Spontaneous Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Maria Dayter and Elena Riekhakaynen

Formant Frequency Analysis of MSA Vowels in Six Algerian Regions . . . . . 128


Ghania Droua-Hamdani

Emotion Recognition and Sentiment Analysis of Extemporaneous Speech


Transcriptions in Russian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
Anastasia Dvoynikova, Oxana Verkholyak, and Alexey Karpov

Predicting a Cold from Speech Using Fisher Vectors; SVM and XGBoost
as Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
José Vicente Egas-López and Gábor Gosztolya

Toxicity in Texts and Images on the Internet . . . . . . . . . . . . . . . . . . . . . . . 156


Denis Gordeev and Vsevolod Potapov

An Automated Pipeline for Robust Image Processing and Optical Character


Recognition of Historical Documents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Ivan Gruber, Pavel Ircing, Petr Neduchal, Marek Hrúz,
Miroslav Hlaváč, Zbyněk Zajíc, Jan Švec, and Martin Bulín

Lipreading with LipsID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176


Miroslav Hlaváč, Ivan Gruber, Miloš Železný, and Alexey Karpov

Automated Destructive Behavior State Detection on the 1D CNN-Based


Voice Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
Anastasia Iskhakova, Daniyar Wolf, and Roman Meshcheryakov

Rhythmic Structures of Russian Prose and Occasional Iambs


(a Diachronic Case Study) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Evgeny Kazartsev, Arina Davydova, and Tatiana Sherstinova

Automatic Detection of Backchannels in Russian Dialogue Speech . . . . . . . . 204


Pavel Kholiavin, Anna Mamushina, Daniil Kocharov,
and Tatiana Kachkovskaia

Experimenting with Attention Mechanisms in Joint CTC-Attention Models


for Russian Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
Irina Kipyatkova and Nikita Markovnikov

Comparison of Deep Learning Methods for Spoken


Language Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
Can Korkut, Ali Haznedaroglu, and Levent Arslan
Contents xi

Conceptual Operations with Semantics for a Companion Robot . . . . . . . . . . 232


Artemiy Kotov, Liudmila Zaidelman, Anna Zinina, Nikita Arinkin,
Alexander Filatov, and Kirill Kivva

Legal Tech: Documents’ Validation Method Based


on the Associative-Ontological Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 244
Sergey Kuleshov, Alexandra Zaytseva, and Konstantin Nenausnikov

Audio Adversarial Examples for Robust Hybrid CTC/Attention


Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
Ludwig Kürzinger, Edgar Ricardo Chavez Rosas, Lujun Li,
Tobias Watzel, and Gerhard Rigoll

CTC-Segmentation of Large Corpora for German End-to-End


Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
Ludwig Kürzinger, Dominik Winkelbauer, Lujun Li, Tobias Watzel,
and Gerhard Rigoll

Stylometrics Features Under Domain Shift: Do They Really


“Context-Independent”? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
Tatiana Litvinova

Speech Features of 13–15 Year-Old Children with Autism


Spectrum Disorders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
Elena Lyakso, Olga Frolova, Aleksey Grigorev, Viktor Gorodnyi,
Aleksandr Nikolaev, and Anna Kurazhova

Multi-corpus Experiment on Continuous Speech Emotion Recognition:


Convolution or Recurrence? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
Manon Macary, Martin Lebourdais, Marie Tahon, Yannick Estève,
and Anthony Rousseau

Detection of Toxic Language in Short Text Messages . . . . . . . . . . . . . . . . . 315


Olesia Makhnytkina, Anton Matveev, Darya Bogoradnikova,
Inna Lizunova, Anna Maltseva, and Natalia Shilkina

Transfer Learning in Speaker’s Age and Gender Recognition . . . . . . . . . . . . 326


Maxim Markitantov

Interactivity-Based Quality Prediction of Conversations


with Transmission Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
Thilo Michael and Sebastian Möller

Graphic Markers of Irony and Sarcasm in Written Texts . . . . . . . . . . . . . . . 346


Polina Mikhailova
xii Contents

Digital Rhetoric 2.0: How to Train Charismatic Speaking


with Speech-Melody Visualization Software . . . . . . . . . . . . . . . . . . . . . . . . 357
Oliver Niebuhr and Jana Neitsch

Generating a Concept Relation Network for Turkish Based on ConceptNet


Using Translational Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
Arif Sırrı Özçelik and Tunga Güngör

Bulgarian Associative Dictionaries in the LABLASS Web-Based System . . . . 379


Dimitar Popov, Velka Popova, Krasimir Kordov, and Stanimir Zhelezov

Preliminary Investigation of Potential Steganographic


Container Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
Rodmonga Potapova and Andrey Dzhunkovskiy

Some Comparative Cognitive and Neurophysiological Reactions


to Code-Modified Internet Information. . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
Rodmonga Potapova and Vsevolod Potapov

The Influence of Multimodal Polycode Internet Content on Human


Brain Activity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
Rodmonga Potapova, Vsevolod Potapov, Nataliya Lebedeva,
Ekaterina Karimova, and Nikolay Bobrov

Synthetic Speech Evaluation by Differential Maps


in Pleasure-Arousal Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
Jiří Přibil, Anna Přibilová, and Jindřich Matoušek

Investigating the Effect of Emoji in Opinion Classification of Uzbek Movie


Review Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
Ilyos Rabbimov, Iosif Mporas, Vasiliki Simaki, and Sami Kobilov

Evaluation of Voice Mimicking Using I-Vector Framework . . . . . . . . . . . . . 446


Rajeev Rajan, Abhijith Girish, Adharsh Sabu,
and Akshay Prasannan Latha

Score Normalization of X-Vector Speaker Verification System


for Short-Duration Speaker Verification Challenge . . . . . . . . . . . . . . . . . . . 457
Ivan Rakhmanenko, Evgeny Kostyuchenko, Evgeny Choynzonov,
Lidiya Balatskaya, and Alexander Shelupanov

Genuine Spontaneous vs Fake Spontaneous Speech: In Search


of Distinction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
Ekaterina Razubaeva and Anton Stepikhov

Mixing Synthetic and Recorded Signals for Audio-Book Generation . . . . . . . 479


Meysam Shamsi, Nelly Barbot, Damien Lolive, and Jonathan Chevelu
Contents xiii

Temporal Concord in Speech Interaction: Overlaps and Interruptions


in Spoken American English . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490
Tatiana Shevchenko and Anastasia Gorbyleva

Cognitively Challenging: Language Shift and Speech Rate


of Academic Bilinguals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500
Tatiana Shevchenko and Tatiana Sokoreva

Toward Explainable Automatic Classification of Children’s


Speech Disorders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
Dima Shulga, Vered Silber-Varod, Diamanta Benson-Karai, Ofer Levi,
Elad Vashdi, and Anat Lerner

Recognition Performance of Selected Speech Recognition


APIs – A Longitudinal Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520
Ingo Siegert, Yamini Sinha, Oliver Jokisch, and Andreas Wendemuth

Does A Priori Phonological Knowledge Improve Cross-Lingual Robustness


of Phonemic Contrasts? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 530
Lucy Skidmore and Alexander Gutkin

Can We Detect Irony in Speech Using Phonetic Characteristics


Only? – Looking for a Methodology of Analysis . . . . . . . . . . . . . . . . . . . . 544
Pavel Skrelin, Uliana Kochetkova, Vera Evdokimova,
and Daria Novoselova

Automated Compilation of a Corpus-Based Dictionary


and Computing Concreteness Ratings of Russian . . . . . . . . . . . . . . . . . . . . 554
Valery Solovyev and Vladimir Ivanov

Increasing the Accuracy of the ASR System by Prolonging Voiceless


Phonemes in the Speech of Patients Using the Electrolarynx . . . . . . . . . . . . 562
Petr Stanislav, Josef V. Psutka, and Josef Psutka

Leverage Unlabeled Data for Abstractive Speech Summarization


with Self-supervised Learning and Back-Summarization. . . . . . . . . . . . . . . . 572
Paul Tardy, Louis de Seynes, François Hernandez, Vincent Nguyen,
David Janiszek, and Yannick Estève

Uncertainty of Phone Voicing and Its Impact on Speech Synthesis . . . . . . . . 581


Daniel Tihelka, Zdeněk Hanzlíček, and Markéta Jůzová

Grappling with Web Technologies: The Problems of Remote


Speech Recording . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592
Daniel Tihelka, Markéta Jůzová, and Jakub Vít
xiv Contents

Robust Noisy Speech Parameterization Using Convolutional


Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603
Ryhor Vashkevich and Elias Azarov

More than Words: Cross-Linguistic Exploration of Parkinson’s Disease


Identification from Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613
Vass Verkhodanova, Dominika Trčková, Matt Coler, and Wander Lowie

Phonological Length of L2 Czech Speakers’ Vowels in Ambiguous


Contexts as Perceived by L1 Listeners . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624
Jitka Veroňková and Tomáš Bořil

Learning an Unsupervised and Interpretable Representation of Emotion


from Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636
Siwei Wang, Catherine Soladié, and Renaud Séguier

Synchronized Forward-Backward Transformer for End-to-End


Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646
Tobias Watzel, Ludwig Kürzinger, Lujun Li, and Gerhard Rigoll

KazNLP: A Pipeline for Automated Processing of Texts Written


in Kazakh Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657
Zhandos Yessenbayev, Zhanibek Kozhirbayev, and Aibek Makazhanov

Diarization Based on Identification with X-Vectors . . . . . . . . . . . . . . . . . . . 667


Zbyněk Zajíc, Josef V. Psutka, and Luděk Müller

Different Approaches in Cross-Language Similar Documents Retrieval


in the Legal Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 679
Vladimir Zhebel, Denis Zubarev, and Ilya Sochenkov

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 687


Lightweight CNN for Robust Voice
Activity Detection

Tanvirul Alam and Akib Khan(B)

BJIT Limited, Dhaka, Bangladesh


{tanvirul.alam,akib.khan}@bjitgroup.com

Abstract. Voice activity detection (VAD) is an important prepossessing


step in many speech related applications. Convolutional neural networks
(CNN) are widely used for different audio classification tasks and have
been adopted successfully for this. In this work, we propose a lightweight
CNN architecture for real time voice activity detection. We use strong
data augmentation and regularization for improving the performance of
the model. Using knowledge distillation approach, we transfer knowledge
from a larger CNN model which leads to better generalization ability and
robust performance of the CNN architecture in noisy conditions. The
resulting network obtains 62.6% relative improvements in EER compared
to a deep feedforward neural network (DNN) of comparable parameter
count on a noisy test dataset.

Keywords: Voice activity detection · Convolutional neural networks ·


Regularization · Knowledge distillation

1 Introduction

Voice activity detection (VAD) is the task of identifying the parts of a noisy
speech signal that contains human speech activity. This is used widely in many
speech processing applications including automatic speech recognition, speech
enhancement, speech synthesis, speaker identification, etc. Usually, VAD is used
in the initial phase of such an application to improve the efficacy of the later
tasks. VAD algorithms are thus required to be robust to different kinds of envi-
ronmental noise while having a low computational cost and small memory foot-
print.
A number of techniques have been proposed for VAD in the literature. Early
works were based on energy based features, in combination with zero cross-
ing rates and periodicity measures [15,32,33]. As these approaches are highly
affected in the presence of additive noise, various other features have been pro-
posed [2,7,23]. Different supervised [19,36] and unsupervised [23,34] learning
algorithms have been adopted for the task as well.
More recently, deep learning has been used for the VAD due to their abil-
ity to model complex functions and learn robust features from the dataset itself.

c Springer Nature Switzerland AG 2020


A. Karpov and R. Potapova (Eds.): SPECOM 2020, LNAI 12335, pp. 1–12, 2020.
https://doi.org/10.1007/978-3-030-60276-5_1
2 T. Alam and A. Khan

Convolutional neural networks (CNN) have been used for VAD as they are capa-
ble of learning rich, local feature representation and are more invariant to input
position and distortion [1,25,26,30]. Recurrent neural networks (RNN) and espe-
cially their variants, long short-term memory (LSTM) networks, are capable of
learning long range dependencies between inputs and are also used for VAD
[5,12,27]. Both CNN and LSTM based approaches perform better than multi-
layer perceptrons under noisy conditions [31]. However, if the audio duration is
too long LSTM may become trapped into a dead state, and the performance can
degrade [1]. CNNs do not suffer from this while also being computationally more
efficient. CNN and RNN can also be combined to form convolutional recurrent
neural networks that can benefit from frequency modeling with CNN and tem-
poral modeling with RNN [35]. Denoising autoencoders are often used to learn
noise-robust features which are then used for the classification task [13,16,37].
Speech features are augmented with estimated noise information for noise aware
training in [31] to improve robustness.
We adopt CNN in our study due to their computational efficiency and reliable
performance in VAD and other audio classification tasks [4,10,24]. Since our
focus is to develop a VAD which is robust to different kinds of noise, we synthesize
a training dataset under different noisy conditions and signal to noise ratio (SNR)
levels. To gauge the robustness of the learned model, we prepare a test dataset
by using speech and noise data obtained from a different source.
One of our primary goal is to improve the performance of the network with-
out incurring additional computational cost. For this, we design a lightweight
CNN architecture. We use SpecAugment [20] and DropBlock [6] regularization
to improve the generalization performance of the network, both of which improve
upon the baseline. CNN models trained with deeper layers with strong regular-
ization tend to have better generalization performance and perform better under
unseen noise sources. However, they are often not feasible for real-time use in
constrained settings (e.g., in mobile devices where low memory and fast infer-
ence time are often preferred). To address this issue, we train a larger model
that achieves better performance compared to the proposed CNN architecture.
We then use it to transfer knowledge to our lightweight CNN architecture using
knowledge distillation [11]. This further improves the network’s performance
under unseen noise types.
We organize the rest of the paper as follows. In Sect. 2, we describe our base-
line CNN architecture along with its improvements using SpecAugment, Drop-
Block and knowledge distillation. In Sect. 3, we describe the training and test
datasets. We provide details of the experiments, different hyperparameter set-
tings, and results obtained on development and test set with different approaches
in Sect. 4. Finally, in Sect. 5, we conclude our paper and provide an outlook on
future work.
Lightweight CNN for Robust Voice Activity Detection 3

2 Method
2.1 Deep Convolutional Neural Network
Deep convolutional neural network (CNN) architectures have been shown to
be successful for the voice activity detection task [1,30,31] as they are capa-
ble of learning rich hierarchical representation by utilizing local filters. Feature
extracted from an extended centered context window is used as network input
for frame level voice activity detection. We design a CNN architecture while
keeping the computational and memory constraints often encountered by VAD
applications in mind. The network is composed of three convolution and pooling
layers followed by two fully connected layers. The first convolution and pooling
layer each uses a (5 × 5) kernel. Second and final convolution layer uses a (3 × 3)
kernel and each is followed by a (2 × 2) max pooling layer. The three convolu-
tion layers consist of 32, 48 and 64 filters respectively. The first fully connected
layer has 64 hidden units and the second fully connected layer has 2 hidden
units corresponding to the two categories of interest, specifically voice and non-
voice. ReLU nonlinearity is used after each convolution and fully connected layer
except for the final layer which has softmax activation function. Dropout [29] is
used after the first fully connected layer.
All audios are sampled at 16 KHz. We use librosa [18] to extract log-scaled
mel spectrogram features with 40 frequency bands. We use window size of 512
and hop length 256 covering the frequency range of 300–8000 Hz. To stabilize
mel spectrogram output we use log(melspectrogram + 0.001), where the off-
set is used to avoid taking a logarithm of zero. The log mel spectrograms are
normalized to have zero mean value. We use an extended context window of
40 frames i.e., the input shape is 40 × 40. We treat the mel spectrogram fea-
ture as a single channel image and apply successive convolution and pooling
operations on it. Padding is used to keep spatial dimension unchanged during

Table 1. Proposed CNN architecture. Data shape represents the dimension in channel,
frequency, time. Each convolution and fully connected layer includes a bias term.

Layer ksize Filters Stride Data shape


Input (1, 40, 40)
conv1 (5, 5) 32 (1, 1) (32, 40, 40)
pool1 (5, 5) (5, 5) (32, 8, 8)
conv2 (3, 3) 48 (1, 1) (48, 8, 8)
pool2 (2, 2) (2, 2) (48, 4, 4)
conv3 (3, 3) 64 (1, 1) (64, 4, 4)
pool3 (2, 2) (2, 2) (64, 2, 2)
fc1 64 (64,)
Dropout (64,)
fc2 2 (2,)
4 T. Alam and A. Khan

Fig. 1. SpecAugment used in this study. top-left: Log mel spectrogram for a sample
audio without augmentation, top-right: Frequency masking, bottom-left: Time-masking,
bottom-right: A combination of three frequency and time masking applied.

convolution operation. The network has 58,994 trainable parameters and a total
of 2.7M multiply-accumulate operations. This network can run in real time in
most modern processors. See Table 1 for specific layer details.
We also train a deep feedforward neural network (DNN) consisting of three
hidden layers, and compare its performance to that of the CNN model. Each
hidden layer of the DNN has 36 units. All fully connected layers except the last
one are followed by ReLU non-linearity. Dropout is not added in this network
as it causes the network to underfit on the dataset. This network is designed to
have parameter count comparable to the CNN architecture and has 60374 total
trainable parameters. The DNN is trained using the same input features as in
the CNN architecture.

2.2 SpecAugment

SpecAugment was introduced in [20] as a simple but effective augmentation


strategy for speech recognition. Three types of deformations were introduced in
the paper that can be directly applied on spectrogram features: time warping,
frequency masking and time masking. Here, we apply a combination of frequency
and time masking (see Fig. 1). We omit time warping as it was reported in [20]
to be the least influential while being the most expensive. Frequency masking
is applied to f consecutive mel frequency channels [f0 , f0 + f ) where f is first
chosen from a uniform distribution from 0 to a selectable parameter F , and f0
is chosen uniformly from [0, v − f ] with v being the number of mel frequency
channels. Similarly, for time masking, t consecutive time steps [t0 , t0 + t) are
masked, where t is chosen from a uniform distribution from 0 to the time mask
parameter T and t0 is chosen uniformly from [0, τ − t), where τ is the number of
total time steps. We apply [0, n] such masking randomly for each spectrogram
image where n is a tunable parameter. For each masking, we randomly select
Lightweight CNN for Robust Voice Activity Detection 5

Fig. 2. Schematics of DropBlock regularization used in this study. left: Dropout which
drops activations at random, right: DropBlock drops continuous regions of features.

either frequency or time masking and set F = αv and T = ατ . Since the log mel
spectrogram is normalized to have zero mean value, setting the masked value
to zero is equivalent to setting it to the mean value. SpecAugment effectively
converts an overfitting to an underfitting problem. This enables us to train larger
networks with longer duration (i.e., more epochs) without overfitting and leads
to improved generalization performance of the model.

2.3 DropBlock
Although dropout is commonly used as a regularization technique for fully con-
nected layers, it is not as effective for convolution layers. This is because features
are spatially correlated in convolution layer, and dropping units randomly does
not effectively prevent information from being sent to the next layers. To remedy
this, the authors in [6] introduced DropBlock regularization which drops contigu-
ous regions of feature maps (see Fig. 2). As this discards features in a spatially
correlated area, the network needs to learn discriminative features from different
regions of the feature map for classification. It was shown to be a more effective
regularizer compared to dropout for image classification and detection tasks in
[6]. DropBlock has two main parameters which are block size and γ. block size
determines the size of the square block to be dropped and γ controls the number
of activation units to drop. In our experiments we apply DropBlock after the
first pooling layer only. We don’t apply DropBlock on the subsequent layers as
the resolution of the feature map becomes small.

2.4 Knowledge Distillation


Distillation is a method for transferring knowledge from an ensemble or from a
large, highly regularized model into a smaller model [11]. A side-effect of training
6 T. Alam and A. Khan

with negative log-likelihood criterion is that the model assigns probabilities to all
classes including incorrect ones. While the probabilities of the incorrect answers
may be small, their relative probabilities may still be significant and provide
us insight on how larger models learn to generalize. This suggests that, if a
model generalizes well due to being, for example, average of a large ensemble of
different models, we can train a smaller model in the same way to achieve better
generalization performance. To accomplish this, the probabilities produced by
the large model are used as soft targets for training the smaller model. Neural
networks used for classification tasks typically produce class probabilities by
using a softmax layer. This converts the logits, zi , computed for each class into
a probability, qi , using the following equation

exp(zi /T )
qi =  (1)
j exp(zj /T )

Here, T is the temperature term that is usually set to 1. We can get softer
probability distribution over classes using a higher temperature. If the correct
labels are known, the smaller model can be trained to produce the correct labels
as well. The loss function then takes the following form

LKD = αT 2 LCE (QTs , QTt ) + (1 − α)LCE (Qs , ytrue ) (2)

Here, QTs and QTt are the softened probabilities of the smaller student and larger
teacher model respectively, generated using a higher temperature and Qs is gen-
erated using T = 1. LCE is the cross entropy loss function and α(0 ≤ α ≤ 1) is a
hyper parameter that controls the contribution from each part of the loss. The
first term in the loss is multiplied by T 2 since the magnitude of the gradients
produced by soft targets in Eq. 1 scale as 1/T 2 . Best results are obtained in
practice by using a considerably lower weight on the second objective function
or equivalently using a large value of α.
CNNs designed for image classification tasks have been shown to be success-
ful in large scale audio classification tasks [10]. Inspired by this, we use Pre-
Act ResNet-18 [9] network as the teacher network to transfer knowledge to the
previously described CNN architecture. We use the implementation provided
in [3] which consists of 11.2M trainable parameters and 868M total multiply-
accumulate operations. This means it has roughly 190 times more parameters
compared to our baseline CNN architecture.

3 Dataset Description

We prepared the training dataset using MUSAN [28] corpus. The corpus consists
of 109 h of audio data partitioned into three categories: speech, music and noise.
The speech portion consists of about 60 h of data. It contains 20 h and 21 min of
read speech from LibriVox1 , approximately half of which are in English and the
1
https://librivox.org/.
Lightweight CNN for Robust Voice Activity Detection 7

rest are from eleven different languages. The remainder of the speech portion
consists of 40 h and 1 min of US government hearings, committees and debates.
There are 929 noise files collected from different noise sources, ranging from
technical noises, such as DTMF tones, dialtones, fax machine noises etc. as well
as ambient sounds like car idling, thunder, wind, footsteps, animal noises etc.
These are downloaded from Free Sound2 and Sound Bible3 . The total noise
duration is about 6 h.
We used the MS-SNSD toolkit provided in [22] to generate our training and
test dataset. We removed the long silent parts from the train speech data using
pyAudioAnalysis [8]. We generated 60 h of training data with 30 h of noise and
speech data each. Noise was added to the speech data by selecting SNR ran-
domly from a discrete uniform distribution [−10, 20]. We used a minimum audio
length of 10 s each when generating noisy speech. Five-fold cross validation was
performed during training.
We prepared an additional test dataset by using noise collected from 100 dif-
ferent types4 . We used UW/NU corpus version 1.0 [17] for speech which consists
of 20 speakers (10 male, 10 female) reciting 180 Harvard IEEE sentences each.
We resampled the audios to 16 Khz and used MS-SNSD to generate 10 h of test
data. We used 7 discrete SNR levels in [−10, 20] range for preparing the test
data in order to gauge model performance under different noisy conditions.

4 Experiments
4.1 Training Procedure

We trained the model using cross entropy loss function and Adam optimization
algorithm [14]. We used 256 samples per minibatch during training. The DNN
and baseline CNN networks were both trained for 20 epochs. We used an initial
learning rate of 0.001 which was decreased by a factor of 10 after 10 and 15
epochs. Dropout was applied after the first fully connected layer of the CNN
model with probability 0.5. We optimized the network hyperparameters using
the validation set. The network was checkpointed when the accuracy on the
validation set improved in an epoch and the network weights with the highest
validation set accuracy were applied to the test data. The network was trained
using PyTorch [21] deep learning framework.
When using SpecAugment and DropBlock, we noticed that training for more
epochs improved performance and so these networks were trained for 40 epochs.
Learning rate was initialized at 0.001 and decreased by a factor of 10 after 25 and
35 epochs. Validation set was used to identify the optimum parameter settings
for SpecAugment and DropBlock. We used n = 3 and α = 0.2 for SpecAugment.
For DropBlock, we applied blocksize = 4 and γ = 0.3. As mentioned in [6],

2
https://freesound.org/.
3
http://soundbible.com/.
4
http://web.cse.ohio-state.edu/pnl/corpus/HuNonspeech/.
8 T. Alam and A. Khan

Table 2. AUC and EER (%) on validation and test set.

Method Validation Test


AUC EER AUC EER
DNN 98.32 5.52 95.55 10.45
CNN 99.57 2.44 99.07 4.72
PreAct ResNet-18 99.75 1.68 99.51 3.27
CNN + SpecAugment 99.63 2.29 99.22 4.37
CNN + DropBlock 99.58 2.36 99.15 4.50
CNN + SpecAugment + DropBlock 99.65 2.24 99.31 4.14
CNN Distilled 99.65 2.25 99.36 3.91

we also noticed that using a fixed value of γ did not work well. So, we linearly
increased γ from 0 to 0.3 at the end of each epoch.
The PreAct ResNet-18 model was trained for 20 epochs with similar learning
rate scheduling as the baseline CNN architecture. We did not apply dropout or
DropBlock for training this model but used SpecAugment with the configuration
mentioned above. The model with the best validation set performance was used
for transferring knowledge to our CNN model. SpecAugment and DropBlock
were used during distillation but dropout was removed as the network underfitted
the training data with dropout. Distilled network was trained for 40 epochs using
learning rate scheduling same as the regularized models. We experimented with
values of α ∈ {0.9, 0.95, 0.99} and T ∈ {2, 4, 8, 16}. Best results were obtained
for α = 0.99 and T = 8.

4.2 Results
We use the area under the ROC curve (AUC) and equal error rate (EER) as
evaluation metrics as they are commonly used in literature. The reported results
are average across 5 runs for both the validation and test sets. In this work, we
are only interested in frame level performance, and so we have not conducted
evaluation of segment level performance. The results are displayed in Table 2 for
different models.
The baseline CNN model performs significantly better than the DNN model
in the validation and test datasets. PreAct ResNet-18 outperforms the base-
line CNN architecture. We investigate the effect of applying SpecAugment and
DropBlock in isolation, and then if they are combined. It is evident that both
SpecAugment and DropBlock improves upon CNN baseline, although the per-
formance gain from SpecAugment is greater comparably. Combining both fur-
ther improves the performance. This suggests that the regularization applied are
effective for improving the generalization performance of the network. Since the
PreAct ResNet-18 model has better performance compared to the CNN archi-
tecture, it can serve as a suitable teacher model for distillation. Transferring
Lightweight CNN for Robust Voice Activity Detection 9

Fig. 3. left: AUC(%) on test data at different SNR levels. right: EER(%) on test data
at different SNR levels. Means and standard deviations of five runs are depicted by
solid lines and shaded areas respectively.

knowledge from this network further improves performance under unseen noisy
conditions. The distilled CNN model has 62.6% reduced EER compared to the
DNN architecture on the test dataset. It is interesting to note that, performance
gain from regularization and distillation is relatively greater on the test dataset
which is constructed from unseen noise types. For example, compared to the
baseline CNN, distilled CNN has 7.8% relatively lower EER on the validation
dataset. But this number is 17.2% on the test dataset.
Performance at 7 different SNR levels on the test dataset are plotted in
Fig. 3 for different models. CNN + Regularization refers to the model trained
using SpecAugment and DropBlock. As expected, performance degrades as noise
is increased. However, for regularized and distilled models the drop is less severe
compared to the baseline CNN model. At high SNR levels (e.g., 20 SNR), dif-
ferent models perform comparably to each other. On the other hand, at the
extreme noisy condition of −10 SNR, we see 1% absolute improvement of AUC
and 1.68% absolute reduction in EER for distilled CNN compared to the base-
line CNN architecture. This suggests that adding regularization and distillation
makes the model more robust to unseen and severe noisy conditions.

5 Conclusions
In this work, we have designed a lightweight CNN architecture for voice activ-
ity detection. We have evaluated the trained model on a noisy test dataset and
showed that better results can be obtained using strong data augmentation and
regularization. We further demonstrate the effectiveness of knowledge distilla-
tion for the task. Our proposed model is robust under severe and unseen noisy
conditions. We believe further improvements can be made by using an ensemble
of larger models instead of a single model and using them to train the distilled
model. We plan to investigate the performance of the proposed approach on
audio recorded in noisy environment in future work.
10 T. Alam and A. Khan

References
1. Chang, S.Y., et al.: Temporal modeling using dilated convolution and gating for
voice-activity-detection. In: 2018 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pp. 5549–5553. IEEE (2018)
2. Chuangsuwanich, E., Glass, J.: Robust voice activity detector for real world appli-
cations using harmonicity and modulation frequency. In: Twelfth Annual Confer-
ence of the International Speech Communication Association (2011)
3. pytorch cifar (2017). https://github.com/kuangliu/pytorch-cifar
4. Costa, Y.M., Oliveira, L.S., Silla Jr., C.N.: An evaluation of convolutional neural
networks for music classification using spectrograms. Appl. Soft Comput. 52, 28–38
(2017)
5. Eyben, F., Weninger, F., Squartini, S., Schuller, B.: Real-life voice activity detec-
tion with LSTM Recurrent Neural Networks and an application to Hollywood
movies. In: 2013 IEEE International Conference on Acoustics, Speech and Signal
Processing, pp. 483–487. IEEE (2013)
6. Ghiasi, G., Lin, T.Y., Le, Q.V.: Dropblock: a regularization method for convo-
lutional networks. In: Advances in Neural Information Processing Systems, pp.
10727–10737 (2018)
7. Ghosh, P.K., Tsiartas, A., Narayanan, S.S.: Robust voice activity detection using
long-term signal variability. IEEE Trans. Speech Audio Process. 19(3), 600–613
(2011)
8. Giannakopoulos, T.: Pyaudioanalysis: an open-source python library for audio sig-
nal analysis. PloS One 10(12), e0144610 (2015)
9. He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In:
Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp.
630–645. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0 38
10. Hershey, S., et al.: CNN architectures for large-scale audio classification. In: 2017
IEEE International Conference on Acoustics, Speech and Signal Processing, pp.
131–135. IEEE (2017)
11. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural net-
work. In: NIPS Deep Learning and Representation Learning Workshop (2015),
arxiv:1503.02531
12. Hughes, T., Mierle, K.: Recurrent neural networks for voice activity detection. In:
2013 IEEE International Conference on Acoustics, Speech and Signal Processing,
pp. 7378–7382. IEEE (2013)
13. Jung, Y., Kim, Y., Choi, Y., Kim, H.: Joint learning using denoising variational
autoencoders for voice activity detection. In: INTERSPEECH, pp. 1210–1214
(2018)
14. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: 3rd Inter-
national Conference on Learning Representations (2015)
15. Lamel, L., Rabiner, L., Rosenberg, A., Wilpon, J.: An improved endpoint detector
for isolated word recognition. IEEE Trans. Acoust. Speech Signal Process. 29(4),
777–785 (1981)
16. Lin, R., Costello, C., Jankowski, C., Mruthyunjaya, V.: Optimizing voice activity
detection for noisy conditions. In: Interspeech, pp. 2030–2034. ISCA (2019)
17. McCloy, D.R., Souza, P.E., Wright, R.A., Haywood, J., Gehani, N., Rudolph,
S.: The UW/NU corpus (2013). http://depts.washington.edu/phonlab/resources/
uwnu/, version 1.0
Lightweight CNN for Robust Voice Activity Detection 11

18. McFee, B., Raffel, C., Liang, D., Ellis, D.P., McVicar, M., Battenberg, E., Nieto,
O.: librosa: audio and music signal analysis in python. In: Proceedings of the 14th
Python in Science Conference, vol. 8 (2015)
19. Ng, T., et al.: Developing a speech activity detection system for the DARPA RATS
program. In: Thirteenth Annual Conference of the International Speech Commu-
nication Association (2012)
20. Park, D.S., et al.: Specaugment: a simple data augmentation method for automatic
speech recognition. In: Interspeech, pp. 2613–2617. ISCA (2019)
21. Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning
library. In: Advances in Neural Information Processing Systems 32, pp. 8024–8035.
Curran Associates, Inc. (2019). http://papers.neurips.cc/paper/9015-pytorch-an-
imperative-style-high-performance-deep-learning-library.pdf
22. Reddy, C.K.A., Beyrami, E., Pool, J., Cutler, R., Srinivasan, S., Gehrke, J.: A
scalable noisy speech dataset and online subjective test framework. In: Interspeech,
pp. 1816–1820. ISCA (2019)
23. Sadjadi, S.O., Hansen, J.H.: Unsupervised speech activity detection using voicing
measures and perceptual spectral flux. IEEE Signal Process. Lett. 20(3), 197–200
(2013)
24. Salamon, J., Bello, J.P.: Deep convolutional neural networks and data augmen-
tation for environmental sound classification. IEEE Signal Process. Lett. 24(3),
279–283 (2017)
25. Saon, G., Thomas, S., Soltau, H., Ganapathy, S., Kingsbury, B.: The IBM speech
activity detection system for the DARPA RATS program. In: Interspeech, pp.
3497–3501. ISCA (2013)
26. Sehgal, A., Kehtarnavaz, N.: A convolutional neural network smartphone app for
real-time voice activity detection. IEEE Access 6, 9017–9026 (2018)
27. Shannon, M., Simko, G., Chang, S.Y., Parada, C.: Improved end-of-query detection
for streaming speech recognition. In: Interspeech, pp. 1909–1913 (2017)
28. Snyder, D., Chen, G., Povey, D.: MUSAN: a music, speech, and noise corpus. CoRR
abs/1510.08484 (2015), arxiv:1510.08484
29. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.:
Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn.
Res 15(1), 1929–1958 (2014)
30. Thomas, S., Ganapathy, S., Saon, G., Soltau, H.: Analyzing convolutional neu-
ral networks for speech activity detection in mismatched acoustic conditions. In:
2014 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pp. 2519–2523. IEEE (2014)
31. Tong, S., Gu, H., Yu, K.: A comparative study of robustness of deep learning
approaches for VAD. In: 2016 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), pp. 5695–5699. IEEE (2016)
32. Tucker, R.: Voice activity detection using a periodicity measure. IEEE Proc. I
(Commun. Speech Vision) 139(4), 377–380 (1992)
33. Woo, K.H., Yang, T.Y., Park, K.J., Lee, C.: Robust voice activity detection algo-
rithm for estimating noise spectrum. Electron. Lett. 36(2), 180–181 (2000)
34. Ying, D., Yan, Y., Dang, J., Soong, F.K.: Voice activity detection based on an
unsupervised learning framework. IEEE Trans. Audio Speech Lang. Process. 19(8),
2624–2633 (2011)
35. Zazo, R., Sainath, T.N., Simko, G., Parada, C.: Feature learning with raw-
waveform CLDNNs for voice activity detection. In: Interspeech, pp. 3668–3672
(2016)
12 T. Alam and A. Khan

36. Zhang, X.L., Wang, D.: Boosted deep neural networks and multi-resolution cochlea-
gram features for voice activity detection. In: Fifteenth Annual Conference of the
International Speech Communication Association (2014)
37. Zhang, X.L., Wu, J.: Denoising deep neural networks based voice activity detec-
tion. In: 2013 IEEE International Conference on Acoustics, Speech and Signal
Processing, pp. 853–857. IEEE (2013)
Hate Speech Detection Using Transformer
Ensembles on the HASOC Dataset

Pedro Alonso1 , Rajkumar Saini1 , and György Kovács1,2(B)


1
Embedded Internet Systems Lab, Luleå University of Technology, Luleå, Sweden
{pedro.alonso,rajkumar.saini,gyorgy.kovacs}@ltu.se
2
MTA-SZTE Research Group on Artificial Intelligence, Szeged, Hungary

Abstract. With the ubiquity and anonymity of the Internet, the spread
of hate speech has been a growing concern for many years now. The
language used for the purpose of dehumanizing, defaming or threatening
individuals and marginalized groups not only threatens the mental health
of its targets, as well as their democratic access to the Internet, but also
the fabric of our society. Because of this, much effort has been devoted
to manual moderation. The amount of data generated each day, partic-
ularly on social media platforms such as Facebook and twitter, however
makes this a Sisyphean task. This has led to an increased demand for
automatic methods of hate speech detection.
Here, to contribute towards solving the task of hate speech detection,
we worked with a simple ensemble of transformer models on a twitter-
based hate speech benchmark. Using this method, we attained a weighted
F1 -score of 0.8426, which we managed to further improve by leverag-
ing more training data, achieving a weighted F1 -score of 0.8504. Thus
markedly outperforming the best performing system in the literature.

Keywords: Natural Language Processing · Hate speech detection ·


Transformers · RoBERTa · Ensemble

1 Introduction

There are many questions still surrounding the issue of hate speech. For one, it
is strongly debated whether hate speech should be prosecuted, or whether free
speech protections should extend to it [2,11,16,24]. Another question debated is
regarding the best counter-measure to apply, and whether it should be suppres-
sion (through legal measures, or banning/blocklists), or whether it should be
methods that tackle the root of the problem, namely counter-speech and educa-
tion [4]. These arguments, however are fruitless without the ability of detecting
hate speech en masse. And while manual detection may seem as a simple (albeit
hardly scalable) solution, the burden of manual moderation [15], as well as the
sheer amount of data generated online justify the need for an automatic solution
of detecting hateful and offensive content.

c Springer Nature Switzerland AG 2020


A. Karpov and R. Potapova (Eds.): SPECOM 2020, LNAI 12335, pp. 13–21, 2020.
https://doi.org/10.1007/978-3-030-60276-5_2
14 P. Alonso et al.

1.1 Related Work

The ubiquity of fast, reliable Internet access that enabled the sharing of infor-
mation and opinions at an unprecedented rate paired with the opportunity for
anonymity [50] has been responsible for the increase in the spread of offensive
and hateful content in recent years. For this reason, the detection of hate speech
has been examined by many researchers [21,48]. These efforts date back to the
late nineties and Microsoft research, with the proposal of a rule-based system
named Smokey [36]. This has been followed by many similar proposals for rule-
based [29], template-based [27], or keyword-based systems [14,21].
In the meantime, many researchers have tackled this task using classical
machine learning methods. After applying the Bag-of-Words (BoW) method
for feature extraction, Kwok and Wang [19] used a Naı̈ve Bayes classifier for
the detection of racism against black people on Twitter. Grevy et al. [13] used
Support Vector Machines (SVMs) on BoW features for the classification of racist
texts. However, since the BoW approach was shown to lead to high false positive
rates-[6], others used more sophisticated feature extraction methods to obtain
input for the classical machine learning methods (such as SVM, Naı̈ve Bayes and
Logistic Regression [5,6,39,40]) deployed for the detection of hateful content.
One milestone in hate speech detection was deep learning gaining traction
in Natural Language Processing (NLP) after its success in pattern recognition
and computer vision [44], propelling the field forward [31]. The introduction
of embeddings [26] had an important role in this process. For one, by pro-
viding useful features to the same classical machine learning algorithms for
hate speech detection [25,45], leading to significantly better results than those
attained with the BoW approach (both in terms of memory-complexity, and
classification scores [9]). Other deep learning approaches were also popular for
the task, including Recurrent Neural Networks [1,7,10,38], Convolutional Neural
Networks [1,12,32,51], and methods that combined the two [17,41,49].
The introduction of transformers was another milestone, in particular the
high improvement in text classification performance by BERT [37]. What is
more, transformer models have proved highly successful in hate speech detection
competitions (with most of the top ten teams using a transformer in a recent
challenge [46]). Ensembles of transformers also proved to be successful in hate
speech detection [28,30]. So much so, that such a solution has attained the
best performance (i.e. on average the best performance over several sub-tasks)
recently in a challenge with more than fifty participants [35]. For this reason,
here, we also decided to use an ensemble of transformer models.

1.2 Contribution

Here, we apply a 5-fold ensemble training method using the RoBERTA model,
which enables us to attain state-of-the-art performance on the HASOC bench-
mark. Moreover, by proposing additional fine-tuning, we significantly increase
the performance of models trained on different folds.
Another Random Scribd Document
with Unrelated Content
XV
ABRAHAM LINCOLN

REMARKS AT THE FUNERAL SERVICES HELD IN


CONCORD, APRIL 19, 1865

“Nature, they say, doth dote,


And cannot make a man
Save on some worn-out plan,
Repeating us by rote:
For him her Old-World moulds aside she threw,
And, choosing sweet clay from the breast
Of the unexhausted West,
With stuff untainted shaped a hero new,
Wise, steadfast in the strength of God, and true.
How beautiful to see
Once more a shepherd of mankind indeed,
Who loved his charge, but never loved to lead;
One whose meek flock the people joyed to be,
Not lured by any cheat of birth,
But by his clear-grained human worth,
And brave old wisdom of sincerity!
They knew that outward grace is dust;
They could not choose but trust
In that sure-footed mind’s unfaltering skill,
And supple-tempered will
That bent, like perfect steel, to spring again and thrust.
...
Nothing of Europe here,
Or, then, of Europe fronting mornward still,
Ere any names of Serf and Peer
Could Nature’s equal scheme deface; ...
Here was a type of the true elder race,
And one of Plutarch’s men talked with us face to face.”

Lowell, Commemoration Ode.

ABRAHAM LINCOLN
We meet under the gloom of a calamity which darkens down over
the minds of good men in all civil society, as the fearful tidings travel
over sea, over land, from country to country, like the shadow of an
uncalculated eclipse over the planet. Old as history is, and manifold
as are its tragedies, I doubt if any death has caused so much pain to
mankind as this has caused, or will cause, on its announcement; and
this, not so much because nations are by modern arts brought so
closely together, as because of the mysterious hopes and fears
which, in the present day, are connected with the name and
institutions of America.
In this country, on Saturday, every one was struck dumb, and saw
at first only deep below deep, as he meditated on the ghastly blow.
And perhaps, at this hour, when the coffin which contains the dust of
the President sets forward on its long march through mourning
states, on its way to his home in Illinois, we might well be silent, and
suffer the awful voices of the time to thunder to us. Yes, but that first
despair was brief: the man was not so to be mourned. He was the
most active and hopeful of men; and his work had not perished: but
acclamations of praise for the task he had accomplished burst out
into a song of triumph, which even tears for his death cannot keep
down.
The President stood before us as a man of the people. He was
thoroughly American, had never crossed the sea, had never been
spoiled by English insularity or French dissipation; a quite native,
aboriginal man, as an acorn from the oak; no aping of foreigners, no
frivolous accomplishments, Kentuckian born, working on a farm, a
flatboatman, a captain in the Black Hawk War, a country lawyer, a
representative in the rural legislature of Illinois;—on such modest
foundations the broad structure of his fame was laid. How slowly,
and yet by happily prepared steps, he came to his place. All of us
remember—it is only a history of five or six years—the surprise and
the disappointment of the country at his first nomination by the
convention at Chicago. Mr. Seward, then in the culmination of his
good fame, was the favorite of the Eastern States. And when the
new and comparatively unknown name of Lincoln was announced
(notwithstanding the report of the acclamations of that convention),
we heard the result coldly and sadly. It seemed too rash, on a purely
local reputation, to build so grave a trust in such anxious times; and
men naturally talked of the chances in politics as incalculable. But it
turned out not to be chance. The profound good opinion which the
people of Illinois and of the West had conceived of him, and which
they had imparted to their colleagues, that they also might justify
themselves to their constituents at home, was not rash, though they
did not begin to know the riches of his worth.[174]
A plain man of the people, an extraordinary fortune attended him.
He offered no shining qualities at the first encounter; he did not
offend by superiority. He had a face and manner which disarmed
suspicion, which inspired confidence, which confirmed good will. He
was a man without vices. He had a strong sense of duty, which it
was very easy for him to obey. Then, he had what farmers call a long
head; was excellent in working out the sum for himself; in arguing his
case and convincing you fairly and firmly. Then, it turned out that he
was a great worker; had prodigious faculty of performance; worked
easily. A good worker is so rare; everybody has some disabling
quality. In a host of young men that start together and promise so
many brilliant leaders for the next age, each fails on trial; one by bad
health, one by conceit, or by love of pleasure, or lethargy, or an ugly
temper,—each has some disqualifying fault that throws him out of
the career. But this man was sound to the core, cheerful, persistent,
all right for labor, and liked nothing so well.
Then, he had a vast good nature, which made him tolerant and
accessible to all; fair-minded, leaning to the claim of the petitioner;
affable, and not sensible to the affliction which the innumerable visits
paid to him when President would have brought to any one else.[175]
And how this good nature became a noble humanity, in many a
tragic case which the events of the war brought to him, every one will
remember; and with what increasing tenderness he dealt when a
whole race was thrown on his compassion. The poor negro said of
him, on an impressive occasion, “Massa Linkum am eberywhere.”
Then his broad good humor, running easily into jocular talk, in
which he delighted and in which he excelled, was a rich gift to this
wise man. It enabled him to keep his secret; to meet every kind of
man and every rank in society; to take off the edge of the severest
decisions; to mask his own purpose and sound his companion; and
to catch with true instinct the temper of every company he
addressed. And, more than all, it is to a man of severe labor, in
anxious and exhausting crises, the natural restorative, good as
sleep, and is the protection of the overdriven brain against rancor
and insanity.
He is the author of a multitude of good sayings, so disguised as
pleasantries that it is certain they had no reputation at first but as
jests; and only later, by the very acceptance and adoption they find
in the mouths of millions, turn out to be the wisdom of the hour. I am
sure if this man had ruled in a period of less facility of printing, he
would have become mythological in a very few years, like Æsop or
Pilpay, or one of the Seven Wise Masters, by his fables and
proverbs. But the weight and penetration of many passages in his
letters, messages and speeches, hidden now by the very closeness
of their application to the moment, are destined hereafter to wide
fame. What pregnant definitions; what unerring common sense; what
foresight; and, on great occasion, what lofty, and more than national,
what humane tone! His brief speech at Gettysburg will not easily be
surpassed by words on any recorded occasion. This, and one other
American speech, that of John Brown to the court that tried him, and
a part of Kossuth’s speech at Birmingham, can only be compared
with each other, and with no fourth.
His occupying the chair of state was a triumph of the good sense
of mankind, and of the public conscience. This middle-class country
had got a middle-class president, at last. Yes, in manners and
sympathies, but not in powers, for his powers were superior. This
man grew according to the need. His mind mastered the problem of
the day; and as the problem grew, so did his comprehension of it.
Rarely was man so fitted to the event. In the midst of fears and
jealousies, in the Babel of counsels and parties, this man wrought
incessantly with all his might and all his honesty, laboring to find
what the people wanted, and how to obtain that. It cannot be said
there is any exaggeration of his worth. If ever a man was fairly
tested, he was. There was no lack of resistance, nor of slander, nor
of ridicule. The times have allowed no state secrets; the nation has
been in such ferment, such multitudes had to be trusted, that no
secret could be kept. Every door was ajar, and we know all that
befell.
Then, what an occasion was the whirlwind of the war. Here was
place for no holiday magistrate, no fair-weather sailor; the new pilot
was hurried to the helm in a tornado. In four years,—four years of
battle-days,—his endurance, his fertility of resources, his
magnanimity, were sorely tried and never found wanting. There, by
his courage, his justice, his even temper, his fertile counsel, his
humanity, he stood a heroic figure in the centre of a heroic epoch.
He is the true history of the American people in his time. Step by
step he walked before them; slow with their slowness, quickening his
march by theirs, the true representative of this continent; an entirely
public man; father of his country, the pulse of twenty millions
throbbing in his heart, the thought of their minds articulated by his
tongue.
Adam Smith remarks that the axe, which in Houbraken’s portraits
of British kings and worthies is engraved under those who have
suffered at the block, adds a certain lofty charm to the picture. And
who does not see, even in this tragedy so recent, how fast the terror
and ruin of the massacre are already burning into glory around the
victim? Far happier this fate than to have lived to be wished away; to
have watched the decay of his own faculties; to have seen—perhaps
even he—the proverbial ingratitude of statesmen; to have seen
mean men preferred. Had he not lived long enough to keep the
greatest promise that ever man made to his fellow men,—the
practical abolition of slavery? He had seen Tennessee, Missouri and
Maryland emancipate their slaves. He had seen Savannah,
Charleston and Richmond surrendered; had seen the main army of
the rebellion lay down its arms. He had conquered the public opinion
of Canada, England and France.[176] Only Washington can compare
with him in fortune.
And what if it should turn out, in the unfolding of the web, that he
had reached the term; that this heroic deliverer could no longer serve
us; that the rebellion had touched its natural conclusion, and what
remained to be done required new and uncommitted hands,—a new
spirit born out of the ashes of the war; and that Heaven, wishing to
show the world a completed benefactor, shall make him serve his
country even more by his death than by his life? Nations, like kings,
are not good by facility and complaisance. “The kindness of kings
consists in justice and strength.” Easy good nature has been the
dangerous foible of the Republic, and it was necessary that its
enemies should outrage it, and drive us to unwonted firmness, to
secure the salvation of this country in the next ages.
The ancients believed in a serene and beautiful Genius which
ruled in the affairs of nations; which, with a slow but stern justice,
carried forward the fortunes of certain chosen houses, weeding out
single offenders or offending families, and securing at last the firm
prosperity of the favorites of Heaven. It was too narrow a view of the
Eternal Nemesis. There is a serene Providence which rules the fate
of nations, which makes little account of time, little of one generation
or race, makes no account of disasters, conquers alike by what is
called defeat or by what is called victory, thrusts aside enemy and
obstruction, crushes everything immoral as inhuman, and obtains the
ultimate triumph of the best race by the sacrifice of everything which
resists the moral laws of the world.[177] It makes its own instruments,
creates the man for the time, trains him in poverty, inspires his
genius, and arms him for his task. It has given every race its own
talent, and ordains that only that race which combines perfectly with
the virtues of all shall endure.[178]
XVI
HARVARD COMMEMORATION SPEECH

JULY 21, 1865

“‘Old classmate, say


Do you remember our Commencement Day?
Were we such boys as these at twenty?’ Nay,
God called them to a nobler task than ours,
And gave them holier thoughts and manlier powers,—
This is the day of fruits and not of flowers!
These ‘boys’ we talk about like ancient sages
Are the same men we read of in old pages—
The bronze recast of dead heroic ages!
We grudge them not, our dearest, bravest, best,—
Let but the quarrel’s issue stand confest:
’Tis Earth’s old slave-God battling for his crown
And Freedom fighting with her visor down.”

Holmes.

“Many loved Truth, and lavished life’s best oil


Amid the dust of books to find her,
Content at last, for guerdon of their toil,
With the cast mantle she hath left behind her.
Many in sad faith sought for her,
Many with crossed hands sighed for her;
But these, our brothers, fought for her,
At life’s dear peril wrought for her,
So loved her that they died for her,
Tasting the raptured fleetness
Of her divine completeness:
Their higher instinct knew
Those love her best who to themselves are true,
And what they dare to dream of, dare to do;
They followed her and found her
Where all may hope to find,
Not in the ashes of the burnt-out mind,
But beautiful, with danger’s sweetness round her.
Where faith made whole with deed
Breathes its awakening breath
Into the lifeless creed,
They saw her plumed and mailed,
With sweet, stern face unveiled,
And all-repaying eyes, look proud on them in death.”

Lowell, Commemoration Ode.

HARVARD COMMEMORATION SPEECH


MR. CHAIRMAN, and Gentlemen: With whatever opinion we
come here, I think it is not in man to see, without a feeling of pride
and pleasure, a tried soldier, the armed defender of the right. I think
that in these last years all opinions have been affected by the
magnificent and stupendous spectacle which Divine Providence has
offered us of the energies that slept in the children of this country,—
that slept and have awakened. I see thankfully those that are here,
but dim eyes in vain explore for some who are not.
The old Greek Heraclitus said, “War is the Father of all things.” He
said it, no doubt, as science, but we of this day can repeat it as
political and social truth. War passes the power of all chemical
solvents, breaking up the old adhesions, and allowing the atoms of
society to take a new order. It is not the Government, but the War,
that has appointed the good generals, sifted out the pedants, put in
the new and vigorous blood. The War has lifted many other people
besides Grant and Sherman into their true places. Even Divine
Providence, we may say, always seems to work after a certain
military necessity. Every nation punishes the General who is not
victorious. It is a rule in games of chance that the cards beat all the
players, and revolutions disconcert and outwit all the insurgents.
The revolutions carry their own points, sometimes to the ruin of
those who set them on foot. The proof that war also is within the
highest right, is a marked benefactor in the hands of the Divine
Providence, is its morale. The war gave back integrity to this erring
and immoral nation. It charged with power, peaceful, amiable men, to
whose life war and discord were abhorrent. What an infusion of
character went out from this and other colleges! What an infusion of
character down to the ranks! The experience has been uniform that it
is the gentle soul that makes the firm hero after all. It is easy to recall
the mood in which our young men, snatched from every peaceful
pursuit, went to the war. Many of them had never handled a gun.
They said, “It is not in me to resist. I go because I must. It is a duty
which I shall never forgive myself if I decline. I do not know that I can
make a soldier. I may be very clumsy. Perhaps I shall be timid; but
you can rely on me. Only one thing is certain, I can well die, but I
cannot afford to misbehave.”
In fact the infusion of culture and tender humanity from these
scholars and idealists who went to the war in their own despite—God
knows they had no fury for killing their old friends and countrymen—
had its signal and lasting effect. It was found that enthusiasm was a
more potent ally than science and munitions of war without it. “It is a
principle of war,” said Napoleon, “that when you can use the
thunderbolt you must prefer it to the cannon.” Enthusiasm was the
thunderbolt. Here in this little Massachusetts, in smaller Rhode
Island, in this little nest of New England republics it flamed out when
the guilty gun was aimed at Sumter.
Mr. Chairman, standing here in Harvard College, the parent of all
the colleges; in Massachusetts, the parent of all the North; when I
consider her influence on the country as a principal planter of the
Western States, and now, by her teachers, preachers, journalists
and books, as well as by traffic and production, the diffuser of
religious, literary and political opinion;—and when I see how
irresistible the convictions of Massachusetts are in these swarming
populations,—I think the little state bigger than I knew. When her
blood is up, she has a fist big enough to knock down an empire. And
her blood was roused. Scholars changed the black coat for the blue.
A single company in the Forty-fourth Massachusetts Regiment
contained thirty-five sons of Harvard. You all know as well as I the
story of these dedicated men, who knew well on what duty they
went,—whose fathers and mothers said of each slaughtered son,
“We gave him up when he enlisted.” One mother said, when her son
was offered the command of the first negro regiment, “If he accepts
it, I shall be as proud as if I had heard that he was shot.”[179] These
men, thus tender, thus high-bred, thus peaceable, were always in the
front and always employed. They might say, with their forefathers the
old Norse Vikings, “We sung the mass of lances from morning until
evening.” And in how many cases it chanced, when the hero had
fallen, they who came by night to his funeral, on the morrow returned
to the war-path to show his slayers the way to death!
Ah! young brothers, all honor and gratitude to you,—you, manly
defenders, Liberty’s and Humanity’s bodyguard! We shall not again
disparage America, now that we have seen what men it will bear. We
see—we thank you for it—a new era, worth to mankind all the
treasure and all the lives it has cost; yes, worth to the world the lives
of all this generation of American men, if they had been demanded.
[180]
XVII
ADDRESS

AT THE DEDICATION OF THE SOLDIERS’


MONUMENT IN CONCORD, APRIL 19, 1867

“They have shown what men may do,


They have proved how men may die,—
Count, who can, the fields they have pressed,
Each face to the solemn sky!”

Brownell.

“Think you these felt no charms


In their gray homesteads and embowered farms?
In household faces waiting at the door
Their evening step should lighten up no more?
In fields their boyish feet had known?
In trees their fathers’ hands had set,
And which with them had grown,
Widening each year their leafy coronet?
Felt they no pang of passionate regret
For those unsolid goods that seem so much our own?
These things are dear to every man that lives,
And life prized more for what it lends than gives.
Yea, many a tie, through iteration sweet,
Strove to detain their fatal feet;
And yet the enduring half they chose,
Whose choice decides a man life’s slave or king,
The invisible things of God before the seen and known:
Therefore their memory inspiration blows
With echoes gathering on from zone to zone;
For manhood is the one immortal thing
Beneath Time’s changeful sky,
And, where it lightened once, from age to age,
Men come to learn, in grateful pilgrimage,
That length of days is knowing when to die.”

Lowell, Concord Ode.

ADDRESS
DEDICATION OF SOLDIERS’ MONUMENT IN CONCORD, APRIL 19, 1867

Fellow citizens: The day is in Concord doubly our calendar day,


as being the anniversary of the invasion of the town by the British
troops in 1775, and of the departure of the company of volunteers for
Washington, in 1861. We are all pretty well aware that the facts
which make to us the interest of this day are in a great degree
personal and local here; that every other town and city has its own
heroes and memorial days, and that we can hardly expect a wide
sympathy for the names and anecdotes which we delight to record.
We are glad and proud that we have no monopoly of merit. We are
thankful that other towns and cities are as rich; that the heroes of old
and of recent date, who made and kept America free and united,
were not rare or solitary growths, but sporadic over vast tracts of the
Republic. Yet, as it is a piece of nature and the common sense that
the throbbing chord that holds us to our kindred, our friends and our
town, is not to be denied or resisted,—no matter how frivolous or
unphilosophical its pulses,—we shall cling affectionately to our
houses, our river and pastures, and believe that our visitors will
pardon us if we take the privilege of talking freely about our nearest
neighbors as in a family party;—well assured, meantime, that the
virtues we are met to honor were directed on aims which command
the sympathy of every loyal American citizen, were exerted for the
protection of our common country, and aided its triumph.
The town has thought fit to signify its honor for a few of its sons by
raising an obelisk in the square. It is a simple pile enough,—a few
slabs of granite, dug just below the surface of the soil, and laid upon
the top of it; but as we have learned that the upheaved mountain,
from which these discs or flakes were broken, was once a glowing
mass at white heat, slowly crystallized, then uplifted by the central
fires of the globe: so the roots of the events it appropriately marks
are in the heart of the universe. I shall say of this obelisk, planted
here in our quiet plains, what Richter says of the volcano in the fair
landscape of Naples: “Vesuvius stands in this poem of Nature, and
exalts everything, as war does the age.”
The art of the architect and the sense of the town have made
these dumb stones speak; have, if I may borrow the old language of
the church, converted these elements from a secular to a sacred and
spiritual use; have made them look to the past and the future; have
given them a meaning for the imagination and the heart. The sense
of the town, the eloquent inscriptions the shaft now bears, the
memories of these martyrs, the noble names which yet have
gathered only their first fame, whatever good grows to the country
out of the war, the largest results, the future power and genius of the
land, will go on clothing this shaft with daily beauty and spiritual life.
’Tis certain that a plain stone like this, standing on such memories,
having no reference to utilities, but only to the grand instincts of the
civil and moral man, mixes with surrounding nature,—by day with the
changing seasons, by night the stars roll over it gladly,—becomes a
sentiment, a poet, a prophet, an orator, to every townsman and
passenger, an altar where the noble youth shall in all time come to
make his secret vows.[181]
The old Monument, a short half-mile from this house, stands to
signalize the first Revolution, where the people resisted offensive
usurpations, offensive taxes of the British Parliament, claiming that
there should be no tax without representation. Instructed by events,
after the quarrel began, the Americans took higher ground, and
stood for political independence. But in the necessities of the hour,
they overlooked the moral law, and winked at a practical exception to
the Bill of Rights they had drawn up. They winked at the exception,
believing it insignificant. But the moral law, the nature of things, did
not wink at it, but kept its eye wide open. It turned out that this one
violation was a subtle poison, which in eighty years corrupted the
whole overgrown body politic, and brought the alternative of
extirpation of the poison or ruin to the Republic.[182]
This new Monument is built to mark the arrival of the nation at the
new principle,—say, rather, at its new acknowledgment, for the
principle is as old as Heaven,—that only that state can live, in which
injury to the least member is recognized as damage to the whole.
Reform must begin at home. The aim of the hour was to
reconstruct the South; but first the North had to be reconstructed. Its
own theory and practice of liberty had got sadly out of gear, and
must be corrected. It was done on the instant. A thunderstorm at sea
sometimes reverses the magnets in the ship, and south is north. The
storm of war works the like miracle on men. Every Democrat who
went South came back a Republican, like the governors who, in
Buchanan’s time, went to Kansas, and instantly took the free-state
colors. War, says the poet, is

“the arduous strife,


To which the triumph of all good is given.”[183]

Every principle is a war-note. When the rights of man are recited


under any old government, every one of them is a declaration of war.
War civilizes, rearranges the population, distributing by ideas,—the
innovators on one side, the antiquaries on the other. It opens the
eyes wider. Once we were patriots up to the town-bounds, or the
state-line. But when you replace the love of family or clan by a
principle, as freedom, instantly that fire runs over the state-line into
New Hampshire, Vermont, New York and Ohio, into the prairie and
beyond, leaps the mountains, bridges river and lake, burns as hotly
in Kansas and California as in Boston, and no chemist can
discriminate between one soil and the other. It lifts every population
to an equal power and merit.
As long as we debate in council, both sides may form their private
guess what the event may be, or which is the strongest. But the
moment you cry “Every man to his tent, O Israel!” the delusions of
hope and fear are at an end;—the strength is now to be tested by the
eternal facts. There will be no doubt more. The world is equal to
itself. The secret architecture of things begins to disclose itself; the
fact that all things were made on a basis of right; that justice is really
desired by all intelligent beings; that opposition to it is against the
nature of things; and that, whatever may happen in this hour or that,
the years and the centuries are always pulling down the wrong and
building up the right.
The war made the Divine Providence credible to many who did not
believe the good Heaven quite honest. Every man was an
abolitionist by conviction, but did not believe that his neighbor was.
The opinions of masses of men, which the tactics of primary
caucuses and the proverbial timidity of trade had concealed, the war
discovered; and it was found, contrary to all popular belief, that the
country was at heart abolitionist, and for the Union was ready to die.
As cities of men are the first effects of civilization, and also
instantly causes of more civilization, so armies, which are only
wandering cities, generate a vast heat, and lift the spirit of the
soldiers who compose them to the boiling point. The armies
mustered in the North were as much missionaries to the mind of the
country as they were carriers of material force, and had the vast
advantage of carrying whither they marched a higher civilization. Of
course, there are noble men everywhere, and there are such in the
South; and the noble know the noble, wherever they meet; and we
have all heard passages of generous and exceptional behavior
exhibited by individuals there to our officers and men, during the war.
But the common people, rich or poor, were the narrowest and most
conceited of mankind, as arrogant as the negroes on the Gambia
River; and, by the way, it looks as if the editors of the Southern press
were in all times selected from this class. The invasion of Northern
farmers, mechanics, engineers, tradesmen, lawyers and students did
more than forty years of peace had done to educate the South.[184]
“This will be a slow business,” writes our Concord captain home, “for
we have to stop and civilize the people as we go along.”
It is an interesting part of the history, the manner in which this
incongruous militia were made soldiers. That was done again on the
Kansas plan. Our farmers went to Kansas as peaceable, God-
fearing men as the members of our school committee here. But
when the Border raids were let loose on their villages, these people,
who turned pale at home if called to dress a cut finger, on witnessing
the butchery done by the Missouri riders on women and babes, were
so beside themselves with rage, that they became on the instant the
bravest soldiers and the most determined avengers.[185] And the first
events of the war of the Rebellion gave the like training to the new
recruits.
All sorts of men went to the war,—the roughs, men who liked
harsh play and violence, men for whom pleasure was not strong
enough, but who wanted pain, and found sphere at last for their
superabundant energy; then the adventurous type of New
Englander, with his appetite for novelty and travel; the village
politician, who could now verify his newspaper knowledge, see the
South, and amass what a stock of adventures to retail hereafter at
the fireside, or to the well-known companions on the Mill-dam; young
men, also, of excellent education and polished manners, delicately
brought up; manly farmers, skilful mechanics, young tradesmen,
men hitherto of narrow opportunities of knowing the world, but well
taught in the grammar-schools. But perhaps in every one of these
classes were idealists, men who went from a religious duty. I have a
note of a conversation that occurred in our first company, the
morning before the battle of Bull Run. At a halt in the march, a few of
our boys were sitting on a rail fence talking together whether it was
right to sacrifice themselves. One of them said, ‘he had been
thinking a good deal about it, last night, and he thought one was
never too young to die for a principle.’ One of our later volunteers, on
the day when he left home, in reply to my question, How can you be
spared from your farm, now that your father is so ill? said: “I go
because I shall always be sorry if I did not go when the country
called me. I can go as well as another.” One wrote to his father these
words: “You may think it strange that I, who have always naturally
rather shrunk from danger, should wish to enter the army; but there
is a higher Power that tunes the hearts of men, and enables them to
see their duty, and gives them courage to face the dangers with
which those duties are attended.” And the captain writes home of
another of his men, “B⸺ comes from a sense of duty and love of
country, and these are the soldiers you can depend upon.”[186]
None of us can have forgotten how sharp a test to try our peaceful
people with, was the first call for troops. I doubt not many of our
soldiers could repeat the confession of a youth whom I knew in the
beginning of the war, who enlisted in New York, went to the field, and
died early. Before his departure he confided to his sister that he was
naturally a coward, but was determined that no one should ever find
it out; that he had long trained himself by forcing himself, on the
suspicion of any near danger, to go directly up to it, cost him what
struggles it might. Yet it is from this temperament of sensibility that
great heroes have been formed.
Our first company was led by an officer who had grown up in this
village from a boy.[187] The older among us can well remember him
at school, at play and at work, all the way up, the most amiable,
sensible, unpretending of men; fair, blond, the rose lived long in his
cheek; grave, but social, and one of the last men in this town you
would have picked out for the rough dealing of war,—not a trace of
fierceness, much less of recklessness, or of the devouring thirst for
excitement; tender as a woman in his care for a cough or a chilblain
in his men; had troches and arnica in his pocket for them. The army
officers were welcome to their jest on him as too kind for a captain,
and, later, as the colonel who got off his horse when he saw one of
his men limp on the march, and told him to ride. But he knew that his
men had found out, first that he was captain, then that he was
colonel, and neither dared nor wished to disobey him. He was a man
without conceit, who never fancied himself a philosopher or a saint;
the most modest and amiable of men, engaged in common duties,
but equal always to the occasion; and the war showed him still
equal, however stern and terrible the occasion grew,—disclosed in
him a strong good sense, great fertility of resource, the helping hand,
and then the moral qualities of a commander,—a patience not to be
tired out, a serious devotion to the cause of the country that never
swerved, a hope that never failed. He was a Puritan in the army, with
traits that remind one of John Brown,—an integrity incorruptible, and
an ability that always rose to the need.
You will remember that these colonels, captains and lieutenants,
and the privates too, are domestic men, just wrenched away from
their families and their business by this rally of all the manhood in the
land. They have notes to pay at home; have farms, shops, factories,
affairs of every kind to think of and write home about. Consider what
sacrifice and havoc in business arrangements this war-blast made.
They have to think carefully of every last resource at home on which
their wives or mothers may fall back; upon the little account in the
savings bank, the grass that can be sold, the old cow, or the heifer.
These necessities make the topics of the ten thousand letters with
which the mail-bags came loaded day by day. These letters play a
great part in the war. The writing of letters made the Sunday in every
camp:—meantime they are without the means of writing. After the
first marches there is no letter-paper, there are no envelopes, no
postage-stamps for these were wetted into a solid mass in the rains
and mud. Some of these letters are written on the back of old bills,
some on brown paper, or strips of newspaper; written by fire-light,
making the short night shorter; written on the knee, in the mud, with
pencil, six words at a time; or in the saddle, and have to stop
because the horse will not stand still. But the words are proud and
tender,—“Tell mother I will not disgrace her;” “tell her not to worry
about me, for I know she would not have had me stay at home if she
could as well as not.” The letters of the captain are the dearest
treasures of this town. Always devoted, sometimes anxious,
sometimes full of joy at the deportment of his comrades, they contain
the sincere praise of men whom I now see in this assembly. If
Marshal Montluc’s[188] Memoirs are the Bible of soldiers, as Henry
IV. of France said, Colonel Prescott might furnish the Book of
Epistles.
He writes, “You don’t know how one gets attached to a company
by living with them and sleeping with them all the time. I know every
man by heart. I know every man’s weak spot,—who is shaky, and
who is true blue.” He never remits his care of the men, aiming to hold
them to their good habits and to keep them cheerful. For the first
point, he keeps up a constant acquaintance with them; urges their
correspondence with their friends; writes news of them home, urging
his own correspondent to visit their families and keep them informed
about the men; encourages a temperance society which is formed in
the camp. “I have not had a man drunk, or affected by liquor, since
we came here.” At one time he finds his company unfortunate in
having fallen between two companies of quite another class,—“’tis
profanity all the time; yet instead of a bad influence on our men, I
think it works the other way,—it disgusts them.”
One day he writes, “I expect to have a time, this forenoon, with the
officer from West Point who drills us. He is very profane, and I will
not stand it. If he does not stop it, I shall march my men right away
when he is drilling them. There is a fine for officers swearing in the
army, and I have too many young men that are not used to such talk.
I told the colonel this morning I should do it, and shall,—don’t care
what the consequence is. This lieutenant seems to think that these
men, who never saw a gun, can drill as well as he, who has been at
West Point four years.” At night he adds: “I told that officer from West
Point, this morning, that he could not swear at my company as he
did yesterday; told him I would not stand it anyway. I told him I had a
good many young men in my company whose mothers asked me to
look after them, and I should do so, and not allow them to hear such
language, especially from an officer, whose duty it was to set them a
better example. Told him I did not swear myself and would not allow
him to. He looked at me as much as to say, Do you know whom you
are talking to? and I looked at him as much as to say, Yes, I do. He
looked rather ashamed, but went through the drill without an oath.”
So much for the care of their morals. His next point is to keep them
cheerful. ’Tis better than medicine. He has games of baseball, and
pitching quoits, and euchre, whilst part of the military discipline is
sham fights.
The best men heartily second him, and invent excellent means of
their own. When, afterwards, five of these men were prisoners in the
Parish Prison in New Orleans, they set themselves to use the time to
the wisest advantage,—formed a debating-club, wrote a daily or
weekly newspaper, called it “Stars and Stripes.” It advertises,
“prayer-meeting at 7 o’clock, in cell No. 8, second floor,” and their
own printed record is a proud and affecting narrative.
Whilst the regiment was encamped at Camp Andrew, near
Alexandria, in June, 1861, marching orders came. Colonel Lawrence
sent for eight wagons, but only three came. On these they loaded all
the canvas of the tents, but took no tent-poles.
“It looked very much like a severe thunderstorm,” writes the
captain, “and I knew the men would all have to sleep out of doors,
unless we carried them. So I took six poles, and went to the colonel,
and told him I had got the poles for two tents, which would cover
twenty-four men, and unless he ordered me not to carry them, I
should do so. He said he had no objection, only thought they would
be too much for me. We only had about twelve men [the rest of the
company being, perhaps, on picket or other duty], and some of them
have their heavy knapsacks and guns to carry, so could not carry
any poles. We started and marched two miles without stopping to
rest, not having had anything to eat, and being very hot and dry.” At
this time Captain Prescott was daily threatened with sickness, and
suffered the more from this heat. “I told Lieutenant Bowers, this
morning, that I could afford to be sick from bringing the tent-poles,
for it saved the whole regiment from sleeping outdoors; for they
would not have thought of it, if I had not taken mine. The major had
tried to discourage me;—said, ‘perhaps, if I carried them over, some
other company would get them;’—I told him, perhaps he did not think
I was smart.” He had the satisfaction to see the whole regiment
enjoying the protection of these tents.[189]
In the disastrous battle of Bull Run this company behaved well,
and the regimental officers believed, what is now the general
conviction of the country, that the misfortunes of the day were not so
much owing to the fault of the troops as to the insufficiency of the
combinations by the general officers. It happened, also, that the Fifth
Massachusetts was almost unofficered. The colonel was, early in the
day, disabled by a casualty; the lieutenant-colonel, the major and the
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and


personal growth!

textbookfull.com

You might also like