Punctuation Restoration using Transformer Models for High-and Low-Resource Languages

Published in Proceedings of the 6th Workshop on Noisy User-generated Text (W-NUT 2020)@EMNLP, 2020

Recommended citation: T. Alam, A. Khan, and F. Alam, “Punctuation Restoration using Transformer Models for High-and Low-Resource Languages,” in Proceedings of the 6th Workshop on Noisy User-generated Text (W-NUT2020)@EMNLP. 2020. http://noisy-text.github.io/2020/pdf/2020.d200-1.18.pdf

Abstract. Punctuation restoration is a common postprocessing problem for Automatic Speech Recognition (ASR) systems. It is important to improve the readability of the transcribed text for the human reader and facilitate NLP tasks. Current state-of-art address this problem using different deep learning models. Recently, transformer models have proven their success in downstream NLP tasks, and these models have been explored very little for the punctuation restoration problem. In this work, we explore different transformer based models and propose an augmentation strategy for this task, focusing on high-resource (English) and low-resource (Bangla) languages. For English, we obtain comparable state-of-the-art results, while for Bangla, it is the first reported work, which can serve as a strong baseline for future work. We have made our developed Bangla dataset publicly available for the research community.