Academia.eduAcademia.edu

Dialect identification

description13 papers
group1 follower
lightbulbAbout this topic
Dialect identification is the linguistic process of determining the specific regional or social variety of a language spoken by an individual or group. It involves analyzing phonetic, lexical, and syntactic features to distinguish between different dialects, often utilizing computational methods and acoustic analysis in sociolinguistic research.
lightbulbAbout this topic
Dialect identification is the linguistic process of determining the specific regional or social variety of a language spoken by an individual or group. It involves analyzing phonetic, lexical, and syntactic features to distinguish between different dialects, often utilizing computational methods and acoustic analysis in sociolinguistic research.

Key research themes

1. How can acoustic and phonotactic features be leveraged for automatic dialect identification in closely related dialects?

This research area focuses on the extraction and utilization of acoustic and phonotactic features combined with advanced machine learning techniques for automatic dialect identification, especially in languages with multiple, closely related dialects such as Arabic. Accurate spoken dialect identification is critical for downstream speech technologies including speech recognition, dialect adaptation, and forensic applications.

by Maryam Najafian and 
1 more
Key finding: This paper demonstrates that combining phonotactic, lexical, and acoustic features using classifiers like Support Vector Machines, Logistic Regression, and Convolutional Neural Networks can achieve state-of-the-art dialect... Read more
Key finding: This study presents a deep learning approach using a Hamilton neural network (HNN) classifier integrated with multi-scale product analysis (MPA) feature extraction to effectively capture null spectral, temporal, and prosodic... Read more
Key finding: Introducing a hierarchical universal background model (UBM) and mel-frequency cepstral coefficient (MFCC) features combined with Gaussian Mixture Models (GMMs), this paper addresses practical challenges of small training... Read more
Key finding: This study employs sentence-level text classification using traditional machine learning classifiers (Logistic Regression, Multinomial Naïve Bayes, Support Vector Machines) on unigram and bigram feature sets for... Read more

2. What roles do segmental and prosodic cues play in human and automatic perception of dialects?

This theme investigates how listeners utilize segmental and prosodic information—such as vowels, tones, intonation patterns, and rhythm—in perceiving and identifying dialects. The integration of speech perception findings with computational approaches elucidates which acoustic cues carry the most dialect-specific information, informing both psycholinguistic theory and automatic dialect recognition systems.

Key finding: This paper provides empirical evidence from Swiss German indicating that both segmental and prosodic features contribute to dialect identification, with segmental cues being more diagnostic, followed by f0 and rhythmic... Read more
Key finding: By focusing on three Chinese subregional dialects differing in vowel and tonal properties, this study reveals that vowels and lexical tones contribute differentially to dialect perception. The work shows that vowels play a... Read more
Key finding: Through perceptual experiments investigating African-American English (AAE) and Mainstream American English (MAE), this study finds that intonation patterns (prosody) serve as salient cues for dialect identification. Listener... Read more

3. How can computational and statistical methods cluster dialect varieties and uncover their defining linguistic features?

This line of research explores the use of graph-theoretic clustering methods and supervised machine learning to classify dialect varieties and reveal the key linguistic differences that characterize dialect clusters. By linking dialectometry with interpretable feature analysis, it bridges quantitative dialectology with linguistic insight, providing actionable tools for dialect classification and feature extraction.

Key finding: Applying hierarchical spectral partitioning of bipartite graphs to Dutch dialect data, this study simultaneously clusters dialect varieties and identifies their linguistic sound correspondences without the need for external... Read more
Key finding: Using supervised machine learning on heritage Finnish speech data, this paper presents models that probabilistically assign individual heritage speakers to dialect groups by analyzing feature distributions rather than... Read more
Key finding: This research applies supervised machine learning classification methods to Kurdish texts, focusing on two dominant dialects, Kurmanji and Sorani. It highlights the challenges posed by lack of standard orthography and... Read more

All papers in Dialect identification

In daily life, dialect is the most widely used form of communication. Automatically identifying a dialect is a challenging task, particularly when dealing with similar dialects spoken in the same nation. In this study, we developed an... more
In this paper we discuss several models we used to classify 25 city-level Arabic dialects in addition to Modern Standard Arabic (MSA) as part of MADAR shared task (sub-task 1). We propose an ensemble model of a group of experimentally... more
Social media has become an integral part of people’s lives, resulting in a constant flow of information. However, a concerning trend has emerged with the rapid spread of fake news, attributed to the lack of verification mechanisms. Fake... more
This paper describes Faheem (adj. of understand), our submission to NADI (Nuanced Arabic Dialect Identification) shared task. With so many Arabic dialects being understudied due to the scarcity of the resources, the objective is to... more
In this paper we discuss several models we used to classify 25 city-level Arabic dialects in addition to Modern Standard Arabic (MSA) as part of MADAR shared task (sub-task 1). We propose an ensemble model of a group of experimentally... more
In this paper we discuss several models we used to classify 25 city-level Arabic dialects in addition to Modern Standard Arabic (MSA) as part of MADAR shared task (sub-task 1). We propose an ensemble model of a group of experimentally... more
This paper describes Faheem (adj. of understand), our submission to NADI (Nuanced Arabic Dialect Identification) shared task. With so many Arabic dialects being understudied due to the scarcity of the resources, the objective is to... more
In this paper we discuss several models we used to classify 25 city-level Arabic dialects in addition to Modern Standard Arabic (MSA) as part of MADAR shared task (sub-task 1). We propose an ensemble model of a group of experimentally... more
Machine translation has been a major motivation of development in natural language processing. Despite the burgeoning achievements in creating more efficient machine translation systems, thanks to deep learning methods, parallel corpora... more
Despite the recent advances in applying language-independent approaches to various natural language processing tasks thanks to artificial intelligence, some language-specific tools are still essential to process a language in a viable... more
With the rapid expansion in internet technology and research, people are more vocal about their belongings and accomplishments. It could be due to social media and easy access to those websites where they can conveniently/freely share... more
In this paper we discuss several models we used to classify 25 city-level Arabic dialects in addition to Modern Standard Arabic (MSA) as part of MADAR shared task (sub-task 1). We propose an ensemble model of a group of experimentally... more
In this paper, we investigate a set of methods for textual Arabic Dialect Identification, where we considered word-level and sentence-level approaches. We used three classifiers, namely: Linear Support Vector Machine L-SVM, Bernoulli... more
Typically native speakers of Arabic mix dialectal Arabic and Modern Standard Arabic in the same utterance. This phenomenon is known as linguistic code switching (LCS). It is a very challenging task to identify these LCS points in written... more
Dialect identification is a prior requirement for learning lexical and morphological knowledge a language variation that can be beneficial for natural language processing (NLP) and potential AI downstream tasks. In this paper, we present... more
We present an experimental dataset, Basic Dataset for Sorani Kurdish Automatic Speech Recognition (BD-4SK-ASR), which we used in the first attempt in developing an automatic speech recognition for Sorani Kurdish. The objective of the... more
Typically native speakers of Arabic mix dialectal Arabic and Modern Standard Arabic in the same utterance. This phenomenon is known as linguistic code switching (LCS). It is a very challenging task to identify these LCS points in written... more
Typically native speakers of Arabic mix dialectal Arabic and Modern Standard Arabic in the same utterance. This phenomenon is known as linguistic code switching (LCS). It is a very challenging task to identify these LCS points in written... more
We present an experimental dataset, Basic Dataset for Sorani Kurdish Automatic Speech Recognition (BD-4SK-ASR), which we used in the first attempt in developing an automatic speech recognition for Sorani Kurdish. The objective of the... more
Automatic dialect identification is a necessary Language Technology for processing multidialect languages in which the dialects are linguistically far from each other. Particularly, this becomes crucial where the dialects are mutually... more
Kurdish belongs to the Iranian language group within Indo-European language family. So, there are many similarities between Kurdish and other Iranian languages. such similarities among various languages lead to categorizing languages... more
Automatic dialect identification is a necessary Language Technology for processing multi-dialect languages in which the dialects are linguistically far from each other. Particularly, this becomes crucial where the dialects are mutually... more
Download research papers for free!