Dialect identification

description13 papers

group1 follower

lightbulbAbout this topic

Dialect identification is the linguistic process of determining the specific regional or social variety of a language spoken by an individual or group. It involves analyzing phonetic, lexical, and syntactic features to distinguish between different dialects, often utilizing computational methods and acoustic analysis in sociolinguistic research.

lightbulbAbout this topic

Key research themes

1. How can acoustic and phonotactic features be leveraged for automatic dialect identification in closely related dialects?

This research area focuses on the extraction and utilization of acoustic and phonotactic features combined with advanced machine learning techniques for automatic dialect identification, especially in languages with multiple, closely related dialects such as Arabic. Accurate spoken dialect identification is critical for downstream speech technologies including speech recognition, dialect adaptation, and forensic applications.

QMDIS: QCRI-MIT Advanced Dialect Identification System

by Maryam Najafian and

2017, Interspeech

Key finding: This paper demonstrates that combining phonotactic, lexical, and acoustic features using classifiers like Support Vector Machines, Logistic Regression, and Convolutional Neural Networks can achieve state-of-the-art dialect... Read more

articleView Paper downloadDownload

Arabic dialect classification using an adaptive deep learning model

by beei iaes

2025, Bulletin of Electrical Engineering and Informatics

Key finding: This study presents a deep learning approach using a Hamilton neural network (HNN) classifier integrated with multi-scale product analysis (MPA) feature extraction to effectively capture null spectral, temporal, and prosodic... Read more

articleView Paper downloadDownload

A Systematic Strategy For Robust Automatic Dialect Identification

by John Hansen

2022

Key finding: Introducing a hierarchical universal background model (UBM) and mel-frequency cepstral coefficient (MFCC) features combined with Gaussian Mixture Models (GMMs), this paper addresses practical challenges of small training... Read more

articleView Paper downloadDownload

Meeting Challenges of Modern Standard Arabic and Saudi Dialect Identification

by Computer Science & Information Technology (CS & IT) Computer Science Conference Proceedings (CSCP)

2022

Key finding: This study employs sentence-level text classification using traditional machine learning classifiers (Logistic Regression, Multinomial Naïve Bayes, Support Vector Machines) on unigram and bigram feature sets for... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

2. What roles do segmental and prosodic cues play in human and automatic perception of dialects?

This theme investigates how listeners utilize segmental and prosodic information—such as vowels, tones, intonation patterns, and rhythm—in perceiving and identifying dialects. The integration of speech perception findings with computational approaches elucidates which acoustic cues carry the most dialect-specific information, informing both psycholinguistic theory and automatic dialect recognition systems.

The role of segments and prosody in the identification of a speaker’s dialect

by Marie-José Kolly

2024, Journal of Phonetics

Key finding: This paper provides empirical evidence from Swiss German indicating that both segmental and prosodic features contribute to dialect identification, with segmental cues being more diagnostic, followed by f0 and rhythmic... Read more

articleView Paper downloadDownload

Vowels and tones as acoustic cues in Chinese subregional dialect identification

by Vincent Van Heuven

2025, Speech Communication

Key finding: By focusing on three Chinese subregional dialects differing in vowel and tonal properties, this study reveals that vowels and lexical tones contribute differentially to dialect perception. The work shows that vowels play a... Read more

articleView Paper downloadDownload

Dialect Identification from Prosodic Cues

by Christina Foreman

2022

Key finding: Through perceptual experiments investigating African-American English (AAE) and Mainstream American English (MAE), this study finds that intonation patterns (prosody) serve as salient cues for dialect identification. Listener... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

3. How can computational and statistical methods cluster dialect varieties and uncover their defining linguistic features?

This line of research explores the use of graph-theoretic clustering methods and supervised machine learning to classify dialect varieties and reveal the key linguistic differences that characterize dialect clusters. By linking dialectometry with interpretable feature analysis, it bridges quantitative dialectology with linguistic insight, providing actionable tools for dialect classification and feature extraction.

Hierarchical bipartite spectral graph partitioning to cluster dialect varieties and determine their most important linguistic features

by John Nerbonne

2025

Key finding: Applying hierarchical spectral partitioning of bipartite graphs to Dutch dialect data, this study simultaneously clusters dialect varieties and identifies their linguistic sound correspondences without the need for external... Read more

articleView Paper downloadDownload

Identifying the dialectal background of American Finnish speakers using a supervised machine-learning model

by Ilmari Ivaska

2023, Nordic Journal of Linguistics

Key finding: Using supervised machine learning on heritage Finnish speech data, this paper presents models that probabilistically assign individual heritage speakers to dialect groups by analyzing feature distributions rather than... Read more

articleView Paper downloadDownload

Automatic Kurdish Dialects Identification

by Hossein Hassani

2021, Computer Science & Information Technology ( CS & IT )

Key finding: This research applies supervised machine learning classification methods to Kurdish texts, focusing on two dominant dialects, Kurmanji and Sorani. It highlights the challenges posed by lack of standard orthography and... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

All papers in Dialect identification

Arabic dialect classification using an adaptive deep learning model

by beei iaes

2025, Bulletin of Electrical Engineering and Informatics

In daily life, dialect is the most widely used form of communication. Automatically identifying a dialect is a challenging task, particularly when dealing with similar dialects spoken in the same nation. In this study, we developed an... more

descriptionView Paper arrow_downwardDownload

The role of segments and prosody in the identification of a speaker’s dialect

by Marie-José Kolly

2024, Journal of Phonetics

descriptionView Paper arrow_downwardDownload

Mawdoo3 AI at MADAR Shared Task: Arabic Fine-Grained Dialect Identification with Ensemble Learning

by Mostafa Samir

2023, Proceedings of the Fourth Arabic Natural Language Processing Workshop

In this paper we discuss several models we used to classify 25 city-level Arabic dialects in addition to Modern Standard Arabic (MSA) as part of MADAR shared task (sub-task 1). We propose an ensemble model of a group of experimentally... more

descriptionView Paper arrow_downwardDownload

Fake News Detection in Low Resource Languages using SetFit Framework

by Amin Abdedaiem

2023, Inteligencia Artificial

Social media has become an integral part of people’s lives, resulting in a constant flow of information. However, a concerning trend has emerged with the rapid spread of fake news, attributed to the lack of verification mechanisms. Fake... more

Figure 1: The most common fake news detection methods. visual features such as extracted concepts, image captioning and clarity score. Both linguistic and visual features can suffer from limitations where the content is short in terms of text or content that does not contain any visual materials. The knowledge feature or approach uses fact checking techniques in order to compare the news content with several predefined external sources to determine the trustworthiness of the input news. The fact checking technique can be done by an expert in order to make a final decision about the news by manually checking external sources like snopes and Poltifact . Also can be done by a crowd sourced technique which helps to check the accuracy of the news but this one is less credible compared to the human expert [62].

Figure 2: A graphical illustration depicting the quantity of chosen articles categorized by year. In this Section, we are trying to find out the level reached with fake news detection in dialectical Arabic whether it’s the whole Arabic language (MSA) or dialectical Arabic related to a specific region or the Algerian dialect, we are also focusing on any existing datasets or systems that have been developec for the sake of fake news detection, we are trying to spot the problems researchers faced in this topic and the techniques used to encounter the spread of misinformative news. we have selected papers that searched in the context of fake news detection in Arabic in the last 7 years which covered all the possible approaches and methods from papers that used machine learning to ones that used transformers or hybrid techniques.the diversity of the papers selected is described in figure 2.

Figure 3: A visual representation of SetFIT’s fine-tuning and training process

Figure 4: Visualization of the models’ accuracy which was assessed across three different sample sizes per class.

Figure 5: Visualizing the performance of the SetFIT framework across various pre-trained models on a normalized dataset with 1500 samples per class. In contrast, MARBERTv2 achieved lower scores than the other models, primarily due to its pre- training dataset being exclusively focused on Maghreb’s dialects. In comparison to the results from the previous experiment, these findings suggest that the normalization pre-processing step has a positive impact on models that saw MSA text during pre-training phase. However, it has a negative impact on models that were primarily trained on Arabic dialects, leading to an increased occurrence of out-of- vocabulary words.

However, it is important to note that CAMeLBERT MSA experienced a decrease in performance when utilizing the normalized dataset. This decline can be attributed to the emergence of out-of-vocabulary words that were not encountered during the training phase, indicating that domain-specific terminology was absent from the training set. In contrast, ARBERT benefited from training on six distinct sources of text, enabling it to effectively handle a broader range of words and thereby improving its performance across diverse text domains. Figure 6 provides a visual comparison of both models’ performance on an unnormalized dataset. Figure 6: Visualizing the performance of models trained solely on MSA data in the context of dialectal testing data.

Additionally, the evaluation of these models using various classifiers provides valuable insights into their adaptability to different scenarios, enhancing their practical applicability. For a thorough visual representation, Figure 7 illustrates a comprehensive comparison of both approaches across all metrics. Figure 7: Visual comparison between the Few-Shot learning approach and the standard fine-tuning method using MARBERTv2 on a dataset of 1500 samples per class.

Table 4: The summary of Related work (Dataset).

Overall, this dataset provides a comprehensive and diverse set of news categories and sources, allowing for effective training and testing of machine learning models for the detection of fake news. It is also important to note that ethical considerations were taken into account during the data collection process, with all sensitive information anonymized and user consent obtained where necessary. note that algerian dialect is a very low resource dialect to gather data for, algerians are not that open to twitter and other big social media apart from facebook thus the news and publications are all made in MSA which made it extremely hard to gather data on the algerian dialect.

Table 6: Summary of the used Arabic pre-trained models in terms of dataset, size, vocabulary size, and training language.

Table 7: Hyper-parametres summary. The choice of the training sample sizes was determined following a rigorous testing process, in accor- dance with the methodology outlined in the SetFit research paper. These sample sizes were identified as optimal, taking into account the limitations of our initial dataset. Efforts to incorporate a larger number of samples were constrained by resource limitations and were therefore not pursued further.

Table 9: The performance score of SetFIT framework on a normalized training set of 1500 samples per eclaga. Table 9 and Figure 5 summarize the results obtained from the experiment conducted on a normalized dataset. The results indicate that AraBERTv2 outperformed the other models, achieving an F'1-score of 0.7004, closely followed by AraBART with a score of 0.6918. Importantly, both of these models were pre-trained on the same dataset, highlighting that this dataset contains a notably higher proportion of MSA text compared to dialectal text.

Table 11: Comparaison between the performance score of SetFIT framework and standard fine-tuning using the same Arabic pre-trained model and the training set of 1500 samples per class. The results unequivocally indicate that SetFIT surpassed the performance of the fine-tuned model across all metrics, underscoring Few-Shot learning as the preferred approach for low-resource languages.

descriptionView Paper arrow_downwardDownload

Faheem at NADI shared task: Identifying the dialect of Arabic tweet

by Aqil Azmi

2023

This paper describes Faheem (adj. of understand), our submission to NADI (Nuanced Arabic Dialect Identification) shared task. With so many Arabic dialects being understudied due to the scarcity of the resources, the objective is to... more

descriptionView Paper arrow_downwardDownload

Mawdoo3 AI at MADAR Shared Task: Arabic Fine-Grained Dialect Identification with Ensemble Learning

by Mostafa Samir

2023, Proceedings of the Fourth Arabic Natural Language Processing Workshop

Figure 2: Normalized confusion matrix of our baseline MNB model on the DEV dataset. pothesize that its second layer managed to learn from the non-orthographic probability features of he first layer by detecting its biases and error distribution, thus enhancing upon it. We believe hat a human benchmark might be useful for this fine-grained dialect detection problem, for which it would set a reasonable upper-bound that shows he significance of the orthographic features in de- ermining the writer’s dialect through the analysis of the human error.

Figure 1: Words distribution among the 25 dialects and MSA sorted by the percentage of exclusive words.

Table 1: Results in terms of macro Fl-score (F1) and accuracy (Acc) of our experimental baseline, our three models (i.e., runs) which are Ensemble, LR and Ensem- ble + LMs respectively, the best model of (Salameh et al., 2018) (MNB), and the top ranked system in MADAR shared task (ArbDialectID).

descriptionView Paper arrow_downwardDownload

Mawdoo3 AI at MADAR Shared Task: Arabic Fine-Grained Dialect Identification with Ensemble Learning

by Mostafa Samir

2023, Proceedings of the Fourth Arabic Natural Language Processing Workshop

descriptionView Paper arrow_downwardDownload

Faheem at NADI shared task: Identifying the dialect of Arabic tweet

by Aqil Azmi

2023

descriptionView Paper arrow_downwardDownload

Mawdoo3 AI at MADAR Shared Task: Arabic Fine-Grained Dialect Identification with Ensemble Learning

by Mostafa Samir

2023, Proceedings of the Fourth Arabic Natural Language Processing Workshop

descriptionView Paper arrow_downwardDownload

Leveraging Multilingual News Websites for Building a Kurdish Parallel Corpus

by sina ahmadi

2023, ACM Transactions on Asian and Low-Resource Language Information Processing

Machine translation has been a major motivation of development in natural language processing. Despite the burgeoning achievements in creating more efficient machine translation systems, thanks to deep learning methods, parallel corpora... more

descriptionView Paper arrow_downwardDownload

KLPT – Kurdish Language Processing Toolkit

by sina ahmadi

2023, Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)

Despite the recent advances in applying language-independent approaches to various natural language processing tasks thanks to artificial intelligence, some language-specific tools are still essential to process a language in a viable... more

Table 1: A comparison of the Kurdish alphabets. Variations are specified with "/" Kurdish people that those are in fact different di- alects of the Kurdish language (Haig and Matras, 2002; Matras, 2017). In this study, we remain with this theory and refer to them as Kurdish dialects. It is worth mentioning that despite the linguistic similarities of Zazaki, also known as Dimli, and Gorani languages and the popular belief that they are dialects of Kurdish, studies show that they be- long to the Zaza-Gorani language family which is independent from the Kurdish language (Paul, 1998; Jugel, 2014; Ahmadi, 2020c).

Figure 2: Number of scientific publications directly related to Kurdish language processing per year putational linguistics, we reviewed the scientific publications that directly address an issue in those fields. A total number of 53 publications are col- ected from the widely-used academic databases and search engines such as Google Scholar>, and hen classified based on their discussed sub-fields which are illustrated in Figure 1. The Kurdish dialects are not evenly discussed in the previous studies, with Sorani making up a predominant pro- portion of almost 90%. Although a smaller pro- portion represents the Kurmanji dialect, no publi- cation is found with respect to processing of the Southern Kurdish or Laki dialects. Regarding the research focus of the previous works, a range of NLP sub-fields has been addressed, particularly in text mining, morphological and syntactic analysis and, creation of lexical resources. We exception- ally included optical character recognition as it is of importance for converting printed material to electronic forms (Ahmadi et al., 2019). The full list of the surveyed papers can be found in Ap- pendix A.2.

Table A.2: Classification of the publications in the field of Kurdish language processing

descriptionView Paper arrow_downwardDownload

A Supervised Leaning Technique for Language Identification

by Tauqeer Ahmad

2022

With the rapid expansion in internet technology and research, people are more vocal about their belongings and accomplishments. It could be due to social media and easy access to those websites where they can conveniently/freely share... more

descriptionView Paper arrow_downwardDownload

Mawdoo3 AI at MADAR Shared Task: Arabic Fine-Grained Dialect Identification with Ensemble Learning

by Ahmad mustafa

2022, Proceedings of the Fourth Arabic Natural Language Processing Workshop

descriptionView Paper arrow_downwardDownload

Word-Level vs Sentence-Level Language Identification: Application to Algerian and Arabic Dialects

by Mohamed LICHOURI

2022, Procedia Computer Science

In this paper, we investigate a set of methods for textual Arabic Dialect Identification, where we considered word-level and sentence-level approaches. We used three classifiers, namely: Linear Support Vector Machine L-SVM, Bernoulli... more

descriptionView Paper arrow_downwardDownload

Token-Level Identification of Linguistic Code Switching

by Mona Diab

2022

Typically native speakers of Arabic mix dialectal Arabic and Modern Standard Arabic in the same utterance. This phenomenon is known as linguistic code switching (LCS). It is a very challenging task to identify these LCS points in written... more

descriptionView Paper arrow_downwardDownload

Meeting Challenges of Modern Standard Arabic and Saudi Dialect Identification

by Computer Science & Information Technology (CS & IT) Computer Science Conference Proceedings (CSCP)

2022

Dialect identification is a prior requirement for learning lexical and morphological knowledge a language variation that can be beneficial for natural language processing (NLP) and potential AI downstream tasks. In this paper, we present... more

descriptionView Paper arrow_downwardDownload

Kurdish (Sorani) Speech to Text: Presenting an Experimental Dataset

by Hossein Hassani

2022

We present an experimental dataset, Basic Dataset for Sorani Kurdish Automatic Speech Recognition (BD-4SK-ASR), which we used in the first attempt in developing an automatic speech recognition for Sorani Kurdish. The objective of the... more

descriptionView Paper arrow_downwardDownload

Token-Level Identification of Linguistic Code Switching

by Mona Diab

2022

descriptionView Paper arrow_downwardDownload

Token Level Identification of Linguistic Code Switching

by Heba EL MASRY

2022

descriptionView Paper arrow_downwardDownload

Kurdish (Sorani) Speech to Text: Presenting an Experimental Dataset

by Hossein Hassani

2021

Figure 1: The Sorani sounds along with their phoneme representation.

descriptionView Paper arrow_downwardDownload

Automatic Kurdish Dialects Identification

by Hossein Hassani

2021, Computer Science & Information Technology ( CS & IT )

Automatic dialect identification is a necessary Language Technology for processing multidialect languages in which the dialects are linguistically far from each other. Particularly, this becomes crucial where the dialects are mutually... more

descriptionView Paper arrow_downwardDownload

KURDISH LANGUAGE, ITS FAMILY AND DIALECTS

by Kurdiname International Academical Journal

2021

Kurdish belongs to the Iranian language group within Indo-European language family. So, there are many similarities between Kurdish and other Iranian languages. such similarities among various languages lead to categorizing languages... more

descriptionView Paper arrow_downwardDownload

New Era -Modern Persian -Orthography

by Ali Akbar Abedian Kasgari

2020, Amazon, Kindle

descriptionView Paper arrow_downwardDownload

AUTOMATIC KURDISH DIALECTS IDENTIFICATION

by Hossein Hassani and

2016

Automatic dialect identification is a necessary Language Technology for processing multi-dialect languages in which the dialects are linguistically far from each other. Particularly, this becomes crucial where the dialects are mutually... more

descriptionView Paper arrow_downwardDownload