NepBERTa, Nepali Language Model

Research Group:BBMMLLStatus:Inactive
NepBERTa, Nepali Language Model

This project is developing a large-scale pre-trained Nepali language model to support NLP applications in Nepali, including text classification, named entity recognition, and other language understanding tasks.

Background

Nepali NLP has historically suffered from limited high-quality language resources, restricting the development of robust NLP applications. Most existing language models are trained on English or other high-resource languages, leading to poor performance on Nepali text tasks. There is a need for a large, monolingual Nepali model that can understand language structure, context, and semantics to enable better NLP tools for Nepali users.

Research Aim

The project aimed to pre-train a large-scale Nepali language model, NepBERTa, using extensive monolingual Nepali corpora. It seeks to provide a foundation for a wide range of NLP tasks in Nepali, such as text classification, named entity recognition, part-of-speech tagging, and more, enabling downstream NLP applications to perform reliably on Nepali text.

Outcomes

NepBERTa was successfully developed and evaluated on multiple Nepali NLP benchmarks, demonstrating strong performance compared to existing models. The model provides a foundation for researchers and developers to fine-tune for a variety of Nepali language applications, improving accessibility and effectiveness of NLP solutions in Nepali.

Achievements & Outputs

Official Release: View

Publications
2022
NepBERTa: Nepali Language Model Trained in a Large Corpus