NepBERTa, Nepali Language Model

Research Group:BBMMLLStatus:Open for further support

This project is developing a large-scale pre-trained Nepali language model to support NLP applications in Nepali, including text classification, named entity recognition, and other language understanding tasks.

Background

Nepali NLP has historically suffered from limited high-quality language resources, restricting the development of robust NLP applications. Most existing language models are trained on English or other high-resource languages, leading to poor performance on Nepali text tasks. There is a need for a large, monolingual Nepali model that can understand language structure, context, and semantics to enable better NLP tools for Nepali users.

Research Aim

The project aimed to pre-train a large-scale Nepali language model, NepBERTa, using extensive monolingual Nepali corpora. It seeks to provide a foundation for a wide range of NLP tasks in Nepali, such as text classification, named entity recognition, part-of-speech tagging, and more, enabling downstream NLP applications to perform reliably on Nepali text.

Outcomes

NepBERTa was successfully developed and evaluated on multiple Nepali NLP benchmarks, demonstrating strong performance compared to existing models. The model provides a foundation for researchers and developers to fine-tune for a variety of Nepali language applications, improving accessibility and effectiveness of NLP solutions in Nepali.

Achievements & Outputs

Official Release: View

Team

Dr. Binod Bhattarai Adj. Research Scientist

Publications

2022

NepBERTa: Nepali Language Model Trained in a Large Corpus

Sulav Timilsina, Milan Gautam, Binod Bhattarai

Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, 2022

Peer Reviewed Conference articleBBMMLL (B Bhattarai Multi-Modal Learning Lab)

View PDF