MNLP Project
Maithili Natural Language Processing (MNLP) Toolkit
Overview
Maithili is an Indo-Aryan language native to the Indian subcontinent, mainly spoken in India and Nepal. Maithili’s speaker base is spread across a large part of Bihar and eastern tarai region of Nepal. There are over 30-35 million speakers of the Maithili language.
The idea of MNLP is to build a natural language processing toolkit for Maithili Language. The tool will help to tokenize Maithili text, word embeddings in maithili, Maithili POS Tagging, Name Entity Recognition, build Neural Models for Maithili language. Maithili is a resource constraint language and have very less digital footprint. It makes the data collection, annotation as well as building machine learning model complex. This is an open source project, hosted on github
Open Source Project
All the codes, data and API will be publicly available for greater good of increasing digital footprint of Maithili language. Note that, it is a volunteer project and no paid employment/internship is available. All research will be published on ArXiv/HAL with contributors as authors.
How you can contribute ?
- If you are student in computer science, Please send me your CV on my mail and the motivation to work on the project.
- If you are researcher/scientist/professor, Please feel free to send me a mail for collaborations.
- If you are native Maithili language speaker and want to do volunteer for translation and annotation; you are more then welcome.
- It will be very kind of the owners (Newspaper/Author/Blog) of the Maithili corpus in digital format (pdf/word/text/json), to share the data with us. I assure you the data rights will be protected. If you have non shareable corpus, we also assure the protection of data.