How to build NER dataset for Urdu language?

Prodigy annotation tool

Named Entity Recognition is the most common and important task in NLP. There are a lot of resources and prebuild solutions available for the English language. Urdu is a scarce resource language and there are no usable datasets available that can be used. In this article, I'm going to show you how you can build your own NER dataset with minimal effort by annotating the entities. I'm using UNER(https://github.com/mirfan899/Urdu#uner-dataset) entities for this article.

Annotator:
There are some good annotators available to annotate the text data like http://brat.nlplab.org/, https://prodi.gy/, https://www.lighttag.io/, and https://github.com/YuukanOO/tracy. I'm more interested in building a dataset that can be used for a chatbot in the future. So I've decided to use Prodigy (https://prodi.gy/), you need to purchase the license to use it or you can apply for educational research to get the license.

Commands:
Train ner-ur-model using SpaCy model "ur_model". You need to build a JSONL file with this structure

{"text": "پیپلزپارٹی کی حکومت اور مسلم لیگ ن کے مابین جاری دوستانہ کشمکش اب حقیقی تناؤ میں متشکل ہونا شروع ہوگئی ہے۔"}
{"text": "اس کھینچا تانی میں قومی سیاسی منظر نامے میں ایک بار پھر دائیں اور بائیں بازور کی سیاست کا ظہور ہوتا نظرآرہاہے۔"}
{"text": "اگر نواز شریف آڑے نہ آتے تو ساری قاف لیگ کب کی مسلم لیگ نون میں ضم ہوچکی ہوتی۔"}
{"text": "اب ایسا محسوس ہوتاہے کہ میاں صاحب کے اعصاب جواب دے رہے ہیں اور رفتہ رفتہ تلخ سیاسی حقائق کا ادراک کررہے ہیں۔"}
{"text": "حکومت کو کوئی بڑا ریلیف ملتانظر نہیں آتاہے لہٰذا یوسف رضا گیلانی کی ساری توجہ فوج کے ساتھ تعلقات کو ہموار رکھنے پر ہے۔"}
{"text": "اپنے ذاتی یا جماعتی مفادات کی خاطر بڑی سے بڑی مقدس روایت کو روندا لیاجاتاہے۔"}

After building the urdu.jsonl data file you need to provide the text entities file.

PERSON
LOCATION
ORGANIZATION
DATE
NUMBER
DESIGNATION
TIME

Now use the following prodigy command to train ner-ur-model.

prodigy ner.manual ner-ur-model ur_model data/urdu.jsonl --label data/entities.txt

It will save the data in a sqlite database. Here is the documentation link for prodigy tool (https://prodi.gy/docs/) for reference. Happy annotating the data. If you have some questions feel free to ask in comments.

Comments

Unknown2 October 2019 at 03:35
great work; can you share how we can do dependency parsing in urdu as well
ReplyDelete
Replies
Muhammad Irfan10 November 2019 at 05:04
Dependency parsing is still under development. Although you can use SpaCy Urdu model for this purpose.
ReplyDelete
Replies
Unknown16 March 2021 at 00:27
Greetings,
I have tried to install the urdu model
pip install ur_model-0.0.0.tar.gz
but it showed me this error. please help me
ERROR: Could not install packages due to an EnvironmentError: [Errno 2] No such file or directory: 'C:\\Users\\FIM\\ur_model-0.0.0.tar.gz'

ReplyDelete
Replies
Muhammad Irfan17 March 2021 at 10:28
model should be in the directory where you use pip command.
ReplyDelete
Replies
Umair28 February 2022 at 03:22
Can you tell the steps to train spacy?
ReplyDelete
Replies

Add comment

UrduNLP

Search This Blog

How to build NER dataset for Urdu language?

Comments

Post a Comment

Popular posts from this blog

Transformer Based QA System for Urdu

Text Summarization for Urdu: Part 1

Urdu News Classification