Urdu Tokenization using SpaCy

SpaCy is an NLP library which supports many languages. It’s fast and has DNNs build in for performing many NLP tasks such as POS and NER. It has extensive support and good documentation. It is fast and provides GPU support and can be integrated with Tensorflow, PyTorch, Scikit-Learn, etc.

SpaCy provides the easiest way to add any language support. A new language can be added by simply following Adding Languages article. I’ve added the Urdu language with dictionary-based lemmatization, lexical support and stop words(Urdu). Here is how you can use the tokenizer for the Urdu language.

First, install SpaCy.

$ pip install spacy

Now import spacy and create a blank object with support of Urdu language. I’m using blank because there is no proper model available for Urdu yet, but tokenization support available.

import spacy

nlp = spacy.blank('ur')

doc = nlp("کچھ ممالک ایسے بھی ہیں جہاں اس برس روزے کا دورانیہ 20 گھنٹے تک ہے۔")

print("Urdu Tokenization using SpaCy")

for word in doc:
    print(word)

Here is the output:

کچھ
ممالک
ایسے
بھی
ہیں
جہاں
اس
برس
روزے
کا
دورانیہ
20
گھنٹے
تک
ہے
۔

Note that Urdu has different punctuation symbols such as ۔ ، etc and it also uses English numbers 12 etc. Accuracy is 100% for Urdu language tokenization.

If you have any question feel free to ask in comments.

Comments

Anonymous23 January 2020 at 20:41
can you do a tutorial on doing Urdu lemmatization using Spacy please?
ReplyDelete
Replies
Trial Blogs17 October 2020 at 07:04
Greetings, I am curious about how to create detection/annotation for numeric and date expression for Urdu? Like ۱۹۹۳, اکتوبر۳ or ۹۹روپے ? because in your blog it is in english-number from[1-9] how about urdu-number[۰-۹]? Please do tell me.Thanks for your blog due that I knew about URDU natural language processing. Keep the good work.
ReplyDelete
Replies
Muhammad Irfan22 October 2020 at 21:04
For detection or annotation you need to train NER model.
ReplyDelete
Replies
Oblivion30 December 2020 at 08:09
Thanks for sharing this useful information. I wanted to ask what's the data type of return tokens. Sorry for this question I'm new to NLP .
ReplyDelete
Replies
Muhammad Irfan23 January 2021 at 10:26
Its a SpaCy doc..
ReplyDelete
Replies
Unknown15 February 2021 at 10:44
Hi, can you do a tutorial on doing Urdu text summarization using Spacy please?

ReplyDelete
Replies
Muhammad Irfan16 February 2021 at 09:32
Yes, sure. I will do it in future. Currently working on Q&A system.
ReplyDelete
Replies
Unknown30 January 2022 at 12:51
Hi , how can we handle missing white spaces like روزےکادورانیہ between urdu words ?
ReplyDelete
Replies
Muhammad Irfan10 February 2022 at 08:05
Use word segmentation. This is a very difficult problem and can only by done using large corpus for training the model.
ReplyDelete
Replies
Anonymous19 September 2022 at 02:02
can you please help regarding Urdu word segmentation problem
ReplyDelete
Replies
Muhammad Irfan20 September 2022 at 03:56
This is a challenging problem. You need to get a clean Urdu dataset, then I can guide you about it.
ReplyDelete
Replies
Anonymous11 March 2023 at 09:55
language model for urdu is currently unavailable on spacy. i was wondering what more can you do then, other than tokenization on urdu text with spacy.
ReplyDelete
Replies
Muhammad Irfan3 April 2023 at 04:54
sentiment, NER, lemmatization, pretty much everything spacy provides.
ReplyDelete
Replies

Add comment

UrduNLP

Search This Blog

Urdu Tokenization using SpaCy

Comments

Post a Comment

Popular posts from this blog

Transformer Based QA System for Urdu

Text Summarization for Urdu: Part 1

Urdu News Classification