Skip to main content

Urdu POS Tagging using MLP

Urdu is a less developed language as compared to English for natural language processing applications. POS is a simple and most common natural language processing task but the dataset for training Urdu POS is in scarcity. There are different POS tagsets such as Muaz’s Tagset and Sajjad’s Tagset available in the literature. Due to the non-availability of the dataset and restriction to use dataset, much of NLP work is under progress.

I’ve developed a dataset of training POS for the Urdu language. It is available on Github. It is a small dataset more than enough to train the POS tagger. It has been build using Sajja’s Tagset because this tagset covers all the words in Urdu literature and has 39 tags. Here are some examples of this tag set.



I’ve used Keras to build the MLP model for POS. Data is in tab-separated form and converted to sentences and tags using utility functions.

Here is the format of data.txt
[('اشتتیاق', 'NN'), ('اور', 'CC'), ('ملائکہ', 'NN'), ('ہی', 'I'), ('ببانگِ', 'NN'), ('دہل', 'PN'), ('موجود', 'ADJ'), ('ہیں', 'VB'), ('اس', 'PD'), ('وقت', 'NN'), ('تو', 'I'), ('۔', 'SM')]
Here is the Kaggle link(https://www.kaggle.com/mirfan899/data-lstm) to download the dataset used in this tutorial.
import codecs
import numpy as np
from sklearn.model_selection import train_test_split
tagged_sentences = codecs.open("../data/pos.txt", encoding="utf-8").readlines()
print(tagged_sentences[0])
print("Tagged sentences: ", len(tagged_sentences))
sentences, sentence_tags = [], []
for tagged_sentence in tagged_sentences:
    sentence, tags = zip(*ast.literal_eval(tagged_sentence))
    sentences.append(np.array(sentence))
    sentence_tags.append(np.array(tags))
(train_sentences,
 test_sentences,
 train_tags,
 test_tags) = train_test_split(sentences, sentence_tags, test_size=0.2)
words = get_words(train_sentences)
tags = get_tags(train_tags)
Read the dataset and split it into train and test datasets.
Word indexes and tag indexes are built to handle different lengths of sentences and also maintain the OOV dictionary.
word2index = {w: i + 2 for i, w in enumerate(list(words))}
word2index['-PAD-'] = 0
word2index['-OOV-'] = 1

tag2index = {t: i + 1 for i, t in enumerate(list(tags))}
tag2index['-PAD-'] = 0
Train and test sentences, as well as tags, are converted to a proper format to be used in the MLP model:
train_sentences_x = get_train_sentences_x(train_sentences, word2index)
test_sentences_x = get_test_sentences_x(test_sentences, word2index)

train_tags_y = get_train_tags_y(train_tags, tag2index)
test_tags_y = get_test_tags_y(test_tags, tag2index)
Finally adding the padding to train and test sets to be used in the model.
MAX_LENGTH = len(max(train_sentences_x, key=len))

train_sentences_x = pad_sequences(train_sentences_x, maxlen=MAX_LENGTH, padding='post')
test_sentences_x = pad_sequences(test_sentences_x, maxlen=MAX_LENGTH, padding='post')
train_tags_y = pad_sequences(train_tags_y, maxlen=MAX_LENGTH, padding='post')
test_tags_y = pad_sequences(test_tags_y, maxlen=MAX_LENGTH, padding='post')
Here is MLP architecture:
model = Sequential()
model.add(InputLayer(input_shape=(MAX_LENGTH,)))
model.add(Embedding(len(word2index), 128))
model.add(Dense(128))
model.add(Dense(len(tag2index)))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy',
              optimizer=Adam(0.001),
              metrics=['accuracy'])
history = model.fit(train_sentences_x, to_categorical(train_tags_y, len(tag2index)), batch_size=32, epochs=10,
                    validation_split=0.2).history
model.save("../models/mlp.h5")

scores = model.evaluate(test_sentences_x, to_categorical(test_tags_y, len(tag2index)))
model.summary()

The model obtains the 98% accuracy for the train dataset and 97% accuracy on the test dataset.
The loss for train and test dataset
Evaluation of test dataset shows 97% accuracy.

Here are utility functions that are used in building the model.
def logits_to_tokens(sequences, index):
    token_sequences = []
    for categorical_sequence in sequences:
        token_sequence = []
        for categorical in categorical_sequence:
            token_sequence.append(index[np.argmax(categorical)])

        token_sequences.append(token_sequence)

    return token_sequences


def to_categorical(sequences, categories):
    cat_sequences = []
    for s in sequences:
        cats = []
        for item in s:
            cats.append(np.zeros(categories))
            cats[-1][item] = 1.0
        cat_sequences.append(cats)
    return np.array(cat_sequences)


def get_words(sentences):
    words = set([])
    for sentence in sentences:
        for word in sentence:
            words.add(word)
    return words


def get_tags(sentences_tags):
    tags = set([])
    for tag in sentences_tags:
        for t in tag:
            tags.add(t)
    return tags


def get_train_sentences_x(train_sentences, word2index):
    train_sentences_x = []
    for sentence in train_sentences:
        sentence_index = []
        for word in sentence:
            try:
                sentence_index.append(word2index[word])
            except KeyError:
                sentence_index.append(word2index['-OOV-'])

        train_sentences_x.append(sentence_index)
    return train_sentences_x


def get_test_sentences_x(test_sentences, word2index):
    test_sentences_x = []
    for sentence in test_sentences:
        sentence_index = []
        for word in sentence:
            try:
                sentence_index.append(word2index[word])
            except KeyError:
                sentence_index.append(word2index['-OOV-'])
        test_sentences_x.append(sentence_index)
    return test_sentences_x


def get_train_tags_y(train_tags, tag2index):
    train_tags_y = []
    for tags in train_tags:
        train_tags_y.append([tag2index[t] for t in tags])
    return train_tags_y


def get_test_tags_y(test_tags, tag2index):
    test_tags_y = []
    for tags in test_tags:
        test_tags_y.append([tag2index[t] for t in tags])
    return test_tags_y
and that’s all. You can build Urdu as well as the Arabic POS MLP model using this article.

Thank you for reading the article. Hit the clap button if you like the article. If you need help in this article, contact me on Linkedin.

Comments

  1. how can i preprocess my urdu text dataset for deep learning using your code ?
    i am new to nlp please guide me

    ReplyDelete

Post a Comment

Popular posts from this blog

Urdu Tokenization using SpaCy

SpaCy is an NLP library which supports many languages. It’s fast and has DNNs build in for performing many NLP tasks such as POS and NER. It has extensive support and good documentation. It is fast and provides GPU support and can be integrated with Tensorflow, PyTorch, Scikit-Learn, etc. SpaCy provides the easiest way to add any language support. A new language can be added by simply following Adding Languages article. I’ve added the Urdu language with dictionary-based lemmatization, lexical support and stop words( Urdu ). Here is how you can use the tokenizer for the Urdu language. First, install SpaCy . $ pip install spacy Now import spacy and create a blank object with support of Urdu language. I’m using blank because there is no proper model available for Urdu yet, but tokenization support available. import spacy nlp = spacy.blank('ur') doc = nlp(" کچھ ممالک ایسے بھی ہیں جہاں اس برس روزے کا دورانیہ 20 گھنٹے تک ہے۔") print("Urdu Tokeniza

How to build Urdu language model in SpaCy

Urdu alphabets SpaCy is the most commonly used NLP library for building NLP and chatbot apps. The Urdu language does not have resources for building chatbot and NLP apps. Most of the tools are proprietary or data is licensed. After adding the support for the Urdu language, I'm going to show you how to build an Urdu model which can be used for multiple applications such as word and sentence similarity, chatbots, knowledgebase, etc. Follow the steps to build the model. Step 1: Build word frequencies for Urdu. I've created a script that can be used to build word frequencies. There are multiple resources available for building word frequencies, you can choose whatever you want but the format should be like this. frequency document_id word Here is the script I'm using to build word frequencies for SpaCy. from __future__ import unicode_literals import string import codecs import glob from collections import Counter import re import plac from multiprocessing

Text Summarization for Urdu: Part 1

 Text Summarization is an important task for large documents to get the idea of the document. There are two main summarization techniques used in NLP for text summarization. Extractive Text Summarization :  This approach's name is self-explanatory. Most important sentences or phrases are extracted from the original text and a short summary provided with these important sentences. See the figure for the explanation. Abstractive Text Summarization : This approach uses more advanced deep learning techniques to generate new sentences by learning from the original text. It is a complex task and requires heavy computing power such as GPU. Let's dive into the code for generating the text summary. I'm using Arabic as a parameter because the contributor did an excellent job of handling a lot of things like stemming, Urdu characters support, etc. from summa.summarizer import summarize text = """ اسلام آباد : صدر مملکت ڈاکٹر عارف علوی بھی کورونا وائرس کا شکار ہوگئے۔ سما