Urdu POS Tagging using MLP

Urdu is a less developed language as compared to English for natural language processing applications. POS is a simple and most common natural language processing task but the dataset for training Urdu POS is in scarcity. There are different POS tagsets such as Muaz’s Tagset and Sajjad’s Tagset available in the literature. Due to the non-availability of the dataset and restriction to use dataset, much of NLP work is under progress.

I’ve developed a dataset of training POS for the Urdu language. It is available on Github. It is a small dataset more than enough to train the POS tagger. It has been build using Sajja’s Tagset because this tagset covers all the words in Urdu literature and has 39 tags. Here are some examples of this tag set.

I’ve used Keras to build the MLP model for POS. Data is in tab-separated form and converted to sentences and tags using utility functions.

Here is the format of data.txt

[('اشتتیاق', 'NN'), ('اور', 'CC'), ('ملائکہ', 'NN'), ('ہی', 'I'), ('ببانگِ', 'NN'), ('دہل', 'PN'), ('موجود', 'ADJ'), ('ہیں', 'VB'), ('اس', 'PD'), ('وقت', 'NN'), ('تو', 'I'), ('۔', 'SM')]

Here is the Kaggle link(https://www.kaggle.com/mirfan899/data-lstm) to download the dataset used in this tutorial.

import codecs
import numpy as np
from sklearn.model_selection import train_test_split
tagged_sentences = codecs.open("../data/pos.txt", encoding="utf-8").readlines()
print(tagged_sentences[0])
print("Tagged sentences: ", len(tagged_sentences))
sentences, sentence_tags = [], []
for tagged_sentence in tagged_sentences:
    sentence, tags = zip(*ast.literal_eval(tagged_sentence))
    sentences.append(np.array(sentence))
    sentence_tags.append(np.array(tags))
(train_sentences,
 test_sentences,
 train_tags,
 test_tags) = train_test_split(sentences, sentence_tags, test_size=0.2)
words = get_words(train_sentences)
tags = get_tags(train_tags)

Read the dataset and split it into train and test datasets.

Word indexes and tag indexes are built to handle different lengths of sentences and also maintain the OOV dictionary.

word2index = {w: i + 2 for i, w in enumerate(list(words))}
word2index['-PAD-'] = 0
word2index['-OOV-'] = 1

tag2index = {t: i + 1 for i, t in enumerate(list(tags))}
tag2index['-PAD-'] = 0

Train and test sentences, as well as tags, are converted to a proper format to be used in the MLP model:

train_sentences_x = get_train_sentences_x(train_sentences, word2index)
test_sentences_x = get_test_sentences_x(test_sentences, word2index)

train_tags_y = get_train_tags_y(train_tags, tag2index)
test_tags_y = get_test_tags_y(test_tags, tag2index)

Finally adding the padding to train and test sets to be used in the model.

MAX_LENGTH = len(max(train_sentences_x, key=len))

train_sentences_x = pad_sequences(train_sentences_x, maxlen=MAX_LENGTH, padding='post')
test_sentences_x = pad_sequences(test_sentences_x, maxlen=MAX_LENGTH, padding='post')
train_tags_y = pad_sequences(train_tags_y, maxlen=MAX_LENGTH, padding='post')
test_tags_y = pad_sequences(test_tags_y, maxlen=MAX_LENGTH, padding='post')

Here is MLP architecture:

model = Sequential()
model.add(InputLayer(input_shape=(MAX_LENGTH,)))
model.add(Embedding(len(word2index), 128))
model.add(Dense(128))
model.add(Dense(len(tag2index)))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy',
              optimizer=Adam(0.001),
              metrics=['accuracy'])
history = model.fit(train_sentences_x, to_categorical(train_tags_y, len(tag2index)), batch_size=32, epochs=10,
                    validation_split=0.2).history
model.save("../models/mlp.h5")

scores = model.evaluate(test_sentences_x, to_categorical(test_tags_y, len(tag2index)))
model.summary()

The model obtains the 98% accuracy for the train dataset and 97% accuracy on the test dataset.

The loss for train and test dataset

Evaluation of test dataset shows 97% accuracy.

Here are utility functions that are used in building the model.

def logits_to_tokens(sequences, index):
    token_sequences = []
    for categorical_sequence in sequences:
        token_sequence = []
        for categorical in categorical_sequence:
            token_sequence.append(index[np.argmax(categorical)])

        token_sequences.append(token_sequence)

    return token_sequences


def to_categorical(sequences, categories):
    cat_sequences = []
    for s in sequences:
        cats = []
        for item in s:
            cats.append(np.zeros(categories))
            cats[-1][item] = 1.0
        cat_sequences.append(cats)
    return np.array(cat_sequences)


def get_words(sentences):
    words = set([])
    for sentence in sentences:
        for word in sentence:
            words.add(word)
    return words


def get_tags(sentences_tags):
    tags = set([])
    for tag in sentences_tags:
        for t in tag:
            tags.add(t)
    return tags


def get_train_sentences_x(train_sentences, word2index):
    train_sentences_x = []
    for sentence in train_sentences:
        sentence_index = []
        for word in sentence:
            try:
                sentence_index.append(word2index[word])
            except KeyError:
                sentence_index.append(word2index['-OOV-'])

        train_sentences_x.append(sentence_index)
    return train_sentences_x


def get_test_sentences_x(test_sentences, word2index):
    test_sentences_x = []
    for sentence in test_sentences:
        sentence_index = []
        for word in sentence:
            try:
                sentence_index.append(word2index[word])
            except KeyError:
                sentence_index.append(word2index['-OOV-'])
        test_sentences_x.append(sentence_index)
    return test_sentences_x


def get_train_tags_y(train_tags, tag2index):
    train_tags_y = []
    for tags in train_tags:
        train_tags_y.append([tag2index[t] for t in tags])
    return train_tags_y


def get_test_tags_y(test_tags, tag2index):
    test_tags_y = []
    for tags in test_tags:
        test_tags_y.append([tag2index[t] for t in tags])
    return test_tags_y

and that’s all. You can build Urdu as well as the Arabic POS MLP model using this article.

Thank you for reading the article. Hit the clap button if you like the article. If you need help in this article, contact me on Linkedin.

Comments

Unknown30 November 2020 at 21:58
how can i preprocess my urdu text dataset for deep learning using your code ?
i am new to nlp please guide me
ReplyDelete
Replies

Add comment

UrduNLP

Search This Blog

Urdu POS Tagging using MLP

Comments

Post a Comment

Popular posts from this blog

Transformer Based QA System for Urdu

Text Summarization for Urdu: Part 1

Urdu News Classification