iNLTK covers almost all of the most common Indian languages. Following is the list of languages along with their codes available in iNLTK:
When using a language for the first time in a system/environment, you need to set the language setup which downloads the models corresponding to the language. However, this is required only when using the language for the first time. Subsequently, no setup is required. we have set the language as Bengali (bn). You can set it up for any language of your choice from the list of available languages. It is reiterated that the setup is only a one-time job. You can set up a language as follows:
Now, let us perform some of the basic NLP tasks in Indian Languages using iNLTK. The tasks that we will be performing are as follows:
We tokenize the sentence 'गीक्स फॉर गीक्स एक बेहतरीन टेक्नोलॉजी लर्निंग प्लेटफॉर्म है।' (which is Hindi translation for 'GeeksForGeeks is a great technology learning platform.')
Hence, we have tokenized a sentence using iNLTK.
In NLP, text embeddings refer to a vectorized representation of text. It is necessary to convert text to embeddings as we cannot feed Machine/Deep Learning models with the raw text directly. This can be done using iNLTK's get_embedding_vectors(text, language code) which takes input text and its language code as the arguments.
We generate text embeddings for the same sentence 'गीक्स फॉर गीक्स एक बेहतरीन टेक्नोलॉजी लर्निंग प्लेटफॉर्म है।' (which is Hindi translation for 'GeeksForGeeks is a great technology learning platform.')
[array([-0.737411, 0.203377, 0.005537, -0.468718, ..., 0.110487, 0.325836, 0.64981 , 0.463476], dtype=float32),
array([-0.012183, -0.036214, -0.412297, -0.546257, ..., 0.094262, 0.0921 , 1.359242, -0.505965], dtype=float32),
array([ 0.021317, -0.130494, -0.248163, -0.203298, ..., 0.064852, 0.230874, -0.315259, 0.368123], dtype=float32),
array([-0.737411, 0.203377, 0.005537, -0.468718, ..., 0.110487, 0.325836, 0.64981 , 0.463476], dtype=float32),
array([-0.012183, -0.036214, -0.412297, -0.546257, ..., 0.094262, 0.0921 , 1.359242, -0.505965], dtype=float32),
array([ 0.526271, -0.111786, 0.024964, -0.413432, ..., -0.269101, 0.14501 , 0.139528, 0.036384], dtype=float32),
array([ 0.231323, -0.129719, -0.120698, -0.229107, ..., -0.207799, -0.144117, 1.09991 , 0.544219], dtype=float32),
array([ 0.408419, 0.320988, -0.380744, -0.563505, ..., -0.254394, -0.200471, 0.201553, -0.074097], dtype=float32),
array([-0.307099, -0.186613, 0.040754, -0.271758, ..., 0.477781, 0.759681, 0.485825, 0.222599], dtype=float32),
array([-0.0195 , -0.056414, 0.155854, -0.955072, ..., 0.127837, -0.161846, 0.381132, -0.233802], dtype=float32),
array([-0.063136, -0.16291 , -0.412124, -0.580033, ..., -0.468475, 0.246613, 0.661614, 0.354779], dtype=float32),
array([-0.182706, -0.237699, 0.478908, -0.567147, ..., 0.694749, 0.526647, 0.650397, 0.172727], dtype=float32),
array([-0.183833, -0.005238, -0.187345, -0.113823, ..., 0.062584, -1.36463 , 0.665604, -1.425032], dtype=float32),
array([ 0.792413, 0.01189 , -0.71231 , -0.313467, ..., 0.190676, 0.938687, 0.464781, 0.195361], dtype=float32)]
Thus, we have generated embeddings for Hindi text using iNLTK.
Here we are giving some initial words, and we try to predict the subsequent words based on them. iNLTK provides a function predict_next_words(text, n, language_code) which takes the input text, a number of words to be predicted (n), and language code as the arguments.
Predict the next words for the phrase 'गीक्स फॉर गीक्स एक बेहतरीन टेक्नोलॉजी' (which is Hindi translation for 'GeeksForGeeks is a great technology)
Here, we have predicted the next 4 words for a given phrase in Hindi.
One of the most common tasks of NLP is to generate similar sentences to a given input sentence. iNLTK's get_similar_sentences(text, n, language_code) does exactly the same. It takes the input text, the number of sentences to be generated (n), and the language code as the arguments.
We generate similar sentences for the sentence 'गीक्स फॉर गीक्स एक बेहतरीन टेक्नोलॉजी लर्निंग प्लेटफॉर्म है।' (which is Hindi translation for 'GeeksForGeeks is a great technology learning platform.')
We can also check the similarities between two sentences using iNLTK. This can be done using iNLTK's get_sentence_similarity(text1, text2, language_code) function, which takes two text inputs that are to be compared and the language code as the arguments.
we check the similarity between the sentences 'Geeks For Geeks হল একটি দুর্দান্ত প্রযুক্তি শেখার প্ল্যাটফর্ম।' and 'Geeks For Geeks হল একটি দুর্দান্ত কম্পিউটার বিজ্ঞান শেখার প্ল্যাটফর্ম।' (which are Bengali translations for 'GeeksForGeeks is a great technology learning platform' and 'Geeks For Geeks is an awesome computer science learning platform.' respectively.
We can see that the similarity score of the two sentences is quite high, as expected.