Part 1: NLP Pipeline in NodeJS with natural module

Hi Guys,

This is my first two part post regarding Natural Language Processing in NodeJS.

In first part, we're going to touch upon the NLP in NodeJS with a node module called 'natural'.
To learn more about the module, please visit this page. With that we'll also need some more npm such as stopword & contractions.

Now we're going to implement a simple NLP pipeline with Natural to get you started.

NLP is basically a branch-off of Machine Learning, where a machine is taught how to process a natural language provided as input and provide response for that. This is in regards to Turing test where, given a response, it'd be hard to find out whether a human or system has given it.

So all the research going on in the NLP field is regarding passing on the Turing test. There are many intricacies in this, such as find out context of the input or filtering ambiguity etc.

In this & following tutorial we're going to touch upon some of the basics of NLP. In every NLP you do, there are basic tasks you'll need to perform. These tasks are called pre-processing. NLP basically isn't a one big function but a cohesion of small small tasks connected in such a way that they form a pipeline, in which output of one task is passed onto next to get refined result.

We're going to implement exactly this pipeline to get the most frequently used terms in given corpus.
Although seemingly easy, this task has caveats in it:

  • There are filler words in English which are used more frequently like the, a, an etc which doesn't provide any meaning in themselves, so we want to avoid those.
  • We want to avoid punctuation symbols
  • Corpus may contain mixed case letters such as Product, proDuct, producT etc.
  • Numbers which are meaningless without context such as $500 or 1 Kg.
So in order to clean our corpus & get meaningful frequent words we'll follow a simple pipeline:
lowercase -> tokenize -> remove stopwords -> calculate frequency
Let's breakdown the pipeline further

1.  Lowercase:
This is the very first step in our pipeline & is no-brainer. We'll first lowercase all the words to give equal footing to every word & remove the inconsistency with same words. But always keep in mind your target corpus, because for some corpus this may actually lead to confusion. For e.g.: US to us, AM to am. In these case we lose actual meaning of the word itself. So always be aware about what your corpus is. We also expand contractions, i.e. phrases like I'm or isn't in this step.

2. Tokenize:
This is the second step in our pipeline, wherein we break our corpus into individual entity called token. Token is nothing but a word in corpus. To tokenize the corpus we remove all the punctuation marks, white spaces & numbers from the corpus. This step gives an array of tokens/words.

3. Remove stopwords:
As distcussed in caveats, words like the, a, an are frequently used words with no actual meaning to them. These words are called stopwords. We want to avoid such words, hence this step. We'll also add some of our own stopwords to fit the corpus.

4. Calculate frequency:
Now that we've got data which is on equal footing, tokenized & without any meaningless words, calculate frequency of each word to find most frequently used words.

Now let's implement above pipeline in Node with natural module:

 var natural = require('natural');  
 var stopword = require('stopword');  
 var contractions = require('contractions');  
   
 getTermFrequencies = function (text) {  
   var result = [];  
   // Convert all the text into lowercase letters  
   // & then expand all the contractions  
   text = contractions.expand(text.toLowerCase());  
   
   var tokenizer = new natural.WordTokenizer();  
   
   // Break text into array of words  
   // Remove all the stop words from tokens  
   var stopwords = ['will', 'customer', 'order', 'product', 'delivery', 'really',  
     'good', 'deliveries', 'not'  
   ];  
   stopwords.concat(stopword.en);  
   text = stopword.removeStopwords(tokenizer.tokenize(text), stopwords);  
   
   // Convert array into single string for  
   // getting count of individual words  
   var document = text.join(' ');  
   
   var tfidf = new natural.TfIdf();  
   tfidf.addDocument(document);  
   
   // Get the count of individual words  
   result = tfidf.listTerms(0);  
   
   // Get result in ascending order  
   result = _.sortBy(result, function (word) {  
     return word.tf;  
   });

   return _.reverse(result);  
 };  
 
 module.exports.getTermFrequencies = getTermFrequencies;

Thus, our pipeline for NLP preprocessing is completed. We can now build some more complex NLP on this.

Let me hear your story with this.

Happy coding!

Comments