Text Summarisation and Feature Engineering using TF-IDF

This article explains how textual data is modelled in natural language processing. Several modelling techniques exist to model language in NLP:

For the sake of this article, Bag of words, N-gram modelling and Term Frequency matrices will be discussed. However, prior to any decent language modelling, textual data needs to be preprocessed.

Text Preprocessing

Given textual data, the following preprocessing takes place prior to any meaningul analytics:

Noise Removal using Regular Expressions

Regular expressions, often abbreviated as “regex” or “regexp”, are sequences of characters that define a search pattern used for pattern matching and text processing tasks.

import re  #library for regular expression 

text = "The $#quick brown fox #jumps over the lazy dog!!!"
pattern = r'[^a-zA-Z\s]' #find unwanted characters (non-alphanumeric and non-whitespace)

clean_text = re.sub(pattern, '', text)#replace them with an empty string
print('initial text:', text)
print('\nafter cleaning:',clean_text)  

Tokenisation

Tokenisation is the process of diving text into a sequence of tokens, which roughly corresponds to “words”. The nltk package is a very rich Python package that can be used for word tokenisation as well as sentence tokenisation.

#pip install nltk
#nltk.download('punkt') #donwload necessary resources

from nltk.tokenize import word_tokenize

text = "Hello! How are you? I am doing well."

words = word_tokenize(text)
print(words) 
from nltk.tokenize import sent_tokenize

text = "Hello! How are you? I am doing well. Let's learn NLP."

sentences = sent_tokenize(text)
print(sentences) 

Stemming

Stemming is the process of transforming words into a root term to minimise redundancies. The root term is not necessarily a word. For instance, the words ‘caring’, ‘cares’, ‘cared’, ‘caringly’ and ‘carefully’ represent the same underlying reality in language understanding and therefore can be converted to the same root for the sake of concise representation of information in textual data analysis.

from nltk import SnowballStemmer, PorterStemmer, LancasterStemmer
words = 'caring cares cared caringly carefully'
# find the stem of each word in words
stemmer = SnowballStemmer('english')
for word in words.split():
    print(stemmer.stem(word)) 

Lemmatisation

A very similar operation to stemming is called lemmatisation. Lemmatising is the process of grouping words of similar meaning together to a root term existing with the target vocabulary. Unlike, stemming whose roots are not necessarily existing words, lemmatisation ensures that the root term are existing words in the language vocabulary.

import nltk 
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("cats"))
print(lemmatizer.lemmatize("programmed",pos="v"))
print(lemmatizer.lemmatize("programming",pos="v"))
print(lemmatizer.lemmatize("better", pos="a"))
print(lemmatizer.lemmatize("best", pos="a")) 

Stop Words Removal

Stop words are words which do not contain pertinent information in carrying the core significance of natural language communication. Usually these words are filtered out from search queries because they return a vast amount of unnecessary information. Typically stop words are pronouns, prepositions, adverbs and auxiliary verbs.

from nltk.corpus import stopwords
print(stopwords.words('english'))#list of english stopwords 
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

text = 'the world is ending, i can see it in the air'
tokens = word_tokenize(text)#fetch tokens
eng_stopwords = stopwords.words('english')#get list of stopwords in english
tokens_stops_removed = [word for word in tokens if word not in eng_stopwords]#remove stop words from list
text_clean = " ".join(tokens_stops_removed)

print("text-->",text)
print("tokens-->",tokens,end="\n\n")
print("tokens [stopwords removed] -->",tokens_stops_removed)
print("text [stopwords removed]-->",text_clean) 

Text Preprocessing Function

Given your knowledge of text preprocessing components, a user function can be designed to preprocess a given text string.

import re  #library for regular expression 
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize


def text_preprocess(text):
    pattern = r'[^a-zA-Z\s]' #find unwanted characters (non-alphanumeric and non-whitespace)
    text = text.lower()#put to lower case
    clean_text = re.sub(pattern, '', text)#replace them with an empty string    
    tokens = word_tokenize(clean_text)#fetch tokens
    eng_stopwords = stopwords.words('english')#get list of stopwords in english
    eng_stopwords.append('th') #add user aware additional stop words
    tokens_stops_removed = [word for word in tokens if word not in eng_stopwords]#remove stop words from list
    text_clean = " ".join(tokens_stops_removed)    
    return text_clean
     

Feature Engineering

Feature Engineering is the process of building numerical features from textual data. Several feature engineering techniques exist based on the amount of semantic content that the method can acquire.

In this tutorial, we will focus on the most basic feature engineering technique: The TFIDF method.

Bag-of-words and n-grams

Bag of words, unique words in the text document, are most basically used as features for language modelling. N-gram consist of forming text feature by using the frequency count of adjacent n-compound words. The bag of words used in basic feature engineering thus represents a 1-gram model. Unlike the bag of words or unigrams, n-gram (n>1) can enhance the capturing of contextual information in language modelling.

Term Frequency

The Term Frequency (TF)  is a measure of token counts in a text document. It is a first-degree feature engineering process whereby each term is converted numerically by taking the number of times it occurs in the textual dataset. Term frequencies of bag-of-words or n-grams in general are used to form the frequency matrices.

Let’s consider an excerpt text from Wikipedia on Globalisation and a corpus (i.e. collection of documents about a subject) related to formal documents also extracted from Wikipedia.

import requests

url_base = "https://raw.githubusercontent.com/mlinsights/freemium/refs/heads/main/datasets/text-analysis/globalisation/"
url = url_base+"globalisation.txt"
response = requests.get(url)#get from the web
text = response.text
print(text) 
globalisation

The Sklearn CountVectorizer can be used to generate a term frequency matrix from text.

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

#preprocess input text
clean_text = text_preprocess(text)
# create count vectorizer
cvz = CountVectorizer()
# get token counts
result_cvz = cvz.fit_transform([clean_text])
#get feature list
feature_list = cvz.get_feature_names_out()
#get tokens count
tf_array = result_cvz.toarray()[0]

#tf dataframe
tf = pd.DataFrame({'term':feature_list, 'freq':tf_array})
tf.sort_values(by=['freq'],inplace=True,ascending=False)
tf.reset_index(drop=True, inplace=True)

tf.head() 
import matplotlib.pyplot as plt
top_n = 30

plt.figure()
plt.bar(tf.term[0:top_n], tf.freq[0:top_n])
plt.xticks(rotation=90)
plt.ylabel('Frequency')
plt.title('Term Frequency')
plt.show() 

WordCloud

The WordCloud is a visual representation of word frequency counts. It is a good aid to get a visual appreciation of the information content in textual data. In the code below, the term frequency matrix is presented in a wordcloud.

from wordcloud import WordCloud 

word_freq = {} 
num_terms = len(tf) 
for i in range(num_terms): 
    freq = tf.iloc[i,1] 
    term = tf.iloc[i,0] 
    word_freq[term] = freq 

wordcloud  = WordCloud(max_font_size=50,  
                          max_words=top_n, background_color="white").generate_from_frequencies(word_freq) 

plt.figure(figsize = (8,8), facecolor = None) 
plt.imshow(wordcloud,interpolation="bilinear") 
plt.axis("off") 
plt.show()  

TFIDF - Term Frequency Inverse Document Frequency

The TFIDF is  simply a frequency measure of the number of occurrences of a word within a document scaled against a scarcity weight of its use within the word context (i.e. corpus = collection of documents).

It aims to assign a high numerical frequency to words that often occur in a text but are less common within its context to highlight pertinence or information content.

The Sklearn TfidfVectorizer object can be used to generate a TFIDF vector for a given text data against a corpus. It is worth nothing however that the TfidfVectorizer generates TFIDF vector for both the entire corpus, by comparing each document against the remaining documents in the corpus. An artifice must thus be done, to only extract data from the textual data of interest, by passing its vocabulary list and fetching only its vector from the vectorizer output.

# import required module
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import word_tokenize
import numpy as np

vocabulary = np.unique(word_tokenize(clean_text)).tolist()#fetch bag of words
corpus = []
#corpus
for i in range(5):
    url = url_base+"corpus/corpus_%d.txt"%(i+1)
    response = requests.get(url)#get from the web
    corpus_i = response.text
    corpus.append(corpus_i)
#add document into the corpus
corpus.append(clean_text)

# create object
tfidf = TfidfVectorizer(vocabulary=vocabulary)
# get tf-df values
result_tfidf = tfidf.fit_transform(corpus)
#get feature list
feature_list = tfidf.get_feature_names_out() 
#get the TFIDF of the last document
tfidf_array = result_tfidf[-1].toarray()[0]

#tf dataframe
tfidf = pd.DataFrame({'term':feature_list, 'freq':tfidf_array})
tfidf.sort_values(by=['freq'],inplace=True,ascending=False)
tfidf.reset_index(drop=True, inplace=True)

tfidf.head() 
import matplotlib.pyplot as plt
top_n = 30

plt.figure()
plt.bar(tfidf.term[0:top_n], tfidf.freq[0:top_n])
plt.xticks(rotation=90)
plt.ylabel('Frequency')
plt.title('TFIDF')
plt.show() 
from wordcloud import WordCloud 

word_freq = {} 
num_terms = len(tfidf) 
for i in range(num_terms): 
    freq = tfidf.iloc[i,1] #read data from the TFIDF matrix
    term = tfidf.iloc[i,0] 
    word_freq[term] = freq 

wordcloud  = WordCloud(max_font_size=50,  
                          max_words=top_n, background_color="white").generate_from_frequencies(word_freq) 

plt.figure(figsize = (8,8), facecolor = None) 
plt.imshow(wordcloud,interpolation="bilinear") 
plt.axis("off") 
plt.show()  

Conclusion

In this tutorial, text summarisation and featuring engineering in NLP is discussed. Any meaningful analytics with textual data requires denoising that involves regex, stemming or lemmaisation and stop word removal. Feature engineering in textual data typically involves finding a numerical representation of textual data while carrying semantic information. The Term Frequency and Term Frequency Inverse Document Frequency vectors are the most fundamental numeric representation of textual data, however with very limited semantic flexibility. They are nevertheless performant in several text classification problems and other NLP tasks. More advanced and robust techniques such as word embedding, and contextual sentence embedding can be investigated.

Author: Yves Matanga, PhD