{"id":3114,"date":"2025-02-17T09:22:02","date_gmt":"2025-02-17T09:22:02","guid":{"rendered":"https:\/\/mlinsightscentral.com\/?page_id=3114"},"modified":"2025-03-12T06:54:44","modified_gmt":"2025-03-12T06:54:44","slug":"text-summarisation-and-feature-engineering-using-tf-idf","status":"publish","type":"page","link":"https:\/\/mlinsightscentral.com\/index.php\/text-summarisation-and-feature-engineering-using-tf-idf\/","title":{"rendered":"Text Summarisation and Feature Engineering using TF-IDF"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-page\" data-elementor-id=\"3114\" class=\"elementor elementor-3114\">\n\t\t\t\t\t\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-78e7dc6 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"78e7dc6\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-f664503\" data-id=\"f664503\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-98a19a4 elementor-widget elementor-widget-heading\" data-id=\"98a19a4\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<style>\/*! elementor - v3.13.3 - 28-05-2023 *\/\n.elementor-heading-title{padding:0;margin:0;line-height:1}.elementor-widget-heading .elementor-heading-title[class*=elementor-size-]>a{color:inherit;font-size:inherit;line-height:inherit}.elementor-widget-heading .elementor-heading-title.elementor-size-small{font-size:15px}.elementor-widget-heading .elementor-heading-title.elementor-size-medium{font-size:19px}.elementor-widget-heading .elementor-heading-title.elementor-size-large{font-size:29px}.elementor-widget-heading .elementor-heading-title.elementor-size-xl{font-size:39px}.elementor-widget-heading .elementor-heading-title.elementor-size-xxl{font-size:59px}<\/style><h2 class=\"elementor-heading-title elementor-size-default\"><span class=\"ez-toc-section\" id=\"Text_Summarisation_and_Feature_Engineering_using_TF-IDF\"><\/span>Text Summarisation and Feature Engineering using TF-IDF<span class=\"ez-toc-section-end\"><\/span><\/h2>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-92fd41a elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"92fd41a\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-ab92b06\" data-id=\"ab92b06\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-ec9a8b2 elementor-widget elementor-widget-image\" data-id=\"ec9a8b2\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<style>\/*! elementor - v3.13.3 - 28-05-2023 *\/\n.elementor-widget-image{text-align:center}.elementor-widget-image a{display:inline-block}.elementor-widget-image a img[src$=\".svg\"]{width:48px}.elementor-widget-image img{vertical-align:middle;display:inline-block}<\/style>\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"557\" height=\"286\" src=\"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/02\/wordcloud.png\" class=\"attachment-large size-large wp-image-3159\" alt=\"\" srcset=\"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/02\/wordcloud.png 557w, https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/02\/wordcloud-300x154.png 300w\" sizes=\"auto, (max-width: 557px) 100vw, 557px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-ca218d7 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"ca218d7\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-151b801\" data-id=\"151b801\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-448de76 elementor-widget elementor-widget-text-editor\" data-id=\"448de76\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<style>\/*! elementor - v3.13.3 - 28-05-2023 *\/\n.elementor-widget-text-editor.elementor-drop-cap-view-stacked .elementor-drop-cap{background-color:#69727d;color:#fff}.elementor-widget-text-editor.elementor-drop-cap-view-framed .elementor-drop-cap{color:#69727d;border:3px solid;background-color:transparent}.elementor-widget-text-editor:not(.elementor-drop-cap-view-default) .elementor-drop-cap{margin-top:8px}.elementor-widget-text-editor:not(.elementor-drop-cap-view-default) .elementor-drop-cap-letter{width:1em;height:1em}.elementor-widget-text-editor .elementor-drop-cap{float:left;text-align:center;line-height:1;font-size:50px}.elementor-widget-text-editor .elementor-drop-cap-letter{display:inline-block}<\/style>\t\t\t\t<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_53 ez-toc-wrap-left counter-hierarchy ez-toc-counter ez-toc-light-blue ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\">\n<p class=\"ez-toc-title \" >Table of Contents<\/p>\n<span class=\"ez-toc-title-toggle\"><a href=\"#\" class=\"ez-toc-pull-right ez-toc-btn ez-toc-btn-xs ez-toc-btn-default ez-toc-toggle\" aria-label=\"Toggle Table of Content\" role=\"button\"><label for=\"item-69e2fb87a5cf9\" ><span class=\"\"><span style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input aria-label=\"Toggle\" aria-label=\"item-69e2fb87a5cf9\"  type=\"checkbox\" id=\"item-69e2fb87a5cf9\"><\/a><\/span><\/div>\n<nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/mlinsightscentral.com\/index.php\/text-summarisation-and-feature-engineering-using-tf-idf\/#Text_Summarisation_and_Feature_Engineering_using_TF-IDF\" title=\"Text Summarisation and Feature Engineering using TF-IDF\">Text Summarisation and Feature Engineering using TF-IDF<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/mlinsightscentral.com\/index.php\/text-summarisation-and-feature-engineering-using-tf-idf\/#Text_Preprocessing\" title=\"Text Preprocessing\">Text Preprocessing<\/a><ul class='ez-toc-list-level-3'><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/mlinsightscentral.com\/index.php\/text-summarisation-and-feature-engineering-using-tf-idf\/#Noise_Removal_using_Regular_Expressions\" title=\"Noise Removal using Regular Expressions\">Noise Removal using Regular Expressions<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/mlinsightscentral.com\/index.php\/text-summarisation-and-feature-engineering-using-tf-idf\/#Tokenisation\" title=\"Tokenisation\">Tokenisation<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/mlinsightscentral.com\/index.php\/text-summarisation-and-feature-engineering-using-tf-idf\/#Stemming\" title=\"Stemming\">Stemming<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/mlinsightscentral.com\/index.php\/text-summarisation-and-feature-engineering-using-tf-idf\/#Lemmatisation\" title=\"Lemmatisation\">Lemmatisation<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/mlinsightscentral.com\/index.php\/text-summarisation-and-feature-engineering-using-tf-idf\/#Stop_Words_Removal\" title=\"Stop Words Removal\">Stop Words Removal<\/a><ul class='ez-toc-list-level-4'><li class='ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/mlinsightscentral.com\/index.php\/text-summarisation-and-feature-engineering-using-tf-idf\/#Text_Preprocessing_Function\" title=\"Text Preprocessing Function\">Text Preprocessing Function<\/a><\/li><\/ul><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/mlinsightscentral.com\/index.php\/text-summarisation-and-feature-engineering-using-tf-idf\/#Feature_Engineering\" title=\"Feature Engineering\">Feature Engineering<\/a><ul class='ez-toc-list-level-3'><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-10\" href=\"https:\/\/mlinsightscentral.com\/index.php\/text-summarisation-and-feature-engineering-using-tf-idf\/#Bag-of-words_and_n-grams\" title=\"Bag-of-words and n-grams\">Bag-of-words and n-grams<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-11\" href=\"https:\/\/mlinsightscentral.com\/index.php\/text-summarisation-and-feature-engineering-using-tf-idf\/#Term_Frequency\" title=\"Term Frequency\">Term Frequency<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-12\" href=\"https:\/\/mlinsightscentral.com\/index.php\/text-summarisation-and-feature-engineering-using-tf-idf\/#WordCloud\" title=\"WordCloud\">WordCloud<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-13\" href=\"https:\/\/mlinsightscentral.com\/index.php\/text-summarisation-and-feature-engineering-using-tf-idf\/#TFIDF_%E2%80%93_Term_Frequency_Inverse_Document_Frequency\" title=\"TFIDF &#8211; Term Frequency Inverse Document Frequency\">TFIDF &#8211; Term Frequency Inverse Document Frequency<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-14\" href=\"https:\/\/mlinsightscentral.com\/index.php\/text-summarisation-and-feature-engineering-using-tf-idf\/#Conclusion\" title=\"Conclusion\">Conclusion<\/a><\/li><\/ul><\/li><\/ul><\/nav><\/div>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-b0d09ea elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"b0d09ea\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-2ebeb47\" data-id=\"2ebeb47\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-30ec5d4 elementor-widget elementor-widget-text-editor\" data-id=\"30ec5d4\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>This article explains how textual data is modelled in natural language processing. Several modelling techniques exist to model language in NLP:<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-eebe48b elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"eebe48b\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-8f73f1c\" data-id=\"8f73f1c\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-06a3b3b elementor-icon-list--layout-traditional elementor-list-item-link-full_width elementor-widget elementor-widget-icon-list\" data-id=\"06a3b3b\" data-element_type=\"widget\" data-widget_type=\"icon-list.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<link rel=\"stylesheet\" href=\"https:\/\/mlinsightscentral.com\/wp-content\/plugins\/elementor\/assets\/css\/widget-icon-list.min.css\">\t\t<ul class=\"elementor-icon-list-items\">\n\t\t\t\t\t\t\t<li class=\"elementor-icon-list-item\">\n\t\t\t\t\t\t\t\t\t\t\t<span class=\"elementor-icon-list-icon\">\n\t\t\t\t\t\t\t<i aria-hidden=\"true\" class=\"fas fa-check\"><\/i>\t\t\t\t\t\t<\/span>\n\t\t\t\t\t\t\t\t\t\t<span class=\"elementor-icon-list-text\">Bag of words and Term Frequency matrices<\/span>\n\t\t\t\t\t\t\t\t\t<\/li>\n\t\t\t\t\t\t\t\t<li class=\"elementor-icon-list-item\">\n\t\t\t\t\t\t\t\t\t\t\t<span class=\"elementor-icon-list-icon\">\n\t\t\t\t\t\t\t<i aria-hidden=\"true\" class=\"fas fa-check\"><\/i>\t\t\t\t\t\t<\/span>\n\t\t\t\t\t\t\t\t\t\t<span class=\"elementor-icon-list-text\">N-gram modelling<\/span>\n\t\t\t\t\t\t\t\t\t<\/li>\n\t\t\t\t\t\t\t\t<li class=\"elementor-icon-list-item\">\n\t\t\t\t\t\t\t\t\t\t\t<span class=\"elementor-icon-list-icon\">\n\t\t\t\t\t\t\t<i aria-hidden=\"true\" class=\"fas fa-check\"><\/i>\t\t\t\t\t\t<\/span>\n\t\t\t\t\t\t\t\t\t\t<span class=\"elementor-icon-list-text\">Word Embedding Techniques<\/span>\n\t\t\t\t\t\t\t\t\t<\/li>\n\t\t\t\t\t\t\t\t<li class=\"elementor-icon-list-item\">\n\t\t\t\t\t\t\t\t\t\t\t<span class=\"elementor-icon-list-icon\">\n\t\t\t\t\t\t\t<i aria-hidden=\"true\" class=\"fas fa-check\"><\/i>\t\t\t\t\t\t<\/span>\n\t\t\t\t\t\t\t\t\t\t<span class=\"elementor-icon-list-text\">Transformer-based Embedding<\/span>\n\t\t\t\t\t\t\t\t\t<\/li>\n\t\t\t\t\t\t<\/ul>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-7ee164a elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"7ee164a\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-aeb7ef5\" data-id=\"aeb7ef5\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-5af3e3e elementor-widget elementor-widget-text-editor\" data-id=\"5af3e3e\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>For the sake of this article, Bag of words, N-gram modelling and Term Frequency matrices will be discussed. However, prior to any decent language modelling, textual data needs to be preprocessed.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-fc72b90 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"fc72b90\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-6f49bdf\" data-id=\"6f49bdf\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-345812f elementor-widget elementor-widget-heading\" data-id=\"345812f\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><span class=\"ez-toc-section\" id=\"Text_Preprocessing\"><\/span>Text Preprocessing<span class=\"ez-toc-section-end\"><\/span><\/h2>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-636639f elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"636639f\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-1984e29\" data-id=\"1984e29\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-12af5b7 elementor-widget elementor-widget-text-editor\" data-id=\"12af5b7\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>Given textual data, the following preprocessing takes place prior to any meaningul analytics:<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-d6f6b69 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"d6f6b69\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-49108cd\" data-id=\"49108cd\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-4a22a4b elementor-icon-list--layout-traditional elementor-list-item-link-full_width elementor-widget elementor-widget-icon-list\" data-id=\"4a22a4b\" data-element_type=\"widget\" data-widget_type=\"icon-list.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<ul class=\"elementor-icon-list-items\">\n\t\t\t\t\t\t\t<li class=\"elementor-icon-list-item\">\n\t\t\t\t\t\t\t\t\t\t\t<span class=\"elementor-icon-list-icon\">\n\t\t\t\t\t\t\t<i aria-hidden=\"true\" class=\"fas fa-check\"><\/i>\t\t\t\t\t\t<\/span>\n\t\t\t\t\t\t\t\t\t\t<span class=\"elementor-icon-list-text\">Unwanted characters removal using Regex<\/span>\n\t\t\t\t\t\t\t\t\t<\/li>\n\t\t\t\t\t\t\t\t<li class=\"elementor-icon-list-item\">\n\t\t\t\t\t\t\t\t\t\t\t<span class=\"elementor-icon-list-icon\">\n\t\t\t\t\t\t\t<i aria-hidden=\"true\" class=\"fas fa-check\"><\/i>\t\t\t\t\t\t<\/span>\n\t\t\t\t\t\t\t\t\t\t<span class=\"elementor-icon-list-text\">Tokenisation<\/span>\n\t\t\t\t\t\t\t\t\t<\/li>\n\t\t\t\t\t\t\t\t<li class=\"elementor-icon-list-item\">\n\t\t\t\t\t\t\t\t\t\t\t<span class=\"elementor-icon-list-icon\">\n\t\t\t\t\t\t\t<i aria-hidden=\"true\" class=\"fas fa-check\"><\/i>\t\t\t\t\t\t<\/span>\n\t\t\t\t\t\t\t\t\t\t<span class=\"elementor-icon-list-text\">Stemming<\/span>\n\t\t\t\t\t\t\t\t\t<\/li>\n\t\t\t\t\t\t\t\t<li class=\"elementor-icon-list-item\">\n\t\t\t\t\t\t\t\t\t\t\t<span class=\"elementor-icon-list-icon\">\n\t\t\t\t\t\t\t<i aria-hidden=\"true\" class=\"fas fa-check\"><\/i>\t\t\t\t\t\t<\/span>\n\t\t\t\t\t\t\t\t\t\t<span class=\"elementor-icon-list-text\">Lemmatisation<\/span>\n\t\t\t\t\t\t\t\t\t<\/li>\n\t\t\t\t\t\t\t\t<li class=\"elementor-icon-list-item\">\n\t\t\t\t\t\t\t\t\t\t\t<span class=\"elementor-icon-list-icon\">\n\t\t\t\t\t\t\t<i aria-hidden=\"true\" class=\"fas fa-check\"><\/i>\t\t\t\t\t\t<\/span>\n\t\t\t\t\t\t\t\t\t\t<span class=\"elementor-icon-list-text\">Stop Words Removal<\/span>\n\t\t\t\t\t\t\t\t\t<\/li>\n\t\t\t\t\t\t<\/ul>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-1513a57 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"1513a57\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-9685acc\" data-id=\"9685acc\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-927fdc3 elementor-widget elementor-widget-heading\" data-id=\"927fdc3\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><span class=\"ez-toc-section\" id=\"Noise_Removal_using_Regular_Expressions\"><\/span>Noise Removal using Regular Expressions<span class=\"ez-toc-section-end\"><\/span><\/h3>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-0b6a478 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"0b6a478\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-123981d\" data-id=\"123981d\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-5f4cff9 elementor-widget elementor-widget-text-editor\" data-id=\"5f4cff9\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>Regular expressions, often abbreviated as &#8220;regex&#8221; or &#8220;regexp&#8221;, are sequences of characters that define a search pattern used for pattern matching and text processing tasks.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-b1f4f21 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"b1f4f21\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-30e2af8\" data-id=\"30e2af8\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-554a8df elementor-widget elementor-widget-elementor-syntax-highlighter\" data-id=\"554a8df\" data-element_type=\"widget\" data-widget_type=\"elementor-syntax-highlighter.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<pre><code class='language-python'>import re  #library for regular expression \r\n\r\ntext = &quot;The $#quick brown fox #jumps over the lazy dog!!!&quot;\r\npattern = r'[^a-zA-Z\\s]' #find unwanted characters (non-alphanumeric and non-whitespace)\r\n\r\nclean_text = re.sub(pattern, '', text)#replace them with an empty string\r\nprint('initial text:', text)\r\nprint('\\nafter cleaning:',clean_text)  <\/code><\/pre><script>\nif (!document.getElementById('syntaxed-prism')) {\n\tvar my_awesome_script = document.createElement('script');\n\tmy_awesome_script.setAttribute('src','https:\/\/mlinsightscentral.com\/wp-content\/plugins\/syntax-highlighter-for-elementor\/assets\/prism2.js');\n\tmy_awesome_script.setAttribute('id','syntaxed-prism');\n\tdocument.body.appendChild(my_awesome_script);\n} else {\n\twindow.Prism && Prism.highlightAll();\n}\n<\/script>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-0ad5fc4 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"0ad5fc4\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-b805ee9\" data-id=\"b805ee9\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-f9570f1 elementor-widget elementor-widget-image\" data-id=\"f9570f1\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"606\" height=\"68\" src=\"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/02\/regex_cleaning.png\" class=\"attachment-large size-large wp-image-3135\" alt=\"\" srcset=\"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/02\/regex_cleaning.png 606w, https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/02\/regex_cleaning-300x34.png 300w\" sizes=\"auto, (max-width: 606px) 100vw, 606px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-d9bcd00 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"d9bcd00\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-892d40e\" data-id=\"892d40e\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-d524002 elementor-widget elementor-widget-heading\" data-id=\"d524002\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><span class=\"ez-toc-section\" id=\"Tokenisation\"><\/span>Tokenisation<span class=\"ez-toc-section-end\"><\/span><\/h3>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-725b722 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"725b722\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-19a0b4a\" data-id=\"19a0b4a\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-1ffc9a6 elementor-widget elementor-widget-text-editor\" data-id=\"1ffc9a6\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>Tokenisation is the process of diving text into a sequence of tokens, which roughly corresponds to &#8220;words&#8221;. The nltk package is a very rich Python package that can be used for word tokenisation as well as sentence tokenisation.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-0ae3eb5 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"0ae3eb5\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-50 elementor-top-column elementor-element elementor-element-dd184a3\" data-id=\"dd184a3\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-11dc112 elementor-widget elementor-widget-elementor-syntax-highlighter\" data-id=\"11dc112\" data-element_type=\"widget\" data-widget_type=\"elementor-syntax-highlighter.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<pre><code class='language-python'>#pip install nltk\r\n#nltk.download('punkt') #donwload necessary resources\r\n\r\nfrom nltk.tokenize import word_tokenize\r\n\r\ntext = &quot;Hello! How are you? I am doing well.&quot;\r\n\r\nwords = word_tokenize(text)\r\nprint(words) <\/code><\/pre><script>\nif (!document.getElementById('syntaxed-prism')) {\n\tvar my_awesome_script = document.createElement('script');\n\tmy_awesome_script.setAttribute('src','https:\/\/mlinsightscentral.com\/wp-content\/plugins\/syntax-highlighter-for-elementor\/assets\/prism2.js');\n\tmy_awesome_script.setAttribute('id','syntaxed-prism');\n\tdocument.body.appendChild(my_awesome_script);\n} else {\n\twindow.Prism && Prism.highlightAll();\n}\n<\/script>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t<div class=\"elementor-column elementor-col-50 elementor-top-column elementor-element elementor-element-f4625da\" data-id=\"f4625da\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-f2b956a elementor-widget elementor-widget-elementor-syntax-highlighter\" data-id=\"f2b956a\" data-element_type=\"widget\" data-widget_type=\"elementor-syntax-highlighter.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<pre><code class='language-python'>from nltk.tokenize import sent_tokenize\r\n\r\ntext = &quot;Hello! How are you? I am doing well. Let's learn NLP.&quot;\r\n\r\nsentences = sent_tokenize(text)\r\nprint(sentences) <\/code><\/pre><script>\nif (!document.getElementById('syntaxed-prism')) {\n\tvar my_awesome_script = document.createElement('script');\n\tmy_awesome_script.setAttribute('src','https:\/\/mlinsightscentral.com\/wp-content\/plugins\/syntax-highlighter-for-elementor\/assets\/prism2.js');\n\tmy_awesome_script.setAttribute('id','syntaxed-prism');\n\tdocument.body.appendChild(my_awesome_script);\n} else {\n\twindow.Prism && Prism.highlightAll();\n}\n<\/script>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-29c06dc elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"29c06dc\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-50 elementor-top-column elementor-element elementor-element-daf2556\" data-id=\"daf2556\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-1723d43 elementor-widget elementor-widget-image\" data-id=\"1723d43\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"711\" height=\"26\" src=\"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/02\/word_tokens.png\" class=\"attachment-large size-large wp-image-3144\" alt=\"\" srcset=\"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/02\/word_tokens.png 711w, https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/02\/word_tokens-300x11.png 300w\" sizes=\"auto, (max-width: 711px) 100vw, 711px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t<div class=\"elementor-column elementor-col-50 elementor-top-column elementor-element elementor-element-37f837a\" data-id=\"37f837a\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-5f5c8be elementor-widget elementor-widget-image\" data-id=\"5f5c8be\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"635\" height=\"29\" src=\"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/02\/sent_tokens.png\" class=\"attachment-large size-large wp-image-3145\" alt=\"\" srcset=\"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/02\/sent_tokens.png 635w, https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/02\/sent_tokens-300x14.png 300w\" sizes=\"auto, (max-width: 635px) 100vw, 635px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-4563237 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"4563237\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-32ca880\" data-id=\"32ca880\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-97eca64 elementor-widget elementor-widget-heading\" data-id=\"97eca64\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><span class=\"ez-toc-section\" id=\"Stemming\"><\/span>Stemming<span class=\"ez-toc-section-end\"><\/span><\/h3>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-5644c08 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"5644c08\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-24e4c1a\" data-id=\"24e4c1a\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-16ce27c elementor-widget elementor-widget-text-editor\" data-id=\"16ce27c\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>Stemming is the process of transforming words into a root term to minimise redundancies. The root term is not necessarily a word. For instance, the words &#8216;caring&#8217;, &#8216;cares&#8217;, &#8216;cared&#8217;, &#8216;caringly&#8217; and &#8216;carefully&#8217; represent the same underlying reality in language understanding and therefore can be converted to the same root for the sake of concise representation of information in textual data analysis.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-c18e2c1 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"c18e2c1\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-38ec5d7\" data-id=\"38ec5d7\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-cd1c86e elementor-widget elementor-widget-elementor-syntax-highlighter\" data-id=\"cd1c86e\" data-element_type=\"widget\" data-widget_type=\"elementor-syntax-highlighter.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<pre><code class='language-python'>from nltk import SnowballStemmer, PorterStemmer, LancasterStemmer\r\nwords = 'caring cares cared caringly carefully'\r\n# find the stem of each word in words\r\nstemmer = SnowballStemmer('english')\r\nfor word in words.split():\r\n    print(stemmer.stem(word)) <\/code><\/pre><script>\nif (!document.getElementById('syntaxed-prism')) {\n\tvar my_awesome_script = document.createElement('script');\n\tmy_awesome_script.setAttribute('src','https:\/\/mlinsightscentral.com\/wp-content\/plugins\/syntax-highlighter-for-elementor\/assets\/prism2.js');\n\tmy_awesome_script.setAttribute('id','syntaxed-prism');\n\tdocument.body.appendChild(my_awesome_script);\n} else {\n\twindow.Prism && Prism.highlightAll();\n}\n<\/script>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-73a79b1 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"73a79b1\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-87884c3\" data-id=\"87884c3\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-4f22160 elementor-widget elementor-widget-image\" data-id=\"4f22160\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"343\" height=\"112\" src=\"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/02\/care_img.png\" class=\"attachment-large size-large wp-image-3155\" alt=\"\" srcset=\"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/02\/care_img.png 343w, https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/02\/care_img-300x98.png 300w\" sizes=\"auto, (max-width: 343px) 100vw, 343px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-08e7e0f elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"08e7e0f\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-60f19da\" data-id=\"60f19da\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-98368df elementor-widget elementor-widget-heading\" data-id=\"98368df\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><span class=\"ez-toc-section\" id=\"Lemmatisation\"><\/span>Lemmatisation<span class=\"ez-toc-section-end\"><\/span><\/h3>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-7d79eee elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"7d79eee\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-dc09ab1\" data-id=\"dc09ab1\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-d8fb7b4 elementor-widget elementor-widget-text-editor\" data-id=\"d8fb7b4\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>A very similar operation to stemming is called lemmatisation. Lemmatising is the process of grouping words of similar meaning together to a root term existing with the target vocabulary. Unlike, stemming whose roots are not necessarily existing words, lemmatisation ensures that the root term are existing words in the language vocabulary.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-3b77747 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"3b77747\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-ac18723\" data-id=\"ac18723\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-b0a4619 elementor-widget elementor-widget-elementor-syntax-highlighter\" data-id=\"b0a4619\" data-element_type=\"widget\" data-widget_type=\"elementor-syntax-highlighter.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<pre><code class='language-python'>import nltk \r\nnltk.download('wordnet')\r\nfrom nltk.stem import WordNetLemmatizer\r\n\r\nlemmatizer = WordNetLemmatizer()\r\nprint(lemmatizer.lemmatize(&quot;cats&quot;))\r\nprint(lemmatizer.lemmatize(&quot;programmed&quot;,pos=&quot;v&quot;))\r\nprint(lemmatizer.lemmatize(&quot;programming&quot;,pos=&quot;v&quot;))\r\nprint(lemmatizer.lemmatize(&quot;better&quot;, pos=&quot;a&quot;))\r\nprint(lemmatizer.lemmatize(&quot;best&quot;, pos=&quot;a&quot;)) <\/code><\/pre><script>\nif (!document.getElementById('syntaxed-prism')) {\n\tvar my_awesome_script = document.createElement('script');\n\tmy_awesome_script.setAttribute('src','https:\/\/mlinsightscentral.com\/wp-content\/plugins\/syntax-highlighter-for-elementor\/assets\/prism2.js');\n\tmy_awesome_script.setAttribute('id','syntaxed-prism');\n\tdocument.body.appendChild(my_awesome_script);\n} else {\n\twindow.Prism && Prism.highlightAll();\n}\n<\/script>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-8ccb9b1 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"8ccb9b1\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-d33ed29\" data-id=\"d33ed29\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-cacff82 elementor-widget elementor-widget-image\" data-id=\"cacff82\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"443\" height=\"107\" src=\"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/02\/lemma_.png\" class=\"attachment-large size-large wp-image-3166\" alt=\"\" srcset=\"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/02\/lemma_.png 443w, https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/02\/lemma_-300x72.png 300w\" sizes=\"auto, (max-width: 443px) 100vw, 443px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-ae96bb1 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"ae96bb1\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-0a7ac2c\" data-id=\"0a7ac2c\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-8bca9fe elementor-widget elementor-widget-heading\" data-id=\"8bca9fe\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><span class=\"ez-toc-section\" id=\"Stop_Words_Removal\"><\/span>Stop Words Removal<span class=\"ez-toc-section-end\"><\/span><\/h3>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-4a36d91 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"4a36d91\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-823cc63\" data-id=\"823cc63\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-9fd7d2a elementor-widget elementor-widget-text-editor\" data-id=\"9fd7d2a\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>Stop words are words which do not contain pertinent information in carrying the core significance of natural language communication. Usually these words are filtered out from search queries because they return a vast amount of unnecessary information. Typically stop words are pronouns, prepositions, adverbs and auxiliary verbs.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-6fa44bc elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"6fa44bc\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-6c6a4b7\" data-id=\"6c6a4b7\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-4a81f3f elementor-widget elementor-widget-elementor-syntax-highlighter\" data-id=\"4a81f3f\" data-element_type=\"widget\" data-widget_type=\"elementor-syntax-highlighter.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<pre><code class='language-python'>from nltk.corpus import stopwords\r\nprint(stopwords.words('english'))#list of english stopwords <\/code><\/pre><script>\nif (!document.getElementById('syntaxed-prism')) {\n\tvar my_awesome_script = document.createElement('script');\n\tmy_awesome_script.setAttribute('src','https:\/\/mlinsightscentral.com\/wp-content\/plugins\/syntax-highlighter-for-elementor\/assets\/prism2.js');\n\tmy_awesome_script.setAttribute('id','syntaxed-prism');\n\tdocument.body.appendChild(my_awesome_script);\n} else {\n\twindow.Prism && Prism.highlightAll();\n}\n<\/script>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-3e65780 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"3e65780\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-466104a\" data-id=\"466104a\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-0eac2aa elementor-widget elementor-widget-image\" data-id=\"0eac2aa\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"217\" src=\"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/02\/stopwords-1024x217.png\" class=\"attachment-large size-large wp-image-3171\" alt=\"\" srcset=\"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/02\/stopwords-1024x217.png 1024w, https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/02\/stopwords-300x64.png 300w, https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/02\/stopwords-768x163.png 768w, https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/02\/stopwords.png 1226w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-02596a2 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"02596a2\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-3c9eccd\" data-id=\"3c9eccd\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-c1da238 elementor-widget elementor-widget-elementor-syntax-highlighter\" data-id=\"c1da238\" data-element_type=\"widget\" data-widget_type=\"elementor-syntax-highlighter.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<pre><code class='language-python'>from nltk.corpus import stopwords\r\nfrom nltk.tokenize import word_tokenize\r\n\r\ntext = 'the world is ending, i can see it in the air'\r\ntokens = word_tokenize(text)#fetch tokens\r\neng_stopwords = stopwords.words('english')#get list of stopwords in english\r\ntokens_stops_removed = [word for word in tokens if word not in eng_stopwords]#remove stop words from list\r\ntext_clean = &quot; &quot;.join(tokens_stops_removed)\r\n\r\nprint(&quot;text--&gt;&quot;,text)\r\nprint(&quot;tokens--&gt;&quot;,tokens,end=&quot;\\n\\n&quot;)\r\nprint(&quot;tokens [stopwords removed] --&gt;&quot;,tokens_stops_removed)\r\nprint(&quot;text [stopwords removed]--&gt;&quot;,text_clean) <\/code><\/pre><script>\nif (!document.getElementById('syntaxed-prism')) {\n\tvar my_awesome_script = document.createElement('script');\n\tmy_awesome_script.setAttribute('src','https:\/\/mlinsightscentral.com\/wp-content\/plugins\/syntax-highlighter-for-elementor\/assets\/prism2.js');\n\tmy_awesome_script.setAttribute('id','syntaxed-prism');\n\tdocument.body.appendChild(my_awesome_script);\n} else {\n\twindow.Prism && Prism.highlightAll();\n}\n<\/script>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-9d9ce92 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"9d9ce92\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-5cb075c\" data-id=\"5cb075c\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-6954de7 elementor-widget elementor-widget-image\" data-id=\"6954de7\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"906\" height=\"107\" src=\"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/02\/stopword_removal.png\" class=\"attachment-large size-large wp-image-3178\" alt=\"\" srcset=\"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/02\/stopword_removal.png 906w, https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/02\/stopword_removal-300x35.png 300w, https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/02\/stopword_removal-768x91.png 768w\" sizes=\"auto, (max-width: 906px) 100vw, 906px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-1b46413 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"1b46413\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-9e1ddec\" data-id=\"9e1ddec\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-ea57718 elementor-widget elementor-widget-heading\" data-id=\"ea57718\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\"><span class=\"ez-toc-section\" id=\"Text_Preprocessing_Function\"><\/span>Text Preprocessing Function<span class=\"ez-toc-section-end\"><\/span><\/h4>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-1d297fa elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"1d297fa\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-846a98f\" data-id=\"846a98f\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-60cecad elementor-widget elementor-widget-text-editor\" data-id=\"60cecad\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>Given your knowledge of text preprocessing components, a user function can be designed to preprocess a given text string.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-878474c elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"878474c\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-f80aa74\" data-id=\"f80aa74\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-5d38178 elementor-widget elementor-widget-elementor-syntax-highlighter\" data-id=\"5d38178\" data-element_type=\"widget\" data-widget_type=\"elementor-syntax-highlighter.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<pre><code class='language-python'>import re  #library for regular expression \nfrom nltk.corpus import stopwords\nfrom nltk.tokenize import word_tokenize\n\n\ndef text_preprocess(text):\n    pattern = r'[^a-zA-Z\\s]' #find unwanted characters (non-alphanumeric and non-whitespace)\n    text = text.lower()#put to lower case\n    clean_text = re.sub(pattern, '', text)#replace them with an empty string    \n    tokens = word_tokenize(clean_text)#fetch tokens\n    eng_stopwords = stopwords.words('english')#get list of stopwords in english\n    eng_stopwords.append('th') #add user aware additional stop words\n    tokens_stops_removed = [word for word in tokens if word not in eng_stopwords]#remove stop words from list\n    text_clean = &quot; &quot;.join(tokens_stops_removed)    \n    return text_clean\n     <\/code><\/pre><script>\nif (!document.getElementById('syntaxed-prism')) {\n\tvar my_awesome_script = document.createElement('script');\n\tmy_awesome_script.setAttribute('src','https:\/\/mlinsightscentral.com\/wp-content\/plugins\/syntax-highlighter-for-elementor\/assets\/prism2.js');\n\tmy_awesome_script.setAttribute('id','syntaxed-prism');\n\tdocument.body.appendChild(my_awesome_script);\n} else {\n\twindow.Prism && Prism.highlightAll();\n}\n<\/script>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-19563c4 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"19563c4\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-0b45512\" data-id=\"0b45512\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-101901a elementor-widget elementor-widget-heading\" data-id=\"101901a\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><span class=\"ez-toc-section\" id=\"Feature_Engineering\"><\/span>Feature Engineering<span class=\"ez-toc-section-end\"><\/span><\/h2>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-6dfa168 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"6dfa168\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-9fecfaf\" data-id=\"9fecfaf\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-7f34226 elementor-widget elementor-widget-text-editor\" data-id=\"7f34226\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>Feature Engineering is the process of building numerical features from textual data. Several feature engineering techniques exist based on the amount of semantic content that the method can acquire.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-202f08c elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"202f08c\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-3cd09b9\" data-id=\"3cd09b9\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-952e106 elementor-icon-list--layout-traditional elementor-list-item-link-full_width elementor-widget elementor-widget-icon-list\" data-id=\"952e106\" data-element_type=\"widget\" data-widget_type=\"icon-list.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<ul class=\"elementor-icon-list-items\">\n\t\t\t\t\t\t\t<li class=\"elementor-icon-list-item\">\n\t\t\t\t\t\t\t\t\t\t\t<span class=\"elementor-icon-list-icon\">\n\t\t\t\t\t\t\t<i aria-hidden=\"true\" class=\"fas fa-check\"><\/i>\t\t\t\t\t\t<\/span>\n\t\t\t\t\t\t\t\t\t\t<span class=\"elementor-icon-list-text\">Bag-of-words modelling and Term Frequency Inverse Document Frequency (TFIDF)<\/span>\n\t\t\t\t\t\t\t\t\t<\/li>\n\t\t\t\t\t\t\t\t<li class=\"elementor-icon-list-item\">\n\t\t\t\t\t\t\t\t\t\t\t<span class=\"elementor-icon-list-icon\">\n\t\t\t\t\t\t\t<i aria-hidden=\"true\" class=\"fas fa-check\"><\/i>\t\t\t\t\t\t<\/span>\n\t\t\t\t\t\t\t\t\t\t<span class=\"elementor-icon-list-text\">Static Word Embeddings<\/span>\n\t\t\t\t\t\t\t\t\t<\/li>\n\t\t\t\t\t\t\t\t<li class=\"elementor-icon-list-item\">\n\t\t\t\t\t\t\t\t\t\t\t<span class=\"elementor-icon-list-icon\">\n\t\t\t\t\t\t\t<i aria-hidden=\"true\" class=\"fas fa-check\"><\/i>\t\t\t\t\t\t<\/span>\n\t\t\t\t\t\t\t\t\t\t<span class=\"elementor-icon-list-text\">Transformer-based Contextual Embeddings<\/span>\n\t\t\t\t\t\t\t\t\t<\/li>\n\t\t\t\t\t\t<\/ul>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-041fc14 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"041fc14\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-be507cd\" data-id=\"be507cd\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-856f441 elementor-widget elementor-widget-text-editor\" data-id=\"856f441\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>In this tutorial, we will focus on the most basic feature engineering technique: The TFIDF method.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-be9d2a8 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"be9d2a8\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-540a926\" data-id=\"540a926\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-692500e elementor-widget elementor-widget-heading\" data-id=\"692500e\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><span class=\"ez-toc-section\" id=\"Bag-of-words_and_n-grams\"><\/span>Bag-of-words and n-grams<span class=\"ez-toc-section-end\"><\/span><\/h3>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-2a2ddcb elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"2a2ddcb\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-608da03\" data-id=\"608da03\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-96bbf9c elementor-widget elementor-widget-text-editor\" data-id=\"96bbf9c\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>Bag of words, unique words in the text document, are most basically used as features for language modelling. N-gram consist of forming text feature by using the frequency count of adjacent n-compound words. The bag of words used in basic feature engineering thus represents a 1-gram model. Unlike the bag of words or unigrams, n-gram (n&gt;1) can enhance the capturing of contextual information in language modelling.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-67c50e1 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"67c50e1\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-988960f\" data-id=\"988960f\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-5df94bd elementor-widget elementor-widget-image\" data-id=\"5df94bd\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"230\" src=\"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-12-at-08.50.11-1024x230.png\" class=\"attachment-large size-large wp-image-3342\" alt=\"\" srcset=\"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-12-at-08.50.11-1024x230.png 1024w, https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-12-at-08.50.11-300x67.png 300w, https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-12-at-08.50.11-768x172.png 768w, https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-12-at-08.50.11.png 1311w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-bfb1c68 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"bfb1c68\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-3329dfb\" data-id=\"3329dfb\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-1433303 elementor-widget elementor-widget-heading\" data-id=\"1433303\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><span class=\"ez-toc-section\" id=\"Term_Frequency\"><\/span>Term Frequency<span class=\"ez-toc-section-end\"><\/span><\/h3>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-9379a8f elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"9379a8f\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-6a18483\" data-id=\"6a18483\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-b519c20 elementor-widget elementor-widget-text-editor\" data-id=\"b519c20\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>The Term Frequency (TF)\u00a0 is a measure of token counts in a text document. It is a first-degree feature engineering process whereby each term is converted numerically by taking <strong>the number of times it occurs in the textual dataset<\/strong>.\u00a0Term frequencies of bag-of-words or n-grams in general are used to form the frequency matrices.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-3b330af elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"3b330af\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-54a83fa\" data-id=\"54a83fa\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-56fcbd1 elementor-widget elementor-widget-image\" data-id=\"56fcbd1\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"237\" src=\"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-12-at-08.52.30-1024x237.png\" class=\"attachment-large size-large wp-image-3349\" alt=\"\" srcset=\"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-12-at-08.52.30-1024x237.png 1024w, https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-12-at-08.52.30-300x69.png 300w, https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-12-at-08.52.30-768x178.png 768w, https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-12-at-08.52.30.png 1184w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-4b3b44d elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"4b3b44d\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-0672a11\" data-id=\"0672a11\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-6ef9a25 elementor-widget elementor-widget-text-editor\" data-id=\"6ef9a25\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>Let&#8217;s consider an excerpt text from Wikipedia on <a href=\"https:\/\/raw.githubusercontent.com\/mlinsights\/freemium\/refs\/heads\/main\/datasets\/text-analysis\/globalisation\/globalisation.txt\">Globalisation<\/a> and a <a href=\"https:\/\/github.com\/mlinsights\/freemium\/tree\/main\/datasets\/text-analysis\/globalisation\/corpus\">corpus<\/a> (i.e. collection of documents about a subject) related to formal documents also extracted from Wikipedia.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-0494c81 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"0494c81\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-2c5642c\" data-id=\"2c5642c\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-8bed34d elementor-widget elementor-widget-elementor-syntax-highlighter\" data-id=\"8bed34d\" data-element_type=\"widget\" data-widget_type=\"elementor-syntax-highlighter.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<pre><code class='language-python'>import requests\n\nurl_base = &quot;https:\/\/raw.githubusercontent.com\/mlinsights\/freemium\/refs\/heads\/main\/datasets\/text-analysis\/globalisation\/&quot;\nurl = url_base+&quot;globalisation.txt&quot;\nresponse = requests.get(url)#get from the web\ntext = response.text\nprint(text) <\/code><\/pre><script>\nif (!document.getElementById('syntaxed-prism')) {\n\tvar my_awesome_script = document.createElement('script');\n\tmy_awesome_script.setAttribute('src','https:\/\/mlinsightscentral.com\/wp-content\/plugins\/syntax-highlighter-for-elementor\/assets\/prism2.js');\n\tmy_awesome_script.setAttribute('id','syntaxed-prism');\n\tdocument.body.appendChild(my_awesome_script);\n} else {\n\twindow.Prism && Prism.highlightAll();\n}\n<\/script>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-a90bccc elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"a90bccc\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-2365ac0\" data-id=\"2365ac0\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-65e5b4c elementor-widget elementor-widget-image\" data-id=\"65e5b4c\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"307\" src=\"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-10-at-13.10.12-1024x307.png\" class=\"attachment-large size-large wp-image-3224\" alt=\"globalisation\" srcset=\"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-10-at-13.10.12-1024x307.png 1024w, https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-10-at-13.10.12-300x90.png 300w, https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-10-at-13.10.12-768x230.png 768w, https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-10-at-13.10.12-1536x460.png 1536w, https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-10-at-13.10.12.png 1961w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-570b242 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"570b242\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-c47634b\" data-id=\"c47634b\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-2330fb9 elementor-widget elementor-widget-text-editor\" data-id=\"2330fb9\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>The Sklearn CountVectorizer can be used to generate a term frequency matrix from text.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-c52d9e8 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"c52d9e8\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-93246ce\" data-id=\"93246ce\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-3b2c74e elementor-widget elementor-widget-elementor-syntax-highlighter\" data-id=\"3b2c74e\" data-element_type=\"widget\" data-widget_type=\"elementor-syntax-highlighter.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<pre><code class='language-python'>from sklearn.feature_extraction.text import CountVectorizer\nimport pandas as pd\n\n#preprocess input text\nclean_text = text_preprocess(text)\n# create count vectorizer\ncvz = CountVectorizer()\n# get token counts\nresult_cvz = cvz.fit_transform([clean_text])\n#get feature list\nfeature_list = cvz.get_feature_names_out()\n#get tokens count\ntf_array = result_cvz.toarray()[0]\n\n#tf dataframe\ntf = pd.DataFrame({'term':feature_list, 'freq':tf_array})\ntf.sort_values(by=['freq'],inplace=True,ascending=False)\ntf.reset_index(drop=True, inplace=True)\n\ntf.head() <\/code><\/pre><script>\nif (!document.getElementById('syntaxed-prism')) {\n\tvar my_awesome_script = document.createElement('script');\n\tmy_awesome_script.setAttribute('src','https:\/\/mlinsightscentral.com\/wp-content\/plugins\/syntax-highlighter-for-elementor\/assets\/prism2.js');\n\tmy_awesome_script.setAttribute('id','syntaxed-prism');\n\tdocument.body.appendChild(my_awesome_script);\n} else {\n\twindow.Prism && Prism.highlightAll();\n}\n<\/script>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-5bbc541 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"5bbc541\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-7cb4687\" data-id=\"7cb4687\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-f71db94 elementor-widget elementor-widget-image\" data-id=\"f71db94\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"281\" height=\"326\" src=\"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-10-at-13.18.07.png\" class=\"attachment-large size-large wp-image-3240\" alt=\"\" srcset=\"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-10-at-13.18.07.png 281w, https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-10-at-13.18.07-259x300.png 259w\" sizes=\"auto, (max-width: 281px) 100vw, 281px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-0ca342f elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"0ca342f\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-fe989a2\" data-id=\"fe989a2\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-eff98d2 elementor-widget elementor-widget-elementor-syntax-highlighter\" data-id=\"eff98d2\" data-element_type=\"widget\" data-widget_type=\"elementor-syntax-highlighter.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<pre><code class='language-python'>import matplotlib.pyplot as plt\ntop_n = 30\n\nplt.figure()\nplt.bar(tf.term[0:top_n], tf.freq[0:top_n])\nplt.xticks(rotation=90)\nplt.ylabel('Frequency')\nplt.title('Term Frequency')\nplt.show() <\/code><\/pre><script>\nif (!document.getElementById('syntaxed-prism')) {\n\tvar my_awesome_script = document.createElement('script');\n\tmy_awesome_script.setAttribute('src','https:\/\/mlinsightscentral.com\/wp-content\/plugins\/syntax-highlighter-for-elementor\/assets\/prism2.js');\n\tmy_awesome_script.setAttribute('id','syntaxed-prism');\n\tdocument.body.appendChild(my_awesome_script);\n} else {\n\twindow.Prism && Prism.highlightAll();\n}\n<\/script>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-3405a83 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"3405a83\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-be82512\" data-id=\"be82512\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-bea038b elementor-widget elementor-widget-image\" data-id=\"bea038b\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"563\" height=\"527\" src=\"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/03\/tf_images.png\" class=\"attachment-large size-large wp-image-3245\" alt=\"\" srcset=\"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/03\/tf_images.png 563w, https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/03\/tf_images-300x281.png 300w\" sizes=\"auto, (max-width: 563px) 100vw, 563px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-6cee39a elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"6cee39a\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-2c822d9\" data-id=\"2c822d9\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-6837490 elementor-widget elementor-widget-heading\" data-id=\"6837490\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><span class=\"ez-toc-section\" id=\"WordCloud\"><\/span>WordCloud<span class=\"ez-toc-section-end\"><\/span><\/h3>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-55acda3 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"55acda3\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-77370f3\" data-id=\"77370f3\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-e4bb6e8 elementor-widget elementor-widget-text-editor\" data-id=\"e4bb6e8\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>The WordCloud is a visual representation of word frequency counts. It is a good aid to get a visual appreciation of the information content in textual data. In the code below, the term frequency matrix is presented in a wordcloud.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-c406b1a elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"c406b1a\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-e58321f\" data-id=\"e58321f\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-43e15e5 elementor-widget elementor-widget-elementor-syntax-highlighter\" data-id=\"43e15e5\" data-element_type=\"widget\" data-widget_type=\"elementor-syntax-highlighter.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<pre><code class='language-python'>from wordcloud import WordCloud \n\nword_freq = {} \nnum_terms = len(tf) \nfor i in range(num_terms): \n    freq = tf.iloc[i,1] \n    term = tf.iloc[i,0] \n    word_freq[term] = freq \n\nwordcloud  = WordCloud(max_font_size=50,  \n                          max_words=top_n, background_color=&quot;white&quot;).generate_from_frequencies(word_freq) \n\nplt.figure(figsize = (8,8), facecolor = None) \nplt.imshow(wordcloud,interpolation=&quot;bilinear&quot;) \nplt.axis(&quot;off&quot;) \nplt.show()  <\/code><\/pre><script>\nif (!document.getElementById('syntaxed-prism')) {\n\tvar my_awesome_script = document.createElement('script');\n\tmy_awesome_script.setAttribute('src','https:\/\/mlinsightscentral.com\/wp-content\/plugins\/syntax-highlighter-for-elementor\/assets\/prism2.js');\n\tmy_awesome_script.setAttribute('id','syntaxed-prism');\n\tdocument.body.appendChild(my_awesome_script);\n} else {\n\twindow.Prism && Prism.highlightAll();\n}\n<\/script>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-2ec0b76 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"2ec0b76\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-6ca3cc7\" data-id=\"6ca3cc7\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-cff31ec elementor-widget elementor-widget-image\" data-id=\"cff31ec\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"640\" height=\"329\" src=\"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/03\/tf_wordcloud.png\" class=\"attachment-large size-large wp-image-3264\" alt=\"\" srcset=\"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/03\/tf_wordcloud.png 640w, https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/03\/tf_wordcloud-300x154.png 300w\" sizes=\"auto, (max-width: 640px) 100vw, 640px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-e7e6f35 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"e7e6f35\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-b396d55\" data-id=\"b396d55\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-fe1a13b elementor-widget elementor-widget-heading\" data-id=\"fe1a13b\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\">TFIDF - Term Frequency Inverse Document Frequency<\/h3>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-c86d05c elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"c86d05c\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-9dd2b0f\" data-id=\"9dd2b0f\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-c83ad58 elementor-widget elementor-widget-text-editor\" data-id=\"c83ad58\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>The TFIDF is\u00a0 simply a frequency measure of the number of occurrences of a word within a document scaled against a scarcity weight of its use within the word context (i.e. corpus = collection of documents).<\/p><p>It aims to assign a high numerical frequency to <strong>words that often occur in a text but are less common within its context<\/strong> to highlight pertinence or information content.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-656a613 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"656a613\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-32f42b1\" data-id=\"32f42b1\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-ce03719 elementor-widget elementor-widget-image\" data-id=\"ce03719\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"278\" src=\"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-12-at-08.44.13-1024x278.png\" class=\"attachment-large size-large wp-image-3326\" alt=\"\" srcset=\"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-12-at-08.44.13-1024x278.png 1024w, https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-12-at-08.44.13-300x81.png 300w, https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-12-at-08.44.13-768x209.png 768w, https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-12-at-08.44.13.png 1219w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-0aeba18 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"0aeba18\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-1b9ae07\" data-id=\"1b9ae07\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-2f8f3fc elementor-widget elementor-widget-text-editor\" data-id=\"2f8f3fc\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>The Sklearn TfidfVectorizer object can be used to generate a TFIDF vector for a given text data against a corpus. It is worth nothing however that the TfidfVectorizer generates TFIDF vector for both the entire corpus, by comparing each document against the remaining documents in the corpus. An artifice must thus be done, to only extract data from the textual data of interest, by passing its vocabulary list and fetching only its vector from the vectorizer output.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-e24eea7 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"e24eea7\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-3f53c6f\" data-id=\"3f53c6f\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-428eedb elementor-widget elementor-widget-elementor-syntax-highlighter\" data-id=\"428eedb\" data-element_type=\"widget\" data-widget_type=\"elementor-syntax-highlighter.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<pre><code class='language-python'># import required module\nfrom sklearn.feature_extraction.text import TfidfVectorizer\nfrom nltk.tokenize import word_tokenize\nimport numpy as np\n\nvocabulary = np.unique(word_tokenize(clean_text)).tolist()#fetch bag of words\ncorpus = []\n#corpus\nfor i in range(5):\n    url = url_base+&quot;corpus\/corpus_%d.txt&quot;%(i+1)\n    response = requests.get(url)#get from the web\n    corpus_i = response.text\n    corpus.append(corpus_i)\n#add document into the corpus\ncorpus.append(clean_text)\n\n# create object\ntfidf = TfidfVectorizer(vocabulary=vocabulary)\n# get tf-df values\nresult_tfidf = tfidf.fit_transform(corpus)\n#get feature list\nfeature_list = tfidf.get_feature_names_out() <\/code><\/pre><script>\nif (!document.getElementById('syntaxed-prism')) {\n\tvar my_awesome_script = document.createElement('script');\n\tmy_awesome_script.setAttribute('src','https:\/\/mlinsightscentral.com\/wp-content\/plugins\/syntax-highlighter-for-elementor\/assets\/prism2.js');\n\tmy_awesome_script.setAttribute('id','syntaxed-prism');\n\tdocument.body.appendChild(my_awesome_script);\n} else {\n\twindow.Prism && Prism.highlightAll();\n}\n<\/script>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-fc33216 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"fc33216\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-fc7eb1e\" data-id=\"fc7eb1e\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-e2bd9d7 elementor-widget elementor-widget-elementor-syntax-highlighter\" data-id=\"e2bd9d7\" data-element_type=\"widget\" data-widget_type=\"elementor-syntax-highlighter.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<pre><code class='language-python'>#get the TFIDF of the last document\ntfidf_array = result_tfidf[-1].toarray()[0]\n\n#tf dataframe\ntfidf = pd.DataFrame({'term':feature_list, 'freq':tfidf_array})\ntfidf.sort_values(by=['freq'],inplace=True,ascending=False)\ntfidf.reset_index(drop=True, inplace=True)\n\ntfidf.head() <\/code><\/pre><script>\nif (!document.getElementById('syntaxed-prism')) {\n\tvar my_awesome_script = document.createElement('script');\n\tmy_awesome_script.setAttribute('src','https:\/\/mlinsightscentral.com\/wp-content\/plugins\/syntax-highlighter-for-elementor\/assets\/prism2.js');\n\tmy_awesome_script.setAttribute('id','syntaxed-prism');\n\tdocument.body.appendChild(my_awesome_script);\n} else {\n\twindow.Prism && Prism.highlightAll();\n}\n<\/script>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-788a43e elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"788a43e\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-6561393\" data-id=\"6561393\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-9c028cb elementor-widget elementor-widget-image\" data-id=\"9c028cb\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"411\" height=\"321\" src=\"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-10-at-13.53.09.png\" class=\"attachment-large size-large wp-image-3280\" alt=\"\" srcset=\"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-10-at-13.53.09.png 411w, https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-10-at-13.53.09-300x234.png 300w\" sizes=\"auto, (max-width: 411px) 100vw, 411px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-e048d37 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"e048d37\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-77254fe\" data-id=\"77254fe\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-d639d92 elementor-widget elementor-widget-elementor-syntax-highlighter\" data-id=\"d639d92\" data-element_type=\"widget\" data-widget_type=\"elementor-syntax-highlighter.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<pre><code class='language-python'>import matplotlib.pyplot as plt\ntop_n = 30\n\nplt.figure()\nplt.bar(tfidf.term[0:top_n], tfidf.freq[0:top_n])\nplt.xticks(rotation=90)\nplt.ylabel('Frequency')\nplt.title('TFIDF')\nplt.show() <\/code><\/pre><script>\nif (!document.getElementById('syntaxed-prism')) {\n\tvar my_awesome_script = document.createElement('script');\n\tmy_awesome_script.setAttribute('src','https:\/\/mlinsightscentral.com\/wp-content\/plugins\/syntax-highlighter-for-elementor\/assets\/prism2.js');\n\tmy_awesome_script.setAttribute('id','syntaxed-prism');\n\tdocument.body.appendChild(my_awesome_script);\n} else {\n\twindow.Prism && Prism.highlightAll();\n}\n<\/script>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-e5dab5c elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"e5dab5c\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-cee2b5a\" data-id=\"cee2b5a\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-e8624e7 elementor-widget elementor-widget-image\" data-id=\"e8624e7\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"567\" height=\"537\" src=\"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/03\/tfidf_bar.png\" class=\"attachment-large size-large wp-image-3284\" alt=\"\" srcset=\"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/03\/tfidf_bar.png 567w, https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/03\/tfidf_bar-300x284.png 300w\" sizes=\"auto, (max-width: 567px) 100vw, 567px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-3c51e09 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"3c51e09\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-79b7205\" data-id=\"79b7205\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-cff5cd0 elementor-widget elementor-widget-elementor-syntax-highlighter\" data-id=\"cff5cd0\" data-element_type=\"widget\" data-widget_type=\"elementor-syntax-highlighter.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<pre><code class='language-python'>from wordcloud import WordCloud \n\nword_freq = {} \nnum_terms = len(tfidf) \nfor i in range(num_terms): \n    freq = tfidf.iloc[i,1] #read data from the TFIDF matrix\n    term = tfidf.iloc[i,0] \n    word_freq[term] = freq \n\nwordcloud  = WordCloud(max_font_size=50,  \n                          max_words=top_n, background_color=&quot;white&quot;).generate_from_frequencies(word_freq) \n\nplt.figure(figsize = (8,8), facecolor = None) \nplt.imshow(wordcloud,interpolation=&quot;bilinear&quot;) \nplt.axis(&quot;off&quot;) \nplt.show()  <\/code><\/pre><script>\nif (!document.getElementById('syntaxed-prism')) {\n\tvar my_awesome_script = document.createElement('script');\n\tmy_awesome_script.setAttribute('src','https:\/\/mlinsightscentral.com\/wp-content\/plugins\/syntax-highlighter-for-elementor\/assets\/prism2.js');\n\tmy_awesome_script.setAttribute('id','syntaxed-prism');\n\tdocument.body.appendChild(my_awesome_script);\n} else {\n\twindow.Prism && Prism.highlightAll();\n}\n<\/script>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-2fa462c elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"2fa462c\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-18ab238\" data-id=\"18ab238\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-a82688f elementor-widget elementor-widget-image\" data-id=\"a82688f\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"640\" height=\"329\" src=\"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/03\/wordcloud_tfidf.png\" class=\"attachment-large size-large wp-image-3288\" alt=\"\" srcset=\"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/03\/wordcloud_tfidf.png 640w, https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/03\/wordcloud_tfidf-300x154.png 300w\" sizes=\"auto, (max-width: 640px) 100vw, 640px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-ca404b7 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"ca404b7\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-9acd6f0\" data-id=\"9acd6f0\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-39ce245 elementor-widget elementor-widget-heading\" data-id=\"39ce245\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><span class=\"ez-toc-section\" id=\"Conclusion\"><\/span>Conclusion<span class=\"ez-toc-section-end\"><\/span><\/h3>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-0cd5725 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"0cd5725\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-59d8a83\" data-id=\"59d8a83\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-7925fc8 elementor-widget elementor-widget-text-editor\" data-id=\"7925fc8\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>In this tutorial, text summarisation and featuring engineering in NLP is discussed. Any meaningful analytics with textual data requires denoising that involves regex, stemming or lemmaisation and stop word removal. Feature engineering in textual data typically involves finding a numerical representation of textual data while carrying semantic information. The Term Frequency and Term Frequency Inverse Document Frequency vectors are the most fundamental numeric representation of textual data, however with very limited semantic flexibility. They are nevertheless performant in several text classification problems and other NLP tasks. More advanced and robust techniques such as word embedding, and contextual sentence embedding can be investigated.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-95d077c elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"95d077c\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-1e30c59\" data-id=\"1e30c59\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t\t<div class=\"elementor-element elementor-element-260eb6a elementor-widget elementor-widget-heading\" data-id=\"260eb6a\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<div class=\"elementor-heading-title elementor-size-default\"><b>Author: Yves Matanga, PhD<\/b><\/div>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-3120a3b elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"3120a3b\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-5bac867\" data-id=\"5bac867\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap\">\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>Text Summarisation and Feature Engineering using TF-IDF This article explains how textual data is modelled in natural language processing. Several modelling techniques exist to model language in NLP: Bag of words and Term Frequency matrices N-gram modelling Word Embedding Techniques Transformer-based Embedding For the sake of this article, Bag of words, N-gram modelling and Term &hellip;<\/p>\n<p class=\"read-more\"> <a class=\"\" href=\"https:\/\/mlinsightscentral.com\/index.php\/text-summarisation-and-feature-engineering-using-tf-idf\/\"> <span class=\"screen-reader-text\">Text Summarisation and Feature Engineering using TF-IDF<\/span> Read More &raquo;<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"_uag_custom_page_level_css":"","site-sidebar-layout":"no-sidebar","site-content-layout":"page-builder","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"disabled","ast-breadcrumbs-content":"","ast-featured-img":"disabled","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","footnotes":""},"wf_page_folders":[8],"class_list":["post-3114","page","type-page","status-publish","hentry"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v20.11 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Text Summarisation and Feature Engineering using TF-IDF - MLInsightsCentral<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/mlinsightscentral.com\/index.php\/text-summarisation-and-feature-engineering-using-tf-idf\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Text Summarisation and Feature Engineering using TF-IDF - MLInsightsCentral\" \/>\n<meta property=\"og:description\" content=\"Text Summarisation and Feature Engineering using TF-IDF This article explains how textual data is modelled in natural language processing. Several modelling techniques exist to model language in NLP: Bag of words and Term Frequency matrices N-gram modelling Word Embedding Techniques Transformer-based Embedding For the sake of this article, Bag of words, N-gram modelling and Term &hellip; Text Summarisation and Feature Engineering using TF-IDF Read More &raquo;\" \/>\n<meta property=\"og:url\" content=\"https:\/\/mlinsightscentral.com\/index.php\/text-summarisation-and-feature-engineering-using-tf-idf\/\" \/>\n<meta property=\"og:site_name\" content=\"MLInsightsCentral\" \/>\n<meta property=\"article:modified_time\" content=\"2025-03-12T06:54:44+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/02\/wordcloud.png\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data1\" content=\"8 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/mlinsightscentral.com\/index.php\/text-summarisation-and-feature-engineering-using-tf-idf\/\",\"url\":\"https:\/\/mlinsightscentral.com\/index.php\/text-summarisation-and-feature-engineering-using-tf-idf\/\",\"name\":\"Text Summarisation and Feature Engineering using TF-IDF - MLInsightsCentral\",\"isPartOf\":{\"@id\":\"https:\/\/mlinsightscentral.com\/#website\"},\"datePublished\":\"2025-02-17T09:22:02+00:00\",\"dateModified\":\"2025-03-12T06:54:44+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/mlinsightscentral.com\/index.php\/text-summarisation-and-feature-engineering-using-tf-idf\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/mlinsightscentral.com\/index.php\/text-summarisation-and-feature-engineering-using-tf-idf\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/mlinsightscentral.com\/index.php\/text-summarisation-and-feature-engineering-using-tf-idf\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/mlinsightscentral.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Text Summarisation and Feature Engineering using TF-IDF\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/mlinsightscentral.com\/#website\",\"url\":\"https:\/\/mlinsightscentral.com\/\",\"name\":\"MLInsightsCentral\",\"description\":\"Learn Machine Learning and AI for engineers, data scientists and AI practionners.\",\"publisher\":{\"@id\":\"https:\/\/mlinsightscentral.com\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/mlinsightscentral.com\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/mlinsightscentral.com\/#organization\",\"name\":\"MLInsightsCentral\",\"url\":\"https:\/\/mlinsightscentral.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/mlinsightscentral.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2023\/06\/cropped-1290538dccf74accb0ae585ff4e8586c-1.png\",\"contentUrl\":\"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2023\/06\/cropped-1290538dccf74accb0ae585ff4e8586c-1.png\",\"width\":200,\"height\":110,\"caption\":\"MLInsightsCentral\"},\"image\":{\"@id\":\"https:\/\/mlinsightscentral.com\/#\/schema\/logo\/image\/\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Text Summarisation and Feature Engineering using TF-IDF - MLInsightsCentral","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/mlinsightscentral.com\/index.php\/text-summarisation-and-feature-engineering-using-tf-idf\/","og_locale":"en_US","og_type":"article","og_title":"Text Summarisation and Feature Engineering using TF-IDF - MLInsightsCentral","og_description":"Text Summarisation and Feature Engineering using TF-IDF This article explains how textual data is modelled in natural language processing. Several modelling techniques exist to model language in NLP: Bag of words and Term Frequency matrices N-gram modelling Word Embedding Techniques Transformer-based Embedding For the sake of this article, Bag of words, N-gram modelling and Term &hellip; Text Summarisation and Feature Engineering using TF-IDF Read More &raquo;","og_url":"https:\/\/mlinsightscentral.com\/index.php\/text-summarisation-and-feature-engineering-using-tf-idf\/","og_site_name":"MLInsightsCentral","article_modified_time":"2025-03-12T06:54:44+00:00","og_image":[{"url":"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2025\/02\/wordcloud.png"}],"twitter_card":"summary_large_image","twitter_misc":{"Est. reading time":"8 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/mlinsightscentral.com\/index.php\/text-summarisation-and-feature-engineering-using-tf-idf\/","url":"https:\/\/mlinsightscentral.com\/index.php\/text-summarisation-and-feature-engineering-using-tf-idf\/","name":"Text Summarisation and Feature Engineering using TF-IDF - MLInsightsCentral","isPartOf":{"@id":"https:\/\/mlinsightscentral.com\/#website"},"datePublished":"2025-02-17T09:22:02+00:00","dateModified":"2025-03-12T06:54:44+00:00","breadcrumb":{"@id":"https:\/\/mlinsightscentral.com\/index.php\/text-summarisation-and-feature-engineering-using-tf-idf\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/mlinsightscentral.com\/index.php\/text-summarisation-and-feature-engineering-using-tf-idf\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/mlinsightscentral.com\/index.php\/text-summarisation-and-feature-engineering-using-tf-idf\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/mlinsightscentral.com\/"},{"@type":"ListItem","position":2,"name":"Text Summarisation and Feature Engineering using TF-IDF"}]},{"@type":"WebSite","@id":"https:\/\/mlinsightscentral.com\/#website","url":"https:\/\/mlinsightscentral.com\/","name":"MLInsightsCentral","description":"Learn Machine Learning and AI for engineers, data scientists and AI practionners.","publisher":{"@id":"https:\/\/mlinsightscentral.com\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/mlinsightscentral.com\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/mlinsightscentral.com\/#organization","name":"MLInsightsCentral","url":"https:\/\/mlinsightscentral.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/mlinsightscentral.com\/#\/schema\/logo\/image\/","url":"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2023\/06\/cropped-1290538dccf74accb0ae585ff4e8586c-1.png","contentUrl":"https:\/\/mlinsightscentral.com\/wp-content\/uploads\/2023\/06\/cropped-1290538dccf74accb0ae585ff4e8586c-1.png","width":200,"height":110,"caption":"MLInsightsCentral"},"image":{"@id":"https:\/\/mlinsightscentral.com\/#\/schema\/logo\/image\/"}}]}},"uagb_featured_image_src":{"full":false,"thumbnail":false,"medium":false,"medium_large":false,"large":false,"1536x1536":false,"2048x2048":false},"uagb_author_info":{"display_name":"admin","author_link":"https:\/\/mlinsightscentral.com\/index.php\/author\/yvesm\/"},"uagb_comment_info":0,"uagb_excerpt":"Text Summarisation and Feature Engineering using TF-IDF This article explains how textual data is modelled in natural language processing. Several modelling techniques exist to model language in NLP: Bag of words and Term Frequency matrices N-gram modelling Word Embedding Techniques Transformer-based Embedding For the sake of this article, Bag of words, N-gram modelling and Term&hellip;","_links":{"self":[{"href":"https:\/\/mlinsightscentral.com\/index.php\/wp-json\/wp\/v2\/pages\/3114","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mlinsightscentral.com\/index.php\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/mlinsightscentral.com\/index.php\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/mlinsightscentral.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mlinsightscentral.com\/index.php\/wp-json\/wp\/v2\/comments?post=3114"}],"version-history":[{"count":199,"href":"https:\/\/mlinsightscentral.com\/index.php\/wp-json\/wp\/v2\/pages\/3114\/revisions"}],"predecessor-version":[{"id":3352,"href":"https:\/\/mlinsightscentral.com\/index.php\/wp-json\/wp\/v2\/pages\/3114\/revisions\/3352"}],"wp:attachment":[{"href":"https:\/\/mlinsightscentral.com\/index.php\/wp-json\/wp\/v2\/media?parent=3114"}],"wp:term":[{"taxonomy":"wf_page_folders","embeddable":true,"href":"https:\/\/mlinsightscentral.com\/index.php\/wp-json\/wp\/v2\/wf_page_folders?post=3114"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}