How to Automatically Categorise Keywords?

If you landed on this blog post, then you probably want to know the answers to the following questions:

  1. How to automatically categorise keywords using Python and save time?
  2. How the suggested solution is different to other keyword categorisation options?

There are indeed already solutions available for this task. For example, keyword tagging using Google Sheets or keywords categorisation for the buyers journey using Excel formula. In both cases, in order to categorise keywords, you will need to define a list of tags (or trigger words) and assign a category to each of them. You will then need to find these tags in your keywords. Once a tag is found, a keyword will be assigned to the same category as the tag that was found.

Let’s say we have the following keywords to analyse:

Keywords
why pokemon sword and shield is good
why pokemon sword and shield is bad

We’ve also compiled a list of tags we want to find in our keywords together with the categories assigned to them:

Tags Categories
why Questions
good Adjectives
bad Adjectives

If you use the Excel solution to categorise these keywords, you will have the following result:

Keywords Categories
why pokemon sword and shield is good Questions
why pokemon sword and shield is bad Questions

The issue that occurs here is if the inspected keyword contains more than one of your tags, this keyword will be categorised based on a tag found first in this keyword. Any other tags will be ignored. In short, this solution categorises keywords in the order in which the searched tags appear in them. Both keywords will be put into Questions category simply because why tag appears first in this keyword. The good and the bad tags will be ignored. It might be useful though to categorise our keywords based on all tags found in them, especially when we start to analyse our keyword pool together with such metrics as search volumes and/or impressions for these keywords.

There will be no clash when keywords are categorised based on tags from different categories (e.g.: Questions and Adjectives tags) when using the Google Sheet solution. However, we are facing a similar issue with keywords containing more than one tag of the same type but not sharing the same category name. For example, the keyword to categorise is is pokemon sword and shield good or bad, and the category names for the good and the bad tags are not Adjectives - not the same:

Tags Categories
good Good
bad Bad

In this case we will have the following result:

Keywords Categories
is pokemon sword and shield good or bad Good

The bad tag will be ignored as the good one appears first.

To overcome these issues I suggest adding one more step to this process, and extract all tags from the keywords before starting the actual categorisation process. This should insure that all keywords are labeled based on all suggested tags:

Keywords Tags Categories
why pokemon sword and shield is good why Questions
why pokemon sword and shield is good good Adjectives
why pokemon sword and shield is bad why Questions
why pokemon sword and shield is bad bad Adjectives

And also:

Keywords Categories
is pokemon sword and shield good or bad Good
is pokemon sword and shield good or bad Bad

The Code

The code below takes two CSV files as input. The first one (keywords.csv) contains keywords to parse, and the second file (tags.csv) includes tags you want to extract from the first one.

First of all, load Python’s pandas library (“Python Data Analysis Library”) that will be needed to analyse our data.

import pandas as pd

Load your input files into two separate data frames and do the following to clean up your data: 1. remove white spaces before and after the loaded string; 2. make everything to lower case; 3. remove all (‘) and any other special characters and multiple spaces between the words preserving only letters and numbers.

While the above steps might look unnecessary for our small example data pool, they will allow you to apply this script to larger data set, for example, to Search Console query data.

inputData = pd.read_csv("keywords.csv", encoding="latin-1")
inputData.Keywords = inputData.Keywords.str.strip(' ')
inputData.Keywords = inputData.Keywords.str.lower()
inputData.Keywords = inputData.Keywords.str.replace("'", '').str.replace('[^a-zA-Z0-9_]', ' ').str.replace(' +',' ')
inputData
Keywords
0 will pokemon sword and shield be good
1 why pokemon sword and shield is good
2 why pokemon sword and shield is bad
3 why pokemon sword and shield
4 what pokemon are in sword and shield

I do the same for the file with the tags to make sure neither of them include unnecessary characters.

tags_cats = pd.read_csv("tags_cats.csv", encoding="latin-1", header=None, names=["Tags", "Categories"])
tags_cats.Tags = tags_cats.Tags.str.strip(' ')
tags_cats.Tags = tags_cats.Tags.str.lower()
tags_cats.Tags = tags_cats.Tags.str.replace("'", '').str.replace('[^a-zA-Z0-9_]', ' ').str.replace(' +',' ')
tags_cats
Tags Categories
0 good Adjectives
1 bad Adjectives
2 why Questions

Next, I’m using vectorized string operations to compute the data instead of using nested for loops. 1. This allows us to execute batch operations on entire arrays, and not looping over all elements value-by-value from each of the files. The main goal for using vectorization is to make the process fast enough for larger data sets than our simple Pokemon example.

To avoid matching words that can be part of other words and get the exact results I’m using a regular expression that matches on word boundaries: \b (see Regex Boundaries and Delimiters for details).

The Tags values are passed inside braces, and with word boundary method by adding \bs on both sides of the searched term to make sure that only exact match words are extracted from our keywords.

For example, the regex \bcity\b would therefore match city in a city center, but it wouldn’t match it in simplicity or ethnicity. Therefore, if only .contains(searchTerm) is used in the for loop above instead of f’\b{Tags}\b’, sort will be returned for resort in the output file, but this is not what I want to achieve.

To allow a regex search in Python I also need to use two backslashes \ \b as a word boundary because \b in a Python string is shorthand for a backspace character and needs to be escaped for the word boundary to be interpreted correctly.

for Tags in tags_cats.Tags.values:
    inputData[Tags] = inputData['Keywords'].str.contains(f'\\b{Tags}\\b', regex=True, case=False)

As a result, columns are added for each searched tag to our data frame, and each of these tags is evaluated to either True or False depending on either they were found in a keyword:

inputData
Keywords good bad why
0 will pokemon sword and shield be good True False False
1 why pokemon sword and shield is good True False True
2 why pokemon sword and shield is bad False True True
3 why pokemon sword and shield False False True
4 what pokemon are in sword and shield False False False

Next, I’m creating an empty data frame result, and adding a for loop that appends the found tag to the Tags column next to the keywords in the Keywords column every time the row value was evaluated to True.

result = pd.DataFrame([], columns = ['Keywords', 'Tags'])
for tag in tags_cats.Tags.values:
    resultForCurrentSearchTerm = inputData.loc[inputData[tag] == True, ['Keywords']]
    resultForCurrentSearchTerm['Tags'] = tag
    result = result.append(resultForCurrentSearchTerm)
result.sort_values(by=['Tags'], inplace=True)
result
Keywords Tags
2 why pokemon sword and shield is bad bad
0 will pokemon sword and shield be good good
1 why pokemon sword and shield is good good
1 why pokemon sword and shield is good why
2 why pokemon sword and shield is bad why
3 why pokemon sword and shield why

The final part of the script does the actual categorisation of the keywords based on the extracted tags from the above data frame. We are combining two datasets into one by using the merge() function on Tags column.

result_cats = result.merge(tags_cats, on="Tags")
result_cats
Keywords Tags Categories
0 why pokemon sword and shield is bad bad Adjectives
1 will pokemon sword and shield be good good Adjectives
2 why pokemon sword and shield is good good Adjectives
3 why pokemon sword and shield is good why Questions
4 why pokemon sword and shield is bad why Questions
5 why pokemon sword and shield why Questions

Export the result_cat data frame to a CSV file:

result_cats.to_csv('final_result.csv', index = False)

The above code also works for real-life data with 200,000 keywords and 3,000 tags. The input and the Jupyter files can be found in the keyword-categorisation-with-python repository.

Footnotes

1. We can come to the same result using a nested for loop. See below. While this works fine for a small data set, it will be very inefficient when processing large amount of data.

import csv
import re
resultWithNestedForLoops = []
for keyword in inputData.Keywords:
    for tag in tags_cats.Tags:
        if re.search(r"\b%s\b" % tag, keyword):
            resultWithNestedForLoops.append((keyword, tag))
resultWithNestedForLoops
[('will pokemon sword and shield be good', 'good'),
 ('why pokemon sword and shield is good', 'good'),
 ('why pokemon sword and shield is good', 'why'),
 ('why pokemon sword and shield is bad', 'bad'),
 ('why pokemon sword and shield is bad', 'why'),
 ('why pokemon sword and shield', 'why')]
with open("resultWithNestedForLoops.csv", "w") as f:
    writer = csv.writer(f)
    writer.writerow(["Keywords", "Tags"])
    writer.writerows(resultWithNestedForLoops)