The following post is an attempt to find a way of analysing competitors’ backlinks using Python as the main analysis tool. It was inspired by Will Nye’s post How to undertake large-scale competitive link analysis.

In his post he gives us an Excel solution for this task, mentioning at the end that “it may be best to use SQL, R or Python.” While the suggested solution works perfectly if followed his instructions, I wanted to look for a Python solution, mainly to overcome scalability limitations of Excel.

Having said that, the example explained below is based on the tiny set of data for better explanation purposes. Let’s go through the solution step by step.

The set-up.

Create a folder, and place there a file listing all referring domains to your site. All referring domains should be in column A, and the name of your domain needs to be added next to all of them in column B (no headers please). Let’s call this file “my-domain.csv”

This is an image

You will also need to create a folder within the above folder (e.g.: comps) where you put your competitors backlinks in a separate csv files. For example, comp-1.csv, comp-2.csv, comp-3.csv etc. Again, no headers please. In the same way as with the file listing your competitors, the name of each of the competitors should be added next to the linking domain.

This is an image

You now need to download the script from my GitHub repository here, and adjust the folder paths found in the script to those you created as described above.

Goal 1. Check whether or not the domains linking to your competitors already link to your site.

The way we can find out this, is to read each of the competitors’ files in your “comps” folder, and compare and filter out the duplicate linking domain for each of the competitors’ files. We obviously cannot just merge all competitors’ files with the file that contains your referring domains, as there might be duplicates found across competitors’ files that cannot be found among your referring domains. This is not we need.

Therefore, we need to take the file with your linking domains, and compare it with each of the competitors’ files separately. Once shared domains are found among both files, we need to flag them, save it in a separate data frame, and move to the next file.

Let’s have a look at it step by step.

First we need to import packages and modules that will allow us to navigate through the folders, and manipulate our files:

import pandas as pd
import numpy as np
import os
import csv

We then read the content of the file with your linking domains and load it into a “myBacklinks” dataframe:

myBacklinks = pd.read_csv("/home/karina/Documents/pyseo/scale-backlinks-audit/my-domain.csv", header=None)

Function “dupes_check” is created to do all the hard work on comparing your file with each of your competitors’ files, and search for the domains that are shared between both data sets:

def dupes_check (comp_file):
    comp = pd.read_csv(comp_file, header=None)
    with open(comp_file, 'r') as f:
        combined = myBacklinks.append(comp);
        combined['dupes'] = combined.duplicated(subset=0);
        combined_dupes = combined[combined.dupes == True];
        return combined_dupes

The above function is added to a for loop that as the name suggests, loops over all files in the “comps” folder, and applies the above function. The result is first populated into the specifically created empty dataframe called “data”, and after some clean-up (unnecessary rows deleted and column names added) saved to a “final_data” data frame:

data = pd.DataFrame([])
directory = '/home/karina/Documents/pyseo/scale-backlinks-audit/comps/'
for file in os.listdir(directory):
    data = data.append(pd.DataFrame(dupes_check(directory + file)))
final_data = data.drop('dupes',1)
final_data.rename(columns={0: 'shared_backlinks', 1: 'competitors'}, inplace=True)
We do further clean-up by combining referring domains we are sharing with all our competitors, and counting the total number of them. The rows in the column “count” is returning the number of my competitors we are sharing this backlink with:
final_output = final_data.groupby(['shared_backlinks']).size().to_frame('count').reset_index().sort_values('count', ascending=False)

But I wanted to go a step further, and have as a result a table that will not only tell me the number of competitors I’m sharing a specific referring domain, but also all names of these competitors. Here is the solution I found:

final_data.sort_values(['shared_backlinks'], ascending=True)
final_data['Count'] = final_data.groupby('shared_backlinks').cumcount()
out = final_data.pivot('shared_backlinks', 'Count', 'competitors').reset_index()

I then merged the above result with the “final_output” data frame to also have the total number of competitors column there. There might be no need for these for the sample datasets like here, but it can be useful when playing with the big data files:

out_merged = pd.merge(final_output, out, on='shared_backlinks')

The final result is ready now to be saved in your folder:

out_merged.to_csv('/home/karina/Documents/pyseo/scale-backlinks-audit/shared_domains.csv', index = False)

Goal 2. Check how many competitors each referring domain links to.

We basically want to know how many and which competitors each of the linking domain is linking to. In order to do so, I first combined all files from your “comps” folder as it shown below:

filenames = os.listdir()
with open('/home/karina/Documents/pyseo/scale-backlinks-audit/comps/all.csv', 'w') as outfile:
    for fname in filenames:
        with open(fname) as infile:

Added column names:

backlink_pool = pd.read_csv("/home/karina/Documents/pyseo/scale-backlinks-audit/comps/all.csv", header=None)
backlink_pool.rename(columns={0: 'linking_domains', 1: 'competitors'}, inplace=True)

Then I want to find all duplicate entries within this data set:

backlink_pool['dupes'] = backlink_pool.duplicated(subset='linking_domains', keep = False)
backlink_pool_dupes = backlink_pool[backlink_pool.dupes == True]
final_data2 = backlink_pool_dupes.sort_values('linking_domains', ascending=False).drop('dupes',1).reset_index(drop=True)

For final result we are pivoting the above data frame that lists all referring domains that links at least to the two of your competitors:

final_data2['Count'] = final_data2.groupby('linking_domains').cumcount()
out2 = final_data2.pivot('linking_domains', 'Count', 'competitors').reset_index()

Next I wanted to know which of the above linking domains link to my competitors but not to me? For that I first needed to find those domains that actually link to my site:

myBacklinks.rename(columns={0: 'linking_domains', 1:'my_domain' }, inplace=True)
out_merged2 = pd.merge(myBacklinks, out2, on='linking_domains')

After getting rid of duplicate rows, we will get the final result:

finalResult = out_merged2_step2.drop_duplicates(keep=False)
domainsToReview = finalResult.reset_index(drop=True)

Finally, we are saving our result data frame to a csv file:

domainsToReview.to_csv('/home/karina/Documents/pyseo/scale-backlinks-audit/domains-to-review.csv', index = False)

After your run the above script you will have two new csv files created and saved in your folder: This is an image

These should give you answers to the goals we set up in this post. First, to check whether or not the domains linking to your competitors already link to your site: This is an image

And second, to check how many competitors each of the referring domain links to. In the our test example we only found one: This is an image

Now it’s time to test the above script on the real data set provided of course that you know who are your real search competitors are. But this is I guess a topic for s separate post.