POST

Python for Custom SEO Tooling

January 22, 2022

Author: Alan Richardson

I’ve started experimenting with Python when creating custom SEO tooling.

Why Python?

Despite programming since childhood, I’ve only just started learning Python.

I tried to use it years ago but it was hard to install on Windows machines, there were not many libraries available to help and the IDE support was poor.

That situation has changed. I primarily use Mac and Python is installed by default, it is now easy to install on Windows. The default IDE for Python IDLE has improved but there are now far better options available like Visual Studio Code that make working with Python much easier.

Python is currently listed on the PopularitY of Programming Languages site as being the #1 searched for Programming Language.

Why Python for SEO?

Python is often used for SEO, Data Analysis and Security Testing. Domains which are surprisingly technical, and often populated by people without a formal Software Development background but who need to create very custom and adhoc applications.

This has resulted in a lot of libraries released and available for free to support these domains.

I’ve found that almost every fundamental task I’ve wanted to do to support my SEO activities has already been written and available through a simple dependency.

Pandas is a library for data manipulation which makes working with CSV or raw data simple.
Advertools has many SEO support functions making it easy to download and parse sitemaps and crawl sites.
Beautiful Soup makes working with HTML simple.
WhoIs get the whois information for a domain.

All I then have to do is write some simple code to create the functionality I need.

A Simple Python Example

I’m going to explain a fairly simple Python script that uses some of the above libraries to load a sitemap.xml then for each URL listed, download the page and scan it for any iframe elements that do not have lazy loading enabled.

The script has several dependencies e.g. advertools, beautiful soup. In Python dependencies are added using a pip command, I’m using Python version 3 on Mac so the command is pip3.

pip3 install advertools

The above command would install advertools for Python so I could use the library in my script.

Dependencies are often listed in a file called requirements.txt which is a list of the library names, this can be used to install the dependencies:

pip3 install -r requirements.txt

The requirements.txt file for this script would be:

advertools
beautifulsoup4
CacheControl
lockfile

Python scripts are text files with an extension .py e.g. iframechecker.py

At the head of the script are all the imports to pull in the libraries, classes and functions used:

import advertools
import requests

from cachecontrol import CacheControl
from cachecontrol.caches.file_cache import FileCache
from bs4 import BeautifulSoup

First I would want to download the sitemap (change the sitemap xml variable to match the URL of the sitemap being processed):

siteMapUrl = "https://mysite.com/my-sitemap-posts-url.xml"
sitemapDataFrame = advertools.sitemap_to_df(siteMapUrl, recursive=True)

Advertools made it easy to download the sitemap by using hte sitemap_to_df method. This creates a Panda dataframe. Panda was not listed as a dependency in my project becuase it was installed as a dependency of Advertools.

I only want to download the HTML files from the sitemap once so I’m going to create a cache for the HTML Requests. To empty the cache I can delete the folder .web_cache that is created.

forever_cache = FileCache('.web_cache', forever=True)
session = CacheControl(requests.Session(), forever_cache)

I then want to loop over every Url in the sitemap and download the HTML page. I get all the loc elements from the sitemap dataframe then loop over them all and use the cached request session to get the url and return the HTML (text) of the page.

urls = sitemapDataFrame['loc'].tolist()

for url in urls:
    urlReport = []
    print(url)
    html = session.get(url).text

I then want to use Beautiful Soup to parse the HTML to make it easy to find all the iframe elements.

    parsedHtml = BeautifulSoup(html, 'html.parser')
    iframes = parsedHtml.find_all('iframe')

All the iframe elements are now in the iframes array.

At this point it is worth mentioning that Python uses indentation to control the scope of the application, so all code indented below the for line is included in the loop.

The loop over all the iframe elements and find those that do not have lazy loading set, this is identifiable by looking at the loading attribute on the iframe element.

    for iframe in iframes:
        lazy = iframe.get('loading')
        if(lazy==None or lazy!="lazy"):
            print("")
            print(iframe)
            print("- has an iframe without lazy loading")

And that’s it.

The full script would be:

import advertools
import requests

from cachecontrol import CacheControl
from cachecontrol.caches.file_cache import FileCache
from bs4 import BeautifulSoup

siteMapUrl = "https://mysite.com/my-sitemap-posts-url.xml"
sitemapDataFrame = advertools.sitemap_to_df(siteMapUrl, recursive=True)

forever_cache = FileCache('.web_cache', forever=True)
session = CacheControl(requests.Session(), forever_cache)

urls = sitemapDataFrame['loc'].tolist()

for url in urls:
    urlReport = []
    print(url)
    html = session.get(url).text
    
    parsedHtml = BeautifulSoup(html, 'html.parser')
    iframes = parsedHtml.find_all('iframe')
     
    for iframe in iframes:
        lazy = iframe.get('loading')
        if(lazy==None or lazy!="lazy"):
            print("")
            print(iframe)
            print("- has an iframe without lazy loading")

This could easily be amended to find all the links and then process this in someway, possibly to find all the images without an alt tag

    elements = parsedHtml.find_all('img')
    
    for element in elements:
        attribute = element.get('alt')
        if(attribute==None or attribute==""):
            print("")
            print(element)
            print("- has no alt")

This is a fairly simple example, but you can see that we can create adhoc scripts for a specific purpose with very little code by using existing libraries.

Code for this example can be found on Github

Conclusion

It doesn’t take much Python to write a script that can add a lot of value.

Python is frequently used in SEO and there are a lot of helper libraries available.

As one of the most popular programming languages around, learning Python is a transferable skill for other roles and activities.

I have some Python SEO Tooling examples in this repository on Github.

I have also released some tools written in Python, using Streamlit to build the GUI:

Have you tried our free online Twitter Client Chatterscan.com? Designed for identifying shareable content on twitter quickly.