POST
Python for Custom SEO Tooling
Author: Alan Richardson
I’ve started experimenting with Python when creating custom SEO tooling.
Why Python?
Despite programming since childhood, I’ve only just started learning Python.
I tried to use it years ago but it was hard to install on Windows machines, there were not many libraries available to help and the IDE support was poor.
That situation has changed. I primarily use Mac and Python is installed by default, it is now easy to install on Windows. The default IDE for Python IDLE has improved but there are now far better options available like Visual Studio Code that make working with Python much easier.
Python is currently listed on the PopularitY of Programming Languages site as being the #1 searched for Programming Language.
Why Python for SEO?
Python is often used for SEO, Data Analysis and Security Testing. Domains which are surprisingly technical, and often populated by people without a formal Software Development background but who need to create very custom and adhoc applications.
This has resulted in a lot of libraries released and available for free to support these domains.
I’ve found that almost every fundamental task I’ve wanted to do to support my SEO activities has already been written and available through a simple dependency.
- Pandas is a library for data manipulation which makes working with CSV or raw data simple.
- Advertools has many SEO support functions making it easy to download and parse sitemaps and crawl sites.
- Beautiful Soup makes working with HTML simple.
- WhoIs get the whois information for a domain.
All I then have to do is write some simple code to create the functionality I need.
Further reading on Python for SEO:
- Python for SEO
- SEO Pythonistas
- 8 Python Libraries for SEO
- 19 Python SEO projects
- Python SEO Resources twitter thread by Marco Giordano
A Simple Python Example
I’m going to explain a fairly simple Python script that uses some of the above libraries to load a sitemap.xml
then for each URL listed, download the page and scan it for any iframe
elements that do not have lazy loading enabled.
The script has several dependencies e.g. advertools, beautiful soup. In Python dependencies are added using a pip
command, I’m using Python version 3 on Mac so the command is pip3
.
pip3 install advertools
The above command would install advertools for Python so I could use the library in my script.
Dependencies are often listed in a file called requirements.txt
which is a list of the library names, this can be used to install the dependencies:
pip3 install -r requirements.txt
The requirements.txt
file for this script would be:
advertools
beautifulsoup4
CacheControl
lockfile
Python scripts are text files with an extension .py
e.g. iframechecker.py
At the head of the script are all the imports to pull in the libraries, classes and functions used:
import advertools
import requests
from cachecontrol import CacheControl
from cachecontrol.caches.file_cache import FileCache
from bs4 import BeautifulSoup
First I would want to download the sitemap (change the sitemap xml variable to match the URL of the sitemap being processed):
siteMapUrl = "https://mysite.com/my-sitemap-posts-url.xml"
sitemapDataFrame = advertools.sitemap_to_df(siteMapUrl, recursive=True)
Advertools made it easy to download the sitemap by using hte sitemap_to_df
method. This creates a Panda dataframe. Panda was not listed as a dependency in my project becuase it was installed as a dependency of Advertools.
I only want to download the HTML files from the sitemap once so I’m going to create a cache for the HTML Requests. To empty the cache I can delete the folder .web_cache
that is created.
forever_cache = FileCache('.web_cache', forever=True)
session = CacheControl(requests.Session(), forever_cache)
I then want to loop over every Url in the sitemap and download the HTML page. I get all the loc
elements from the sitemap dataframe then loop over them all and use the cached request session to get
the url and return the HTML (text
) of the page.
urls = sitemapDataFrame['loc'].tolist()
for url in urls:
urlReport = []
print(url)
html = session.get(url).text
I then want to use Beautiful Soup to parse the HTML to make it easy to find all the iframe
elements.
parsedHtml = BeautifulSoup(html, 'html.parser')
iframes = parsedHtml.find_all('iframe')
All the iframe
elements are now in the iframes
array.
At this point it is worth mentioning that Python uses indentation to control the scope of the application, so all code indented below the for
line is included in the loop.
The loop over all the iframe
elements and find those that do not have lazy loading set, this is identifiable by looking at the loading
attribute on the iframe element.
for iframe in iframes:
lazy = iframe.get('loading')
if(lazy==None or lazy!="lazy"):
print("")
print(iframe)
print("- has an iframe without lazy loading")
And that’s it.
The full script would be:
import advertools
import requests
from cachecontrol import CacheControl
from cachecontrol.caches.file_cache import FileCache
from bs4 import BeautifulSoup
siteMapUrl = "https://mysite.com/my-sitemap-posts-url.xml"
sitemapDataFrame = advertools.sitemap_to_df(siteMapUrl, recursive=True)
forever_cache = FileCache('.web_cache', forever=True)
session = CacheControl(requests.Session(), forever_cache)
urls = sitemapDataFrame['loc'].tolist()
for url in urls:
urlReport = []
print(url)
html = session.get(url).text
parsedHtml = BeautifulSoup(html, 'html.parser')
iframes = parsedHtml.find_all('iframe')
for iframe in iframes:
lazy = iframe.get('loading')
if(lazy==None or lazy!="lazy"):
print("")
print(iframe)
print("- has an iframe without lazy loading")
This could easily be amended to find all the links and then process this in someway, possibly to find all the images without an alt tag
elements = parsedHtml.find_all('img')
for element in elements:
attribute = element.get('alt')
if(attribute==None or attribute==""):
print("")
print(element)
print("- has no alt")
This is a fairly simple example, but you can see that we can create adhoc scripts for a specific purpose with very little code by using existing libraries.
Code for this example can be found on Github
Conclusion
It doesn’t take much Python to write a script that can add a lot of value.
Python is frequently used in SEO and there are a lot of helper libraries available.
As one of the most popular programming languages around, learning Python is a transferable skill for other roles and activities.
I have some Python SEO Tooling examples in this repository on Github.
I have also released some tools written in Python, using Streamlit to build the GUI: