Skip to content

The 4 Pillars of Mastering Google Website Crawl

Mastering Google Website Craw

Foreword by Matt Diggity:

In a quick moment I’m going to hand things over to Rowan Collins, the featured guest author of this article.

Rowan is Head of Technical SEO at my agency The Search Initiative.  He’s one of our best search engine technicians.

Other than being overall well-rounded search engines, Rowan is a beast when it comes to the site audit technical side of things… as you’ll soon learn.

Introduction: Rowan Collins

rowanWithout question, the most overlooked aspect of a search engine is the site’s crawlability and indexability: the secret art of sculpting your web crawlers for the Googlebot.

If you can do it right, then you’re going to have a responsive site. Every small change can lead to big gains in the SERPs. However, if done wrong, then you’ll be left waiting weeks for an update from the Googlebot.

I’m often asked how to force Googlebot to crawl specific pages. Furthermore, people are struggling to get their pages indexed.

Well, today’s your lucky day – because that’s all about to change with this article.

I’m going to teach you the four main aspects of mastering site crawl, so you can take actionable measures to improve your site’s content standings in the SERPs.

Pillar #1: Page Blocking

Web crawlers are essential tools for the modern web. Web crawlers work uniquely, and Google assigns a “crawl budget” to each web crawler.  To make sure Google is crawling the pages that you want, don’t waste that budget on a broken page.

This is where page blocking comes into play in search engine crawlers.

When it comes to blocking pages, you’ve got plenty of options, and it’s up to you which ones to use. I’m going to give you the tools, but you’ll need to conduct a site audit of your own site.

Robots.txt

Search engines use advanced algorithms to sort through millions of pages, so we can easily find what we’re looking for.

There are a variety of search engines that have proven to be promising. A simple search engine technique that I like to use is blocking pages with robots.txt.

Originally designed as a result of accidentally DDOS’ing a website with a Google’s crawler; this directive has become unofficially recognized by the web crawler.

Whilst there’s no ISO Standard for robots.txt, Googlebot does have its preferences. You can find out more about that here.

But the short version is that you can simply create a .txt file called robots, and give it directives on how to behave. You will need to structure it so that each search bots knows which search engine rules apply to itself.

Here’s an example:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

Sitemap: https://diggitymarketing.com/sitemap.xml

This is a short and sweet robots.txt file, and it’s one that you’ll likely find on your web crawler. Here it is broken down for you:

  • User-Agent – this is specifying which robots should adhere to the following rules. Whilst good bots will generally follow search engine directives, bad bots do not need to.
  • Disallow – this is telling the search engine bots not to crawl your /wp-admin/ folders, which is where a lot of important documents are kept for WordPress.
  • Allow – this is telling the bots that despite being inside the /wp-admin/ folder, you’re still allowed to crawl this file. The admin-ajax.php file is super important, so you should keep this open for search engine bots.
  • Sitemap – one of the most frequently left outlines is the submit sitemap directive. This helps Googlebot to find your XML sitemap and improve crawlability and indexability.

If you’re using Shopify then you’ll know the digital marketing hardships of not having control over your robots.txt file. Here’s what a good site structure will most likely resemble:

robots tester

However, the following digital marketing strategy can still be applied to Shopify, and should help:

Meta Robots

Still part of the search engine bots, the meta robots tags are HTML code that can be used to specify crawl preferences.

By default all your site’s content pages will be set to index, follow links. This setting ensures that your web page is visible to search engines and that your web page has follow links – even if you don’t specify a search engine preference.

Adding this tag won’t help your page get crawled and indexed, because it’s the default. However, If you’re looking to stop your site’s crawlability and indexability of a broken page then you will need to specify.

<meta name=”robots” content=”noindex,follow”>

<meta name=”robots” content=”noindex,nofollow”>

Whilst the above two follow links tags are technically different from a robot’s directive perspective, they don’t seem to function differently according to Google.

Previously, you would specify the noindex to stop the page from being crawled. Furthermore, you would also choose to specify if the page should activate follow links.

Google recently made a statement that a noindexed broken page eventually gets treated like Soft 404s and they treat the links as nofollow. Therefore, there’s no technical difference between specifying follow links and nofollow.

However, if you don’t trust everything that John Mueller states, you can use the noindex, and follow links to specify your desire to be crawled still.

This is something that Yoast has taken on board, so you’ll notice in recent versions of the Yoast SEO plugin, the option to noindex pagination has been removed.

This is because if Googlebot is treating the noindex tag as a 404, then doing this across your pagination is an awful idea. I would stay on the side of caution and only use this for pages showing broken links, and server redirects. These are the pages you don’t want to be crawled or followed.

noindex

X-Robots Tags

Other search engines that people never really use that often include the X-Robots tags. X-robots are powerful, but not many people understand why it’s so powerful.

With the robots.txt and meta robots directives, it’s up to the robot whether it listens or not. This goes for Googlebot too, it can still ping all the pages to find out if they’re present.

Using this server header, you’re able to tell robots not to crawl your entire website from the server. This means that they won’t have a choice in the matter, they’ll simply be denied web crawle access.

This can either be done by PHP or by Apache Directives because both are processed server-side. With the .htaccess being the preferred method for blocking specific file types and PHP for specific pages.

PHP Code

Here’s an example of the code that you would use for blocking off a broken page with PHP. It’s simple, but it will be processed server-side instead of being optional for google’s crawlers.

header(“X-Robots-Tag: noindex”, true);

Apache Directive

Here’s an example of the code that you could use for blocking off .doc and .pdf files from the SERPs without having to specify every PDF in your robots.txt file.

<FilesMatch “.(doc|pdf)$”>

Header set X-Robots-Tag “noindex, noarchive, nosnippet”

</FilesMatch>

Pillar #2: Understanding Crawl Behaviours

Many of the people who follow The Lab will know that there are lots of ways that robots can act as your web crawlers. However, here’s the rundown on how it all works:

Crawl Budget

When it comes to crawl budget, this is something that only exists in principle, but not in practice. This means that there’s no way to artificially inflate your crawl budget.

For those unfamiliar, this is how much time Google web crawler will spend on your site. Megastores with 1000s of products will be crawled more extensively than those with a microsite. However, the microsite will have core pages crawled more often.

If you are having trouble getting the Google search engine to crawl your important pages, there’s probably a reason for this. Either it’s been blocked off, or it is low value.

Rather than trying to force crawls on pages, you may need to address the root of the problem.

However, for those that like a rough idea, you can check the average crawl rate of your site structure in Google Search Console > Crawl Stats.

crawl stats

Depth First Crawling

One way that search engine bots can act as web crawlers on your web page is through the principle of depth-first. This will force google’s crawler to go as deep as possible before returning up the hierarchy.

This is an effective way for web crawlers to perform their role if you’re looking to find and strengthen internal links with relevant content in as short a time as possible.

An effective internal link structure helps website visitors navigate easily to find the search results they are looking for, as well as helps search engines understand the relevant content of the website. Internal link structure also helps to build page authority, which can result in better rankings for your pages.

However, core navigational website pages will be pushed down to the search results page.

Being aware that Google’s crawler can behave in this way will help when monitoring your website pages, including doing a site audit of your site’s internal link structure.

deep first

Breadth First Crawling

This is the opposite of depth-first crawling, in that it preserves site structure. It will start by crawling every Level 1 page before crawling every Level 2 page.

The benefit of this type of web crawler is that it will likely discover more unique URLs in a shorter period. This is because it travels across multiple categories in your website while overlooking old or deleted URLs.

So, rather than digging deep into the rabbit hole, this method seeks to find every rabbit hole before a site audit.

However, whilst this is good for preserving site architecture, it’s can be slow if your category website pages take a long time to respond and load.

breadth first

Efficiency Crawling

There are many different ways of crawling, but the most notable are the two above, and the third is efficiency crawling. This is where Google’s crawler doesn’t observe breadth or depth first but instead based on response times.

This means that if your web crawlers have an hour to crawl, they will pick all the website pages with low response time. This way, it’s likely to crawl a larger amount of sites in a shorter period of time. This is where the term ‘crawl budget’ comes from.

Essentially, you’re trying to make your website respond as quickly as possible. You do this so that more website pages can be crawled in that allocated time frame.

testing

Server Speed

Many people don’t recognize that the internet is physically connected. There are millions of devices connected across the globe to share and pass files.

However, your website is being hosted on a server somewhere. For Google and your users to open your web page, this will require a connection with your server.

The faster your server is, the less time that Googlebot has to wait for the important files. If we review the above section about efficiency crawling; it’s clear that the log file analyzer is quite important.

When it comes to search engine ability, it pays to get good quality hosting in a location near your target audience. This will lower the latency and also wait time for each file. However, if you want to distribute internationally, you may wish to use a CDN.

page load

Content Distribution Networks (CDNs)

Since the Googlebot search engine is crawling from the Google servers, these may be physically very far away from your website’s server. This means that Google can see your web crawlers as slow, despite your users perceiving this as a fast web page.

One way to work around this is by setting up a relevant page’s Content Distribution Network.

There are loads to choose from, but it’s really straightforward. You are paying for web crawlers to distribute your page’s content across the internet’s network.

That’s what it does, but many people ask why would web crawlers help.

If web crawlers distribute your website page’s content across the internet, the physical distance between your end user and the files can be reduced. This is the content that people see on the search results page.

This ultimately means that there’s less latency and faster load times for all of your search results pages.

world

Image Credit: MaxCDN

Pillar #3: Page Funnelling

Once you understand the above and crawl bot behaviors, the next question should be; how can I force the Google search engine to crawl the website pages that I want?

Below you’re going to find some great tips on tying up loose ends on your website, funneling authority, and heightening your web crawlers.

AHREFS Broken Links

Broken links can be a major problem when it comes to web page usability and search engine optimization. They can influence what’s displayed on the search results page.

At the start of every campaign, it’s essential to tie up any loose internal links. To do this, we look for any broken server redirects that are picked up in AHREFS.

Not only will this help to funnel authority through to your website; it will show broken server redirects that have been picked up. This will help the web crawlers to clean up any unintended 404s that are still live across the internet.

If you want to clean this up quickly, you can export a list of broken server redirects and then import them all to your favorite redirect plugin. We personally use Redirection and Simple 301 Redirects for our WordPress redirects.

Whilst Redirection includes import/export CSV by default, you will need to get an additional add-on for Simple 301 Redirects. It’s called bulk update and is also for free.

broken backlinks

Screaming Frog Broken Links

Similar to above, with Screaming Frog we’re first looking to export all the 404 server errors and then add redirects. This should move all your errors into 301 redirects.

The next step to clean up your website is to strengthen internal links. Internal links can boost your website’s search engine rankings

Whilst a 301 can pass authority and relevant internal links signals, it’s normally faster and more efficient if your server isn’t processing lots of redirects. Get in the habit of cleaning up your links, and remember to optimize those anchors!

screaming frog broken links

Search Console Crawl Errors

Another place you can find some errors to funnel is in your Google Search Console. This can be a handy way to find which errors Googlebot has picked up and differentiate them from relevant keywords.

Then do as you have above, export them all to csv, and bulk import the redirections. This will fix almost all your 404 errors in a couple of days. Then Googlebot will spend more time crawling your relevant website pages, and less time on your broken pages.

url errors

Server Log Analysis

Log file analyzer is key to unlocking the full performance of your operations! With Log File Analyzer, even the most complex data can be easily analyzed and visualized to help glean actionable insights.

Whilst all of the above tools are useful, they’re not the absolute best way to check for inefficiency. By choosing to view server logs through Screaming Frog Log File Analyzer you can find all the errors your server has picked up.

Screaming Frog filters out normal users and focuses primarily on search engine bots. This seems like it would provide the same results as above, but it’s normally more detailed and with relevant keywords.

Not only does it include each of Google’s crawlers; but you can also pick up other search engines such as Bing and Yandex. Plus since it’s every search engine error that your server picked up – you’re not going to rely on Google Search Console to be accurate.

 

server errors

Internal Linking

Internal links can make a difference in the performance of your website.

One way you can improve the crawl rate of a specific search results page is to strengthen internal links. It’s a simple one, but you can improve your current approach.

Using the Screaming File Log File Analyzer from above, you can see which pages are getting the most hits from Googlebot. If it’s being crawled regularly throughout the month; there’s a good chance that you’ve found a candidate for internal linking.

This page can have links added to other core posts, and this is going to help get Googlebot to the right areas of your web crawlers.

You can see below an example of how Matt endeavors to strengthen internal links regularly. This helps you guys to find more awesome web content on the search results page, and also helps Googlebot to rank his site.

Internal links are great for improving user experience and boosting your search results page rankings.

pillow link example

Pillar #4: Forcing a Crawl

If Googlebot is performing a site crawl and not finding your core pages, this is normally a big issue. Or if your website is too big and they’re not able to index your site – this can hurt your search engines.

Search engines do not react positively to this behavior and it can lead to a sudden drop in your website’s ranking. However, Search engines algorithm are designed to reward websites with timely, relevant, and up-to-date content, and a forced crawl can generate outdated or irrelevant content

Thankfully, site audits can help you force a crawl on your website. However, first, there are some words of warning about this approach:

If the web crawlers are not crawling your website regularly, there’s normally a good reason for this. The most likely cause is that Google doesn’t think your website is valuable.

Another good reason for your poor website crawlability and indexability is the website is bloated. If you are struggling to get millions of pages indexed; your problem is the millions of pages with relevant keywords and not the fact that it’s not indexed.

At our search engine optimization Agency The Search Initiative, we have seen examples of websites that were spared a Panda penalty because their website crawlability and indexability was too bad for Google to find the thin web content pages. If we first fixed the site’s crawl ability and indexability issue without fixing the thin web content – we would have ended up slapped with a penalty.

It’s important to fix all of your website’s problems if you want to enjoy long lasting rankings.

Sitemap.xml

Seems like a pretty obvious one, but since Google uses XML Sitemaps as your site’s web crawlers, the first method would be to submit sitemap.

Simply take all the URLs you want indexed, then run through the list mode of Screaming Frog, by selecting List from the menu to submit sitemap:

screaming frog 2

Then you can upload your URLs from one of the following options in the dropdown:

  • From File
  • Enter Manually
  • Paste
  • Download Sitemap
  • Download Sitemap Index

screaming frog 3

Then once you’ve enhanced your website crawlability and indexability, ensuring all the URLs you want are indexed, you can just use the submit Sitemap feature to generate an XML Sitemap.

screaming frog 4

Submit this to your root directory and then upload it to Google Search engines to quickly remove any duplicate pages or pages not responsive to the website’s crawlability.

gsc

Fetch & Request Indexing

If you only have a small number of pages that you want to index, then using the Fetch and Request Indexing tool is super useful.

It works great when combined with the sitemap submissions to effectively recrawl your site in short periods of time. There’s not much to say, other than you can find it in Google Search Console > Crawl > Fetch as Google.

fetch as google

Link Building

It makes sense that if you are trying to have a page become more visible and powerful website’s crawlability; throwing some direct links will help you out.

Normally 1 – 2 decent direct links can help put your page on the map. This is because Google will be crawling another page and then discover the anchor towards yours. Leaving Googlebot no choice but to initiate your website’s crawlability. 

Using low-quality pillow links can also work, but I would recommend that you aim for some high quality direct links. It’s ultimately going to improve your likelihood of being crawled as the good quality web content boosts your website’s crawlability.

example from emily

Indexing Tools

By the time you’ve got to using indexing tools, you should probably have used relevant keywords and running out of ideas.

Site audits can help you determine if your pages are good quality. If your site pages are indexable, in your sitemap, fetched and requested, with some external direct links and you’ve still not been indexed – there’s another trick you can try.

Many people use site audit indexing tools as the shortcut and default straight to it, but in most cases, it’s a waste of money. The results are often unreliable, and if you’ve done everything else right then you shouldn’t really have a problem.

However, you can use site audit tools such as Lightspeed Indexer to try and boost your website’s crawlability. There are tons of others, and they all have their unique benefits.

Most of these tools work by sending Pings to Search Engines, similar to Pingomatic.

pingomatic

Summary

When it comes to your site engine crawlers, there are tons of different ways to solve any problem that you face. The trick for long-term success will be figuring out which approach is best for your website’s search engine crawlers requirements through a web audit.

My advice to each individual would be this:

Make an effort to understand the basic construction and interconnectivity of the internet.

Without this foundation, the rest of the search engine’s ability becomes a series of magic tricks. However, if you are successful, then everything else about the search engine’s ability becomes demystified.

Try to remember that the algorithm is largely mathematical.  Therefore, even your web content can be understood by a series of simple equations, such as direct links.

With this in mind, good luck in fixing your web crawler access issues and if you’re still having problems, you know where to find us: The Search Engines Initiative.

 

Article by

Rowan Collins

When it comes to technical analysis and implementation for client websites, Rowan Collins is your guy.

 

28 thoughts on “The 4 Pillars of Mastering Google Website Crawl”

  1. Hello Rowan,

    Thank you very much for this insightful post.

    One other thing that affects crawling and indexing as you rightly said is the quality of the content on the page.

    If the contents doesn’t seem to provide value perhaps because they are spun, Google might crawl it but won’t index it

  2. Very useful article. I am just checking for my errors and I have found a lot of not found errors for pages!
    exp.:
    https://www.example.com/page/4/article/

    How do I solve this part or do I have to do redirects for all those URLs (there is over 200 of them)?

    Thanks
    B

    1. You can use the robots.txt file’s wilkdcard feature to handle many at a time. Or use the redirection plugin if you have a few hours available in a bored afternoon.

  3. Nice post!! complete and it has useful info.
    Days ago i read this post https://www.seroundtable.com/google-crawl-budget-overrated-25825.html where John Muller talks about crawl budget, maybe you can read it and let me know what do you think.
    Thank you very much for the effort

    1. Hey Harold,

      I think John Mueller is being very transparent on this matter. In the article I mention that people who have crawl budget problems often have site problems. Fixing the root of the problem will ultimately fix the crawl problem.

      If this is because your content is low quality, then fixing this will be a bigger help.

      If you lack links or have too many links, then fixing this will also have a positive impact. It’s about finding the right balances between these signals.

      My personal approach is that crawl budget doesn’t really exist in practice, but only in principle. It’s interesting conceptually, but the main thing is to improve value on the core pages that you want users to engage with.

  4. So the Yoast automatic sitemap isn’t good for google to recognize the important pages ?

    1. Hey Steev,

      The Yoast automatic sitemap gets the job done, but it doesn’t really hold any advantages over a simplified sitemap.

      This will depend largely on your website. The bigger it is, the more benefit you’re going to have from compartmentalised sitemaps.

      The main advantage is that it’s done server side and will adapt based on your page and posts. If you go down the road of using Screaming Frog, then you will need to do it manually whenever you make changes.

  5. Excellent write-up, Rowan.

    As far as distance from Google servers, if your website loads quickly otherwise but shows as slower in Google page speed, would this be an effective measure of being further away?

    Is MaxCDN your preferred content distribution network? Do you consider it best practice to have a CDN anyways?

    Thank You!

    1. Hey Kris,

      I personally use several different page speed tools, and load on 4G through my mobile phone.

      I’m mostly looking for site speed differences between regions, the client’s target location, and whether site speed is likely a problem for their website.

      If your website is consistently performing worse from the United States than any other region; you may wish to tackle this and see if it brings any uplift. Some people will benefit more than others.

      In regards to Content Distribution Networks, I work with clients that use tons of different providers. I’ve not seen that one outperforms others from an SEO perspective, but you should definitely look into where the nodes are located and if this aligns with your goals for the CDN.

      Remember that pricing and implementation may depend on each website, so this will also be a factor.

  6. This is great.

    I am just wondering whether it’s better/more efficient to control index and crawl from robot.txt/htaccess level, or just us a plugin like Yoast to control the pages manually.

    1. Hey Terry,

      I tackle each problem at the root, and look for whatever is going to achieve the best results with the minimal input and collateral.

      Sometimes I’ll use a plugin, other times I will edit the php or liquid files directly. It depends largely on which platform you’re using and the capabilities of it.

  7. Raghuveer Singh Rao

    Yes, I have faced the same issue with my wordpress website. Thank you so much Matt, You always comes out with solution. Once again thank you for your helpful insights.

  8. Hello Matt,

    Great post, I have learned too much today.

    Just a bit confused about the crawl budget, will research on this for sure.

  9. I was struggling for almost one year to get a 100,000 pages site fully indexed. While there is no short cut and it will naturally need time I found a lot in practice what Rowan is reporting

    1) speed, but not only server speed but fast loading in general. Especially make images as small as possible. Otherwise they will eat a lot of your crawling capacity. I could see in the search console that after I reduced the average page size the “Kilobytes downloaded per day” in the search console essentially stayed the same (same crawling budget) but the “Pages crawled per day” went up as each page had gotten smaller

    2) when I enhanced usability of the site and in general increased the value this site brought on the table for users I could see in analytics that the average time spent on site by USERs went up. At the same time also the number of pages indexed went up. Although there is certainly no simple correlation it is probably safe to say that the more users like your site the more pages google will be willing to crawl and index

    3) strong internal linking. Internal linking is even more important than a sitemap.

    Tests:

    a) new domain with 10,000 pages having a sitemap but no internal linking
    b) new domain with 10,000 pages having no sitemap but 1-5 internal links per page

    I performed this test several times. In ALL cases b) outperformed a)

    4) Rowan mentioned backlinks which is very important for sure. Also an expired domain with strong backlinks coming in will get its pages getting indexed much faster at the start than a new domain as of my findings

  10. Great article. What would be the best way to block sites with a specific word in it? My client has many duplicate pages with _copy in it….so i want to block all sites that contains “_copy”.
    Thanks

  11. Great write up,

    I’ve been spending a lot of time in the new search console. What are your suggestions for pages that are being crawled, but not indexed?

    I’ve seen a large increase in this across several websites over the past 90-120 days and I think this is tied into some of Google’s recent algo changes.

    I’m wondering if the best approach is to delete these from the website if they were designed as supplemental pages (blogs, newsletters) and are not the core keyword targets?

  12. Hello! Very interesting article. There was a question – pages with an attribute rel = “canonical” should be closed with a tag “noindex”? Will this save the crawling budget? Thank you!

  13. Nice post, very good article to learn webmaster,
    I have one question, what is cache error, when ever i checked my website cache its show 404 error,
    please guide me.

  14. I have two e-commerce websites, one is about 10,000 pages in the sitemap, have been indexed in google at the first month, but it was dropping the indexed only 300 in the sitemap at the Third month, and it was the same happened at my second websites.
    could you tell me what is the problem? and what should I do?
    thanks

  15. The most important crawling signal in 2019 is fantastic, quality content that the bots cannot miss. If you churn out fantastic content on a regular basis, it is a signal for the spiders to crawl.

    i had issues recently where more than half of my client’s pages were crawled, but not index. You can guess: thin, lack-of-effort content .

    1. Good input. I wouldn’t say its the most important… if you don’t link to a page or include it in a sitemap, it won’t get crawled, but you’re certainly onto something.

Comments are closed.