Sunday, September 29, 2024

A great day to do business

HomeMegaCrawl Budget: Everything You Need to Know for SEO

Crawl Budget: Everything You Need to Know for SEO



Crawl budget is often misunderstood and commonly discussed in SEO and digital marketing communities.

Many people believe that they can use SEO to “hack” their way to the top of Google’s search results.

There is a lot of confusion among marketers and webmasters about the concept of crawl budget, despite all the content that has been written about how search engines work in general and the crawling process in particular.

The Problem

There seems to be a lot of confusion about how search engines work and the basics of how people use them to find information.

. This phenomenon, called “shiny object syndrome” by business people, leads to confusion and usually means that marketers are less able to tell good advice from bad because they don’t understand the basics.

The Solution

This article provides an overview of crawling and how it can help you determine whether “crawl budget” is something you should be concerned about.

You’ll learn the following:

  • How search engines work (a brief introduction).
  • How does crawling work?
  • What is crawl budget and how does it work?
  • How to track and optimize it.
  • The future of crawling.

Let’s get started.

Definitions

Before we go any further with the concept of crawl budget and what it means for search engines, it is important to understand how the crawling process works.

How Search Engines Work

According to Google, there are three basic steps the search engine follows to generate results from webpages:

  • Crawling: Web Crawlers access publicly available webpages
  • Indexing: Google analyzes the content of each page and stores information it finds.
  • Serving (and Ranking): When a user types a query, Google presents the most relevant answers from its index.

If your content is not crawled, it will not be indexed by Google and will not appear in search results.

The Specifics of the Crawling Process

Google states on its documentation about crawling and indexing that:

” Crawling starts with a list of web addresses. From there, crawlers follow links on those sites to discover other pages. The software prioritizes new sites, changes to existing sites, and dead links. The purpose of a computer program is to automatically generate a list of webpages to visit (crawling), and how often to visit each page (fetching).

What does it mean for SEO?

  • Crawlers use links on sites to discover other pages. (Your site’s internal linking structure is crucial.)
  • Crawlers prioritize new sites, changes to existing sites, and dead links.
  • An automated process decides which sites to crawl, how often, and how many pages Google will fetch.
  • The crawling process is impacted by your hosting capabilities (server resources and bandwidth).

As you can see, the process of crawling the web is complicated and expensive for search engines due to the sheer size of the web.

In order for Google to be successful in making information universally accessible and useful, it needs an effective crawling process.

But, how does Google guarantee effective crawling?

By prioritizing pages and resources.

It would be difficult and expensive for Google to crawl every single webpage.

Now that we understand how the crawling process works, we can take a closer look at the idea of a crawl budget.

What Is Crawl Budget?

Crawl budget is the number of pages that a search engine’s crawlers are set to crawl within a certain period of time.

After you have used up your budget, the web crawler will no longer get information from your site and will go to other sites instead.

The amount of times Google crawls your website per day is known as your crawl budget. This number is different for every website and is automatically established by Google.

The amount of money that the search engine will spend on your site is based on a variety of factors.

In general, the are four main factors Google uses to allocate crawl budget:

  • Site Size: Bigger sites will require more crawl budget.
  • Server Setup: Your site’s performance and load times might have an effect on how much budget is allocated to it.
  • Updates Frequency: How often are you updating your content? Google will prioritize content that gets updated on a regular basis.
  • Links: Internal linking structure and dead links.

Although it is true that problems that stop Google from being able to access your site’s most important content can occur, it is not a sign of low quality.

Crawling your site more often will not necessarily help you improve your ranking.

If your content is not meeting your audience’s standards, you will not be attracting new users.

This will not improve by having Googlebot visit your site more frequently.

Crawling is necessary for appearing in the results, but it is not a factor in ranking.

How Do I Optimize My Crawl Budget?

1. Preventing Google from crawling your non-canonical URLs

If you’re not sure what a canonical tag is, it’s essentially a way of telling Google which version of a page is the preferred one.

For example, if you have a product category page for “women’s jeans” located at /clothing/women/jeans, and that page allows visitors to sort by price: low to high (i.e. faceted navigation), then any links to pages that sort the same product category by price would use the same URL.

If you’re using Botify, you can tell when Google is crawling non-canonical pages by looking for the non-indexable indicator. This e-commerce site has a lot of pages that can be reached by more than one URL, which can cause problems with search engines indexing the site. The study found that 97% of the one million pages crawled were non-canonical.

Even though Google should have been able to crawl all 25,000 indexable URLs in a month, they only managed to crawl around half of them. Google’s crawl budget allowed for more than the total number of indexable URLs, but the remainder of the budget was spent on non-indexable URLs.

Since the site has the potential to achieve a nearly 100% crawl ratio, making it more likely that more pages will drive traffic, it is unfortunate that this is not the case. If we don’t crawl all the non-canonical URLs, we might be able to crawl other pages more often. And we’ve found that pages that are crawled more often tend to get more visits.

This problem wastes Google’s crawl budget, yet it is still a major problem for SEO.

The solution? Use your robots.txt file to tell search engines what not to crawl

Pages that don’t have value can waste server resources and prevent Google from finding your good content.

The robots.txt file can be used to specify which parts of the site should be crawled by search engine bots, and which parts should be ignored.

If you want to learn more about how to create robots.txt files, Google has some helpful documentation on the subject.

How do files help preserve your crawl budget?

You can use your robots.txt to disallow search engines from crawling sort pages that duplicate the original page. For example, if you have a large e-commerce site with a faceted navigation that lets you sort the content without changing it, you can use robots.txt to prevent search engines from crawling those pages. You don’t want search engines to waste time on pages that you don’t want them to index.

Ryan Ricketts, Technical SEO Manager at REI, told a story at our Crawl2Convert conference that reminded us of something. He reduced the number of URLs on his website from 34 million to 300,000 and saw a significant improvement in his crawling budget. Aja Frost from Hubspot saw an increase in traffic when she limited the number of pages Google had access to.

Your robots.txt file can help guide search engines to the most critical content on your site. If you have a website, our crawler will follow the rules defined for Google in your website’s robots.txt file. You can also set up a virtual robots.txt file to override the default rules.

It’s not a given that search engines won’t index pages on a website even if the website owner has tried to prevent them from doing so. If there are hyperlinks to other pages on your website, search engines may be able to find and index those pages. See step #3 for more on that.

2. Improving page load times by optimizing your JavaScript 

If your website uses a lot of JavaScript, you may be using up your crawl budget on JS files and API calls.

Consider this example.

The customer’s website Switch from client-side rendering to server-side rendering (SSR). We saw from log file analysis that Google was spending more time on the website’s critical content. Google was able to save time by not having to spend time on JavaScript files and API calls because it was receiving the fully-loaded page from the server.

JavaScript is often the cause of slow page load times. If your pages are loading slowly, it may be because of your use of JavaScript. Google uses the speed of page loading as a criteria for deciding how much to crawl, so if your pages are loading slowly because of JavaScript, Google may be missing your important content.

The solution? Take the burden of rendering JavaScript off search engines 

If you use a server-side rendering (SSR) solution like SpeedWorkers, it can help improve your website’s performance because search engine bots won’t have to spend time rendering JavaScript when they visit your pages.

Page speed affects how long it takes a user to load a page and is therefore a key factor in providing a good user experience. It is also a ranking factor for search engines, but keep in mind that it also influences how often a search engine crawls your site. If your website has a lot of JavaScript and the content changes often, you should think about creating a version of the site specifically for search engine bots. This is called prerendering.

3. Minimizing crawl errors & non-200 status codes 

Remember Google’s crawl budget formula? Google looks at how many errors the crawler encounters while on your site to help determine how long to spend there.

If Googlebot finds a lot of errors while crawling your site, it may lower your crawl rate limit, which would reduce the amount of time it spends crawling your site. If you’re seeing a lot of 5xx errors, you might want to investigate ways to improve your server’s capabilities.

But non-200 status codes can also simply constitute waste. If your pages have been deleted or redirected, Google will waste time crawling them instead of focusing on your current live URLs.

The solution? Clean up your internal linking and make sure your XML sitemap is up-to-date

It’s a good idea to avoid linking to pages that don’t have a 200 status code so that search engine bots don’t crawl them.

Don’t waste your crawl budget by linking to old URL versions in your content. Instead, link to the live, preferred version of the URL. As a general rule, you should avoid linking to URLs if they’re not the final destination for your content. This is because linking to intermediate pages can cause problems for users who are trying to access your content.

For example, you should avoid linking to:

  • Redirected URLs
  • The non-canonical version of a page
  • URLs returning a 404 status code 

Don’t send search engine bots through multiple middlemen to find your content since this will waste your crawl budget. Instead, link to the ultimate destination.

Also, avoid common XML sitemap mistakes such as:

  • Listing non-indexable pages like non-200s, non-canonicals, non-HTML, and no-indexed URLs.
  • Forgetting to update your sitemap after URLs change during a site migration
  • Omitting important pages, and more. 

It’s important to only use live, preferred URLs, and to not leave out any key pages that you want search engines to index. Have old product pages? After you create new pages on your website, be sure to expiration date them and remove them from your sitemap.

Botify can be used to check your sitemap for errors so that less time is wasted during the crawling process.

4. Checking your crawl rate limit in Google Search Console

You can control how often Googlebot crawls your site by changing your crawl rate in Google Search Console. This tool refers to your site’s crawl rate limit, which is a determining factor in Google’s assessment of your site’s crawl budget. Therefore, it is crucial to have a good understanding of this tool.

You can modify Google’s algorithms to determine an appropriate crawl rate for your site.

You can limit Googlebot’s crawl rate in order to prevent it from putting too much strain on your server. Note that this could lead to Google finding less of your important content, so use caution.

The solution? Adjust your crawl rate in GSC

To change your crawl rate, go to the crawl rate settings page for the property you want to adjust. There are two choices: “Let Google optimize” and “Limit Google’s maximum crawl rate.”

If you want to increase your crawl rate, it’s a good idea to to check and see if the option to “Limit Google’s maximum crawl rate” has been selected by accident.

Closing Thoughts

Crawl budget is relevant and useful for websites that want to optimize their performance.

The idea of a crawl budget might not be relevant for long as Google is always changing and testing new solutions.

Make sure you are focusing on the basics and key features that will be most valuable to your customers.


RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular