Is this a brand new domain or will the “site” replace content on an established domain?
If it’s a new domain, start slowly. Focus on the content that is original or at least mad-libbed to the point where it’s not throwing any duplicate content filters. Then release sections or chunks of the site once you’ve hit 70%+ indexation.
If you’re replacing a site, there’s typically a pattern of spidering already in place. Check the logs and verify what Webmaster tools is telling you. The numbers aren’t always what it seems. Take the three week tally of unique pages spidered and increase it by 20-25% — and that’s a safe number to re-launch with.
That’s based on personal experiences only – but I’ve had a number of sites that are well north of 50K pages.
Assuming all or most of those 50,000 pages are of decent quality (I doubt it, just based on my experience of enterprise-level content production) then I would opt to release them slowly over time.
I would also focus on making the site fast and easy to crawl by making use of robots.txt and robots meta tags, rel canonical tags, rel next/prev, etc… and generally improving the overall quality of the site and its content. Every situation is different, but here are a few specific ways you might make the most use of your crawl budget: Look into either blocking or noindexing the internal search results, and blog tag pages. Provide up-to-date XML sitemaps that do not include blocked URLs, multiple URL versions of a page, or other common errors. Limit redirects with multiple “hops” by updating legacy regex code and internal links. Fix internal broken links. Fix duplicate content issues. Avoid publishing indexable thin content. And, above all, build authority!
Paddy Moogan has some excellent ideas for technical ways to optimize your crawl budget. I have found sitemap segmentation to be particularly useful in diagnosing which sections of the site have crawling and/or indexation issues.
Concepts like optimizing your crawl budget, bot herding, distribution of internal pagerank, etc… are all important, but in this particular situation I might be more concerned about whether those 50k pages “should” be in the index at all. Crawl budget, so far as I know, is allocated in large part based on the overall PageRank of the site, as well as individual pages. Domains with a lot of authority and a high average PageRank throughout the site will tend to get crawled more often and more deeply. Releasing a large amount of low-quality (thin, duplicate, stub pages, doorway pages, generally low quality articles…) will affect the average PageRank (per page) of the site. When this ratio of high quality to low quality content leans too heavily toward the low-quality end (or at least non-authoritative) you run the risk of being affected by algorithmic filters, such as Panda, which will harm rankings even for your best pages.
Ian Lurie did some small-scale testing of this hypothesis in 2011 using the log files from 40 websites. Though the sample size was small, the results were compelling enough to lead to this simple conclusion: Crawl budget increases with higher PR.
So, in a sense, I start with the premise that the best way to improve your crawl budget is to improve the authority of your site by publishing top-notch content, obtaining high-quality links, and keeping low-quality content out of the index. That last part implies that my answer to the original question could be “neither” if the quality of the content isn’t up to par.
One last thing I’d like to mention is that the average site on a shared hosting account is probably just as likely to have its crawl budget limited by the host itself as by Google. Matt Cutts touches on the concept of “host load” in this interview with Eric Enge from 2010.
One of the technical terms associated with many new businesses and new sites on the web is “cold start.” From out of nowhere comes a site that no one really knows about, with a topic that might not be very well known either. The site I write about above didn’t suffer from a cold start. But I was nervous that it was too much, too quick, too fast.
An answer to whether you should launch the whole site all at once, or to publish it to the world in pieces might involve what the site is about, how much traffic can be driven to it in a short period of time, and what forces are behind it in terms of promotion. Most sites around that size are likely to go through a cold start.
When a search engine crawls a site these days, it looks at some features such as what PageRank the pages of the site might have, how frequently the pages of the site are updated, and how many visits the server or servers the site is on can handle. There’s actually a little documented history behind how Google crawls websites.
When Google first started out as a search engine, one of the challenges that it faced was to crawl websites to add to its index. The robots mailing list had already come up with a robots.txt protocol that described how webmasters could keep some pages and directories from being crawled on a site by having a robot follow disallow statements from a text file on the root directory labeled robots.txt. In the early 2000s, there was a page on the Stanford.edu website that listed a number of whitepapers that Google followed in setting up its search engine. One of those documents was co-authored by Lawrence Page, titled Efficient Crawling Through URL Ordering.
The document describes a number of importance metrics that set the priority for crawling pages across the Web. For example, one of the importance metrics described was closeness to root directory. Given a choice between crawling one website with a million pages, and a million home pages on a million sites, The index of a search engine would be a better index if it covered the million pages on a million sites, instead of a one-million paged site.
These importance metrics involve:
- A focused crawl for a specific query and similar queries, to make sure that topics included within an index had a broad range of coverage.
- An estimate of Back Link counts to a page.
- PageRank – considering the importance of links pointed to a page.
- Forward Links Count – A page that might have a lot of links from it might be something like a directory, which can help a crawler find lots of new links.
- Location metrics – URLs in a root level directory might be preferred over URLs that include more slashes in their path, since more important pages tend to be closer to the root directory. A URL that includes .com within it might be more important that other URLs that don’t. A URL that includes the string “home” might be more interesting than others.
It’s evident from these that one of the important metrics that still holds a great deal of weight these days is PageRank. Another appears to be how much of a load a site or server can take when a search engine visits a site to index a page. If Google crawled too many pages from a server at the same time, it could slow the sites on that server or cause the site to crash.
Given a site filled with high quality content, with canonical and pagination link elements set up correctly, with duplicate content issues taken care of, and with other crawling and indexing issues taken into account (Good luck on all of those), I’d have to think about other issues such as how quickly the site might attract traffic, might attract links, might be shared socially, how much people might trust it and find it credible, and so on.
For a site where a launch might not be a cold start, I don’t think I would hesitate in showing the world a 50,000 URL website. That’s not the circumstances around most sites.