November 8th, 2011 by Mike Ashenfelder
The following is a guest post by Nicholas Taylor, Information Technology Specialist for the Repository Development Group.
What is the average lifespan of webpage? Predictably, estimates vary and vary over time. A 1997 special report in Scientific American claimed 44 days. A subsequent 2001 academic study in IEEE Computer suggested 75 days. More recently, in 2003, a Washington Post article indicated that the number was 100 days.
While there appear to be overall fewer estimates of webpage longevity floating around than, say, the amount of data stored in the Library of Congress, we can at least feel more assured that they’ve all come from someone who should know: Brewster Kahle, founder of the Internet Archive.
Determining the average lifespan of a webpage is complicated not just by the infrastructure required to analyze a plausibly representative sample of links across the web but also because it’s easy to conflate “the average lifespan of a webpage” with other closely-related concepts that are, in actuality, much more difficult to measure. That is to say, we take for granted that we know what it means that a webpage has “died.”
For instance, is a “webpage” defined by its URI or by its contents? A non-resolving link doesn’t necessarily imply that the content once hosted there no longer exists (1); it may have been archived or simply exist at a new location (albeit, one mediated by a paywall) to which the web server was not configured to redirect page requests. Conversely, a resolving link doesn’t necessarily imply that the same content is still hosted there as it once was.
An automated link checker visiting a list of URIs and logging all ultimately successful and failed requests would miss these subtleties. A human being with a limitless amount of time who set out to manually check the same list might still get hung up on exhaustively concluding that a disappeared webpage did not, in fact, exist at a new URI or on the subjective determination of whether an extant webpage could be said to be the “same” webpage as before.
There are additional complications in these sorts of analyses. While even the longest of the aforementioned webpage lifespans suggests that webpages are ephemeral, some are so fleeting that their lifespans are better measured in hours rather than days. Analyzing its web index, Google noticed that the median lifespan of malware-distributing domains decreased from one month in 2007 to a mere two hours by 2010. Since most commercial web search engines penalize listings from such domains, malware distributors are incentivized to churn quickly through massive numbers of new domains. The number of domains being created and their transience may skew average lifespan calculations by automated methods downward.
Finally, there’s no practical way of knowing precisely when a webpage disappears; we can only know the time difference between a previous visit when the webpage existed and a later visit when it didn’t. Depending on the breadth of the crawl and the infrastructure available, it may be days, weeks, or even months before the crawler visits the same webpage twice. This margin of variability may undermine the precision of webpage lifespans for which the appropriate scale of measurement likewise appears to be days, weeks, or months.
What this all means for calculations of the average longevity of a webpage is that, while Internet Archive’s estimates may be the best available, there are key limitations and caveats behind any of the numbers proffered to date. Unfortunately, it’s unlikely that we’ll have objective measurements better than the gross methodologies permitted by automated link checking any time soon.