rentzsch.com: tales from the red shed

Google Cache Hacking

Notes
As Googlebots traverse the web, they stash away copies of the pages they download. Google then allows searchers to view these pages from Google's cache. For example, here's Google's cache of rentzsch.com:

http://google.com/search?q=cache:1Hotxp9-WHYC:rentzsch.com/

Note the URL pattern. The query begins with a cache: keyword, followed by some sort of hash, followed by another colon, and the cached page's real address.

Now, let's suppose rentzsch.com was temporally down while I upgrade my server hardware to 39 Xserves + a Xserve RAID in my new 42U rack (hey -- I can dream), and you really want to get information PSIG, the programming group I host. You modify the cache URL:

http://google.com/search?q=cache:1Hotxp9-WHYC:rentzsch.com/psig/

This is interesting. The original page's address (rentzsch.com/psig/) is ignored, and the same page is returned as before. We can verify that the original page address is being ignored by replacing it and trying again:

http://google.com/search?q=cache:1Hotxp9-WHYC:apple.com/

Obviously it's the hash (1Hotxp9-WHYC) that is really identifying the page cache entry. Now here is where it gets interesting. We can corrupt the hash to force Google to rely on the given page address. For example, let's morph the hash's initial 1 into a 2:

http://google.com/search?q=cache:2Hotxp9-WHYC:apple.com/

Bingo, a cache of Apple's frontpage is returned. Likewise, we can lookup the PSIG cache as well:

http://google.com/search?q=cache:2Hotxp9-WHYC:rentzsch.com/psig/

Using this technique, it should be possible to write a frontend to Google that allows you to fully surf the web, exclusively through their cache. All that would be required is to rewrite the URLs in the returned page to point back into Google's cache. I'm actually tempted to write such a frontend, but the low volume of traffic to Googlefone tells me my time is better spent elsewhere.

Update: Avi points out it's far easier to access Google's cache than I mention here.

Update #2: Inspired by Jon Udell's LibraryLookup, I wrote my first bookmarklet: Google Cache Lookup.

Now, when you stumble upon a page that's been deleted, suddenly changed for the worse, or temporally inaccessible, you can invoke this bookmarklet which will spawn a new window with Google's cache of the current page (hopefully).

Update #3: Fuse thinks it would be nice if browsers, when faced with an error, would automatically attempt to display an archived copy of the requested page. I agree. Also, he wrote a better version of my Google Cache Lookup bookmarklet: Google Cache Lookerupper. Nice work.

Update #4: Ryan Shaw wrote that the bookmarklet didn't work him in Mozilla 1.3 or Galeon 1.3.4, so he wrote another version. Meanwhile Matt Schneider related how web archives allow persistent fine-grained URLs. (Check out Purple Slurple).

Wednesday, April 16, 2003
12:00 AM