Google Alchemy: Word into HTML

Google quietly updated all of its distributed indexes this weekend with a major change: the company now extracts and indexes the text from many file formats, including Word, PowerPoint, Excel, PostScript, RTF, Lotus files, WordStar 2000, RFT (and old IBM format), MacWrite, and on and on. The extracted information is also converted in most cases to HTML which you can view by clicking a link next to the result. I was holding off on this item until my brief ran in the New York Times (2nd item).

The addition of these file formats starts out slow: only about 12 million documents out of 35 million non-HTML files indexed are from this set. PDF represents the balance; the company has indexed PDFs since February, starting with a collection of about 10 million, now at 23 million. PDFs are now represented in HTML instead of text, preserving as much of the formatting as practical.

The number two format after PDF is Word; number three is PostScript. This might seem odd if you're not in the academic environment. Many academics have long published PS versions of their files before and after Acrobat PDF took off, as it was easier across many systems to print PostScript to a printer or view through freeware software than any other format.

Google expects their initial number to grow quickly. If you can imagine the number of Word documents linked to Web pages, it must be in the tens of millions alone. Couple that with the other formats, legacies of past ages in some cases, and a lot of the hidden Web will be revealed.

One key use I've already found for this feature is viewing PowerPoint files as HTML through Google's View as HTML link. PowerPoint files are huge, especially if they have embedded graphics or movies. Google strips everything but basic position, type size, and color. This 230K PowerPoint file, for instance, is just 9K when viewed as text. (Given the nature of that last URL, it's possible that the link I provide will expire.) Other files will compess much more, from multiple megabits to a few tens of K.

This feature also opens up these documents to people who don't own the original program. Despite the tens of millions of people and business "seats" that use the Microsoft suite, for instance, there are hundreds of millions who don't own all the applications. Most people in the U.S. and worldwide still access the Net via dial-up, too, and this squeezing of information could provoke a trend to bypass the original files except after reviewing the condensed, extracted version.

Google tells me that entering information in the Properties dialog box for Office files will assist them in keywording and titling results. Google will also follow the various forms of exclusion, and omit any files specifically or generally requested for them to omit.

I've thought about some scenarios where people have linked documents for download via obscure pages that are still in their site hierarchies. Where these files would be invisible until now, suddenly they could emerge as the top match for a given search query and be exposed to the world. Potential embarassment there. This happened a bit when Google opened up its Usenet news archives to searching back to the mid-90s and people's old, forgotten posts surfaced like an old pet burial in a heavy rain.

