Easy Data Retrieval

Easy Data Retrieval

INTRODUCTION

VISION
STATEMENT

THE PLATFORM
IS THE WEB

LOCATION
INDEPENDENCE

BANDWIDTH
TOLERANCE

EASY DATA
RETRIEVAL

YOU CAN TRUST
IT

IT JUST WORKS

It's increasingly easy to put large quantities of information into online services. This is being facilitated by the continuing fact that disk capacities, processor speeds, and main memory sizes are all increasing at 50% per year at fixed prices. Increasingly, it's becoming difficult to find material. This is a problem known to librarians for centuries, and to PC users for years ("just where did I save that file?"). Continuing the email example, anyone who has been getting tens of messages a day for several years tends to accumulate a large store of old messages that "might be useful one day". The problem is how to find things?

The traditional answer has been to get the end-user to apply structure to the data, by filing items in hierarchies, or by attaching keywords. It's long been recognized that this approach doesn't scale arbitraily, because users (especially users not specially trained in information retrieval) find it too difficult to apply such strategies consistently. Many users find that hierarchies fail even on the scale of organizing their home PC, far less a 100 MByte 10-year long email archive.

Concretely, the solution here is to use automatic indexes. More specifically, we propose to leverage the technology used for the AltaVista web site. This currently maintains a full-text index of roughly 100 million documents, totalling roughly 100 GBytes. We know it's effective for finding web documents (see the hit rate statistics). We know that it scales up well. We also know that it scales down well (see the AltaVista product for indexing your own PC, or the Pachyderm email system). The numbers are compelling: very few applications have enough data to challenge the underlying indexing libraries, and we know that millions of users like the results they get.

From the point of view of the user, he has been freed from the filing and organizing tasks required for manual schemes (such as file system hierarchies). Further, his data has just become more valuable: not only does full-text indexing work, but it works much better than most explicit orgranizations, and he can now find the data he needs for his task.

From the point of view of the system manager, things got larger. Users will now be enabled to deal with much more data than before, and will tend to keep more "just in case" in might be useful. Disks and file systems will inevitably get more numerous and bigger.

From the point of view of system design, life is in many ways simpler. The designer no longer needs to invent new schemes to let the customers organize their data. A query-based email system is much simpler than one that uses hierarchic folders or other structuring techniques, provided only that the implementor has the ability to link the AltaVista indexing libraries into his application.