Department of Computer Science,
SUNY at Stony Brook,
Stony Brook, NY 11794, U.S.A.
June 2002
Abstract
| Given that commercial search engines cover billions of web pages, efficiently managing the corresponding volumes of disk-resident data needed to answer user queries quickly is a formidable data manipulation challenge. We present a general technique for efficiently carrying out large sets of simple transformation or querying operations over external-memory data tables. It greatly reduces the number of performed disk accesses and seeks by maximizing the temporal locality of data access and organizing most of the necessary disk accesses into long sequential reads or writes of data that is reused many times while in memory. This technique is based on our experience from building a functionally complete and fully operational web search engine called Yuntis. As such, it is in particular well suited for most data manipulation tasks in a modern web search engine and is employed throughout Yuntis. The key idea of this technique is coordinated partitioning of related data tables and corresponding partitioning and delayed batched execution of the transformation and querying operations that work with the data. This data and processing partitioning is naturally compatible with distributed data storage and parallel execution on a cluster of workstations. Empirical measurements on the Yuntis prototype demonstrate that our technique can improve the performance of external-memory data preparation runs by a factor of 100 versus a straightforward implementation. |
| The paper (presented at VLDB 2002 -- 14% acceptance rate): pdf, ps, or dvi (12 letter pages; as of June 10, 2002) |
@inproceedings{LifantsevChiueh02:vldb,
author = "Maxim Lifantsev and Tzi-cker Chiueh",
title = "{I/O}-Conscious Data Preparation for
Large-Scale Web Search Engines.",
booktitle = "VLDB 2002, Proceedings of 28th International Conference on Very
Large Data Bases, August 20-23, 2002, Hong Kong, China",
publisher = "Morgan Kaufmann",
year = "2002",
}
|
|
Research Papers |
| Last updated on Oct. 18, 2002 by Maxim Lifantsev | |
| Comments, Suggestions? |