"> Yuntis: Collaborative Web Resource Categorization and Ranking Project /~maxim/Yuntis/"> --> >


Yuntis: Collaborative Web Resource Categorization and Ranking Project


People

Professor Tzi-cker Chiueh
Maxim Lifantsev


Project Description

The Web becomes more and more extensive source of various information resources and services used by a substantial part of the people on the globe. Consequently tools and services for helping people to quickly and accurately locate web information resources they are interested in become more crucial for usability of the Web and for real accessibility of resources provided somewhere on the Web to the users interested in these resources. Such tools -- presently represented by web search engines and directories -- try to bridge the gap between a users' information need (which is usually expressed by a keyword-based query with additional restrictions based on, for instance, a category of resources) and the set of all web pages that have some textual content and are loosely connected by hyper-links, but do not have a build-in mechanism for mapping user queries into a set of relevant web resources.

Yuntis is a project to develop a mechanism (and a software system supporting it) to allow web page authors and general web surfers to collaboratively construct a mapping from keyword-based queries to (ordered) sets of web resources, so that this mapping best satisfies (that is, democratically) the expressed collective desires of all the web authors and surfers.

The main idea of the project is to support a democratic open voting model with power delegation when interested actors (that is, web page authors and web surfers) can express their opinions (desires) on how a web search and navigation service should answer different types of queries, as well as delegate the power to do so to any other actor by means of publishing on the Web of some metadata that is to be crawled and processed by the search engine. This metadata can be used to state such things as the use of a given fraction of voting power to associate a given short textual description with a given web resource, or to associate a universal "goodness" rank on some common scale with a web resource, or to transfer that fraction of power to a given other actor. More realistically such associations and power transfers are to be stated with respect to a category in a classification directory. In addition, metadata specifying which subcategories a given category should have and what the textual descriptions of (sub)categories in the directory should be can be used to collaboratively construct the classification directory itself. All this crawled metadata when combined with a given distribution of initial amounts of power given to each actor is to be used to construct a classification directory structure and associations of text and rank values with web resources as collectively desired by the actors that have provided the metadata. This information along with various statistics on it can then be used to answer user queries and serve the directory structure and various information about web resources and actors to a user of our web search engine service.

In the absence of the proposed kind of explicit metadata one can (inaccurately) convert various available data sources such as textual composition of web pages, linkage among web pages, and the structure of an existing classification directory such as ODP, into a common (meta)data format a system can use along with explicit metadata while the later gets more widespread.

A more abstract view of the project is development (and application to creation of a web searching service) of a mechanism for collaborative construction of various relations when the power to influence the relations can be delegated among the collaborating actors (people) and the initial amount of influencing power given to each actor is an independent parameter of the system. This allows one to get results that, for example, range from being completely democratic (when all actors are given equal amount of initial power) to being "elitarian" (when a selected set of actors are initially given more power than others) to being dictatorial or personalized (when one actor is initially given most of or all the power). All these cases can utilize the same power delegation and application network among the actors, that, for instance, can allow a "dictator" to build a hierarchy of trusted subordinates and allow a population of democratically cooperating actors to "elect" representatives by delegating them the power to influence the constructed relations. Thus, the power delegation mechanism should allow each actor to provide a relatively small amount of metadata in order to accurately express its wishes on how the constructed relations should be built.


Prototype Implementation

The current prototype implementation of the Yuntis engine can be accessed at http://yuntis.ecsl.cs.sunysb.edu.

The prototype consists of three different data sets powered by three different engines: http://yuntis-edu.ecsl.cs.sunysb.edu, http://yuntis-usb.ecsl.cs.sunysb.edu, and http://yuntis-wrl.ecsl.cs.sunysb.edu.

The yuntis-edu server is based on a crawl of over nine million pages located on mainly English-speaking universities and some research labs.

The yuntis-wrl server is based on a crawl of over four million pages listed in a recent snapshot of ODP.

The yuntis-usb server is based on a recent exhaustive crawl of the .sunysb.edu domain.
You can compare it with Google's SUNY SB search.
(This is not a very fair comparison because Google's search uses linkage information outside of the .sunysb.edu domain, but returns results only inside of the .sunysb.edu domain, whereas Yuntis has data only from the .sunysb.edu domain, but can return results from other domains too.)
Another point of comparison is Inktomi-powered official SUNY SB search.

For the most recent description of the main features of the prototype see http://yuntis-usb.ecsl.cs.sunysb.edu/about/.


Papers and Technical Reports

I/O-Conscious Data Preparation for Large-Scale Web Search Engines. Maxim Lifantsev and Tzi-cker Chiueh. Proceedings of 28th International Conference on Very Large Data Bases, August 20-23, 2002, Hong Kong, China, Morgan Kaufmann, Hong Kong, August 2002.
A System for Collaborative Web Resource Categorization and Ranking. Maxim Lifantsev. Ph.D. Dissertation Proposal, Department of Computer Science, SUNY at Stony Brook, Stony Brook, NY, October 2001.
Open Peer-Review as Web's Self-Organization Force. Maxim Lifantsev. Technical Report TR-78, ECSL, Department of Computer Science, SUNY at Stony Brook, Stony Brook, NY, February 2000.
Voting Model for Ranking Web Pages. Maxim Lifantsev. In Peter Graham and Muthucumaru Maheswaran, editors, Proceedings of the International Conference on Internet Computing (Las Vegas, Nevada, U.S.A.), CSREA Press, pages 143-148, Las Vegas, June 2000.
Rank Computation Methods for Web Documents. Maxim Lifantsev. Technical Report TR-76, ECSL, Department of Computer Science, SUNY at Stony Brook, Stony Brook, NY, November 1999.

Items are listed in reverse chronological order.


Downloadable Software

OpenGRiD project code is the early version of our prototype implementation codebase. It contains C++ libraries for building event-driven non-blocking unithreaded applications. Sample applications such as a simple web server and a minimal web proxy are included.
OGProxy's code provides the code for a web proxy that does some filtering of HTTP and HTML data.


Related Resources

Google Search Engine

Implementation: http://www.google.com/
Current (Known) Researchers:
Sergey Brin, Google's Co-founder & President, Technology (President until Aug. 2001), Ph.D. candidate at Stanford University, DBLP Publications List
Larry Page, Google's Co-founder & President, Products (Chief Executive Officer until Aug. 2001), Ph.D. candidate at Stanford University, DBLP Publications List
Craig Silverstein, Google's Director of Technology, Ph.D. candidate at Stanford University, DBLP Publications List
Monika Henzinger, Google's Director of Research, formerly a Research Acientist at the Compaq's Systems Research Center, DBLP Publications List
Krishna Bharat, Senior Research Scientist at Google, formerly a Research Scientist at the Compaq's Systems Research Center, DBLP Publications List
See also Database group at Stanford University.
Main Publications:
Sergey Brin and Lawrence Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine. WWW7, 1998.
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd, The PageRank Citation Ranking: Bringing Order to the Web. Stanford University, 1998. (slides)
Sergey Brin, Rajeev Motwani, Lawrence Page, and Terry Winograd, What can you do with a Web in your Pocket? Data Engineering Bulletin, 1998.
Junghoo Cho, Hector Garcia-Molina, and Lawrence Page, Efficient crawling through URL ordering. WWW7, 1998.
Publication Collections:
By Sergey Brin
By Craig Silverstein
By Monika Henzinger

Stanford University Database Group

Group's home page: WebBase Project
Implementation: Google's prototype has been developed there as part of the WebBase Project
Current (Known) Researchers:
Hector Garcia-Molina, Professor at Stanford University, DBLP Publications List
Rajeev Motwani, Associate Professor at Stanford University, DBLP Publications List
Terry Winograd, Professor at Stanford University, DBLP Publications List
Jeffrey Ullman, Professor at Stanford University, DBLP Publications List
Andreas Paepcke, Senior Research Scientist at Stanford University, DBLP Publications List
Junghoo Cho, Ph.D. candidate at Stanford University, DBLP Publications List
Sergey Melnik, Research scholar at Stanford University, DBLP Publications List
Taher Haveliwala, Ph.D. candidate at Stanford University, DBLP Publications List
Sriram Raghavan, Ph.D. candidate at Stanford University, DBLP Publications List
Former Group Members:
Sergey Brin
Larry Page
Craig Silverstein
Main Publications:
Sergey Melnik, Sriram Raghavan, Beverly Yang, and Hector Garcia-Molina, Building a Distributed Full-Text Index for the Web. WWW10, 2001.
Jun Hirai, Sriram Raghavan, Hector Garcia-Molina, and Andreas Paepcke, WebBase: A repository of web pages. WWW9, 2000.
Junghoo Cho and Hector Garcia-Molina, The Evolution of the Web and Implications for an Incremental Crawler. VLDB, 2000.
Andreas Paepcke, Hector Garcia-Molina, Gerard Rodríguez-Mulŕ, and Junghoo Cho Beyond Document Similarity: Understanding Value-Based Search and Browsing Technologies. SIGMOD, 2000.
Junghoo Cho, Narayanan Shivakumar, and Hector Garcia-Molina, Finding replicated web collections. SIGMOD, 2000.
Taher Haveliwala, Efficient Computation of PageRank. Stanford University, 1999.
Narayanan Shivakumar, Hector Garcia-Molina, Finding near-replicas of documents and servers on the web. WebDB, 1998.
See also early Google publications.
Publication Collections:
By the Stanford DB group
By Rajeev Motwani
By Jeffrey Ullman
By Junghoo Cho
By Sergey Melnik
By Sriram Raghavan

Kleinberg and (former) IBM's Group at Almaden

Projects' Home Pages:
Text and Information
CLEVER
Campfire
Main Researchers:
Jon Kleinberg, Assistant Professor at Cornell University, DBLP Publications List
Soumen Chakrabarti, Assistant Professor at Indian Institute of Technology Bombay, DBLP Publications List
Byron Dom, Research Scientist at IBM's Almaden Research Center, DBLP Publications List
Ravi Kumar, Visiting? Research Scientist at IBM's Almaden Research Center, DBLP Publications List
Prabhakar Raghavan, Vice President and Chief Technology Officer at Verity, Consulting Professor at Stanford University, DBLP Publications List
Sridhar Rajagopalan, Postdoc at DIMACS Rutgers University, DBLP Publications List
D. Sivakumar, Visiting? Research Scientist at IBM's Almaden Research Center?, DBLP Publications List
Andrew Tomkins, Ph.D from Carnegie Mellon University, Research Scientist at IBM's Almaden Research Center, DBLP Publications List
David Gibson, Ph.D. candidate at University of California, Berkeley, DBLP Publications List
Dharmendra Modha, Research Scientist at IBM's Almaden Research Center, DBLP Publications List
Main Publications:
Soumen Chakrabarti, Integrating the Document Object Model with Hyperlinks for Enhanced Topic Distillation and Information Extraction. WWW10, 2001.
Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew Tomkins, On semi-automated Web taxomony construction. SIGMOD WebDB, 2001.
Dharmendra Modha and Scott Spangler, Clustering Hypertext with Applications to Web Searching. ACM HT, 2000.
Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, D. Sivakumar, Andrew Tomkins, and Eli Upfal, Stochastic models for the Web graph. IEEE FCS, 2000.
Soumen Chakrabarti, Martin van den Berg, and Byron Dom, Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery. WWW8, 1999.
Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew Tomkins, Trawling the Web for emerging cyber-communities. WWW8, 1999.
Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew Tomkins, Extracting large-scale knowledge bases from the web. VLDB, 1999.
Jon Kleinberg, Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew Tomkins, The Web as a graph: measurements, models and methods. ICCC, LNCS, 1999.
Soumen Chakrabarti, Byron Dom, Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins, David Gibson, and Jon Kleinberg, Mining the Link Structure of the World Wide Web. IEEE Computer, 1999.
Jon Kleinberg, Authoritative Sources in a Hyperlinked Environment. ACM-SIAM DA, 1998.
David Gibson, Jon Kleinberg, Prabhakar Raghavan, Inferring Web Communities from Link Topology. ACM HT, 1998.
Soumen Chakrabarti, Byron Dom, Prabhakar Raghavan, Sridhar Rajagopalan, David Gibson, and Jon Kleinberg, Automatic resource compilation by analyzing hyperlink structure and associated text. WWW7, 1998.
Soumen Chakrabarti, Byron Dom, Rakesh Agrawal, and Prabhakar Raghavan, Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. VLDB Journal, 1998.
Soumen Chakrabarti, Byron Dom, and Piotr Indyk, Enhanced hypertext categorization using hyperlinks. SIGMOD, 1998.
Soumen Chakrabarti, Byron Dom, David Gibson, Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew Tomkins, Spectral filtering for resource discovery. SIGIR, 1998.
Soumen Chakrabarti, Byron Dom, David Gibson, Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew Tomkins, Experiments in Topic Distillation. SIGIR, 1998.
Soumen Chakrabarti, Byron Dom, Rakesh Agrawal, and Prabhakar Raghavan, Using taxonomy, discriminants, and signatures for navigating in text databases. VLDB, 1997.
Publication Collections:
For the IBM's CLEVER Project
For the Text and Information Research at Almaden
By Jon Kleinberg
By Soumen Chakrabarti
By Byron Dom
By Prabhakar Raghavan
By Andrew Tomkins

Compaq's Systems Research Center Web Archeology Group

Group's home page: Web Archaeology Project
Implementation: Mercator web crawler and other search engine related prototypes have been created.
Also various parts of AltaVista search engine have been developed here.
Main Researchers:
Marc Najork, Manager of Programming Technology at Compaq's Systems Research Center, DBLP Publications List
Raymie Stata, Research Scientist at Compaq's Systems Research Center, DBLP Publications List
Former Group Members:
Monika Henzinger
Krishna Bharat
Andrei Broder, Vice President of Research at AltaVista, DBLP Publications List
Main Publications:
Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, and Janet Wiener, Graph structure in the Web: experiments and models. WWW9, 2000.
Raymie Stata, Krishna Bharat, Farzin Maghoul, The Term Vector Database: fast access to indexing terms for Web pages. WWW9, 2000.
Monika Henzinger, Allan Heydon, Michael Mitzenmacher, and Marc Najork, On near-uniform URL sampling. WWW9, 2000.
Jeffrey Dean and Monika Henzinger, Finding Related Pages in the World Wide Web. WWW8, 1999.
Krishna Bharat, Andrei Broder, Jefferey Dean, and Monika Henzinger, A comparison of Techniques to Find Mirrored Hosts on the WWW. ACM DL, 1999.
Allan Heydon, Marc Najork, Mercator: A scalable, extensible Web crawler. WWW Journal, 1999.
Krishna Bharat and Monika Henzinger, Improved Algorithms for Topic Distillation in a Hyperlinked Environment. SIGIR, 1998.
Krishna Bharat, Andrei Broder, Monika Henzinger, Puneet Kumar, and Suresh Venkatasubramanian, The Connectivity Server: fast access to linkage information on the Web. WWW7, 1998.
Publication Collections:
By the Web Archaeology group
By Marc Najork
By Raymie Stata
By Monika Henzinger

NEC Research Institute Group

Systems:
ResearchIndex (CiteSeer) CS bibliography search and navigation engine
Inquirus metasearch engine (demo)
Inquirus2 preference based metasearch engine
Main Researchers:
C. Lee Giles, Professor at Pennsylvania State University, Consulting Scientist at NEC Research Institute, DBLP Publications List
Steve Lawrence, Research Scientist at NEC Research Institute, DBLP Publications List
Kurt Bollacker, Ph.D. from University of Texas?, DBLP Publications List
Eric Glover, Ph.D. candidate at University of Michigan, DBLP Publications List
Main Publications:
Steve Lawrence, C. Lee Giles, and Kurt Bollacker, Digital Libraries and Autonomous Citation Indexing. IEEE Computer, 1999.
Steve Lawrence, Kurt Bollacker, and C. Lee Giles Indexing and Retrieval of Scientific Literature. CIKM, 1999.
Steve Lawrence and C. Lee Giles, Inquirus, The NECI Meta Search Engine. WWW7, 1998.
Steve Lawrence and C. Lee Giles, Context and Page Analysis for Improved Web Search. IEEE Computer, 1998.
Gary Flake, Steve Lawrence, and C. Lee Giles, Efficient identification of Web communities. KDD, 2000.
Michelangelo Diligenti, Frans Coetzee, Steve Lawrence, C. Lee Giles, and Marco Gori, Focused Crawling Using Context Graphs. VLDB, 2000.
David M. Pennock, Eric Horvitz, and C. Lee Giles, Social Choice Theory and Recommender Systems: Analysis of the Axiomatic Foundations of Collaborative Filtering. AAAI, 2000.
Steve Lawrence and C. Lee Giles, Accessibility of Information on the Web. Nature, 1999.
Steve Lawrence and C. Lee Giles, Searching the World Wide Web. Science, 1998.
Publication Collections:
By C. Lee Giles
Web search-related by Steve Lawrence
Digital libraries-related by Steve Lawrence
By Eric Glover

Teoma/DiscoWeb Project

Implementation: Teoma prototype search engine
Project's home page: DiskoWeb Project
Current (Known) Researchers:
Apostolos Gerasoulis, Professor at Rutgers University, DBLP Publications List
Tao Yang, Associate Professor at University of California Santa Barbara, DBLP Publications List
Brian Davison, Ph.D. candidate at Rutgers University, DBLP Publications List
Main Publications:
Brian Davison, Apostolos Gerasoulis, Konstantinos Kleisouris, Yingfang Lu, Hyun-ju Seo, Wei Wang, and Baohua Wu, DiscoWeb: Applying Link Analysis to Web Search. WWW8 poster, 1999.
Brian Davison, Topical Locality in the Web. SIGIR, 2000.
Brian Davison, Recognizing Nepotistic Links on the Web. AAAI workshop, 2000.
Publication Collections:
By Brian Davison
News Coverage:
Make Room For Teoma, SearchEngineWatch.com, July 2001.

Conferences and other Web Information Retrieval Resources

Web Information Retrieval Resources Site maintained by Einat Amitay, including
Web IR-related Upcoming Conferences
Web IR-related Conference Proceedings
WebIR Yahoo Groups mailing list


Acknowledgments

This research project has been supported in part by National Science Foundation from grants IRI-9711635 and MIP-9710622.


Last updated on by Maxim Lifantsev
Comments, Suggestions?