This document is somewhat outdated and should be considered as complementary to the Open GRiD Project Research Papers.
In any case be sure to read the document describing the Open GRiD project before reading this page!
This document is a draft outlining the proposed architecture for the Open GRiD project and some other implementational and development issues.
The parts in need of expansion or work are indicated by [[double square brackets]].
You are highly encouraged to review this proposal and submit your opinions, suggestions, comments, to the development discussion mailing list or e-mail them to me directly (maxim@cs.sunysb.edu).
Opinions about all the work (RE's) of an AE contribute to the overall opinion about this AE as the author of all this work. But what is ranked and categorized in the first place are the individual RE's (though AE's can be also ranked directly).
Hence, both individual RE's and their AE are ranked and categorized,
but the opinions made by an AE on any of its RE's (Web pages)
are weighted according to the ranks of AE as a whole,
not according to the ranks of that individual RE.
Justification: Ranks in very different fields of work (RE's) will not
influence each other much, but ranks of all the work in categories
(or qualities) similar to all the work
(e.g. qualities like "clarity", "web page design", etc.)
must (and will) influence each other and the weight of opinions
of this AE in these categories (or qualities).
Ranks of a group AE influence and are influenced by the ranks of the member AE's according to the categorized participation quotas. whose descriptions are mirrored The quotas describe the degree of participation of each individual AE in the work of the group AE.
Open GRiD should provide both a validation service during submission of pages to validate AA description, and a notification service informing AE's about pages that are just claiming to be in the authorship area are of an AE.
Abuse note: Since all these AE relationships are public, declaring a part of your site to be a separate virtual AE might not be a good idea, because it does not prevent other AE's from ranking you as an AE's directly, taking into account the knowledge that you are playing with having split AE's.
An AE can state or provide (by making it a part of one or many of RE's in its authorship area) the following kinds of opinions and information:
The way to provide a mirror page for the current page is to put
<LINK rel="alternate" href="$mirror-uri$">in the HEAD section of the page.
The document at $mirror-uri$ should have similar link to this document for such links to count.
Note that we can put many such mirror defining links on one page. This can be used for example when a particular page can be accessed using many different URI's: we can (should) put all these URI's as mirror links on that page.
"alternate" is a part of HTML 4.0 standard:
http://www.w3.org/TR/REC-html40/types.html#type-links.
The method to indicate that a particular page is
(a mirror of) the root document of an AE is
to put
Since the value of name in META
is not a part of the standard
(see http://www.w3.org/TR/REC-html40/struct/global.html#edef-META),
such HTML code should be okay with the standard.
An AE can and should provide its name by putting
Where $author-name$ is human readable name of the AE.
Similarly an AE can provide a link to its description using
Where $description-uri$
must end with #$tag-name$
and $tag-name$
must be defined as follows in the document
pointed by $description-uri$:
Also an AE must either directly provide a contact e-mail to receive
Open GRiD-related messages using
The choice is here between making e-mail very easily extractable
(note that if one wants to get somebody's e-mail
or collect a number of e-mails, it is not that difficult to do so
without such explicit information)
and having all messages to pass through the Open GRiD server.
To add new pages to the authorship area one can put
Where the $new-page-uri$ points to
a page to be included in the AA of this AE.
For $new-page-uri-prefix$ all its continuations
are declared to be included in the AA.
(This is for CGI-script URI's.)
Note that these links are generally not rendered by browsers.
Also one can use visible links for the same purpose
(in the BODY of a document already established to be in the AA):
But note that rel
for AREA, FORM, FRAME,
and IFRAME is not in the HTML 4.0 standard.
We can also use equivalent comment style for things not in the standard,
for example:
Generally we will use such form of post-comments
or replacing comments (for non-standard HTML elements)
when what we need is not in the standard.
The above links must be complemented by links going in the opposite
direction in order to count as valid.
These towards-the-root links again can be put invisibly
in the HEAD section of a non root document:
Where $to-parent_uri$
should (transitively) lead to (a mirror of) the root document of the AE.
To categorize and rank an RE (a Web page) one can use opinion links (OL's).
Opinion links use these extra attributes of A (anchor) element
(all of them but href are not in the standard):
href and cat are the only required attributes.
Attribute definitions An opinion link says that it's authoring entity
places the referenced resource in the specified categorization
with the specified rank.
The rank reflects merits/value/importance of the resource
w.r.t. the specified categorization.
The description should shortly reflect the contents of the referenced
resource w.r.t. the specified categorization.
The RE specified by the href attribute
can be a part of authorship area of some AE known
to the Open GRiD engine or just a regular web page, not annotated
to be a part of an authorship area.
In short OL's will be written as tuples:
ol = < re, ca, r, co, d, e >.
To rank an AE (author) directly one can use direct opinion links (DOL's).
Direct opinion links use these extra
attributes of A (anchor) element
(all of them but href and rel
are not in the standard):
href, rel and cat
are the only required attributes.
Presence of rank-of in rel
indicates that this is a direct opinion link.
A direct opinion link says that it's authoring entity
ranks the referenced AE in the specified categorization
with the specified rank.
The description should shortly reflect the
main information about the referenced authoring entity
w.r.t. the specified categorization.
In short DOL's will be written as tuples:
dol = < ae, ca, r, co, d, e >.
In order to reduce the needed amount of text
to specify a set of opinions in some subcategory one can use
category prefix specification syntax,
which is to use this non-standard attribute of SPAN element:
Attribute definitions In order to provide several (direct) opinion links sharing the same
ranked entity, but having different categorizations one can use opinion bags:
Opinion Bag is bounded by start and end tags of non-standard
element OPBAG;
start tag of OPBAG
must be immediately followed by A element
that provides the href for all opinion links in the bag
(it can be a regular link or an opinion link);
OPBAG attributes can provide default values for
catpref, rel, rank, conf,
descr, expl
and %timeattrs;
for the opinion links in the bag.
The (additional) opinion links in the bag are defined by non-standard
elements OPI
(all of the attributes are not in the standard):
Attribute definitions To provide descriptions to categorizations
one can use category description opinions (CDO's).
CDO's are specified using these non-standard attributes
of non-standard element OPI:
otype, cat, and descr
are the required attributes here.
For example if we have the following categorization/description pairs
specified
In short, elementary CDO's (i.e. the ones without *)
will be written as tuples:
cdo = < ca, co, d, e >.
To express an opinion about how much rank of an RE in one category
should influence its rank in another category
one can use category influence opinions (CIO's),
which are specified by using these non-standard attributes
of non-standard element OPI:
otype, cat, dest-cat,
and infl are the required attributes here.
Attribute definitions
We have 7 types of original/influenced category specifications totally:
Here are some examples:
This opinion creates destination categorization as a
regular categorization and populates it with some RE's
from the original category.
(Anything can be put directly in either of the categorizations.)
otype="cat-equiv" means that we have two symmetrical
category influence opinions.
It does not look like we need to have some explicit
Category Alias Opinions: CIO's subsume those.
We also will not have explicit opinions reflecting
subcategory/RE properness w.r.t. a category:
these relations can be captured by opinion links and CIO's.
In short form CIO's will be written as tuples:
cio = < oc, ic, i, co, e >.
One can use comment opinion (CO) "links"
in order to provide non-author comments to an RE;
CO's are specified using these extra attributes of A
(anchor) element
(all of them but href and rel are not in the standard):
href, rel, descr,
and cbody are the required attributes here.
Presence of comment-to in rel
indicates that this is a comment.
Attribute definitions In short CO's will be written as tuples:
co = < ce, ct, cb, ca, cc >,
where ce combines href and hrefext.
Groups actively discussing some topic might wish to
set-up and use a partial OpenGRiD server to
timely update and serve comments to the
set of RE's discussed by the group.
Maybe some time later HTTP servers will maintain and serve
information about such third-party comments
for the pages on them.
To specify composition of a group authorship entity
one can use composition links on the pages of the group AE
together with corresponding participation links on the pages
of participating AE's.
These are specified using these extra attributes of A
(anchor) element
(all of them but href and rel
are not in the standard):
href and rel
are the required attributes here.
Attribute definitions The values of quotas for the same categorization for all
participants should sum up to 100.
These are specified using these extra attributes of A
(anchor) element
(all of them but href and rel
are not in the standard):
href and rel
are the required attributes here.
The corresponding participation quotas should be the same
for both kinds of links in order to count.
If they are not, then that portion of work in that category
is credited to/blamed for no one.
Group AE functionality implementation can be added later.
Both kinds of group participation links can be used with
OPBAG's specifying
many categorizations, quotas, descriptions, and explanations
for the same initial link specified only once.
Categorization describes a category of knowledge and
possibly a particular quality in that category and a level of presentation
of the information (and hence the expected proficiency of the reader).
Where the level names are case insensitive,
whereas category and quality names are case sensitive.
[[Maybe we can have style sheets for $level-point$ names.]]
Qualities are good to have because things tend to have
many (but not too many) important qualities
contributing to the overall rating;
also RE's should automatically get similar ranks
for same qualities in close categories
(which can be achieved only by having qualities as a notion known
to the Open GRiD engine).
Presentation scale is an important general "quality"
or search criterion.
Examples:
Absence of the quality means rating in the overall quality.
The default is "/::/@", that is, no categorization
no quality specification, and unspecified level (i.e. all levels).
The initial implementation might probably disregard qualities
and levels for simplicity.
The syntax for the categorization is extended by the ability to
have * at the end of category and/or quality path:
The syntax for the categorization is extended by the ability to
skip category/quality paths and level specification:
The syntax is as of $cat-path$
from the definition of categorization.
The rank is an integer ranging from 0 to 100
with optional % after it and preceding -
to specify negative value.
The default value is 0%.
The rank ranges from total disapproval of the merits of the RE
in the category to the complete approval
(i.e. from a statement that RE has completely misleading/harmful
information/work/ideas/etc.
to a statement that RE has absolutely important/necessary
information/work/ideas/etc.
w.r.t. this categorization);
0% means only that the RE belongs to the category.
[[We can make the default to be a small positive value;
this will make all links into opinion links and
achieve Google-like ratings in overall category based on regular links.]]
Confidence is an integer ranging from 0 to 100
with optional % after it.
The default value is 50%.
Confidence provides an AE with a way to voluntary reduce
weight of an opinion:
other AE's will not think bad of an AE if it has wrong
but low confidence opinion.
Quota is an integer ranging from 0 to 100
with optional % after it.
The default value is 50%.
The influence is an integer ranging
from 0 to 100
with optional % after it
and optional - (to specify negative-only influence)
or + (to specify positive-only influence)
or i (case insensitive;
to specify "inverted" influence).
A number from 0% to 100% states that positive (negative)
rank in the first category
implies positive (negative) rank in the second one with the stated strength.
Attribute definitions The times specified by AE should be the actual times
to the best of the knowledge of the AE.
The client software should include "expires soon"
notification/search services
as well as easy postpone-expiration button,
so that people are urged/prompted to review and update/correct their opinions
if necessary.
The OpenGRiD engine stores and shows both
the value claimed by the AE and the value it can confirm
(because of the dates and times of crawls of the page)
for creation and update times.
But we can and should provide the user with a way
to combine the opinions of experts and non-experts
with different weights in order to just get such search results.
When we present a category listing to a user
we have the following components in it:
When we present search results to a user
they can be sorted taking into account search word occurrences and
their rank in (for RE's) or their influence on
(for categories) the category to which the search is restricted.
When we present comments to a given page to a user
the comments can be sorted taking into account
Initially when you find some new author/resource
that is not ranked in a category you think it should be ranked,
you should use regular opinion links
to rank/describe the resource in the category or
you can use direct opinion links to rank/describe the author
of the resource directly.
Making an opinion link to an RE/AE w.r.t. a category
means that you think this RE/AE is an important
item in the category (either as a good resource
or a bad example).
When you find some resource/author that is already
ranked in a particular way in a category and you
The steps to perform a category movement
(i.e. restructuring of the category tree) are described below.
[[Is this process too complicated? Can it be simplified?]]
Here is the proposed way to determine default category influences
using just the proximity of
different categories in the current structure of
the categorization "tree":
Note that we will have one overall categorization tree,
the same presentation level scale for all categories,
but the quality trees are going to depend on a
position in the category tree.
The quality tree for an RE is the combination of all
categorizations of this RE that are explicitly stated on some page of some AE.
The quality tree for a category is the combination of all
the quality trees for all RE's that are explicitly categorized
into this category by some AE.
Let's first define influences for two category paths,
two quality paths (in some quality tree), and two levels:
The category influence for two arbitrary categories
are determined using transitive closure
(defined below)
of the above default CIO's
along the shortest path from the first category to the second one in the
tree.
The intuition is that if something is good (bad) in all immediate subcategories
(subqualities), then it is good (bad) in the category (quality).
We assume here that the subcategories and subqualities are mostly disjoint.
(Note that for a tree with 10-splits the default influence between
two nodes more than 1 arc apart is less than 1%.
Since we are going to disregard opinions with ranks
less than certain threshold (for performance reasons),
the default influences will be very local.)
li = < l1, l2, 10% + (1 - | l1 -l2 | / (lmax-lmin)) * 90%, 20%, "default" >
Then we have
< ca1::q1@l1, ca2::q2@l2, rc*rq*rl, coc*coq*col, "default" >
if we have
A more complicated thing might be to take into account the depth in
the directory tree (deeper means more influence; higher, less),
but maybe this is not good/necessary:
if you look at the current web directories it does not seem to be right
to have a big inference between all categories except maybe
the parent and some children for all levels of the tree.
Potentially, there might be many default category influence schemes
(An ultimate extension is to allow for some category influence
scheme description language.)
[[We might try to make all the uses of e-mail to be an optional
subscription of an AE to the services of Open GRiD,
but these options maybe should be made public to everyone.]]
We also need "transitive closure" of this information:
transitive closure of the CIO graph as well as new OL's and CO's
induced by the existing OL's, CO's and the closed CIO graph.
Here are the computation rules:
When we have
ol1 = < re, ca, r1, co1, d1, e1 >
and
ol2 = < re, ca, r2, co2, d2, e2 >
i.e. different opinions about the same object
(similarly for CDO's, CIO's, and CO's)
we keep both and filter such cases later.
[[It has to be decided what is better, to compute (and store) this data
explicitly, or to compute it on the fly when it is needed
(or just cache it for a while after computing
together with trying to make the use of this data
local w.r.t. this cache).
Actually (persistent) caching might be the best solution
for such cases: we can regulate the amount of storage/recomputation
by changing the size of the cache
(but we need to invalidate/recalculate the cached data when needed.)]]
It should be a service of Open GRiD engine
to provide all (a part according to the
user specified criteria) of the information about both explicit
and derived opinions.
From the web we get the following info:
for each AE
We want the following data to be computed:
For each AE
[[ CDO data ]]
[[ Group AE data ]]
[[ CO data ]]
For all of these we also want the important information on how this rank
or influence were generated: which RE's of this AE
contributed most to this rank of the AE;
which and whose opinions were most important;
what are the most important negative opinions;
how many AE's contributed to this opinion; and other important
statistics:
for example distribution of AE's ranks (influences)
for the AE's that contributed to a rank
(rank as x axis, and rank*number_of_AE's as y axis);
such statistics might help to see how dependable the rank can be.
We also want the ranks assigned by the AE itself to its RE's.
We compute (maybe on demand) the following data
that does depend on the current global ranks we have:
Rank opinion provided
by AE ae about RE re in category ca:
Justification: This way to generate rank_op's from original OL's through
the derived OL's seems to be the best way to handle
the following kind of situation:
Influence opinion provided by AE ae about categories ca and ca':
It might be better to use minimum instead of the average
in order to prevent the following spamming scenario:
an AE creates a category and makes oneself an expert there
(maybe using other fake AE's)
and then tries to influence ranks in some other category
making a CIO from it's "own" category.
Open GRiD engine should also provide to users
all these rank and influence opinions for any given AE.
[[Maybe we should use just the sum of weights (expertness degrees)
of the contributing AE's,
not their average, for coverage degree (maybe normalized with
respect to some current maximum).]]
It's actually good to store
those _den and _num values:
they are good to have for change computations.
It's nice to have some overall statistics like average rank or confidence
of an RE or AE in a category.
[[ We need to specify the algorithms for computing this data
and the needed data structures. ]]
[[Extension]]: We need a way to find RE's that are related to (parts of) a given "root" RE
(and some means for AE to describe it)
in order to index those sub-RE's and return them when a word
we are searching for occurs in them but not in the root RE,
and the categorization we are looking for is met by the "root" RE.
(A default for this can be the URL-path subtree under the "root" RE.)
When returning the search results Open GRiD should indicate this
structuring information (e.g. the search result is this and that,
it is a part of a structure with the root this and that).
I guess we need some redirection/mirror handling methods:
we need URL redirection database to
handle "moved to
(Maybe from web-performance point of view
some fetcher/fetching manager/indexer functionality
for retrieving connected pages like authorship areas
should be combined into one unit
in order to retrieve such connected pages in one crawl,
so that DNS (and routing?) caches are used.)
Initially the overall set of crawled pages is
only the pages with voting links
(i.e. authorship areas) provided by URL submissions to the search engine
(or discovered through links)
and the ranked sites for indexing their data and then searching thought it.
[[This list needs to be expanded...]]
Do we need a database software to handle storage and retrieval of our data?
This question is not trivial since potentially the project will
have a quite large database that might get too big for the
initially chosen third-party database software.
(In any case the interface with storage and retrieval should be
a clearly defined module.)
Browseable directory structure with directory/site descriptions
(With nice hooks and quick links to other services/functionality.)
Search for words
in such classes as: category/quality name,
description text, explanation text,
the RE text itself.
We'll need some good default
and a flexible access to these advanced features.
Information about an RE:
which AE is the author;
ranks and descriptions of this RE in the categories it is categorized in;
which AE's are the main contributors to a given rank.
Information about an AE:
main page, its opinions and ranks (by category, influence, etc.),
information similar to that of an RE
for the overall ranks of this AE.
A user should have some kind of equalizer-like method
to specify that he/she wants results
ranked using a weighting scheme where
different classes of AE's (according to their ranks in
the category used) have weights different from their ranks.
A service to validate an AE, page, etc. for conformance with the standard
used (this AE can be not know to the engine
or already indexed by the engine).
This can/should be provided as both a web service
and a standalone program that does the fetching and processing itself:
see the next section.
A service to inform AE's (or just anyone) about important
news as specified by some preferences of the person
interested in such news.
Examples of such news include:
Such notifications should be provided in both
push mode (subscription for certain change information
to be sent every such and such interval)
and in pull mode (tell me what has changed since this and that date
in this and that area).
I think it's nice to have some client side software
that interfaces with the server and provides some
other advanced functionality
reducing the load of the server and possibly the network traffic.
(We can try to put as much tasks from the server into client
as possible without increase of the required network traffic
and server processing.)
A good way to implement it, is I think to have some software
acting as a proxy between the browser and the Web.
This allows for natural interception of browser requests
(we do not have to change the displayed HTML's
since all the request will go through the proxy anyway);
for readily providing additional information from
the Open GRiD server about the current document;
for providing a flexible easily constructible unified interface
by simply generating HTML's on the fly;
and it allows such proxy to be used virtually with any browser
on any OS;
the proxy will also have the access to the user's file system
to store and retrieve fast the needed data.
Let's see what kinds of functionality such client software can provide:
Tools for users to easily download (from Open GRiD server,
or directly from AA of some AE, or convert form their bookmark file),
manage (comfortably browse as a directory, change,
reorganize, search for, etc.),
create (like bookmarks while browsing),
and upload (into special mini-directory place in the AA of this AE)
their opinions.
This is going to also act as a nice bookmark management package.
Note: We will need some syntax to define and use personal category name
shortcuts for each user on that proxy client-side software.
(Like e.g. Networks for comp/internet/structure/networks/lans).
Such software can naturally include, manage, and store any
user preferences regarding the use of the Open GRiD server.
Having nicely managed personal bookmark-like directory
with shortcut names can be extended into having
a personal custom version of the whole search engine functionality.
The customization consists of weighting the set of opinions of this AE
higher that opinions of others, and then propagating
changes induced by such custom weights.
This can be implemented by downloading the ranks that are going to change
from the Open GRiD server then changing them, storing them,
propagating these changes,
and then using the custom ranks when available instead of the
server ranks while doing all the traditional request for the Open GRiD
server services.
This way of providing custom searching view implies
that the more different, influential, and numerous
are the personal opinions the more storage, bandwidth, and processing power
such user will need to create, use, and maintain his/her custom searching view.
By installing such proxy on a host with a web server,
a user will be able to provide his/her personal searching view to others.
A client proxy can also naturally perform the following functions:
See
The Software Download Page
for currently available code.
The development is (proposed to be) done in C++
(for efficiency, portability, and object-orientedness reasons;
also most existing libraries and searching software are in C or C++).
The server software should run on Linux and other Unix's
(porting should not be a big problem here).
The client should run on both Linux (and Unix's)
and Windows (possibly on Mac's).
Since it's going to be a proxy, accomplishing this should not
be a big problem.
(Junkbuster does it.)
The main development platform is going to be Linux.
I guess the project might need some uniform style
and commenting guidelines, possibly supported by some
tool automatically extracting documentation from source code files.
(E.g. LXR provides
source code browsing and searching without requiring any special format.)
[[Does anybody know about some other nice tools of this sort?]]
Here is the list of some libraries, software, and articles that
can (should) be used as development resources:
Complete, "verify", and adjust the proposal above.
Create a more detailed specification of the system:
parts;
their interactions;
data structures;
algorithms.
Write the interfaces and the code.
Here are descriptions of some coding mini-projects to be done
(see
The Software Download Page
for the current code):
Copyright (C) 1999 Maxim L. Lifantsev
The license for this document is the same as
the
one for the Open GRiD project proposal.
Authorship Entity Root Document Identification
<META name="root-document" content="yes">
in the HEAD section of the page.
Authorship Entity Information
AE Name
<META name="author-name" content="$author-name$">
in the HEAD section of (a mirror of) its root document.
AE Description
<META name="author-descr" content="$description-uri$">
<A name="$tag-name$"> <$element$ $element-attrs$> $some-html-code$ </$element$>
for some element $element$ (e.g. for SPAN);
the
(Such requirement allows us to both extract the description text
and point to it exactly by a link.)
We should issue a warning if $description-uri$
points to a document not
in the authorship area of this AE.
AE E-mail
<META name="author-email" content="$e-mail$">
or register with the Open GRiD server obtaining some login name
and providing to the system non-disclosable e-mail
and then using
<META name="author-id" content="$id_name$">
to let others send e-mails to the author through the
Open GRiD server.
Authorship Area Defining Links
From-Root-Document Links
<LINK rel="same-author" href="$new-page-uri$">
<LINK rel="same-author-cgi" href="$new-page-uri-prefix$">
in the HEAD section of a document already established
to be in the authorship area (AA).
<A rel="same-author" href="$new-page-uri$">
<AREA rel="same-author" href="$new-page-uri$">
<FORM rel="same-author-cgi" action="$new-page-uri-prefix$">
<FRAME rel="same-author" src="$new-page-uri$">
<IFRAME rel="same-author" src="$new-page-uri$">
<AREA href="$new-page-uri$">
<!-- $OpenGridTag$ AREA rel="same-author" -->
Where $OpenGridTag$ is OpenGrid (case insensitive).
The code between <!-- $OpenGridTag$
and --> is parsed as the code between
< and > in HTML.
There can be white-space characters between
<!-- and $OpenGridTag$.
Note that the syntax without comments is okay with the current HTML browsers
because they were required to be build so that
they simple ignore what they do not understand.
Towards-Root-Document Links
<LINK rel="to-author" href="$to-parent-uri$">
or visibly in the BODY of a non root document:
<A rel="to-author" href="$to-parent-uri$">
Opinion Link
<!ATTLIST A
href %URI; -- URI of the RE (the page that is categorized and ranked) --
cat %Category; -- categorization of the RE --
rank %Rank; -- rank of the RE in the categorization --
conf %Confidence; -- confidence of the categorizing/ranking opinion --
descr %URI; -- RE description URI --
expl %URI; -- explanation URI --
%timeattrs; -- time specifying attributes --
>
cat is used here as the indicator to tell
an opinion link from a normal link.
The URI should satisfy the requirements
defined for $description-uri$
in section AE Description.
The URI should satisfy the requirements
defined for $description-uri$
in section AE Description.
The description is to be used in a directory listing
or search result listing to give the user short description
of the resource.
Different descriptions will be extracted and
the ones whose copies are provided by AE's with highest
combined weight in the categorization are to be displayed.
(Hence descriptions can be mirrored or referenced.)
[[Describe how exactly descriptions determine the
description in the directory listing.]]
Direct Opinion Link
<!ATTLIST A
href %URI; -- URI of some RE of the ranked AE --
rel %LinkTypes; -- must include "rank-of" --
cat %Category; -- categorization of the rank --
rank %Rank; -- rank of the AE in the categorization --
conf %Confidence; -- confidence of the categorizing/ranking opinion --
descr %URI; -- AE description URI --
expl %URI; -- explanation URI --
%timeattrs; -- time specifying attributes --
>
The description is to be used in a directory listing
or search result listing to give the user short description
of the AE.
Category Prefix
<!ATTLIST SPAN
catpref %CatPrefix; -- category prefix --
>
Opinion Bag
Opinion bags can not be nested.
<!ATTLIST OPI
otype CDATA -- must be equal to "link" --
cat %Category; -- categorization of the RE --
rank %Rank; -- rank of the RE in the categorization --
conf %Confidence; -- confidence of the categorizing/ranking opinion --
descr %URI; -- RE description URI --
expl %URI; -- explanation URI --
%timeattrs; -- time specifying attributes --
>
Category Description Opinion
<!ATTLIST OPI
otype CDATA -- must be equal to "cat-descr" --
cat %ExtCategory; -- categorization --
descr %URI; -- categorization description URI --
conf %Confidence; -- confidence of the description opinion --
expl %URI; -- explanation URI --
%timeattrs; -- time specifying attributes --
>
/Computers - comps
/*::/*@into - intro
/Computers::/*@intro - comp-intro
/Hardware - hardw
/*::/Quality - qual
/Computers::/Quality - comp-qual
Then we will have the following categorization/description pairs
presented to the user
/Computers - comps
/Computers@intro - comp-intro
/Hardware - hardw
/Hardware@intro - hardw, intro
/Hardware/Quality - hardw, qual
/Hardware/Quality@intro - hardw, qual, intro
/Computers::/Quality - comp-qual
/Computers::/Quality@intro - comp-qual, intro
Category Influence Opinion
<!ATTLIST OPI
otype CDATA -- must be equal to "cat-infl" or "cat-equiv" --
cat %ExtCategory; -- original categorization --
dest-cat %InfCategory; -- influenced categorization --
infl %Infl; -- degree of influence --
conf %Confidence; -- confidence of the influence opinion --
expl %URI; -- explanation URI --
%timeattrs; -- time specifying attributes --
>
Each of the 3 parts of the original categorization
(category, quality, and level)
can be a point or a subtree/interval (provided at least one of them
is a point).
When something is a point, it can be any other point in the
influenced category.
When something is a set, each element of the set will be mapped
into the same element of the set in the influenced category
(and the specification of that part of categorization should be missing
in the influenced category).
/Comp/Soft/Games::/absence_of_violence -->
/Comp/Soft/Education::/wide_applicability
means that /absence_of_violence in /Comp/Soft/Games
influences /wide_applicability in /Comp/Soft/Education.
/Comp/Soft/*::/reliability --> ::/
means that for the whole subtree of /Comp/Soft
/reliability influences / (the overall rank)
/Comp/Soft/Databases::/speed/* --> /Comp/Soft/Web_Commerce
means that the whole quality subtree /speed/*
in /Comp/Soft/Databases influences the corresponding
qualities in /Comp/Soft/Web_Commerce
But, individual bookmark mini-directories can support aliases
for convenience of the only user and maintainer of the mini-directory.
[[Or can they be modeled by symmetric non-exported CIO's?]]
Comment Opinion
<!ATTLIST A
href %URI; -- URI of the RE (Web page) commented upon --
hrefext CDATA -- information to target the href more precisely --
rel %LinkTypes; -- must include "comment-to" --
ctype NAME -- type of the comment --
descr %URI; -- comment title/description URI --
cbody %URI; -- comment body URI --
cat %Category; -- categorization topic for the comment --
conf %Confidence; -- confidence of the comment opinion --
%timeattrs; -- time specifying attributes --
>
The URI should satisfy the requirements
defined for $description-uri$
in section AE Description.
Group Authorship Entity Composition/Participation
Composition Link
<!ATTLIST A
href %URI; -- URI of some page of the participating AE --
rel %LinkTypes; -- must include "participant" --
cat %Category; -- categorization of participation --
quota %Quota; -- participation quota in the specified categorization --
descr %URI; -- participation description URI --
expl %URI; -- explanation URI --
%timeattrs; -- time specifying attributes --
>
Participation Link
<!ATTLIST A
href %URI; -- URI of some page of the group AE --
rel %LinkTypes; -- must include "workgroup" --
cat %Category; -- categorization of participation --
quota %Quota; -- participation quota in the specified categorization --
descr %URI; -- participation description URI --
expl %URI; -- explanation URI --
%timeattrs; -- time specifying attributes --
>
This is to prevent making an AE responsible or credited for
a work it did not participate in.
HTML Data Type Definitions
Categorization
<!ENTITY % Category "CDATA"
-- category description
-->
This is represented as a path in directory structure,
a path in quality structure,
and a point or an interval on the level scale:
$Category$ ::= $cat-path$ [ :: $qual-path$ ] [ @ $pres-level$ ]
$cat-path$ ::= [ / ] $path-name$ [{ / $path-name }] [ / ]
$qual-path$ ::= [ / ] $path-name$ [{ / $path-name }] [ / ]
$pres-level$ ::= $level-point$ [ .. $level-point$]
$level-point$ ::= $digit$
| intro | begnr | interm | advncd | expert
$path-name$ ::= . | / | { $alpha$ | $digit$ | _ | - }
$level-point$ scale is 1..9;
default $level-point$ values:
intro = 1, begnr = 3, interm = 5,
advncd = 7, expert = 9.
/Computers/Internet/Search_Engines
/Computers/Software::/reliability
/Computers/Software::/productivity@intro
/Computers/Software/OS_Shells::/interface/ease_of_learning
/Computers/Software/Web_Browsers::/interface/ease_of_use
/Computers/Software/Web_Browsers::/standard_compliance@interm..expert
/Computers/Software/Operating_Systems::/quality/crash_rate
/Computers/Software/Games::/price
Extended Categorization
<!ENTITY % ExtCategory "CDATA"
-- extended category description
-->
$ExtCategory$ ::= $cat-path$ [ * ] [ :: $qual-path$ [ * ] ] [ @ $pres-level$ ]
Star means that we are specifying the whole category or quality
subtree under the specified path.
Inferred Categorization
<!ENTITY % InfCategory "CDATA"
-- category description with possibly missing parts
-->
$InfCategory$ ::= [ $cat-path$ ] [ :: [ $qual-path$ ] ] [ @ [ $pres-level$ ] ]
Category Prefix
<!ENTITY % CatPrefix "CDATA"
-- category prefix
-->
Rank
<!ENTITY % Rank "CDATA"
-- rank of a ranked entity
-->
Confidence
<!ENTITY % Confidence "CDATA"
-- author confidence in an opinion
-->
Quota
<!ENTITY % Quota "CDATA"
-- participation quota
-->
Influence
<!ENTITY % Infl "CDATA"
-- influence of rank in one categorization on the rank in another
-->
The value of the influence states the amount/strength of the influence.
A a number from -0% to -100% states that positive rank in the first
category implies negative rank in the second one with the stated strength.
A a number from +0% to +100% states that negative rank in the first
category implies positive rank in the second one with the stated strength.
A "number" from i0% to i100% states that positive (negative) rank
in the first category implies negative (positive) rank
in the second one with the stated strength.
Time Attributes
<!ENTITY % timeattrs
"add-time %Datetime; -- creation time --
upd-time %Datetime; -- last update time --
exp-time %Datetime; -- last visit time --
vis-time %Datetime; -- expiration time --"
>
The last visited time is the time the AE last looked at that page.
The expiration time should be determined by a default expiration interval
(say 4 month) from the last review of the entity about
which the opinion is expressed.
The last visited time for the engine is the time
a user visited the page using a link provided by the engine.
The expiration time is obtained from the value supplied by the AE
by having a maximal validity interval (of say 5 month).
[[The weight of expired opinions should be depreciated.]]
Time Information about a Page that should be Provided by the Engine
Notes and Issues
I think this and that AE should have this and that opinion.
(But in reality that AE does not have such opinion expressed.)
If we were to provide this, the current "experts" (i.e. AE's with high ranks)
in the field could monopolize the filed by stating that opinions of experts
in this field should be valued more than opinions of not-experts.
Use of Ranks for Viewing
Guide on When to Provide Which Opinions
Initial Ranks
Follow-up Opinions
then you can support that rank by making a similar one
or/and by ranking positively the author(s), which contribute to the rank
the way you think is right, w.r.t. this category.
The first way will strengthen the rank in question
and increase/decrease expertness of the author of the resource
changing the weight of his/her opinions in this category.
The second way will increase expertness of the author(s)
that ranked the resource in question,
thus changing the weight of all his/her/their present and future opinions
in this category
and hence in particular influencing the rank of the original resource.
then you can rank the RE/AE in question as you think it should be done
explaining your opinion.
If you also think that making such opinion does not go together with
the expert status in the category you can rank the
author(s) of the offending opinion(s).
It might be also a good idea to contact the author(s) of the
opinion(s) and try to argue that the opinion(s) is/are wrong.
then that means you disagree with the expert status
of the author(s) of the opinion(s)
and you can proceed as described in the previous item.
Note that in this situation
you should not rank the original RE/AE negatively in the
category as it will *increase* the degree this site belongs
to this category!
Category Movement Process
Note that it should be performed more or less together by the majority of
experts in the category.
Default/General Category Influence Schemes
(All such quality trees supposedly will be rather small,
but they are going to be different for different parts of the
category tree.
We need "local" quality trees because
the default quality influences will be determined by
the degree of splitting of the quality tree;
hence, combined quality tree would dilute the important influences.)
ci = < ca, ca/ca1, 100% / m, 20%, "default" >
ci = < ca/ca1, ca, 100% / m, 20%, "default" >
where ca is a category (or quality) path;
ca1 is a name of subcategory (subquality);
m is the number of subcategories (subqualities in that given tree)
of ca;
20% is the tentative "low confidence value".
li = < "all", l, 100% / n, 20%, "default" >
li = < l, "all", 100% / n, 20%, "default" >
where l, l1, l2 are presentation levels;
| l1 -l2 | is the numerical distance
between l1 and l2;
lmax-lmin is the maximal numerical
distance between two presentation levels;
10% is the tentative value for the influence between
the maximal and the minimal levels;
n is the number of levels.
ci = < ca1, ca2, rc, coc, "default" >,
qi = < q1, q2, rq, coq, "default" >, and
li = < l1, l2, rl, col, "default" >,
where quality influence is computed in
the quality tree that is the result of merging
of the quality trees for categories ca1 and ca2.
[[Or maybe we should increase the confidence here a bit?]]
Such schemes can be presented on RE's which are to be ranked
in special subcategories named say ".../Subtree_Influence_Scheme";
the ranks of a scheme will influence how much the scheme is used
in the given subtree of the categorization in order to determine
the category influences.
But most probably just one default scheme with low confidence
and some specific category influence opinions provided
by different people will be enough.
Raw Data Collected from the Web
Text (HTML/XML) Data for Indexing and Searching
Should at first include at least the whole authorship areas of
the known AE's as well as all the _ranked_ RE's
(not only the ones located in some authorship area)
and possibly (some) "surrounding" pages of (some) of these RE's
(these might be by default the/some pages "under" the URL path
of the initial RE reachable from it).
Authoring Entity Data
Data about each AE:
what is its authoring area
and the information about AE:
name, description, e-mail,
main Web page (only this we will get for sure),
Authoring Entity Structure Data
The data for all "group AE's":
their structure as a collection of individual AE's.
And the inverted data for each AE: in which "group AE's" does
it participate and how.
(Group AE functionality might be missing in the initial implementation.)
Opinion Data for Ranking
For each AE all the data in its authorship area
regarding the Opinion Links, Category Description Opinions,
Category Influence Opinions, and Comment Opinions.
Note that we have to account for the fact that there might
be "subtree CIO's" that might (partially) overlap
with each other and with OL's and CO's.
generate cio = < oc1, ic2, i1 * i2, co1 * co2, e1 since e2 >
if ic1 = oc2
(This shows that internally Open GRiD needs a more complex representation for
explanations than just one URL);
generate ol = < re1, ic2, r1 * i2, co1 * co2, d1, e1 since e2 >
if ca1 = oc2
generate co = < ce1, ct1, cb1, ic2, cc1 * co2 >
if ca1 = oc2
(I don't think that keeping only the one with the highest confidence
or with the highest absolute value of the rank is right
- see below on how derived OL's are used.)
The point here is that creation of multiple opinions
about the same thing by an AE should not increase the weight
of the opinion of this AE about this thing.
Rank Computation Scheme
Input Data
Output Data
For each RE
denoted rank (ae, ca), rank_conf (ae, ca),
and rank_covg (ae, ca).
The higher the positive rank the better expert this AE is in the category;
the higher the negative rank the better
example of a bad expert this AE is in the category.
The confidence shows the aggregated confidence
of the opinion creators in the above rank.
The coverage degree is high if many high-ranked
experts have contributed to this rank.
(It reflects reliability of the rank based on
the number and the expertness of the AE's that have contributed to it.)
We can use it as a threshold to disregard/not display
weakly supported opinions.
These are to be used to weight the opinions of this AE.
[[Currently we use only ranks to weight the opinions
(this is the feedback in the system).
How can we use the other data?
Coverage degree for an AE rank should increase the
importance of its opinions, shouldn't it?]]
For each two categorizations:
denoted rank (re, ca), rank_conf (re, ca),
and rank_covg (re, ca).
(The meanings of rank, confidence, and coverage degree
are similar to that of AE.)
These are to be used to influence the ranks of the authoring AE
and to determine how relevant this RE is for a particular query.
denoted infl (ca1, ca2), infl_conf (ca1, ca2),
and infl_covg (ca1, ca2).
These are to be used to determine how much ranks in one category
should influence ranks in another category.
Obviously, infl (ca,ca) = 100%,
infl_conf (ca,ca) = 100%,
and infl_covg (ca,ca) = "max".
It's nice to make some conditions on these kinds of statistical information
into an additional search criteria.
Intermediate Help Data
rank_op (ae, re, ca) = < r, w, co, d, e >
(rank, weight, confidence, description, explanation),
where
r = r_i
w = rank (ae, ca_i) * infl (ca_i, ca)
co = co_i * infl_conf (ca_i, ca)
d = d_i
e = e_i since "infl (ca_i, ca)"
for such i
that ol_i = < re_i, ca_i, r_i, co_i, d_i, e_i >
is a derived OL of AE ae such that
re_i = re
and co is the maximal one over all OL's of this AE that
satisfy the previous requirements.
Certainly, we are only interested in considering (and possibly storing)
only such rank opinions for which each of
r, w, co is greater than a certain
threshold.
ae has an OL about re in ca1;
and a CIO from ca1 to ca2;
the system has influence ranks from ca1 to ca2
and from both ca1 and ca2 to ca3;
and we are interested in the opinion
of ae about re in ca3.
infl_op (ae, ca, ca') = < i, w, co, e >
(influence, weight, confidence, explanation),
where
i = i_i
w = (rank (ae, ca) + rank (ae, ca')) / 2
co = co_i * (rank_conf (ae, ca) + rank_conf (ae, ca')) / 2
e = e_i
for such i that
cio_i = < oc_i, ic_i, i_i, co_i, e_i >
is a closure CIO of AE ae such that
oc_i = ca and
ic_i = ca'.
Certainly, we are only interested in considering (and possibly storing)
only such influence opinions for which each of
i, w, co
is greater than a certain threshold.
But it might be not necessary if we can prevent creation
of expertness out of thin air.
Or we can have 2nd or more power averaging
((x^n +y^n)/2)^(1/n)
or computationally-cheaper similar averaging
like (x + y + n*max(x,y)) / (n + 2)
if we want to have easy expertness status export,
provided the original expertness status can be earned only fairly.
The Formulas to Compute Output Data
where the sums are over all AE's ae having
rank_op (ae, re, ca) = < r, w, co, d, e >.
where rank_num (re, ca) = sum (r * w * co)
and rank_den (re, ca) = sum (w * co)
where rank_conf_num (re, ca) = rank_den (re, ca)
and rank_conf_den (re, ca) = sum (w)
where rank_covg_num (re, ca) = rank_conf_den (re, ca)
and rank_covg_den (re, ca) = sum (1)
where the sums are over all RE's re
authored by this ae.
where rank_num (ae, ca) = sum (rank_num (re, ca))
and rank_den (ae, ca) = sum (rank_den (re, ca))
where rank_conf_num (ae, ca) = rank_den (ae, ca)
and rank_conf_den (ae, ca) = sum (rank_conf_den (re, ca))
where rank_covg_num (ae, ca) = rank_conf_den (ae, ca)
and rank_covg_den (ae, ca) = sum (rank_covg_den (re, ca))
That is, the rank of an AE is the rank of a virtual RE,
which is the union of all the RE's authored by this AE.
[[Add handling of direct opinion links!]]
where the sums are over all AE's ae having
infl_op (ae, ca1, ca2) = < i, w, co, e >;
sys_w is some small weight
(say 10%) for the default influence;
sys_i and sys_co
are the default influence and confidence
values for ca1 and ca2 computed as described in
section
"Default/General Category Influence Schemes"
above.
where infl_num (ca1, ca2) = sum (i * w * co) + sys_i * sys_w * sys_co
and infl_den (ca1, ca2) = sum (w * co) + sys_w * sys_co
where infl_conf_num (ca1, ca2) = infl_den (ca1, ca2)
and infl_conf_den (ca1, ca2) = sum (w) + sys_w
where infl_covg_num (ca1, ca2) = infl_conf_den (ca1, ca2)
and infl_covg_den (ca1, ca2) = sum (1) + 1
A good way to compute/adjust these ranks seems to be
some change-propagation algorithm:
see which data items change because of the new/updated information we have;
update these data items and if they change propagate the changes.
Hence we need stored or easily computable lists of what might change
if a certain data item changes.
[[A better way is another type of links,
"inherit links":
the linked document (authored by the same AE) inherits
ranks of the linking document in
the specified "subtree" of the categorization structure.
These links must form trees only.]]
The incentive for AE to make such links is that
his RE's will have a higher chance to be found;
the incentive to avoid making improper links of this kind is the danger
of being voted down for that.
A way to do it without new type of links is for the AE
to have opinion links to such sub-pages stating that they are good
in this and that categories;
then if some page of the AE has high rank in the category,
then the AE will have high rank in the category,
then this will make AE's OL count and propagate
this high rank to the subpage.
[[This is not necessary will ensure correct
rank, confidence, and coverage degree inheritance;
further consideration is required!]]
Open GRiD Components and Their Functions
Fetcher
Gets pages using a given set of URL's and puts (compresses)
the pages into a database.
Fetching Manager
Handles indexing requests (that initially can come from
the user; expiration based re-crawling strategy; indexing
or other components) in order to order;
buffer; remove duplicates or recently crawled pages;
and then distributes the requests to crawler(s).
Parser
Gets newly fetched pages from the database
and processes them
generating new fetching requests for fetching manager;
extracting indexing, link, opinion, author, and other information
from pages;
and generating the needed update requests to
the component managing the searching/ranking/categorization database.
Currently I think that The Berkeley Database (Berkeley copyright)
(see http://www.sleepycat.com/)
should do fine for us.
The Functionality Provided by the Search Engine Interface (Server)
Main Searching Service
This way the user will be able to get "public" rankings,
"expert" rankings, etc. in a given category.
Validation Service
Notification Service
Client Side Functionality
Client software is required for scalable merging of comments
into documents.
Opinion Management Helper
Preference Management Helper
Custom Searching View
General Client Proxy Functionality
(User will be able to easily turn these features on/off
for individual URI's or patterns of URI's,
because we can provide fast access to such functionality
from an "added-functionality-bar" inserted into each page.)
Have combined earliest-match-used white/black list of patterns
for blocking (as in TCP wrappers).
(for now we will rely on external cache such as Squid,
but (optional) integrated cache is good for
the users who do not already have a separate cache set up).
(can be automatic or explicitly activated,
can be restricted to the pages from the same host, etc.,
should handle images, should not steal bandwidth
from explicit browsing).
(We can restrict bandwidth used by a big-file download
stream passing through the proxy.)
(I think sometimes one needs to cancel the visited-link coloring
for a particular link.
Also having visited-link information in a proxy
will make it browser independent if one uses many
(incompatible) browsers over a time.)
(Ensure that a copy of a page is stored locally,
so that we can still get it in case the original disappears.)
Development Notes
It's good to have some common base libraries
helping in managing runtime assertion checks,
debugging dumps, etc.
(Such libraries are part of the code now.)
It's the W3C library (C, OO style) of basic www functionality;
it includes a crawler code.
They describe Google architecture in reasonable detail.
Crawler and search engine for a (small) specified domain (C/C++).
Indexer and search engine (C/C++).
Proxy (http,https) to filter ad's and cookies (C).
Junkbuster's code already has been used as the model
for the initial proxy principles.
Proxy (http,https,ftp,etc.) cache (C/C++).
Converts bookmarks into searchable Yahoo-like mini-directory
with descriptions, new-item flags, aliases, and click-through counters (C++).
A server to create, store, and view comments to html's (Perl, Java);
it also provides some information on backlinks to a document.
WWW'95 paper about architecture for annotations (comments);
it covers group/public/individual annotations; scalability issues;
polling and notification.
Another WWW'95 paper on annotations; an architecture for serving annotations;
their use as trail marks; and about seals of approval (SOAPs),
which are ratings by some reputable source.
WWW'96 paper proposing to use proxy instead of modifying browsers
in order to merge annotations into documents
and otherwise enrich the documents;
also discusses annotations classified by groups and by topics
and filtering of annotations to display by these criteria.
WWW'97 paper on methods to organize and share (within small groups)
bookmarks; also has some ideas on bookmark categorization and ranking.
Things to be Done
Copyright Notice
![]()
Back to top of this page.
![]()
Back to the main page of the Open GRiD project.
![]()
First posted on Mar. 6, 1999
and last updated on Dec. 17, 2000 by
Maxim Lifantsev
![]()
Comments, Suggestions?