The Open GRiD Project Architecture Proposal

Maxim L. Lifantsev


This document is somewhat outdated and should be considered as complementary to the Open GRiD Project Research Papers.

In any case be sure to read the document describing the Open GRiD project before reading this page!

This document is a draft outlining the proposed architecture for the Open GRiD project and some other implementational and development issues.

The parts in need of expansion or work are indicated by [[double square brackets]].

You are highly encouraged to review this proposal and submit your opinions, suggestions, comments, to the development discussion mailing list or e-mail them to me directly (maxim@cs.sunysb.edu).


Table of Contents


Design Goals for the Open GRiD Architecture


Definitions


Kinds of opinions and information that are handled

An AE can state or provide (by making it a part of one or many of RE's in its authorship area) the following kinds of opinions and information:


HTML Elements Design goals


The syntax we use to define the grammar

We use


HTML Notes


Mirror Defining Link

The way to provide a mirror page for the current page is to put

  <LINK rel="alternate" href="$mirror-uri$">
in the HEAD section of the page.

The document at $mirror-uri$ should have similar link to this document for such links to count.

Note that we can put many such mirror defining links on one page. This can be used for example when a particular page can be accessed using many different URI's: we can (should) put all these URI's as mirror links on that page.

"alternate" is a part of HTML 4.0 standard: http://www.w3.org/TR/REC-html40/types.html#type-links.


Authorship Entity Root Document Identification

The method to indicate that a particular page is (a mirror of) the root document of an AE is to put

  <META name="root-document" content="yes">
in the HEAD section of the page.

Since the value of name in META is not a part of the standard (see http://www.w3.org/TR/REC-html40/struct/global.html#edef-META), such HTML code should be okay with the standard.


Authorship Entity Information

AE Name

An AE can and should provide its name by putting

  <META name="author-name" content="$author-name$">
in the HEAD section of (a mirror of) its root document.

Where $author-name$ is human readable name of the AE.

AE Description

Similarly an AE can provide a link to its description using

  <META name="author-descr" content="$description-uri$">

Where $description-uri$ must end with #$tag-name$ and $tag-name$ must be defined as follows in the document pointed by $description-uri$:

  <A name="$tag-name$"> <$element$ $element-attrs$> $some-html-code$ </$element$>
for some element $element$ (e.g. for SPAN); the $some-html-code$ is the body of the description.
(Such requirement allows us to both extract the description text and point to it exactly by a link.)
We should issue a warning if $description-uri$ points to a document not in the
authorship area of this AE.

AE E-mail

Also an AE must either directly provide a contact e-mail to receive Open GRiD-related messages using

  <META name="author-email" content="$e-mail$">
or register with the Open GRiD server obtaining some login name and providing to the system non-disclosable e-mail and then using
  <META name="author-id" content="$id_name$">
to let others send e-mails to the author through the Open GRiD server.

The choice is here between making e-mail very easily extractable (note that if one wants to get somebody's e-mail or collect a number of e-mails, it is not that difficult to do so without such explicit information) and having all messages to pass through the Open GRiD server.


Authorship Area Defining Links

From-Root-Document Links

To add new pages to the authorship area one can put

  <LINK rel="same-author" href="$new-page-uri$">
  <LINK rel="same-author-cgi" href="$new-page-uri-prefix$">
in the HEAD section of a document already established to be in the authorship area (AA).

Where the $new-page-uri$ points to a page to be included in the AA of this AE.

For $new-page-uri-prefix$ all its continuations are declared to be included in the AA. (This is for CGI-script URI's.)

Note that these links are generally not rendered by browsers.

Also one can use visible links for the same purpose (in the BODY of a document already established to be in the AA):

  <A rel="same-author" href="$new-page-uri$">
  <AREA rel="same-author" href="$new-page-uri$">
  <FORM rel="same-author-cgi" action="$new-page-uri-prefix$">
  <FRAME rel="same-author" src="$new-page-uri$">
  <IFRAME rel="same-author" src="$new-page-uri$">

But note that rel for AREA, FORM, FRAME, and IFRAME is not in the HTML 4.0 standard.

We can also use equivalent comment style for things not in the standard, for example:

  <AREA href="$new-page-uri$">
  <!-- $OpenGridTag$ AREA rel="same-author" -->
Where $OpenGridTag$ is OpenGrid (case insensitive). The code between <!-- $OpenGridTag$ and --> is parsed as the code between < and > in HTML. There can be white-space characters between <!-- and $OpenGridTag$.

Generally we will use such form of post-comments or replacing comments (for non-standard HTML elements) when what we need is not in the standard.
Note that the syntax without comments is okay with the current HTML browsers because they were required to be build so that they simple ignore what they do not understand.

Towards-Root-Document Links

The above links must be complemented by links going in the opposite direction in order to count as valid.

These towards-the-root links again can be put invisibly in the HEAD section of a non root document:

  <LINK rel="to-author" href="$to-parent-uri$">
or visibly in the BODY of a non root document:
  <A rel="to-author" href="$to-parent-uri$">

Where $to-parent_uri$ should (transitively) lead to (a mirror of) the root document of the AE.


Opinion Link

To categorize and rank an RE (a Web page) one can use opinion links (OL's).

Opinion links use these extra attributes of A (anchor) element (all of them but href are not in the standard):

<!ATTLIST A
  href     %URI;         -- URI of the RE (the page that is categorized and ranked) --
  cat      %Category;    -- categorization of the RE --
  rank     %Rank;        -- rank of the RE in the categorization --
  conf     %Confidence;  -- confidence of the categorizing/ranking opinion --
  descr    %URI;         -- RE description URI --
  expl     %URI;         -- explanation URI --
  %timeattrs;            -- time specifying attributes --
>

href and cat are the only required attributes.
cat is used here as the indicator to tell an opinion link from a normal link.

Attribute definitions

cat = categorization [CT]
This attribute specifies the categorization of a resource.
rank = rank [CT]
This attribute specifies the rank of the referenced resource in the specified categorization.
conf = confidence [CT]
This attribute indicates the confidence of the AE in this opinion.
descr = uri [CT]
This attribute specifies the URI pointing to some short one-sentence resource description text.
The URI should satisfy the requirements defined for $description-uri$ in section AE Description.
expl = uri [CT]
This attribute specifies the URI pointing to some text explaining/justifying the given opinion.
The URI should satisfy the requirements defined for $description-uri$ in section AE Description.

An opinion link says that it's authoring entity places the referenced resource in the specified categorization with the specified rank.

The rank reflects merits/value/importance of the resource w.r.t. the specified categorization.

The description should shortly reflect the contents of the referenced resource w.r.t. the specified categorization.
The description is to be used in a directory listing or search result listing to give the user short description of the resource.
Different descriptions will be extracted and the ones whose copies are provided by AE's with highest combined weight in the categorization are to be displayed. (Hence descriptions can be mirrored or referenced.)
[[Describe how exactly descriptions determine the description in the directory listing.]]

The RE specified by the href attribute can be a part of authorship area of some AE known to the Open GRiD engine or just a regular web page, not annotated to be a part of an authorship area.

In short OL's will be written as tuples: ol = < re, ca, r, co, d, e >.


Direct Opinion Link

To rank an AE (author) directly one can use direct opinion links (DOL's).

Direct opinion links use these extra attributes of A (anchor) element (all of them but href and rel are not in the standard):

<!ATTLIST A
  href     %URI;         -- URI of some RE of the ranked AE --
  rel      %LinkTypes;   -- must include "rank-of" --
  cat      %Category;    -- categorization of the rank --
  rank     %Rank;        -- rank of the AE in the categorization --
  conf     %Confidence;  -- confidence of the categorizing/ranking opinion --
  descr    %URI;         -- AE description URI --
  expl     %URI;         -- explanation URI --
  %timeattrs;            -- time specifying attributes --
>

href, rel and cat are the only required attributes.

Presence of rank-of in rel indicates that this is a direct opinion link.

A direct opinion link says that it's authoring entity ranks the referenced AE in the specified categorization with the specified rank.

The description should shortly reflect the main information about the referenced authoring entity w.r.t. the specified categorization.
The description is to be used in a directory listing or search result listing to give the user short description of the AE.

In short DOL's will be written as tuples: dol = < ae, ca, r, co, d, e >.


Category Prefix

In order to reduce the needed amount of text to specify a set of opinions in some subcategory one can use category prefix specification syntax, which is to use this non-standard attribute of SPAN element:

<!ATTLIST SPAN
  catpref   %CatPrefix;    -- category prefix --
>

Attribute definitions

catpref = cat_prefix [CT]
This attribute specifies the category prefix to be prepended to all categorizations (and category prefixes) starting with . and which are located within the SPAN element (hence, it is possible to have nested category prefixes).


Opinion Bag

In order to provide several (direct) opinion links sharing the same ranked entity, but having different categorizations one can use opinion bags:

Opinion Bag is bounded by start and end tags of non-standard element OPBAG; start tag of OPBAG must be immediately followed by A element that provides the href for all opinion links in the bag (it can be a regular link or an opinion link); OPBAG attributes can provide default values for catpref, rel, rank, conf, descr, expl and %timeattrs; for the opinion links in the bag.
Opinion bags can not be nested.

The (additional) opinion links in the bag are defined by non-standard elements OPI (all of the attributes are not in the standard):

<!ATTLIST OPI
  otype    CDATA         -- must be equal to "link" --
  cat      %Category;    -- categorization of the RE --
  rank     %Rank;        -- rank of the RE in the categorization --
  conf     %Confidence;  -- confidence of the categorizing/ranking opinion --
  descr    %URI;         -- RE description URI --
  expl     %URI;         -- explanation URI --
  %timeattrs;            -- time specifying attributes --
>

Attribute definitions

otype = cdata [CI]
This attribute specifies the type of OPI element; the allowed values are link, cat-descr, cat-infl, cat-equiv.


Category Description Opinion

To provide descriptions to categorizations one can use category description opinions (CDO's).

CDO's are specified using these non-standard attributes of non-standard element OPI:

<!ATTLIST OPI
  otype    CDATA          -- must be equal to "cat-descr" --
  cat      %ExtCategory;  -- categorization --
  descr    %URI;          -- categorization description URI --
  conf     %Confidence;   -- confidence of the description opinion --
  expl     %URI;          -- explanation URI --
  %timeattrs;             -- time specifying attributes --
>

otype, cat, and descr are the required attributes here.

For example if we have the following categorization/description pairs specified

  /Computers             - comps
  /*::/*@into            - intro
  /Computers::/*@intro   - comp-intro
  /Hardware              - hardw
  /*::/Quality           - qual
  /Computers::/Quality   - comp-qual
Then we will have the following categorization/description pairs presented to the user
  /Computers                  - comps
  /Computers@intro            - comp-intro
  /Hardware                   - hardw
  /Hardware@intro             - hardw, intro
  /Hardware/Quality           - hardw, qual
  /Hardware/Quality@intro     - hardw, qual, intro
  /Computers::/Quality        - comp-qual
  /Computers::/Quality@intro  - comp-qual, intro

In short, elementary CDO's (i.e. the ones without *) will be written as tuples: cdo = < ca, co, d, e >.


Category Influence Opinion

To express an opinion about how much rank of an RE in one category should influence its rank in another category one can use category influence opinions (CIO's), which are specified by using these non-standard attributes of non-standard element OPI:

<!ATTLIST OPI
  otype     CDATA          -- must be equal to "cat-infl" or "cat-equiv" --
  cat       %ExtCategory;  -- original categorization --
  dest-cat  %InfCategory;  -- influenced categorization --
  infl      %Infl;         -- degree of influence --
  conf      %Confidence;   -- confidence of the influence opinion --
  expl      %URI;          -- explanation URI --
  %timeattrs;              -- time specifying attributes --
>

otype, cat, dest-cat, and infl are the required attributes here.

Attribute definitions

infl = infl [CT]
This attribute specifies the degree in which ranks from the first category influence the ranks in the second (destination) category.
dest-cat = inferred-categorization [CT]
This attribute specifies the categorization in which the ranks are influenced by the ranks in the original categorization.

We have 7 types of original/influenced category specifications totally:
Each of the 3 parts of the original categorization (category, quality, and level) can be a point or a subtree/interval (provided at least one of them is a point).
When something is a point, it can be any other point in the influenced category.
When something is a set, each element of the set will be mapped into the same element of the set in the influenced category (and the specification of that part of categorization should be missing in the influenced category).

Here are some examples:

  /Comp/Soft/Games::/absence_of_violence  -->
    /Comp/Soft/Education::/wide_applicability
means that /absence_of_violence in /Comp/Soft/Games influences /wide_applicability in /Comp/Soft/Education.
  /Comp/Soft/*::/reliability  -->  ::/
means that for the whole subtree of /Comp/Soft /reliability influences / (the overall rank)
  /Comp/Soft/Databases::/speed/*  -->  /Comp/Soft/Web_Commerce
means that the whole quality subtree /speed/* in /Comp/Soft/Databases influences the corresponding qualities in /Comp/Soft/Web_Commerce

This opinion creates destination categorization as a regular categorization and populates it with some RE's from the original category. (Anything can be put directly in either of the categorizations.)

otype="cat-equiv" means that we have two symmetrical category influence opinions.

It does not look like we need to have some explicit Category Alias Opinions: CIO's subsume those.
But, individual bookmark mini-directories can support aliases for convenience of the only user and maintainer of the mini-directory. [[Or can they be modeled by symmetric non-exported CIO's?]]

We also will not have explicit opinions reflecting subcategory/RE properness w.r.t. a category: these relations can be captured by opinion links and CIO's.

In short form CIO's will be written as tuples: cio = < oc, ic, i, co, e >.


Comment Opinion

One can use comment opinion (CO) "links" in order to provide non-author comments to an RE; CO's are specified using these extra attributes of A (anchor) element (all of them but href and rel are not in the standard):

<!ATTLIST A
  href     %URI;         -- URI of the RE (Web page) commented upon --
  hrefext  CDATA         -- information to target the href more precisely --
  rel      %LinkTypes;   -- must include "comment-to" --
  ctype    NAME          -- type of the comment --
  descr    %URI;         -- comment title/description URI --
  cbody    %URI;         -- comment body URI --
  cat      %Category;    -- categorization topic for the comment --
  conf     %Confidence;  -- confidence of the comment opinion --
  %timeattrs;            -- time specifying attributes --
>

href, rel, descr, and cbody are the required attributes here.

Presence of comment-to in rel indicates that this is a comment.

Attribute definitions

hrefext = cdata [CI]
This attribute specifies additional information to target href more precisely into a changing document of a different AE. [[The exact syntax is to be specified yet; see http://crit.org and http://lists.w3.org/Archives/Public/www-talk.new/msg01983.html for proposals.]]
ctype = name [CI]
Type of the comment; here are some predefined types: support, addition, comment, correction (comment is the default).
cbody = uri [CT]
This attribute specifies the URI pointing to the comment body
The URI should satisfy the requirements defined for $description-uri$ in section AE Description.

In short CO's will be written as tuples: co = < ce, ct, cb, ca, cc >, where ce combines href and hrefext.

Groups actively discussing some topic might wish to set-up and use a partial OpenGRiD server to timely update and serve comments to the set of RE's discussed by the group.

Maybe some time later HTTP servers will maintain and serve information about such third-party comments for the pages on them.


Group Authorship Entity Composition/Participation

To specify composition of a group authorship entity one can use composition links on the pages of the group AE together with corresponding participation links on the pages of participating AE's.

Composition Link

These are specified using these extra attributes of A (anchor) element (all of them but href and rel are not in the standard):

<!ATTLIST A
  href     %URI;         -- URI of some page of the participating AE --
  rel      %LinkTypes;   -- must include "participant" --
  cat      %Category;    -- categorization of participation --
  quota    %Quota;       -- participation quota in the specified categorization --
  descr    %URI;         -- participation description URI --
  expl     %URI;         -- explanation URI --
  %timeattrs;            -- time specifying attributes --
>

href and rel are the required attributes here.

Attribute definitions

quota = quota [CT]
This attribute specifies the rank of the referenced resource in the specified categorization.

The values of quotas for the same categorization for all participants should sum up to 100.

Participation Link

These are specified using these extra attributes of A (anchor) element (all of them but href and rel are not in the standard):

<!ATTLIST A
  href     %URI;         -- URI of some page of the group AE --
  rel      %LinkTypes;   -- must include "workgroup" --
  cat      %Category;    -- categorization of participation --
  quota    %Quota;       -- participation quota in the specified categorization --
  descr    %URI;         -- participation description URI --
  expl     %URI;         -- explanation URI --
  %timeattrs;            -- time specifying attributes --
>

href and rel are the required attributes here.

The corresponding participation quotas should be the same for both kinds of links in order to count. If they are not, then that portion of work in that category is credited to/blamed for no one.
This is to prevent making an AE responsible or credited for a work it did not participate in.

Group AE functionality implementation can be added later.

Both kinds of group participation links can be used with OPBAG's specifying many categorizations, quotas, descriptions, and explanations for the same initial link specified only once.


HTML Data Type Definitions

Categorization

<!ENTITY % Category "CDATA"
    -- category description
    -->

Categorization describes a category of knowledge and possibly a particular quality in that category and a level of presentation of the information (and hence the expected proficiency of the reader).
This is represented as a path in directory structure, a path in quality structure, and a point or an interval on the level scale:

  $Category$  ::=  $cat-path$ [ :: $qual-path$ ] [ @ $pres-level$ ]
  $cat-path$  ::=  [ / ] $path-name$ [{ / $path-name }] [ / ]
  $qual-path$  ::=  [ / ] $path-name$ [{ / $path-name }] [ / ]
  $pres-level$  ::=  $level-point$ [ .. $level-point$]
  $level-point$  ::=  $digit$
                    | intro | begnr | interm | advncd | expert
  $path-name$  ::=  . | / | { $alpha$ | $digit$ | _ | - }
$level-point$ scale is 1..9; default $level-point$ values: intro = 1, begnr = 3, interm = 5, advncd = 7, expert = 9.

Where the level names are case insensitive, whereas category and quality names are case sensitive.

[[Maybe we can have style sheets for $level-point$ names.]]

Qualities are good to have because things tend to have many (but not too many) important qualities contributing to the overall rating; also RE's should automatically get similar ranks for same qualities in close categories (which can be achieved only by having qualities as a notion known to the Open GRiD engine).

Presentation scale is an important general "quality" or search criterion.

Examples:

  /Computers/Internet/Search_Engines
  /Computers/Software::/reliability
  /Computers/Software::/productivity@intro
  /Computers/Software/OS_Shells::/interface/ease_of_learning
  /Computers/Software/Web_Browsers::/interface/ease_of_use
  /Computers/Software/Web_Browsers::/standard_compliance@interm..expert
  /Computers/Software/Operating_Systems::/quality/crash_rate
  /Computers/Software/Games::/price

Absence of the quality means rating in the overall quality. The default is "/::/@", that is, no categorization no quality specification, and unspecified level (i.e. all levels).

The initial implementation might probably disregard qualities and levels for simplicity.

Extended Categorization

<!ENTITY % ExtCategory "CDATA"
    -- extended category description
    -->

The syntax for the categorization is extended by the ability to have * at the end of category and/or quality path:

  $ExtCategory$  ::=  $cat-path$ [ * ] [ :: $qual-path$ [ * ] ] [ @ $pres-level$ ]
Star means that we are specifying the whole category or quality subtree under the specified path.

Inferred Categorization

<!ENTITY % InfCategory "CDATA"
    -- category description with possibly missing parts
    -->

The syntax for the categorization is extended by the ability to skip category/quality paths and level specification:

  $InfCategory$  ::=  [ $cat-path$ ] [ :: [ $qual-path$ ] ] [ @ [ $pres-level$ ] ]

Category Prefix

<!ENTITY % CatPrefix "CDATA"
    -- category prefix
    -->

The syntax is as of $cat-path$ from the definition of categorization.

Rank

<!ENTITY % Rank "CDATA"
    -- rank of a ranked entity
    -->

The rank is an integer ranging from 0 to 100 with optional % after it and preceding - to specify negative value. The default value is 0%.

The rank ranges from total disapproval of the merits of the RE in the category to the complete approval (i.e. from a statement that RE has completely misleading/harmful information/work/ideas/etc. to a statement that RE has absolutely important/necessary information/work/ideas/etc. w.r.t. this categorization); 0% means only that the RE belongs to the category.

[[We can make the default to be a small positive value; this will make all links into opinion links and achieve Google-like ratings in overall category based on regular links.]]

Confidence

<!ENTITY % Confidence "CDATA"
    -- author confidence in an opinion
    -->

Confidence is an integer ranging from 0 to 100 with optional % after it. The default value is 50%.

Confidence provides an AE with a way to voluntary reduce weight of an opinion: other AE's will not think bad of an AE if it has wrong but low confidence opinion.

Quota

<!ENTITY % Quota "CDATA"
    -- participation quota
    -->

Quota is an integer ranging from 0 to 100 with optional % after it. The default value is 50%.

Influence

<!ENTITY % Infl "CDATA"
    -- influence of rank in one categorization on the rank in another
    -->

The influence is an integer ranging from 0 to 100 with optional % after it and optional - (to specify negative-only influence) or + (to specify positive-only influence) or i (case insensitive; to specify "inverted" influence).
The value of the influence states the amount/strength of the influence.

A number from 0% to 100% states that positive (negative) rank in the first category implies positive (negative) rank in the second one with the stated strength.
A a number from -0% to -100% states that positive rank in the first category implies negative rank in the second one with the stated strength.
A a number from +0% to +100% states that negative rank in the first category implies positive rank in the second one with the stated strength.
A "number" from i0% to i100% states that positive (negative) rank in the first category implies negative (positive) rank in the second one with the stated strength.

Time Attributes

<!ENTITY % timeattrs 
 "add-time  %Datetime;  -- creation time --
  upd-time  %Datetime;  -- last update time --
  exp-time  %Datetime;  -- last visit time --
  vis-time  %Datetime;  -- expiration time --"
  >

Attribute definitions

add-time = datetime [CS]
This attribute specifies the date and time the opinion expressed by the element enclosing this attribute was first expressed.
upd-time = datetime [CS]
This attribute specifies the date and time the opinion expressed by the element enclosing this attribute was last updated.
exp-time = datetime [CS]
This attribute specifies the date and time the opinion expressed by the element enclosing this attribute expires.
vis-time = datetime [CS]
This attribute specifies the date and time the page referenced by the GRiD link specified by the element enclosing this attribute was last visited.

The times specified by AE should be the actual times to the best of the knowledge of the AE.
The last visited time is the time the AE last looked at that page.
The expiration time should be determined by a default expiration interval (say 4 month) from the last review of the entity about which the opinion is expressed.

The client software should include "expires soon" notification/search services as well as easy postpone-expiration button, so that people are urged/prompted to review and update/correct their opinions if necessary.

The OpenGRiD engine stores and shows both the value claimed by the AE and the value it can confirm (because of the dates and times of crawls of the page) for creation and update times.
The last visited time for the engine is the time a user visited the page using a link provided by the engine.
The expiration time is obtained from the value supplied by the AE by having a maximal validity interval (of say 5 month).
[[The weight of expired opinions should be depreciated.]]


Time Information about a Page that should be Provided by the Engine


Notes and Issues


Use of Ranks for Viewing

When we present a category listing to a user we have the following components in it:

When we present search results to a user they can be sorted taking into account search word occurrences and their rank in (for RE's) or their influence on (for categories) the category to which the search is restricted.

When we present comments to a given page to a user the comments can be sorted taking into account


Guide on When to Provide Which Opinions

Initial Ranks

Initially when you find some new author/resource that is not ranked in a category you think it should be ranked, you should use regular opinion links to rank/describe the resource in the category or you can use direct opinion links to rank/describe the author of the resource directly.

Making an opinion link to an RE/AE w.r.t. a category means that you think this RE/AE is an important item in the category (either as a good resource or a bad example).


Follow-up Opinions

When you find some resource/author that is already ranked in a particular way in a category and you


Category Movement Process

The steps to perform a category movement (i.e. restructuring of the category tree) are described below.
Note that it should be performed more or less together by the majority of experts in the category.

[[Is this process too complicated? Can it be simplified?]]


Default/General Category Influence Schemes

Here is the proposed way to determine default category influences using just the proximity of different categories in the current structure of the categorization "tree":

Note that we will have one overall categorization tree, the same presentation level scale for all categories, but the quality trees are going to depend on a position in the category tree.
(All such quality trees supposedly will be rather small, but they are going to be different for different parts of the category tree. We need "local" quality trees because the default quality influences will be determined by the degree of splitting of the quality tree; hence, combined quality tree would dilute the important influences.)

The quality tree for an RE is the combination of all categorizations of this RE that are explicitly stated on some page of some AE. The quality tree for a category is the combination of all the quality trees for all RE's that are explicitly categorized into this category by some AE.

Let's first define influences for two category paths, two quality paths (in some quality tree), and two levels:
ci = < ca, ca/ca1, 100% / m, 20%, "default" >
ci = < ca/ca1, ca, 100% / m, 20%, "default" >
where ca is a category (or quality) path; ca1 is a name of subcategory (subquality); m is the number of subcategories (subqualities in that given tree) of ca; 20% is the tentative "low confidence value".

The category influence for two arbitrary categories are determined using transitive closure (defined below) of the above default CIO's along the shortest path from the first category to the second one in the tree.

The intuition is that if something is good (bad) in all immediate subcategories (subqualities), then it is good (bad) in the category (quality). We assume here that the subcategories and subqualities are mostly disjoint. (Note that for a tree with 10-splits the default influence between two nodes more than 1 arc apart is less than 1%. Since we are going to disregard opinions with ranks less than certain threshold (for performance reasons), the default influences will be very local.)

li = < l1, l2, 10% + (1 - | l1 -l2 | / (lmax-lmin)) * 90%, 20%, "default" >
li = < "all", l, 100% / n, 20%, "default" >
li = < l, "all", 100% / n, 20%, "default" >
where l, l1, l2 are presentation levels; | l1 -l2 | is the numerical distance between l1 and l2; lmax-lmin is the maximal numerical distance between two presentation levels; 10% is the tentative value for the influence between the maximal and the minimal levels; n is the number of levels.

Then we have < ca1::q1@l1, ca2::q2@l2, rc*rq*rl, coc*coq*col, "default" > if we have
ci = < ca1, ca2, rc, coc, "default" >,
qi = < q1, q2, rq, coq, "default" >, and
li = < l1, l2, rl, col, "default" >,
where quality influence is computed in the quality tree that is the result of merging of the quality trees for categories ca1 and ca2. [[Or maybe we should increase the confidence here a bit?]]

A more complicated thing might be to take into account the depth in the directory tree (deeper means more influence; higher, less), but maybe this is not good/necessary: if you look at the current web directories it does not seem to be right to have a big inference between all categories except maybe the parent and some children for all levels of the tree.

Potentially, there might be many default category influence schemes (An ultimate extension is to allow for some category influence scheme description language.)
Such schemes can be presented on RE's which are to be ranked in special subcategories named say ".../Subtree_Influence_Scheme"; the ranks of a scheme will influence how much the scheme is used in the given subtree of the categorization in order to determine the category influences.
But most probably just one default scheme with low confidence and some specific category influence opinions provided by different people will be enough.


Raw Data Collected from the Web

Text (HTML/XML) Data for Indexing and Searching

Should at first include at least the whole authorship areas of the known AE's as well as all the _ranked_ RE's (not only the ones located in some authorship area) and possibly (some) "surrounding" pages of (some) of these RE's (these might be by default the/some pages "under" the URL path of the initial RE reachable from it).

Authoring Entity Data

Data about each AE: what is its authoring area and the information about AE: name, description, e-mail, main Web page (only this we will get for sure),

[[We might try to make all the uses of e-mail to be an optional subscription of an AE to the services of Open GRiD, but these options maybe should be made public to everyone.]]

Authoring Entity Structure Data

The data for all "group AE's": their structure as a collection of individual AE's. And the inverted data for each AE: in which "group AE's" does it participate and how. (Group AE functionality might be missing in the initial implementation.)

Opinion Data for Ranking

For each AE all the data in its authorship area regarding the Opinion Links, Category Description Opinions, Category Influence Opinions, and Comment Opinions.

We also need "transitive closure" of this information: transitive closure of the CIO graph as well as new OL's and CO's induced by the existing OL's, CO's and the closed CIO graph.

Here are the computation rules:

Note that we have to account for the fact that there might be "subtree CIO's" that might (partially) overlap with each other and with OL's and CO's.

When we have ol1 = < re, ca, r1, co1, d1, e1 > and ol2 = < re, ca, r2, co2, d2, e2 > i.e. different opinions about the same object (similarly for CDO's, CIO's, and CO's) we keep both and filter such cases later.
(I don't think that keeping only the one with the highest confidence or with the highest absolute value of the rank is right - see below on how derived OL's are used.)
The point here is that creation of multiple opinions about the same thing by an AE should not increase the weight of the opinion of this AE about this thing.

[[It has to be decided what is better, to compute (and store) this data explicitly, or to compute it on the fly when it is needed (or just cache it for a while after computing together with trying to make the use of this data local w.r.t. this cache). Actually (persistent) caching might be the best solution for such cases: we can regulate the amount of storage/recomputation by changing the size of the cache (but we need to invalidate/recalculate the cached data when needed.)]]

It should be a service of Open GRiD engine to provide all (a part according to the user specified criteria) of the information about both explicit and derived opinions.


Rank Computation Scheme

Input Data

From the web we get the following info: for each AE


Output Data

We want the following data to be computed:

For each AE

For each RE For each two categorizations:

[[ CDO data ]]

[[ Group AE data ]]

[[ CO data ]]

For all of these we also want the important information on how this rank or influence were generated: which RE's of this AE contributed most to this rank of the AE; which and whose opinions were most important; what are the most important negative opinions; how many AE's contributed to this opinion; and other important statistics: for example distribution of AE's ranks (influences) for the AE's that contributed to a rank (rank as x axis, and rank*number_of_AE's as y axis); such statistics might help to see how dependable the rank can be. We also want the ranks assigned by the AE itself to its RE's.
It's nice to make some conditions on these kinds of statistical information into an additional search criteria.


Intermediate Help Data

We compute (maybe on demand) the following data that does depend on the current global ranks we have:

Rank opinion provided by AE ae about RE re in category ca:
rank_op (ae, re, ca) = < r, w, co, d, e > (rank, weight, confidence, description, explanation), where

  r = r_i
  w = rank (ae, ca_i) * infl (ca_i, ca)
  co = co_i * infl_conf (ca_i, ca)
  d = d_i
  e = e_i since "infl (ca_i, ca)"
for such i that ol_i = < re_i, ca_i, r_i, co_i, d_i, e_i > is a derived OL of AE ae such that re_i = re and co is the maximal one over all OL's of this AE that satisfy the previous requirements.
Certainly, we are only interested in considering (and possibly storing) only such rank opinions for which each of r, w, co is greater than a certain threshold.

Justification: This way to generate rank_op's from original OL's through the derived OL's seems to be the best way to handle the following kind of situation:
ae has an OL about re in ca1; and a CIO from ca1 to ca2; the system has influence ranks from ca1 to ca2 and from both ca1 and ca2 to ca3; and we are interested in the opinion of ae about re in ca3.

Influence opinion provided by AE ae about categories ca and ca':
infl_op (ae, ca, ca') = < i, w, co, e > (influence, weight, confidence, explanation), where

  i = i_i
  w = (rank (ae, ca) + rank (ae, ca')) / 2
  co = co_i * (rank_conf (ae, ca) + rank_conf (ae, ca')) / 2
  e = e_i
for such i that cio_i = < oc_i, ic_i, i_i, co_i, e_i > is a closure CIO of AE ae such that oc_i = ca and ic_i = ca'.
Certainly, we are only interested in considering (and possibly storing) only such influence opinions for which each of i, w, co is greater than a certain threshold.

It might be better to use minimum instead of the average in order to prevent the following spamming scenario: an AE creates a category and makes oneself an expert there (maybe using other fake AE's) and then tries to influence ranks in some other category making a CIO from it's "own" category.
But it might be not necessary if we can prevent creation of expertness out of thin air.
Or we can have 2nd or more power averaging ((x^n +y^n)/2)^(1/n) or computationally-cheaper similar averaging like (x + y + n*max(x,y)) / (n + 2) if we want to have easy expertness status export, provided the original expertness status can be earned only fairly.

Open GRiD engine should also provide to users all these rank and influence opinions for any given AE.


The Formulas to Compute Output Data

where the sums are over all AE's ae having rank_op (ae, re, ca) = < r, w, co, d, e >.

[[Maybe we should use just the sum of weights (expertness degrees) of the contributing AE's, not their average, for coverage degree (maybe normalized with respect to some current maximum).]]

where the sums are over all RE's re authored by this ae.
That is, the rank of an AE is the rank of a virtual RE, which is the union of all the RE's authored by this AE.
[[Add handling of direct opinion links!]]

where the sums are over all AE's ae having infl_op (ae, ca1, ca2) = < i, w, co, e >; sys_w is some small weight (say 10%) for the default influence; sys_i and sys_co are the default influence and confidence values for ca1 and ca2 computed as described in section
"Default/General Category Influence Schemes" above.

It's actually good to store those _den and _num values: they are good to have for change computations.

It's nice to have some overall statistics like average rank or confidence of an RE or AE in a category.

[[ We need to specify the algorithms for computing this data and the needed data structures. ]]
A good way to compute/adjust these ranks seems to be some change-propagation algorithm: see which data items change because of the new/updated information we have; update these data items and if they change propagate the changes. Hence we need stored or easily computable lists of what might change if a certain data item changes.

[[Extension]]: We need a way to find RE's that are related to (parts of) a given "root" RE (and some means for AE to describe it) in order to index those sub-RE's and return them when a word we are searching for occurs in them but not in the root RE, and the categorization we are looking for is met by the "root" RE. (A default for this can be the URL-path subtree under the "root" RE.) When returning the search results Open GRiD should indicate this structuring information (e.g. the search result is this and that, it is a part of a structure with the root this and that).
[[A better way is another type of links, "inherit links": the linked document (authored by the same AE) inherits ranks of the linking document in the specified "subtree" of the categorization structure. These links must form trees only.]]
The incentive for AE to make such links is that his RE's will have a higher chance to be found; the incentive to avoid making improper links of this kind is the danger of being voted down for that.
A way to do it without new type of links is for the AE to have opinion links to such sub-pages stating that they are good in this and that categories; then if some page of the AE has high rank in the category, then the AE will have high rank in the category, then this will make AE's OL count and propagate this high rank to the subpage. [[This is not necessary will ensure correct rank, confidence, and coverage degree inheritance; further consideration is required!]]


Open GRiD Components and Their Functions

Fetcher

Gets pages using a given set of URL's and puts (compresses) the pages into a database.

Fetching Manager

Handles indexing requests (that initially can come from the user; expiration based re-crawling strategy; indexing or other components) in order to order; buffer; remove duplicates or recently crawled pages; and then distributes the requests to crawler(s).

Parser

Gets newly fetched pages from the database and processes them generating new fetching requests for fetching manager; extracting indexing, link, opinion, author, and other information from pages; and generating the needed update requests to the component managing the searching/ranking/categorization database.

I guess we need some redirection/mirror handling methods: we need URL redirection database to handle "moved to " redirecting type of documents, and we have to use the information about mirror documents as specified by AE's.

(Maybe from web-performance point of view some fetcher/fetching manager/indexer functionality for retrieving connected pages like authorship areas should be combined into one unit in order to retrieve such connected pages in one crawl, so that DNS (and routing?) caches are used.)

Initially the overall set of crawled pages is only the pages with voting links (i.e. authorship areas) provided by URL submissions to the search engine (or discovered through links) and the ranked sites for indexing their data and then searching thought it.

[[This list needs to be expanded...]]

Do we need a database software to handle storage and retrieval of our data? This question is not trivial since potentially the project will have a quite large database that might get too big for the initially chosen third-party database software. (In any case the interface with storage and retrieval should be a clearly defined module.)
Currently I think that The Berkeley Database (Berkeley copyright) (see
http://www.sleepycat.com/) should do fine for us.


The Functionality Provided by the Search Engine Interface (Server)

Main Searching Service

Browseable directory structure with directory/site descriptions (With nice hooks and quick links to other services/functionality.)

Search for words in such classes as: category/quality name, description text, explanation text, the RE text itself. We'll need some good default and a flexible access to these advanced features.

Information about an RE: which AE is the author; ranks and descriptions of this RE in the categories it is categorized in; which AE's are the main contributors to a given rank.

Information about an AE: main page, its opinions and ranks (by category, influence, etc.), information similar to that of an RE for the overall ranks of this AE.

A user should have some kind of equalizer-like method to specify that he/she wants results ranked using a weighting scheme where different classes of AE's (according to their ranks in the category used) have weights different from their ranks.
This way the user will be able to get "public" rankings, "expert" rankings, etc. in a given category.

Validation Service

A service to validate an AE, page, etc. for conformance with the standard used (this AE can be not know to the engine or already indexed by the engine).

This can/should be provided as both a web service and a standalone program that does the fetching and processing itself: see the next section.

Notification Service

A service to inform AE's (or just anyone) about important news as specified by some preferences of the person interested in such news.

Examples of such news include:

Such notifications should be provided in both push mode (subscription for certain change information to be sent every such and such interval) and in pull mode (tell me what has changed since this and that date in this and that area).


Client Side Functionality

I think it's nice to have some client side software that interfaces with the server and provides some other advanced functionality reducing the load of the server and possibly the network traffic. (We can try to put as much tasks from the server into client as possible without increase of the required network traffic and server processing.)
Client software is required for scalable merging of comments into documents.

A good way to implement it, is I think to have some software acting as a proxy between the browser and the Web. This allows for natural interception of browser requests (we do not have to change the displayed HTML's since all the request will go through the proxy anyway); for readily providing additional information from the Open GRiD server about the current document; for providing a flexible easily constructible unified interface by simply generating HTML's on the fly; and it allows such proxy to be used virtually with any browser on any OS; the proxy will also have the access to the user's file system to store and retrieve fast the needed data.

Let's see what kinds of functionality such client software can provide:

Opinion Management Helper

Tools for users to easily download (from Open GRiD server, or directly from AA of some AE, or convert form their bookmark file), manage (comfortably browse as a directory, change, reorganize, search for, etc.), create (like bookmarks while browsing), and upload (into special mini-directory place in the AA of this AE) their opinions. This is going to also act as a nice bookmark management package.

Note: We will need some syntax to define and use personal category name shortcuts for each user on that proxy client-side software. (Like e.g. Networks for comp/internet/structure/networks/lans).

Preference Management Helper

Such software can naturally include, manage, and store any user preferences regarding the use of the Open GRiD server.

Custom Searching View

Having nicely managed personal bookmark-like directory with shortcut names can be extended into having a personal custom version of the whole search engine functionality.

The customization consists of weighting the set of opinions of this AE higher that opinions of others, and then propagating changes induced by such custom weights.

This can be implemented by downloading the ranks that are going to change from the Open GRiD server then changing them, storing them, propagating these changes, and then using the custom ranks when available instead of the server ranks while doing all the traditional request for the Open GRiD server services.

This way of providing custom searching view implies that the more different, influential, and numerous are the personal opinions the more storage, bandwidth, and processing power such user will need to create, use, and maintain his/her custom searching view.

By installing such proxy on a host with a web server, a user will be able to provide his/her personal searching view to others.

General Client Proxy Functionality

A client proxy can also naturally perform the following functions:


Development Notes

See The Software Download Page for currently available code.

The development is (proposed to be) done in C++ (for efficiency, portability, and object-orientedness reasons; also most existing libraries and searching software are in C or C++).

The server software should run on Linux and other Unix's (porting should not be a big problem here).

The client should run on both Linux (and Unix's) and Windows (possibly on Mac's). Since it's going to be a proxy, accomplishing this should not be a big problem. (Junkbuster does it.)

The main development platform is going to be Linux.

I guess the project might need some uniform style and commenting guidelines, possibly supported by some tool automatically extracting documentation from source code files. (E.g. LXR provides source code browsing and searching without requiring any special format.) [[Does anybody know about some other nice tools of this sort?]]
It's good to have some common base libraries helping in managing runtime assertion checks, debugging dumps, etc. (Such libraries are part of the code now.)

Here is the list of some libraries, software, and articles that can (should) be used as development resources:


Things to be Done

Complete, "verify", and adjust the proposal above.

Create a more detailed specification of the system: parts; their interactions; data structures; algorithms.

Write the interfaces and the code.

Here are descriptions of some coding mini-projects to be done (see The Software Download Page for the current code):


Copyright Notice

Copyright (C) 1999 Maxim L. Lifantsev

The license for this document is the same as the one for the Open GRiD project proposal.


/\ Back to top of this page.

<- Back to the main page of the Open GRiD project.


First posted on Mar. 6, 1999 and last updated on Dec. 17, 2000 by Maxim Lifantsev
Comments, Suggestions?