Philippe Martin and Peter Eklund
Griffith University, School of Information Technology, PMB 50 Gold Coast MC, QLD 9726 Australia
Tel: +61 7 5594 8271; Fax: +61 7 5594 8066; E-mail: pm .@. phmartin dot info
Proceedings of the
"Virtual Documents, Hypertext Functionality and the Web"
at the 8th International World Wide Web Conference.
Table of contents
Web search engines - such as Altavista1 or Infoseek2 - retrieve entire documents based on keywords they include. They exploit undirected Web robots to periodically traverse and index internet/intranet documents. Directed Web robots - such as Harvest3, WebSQL4 and WebLog5 - apply string-matching and structure-matching commands (e.g. hypertext path expressions) to explore an intranet or a small subset of internet and retrieve entire documents or parts of them. However, people are generally not looking for lists of documents but either for a precise answer to a precise query, or for a structured presentation of information related to a certain object such as a particular event, technique, software, idea or person. For example, someone looking for "large-scale deductive database systems" does not want a giant list of references to conferences, articles and courses on database systems, or home pages and user manuals of specific database systems, s/he first wants a classification of features that such systems may have, and then s/he may ask for a classification of existing tools according to some features, e.g. the kinds of query language, exploited techniques, API, memory&performance characteristics, support for multi-users, reliability, license.
Though such precise information and comparisons are important for each person interested in using deductive database systems, it is a long and difficult task for that person to collect the information just by reading documents. However, it is not necessarily difficult for each provider of information on an object to represent this information in a document or a shared knowledge repository so that they can be retrieved - and to a certain extent, merged or composed - via conceptual commands. As opposed to string-matching and structure-matching commands, conceptual commands rely on logical inferences (e.g. exploitation of subsumption relations between terms in the knowledge statements) and improve both precision and recall in information retrieval. They may also be combined with other commands within scripts or usual documents to create virtual documents.
Various kinds of applications of knowledge representation, indexation and queries are illustred by examples in the WebKB site. Here is how some information on the Aditi database system could be represented in one of the structured text notations accepted by WebKB. The difference with the structured way (These information are extracted from the "Catalog of free database systems"13). Relations between each term used in this knowledge statement and other terms may be similarly defined elsewhere (in other documents or shared knowledge repositories) by one or several other users. Then, for example, subsumption relations between terms may be exploited for conceptual retrieval.
[Aditi. isa: large-scale deductive database system; user interface: NU-Prolog, graphical interface (implemented with: Motif); index method: B-trees, multi-level signature files; ports: SunOS, IRIX; ](representation date: 1992/12/17; representation author: email@example.com).
It is handy for an information provider to store and structure knowledge inside Web documents, especially if the duplication of information into machine readable statements and human-only readable statements can be avoided (e.g. by using controlled language14 for sentences and a visual language15 for graphics) or at least reduced by the possibility of mixing and linking the two kinds of statements. To allow this, WebKB exploits the convention that each group of knowledge statements or commands in a document must be delimited by the two special HTML tags "<KR>" and "</KR>" or the strings "$(" and ")$". The knowledge representation language used in each group must be specified at its beginning, e.g.: "<KR language="CG">". Each group is visible unless the document's author hides it with HTML comment tags. Furthermore, various notations allow people to use knowledge statements for indexing any part of any Web document (not just parts which can be refered by URLs). Thus, knowledge statements may be retrieved and handled via document-based commands, and conversely indexed parts of documents may be retrieved and handled via knowledge-based commands.
However, as any other directed Web robot, the scalability and efficiency of the current WebKB is limited by the facts that (i) the users must know which documents countain (or may countain) the knowledge to exploit, and (ii) these documents must be accessed and parsed each time their content has to be exploited. Pieces of knowledge, like Web documents, may be provided by all Web users, and needs to be inter-related or integrated, to allow each user to benefit from the knowledge of users they do not know. For that, cooperatively built knowledge repositories are necessary.
Some Web servers, called ontology servers, support shared knowledge
repositories, e.g. the
Ontolingua ontology server17 and
However, they are not usable for managing large quantities of knowledge and,
apart from AI-Trader19,
they do not allow the indexation and retrieval of parts of documents.
Finally, support of cooperation between the users is essentially limited to
consistency enforcement, annotations and structured dialogues, as in
We are extending WebKB to handle a
cooperatively built knowledge repository
which addresses scalability via the five following points
(five following points23):
(i) a scalable multi-user persistent object repository to support the storage and exploitation of knowledge structures (we have chosen the Shore24 system);
(ii) algorithms allowing the exploitation of large-scale dynamic taxonomies efficiently (we have chosen Fall's algorithms25);
(iii) visualisation techniques (mainly the handling of aliases for terms and the generation of views) to avoid lexical conflicts and enable users to focus on certain kinds of knowledge;
(iv) protocols to allow users to solve semantic conflicts via the insertion of new terms and relations in the common ontology and, in some cases, in the knowledge of other users;
(v) conventions for representing knowledge to improve the automatic comparison of knowledge from different users and hence their consistency and retrieval.
Though these five points permit the exploitation of a large knowledge
repository (that is essential for efficiency reasons and practical use),
it is also clear that for efficiency and reliability reasons, a unique
server cannot be used to handle a universal knowledge repository by all
Web users. Knowledge has to be distributed and mirrored on various knowledge
servers. However, since there is no static conceptual schemas in knowledge
bases, the techniques of distributed database systems - such as
and TSIMMIS29 -
cannot all be reused.
A first step to the distribution of a knowledge repository is to duplicate it on several servers, with updates made on a server automatically duplicated in other servers. Some servers may be dedicated to searches and others to updates.
A second step is to have general servers and specialized servers. A specialised server would store the same knowledge as general servers plus knowledge related to a well-defined set of objects, e.g. knowledge expressed with the subtypes of certain types. Since these sets of objects are well-defined (extensively or via definitions), a general server would store the URLs of these servers and, when answering a query, would delegate the query to the relevant servers if more precision is required. These sets of objects might be determined by the managers of specialized servers, or according to the frequency of accesses to objects in knowledge repositories. Whatever the specialised server a user updates, if the knowledge it enters is relevant to other servers (e.g. if the knowledge is expressed with general terms), it should be automatically duplicated in these servers The rationale of all these duplications is to speed searches and simplify the query mechanisms by avoiding, whenever possible, parallel searches in various servers and then the composition of the results.
Other steps may be necessary, but what should be avoided in this knowledge-based approach (hence precision-oriented) is to let the specialized servers developp independently of each others instead of being part of a unique consistent virtual knowledge repository. Otherwise, conceptual queries and cooperation across the repositories are no more possible, and as in current traders, a most relevant repository to answer a query has to be automatically "guessed".
Finally, knowledge servers should not be limited to the storage of knowledge statements: they should also allow a storage and handling of knowledge-based and document-based commands similar to the storage and handling we described for documents.
The more a piece of information is precisely represented, the more adequately it can be retrieved and exploited. General and intuitive knowledge representation languages seem best adapted for that. WebKB allows to use Conceptual Graphs and also simpler notations for less expressivivity or precision is needed. Ambiguities due to declared terms are partially solved according to the constraints in the used ontologies.
Storing knowledge within documents is handy but the scalability of this approach is limited. Ultimately, we believe a knowledge-based Web relies on scalable distributed cooperatively built knowledge repositories. We have proposed (and work on) some directions for that goal. In this view, knowledge-annotated documents can be used as isolated module of knowledge on which a user can work before submiting it to a knowledge server for integration. A document including commands can also be sent to a knowledge server as a template for generating virtual documents. Of course, scripts of commands could also be stored in a repository handled by a knowledge server and referred to from a document. We currently extend WebKB to allow these combinations.
In the same way as today we register a Web site, we will probably register knowledge representations (or documents including knowledge representations) and complement or refine each other's knowledge.