Toward cooperatively-built knowledge repositories; example on ontology-related tools

Toward cooperatively-built knowledge repositories;
example on ontology-related tools

Dr Philippe Martin, Dr Michael Blumenstein, A.Prof. Peter Deer
Griffith University - School of I.C.T. - PMB 50 Gold Coast MC, QLD 9726 Australia
The first author began this article at the Laboratory for Applied Ontology (LOA), Trento, Italy.

Abstract

After noting that neither informal documents nor totally formal knowledge bases are good media for people to share, compare or discuss about technical knowledge, we propose mechanisms to support the sharing, re-use and cooperative update of semi-formal semantic networks, assign values to contributions and credits to the contributors. We then propose ontological elements to guide and normalize the construction of such knowledge repositories, and an approach to permit the comparison of tools or techniques.

Introduction
Support of cooperation between knowledge providers

Making Knowledge Explicit and Sharing It
Valuating contributions and contributors
Combining the advantages of centralization and distribution

Some ontological elements

Domains and Theories
Tasks and Methodologies
Structures and Languages
Tools
Journals, Conferences, Publishers and Mailing Lists
Articles, Books and other Documents
People: Researchers, Specialists, Teams/Projects, ...

Example of comparison of two ontology-related tools
Conclusion
Acknowledgments
References

1. Introduction

The majority of technical information is currently published in mostly unstructured forms within documents such as articles, e-mails and user manuals. Therefore, finding and comparing tools or techniques to solve a problem is a lengthy process (with most often sub-optimal results) that involves reading many documents partly redundant with each other. This process heavily relies on human memory and manual cross-checking, and its outcomes, even if published, are lost to many people with similar goals. Writing documents is also a lengthy process that involves summarizing what has been described elsewhere and making choices and compromises on which ideas or techniques to describe and how: level of detail, order, etc.

To sum up, whatever the field of study, there is currently no well structured semantic network of techniques or ideas that a Web user could (i) navigate to get a synthetic view of a subject or, as in a decision tree, quickly find its path to relevant information, and (ii) easily update to publish a new idea (or a new way to explain an idea) and link it to other ideas via semantic relations. Such a system is indeed difficult to build and initialize. However, it is part of a vision for a semi-formal "standardized online archive of computer science knowledge" (Smith, 1998) and dwarfed by the much more ambitious visions on a "Digital Aristotle" which would be capable of teaching much of the world's scientific knowledge by (i) adapting to its students' knowledge and preferences (Hillis, 2004), and (ii) preparing and answering (with explanations) test questions for them (this implies the encoding of the knowledge in a formal way and meta-reasonings on the problem-solving strategies). The current approachs that are related to the above cited problems can be divided into three groups.

First, the approaches indexing (parts of) documents by metadata, such as Dublin Core metadata, DocBook metadata, topics (generated, e.g. by Latent Semantic Analysis or other keyword co-occurence analysis techniques, or manually decided as in the Open Directory Project and the topic hierarchies of Yahoo), categories from ontologies (e.g. WordNet or a specialized lightweight ontology as in the KA2 project and some other Semantic Web projects), or more rarely, formal summaries (e.g. in Conceptual Graphs). These approaches are useful for retrieving or exploiting a large base of documents (e.g. Iridescent helps researchers and companies find keyword relationships within the 13 million abstracts of the MEDLINE database) but do not lead to any browsable/updatable semantic network synthesizing and comparing facts or ideas. The same can be said about most document-related query answering systems (e.g. those exploiting Natural Language Understanding techniques).

Second, the approaches aiming to represent elements of a domain into formal or semi-formal knowledge bases (KBs). Examples are Open GALEN (an ontology of medical knowledge), the KBs of Fact Guru (one on Canadian Animals, one on Astronomy, one on Java, and one on Object-Oriented Software Engineering), the QED Project (which aims to build a formal KB of all important, established mathematical knowledge), the KBs of the Halo project (the long term goal of this project is the design of a "Digital Aristotle"; in its first phase, three research teams have each represented the content of a 70 page Chemistry textbook and used this KB to design a system that answers questions from an Advanced Placement exam and explaining the provided answers). The first two KBs are essentially term definitions (formal in Open Galen, semi-formal in Fact Guru) and hence are interesting to re-use for representing or indexing parts of documents but would be insufficient to learn about a domain or find and compare techniques to solve a problem. The second two are formal and interesting for automatic inferencing but are not meant to be directly read and browsed.

Third, the hypertext-based Web sites describing and organizing or comparing resources (researchers, discussion lists, journals, concepts, theories, tools, etc.) of a domain, e.g. MathWorld and the American Mathematical Society. Some Web sites permit their users to collaborate or discuss by adding or updating documents, e.g. via wiki systems or annotation systems. Because these systems do not use semantic relations, the resulting information is often as poorly structured as in mailing lists and hence includes many redundancies, and arguments and counter-arguments for an idea are hard to find and compare. However, Wikipedia, an on-line hypertext encyclopedia which is also a wiki, albeit a very controlled one, has good quality articles on a wide variety of domains. These articles are well connected and permit their readers to get an overview of a subject and explore it to find information. Yet, Wikipedia's content structure and support for collaboration and IR could be improved. An easy-to-use and easy-to-implement feature would be a support for certain semantic relations (e.g. subtypeOf, instanceOf, and partOf) and especially argumentation relations (e.g. proof, example, hypothesis, argument, correction), e.g. as in pre-Web hypertext systems like AAA (Schuler & Smith, 1990) but also allowing the introduction and use of additional relations in an ontology. A semi-formal English-like syntax such as ClearTalk (the notation used in CODE4 and Fact Guru) would support more knowledge processing while still being user-friendly.

It would be utopic to think that even motivated knowledge engineers would be (in the near future) able/willing to represent their research ideas completely into a formal, shared, well-structured readable semantic network that can be explored like a decision tree: there are too many things to enter, too many ways to describe or represent a same thing, and too many ways to group and compare these things. On the other hand, representing the most important structures into such a semantic network and interconnecting them with informal representations seems achievable and extremely interesting for education and IR purposes. Section 2 proposes some mechanisms to support the sharing, re-use and cooperative update of such semantic networks, including some mechanisms to assign values to the contributions and credits to the contributors. Section 3 proposes some ontological elements to guide and normalize the construction of these knowledge repositories. Section 4 shows an approach to permit the comparison of tools or techniques. The domain of ontology-related tools is used as example.

2. Support of cooperation between knowledge providers

2.1 Making Knowledge Explicit and Sharing It

Here, we only consider asynchronous cooperation since it both underlies and is more scalable than exchanges of information between co-temporal users of a system.

The most decentralized knowledge sharing strategy is the one the W3C envisages for the "Semantic Web": many small ontologies, more or less independently developed and thus partially redundant, competing and very loosely interconnected. Hence, these ontologies have problems similar to those we listed for documents: (i) finding the relevant ontologies, choosing between them and combining them require commonsense (and hence is difficult and sub-optimal even for a knowledge engineer, let alone for a machine), (ii) a knowledge provider cannot simply add one concept or statement "at the right place" and is not guided by a large ontology (and a system exploiting it) into providing precise concepts and statements that complement existing ones and are more easily re-used, and (iii) the result is not only more or less lost to others but increases the amount of "data" to search.

A more knowledge-oriented strategy is to have a knowledge server permitting registered users to access and update a single large ontology on a domain and upload files mixing natural language sentences with knowledge representations (e.g. in a controlled language). WebKB-1, WebKB-2, OntoWeb/Ontobroker and Fact Guru are examples of servers allowing this. This was also the strategy used in the well publicized KA2 project (Benjamins & al., 1998) which re-used Ontobroker and aimed to let Knowledge Acquisition researchers index their resources, but (i) the provided ontology was extremely small (more details in Section 3.1) and could not be directly updated by users, and (ii) the formal statements had to be stored within an invented attribute (named "onto") of HTML hyperlink tags via a poorly expressive language. Thus, the approach was limiting which may be one of the reasons why this project achieved limited success.

We know of only two knowledge servers having special protocols to support cooperation between users: Co4 and WebKB-2 (note: most servers, including WebKB-2, support concurrency control (e.g. via KB locking) and several, like Ontolingua, support users' permissions on files/KBs; cooperation support is not so basic: it is about helping knowledge re-use, conflict prevention and the solving of each conflict once it has been detected by the system or a user). The approach of Co4 is based on peer-reviewing; the result is a hierarchy of KBs, the uppermost ones containing the most consensual knowledge while the lowermost ones are the private KBs of contributing users. We believe the approach of WebKB-2 which has a KB shared by all its users leads to more relations between categories (types or individuals) or statements from the different users and may be easier to handle (by the system and the users) for a large amount of knowledge and large number of users. Details can be found in Martin (2003a) but here is a summary of its principles.

To avoid lexical problems, each category identifier is prefixed by a short identifier of its creator (who is also represented by a category and thus may have associated statements). Each statement also has an associated creator and hence, if it is not a definition, may be considered as a belief. Both this namespace mechanism and the embedding of statements can be seen as ways to represent explicit modules, i.e. modules that can be reasoned upon (as opposed to file based modules). Any object (category or statement) may be re-used by any user within his/her statements. The removal of an object can only be done by its creator but a user may "correct" a belief by connecting it to another belief via a "corrective relation" (e.g. pm#corrective_restriction). (Definitions cannot be corrected since they cannot be false.) If entering a new belief introduces a redundancy or an inconsistency that is detected by the system, it is rejected. The user may then either correct his/her belief or re-enter it again but connected by a "corrective relation" to each belief it is redundant or inconsistent with: this allows and makes explicit the disagreement of one user with (her interpretation of) the belief of another user. This also technically removes the cause of the problem: a proposition A may be inconsistent with a proposition B but a belief that "A is a correction of B" is not technically inconsistent with a belief in B. (Definitions from different users cannot be inconsistent with each other, they simply define different categories/meanings; a system of "category cloning" could be used to handle this situation automatically but the resulting ontology would be much more complex than via the manual handling of the situation by each category creator that is occasionally faced to it; hence, such a system has not been implemented in WebKB-2). Choices between beliefs may have to be made by people re-using the KB for an application but then they can exploit the explicit relations between beliefs, e.g. by always selecting the most specialized ones. The query engine of WebKB-2 always returns a statement with its meta-statements, hence with the associated corrective relations. Finally, in order to avoid seeing the objects of certain creators during browsing or within query results, a user may set filters on these creators, based on their identifiers, types or descriptions.

For the construction of knowledge repositories, an interesting aspect of this approach to encourage re-use, precision and object connectivity is that it also works for semi-formal KBs. Here, regarding a statement, semi-formal means that if it is written in a natural language (whether it uses formal terms or not) it must at least be related to another statement by a formal relation, e.g. a generalization relation (pm#corrective_generalization, pm#summary, etc.) or an argumentation relation. Thus, to minimize redundancies and to help information retrieval within information repositories, this minimal semantic structure (which in many cases is the only one bearable by many persons) could be used to organize ideas that are otherwise repeated in many documents. For instance, for a Web site that centralizes and organizes/represents in a formal, semi-formal and informal way resources (tools, techniques, publications, mailing list, teams, etc.) related to a domain, it would be very interesting to have some space where discussions could be conducted in this minimal semi-formal way, and hence index or partly replace the mailing list: this would permit to avoid recurring discussions or presentations of arguments, show the tree of arguments and counter-arguments for an idea, permit incremental additions, encourage deeper or more systematic explorations of each idea, and record the various reached status-quos.

Below is an extract from the beginning of a semi-formal discussion about the statement "a Knowledge Representation Language (KRL) should (also) have an XML notation to ease knowledge sharing". This example shows how three important constructs can be represented: the relation from a statement, the relation on a relation (or more exactly, the relation from a statement connecting two statements), and the conjunctive set of statements. These constructs are important for representing structured discussions even though few argumentation-oriented hypertext systems offer them (ArguMed is one of the exceptions; see also this analysis of Toulmin's argumentation structures).
Notes.
1) The following structures are not expected to be the direct result of a discussion but they may be the result of a semi-automatic re-organization of discussions and then they may be refined by further semi-formal discussions,
2) Relations such as "specialization" or "corrective_restriction" may seem odd to use between informal statements but they are essential for checking the updates of the argumentation structures and hence guiding or exploiting them; specialization relations are used in several argumentation systems (for example, the (counter-)arguments for a statement are valid for its specializations and the (counter-)arguments of the specializations are (counter-)examples for their generalizations);
3) The author of each statement (and hence also each relation between statement) is not shown below but is recorded (the next section illustrates an exploitation of this meta-information and other ones),
4) Each of the statements can be re-used independently in various structures and hence cannot refer to some other statement implicitely (the keyword this used below is a shortcut that is automatically generated by the system when displaying a structure: the actual statements do not contain such a shortcut),
5) The statements do not systematically begin by a capital letter in order not to limit their re-use; for example, if parts of these structures are directly re-used to generate English sentences, the problem of converting (or not) the initial uppercase into a lowercase does not have to be solved.

"a KRL (Knowledge Representation Language) can have an XML notation"
   extended_specialization: "a KRL should have an XML notation" (pm),
   argument: ("the data model of a KRL can be stored into a tree-based structure"
                 argument: - "a graph-based model can be stored into a tree-based
                              structure" (pm)
                           - "the data model of a KRL has to be graph-based" (pm)
             )(pm);


   "a KRL should (also) have an XML notation"
      specialization: "the Semantic Web KRL should have an XML notation" (pm),
      argument: "an XML notation permits a KRL to use URIs and Unicode" (fg,
        objection: ("most syntaxes can easily be adapted to have
                     object identifiers using URIs and Unicode"
                       argument_by_authority: "this was noted by Berners-Lee" (pm)
                   )(pm)),
      argument: "XML can be used for knowledge exchange or storage" (fg,
        objection: "XML is useless or detrimental for knowledge representation, exchange or storage" (pm)),
      argument: "a KRL may have various notations in addition to an 
                 XML-based notation" (pm,
        objection: "the more notations there are the less one of them is going to be
                    commonly adopted for knowledge exchange" (pm)),
      argument: "not using XML for a notation implies that a plug-in has to be installed
                 for each syntax" (pm,
        objection: "XML tools need to be complemented for the semantics of
                    the knowledge representation to be handled" (pm),
        objection: "installing a plug-in is likely to take less time than 
                    always loading XML files" (Sowa));



"the data model of a KRL has to be graph-based"
   argument_by_popularity: "this is acknowledged by about everyone" (pm),
   argument_by_authority: "this is acknowledged by the W3C" (pm);


"XML can be used for knowledge exchange or storage"
   argument: - "an XML notation permits classic XML tools (parsers, XSLT, ...) to
                be re-used" (pm)
             - "classic XML tools are usable even if a graph-based model is used" (pm);


"classic XML tools are usable even if a graph-based model is used"
   specialization: "classic XML tools work on RDF/XML" (pm);




"XML is useless or detrimental for knowledge representation, exchange or storage"
   argument: ("using XML tools for KBSs is a useless additional task"
                 argument: "KBSs do not use XML internally" (pm,
                   objection: "XML can be used for knowledge exchange or storage" (fg,
                     objection: "it is as easy to use other formats for
                                 knowledge exchange or storage" (pm),
                     objection: "a KBS (also) have to use other formats for
                                 knowledge exchange or storage" (pm)))
             )(pm),
   argument: "XML is not a good format for knowledge exchange or storage" (pm);


"XML is not a good format for knowledge exchange or storage"
   argument: - ("XML-based knowledge representations are hard to understand"
                   argument_by_popularity: "this is acknowledged by about everyone" (pm),
                   argument_by_authority: "this is acknowledged by the W3C" (pm)
               )(pm)
             - "a knowledge interchange format should be easy to read and understand
                with a simple editor, by trained people" (pm);

My home page for structured discussions gives access to other examples of structured discussions.

We shall have to do many experiments to see if most of the content of mails can be directly organized into an argumentation tree for each idea and thus permit people to compare and evaluate arguments and counter-arguments (this is very difficult when they are spread across many emails), or if the result will still be difficult to follow and useless because the participants have different goals, assumptions or terminologies (e.g. many discussions on the CG and SUO lists occur because some Semantic Webers use words such as "knowledge", "semantic" and "logic inferencing" when, for the same referred concepts, others would only use words such as "data", "structured" and "data exploitation"). Thus, it may be that the above approach requires or leads to deeper discussions (and possibly using some formal terms) and hence that most of the content of mails cannot be directly organized.

2.2 Valuating contributions and contributors

The above described knowledge sharing mechanism of WebKB-2 records and exploits annotations by individual users on statements but does not record and exploit any measure of the "usefulness" of each statement, a value representing its "global interest", acceptation, popularity, originality, etc. Yet, this seems interesting for a knowledge repository and especially for semi-formal discussions: statements that are obvious, un-argued, or for which each argument has been counter-argued, should be marked as such (e.g. via darker colors or smaller fonts) in order to make them less visible (or invisible, depending on the selected display options) and discourage the entering of such statements. More generally, the presentation of the combined efforts from the various contributors may then take into account the usefulness of each statement. Furthermore, given that the creator of each statement is recorded, (i) a value of usefulness may also be calculated for each creator (and displayed), and (ii) in return, this value may be taken into account to calculate the usefulness of the creator's contributions; these are two additional refinements to both detect and encourage argued and interesting contributions, and hence regulate them.

Ideally, the system would accept user-defined measures of usefulness for a statement or a creator, and adapt its display of the repository accordingly. Below, we present the default measures that we shall soon implement in WebKB-2 (or more exactly, its successor and open-source version, AnyKB). We may try to support user-defined measures but since each step of the user's browsing would imply dynamically re-calculating the usefulness of all statements (except those from WordNet) and all creators, the result is likely to be very slow. For now, we only consider beliefs: we have not yet defined the usefulness of a definition.

To calculate the usefulness of a belief, we first associate two more basic attributes to the belief: 1) its "state of confirmation" and 2) its "global interest".
1) The first is equal to 0 if the belief has no argument nor counter-argument connected to it (examples of counter-argument relation names: "counter-example", "counter-fact", "corrective-specialization"). It is equal to 1 (i.e. the belief is "confirmed") if (i) each argument has a state of confirmation of 0 or 1, and (ii) there exists no confirmed counter-argument. It is equal to -1 if the belief has at least one confirmed counter-argument. It is also equal to 0 in the remaining case: no confirmed counter-argument but each of the argument has a state of confirmation of -1. All this is independent of whom authored the (counter-)arguments.
2) Each user may give a value to the interest of a belief, say between -5 and 5 (the maximum value that the creator of the belief may give is, say, 2). Multiplied by the usefulness of the valuating user, this gives an "individual interest" (thus, this may be seen as a particular multi-valued vote). The "global interest" of a belief is defined as the average of its individual interests (thus, this is a voting system where more competent people in the domain of interest are given more weight). A belief that does not deserve to be visible, e.g. because it is clearly a particular case of a more general belief, is likely to receive a negative global interest. We prefer letting each user explicitly give an interest value rather than taking into account the way the belief is generalized by or connected to (or included in) other beliefs because interpreting an interest from such relations is difficult. For example, a belief that is used as a counter-example may be a particular case of another belief but is nevertheless very interesting as a counter-example.
Finally, the usefulness of a belief is equal to the value of the global interest if the state of confirmation is equal to 1, and otherwise is equal to the value of the state of confirmation (i.e. -1 or 0: a belief without argument has no usefulness, whether it is itself an argument or not).
In argumentation systems, it is traditional to give a type to each statement, e.g. fact, hypothesis, question, affirmation, argument, proof. This could be used in our repositories too (even though the connected relations often already give that information) and we could have used it as a factor to calculate the usefulness (e.g. by considering that an affirmation is worth more than an argument) but we prefer a simpler measure only based on explicit valuations by the users.

Our formula for a user's usefulness is: sum of the usefulness of the beliefs from the user + square root (number of times the user voted on the interest of beliefs). The second part of this equation acknowledges the participation of the user in votes while decreasing its weight as the number of votes increases. (Functions decreasing more rapidly than square root may perhaps better balance originality and participation effort).

These measures are simple but should incite the users to be careful and precise in their contributions (affirmation, arguments, counter-arguments, etc.) and give arguments for them: unlike in traditional discussions or anonymous reviews, careless statements here penalise their authors. Thus, this should lead users not to make statements outside their domain of expertise or without verifying their facts. (Using a different pseudo when providing low quality statements does not seem to be an helpful strategy to escape the above approach since this reduces the number of authored statements for the first pseudo). On the other hand, the above measures should hopefully not lead "correct but outside-the-main-stream contributions" to be under-rated since counter-arguments must be justified. Finally, when a belief is counter-argued, the usefulness of its author decreases, and hence he/she is incited to deepen the discussion or remove the faulty belief.

In his description of a "Digital Aristotle", Hillis (2004) describes a "Knowledge Web" to which researchers could add ideas or explanations of ideas "at the right place", and suggests that this Knowledge Web could and should "include the mechanisms for credit assignment, usage tracking, and annotation that the [current] Web lacks", thus supporting a much better re-use and evaluation of the work of a researcher than via the current system of article publishing and reviewing. However, Hillis does not give any indication on such mechanisms. Although the mechanisms we proposed in this sub-section and the previous one were intended for one knowledge repository/server, they seem usable for the Knowledge Web too. To complement the approach with respect to the Knowledge Web, the next sub-section proposes a strategy to achieve knowledge sharing between knowledge servers.

Again, an alternative (or, in the long term, complementary) approach is the one of Co4 which, via its hierarchy of KBs generated by peer-reviewing of statements from the users' private KBs, supports knowledge sharing and makes explicit various consensuses. However, assuming there are N statements shared by the users of Co4, in the worst case, we assume that there could be 2^N possible KBs if the protocols accept all groupings. Even though this is surely not the case, which KBs should a person look at for finding relations between statements or evaluating the usefulness of a statement/author? Furthermore, the uppermost KBs only represent consensus, not usefulness.

Although independently developed, our approach appears to be an extension of the version of SYNVIEW designed in 1985. In this hypertext system, statements had to be connected by (predefined or user-invented) relations and each statement was valuated by users (this value and another one calculated from the value of arguments and counter-arguments for the statement was simply displayed near the statement as to "summarize the strengths assigned to the various items of evidence within the given contexts"). In 1986, to ease information entering and thus hopefully permit the collaborative work of a small community to create an information repository large enough to interest other people and lead them to participate and store information too, the authors of SYNVIEW removed the constraint of using explicit relations between statements (the statements must be organized hierarchically but the relations linking them are unknown) and replaced the possibility of grading each statement by the possibility of ranking them within the list of (sibling) statements having a same direct super-statement. A similar move away from structured representations was made by Shum, Motta and Domingue (1999) for the same reason and the idea of making the approach more "scalable". Although such a move clearly makes information entering easier, in our viewpoint it makes the system far less likely to scalable because the information is far less retrievable and exploitable, and hence of interest for people to search or complement. Such moves have apparently failed to attract more interest than the original more structured approaches. Since unstructured approaches have strong inherent limitations, we are opting for a move towards improving the entering and sharing of structured forms.

2.3 Combining the advantages of centralization and distribution

One knowledge server cannot support the knowledge sharing of all researchers. It has to be specialized or to act as a broker for more specialized servers. If competing servers had an equivalent content (today, Web search engines already have "similar" content), a Web user could query or update any general server and, if necessary, be redirected to use a more specialized server, and so on recursively (at each level, only one of the competing servers has to be tried since they mirror each other). If a Web user directly tried a specialized server, it could redirect him/her to use a more appropriate server or indicate which other servers may provide more information for his/her query (or directly forward this query to these other servers).

To permit this, our idea is that each server periodically checks related servers (more general servers, competing servers and slightly more specialized servers) and
1) integrates (and hence mirrors) all the objects (categories and statements) generalizing the objects in a reference collection that it uses to define its "domain" (if this is a general server, this collection is reduced to pm#thing, the uppermost concept type),
2) integrates either all the objects that are more specialized than the objects in the reference collection, or if a certain depth of specialization is fixed, associates to its most specialized objects the URLs of the servers that can provide specializations for these objects (note: classifying servers according to fields/domains is far too coarse to index/retrieve knowledge from distributed knowledge servers, e.g. knowledge about "neurons" or "hands" can be relevant to many domains; thus, a classification by objects is necessary), and
3) also associates the URLs of more general servers to the direct specializations of the generalizations of the objects in the reference collection (this is needed since the specializations of some of these specializations do not generalize nor specialize the objects in the reference collection).

Integrating knowledge from other servers is certainly not obvious but this is a more scalable and exploitable approach than letting people and machines select and re-use or integrate dozens or hundreds of (semi-)independently designed small ontologies. A more fundamental obstacle to the widespread use of this approach is that many industry-related servers are likely to make it difficult or illegal to mirror their KBs. However, other approaches will likely suffer from that too.

3. Some ontological elements

By default, the shared KB of WebKB-2 includes an ontology derived from the noun-related part of WordNet and various top-level ontologies (Martin, 2003b). A large general ontology like this is necessary to ease and normalize the cooperative construction of knowledge repositories but is still insufficient: an initial ontology on the domain the repository will be dedicated to is also necessary. As a proof of concept for our tools to support a cooperatively-built knowledge repository, we initially chose to model two related domains: (i) Conceptual Graphs (CGs), since this domain is the most well known to us, and (ii) ontology related tools, since Michael Denny's "Ontology editor survey" attracted some interest, or at least the idea did because the result was frustratingly superficial, poorly structured and hence did not permit to compare the tools and was probably misleading for non-specialists (indeed, ontology tool authors -- including us, regarding WebKB-2 -- were given a short list of rather vague criterias to use, and the answers, instead of being analysed and used to construct various tables to compare similar tools on the same criterias, were abbreviated and directly put into one big table).

Modelling these two domains implies partially modelling other related domains, and we soon had the problem of modularizing the information into several files to support readability, search, checking and systematic input. These files are also expected to be updatable by users when our knowledge-oriented wiki is completed. Although the users of WebKB-2 can direcly update the KB one statement at a time, the documentation discourages them to do so because this is not a scalable way to represent a domain (as an analogy a line command interface is not a scalable way to develop a program). Instead, they are encouraged to create files mixing formal and informal statements and ask WebKB-2 to parse these files, and in the end when the modelling is complete and if the users wish to, integrate them to the shared KB. In order to be generic, we have created six files: Fields of study, Systems of logic, Information Sciences, Knowledge Management, Conceptual Graph and Formal Concept Analysis. The last three files specialize the others. Each of the last four files is divided into sections, the uppermost ones being "Domains and Theories", "Tasks and Methodologies", "Structures and Languages", "Tools", "Journals, Conferences, Publishers and Mailing Lists", "Articles, Books and other Documents" and "People: Researchers, Specialists, Teams/Projects, ...". This is a work in progress: the content and number of files will increase but the sections seem stable. We now give examples of their content.

3.1. Domains and Theories

Names used for domains ("fields of study") are very often also names for tasks. Task categories are more convenient for representing knowledge than domain categories because (i) organizing them is easier and less arbitrary, and (ii) many relations (e.g. case relations) can then be used. Since for simplicity and normalization purposes a choice must be made, whenever suitable we have represented tasks instead of domains. When names are shared by domain categories and task categories (in WebKB-2, categories can share names but not identifiers), we advise using the task categories in indexations or representations.

When studying how to represent and relate document subjects/topics (e.g. technical domains), Welty & Jenkins (1999) concluded that representing them as types was not semantically correct but that mereo-topological relations between individuals were appropriate. Our own analysis confirmed this and we opted for (i) an interpretation of theories and fields of study as large "propositions" composed of many sub-propositions (this seems the simplest, most precise and most flexible way to represent these notions), and (ii) a particular part relation that we named ">part" (instead of "subdomain") for several reasons: to be generic, to remind that it can be used in WebKB-2 as if it was a specialization relation (e.g. the destination category needs not be already declared) and to make clear that our replacement of WordNet hyponym relations between synsets about fields of study by ">part" relations refines WordNet without contradicting it. Our file on "Fields of study" details these choices. Our file on "Systems of logics" illustrates how for some categories the represented field of study is a theory (it does not refer to it) thus simplifying and normalizing the categorization. Below is an example (in the FT notation) of relations from WordNet category #computer_science, followed by an example about logical domains/theories. When introducing general categories in Information Sciences and Knowledge Management, we used the generic users "is" and "km". In WebKB-2, a generic user is a special kind of user that has no password: anyone can create or connect categories in its name but then cannot remove them.

#computer_science__computational_science 
  (^branch of engineering science that studies computable processes and structures^)
  >part:    #artificial_intelligence,  //according to WordNet, AI is ">part:" of CS
  >part:    is#software_engineering_science (is), //"(is)": relation created by "is"
  >part:    is#database_management_science (is),
  >part of: #engineering_science__engineering__applied_science__technology,
  part:     #information_theory,  //relation coming from WordNet: "(wn)" is implicit
  part of:  #information_science; //WordNet has some part relations between domains

km#substructural_logic 
 (^system of propositional calculus that is weaker than the conventional one^)
 >part of:  km#non-classical_logic__intuitionist_logic,
 >part:  km#relevance_logic  km#linear_logic,
 url: http://en.wikipedia.org/wiki/Intuitionistic_logic;

km#CG_domain__Conceptual_Graphs__Conceptual_Structures
 >part of: km#knowledge_management_science,
 object: km#CG_task  km#CG_structure  km#CG_tool  km#CG_mailing_list,
 url: http://www.cs.uah.edu/~delugach/CG/  http://www.jfsowa.com/cg/;

For guiding the sharing, indexation or representation of techniques in Knowledge management, hundreds of domains, theories or tasks need to be represented in a shared ontology which anyone can easily complement. We have begun this work.
On the other hand, as noter earlier, the ontology of the KA2 project was small and additions had to be suggested by e-mail. Most of this ontology is shown below in FT (loss-less-translation from the source in Frame_logic): the whole subtype hierarchy is shown (types in italics are concept types with no subtype), the relations that can be associated to an instance of the type organization are shown but for the other general concept types such relations have been omitted.
Furthermore, most of this "ontology" is composed of about 36 "domains" organized by subtype relations. The names of these domains also represent tasks, structures, methods and experiments (e.g. "reuse_in_KA > ontologies PSMs; PSMs > Sysiphus-III_experiment"). Not representing them as objects prevent their use in knowledge representations. Finally, this domain decomposition is far from being a decision tree and what some domains refers to is quite ambiguous. The comments in this example are from us.

root > organization  project  event  person  publication  product  research_topic;

  organization
    > enterprise  university  department  institute  research_group,
    name: string,  location: string,  employs: person,  publishes: publication,
    technical_report: technical_report,  carries_out: project,
    develops: product,  finances: project;

  project > research_project  development_project; 
    development_project > software_project;

  event > conference  workshop  activity  special_issue  meeting;

  person > student  employee;
    student > PhD_student;
    employee > academic_staff  administrative_staff; //should be named "..._staff_member"
      academic_staff  > lecturer  researcher;   researcher > PhD_student;
      administrative_staff > secretary  technical_staff;

  publication > book  article  journal  online_publication;
    article > technical_report  journal_article  article_in_book
              conference_paper  workshop_paper;
    journal > special_issue;

  research_topic  //this "specialization" hierarchy is far from being a decision tree
   > KA_through_machine_learning  reuse_in_KA
     KA_methodologies  specification_languages
     validation_and_verification  KA_evaluation  //difference between the two??
     knowledge_management  knowledge_elicitation;

     KA_through_machine_learning  //machine learning techniques
      > abduction  case_base_reasoning__CBR
        cooperative_KA //what does that refer to?
        knowledge_based_refinement  knowledge_discovery_in_datasets
        data_mining  learning_apprentice_systems  reinforcement_learning;

     reuse_in_KA > ontologies  PSMs;
       ontologies > theoretical_foundations  software_applications  methodologies;
       PSMs > PSM_evaluation  PSM_libraries  PSM_notations  automated_PSM_generation 
              Sysiphus-III_experiment  Web_mediated_PSM_selection  software_reuse;

    specification_languages
      > specification_methodology  specification_of_control_knowledge
        support_tools_for_formal_methods  automated_code_generation_from_specification
        executable_specification_languages;

    validation_and_verification 
     > anomaly_detection  anomaly_repair_and_knowledge_revision 
       formalisms  methodology  validation_and_verification_of_MAS;

3.2. Tasks and Methodologies

In most model libraries in Knowledge Acquisition, each non-primitive task is linked to techniques that can be used for achieving it, and conversely, each technique combines the results of more primitive tasks. We tried this organization but at the level of generality of our current modelling it turned out to be inadequate: it led (i) to arbitrary choices between representing sometimes as a task (a kind of process) or a technique (a kind of process description), or (ii) to the representation of both notions and thus to introduce categories with names such as "KA_by_classification_from_people"; both cases are problematic for readability and normalization. Similarly, instead of representing methodologies directly, i.e. as another kind of process description, it seems better to represent the tasks advocated by a methodology (including their supertask: following that methodology). Furthermore, with tasks, many relations can then be used directly: similar relations do not have to be introduced for techniques or methodologies (the relation hierarchy should be kept small if only for normalization purposes). Hence, we represented all these things as tasks and used multi-inheritance. This considerably simplified the ontology and the source files. Here are some extracts.

km#KM_task__knowledge_management__KM  (^a K.M. (sub)task^)
 < is#information_sciences_task,
 > km#knowledge_representation  km#knowledge_extraction_and_modelling
   km#knowledge_comparison  km#knowledge_retrieval_task 
   km#knowledge_creation  km#classification  km#KB_sharing_management 
   km#mapping/merging/federation_of_KBs  km#knowledge_translation  
   km#knowledge_validation  
   {km#monotonic_reasoning  km#non_monotonic_reasoning}
   {km#consistent_inferencing km#inconsistent_inferencing}
   {km#complete_inferencing km#incomplete_inferencing}
   {km#structure-only_based_inferencing km#rule_based_inferencing}
   km#teaching_a_KM_related_subject  
   km#language/structure_specific_task  km#KM_methodology_task,
 object of: km#knowledge_management_science,
 object: km#KM_structure;  //between types, the default cardinality is 0..N 
  //The general relation "object" has different (more specialized) meanings depending on
  // the connected categories: in the last relation, the meaning is "task object" 
  // (object worked on or generated by the task) not "domain object".

   km#knowledge_retrieval_task  < is#IR_task,
    > {km#specialization_retrieval  km#generalization_retrieval}
      km#analogy_retrieval  km#structure_only_based_retrieval 
      {km#complete_knowledge_retrieval km#incomplete_knowledge_retrieval}
      {km#consistent_knowledge_retrieval km#inconsistent_knowledge_retrieval}; 

km#CG_task  < km#language/structure_specific_task,
 > km#CG_extraction_by_NLP  km#CG-based_KR  km#CG_matching  
   km#mapping/merging/federation_of_CG-based_KBs
   km#conversion_between_CG_and_other_models_or_notations
   km#teaching_CGs;

   km#conversion_between_CG_and_other_models_or_notations
    > km#conversion_between_RDF_and_CG  fca#FCA-based_storage_of_CGs;

   km#teaching_CGs  object: km#CGs;

3.3. Structures and Languages

In our top-level ontology (Martin, 2003b), pm#description_medium (supertype for languages, data structures, ontologies, ...) and pm#description_content (supertype for fields of studies, theories, document contents, softwares, ...) have for supertype pm#description__information because (i) such a general type that groups both notions is needed for the signatures of many basic relations and is actually much more used that in these signatures than its subtypes, and (ii) classifying WordNet categories according to the two notions would have often led to arbitrary choices. Although we attributed a section to each notion, we represented the default ontology of WebKB-2 as a part of WebKB-2 (see below) and hence allowed part relations between any kind of information. To ease knowledge entering and certain exploitations of it, we allow the use of generic relations such as "part", "object" and "support" when, given the types of the connected objects, the relevant relations (e.g. pm#subtask or pm#physical_part) can automatically be found.

For similar reasons, to represent "sub-versions" of ontologies, softwares, or more generally, documents, we use types connected by subtype relations. Thus, for example, km#WebKB-2 is a type and can be used with quantifiers.

km#KM_structure  < is#symbolic_structure,
 > {km#base_of_facts/beliefs  km#ontology   km#KB_category  km#statement}
   km#KB  km#KA_model  km#KR_language  km#language_specific_structure;

   km#KB__knowledge_base  part: km#ontology  km#base_of_facts/beliefs;

   km#ontology__set_of_category_definitions/constraints
    > km#lexical_ontology  km#language_ontology  km#domain_ontology
      km#top_level_ontology  km#concept_ontology  km#relation_ontology
      km#multi_source_ontology,
    part: 1..* km#KB_category  1..* km#category_definition;

      km#top_level_ontology
       > km#DOLCE_light  km#SUMO  km#top_level_of_ontology_of_John_Sowa;

      km#multi_source_ontology 
       (^ontology where the creator of each category and statement is recorded and
         represented via a category^)
       > km#default_MSO_of_WebKB-2;

         km#default_MSO_of_WebKB-2
          (^an ontology provided as default by a version of WebKB-2^)
          part of: km#WebKB-2,
          part: km#DOLCE_light km#top_level_of_ontology_of_John_Sowa;
                //km#DOLCE  km#SUMO  /*an adaptation of*/km#WordNet;

   km#KR_language__KRL__KR_model_or_notation
    > {km#KR_model/structure  km#KR_notation} //not km#semantics: not a structure
      km#predicate_logic_oriented_language  km#frame_oriented_language
      km#graph_oriented_language  km#KR_language_with_query_commands
      km#KR_language_with_scripting_capabilities,
    attribute: km#semantics;

km#CG_structure  < km#language_specific_structure,
 > km#CG_statement  km#CG_language  km#CG_ontology;

3.4. Tools

We first illustrate some specialization relations between tools then we use the FCG notation to give some details on WebKB-2 and Ontolingua. (The FT notation does not yet permit to enter such details. As in FT, relation names in FCG may be used instead of relations identifiers when there is no ambiguity).

km#CG_related_tool  < km#language/structure_specific_tool,
 > km#CG-based_KBMS  km#CG_graphical_editor  km#NL_parser_with_CG_output;

   km#CG-based_KBMS < km#KBMS,
    > {km#CGWorld  km#PROLOG\+CG  km#CoGITaNT  km#Notio  km#WebKB};

      km#WebKB  > {km#WebKB-1  km#WebKB-2},  url: http://www.webkb.org;

km#input_language (*x,*y) = [*x, may be support of: (a km#parsing,
                                       input: (a statement, formalism: *y))];
[any pm#WebKB-2,
  part: (a is#user_interface, part: {a is#API, a is#HTML_based_interface, 
                                     a is#CGI-accessible_command_interface,
                                     no is#graph_visualization_interface}),
  part: {a is#FastDB, a km#default_MSO_of_WebKB-2},
  input_language: a km#FCG,   output_language: {a km#FCG, a km#RDF},
  support of: a is#regular_expression_based_search,
  support of: a km#specialization_structural_retrieval,
  support of: a km#generalization_structural_retrieval,
  support of: (a km#specialization_structural_retrieval,
                  kind: {km#complete_inferencing, km#consistent_inferencing},
                  input: (a km#query, expressivity: km#PCEF_logic),
                  object: (several km#statement, expressivity: km#PCEF_logic)
              )];          //"PCEF": positive conjunctive existential formula

[any km#Ontolingua, 
  part: {a is#HTML_based_interface, no is#graph_visualization_interface},
  input_language: a km#KIF,  output_language: a km#KIF,
  part: {a km#ontolingua_library, no DBMS}, support of: a is#lexical_search];

To permit the comparison of tools, many more details should be entered and similar structures or relations should be used by the various contributors, for example when expressing what the input languages of a tool can be. To that end, we re-used basic relations as much as possible (we did not introduce relations with names such as "re-used_DBMS" or "default_ontology"). The above examples show that for many features a simple normalized form can be found. However, for many other features this is more difficult. For example, we have not yet found a satisfactory way to represent (i) that WebKB-2 provides a special support (two attributes plus three classes, special notations, lots of code) for storing, searching and exploiting relations between categories and their creators or various names, and (ii) that Ontolingua supports those relations but the user has to define them in KIF and then their exploitation in Lisp. Representing this in detail is time consuming and representations from different persons are unlikely to be matchable and also very difficult to use for comparing the tools via a generated table (as illustrated in Section 4). Less detailed descriptions using same relations should (instead or in addition) be provided. For our example, a short representation could be [any WebKB-2, special_support: a support_for_link_from_category_to_names] even though this would lead to introduce many categories for such "supports" in the ontology: from other viewpoints, it would have been preferable to re-use existing relations such as km#category_name.

3.5. Conferences, Journals, Publishers and Mailing Lists

Just a few examples.

km#CG_mailing_list < km#KM_mailing_list,
 url: majordomo@cs.uah.edu;

km#ICCS__International_Conference_on_Conceptual_Structures
 instance: km#ICCS_2001 km#ICCS_2002 km#ICCS_2003 km#ICCS_2003 km#ICCS_2005;

is#publisher_in_IS  < #publishing_house,
 instance: is#Springer_Verlag  is#AAAI/MIT_Press  is#Cambridge_University_Press,
 object of: #information_science;

3.6. Articles, Books and other Documents

This example shows how a simple document indexation using Dublin Core relations (we have done this for all the articles of ICCS 2002). Representing ideas from the article would be more valuable.

[an #article, dc#Coverage: km#knowledge_representation,
  pm#title: "What Is a Knowledge Representation?",
  dc#Creator: "Randall Davis, Howard E. Shrobe and Peter Szolovits",
  pm#object of: (a #publishing, pm#time: 1993,
                      pm#place: (the #object_section "14:1 p17-33",
                                        pm#part of: is#AI_Magazine)),
  pm#url: http://medg.lcs.mit.edu/ftp/psz/k-rep.html];

3.7. People: Researchers, Specialists, Teams/Projects, ...

We have not yet dealt with this section in our files. However, this example reminds that every introuced domain category is superseded by a category from WordNet.

is#researcher_in_IS  < #researcher;    is#team_in_IS  < #team;

4. Example of comparison of two ontology-related tools

Fact Guru (which is a frame-based system) permits the comparison of two objects by generating a table with the object identifiers as column headers, the identifiers of all their attributes as row headers, and for each cell either a mark to signal that the attribute does not exist for this object or a description of the destination object. The common generalizations of the two objects are also given. However, this strategy is insufficient for comparing tools. Even for people, creating detailed tool comparison tables is often a presentation challenge and involves their knowledge of which features are difficult or important and which are not. A solution could be to propose predefined tables for easing the entering of tool features and then compare them. However, this is restricting. Instead or in complement, we think that a mechanism to generate good comparison tables is necessary and can be found.
The following query and generated table illustrates an approach that we propose. The idea is that a specialization hierarchy of features is generated according to (i) the uppermost relations and destination types specified in the query, and (ii) only the descriptions used in at least one of the tools and the common generalizations of these descriptions are shown. To that end, some FCG-like descriptions of types can be generated. In the cells, '+' means "yes" (the tool has the feature), '-' means "no", and '.' means that the information has not been represented. We invite the reader to compare the content of this table with the representations given above; then, its meaning and the possibility to generate it automatically should hopefully be clear. A maximum depth of automatic exploration may be given; past this depth, the manual exploration of certain branches (like the opening or closing of sub-folders) should permit the user to give the comparison table a presentation better suited to his/her interest. Any number of tools could be compared, not just two.

> compare pm#WebKB-2 km#Ontolingua on 
    (support of: a is#IR_task, output_language: a KR_notation,
     part: a is#user_interface), maxdepth 5


                                           WebKB-2  Ontolingua
support of:
is#IR_task                                    +         +
  is#lexical_search                           +         + 
    is#regular_expression_based_search        +         .   
  km#knowledge_retrieval_task                 +         .
    km#specialization_structural_retrieval    +         .
      (kind: {km#complete_inferencing, km#consistent_inferencing},
       input: (a km#query, expressivity: km#PCEF_logic),
       object: (several statement, expressivity: km#PCEF_logic))
                                              +         .
    km#generalization_structural_retrieval    +         .

output_language: 
km#KR_notation                                +         +
  (expressivity: km#FOL)                      +         +          
    km#FCG                                    +         .
    km#KIF                                    .         +
  km#XML-based notation                       +         .
    km#RDF                                    +         -

part:
is#user_interface                             +         +
  is#HTML_based_interface                     +         + 
  is#CGI-accessible_command_interface         +         .
  is#OKBC_interface                           .         .
  is#API                                      +         .         
  is#graph_visualization_interface            -         -

In the general case, the above approach where the descriptions are put in the rows and organized in a hierarchy is likely to be more readable, scalable and easier to specify via a command than when the descriptions are put in the cells, e.g. as in Fact Guru. However, this may be envisaged as a complement for simple cases, e.g. to display {FCG, KIF} instead of '+' for the output_language relation. In addition to generalization relations, "part" relations could also be used, at least the ">part" relation. For example, if Cogitant was a third entry in the above table, since it has a complete and consistent structure-based and rule-based mechanism to retrieve the specializations of a simple CG in a base of simple CGs and rules using simple CGs, we would expect the entry ending by km#PCEF_logic to be specialized by an entry ending by km#PCEF_with_rules_logic.

5. Conclusion

Knowledge repositories, as we have presented them, have many of the advantages of the "Knowledge Web" and "Digital Aristotle" but seem much more achievable. To that end, we have described some techniques and ontological elements, and we are: (i) implementing a knowledge oriented wiki to complement our current interfaces, (ii) experimenting on how to best support and guide semi-formal discussions, and more generally, organize technical ideas into a semantic network, (iii) implementing and refining our measures of statement/user usefulness, (iv) completing the above presented ontology to permit at least the representation of the information collected in Michael Denny's "Ontology editor survey" (we tend to think that our current ontology on knowledge management will only need to be specialized, even though we have not yet explored the categorization of the basic features of multi-user support such as concurrency control, transactions, CVS, file permissions, file importation, etc.), (v) permitting the comparison of tools as indicated above, and (vi) providing forms or tables to help tool creators represent the features of their tools.

Once implemented, the presented techniques and especially those supporting semi-formal discussions will be applicable to many domains, including PORT. The ontology on knowledge management, in addition to its above cited applications, might guide works on the automatic extraction and organization of technical information from documents in Information Sciences.

6. Acknowledgments

The first author thanks the members of the LOA and Christopher Welty for the interesting discussions related to some ideas mentioned in this article.

7. References

V. R. Benjamins, D. Fensel, A. Gomez-Perez, S. Decker, M. Erdmann, E. Motta and M. Musen (1998). Knowledge Annotation Initiative of the Knowledge Acquisition Community: (KA)2. Proc. of the 11th Banff Knowledge Acquisition for Knowledge Based System Workshop (KAW98), Banff, Canada, April 18-23, 1998.
The ontology is at http://ontobroker.semanticweb.org/ontologies/ka2-onto-2000-11-07.flo.

W.D. Hillis (2004). "Aristotle" (The Knowledge Web). Edge Foundation, Inc., No 138, May 6, 2004.

Ph. Martin (2003a). Knowledge Representation, Sharing and Retrieval on the Web. Chapter of a book titled "Web Intelligence", (Eds.: N. Zhong, J. Liu, Y. Yao), Springer-Verlag, Jan. 2003.

Ph. Martin (2003b). Correction and Extension of WordNet 1.7. Proc. of ICCS 2003 (Dresden, Germany, July 2003), Springer Verlag, LNAI 2746, 160-173.

W. Schuler and J.B. Smith (1990). Author's Argumentation Assistant (AAA): A Hypertext-Based Authoring Tool for Argumentative Texts. Proc. of ECHT'90 (INRIA, France, Nov. 1990), Cambridge University Press, 137-151.

D. Skuce and T.C. Lethbridge (1995). CODE4: A Unified System for Managing Conceptual Knowledge. International Journal of Human-Computer Studies, 42, 413-451.
See also the successor / commercial version: Fact Guru.

D.A. Smith (1998). Computerizing computer science. Communications of the ACM, 41(9), 21-23.

C.A. Welty & J. Jenkins (1999). Formal Ontology for Subject. Journal of Knowledge and Data Engineering (Sept. 1999), 31(2), 155-182.