draft proposal for standard representations workshop

Tom Gruber <Gruber@sumex-aim.stanford.edu>
Full-Name: Tom Gruber
Message-id: <2845064325-3667069@KSL-Mac-69>
Date: Mon, 26 Feb 90  15:38:45 PST
From: Tom Gruber <Gruber@sumex-aim.stanford.edu>
To: shared-kr@sumex-aim.stanford.edu
Subject: draft proposal for standard representations workshop
Colleagues,

Following is the "long version" of a straw man proposal I will be presenting 
at the upcoming Knowledge Representation Standards workshop in Santa Barbara.
This version contains a lot of information about the activities of "The KIF
Group", which is what I call the participants in the dialog on this list
(you know who you are).  The "short version", to be distributed shortly to the
workshop participants, contains essentially the same arguments with less
discussion and without reference to the Stanford KIF initiative per se.
The document below actually argues for things the KIF group ought to
do, plans to do, is doing, or has done.

I hope that this stimulates useful dialog.  Please excuse the manifesto tone.

                                                        --tom


             The Role of Standard Knowledge Representation 
                for Sharing Knowledge-Based Technology
                
                  Thomas Gruber, Stanford University

1. INTRODUCTION

AI has its thousand points of light.  While a large community of
researchers invent new knowledge representations, reasoning methods,
and knowledge-based systems that make use of these innovations, very
little of this work gets shared in a machine intelligible form.  This
is ironic, since AI theories and experiments are especially
well-suited for delivery and examination in the computational medium.
Furthermore, the ability to build comprehensive bodies of knowledge,
shared among cooperating systems, could have great practical utility
in industrial applications, particularly in engineering domains.
What's missing?

Part of the answer is the fact that the field is young, lacking a
strong paradigm such as biology's theory of evolution or chemistry's
periodic table.  Some would say that AI is pretheoretical or
prescientific.  As a result, goes the argument, researchers have no
common theoretical language for communicating, comparing, and building
upon results.

One solution to the problem of youth is to get old (well, mature).  In
the mean time, there is something to be done about "the language
thing".  If the research community could agree on a form in which some
useful subset of what they produce can be shared among other
researchers, in a form that other people's programs can use, it would
be a step in the right direction.  Today, we can share Common Lisp
programs.  This is not enough, since it typically forces the consumer
of the program to buy into the whole architecture of the contributor's
program, rather than some modular piece that can be incorporated into
an existing environment.  In addition, Common Lisp is a programming
language, and what we produce is not necessarily a program.  One
option that has come to many minds is to share knowledge bases, rather
than programs.

Toward this end, a group of people at Stanford has gotten together to
try to agree on a language in which they can share knowledge bases.
The group has several research projects with complementary technology
and knowledge to contribute, with a common focus on modeling
engineered devices.  There is a working proposal for a language,
called KIF (Knowledge Interchange Format), a strategy for transferring
knowledge bases among programs, a plan for coordinating the namespace,
and discussions underway toward a set of very general primitives upon
which to build richer shared ontologies.

Based on the discussions of that group and with others in the
community, this note will describe possible ways in which knowledge
bases can be shared, requirements for a shared representation
language, mechanisms for sharing knowledge bases among people and
programs, and some of the technology that will facilitate the
development and use of shared KBs.  Examples from the KIF experience
will be used to illustrate possible approaches to some of the issues.

2. MODELS OF KB SHARING - What does it take to share knowledge bases?

The basic goal is to share knowledge bases, for scientific and
scholarly communication, and for practical use by programs that can
make use of knowledge.  The goal admits several interpretations,
depending on the purposes and degree of commitment of the
participants.  Therefore I'll organize some of the issues on
what it means to share KBs according to levels of increasing
commitment and agreement.  The utility of KB sharing and the
difficulties of doing so can then be grounded in the context of what
is, exactly, being shared.

Level 1: Sharing SYNTAX

As a minimum condition for sharing formally represented knowledge, one must
agree to a common syntax transmitting and interpreting it.  Ignoring
the question of storage and communication media, sharing syntax
includes specifying in complete detail:

  * the syntax of standard logical expressions (e.g., how to say p => q)

  * the syntax of extensions to FOPC (default reasoning, meta-level
operators, modals, truth value extensions)

  * the rules for printing and parsing sentences (for converting
streams of text to sentences and back, character sets, case
sensitivity, etc.)
                                                       
Level 2: Sharing VOCABULARY - one symbol one entity

In order for two programs to communicate using an interlingua, they
must be acquainted with a common vocabulary.  Since all contemporary
knowledge bases are written by humans, they, too must come to
agreement on the names of things.  Sharing vocabulary requires
agreement on:

  * the names of standard logical, arithmetic, set-theoretic
primitives, such as "implication" and "addition".  These may get
folded into the syntax.

  * a mechanism for defining *primitives* -- symbols with definitions
for people but not for the machine.

  * a mechanism for managing the *namespace* - conflicts among names for
things

  * some domain-independent "content" words, such as standard
relations like "subclass" and "instance"

  * moderately more controversial primitives, such as "terminal",
"component-of", "temporal-subabstraction" but not giving them any
machine-intelligible definitions

Level 3: Sharing ONTOLOGY - explicitly stated/computationally enforced semantics

To share Facts (statements that are extensionally true of individuals
with no inference), one need "only" agree to the form of the facts and
how to interpret their referents in the world.  To share Knowledge
requires coming to agreement on the meanings of symbols used
compositionally.  Sharing an understanding of the *content*, rather
than the form of a knowledge base means sharing an
understanding of certain primitive symbols.  Since it is common for AI
knowledge bases to be established on a foundation of such symbols
naming ontological categories, often in hierarchical relation to one another,
this level of consent is called "sharing ontology".

Agreeing on the "Meaning" of symbols in a knowledge base can be a
tricky epistemological business, but the formal, computational nature
of the medium gives us a leg up.  Since our interlingua is formal, we
can design things such that a given symbol is supposed to mean the
same thing in all contexts.  Since KBs are computational by nature --
interpreted by programs -- we can build programs that enforce the
way symbols are used to help avoid saying meaningless or ambiguous
things.  Assuming that this is possible, one can expect to share

  * ONTOLOGIES: networks of categories/classes, and their
relationships in an inheritance network.  These make stronger claims
about how to carve up the world, including classes such as
"physical-object", "process", "event", etc.

  * An ontology may contain not only the names of classes and their
subclass/instance relationships, but also a) constraints on class
membership (necessary conditions), b) properties to inherit to
instances, and sometimes c) sufficient defining conditions for class
membership (where class membership is inferred from other information
about instances).

  We anticipate that ontologies will be used as skeletal frameworks
that are extended by application developers with more specific
knowledge.  By "extend" we mean both using the vocabulary of terms and
relations in the ontology to define additional terms and relations,
and populating the extended knowledge base with instances of classes
described by both the original and extended ontologies.  It is
important to the success of this extension business that the terms be
accompanied by "principled" and "enforceable" semantics.  "Principled"
means the meanings are unambiguous and comprehensible to humans.
"Enforceable" means that the meanings can be checked by some mechanical
process, to make sure that the terms and relations are used in ways
that makes sense -- that are consistent with the intended semantics of
the original ontology.  Such an ontology constitutes
a vehicle for delivering a theory of how to represent some class of
things, as we will illustrate with the PENMAN example described later.

Level 4: Sharing INFERENCE MECHANISMS

Whereas ontologies provide language with certain primitive terms,
it is also possible to share procedures that can perform certain
classes of inferences on KBs that use corresponding ontologies.  A
theorem prover is an interpreter for general deductive inference, and
it only depends on the most universal vocabulary of level 2, such as
"implication" and "negation".  Other kinds of inference mechanisms
that might be useful to share include:

  * interpreters that make some class of inferences efficiently, such
as term subsumption or property inheritance.

  * special-purpose interpreters that make use of domain-specific
ontologies, such as a diagnosis engine that works on models couched in
the vocabulary for modeling digital circuits at the logic level.

  * task-specific interpreters of special domain-independent languages
composed on top of the interlingua, such as the ACCORD language for
representing control strategies.

  In some cases, procedures can be written in declarative form and become part
of a shared KB.  In most cases, procedures would be shared as Common Lisp
code.  This requires a functional interface to the interlingua.

  The dependency between an inference procedure and a KB can be
described by a) the list of primitive vocabulary assumed by the
procedure, and b) the assumptions under which the vocabulary is
appropriate for modeling something.  For instance, an diagnostic
engineer for discrete lumped-parameter circuits has to make it clear to
the user when it is valid to model a something as a lumped parameter.
We will discuss this requirement in more detail as a mechanism for
sharing.

Level 5: Sharing HEURISTIC (symbol-level) knowledge bases

If two groups share ALL the namespace and agree on all the mutual
ontology, then they are in position to also share a common internal
form for the knowledge base and the Lisp form of the inference
procedures operating on the internal form.  This is "sharing" in the
strongest and most committed sense.   


These levels lay out a space of possible positions to which two or
more parties could commit.  The immediate task is to choose a position
or positions at which there is a relatively stable body of knowledge
to contribute, or there is high payoff in converging.  
The KIF group has begun the work on levels 1 and 2, with the declared
intention of sharing at level 3.  Two of the participating research
groups (MCC's Cyc group and the KSL's How Things Work project) are
collaborating at level 5, to explore the shared knowledge base
hypothesis in depth. 

3. LANGUAGE REQUIREMENTS

The idea of agreeing on a representation language is a strategy toward
the goal of sharing formally-represented knowledge bases, not an end in
itself.  In this light we discuss some design requirements for a
language to support the useful sharing of knowledge bases.

3.1 EXPLICIT, DECLARATIVE SEMANTICS

The basic problem in using someone else's knowledge base is trying to
understand what statements in the KB mean.  The use of a
fully-declarative language is one approach to this problem.
"Declarative" is often *defined* as the property that the meaning of
sentences can be understood without reference to an interpreter.  In
practice, "declarative" means that if the reader is thoroughly
acquainted with the compositional semantics of the logic, and is told
what the primitive predicates, object constants, and functions "mean",
then he/she/it can determine what the sentence means.  Assuming that
this is the case for the researchers and their programs, then
requiring that the language be fully declarative can facilitate
sharing KBs.  That's what the KIF proposal does.

What is not declarative?  Lisp code, data structures, operators in the
language that require the sharing of a piece of code to use.


3.2 COMMON REPRESENTATION VERSUS INTERCHANGE FORMAT.

If two parties share at level 5, then they literally share the same
language and the tools that come with it.  However, the very fact that
people do not usually share at level 5 means that the "interlingua"
cannot be the same on everyone's desk.  Based on the observation that
people's inference mechanisms differ much more than the
"knowledge-level" content of their KBs, the KIF group agreed to take
the strategy of an *interchange format*, rather than a common language.
That means that if two parties want to share KBs, they translate their
KBs into the interchange language and deliver them as text.  On the
receiving end, one translates from KIF into a favorite language.

The idea of an interchange format for knowledge bases is analogous to
the Postscript language for graphics and text.  A word processor or
drawing program generates a text file in Postscript that presumably
captures the "content" of the document, such that different Postscript
printers, with different resolutions and rendering algorithms, can
display the same picture on paper or a video monitor.  Each word
processor can employ its own *internal* representation, designed for
efficiency.  (Some drawing applications can read Postscript and
manipulate it internally, but that is not a requirement for using the
interchange format.) It is instructive to note that applications often
generate Postscript code that would be considered poor software
engineering if written by a human (e.g., redundant definition of
functions, inline expanded macros, etc).  This is a common property of
interchange languages, and is a problem only for programs that attempt
recover information not explicitly present in the sentences, such as
the definition of the macro that was expanded.  It's worth keeping
this in mind when proposing syntactic extension mechanisms for the KIF.

3.3 EXPRESSIVENESS, COMPLETENESS, TRACTABILITY

If the goal is to facilitate sharing knowledge for machine
consumption, then it makes sense to consider the computational
properties of procedures that can use the knowledge.  It is well known
that deduction over KBs in first-order predicate calculus is
semidecidable, which is not very reassuring.  Some inference
procedures that operate on more restricted languages have more
satisfying computational properties.  So there is a tradeoff in the
design of a language between expressiveness and the speed by which
certain inferences can be made using the language.

Based on the assumptions that (1) it is more likely that people will
be able to share the sentences in a knowledge base than specialized
inference procedures that operate over them, and (2) that the
intellectual value-added of some else's knowledge base is often in the
way they represent things (which can require expressive power), the
KIF group finds itself on the expressiveness end of the tradeoff
curve.  Thus the bias is toward a language that allows a large class
of things to be stated.  (In particular, there is general agreement on
the basics of first order predicate calculus, convergence on the
ability to reify sentences with a QUOTE operator and refer to multiple
"meta-levels", and agnostic ascent for the hooks for default reasoning
and modal operators).

The obvious consequence is that one cannot write a general-purpose
interpreter that can make arbitrary deductions over KBs in the
interchange language and still be real time for redwoods.  To cope
with this situation, several approaches can be taken.  One is for
producers of KBs to generate sentences that do not use the full power
of the interchange language, and provide inference procedures that
operate on a restricted language.  For example, one could deliver a
knowledge base containing only statements about the properties of
classes and rules defining classes in terms of other classes and
additional properties.  Then the consumer of such a terminological KB
could compile these FOPC sentences into an internal form that is used
by an efficient subsumption procedure.  Two consumers could have two
different implentations of the subsumption engine, but share the KB.

A second approach is a generalization of the first.  That is to have a
set of "fast" inference procedures in a KB consumer's machine, where
each can recognize sentences in the KIF that satisfy its particular
requirements.  Cyc is an example of this approach; it has a few dozen
inference procedures to date.  The inference procedures operate on
structures at the "heuristic level", such as cached inverse pointers
and compiled Lisp functions, whereas the corresponding KIF sentences
lie at the first-order "epistemological level".  Note that some
statements in the KIF may not have corresponding "fast" procedures,
and thus arbitrary inference over these may be prohibitively slow.
(Better to plant forests than destroy the soil with short-term crops.)

Thus, for each specialized inference procedure there would be a way to
translate from KIF to the internal format, and for each inference
procedure there should be a translation back into KIF.  There is
nothing in principle that precludes the mechanical translation of
logically-equivalent representations of knowledge (c.f., Pat Hayes
"The Logic of Frames" paper).  The Cyc project is developing very
technology for doing this sort of translation in general.  Cyc is
supposed to be able to translate anything it its "heuristic level",
where one sees frames and constraint-language expressions, into
sentences at the "epistemological level", where one sees KIF-like
expressions; the translation is promised to work in the other
direction as well.

A third approach is to rely on general knowledge compilation
techniques such as partial evaluation and meta-level control to
achieve efficient inference over certain classes of sentences in the
KB. 

The important property of all of these approaches is that they don't
*require* a special syntax in the interchange format, at least not a
restriction on expressive power.  Some syntactic annotation might be
useful, as long as it is transparent to the logical interpretation of
the contents of the knowledge base.

4. MECHANISMS FOR SHARING KNOWLEDGE BASES

Given the general strategy of an interchange format for sharing the
content of knowledge bases, several important details need to be
worked out.  We need to devise mechanisms for coming to agreement on
the names and meanings of entities, techniques to enforce the
semantics of KBs, and methods to help identify the assumptions
underlying models.

4.1 BILATERAL AND UNIVERSAL CONSENSUS

It is a lot of work to get two parties to converge on a set of
conventions.  It can be almost impossible to get N parties to agree.
However, this does not imply that any mechanism to facilitate
consensus need be specific to the bilateral case, or that the results
be limited to two parties.  If two parties agree to share a
vocabulary, then the result is a *public* vocabulary in a well-defined
interchange language, which others are free to pick up and use.  It is
possible that "positive returns" can occur in the world of portable
knowledge bases, in same way that technological "conventions" catch on
in the economic marketplace.

4.2 LOCAL CONSENSUS VERSUS GLOBAL STANDARD

While it is often desirable for a set of KB producers and consumers to
converge on a *common* set of content terms, it is too much to demand
that of a "knowledge representation standard".  However, much of the
leverage in agreeing to an interchange language comes in the ability
to "read" and "make sense of" KBs that were not invented here.  This
requires understanding the meanings of content terms in the KB which
are not defined from other terms. 

4.3 NAME REGISTRIES

To share at any of the levels of section 2, the parties involved need
to agree to a set of fixed primitives and a mechanism for defining new
primitives.  The KIF group is working out a set of universal fixed
primitives for level 1 (syntax) by email discussions of written
proposals.  At level 2, the problem is to set up mechanisms for
defining primitive "content" terms, and managing the namespace.  The
meaning of a primitive is not given (completely) in other sentences in
the KB.  For example, the terms "instances" and "subclass" (#%specs in
Cyc) are primitives.  Even though they can be (and are) defined with
reference to set theory, their semantics as relations in the KB are
not *derived* from some more primitive axioms about sets in the KB (or
if they were, then "set" would be primitive").  The term
"allInstances", on the other hand, can be defined completely in terms
of "instances" and "subclass".  The meaning of "allInstances(x, y)" is
given by axioms in KIF.  (It turns out that the axioms are never
*run*; special inference mechanisms with the same semantics handle
closures over relations.)

The vast majority of symbols in a knowledge base are not like
"allInstances"--they are primitive.  Some are important to
understanding and using the KB being shared, and some are not.  Those
that contribute to the intellectual content of the knowledge base,
particularly those that serve as ontological primitives for the KB
users, need to be defined in a public place.  This is akin to
exporting function and variable names from Common Lisp packages.
Names that serve only "internal" roles in a KB, such as skolem
functions (analogous to anonymous lambda expression in Lisp programs),
need not be publicly defined.  Of the "important" terms, some are
common to a larger class of knowledge bases than others, and therefore
worth more effort defining.

One mechanism for defining such primitives is to make an entry in a
name registry, where the name is accompanied by English text and
examples.  The name registry can be implemented by various database
strategies, depending on the concurrency requirements and the physical
distribution of the participants.

The strategy we advocate is to start from a core of general terms upon
which we are most likely to find agreement (or terms most important to
agree upon).  Second, the vocabularies can be partitioned according to
the bullets under the levels of section 2.  For instance, there can be
a registry for standard frame-system terms such as "instance" and
"subclass", and a separate registry for device-modeling terms such as
"terminal" and "component".  A strategy for working on the latter is
for each participating group to model the same device or process in
their own favorite language, and propose general names for the
important relations in the model.  Another strategy is for one
participant to take an existing model in a different participant's
existing terminology.  Experience in this enterprise is lacking ;-).

Note that it does not suffice to "partition the namespace" via
mechanisms (such as Common Lisp's package system) that ensure that no
two KB developers use the same names.  Users of shared knowledge bases
must have some vocabulary in common, or they cannot interpret the
other's KB.  Note also that the ability to write translation tables
among disjoint namespaces is equivalent to managing a single
namespace.

4.4 STRATEGIES FOR NEGOTIATION

Due to the formalism of predicate calculus, sharing knowledge bases
requires sharing a common namespace of predicate, function, and
literal constants.  Therefore negotiation will be necessary when
clashes in the names or definitions of these terms occurs.  Mechanisms for
negotiation include person-to-person discussions, public email,
autocratic arbitration, etc.  We still have no empirical data on what
works in this area.

The means by which people work out a vocabulary depend on the mode in
which conflict is uncovered.  If one party is importing the ontology
of another, then the importing party can maintain local translations
>From the exporter.  If two parties are merging ontologies in which
both have made prior commitment, then each can come up with a list of
bothersome terms to bring to the other.  If all parties are
dynamically extending a shared ontology, it helps to provide dynamic
feedback on conflicts (the other cases can be handled in a batch
mode).  The name registry can play dual roles here: to help people
understand the meanings of primitives, and to help them manage a
common namespace.  The support of groups of people contributing to and
using shared knowledge bases is a fruitful topic for research in
computer-supported collaborative work (CSCW).

4.5 CONSTRAINTS - A WELL-MEANING POLICE FORCE

The term primitive [sic] can be misleading.  To say that "instance" is
a primitive does not imply that there is *no* machine-intelligible
knowledge in the KB about the term.  If one peered into a shared KB
containing "instance", one might find *constraints* that prescribe the
ways in which that term can be (not) used.  One would find sentences
stating that "instance" is a two-place relation, and that its first
argument must be a class.  (There might also be information about how
the term is described in English, but this is not a constraint in the
same sense as the domain restriction.) Constraints imposes a semantic
check on what "makes sense" in a given knowledge base: what is a
consistent use of terminology.  When the primitive is a class in a
hierarchy, one could think of constraints as defining necessary
conditions for class membership.  Constraints can also be written in
an interchange language, and are included in a shared knowledge base.
An interpreter can use these constraints to notify the KB builder or
user when something is amiss, and a knowledge acquisition tool can use
the constraints to restrict the form of what is added.  (Such tools
can range in power from trivial name checkers to full-blown theorem
provers.) The interpreters and constraint checkers need not be
included with a KB, and they can be written to work over a large class
of KBs, as long as the KBs are exchanged in the standard format.

4.6 IDENTIFYING THE ASSUMPTIONS OF SHARED MODELING ONTOLOGIES
AND INFERENCE PROCEDURES

If two parties wish to share an *inference method* (level 4) as well
as an ontology (level 3), then it is important that the assumptions
and scope of applicablity of that method be made as explicit as
possible.  How to represent the assumptions of models, and associated
special-purpose inference mechanisms, is an open research problem.  A
special case is worth attention when considering the issues of sharing
knowledge bases: how to represent the assumptions of ontologies that
are intended for use with specific inference mechanisms.  One
possibility is for the KB producer to restrict the KB user to a fixed
relational vocabulary, where only the names of objects could stray
>From the predefined terms.  Another option is to restrict the user to
extending the ontology through a few relation types, such as
"instance" and "subset", and to write inheritance rules in the
ontology that use those relations.  This is a topic for innovative
discussion.

5. USES OF SHARED KNOWLEDGE BASES

Here are a few practical motivations for sharing machine-intelligible
knowledge bases.

  * To support cumulative research.  Although AI isn't "big science",
it can often take years of infrastructure-building to conduct PhD
research in knowledge-based AI.  There is no "E.  Coli" to study
(blocks worlds notwithstanding), and much reinvention of the
ontological wheel.  Much of the work is evaluated in toy worlds that
do not address issues of scale and complexity.  Shared knowledge bases
can be public resources for research.

  * To support the acquisition and use of knowledge by communities of
cooperating "agents", including people, who need to know about and
talk about entities modelled in a shared, machine-intelligible form.
Engineering, in particular, has enormous knowledge sharing problems,
and approaches to concurrent engineering and active institutional
memory require an interlingua with which to record and communicate
knowledge.

  * To support applications that need comprehensive, shared bodies of
knowledge to work.  Natural language systems, in particular, depend on
large knowledge bases of "background knowledge" or "consensus reality"
to make sense of human language.

  * To avoid brittleness and duplication of work in building
knowledge-based systems.  In the absence of shared knowledge bases,
each knowledge system is competent in a narrow domain of expertise,
and fails disgracefully when given problems requiring knowledge
outside its domain.  Furthermore, an institution may have tens to
hundreds of knowledge systems built for different aspects of a common
physical system or product.  Without shared knowledge bases, these
systems are inherently isolated.

There has been some experience with building and using shared
knowledge bases.  One is Cyc (Lenat and Guha, 1990).  The goal of the
Cyc project is to represent a large body of "common sense".  Although
Cyc is still in it's early childhood in terms of the coverage of this
ambitious domain, it is already being used for word sense
disambiguation by natural language workers at MCC.  The problem for
the NL system is to determine whether some parse "makes sense".  Cyc
can answer some questions about what makes sense because it contains
an ontology of things in the world along with constraints on how
entities in the ontology can be plausibly related.  

Another example is the Helios project in reasoning about digital
circuits.  Helios provides a vocabulary of primitives for modeling
discrete devices such as digital circuits.  Using this vocabulary on
top of a standard first-order logic, different programs can perform
different problem-solving tasks on the same models in a shared
knowledge base.

A third case is the Penman natural language generation project.
Penman is a program that can take "explanations" in a logical
representation and generate polished English text.  A key source of
power for the technology is knowledge about distinctions in the world
that have linguistic consequence.  For example, if the input to Penman
is a description of the structure of a machine and how it works,
Penman needs to know what objects in the description have to do with
part relations, spatial relations, and causal relations.  This kind of
knowledge is organized in an ontology called the "Penman Upper Model".
The Upper Model is a shared knowledge base of categories and
inheritable properties which is instantiated by users of Penman.

Several research projects are pursuing strategies that will capitalize
on the leverage of sharable knowledge bases.  In the KIF initiative
alone, there are several instances of ongoing or planned
collaborations based on sharing KBs:

  * The How Things Work (HTW) group (Gruber and Iwasaki et al.) is
using Cyc for representing and reasoning about electromechanical
devices, for research in model formulation, abstract simulation and
analysis, and explanation.  Much of the formalization already done in
Cyc, such as the representation of histories ("temporal
subabstractions") and the representation of abstract physical
quantities ("interval-based quantities") are being used "off the
shelf".  In turn, some of the requirements of reasoning about multiple
models and qualitative simulation are stimulating work in Cyc on
branching time and microtheories.

  * The DesignWorld project (Genesereth et al.) is committed to a
strategy of declarative, shared, multiuse knowledge bases for
representing the models of its devices.  DesignWorld is an integrated
design, manufacturing, and repair system.  The project aims at the
capability to design and actually build compact disk players using a
microfactory and CAD workstations.  Each part of the overall system
access and communicates through a shared KB.  The DesignWorld and HTW
projects will hopefully benefit from sharing their representation
research and models of physical devices in the form of KBs in KIF
format.

  * The Manufacturing Knowledge Systems (MKS) (Tenenbaum et al.) is
building a network of cooperating knowledge-based systems to support
the many phases and activities of engineering.  It is critical to
these systems that they share a language for knowledge about physical
devices and processes.  Large amounts of engineering knowledge is or
will be available online, yet it is currently in the form of passive
databases (e.g., part descriptions, CAD models).  The MKS project is
developing vocabulary and representational frameworks for capturing
and using this knowledge.  It is hoped the MKS, HTW, and DesignWorld
projects will be able to take advantage of mutual modelling efforts
via the mechanism of shared knowledge bases.  

  * Finally, the HTW project is importing the Penman natural language
generator, and testing the viability of the KIF idea at the same time.
In a proposed experiment, the Penman Upper Model (currently in LOOM)
will be mechanically translated into KIF.  Then it will be read into
Cyc, which will translate it from the epistemological level to the
heuristic level (e.g., recognizing classes, subclass relations, rules
that represent inherited properties, etc.).  Then, if all goes well,
it will be mechanically translated back into KIF, for possible testing
by DesignWorld group, which has a KIF reader already developed.  The
problems of how to resolve name and ontology conflicts, and the
techniques developed to handle the, should provide useful data for
understanding the process of sharing knowledge representation and
ontology.

6. TECHNOLOGY TO FACILITATE THE DEVELOPMENT AND USE OF SHARED
KNOWLEDGE BASES

Several categories of tools can facilitate the development and use of
shared knowledge bases.  

  * Technology for compiling efficient internal representations from
general logical knowledge bases, as described in section 3.3  MCC's Cyc
project has developed a technology for converting from first-order
descriptions ("epistemological level") to efficient internal ("heuristic
level") representations and inference mechanisms.  The translation works in
both directions, to and from Cyc and an interlingua-class language.
Mike Genesereth's group has developed techniques for partial evaluation and
meta-level control that also address efficiency concerns with declarative
knowledge bases.

  * Knowledge acquisition tools that know about ontologies.  Modern
knowledge acquisition tools (for building knowledge-based systems) are
themselves knowledge-based.  Typically they are built to acquire
knowledge for a generic class of problem-solving tasks, such as
diagnosis or parametric design, or for a particular problem-solving
method, such as heuristic classification or skeletal planning, and
sometimes for a domain as well.  With each task comes a modeling
vocabulary (e.g., hypothesis, evidence, data abstraction, etc., for
HC) which is serves as a representation language for a knowledge
system architecture.  Given a standard representation language and
mechanisms for sharing higher-level knowledge bases, intelligent
knowledge acquisition tools could be written for, and packaged with,
ontologies (a la Penman) and skeletal knowledge bases customizing the
tool to task, method, and domain.

  * Representations of multiple models and microtheories.  Systems
that reside in an environment of large, comprehensive knowledge bases
need to focus on small subsets or views of the the KBs at any given time.
An important technical problem is how to represent multiple views or
models that are potentially inconsistent, that share some of the
larger knowledge infrastructure, and that are themselves represented
objects with explicit such as modeling assumption.

   * Classifiers for managing the terminological space - KBs as name
registries.  We can expect that it will be a very common activity to
extend a given knowledge base by using existing terms and relations to
define new ones.  When that knowledge base is quite large, and was
produced by others, it is difficult to recognize and explicitly declare
all of the ramifications of a new definition.  Furthermore, being
confronted with the inferences that follow from a proposed new
definition can often be quite useful in helping knowledge base
developers decide whether they have said what they meant.  These
functions are provided by automated classifiers, such as those in the
KL-ONE family of languages.  The most useful classifiers have
capabilities to answer questions about what caused something to classify
as it did.

  * Specification by reformulation.  Another problem facing developers
reusing an unfamiliar knowledge base is that of determining what the
knowledge base offers them as the vocabulary for describing their
concepts.  This is necessary for developers to find or create the
terms/relations they have in mind.  Specification by reformulation
interfaces (e.g., Williams' RABBIT, Patel-Schneider's ARGON, Fischer's
HELGON, Neches' BACKBORD) address this issue by providing a style of
interaction in which a definition is iteratively refined.  At each step,
the interface generates examples by presenting a set of existing objects
or instances consistent with the description but more specific.  These
serve as "memory joggers" that help the developer realize what they can
say (or what they want to avoid saying).  The interface provides a set
of operations (e.g., require, prohibit, generalize, specialize) that
allow the user to point at features in the examples and say how they
should be used to modify their evolving specification.  Having such an
interface is potentially a very valuable browsing and editing aid.  As a
minimum requirement, it necessitates that the knowledge representation
system be able to return answers about subsumption relationships between
terms and between relations.

  * Approximate retrieval.  A specification by reformulation interface
would facilitate the process of finding an object that is close to the
intended definition, copying the object's definition and editing it.
This would be greatly enhanced if it could find objects that were not
strictly consistent with the user's current specification, but were
close approximations.  Brad Allen of Inference Corporation has built a
system with a relatively efficient algorithm for this purpose; others
have developed systems with more general but less efficient algorithms.

   * NL paraphrases.  For developers to have an understanding of terms
and relations in a very large knowledge base, simple inspection of the
definition is rarely adequate.  In spite of the good intentions of a
declarative semantics, the problem is that it is hard to follow layers
of embeddings in which definitions reference other definitions.  Natural
language definitions to go with the formal definitions are therefore
extremely important.  At the minimum, these must be manually generated
and maintained by the original creators of each knowledge base entry.
Eventually, it would be desirable to have the NL definitions
automatically generated from the formal definition, both because this
would allow tailoring to user needs and because it would ease the
problem of keeping the documentation consistent with the specification.
The state of the art today permits automated generation of simple
dictionary-like definitions of KB entries, assuming that they fit within
a particular ontology.  Generation of more sophisticated definitions,
which include summarization of embedded definitions, requires support
for further research.

   * Hypermedia documentation.  We strongly believe that it is essential
to have environments which allow original creators to use hypermedia to
document their Interlingua knowledge bases.  Developers assimilating
those knowledge bases into their own systems must have the ability to
read in the hypermedia annotations along with with the formal
specifications, and to examine them interactively.

ACKNOWLEDGEMENTS

Many thanks to Bob Neches, who carefully reviewed the drafts of the
document and contributed to its revised content.  Many of the issues
and suggestions have arisen from discussions held by participants of
a group of Stanford researchers attempting to converge on an
interlingua called KIF - knowledge interchange format.  Input from
Mike Genesereth, R. V. Guha, Reid Letsinger, and Marty Tenenbaum have
been particularly helpful.