Mark Tuttle Position Statement ...
umls.UUCP!tuttle@mis.ucsf.edu
From: umls.UUCP!tuttle@mis.ucsf.edu
Date: Thu, 21 Mar 91 12:04:03 PST
Message-id: <9103212004.AA17349@lti3>
To: srkb@isi.edu
Subject: Mark Tuttle Position Statement ...
[Sorry. No local access to LATEX.]
Workshop on Shared, Reusable Knowledge Bases
March 24-25, 1991
Parajo Dunes, CA
Position Statement
Mark S. Tuttle, Vice-President
Lexical Technology, Inc.
1000 Atlantic Ave., #106
Alameda, CA 94501-1366
(415) 865-8500 (voice)
(415) 865-1312 (FAX)
tuttle@lxt.com "soon";
tuttle@mis.ucsf.edu, until then.
o What knowledge would it be useful to share in a machine-readable form?
Even if nothing else is shared I believe it would be useful to
share the NAMES of the things that the knowledge is about. This
quickly leads to requirements for human-readable DEFINITIONS of
the NAMES, so I know what YOUR name means, and formal notions of
the CONTEXT of the NAMES, e.g. RELATIONSHIPS among the NAMES.
The latter can be simple stuff, e.g. NAME_A "is a" NAME_B, or even
NAME_A "is narrower than" NAME_B, or more complex relationships,
e.g. NAME_A "is a conceptual part of" NAME_B, or NAME_A "is a
manifestation of" NAME_B.
Note that while we all know that knowledge tends to be context
specific, it is conceivable that a NAME SERVER could represent
names and relationships among names in some uniform way without
doing terrible violence to the knowledge so abstracted.
Management of such NAME SPACES is one of the things good librarians
do. It is also a proto-science view of knowledge, more commonly
used during the 19th Century, when NAMING things, i.e. CATEGORIZING
them, in a way that "made sense" to humans was a major advance.
While I claim that NAME sharing is a necessary (pre-)condition to
sharing any kind of symbolic information, it is even more important
in biomedicine, which is NOUN rich, and VERB poor.
Skipping ahead, I will argue that it is "useful" to share this
kind of knowledge ONLY if it is maintained. E.g., suppose you got
the shared knowledge when whales were thought to be fish. Would
you be ready to detect and effect the appropriate update? Suppose
you got the knowledge when all diseases, e.g. pellagra, were thought
to be infectious. Then it was discovered that some diseases were
caused by nutritional deficiencies. (Note that this update is
different than the problem with whales.) Then some of these diseases
were found to be inherited, rather than environmental, etc. This
goes on right up to the present. As a rule of thumb, about 10% of
the knowledge in the project outlined below changes in some
non-trivial way each year. And, these are only the evolutionary
changes. Revolutionary changes are much less frequent but also
more difficult to characterize.
In summary, success is critical, i.e. we need to learn from
successful attempts to share knowledge. And, without success, no
one will care about maintenance. But without maintenance, success
will be short lived.
o What is the purpose or goal of knowledge sharing in your work, and what
form does shared/reusable knowledge take?
Lexical Technology, Inc. (LTI) is supported by the National Library
of Medicine (NLM), to help develop the Unified Medical Language
System (UMLS). A 10-year project begun in 1986, the UMLS is an
multi-institutional attempt to provide a uniform interface to the
world's biomedical knowledge. The first major deliverable of this
project was the first Metathesaurus of biomedicine (Meta-1)
released by the NLM last Fall. Meta-1 consists of about 30,000
entries each containing information about a single meaning (one
meaning, one entry), meaning that ambiguous names show up in more
than one entry. Built by LTI, it helps answer the question, for
humans and programs, "What is it called?" Typically, "it" is
found either by lexical matching, because you called "it" by a
name similar to one of the names in the Metathesaurus, or by
matching a related name and then navigating semantically to the
desired entry. Meta-1 supports five different kinds of semantic
locality. A related deliverable is a large-granule semantic
network, which covers all of biomedicine with 133 nodes, e.g.
"Disease or Syndrome", "Immunologic Factor", and 34
kinds of relationships among those nodes, e.g. "Causes",
"Complicates", "Manifestation Of". Each of the ~30,000
entries (meanings) has been assigned one or more semantic types,
e.g. "Disease or Syndrome", by a domain expert. A hypothesis
to be proven is that the Metathesaurus and the accompanying
Semantic Network can be reused locally to improve access to local
information resources. There is no question in my mind that the
project will improve information access to the forty some
information resources maintained at the NLM.
o What are the barriers to knowledge sharing?
As Minsky said, "If you can't solve the simplest non-trivial
version of your problem, it's unlikely that you will solve the
general case." Thus, while in computer science we learn that we
can exchange all the TRUEs and FALSEs, and NODEs with ARCs, with
impunity, the notion of the objects of interest (the NOUNs) seems
pretty critical and immutable. If the VERBs involved seem natural
and immutable too so much the better. Further, if we can't share
knowledge about finite sets, then it's a calculated risk (which we
may decide to take) to work on sharing knowledge about denumerable
sets. In summary, the field of AI has a mind-set about what IS a
worthy problem which prevents work which might lead to sharing.
But what is the simplest non-trivial version of a problem?
Here are three examples.
During the mid '60's the most exciting problem in control theory
was that of providing assistance to pilots landing on an aircraft
carrier during periods of near zero visibility. However, the first
thing the control theorists found was that each plane had a
different amount of slack in its controls. Pilots adjusted to the
idiosyncrasies of each plane without thinking much about it, except
that each plane had its own control "signature". (This wasn't each
different model of plane, but each plane.) One alternative, the "AI
solution", would have been to try to replicate the pilots'
adaptability. Fortunately, the control theorists were also good
engineers, so the first thing they did was take all the slack out
of some planes, for experimental purposes, and get the Navy to work
on developing planes with reproducible control systems. All this was
done, and THEN they solved the remaining control problems.
LTI has a parallel problem within the UMLS. Unlike the perfectly
reasonable assumption with the Knowledge Interchange Format (KIF),
there is no standard character set for the UMLS project. Because
the NLM uses a "non-standard" EBCDIC character set, and "everyone
else" uses ASCII, the only standard is the notion of an eight-bit
character. But, this is not pure perversity on the part of the NLM.
It took more than a century to negotiate all the agreements the NLM
has with foreign medical journals (typically, the journals outlive
the governments), and, in 1966, when the NLM went "on the air" with
interactive bibliographic searching, they had to guarantee that
diacritical marks from a number of languages using the Roman
alphabet were preserved. And, this is a BIG deal. (Imagine
telling the French that their diacritical marks were to be
discarded!) Thus, the whole notion of a character set has been a
struggle from the beginning. Should we try to solve this as if it
were an AI problem? Obviously not, but if we (LTI) doesn't solve
it, who will? So, is combining bibliographic citations in a
single database knowledge sharing? Whatever we think, sharing
cannot take place without first solving the character set problem,
and like matters. Thus, we're forced back to this dilemma of what
is a valid sharing problem?
Suppose we assume away the character set problem (as will happen
eventually). If you and I use the same knowledge should we expect
to get comparable results? One of the first problems you and I
will discover is that we can't scan the same text fragments and
come up with the same set of words. Reducing a sequence of
characters to a sequence of words is a process of abstraction,
which gets reinvented sometimes repeatedly WITHIN THE SAME
PROJECT. Is this a knowledge sharing problem? If success is our
objective, it certainly is. Is the DARPA Knowledge Representation
Standards Effort prepared to deal with this kind of stuff?
In the case of the character set problem, it's pre-KIF (we'd have
to define "escape" sequences), and in the case of the "scanning"
(word extraction) problem, it may suggest that reusable knowledge
may REQUIRE reusable tools. More bluntly, if success means real
sharing, then real problems often come with a large amount of
unanticipated baggage. This is neither a bug nor a feature. It's
just the way it is.
In summary, knowledge sharing may require definition of a
sharable INFRA-STRUCTURE on which the ability to share will depend,
even before we can get to Minsky's "simplest non-trivial version of
the problem". Alternatively, with each body of knowledge to be
shared may have to come with its own infra-structure. There are
probably many other "barriers", but this is a big one for LTI.
o How can AI help?
I think AI is the most trend sensitive part of computer science.
E.g., this workshop constitutes a trend. If everyone of us says,
by DEFINITION, that knowledge representation doesn't count unless
someone else uses it in some alien application, or unless new
insight is produced AND validated, this will go a long way toward
focusing the issues. Perfectly honorable work can be done that's
called "engineering", or "experimental work"; but we should stone wall
everyone and say "It ain't knowledge until its reused, usefully, in
an alien application, except under the following narrow
circumstances." Note that a paper that no one reads might not be
considered a "publication," and a paper that no one cites might not be
considered (publically) useful. (The latter might be very useful to
the author, but only to the world at large if it leads the author to
something which is cited.) A related observation which is most
troublesome is the problem of papers which are cited frequently
but which describe programs which didn't last as long as the student
who wrote it. Thus, Terry Winograd's thesis would now have to be
considered an ESSAY on "understanding".
Though it's much less important than the propaganda efforts
above, I think AI-approaches to the problem of managing name spaces
and discovering synonymy within them, i.e. when two lexically
distinct names, name the same meaning, would be very useful.
They would certainly help in our project, and would increase
sharing directly, were "name sharing" adopted as sub-goal.
On a deeper level, two other ideas come to mind. First, I think
its a myth that most programs encapsulate all important knowledge
declaratively. Given that this is so, a lack of such separation
seems to be a clear impediment to sharing and reuse. Second,
demons which monitored drifts and evolution in knowledge would be
extremely useful. For instance, to the NLM, it's a FEATURE that the
meanings of the names it uses in its naming system drift with
time. At a practical level, any large, sharable ontology may
evolve more or less continuously. An interesting AI problem is to
develop a demon which monitors such changes relative to an
application, e.g. if I'm an application, "Do I care about any of
today's changes?" On an average day, I expect the answer to be
"No."
o What should be the unifying vision for knowledge sharing and reuse?
First, I believe that "systems which are used tend to get better".
Therefore trying to create a path which would both allow early
success and at the same time lay the groundwork for more ambitious
success later should be a high priority. Thus "incrementalism"
should be an important part of the vision.
Second, I believe the vision should accommodate a homogeneous
view of the problem, e.g. that there should be a NAME SERVER.
Third, I believe the vision should accommodate a heterogeneous view
of the problem, e.g. that knowledge tends to be context specific
and that any given context may have its own weirdnesses, but these
idiosyncrasies may be worth sharing nevertheless.
Fourth, the vision should foster, nurture and enhance an
infrastructure of otherwise mundane artifacts including character
sets, physical and logical formats, tools, documentation management
systems, browsers, etc. Its hard to see how any attempt to share
widely can succeed without such things.
Fifth, the vision should address the many aspects of the
maintenance issue head on, e.g. there will be those who maintain
and those who consume maintenance and the needs of both groups
need to be accommodated from the beginning.
Finally, AI needs to get its own house in order. To whit, someone
needs to develop and operational definition of "knowledge" so that,
for instance, the highly successful process of code sharing and
reuse, and emerging standards for various multi-media
representations of information are either distinguished or
integrated. E.g., is knowledge sharing different from code
sharing? Or not? Are any worthy distinctions shades of gray,
or qualitative differences? Are we going to reinvent software
engineering, build on it, or fight with it?
o How can the community cooperate to reach the vision?
Propaganda, as described above, and the formulation and pursuit of
achievable sub-goals are probably the two most important things.
o How should reusable KBs be developed and disseminated?
I think this question is best approached operationally.
Ideas about how to do this are easy to come up with, and
hard to execute. Vigorous execution may lead to more
insights than thought experiements.
o What are some useful pilot experiments?
I'd like to see two kinds, one narrow and one broad: The
first kind is the obvious one, namely to try to reuse some
specialized knowledge in an alien environment. E.g., one
definition of a good tool is that someone uses it for
something important that the tool's designer never thought
of. The second kind relates to the name server proposal
above. This kind of pilot experiment won't be worth much
until its coverage is broad and its database large.
o What are the critical research issues that need to get solved?
Until the knowledge sharing gets past the "easy" stage,
meaning that most of our efforts are going into
infrastructure, I don't see any critical research problems except
those related to maintenance. These have been outlined
above. Once we get past the "easy" stage, meaning we have a
working infra-structure for sharing, then I expect all the classic
problems to appear along with some new ones stemming from emergent
complexity resulting from the large scale.