Belated reply to Hovy and Neches ...

umls.UUCP!tuttle@mis.ucsf.edu (Mark Tuttle)
Date: Sun, 24 Feb 91 21:12:23 PST
From: umls.UUCP!tuttle@mis.ucsf.edu (Mark Tuttle)
Message-id: <9102250512.AA22673@umls.lti.com>
To: srkb@isi.edu
Subject: Belated reply to Hovy and Neches ...
Cc: erlbaum@mis.ucsf.edu, fuller@mis.ucsf.edu, nelson@mis.ucsf.edu,
        olson@mis.ucsf.edu, sherertz@mis.ucsf.edu, sperzel@mis.ucsf.edu
	Dear "Shared Ontology-ers":
	
	My apologies for weighing in late.  I look forward to seeing you
	all at the Miami meeting.  I will follow with my replies to Tom's
	imperatives later today or tomorrow.
	
	Until our router box arrives ("Real soon now.") my tuttle@lxt.com
	address will not work.  If your mailer barfs on the UUCP syntax
	you will see on the return address, just use tuttle@mis.ucsf.edu, or 
	the alias(?) I have at sumex
	below.
	
	My responses below are based on my digestion of the 40+ pages of
	e-mail I have received regarding our upcoming meetings.  I wouldn't
	have predicted how timely this all is, but that's another story.
	
[]To: Tom Gruber <Gruber@sumex-aim.stanford.edu>,
[]        John Kunz <kunz@intellicorp.com>, Doug Lenat <lenat@mcc.com>,
[]        Chris Overton <overt@prc.unisys.com>,
[]        Mark Tuttle <tuttle@sumex-aim.stanford.edu>
[]Cc: Tim Finin <Finin@prc.unisys.com>,
[]        Mike Genesereth <mrg@sunburn.stanford.edu>,
[]        Bill Mark <Mark@sumex-aim.stanford.edu>,
[]        Marty Tenenbaum <Marty@cis.stanford.edu>, neches@venera.isi.edu
[]Subject: Position statement for CAIA-91 panel on KR standardization 
[]Date: Mon, 04 Feb 91 17:12:09 PST
[]From: Eduard Hovy <hovy@venera.isi.edu>
[]
[]
[]My statement, contentiously phrased: 
[]
[]
[]1. What is the purpose of knowledge sharing in your project? 
[]
[]The Penman project is building general-purpose Natural Language 
[]tools.
	Until recently, I would have thought that the notion of "general 
	purpose natural language tools" was an oxymoron, the "general" and 
	"natural" being in conflict, as you note below.  The reason I no longer 
	think this is because.tools needn't carry the semantic burdens of 
	models to be useful.  For instance we built the Metathesaurus for the 
	Unified (not Uniform) Medical Language System (UMLS) using UNIX 
	(tools) and a relational database system (Ingres), and edited it by 
	putting domain experts in front of a HyperCard application.  If these 
	things hadn't existed we would have had to try to invent each of 
	them in order to get the work done.  Thus, we take the virtue 
	(necessity?) of tools for granted.  Surprisingly, few outside 
	computer science do.  They confuse the development of tools with 
	masturbation (or something), in any case tool building is an activity 
	they are suspicious of.  So, I've learned to be tolerant of the nominal 
	tool-building phases of others.  I look forward to hearing about 
	yours.
	
[]We have an extensive sentence generator, two multisentence 
[]text planners, and are extending a parser from prototype to full 
[]size.
	Though I wouldn't have admitted it until recently, we are engaged in 
	sentence generation of a very primitive kind.  We are trying to 
	determine, empirically, the pairs of biomedical nouns which, to use 
	your word (below) "colocate".  Work done on this to date has been 
	very revealing.  We look forward to extending this work and 
	leveraging the results.  (Related to your points below is the 
	observation that empirically colocated nouns and noun phrases in 
	biomedicine tend to suggest the verbs that might be used with them  
	-- though cynics would point out that medicine doesn't have very 
	many important verbs.)
	
[]The current major focus of the project is Machine Translation, 
[]in a joint project with the Center for MT at CMU and the CRL at New 
[]Mexico State University.
[]
[]For us, knowledge sharing enables domain-independent applicability.
	I thought this was the hypothesis to be proven.  Can you offer 
	evidence now?
	
[]To make use of Penman (the sentence generator), for example, a user 
[]must simply define his or her domain model concepts (whether already 
[]organized into a taxonomy or not) in terms of Penman's general 
[]concept ontology (which we call the Upper Model).  The user can 
[]then include domain concepts in the input to Penman, while Penman 
[]makes its decisions in terms of the generic Upper Model concepts 
[]that subsume the input ones.  Penman has been distributed to over 
[]50 research and university sites worldwide, and we find general 
[]satisfaction with this model.
	Although your word "simply" makes me want to hang on to my wallet 
	I look forward to hearing about this.
[]
[]
[]
[]2. What form does the ontology take in your work? 
[]
[]The Upper Model is a taxonomy of approximately 200 nodes organized 
[]in a property inheritance network.
	The semantic network which accompanies the Metathesaurus has 
	133 nodes.  Does this order of magnitude represent some kind of 
	cognitive or consistency maintenance invariant?
	
[]It is implemented in Loom.  
[]The principle of taxonomy construction is based on the structure 
[]of English (rather than essentially on intuition, which is the 
[]normal case in KR).
	I find this distinction very interesting, though only a linguist would 
	be arrogant enough to point it out.  One reason I think it's interesting 
	is that in any environment where queries get stated linguistically, 
	one may make more headway in the short run (at least) by making the 
	taxonomy linguistic.  In the end it may be more natural.  In any case, 
	the attempts to make non-linguistic taxonomies in our project 
	failed.  [Being a non-linguistic thinker, it pains me to agree, but I 
	think you've got a point -- again, if only because so much of what 
	people want to retrieve is (at least vaguely) linguistic.]
	
[]Each node represents an abstraction of some 
[]distinction made in English;
	This sounds like some of my software engineering lectures.  Clear, 
	concise and hard to apply.
	
[]for example, the top node THING 
[]immediately dominates the three nodes OBJECT, PROCESS, and QUALITY.
	One version of the UMLS semantic network started out this way.  It 
	didn't survive the domain experts (the project linguist suggested it, 
	of course), but some of that flavor did survive.
	
[]The Upper Model thus functions as a type of mapping from semantic 
[](domain) terms to English classes: since English tends to treat 
[]entities of similar semantic type fairly similarly (e.g., objects 
[]are generally nouns, and actions verbs), the Upper Model need not 
[]be tremendously large.
	Again, I find this an interesting distinction (having the top level 
	model be about properites of things in English, and the next level be 
	about the domain).  One reason is because the first version of the 
	Metathesaurus contains 30,000 concepts (one per entry), almost all 
	of which are nouns (or noun phrases), with another 20,000 or so 
	chemicals.  If I understand correctly, the Upper Model would be very 
	simple indeed.
[]
[]We have found that the language-dependence is not, as one might 
[]expect, a hindrance; on the contrary, many people find it very 
[]useful for applications that are non-linguistic.  The generality 
[]of English and the fact that it has been so extensively studied 
[]provides the Upper Model with breadth and coverage not encountered 
[]in typical AI/KR endeavors.  This is the main point I'm going to 
[]argue, when I get to point 4 below.
	I guess this is another one of your hypotheses.  Because so much of 
	medical data is not "linguistic" we will be in a position to test the 
	hypothesis.
[]
[]
[]
[]3. What role does the ontology play in knowledge sharing? 
[]
[]See 1. 
[]
[]
[]
[]4. What is easy and hard about building ontologies? What 
[]   methodologies work and don't work? What are design tradeoffs? 
[]   What are the open research problems? 
[]
[]To be contentious, and to only very slightly overstate what I really 
[]believe, I want to claim here that ontology building for the purpose 
[]of general sharing across many domains (as currently conceived and 
[]practised) is wasted effort.
	I think this is just a definitional problem.  I know of no requirement 
	that an ontology has to be consistent and deterministic.  Clearly, 
	humans do not function this way, and many algorithms avoid 
	exponential running times by avoiding consistency and determinism.
	(On the other hand, as I will try to bring out in subsequent 
	discussions, we're certainly not ready for a relational database 
	system that's inconsistent.  I.e., I think tools are one thing, and 
	domain semantics another.)
	
[]In all but a handful of cases, there's 
[]no way you can achieve generality outside your particular domain area, 
[]since you don't have the time to do an exhaustive (and exhausting! -- 
[]given how hard modeling is) analysis of things as diverse as, say, 
[]cooking and drama and dentistry and Greek mythology.
	If anyone disagrees, they should speak now.  I think its possible for 
	tools, methods and representations to be consistent, but not models 
	of (interesting) content.
[]
[]People who have actually had the money to do more general ontology 
[]building (and this is not aimed specifically at Doug; I really mean 
[]anybody in the KR community I can think of), have run into the next 
[]problem: they've never been able to come up with ways of enforcing 
[]consistency of modeling interpretation across the project's lifetime 
[]and the modellers' intuitions.  We all know the feeling: one day you 
[]have the great insight and model some tough thing one way and are 
[]completely convinced it's ok; and next month you discover that it 
[]was all wrong because now you're looking at it from a different 
[]point of view.  And the month after that again.  Repeat.
	As you observe below about lexicographers, I think the answer is not 
	resolution but discipline.  We've watched those who maintain MeSH 
	(Medical Subject Headings), a 15,000 concept naming system (and 
	taxonomy).  They have to divine new truths as best they can infer it, 
	shoehorn it as consistently as possible into the rest of the naming 
	system, and then declare it the truth until its changed, usually many 
	years later.  (MeSH is about 100 years old.)  It's like navigating out 
	of sight of land.  You make your fix as best you can, and mark the 
	map, and, by definition, that's where you are.  The fact that you could 
	make another fix a few minutes later and decide you're somewhere 
	else is to be ignored.  Thus, discipline really is a critical component.
[]
[]But it turns out that there are people who actually *have* come up 
[]with ways of enforcing consistency over large enterprises of this 
[]type: lexicographers.
	I agree!
	
[]There are scads of them working for the Oxford 
[]English Dictionary, for example, dedicated and careful and methodical, 
[]and they follow a plan.
	Well, if you're going to use this example, then you've got to deal 
	with the fact that they've accomplished one revision in 150 years!
	I don't find that at all encouraging.
	
[]They produce something people actually pay 
[]for.
	Amen.
	
[]If you want to have any kind of general success with an 
[]ontology-building project of any magnitude, I claim, you're going 
[]going to have to do what lexicographers do: predefine a smallish set 
[]of terms in terms of which you describe your world, state them very 
[]clearly to a team of intelligent and diligent types, and feed them 
[]reams of real data to work from, to minimize intuitions.  And you 
[]need a plan so that you systematically cover the areas you're 
[]interested in.  And you need consistency checkers.
	This is a very good description of what we attempted to do with the 
	Metathesaurus, and its why we could build one in a little more than a 
	year, editing and all.  Clarity was not always in evidence, and we did 
	sweep a few problems under the rug, for later resolution.  The key 
	was leveraging existing (computer-readable) biomedical naming 
	systems with the best software we could find.
	
[]
[]Now it's hard enough to agree on syntactic features; how much more so 
[]on semantic ones.
	Nah.  This is where the discipline comes in.  The notion of a 
	thesaurus entry template proved to be a way to unify the user model 
	and the system model.  The template was designed to be the simplest 
	thing to build which we "knew" would be minimally useful.  The 
	engine which computed it could have been built using the frame-
	slot-value triples of frame-based systems.  In fact, we had alot of 
	long skinny tables of 2, 3, and 4 columns.
	As for the semantic ones, people were paid by the "brick", and a 
	hierarchy of referees adjuticated difficulties.  By prior agreement 
	some recalcitrant entries were hammered into submission.
	
[]This type of project will be dead in the water before 
[]it even starts, given people's familiar inability to see the world the 
[]same way all the time.
	One of my favorite all-time software lectures was about non-
	determinism -- computers can entertain multiple models of the 
	world simultaneously.  If every single person has to have his or her 
	own model, we ARE dead, as you suggest.  But as people agree then 
	the models can be coalesced.  This is big talk, but we allow terms in 
	the Metathesaurus to have up to TWO (wow) semantically distinct 
	meanings, and future versions will handle non-determinism more 
	gracefully.  Ironically, this may be what kills the linguistic approach.  
	Initially, the fundamental unit of the metathesaurus was that of a 
	term (a blessed string of characters) with a meaning or meanings as 
	an attribute.  But the current model evolved, clumsily, into a 
	Metathesaurus of meanings, with terms (their names) as attributes.  
	The next version will handle this much more cleanly.
	
[]So, I argue, what you need is to have computers 
[]do the first few steps of this data collection and initial sorting.
	Amen, again!  Where were you when we needed you!
	
[]For each domain, you should collect all the texts you can lay your 
[]hands on -- newspaper articles, ads, letters to mom, laws, whatever 
[]-- and feed them into programs that sort them and cluster them by a 
[]collection of criteria: spelling, syntactic type, collocation (that 
[]is, co-occurrence with other words) within narrow windows, etc. In 
[]fact, this last is as close as you can come to semantics -- if two 
[]words appear reliably in the context of a third over the millions of 
[]words you've seen, then you can assume that they are semantically 
[]fairly closely related, both to the third word and to each other. 
[]Having done this clustering, you then have a team of trained 
[]lexicographers inspect the clusters and induce the underlying factors 
[]that make them pseudo-synonyms.  Then you recur the process, making 
[]clusters of the pseudo-synonyms, and so forth.  (In fact, it turns 
[]out that lexicographers are currently starting to make use of computers 
[]for much of their initial data-gathering and clustering.)
	These are old ideas, which I've never seen applied in full generality.  
	In comparisom, our empiricism is wimpy because we start with 
	relative sanitized sources, and then further sanitize them.  To state 
	this in the positive, to the extent we have succeeded, it has been 
	because of the "rigorous implementation of simplicity", and not "the 
	elegant implementation of complexity".  (If anyone can remember 
	who said this first, please let me know.)  Again, the Metathesaurus 
	was developed using a simplified version of exactly what you 
	suggest.
[]
[]Before actually starting this type of project, one can learn from lexico-
[]graphers' experience that it doesn't work in general either beyond a 
[]certain point.  Not only does it get hard toward the top of the hierarchy 
[]of "synonyms",
	The MeSH maintainers tend to ignore the top of their taxonomy for 
	this reason, and because they don't see it as important, forgetting 
	its importance to processibility.  (It's been a bone of contention.)
	
[]but more important, word meanings drift as the domains 
[]become more distant from one another.  A word that in the past might 
[]have been metaphorically applied to a new domain has over time acquired 
[]the new meaning as another sense.  By dint of being similar enough to 
[]inspire the metaphor, the senses share some aspects of meaning, but by 
[]the way the world develops, they have dissimilar connotations as well.
	Biomedicine is filled with examples of this.
	
[]This is all to say that it can be very hard to compartmentalize word 
[]senses, and thus to construct "canonical" meanings for a general 
	ontology.  
[]I believe this type of computer-assisted ontology building, useful though 
[]it may be, is still not going to provide a high-level general ontology 
[]that can be used across various domains.
	Well, the UMLS will be a helluva test of this hypothesis.
	
[]
[]So my final claim: for the top portions of the general-purpose ontology, 
[]the only sensible thing to do is to look at linguistically inspired 
[]ontologies.  Language (in its wide sense, as a semiotic system) is 
[]by far the only general-purpose representation medium we have; it 
[]suffices to carry meanings about almost any domain (only for such 
[]enterprises as quantum mechanics and music and emotions does it not 
[]do too well, but those are enterprises I doubt we'd want to spend too 
[]much time on in this century in any case).
	Sometimes linguistic approaches are poor at complexity management.
	Sometimes a picture is just better.
	
[]By "linguistically inspired" 
[]I mean exactly the kind of reasoning that went into building the Upper 
[]Model: since English treats some entities as objects (by referring to 
[]them using nouns) and others as processes (by referring to them using 
[]verbs), then we need the notions OBJECT and PROCESS. And since it treats 
[]different kinds of processes differently, we need different ontological 
[]types as well, correspondingly.
	Well this is easy enough to try out.
[]
[]Hold it!, you say.  What about Japanese?  Algonquin Eskimo?  Why not 
[]use their Upper Models?  Maybe we don't need four kinds of processes, 
[]but two, or seven!  No, I say, it doesn't matter: you choose your 
[]principle of ontology construction and you stick to it.  You seem 
[]to think there's a "correct" ontology, a "true" way of looking at 
[]the world.  No.  Only to the extent that English is a more compact 
[]way of describing the world than, say, the Khoi-San Bushman tongue, 
[]only to that extent does the English Upper Model provide a "better" 
[]ontology than the Khoi-San one.
	Agreed.
[]
[]So what are the implications?  I believe that you should adopt a 
[]fairly small top-level ontology -- something in the order of a 
[]few hundred to a thousand concepts -- and then build underneath 
[]that various (partially overlapping, and frequently mutually 
[]inconsistent) mid-level ontologies, one for each major area of 
[]concern.  An electrical engineer will use the one appropriate 
[]for EE and a historian the one appropriate for History, and the 
[]fact that many concepts are mutually inconsistent while only 
[]their distant superclasses are not is of no concern.
	I'm not so sure because, for instance, biomedicine is so noun rich
	and verb poor.  Computer science seems to be exactly the opposite.
	Nouns are few and far between.  Verbs are everything.
	
[]In this 
[]way, to use Tom's phrase, "a thousand flowers can bloom" but 
[]there can still be communication across enterprises, communication 
[]that grows steadily less useful as the enterprises drift apart 
[]in underlying conception.  Sound familiar?
	Well, obviously, one must plan for maintenance from the beginning.
	Your heroes, lexicographers, do it and SOME name-system managers do
	it, so maintenance assumptions must be built in.  Another reason for
	tools, environments, etc.
[]
[]=================================================================
[]
[]To: Eduard Hovy <hovy@venera.isi.edu>
[]Cc: Tom Gruber <Gruber@sumex-aim.stanford.edu>,
[]        John Kunz <kunz@intellicorp.com>, Doug Lenat <lenat@mcc.com>,
[]        Chris Overton <overt@prc.unisys.com>,
[]        Mark Tuttle <tuttle@sumex-aim.stanford.edu>,
[]        Tim Finin <Finin@prc.unisys.com>,
[]        Mike Genesereth <mrg@sunburn.stanford.edu>,
[]        Bill Mark <Mark@sumex-aim.stanford.edu>,
[]        Marty Tenenbaum <Marty@cis.stanford.edu>, neches@venera.isi.edu,
[]        neches@venera.isi.edu
[]Reply-To: neches@venera.isi.edu
[]Subject: Re: Position statement for CAIA-91 panel on KR standardization 
[]Date: Mon, 04 Feb 91 22:07:57 PST
[]From: Robert Neches <neches@venera.isi.edu>
[]
[]Ed,
[]
[]Interesting position statement.  I'm not on the panel but I'll throw my $0.02
[]in, anyway: I'm inclined to agree with your conclusion that we'll end up with
[]hierarchies of increasingly topic-specific ontologies (with decreasing degrees
[]of overlap).  However, I don't think that they will be arrived at in the
[]manner you seem to be implying -- by building top-down from linguistic 
[]ontologies -- even though linguistic ontologies may end up near the top.
[]
[]  Rather, I think there's a great extent to which ALL ontologies will
[]have to emerge from bottom-up efforts to reconcile different knowledge bases.
	We agree.  It's nice, maybe even necessary to have a top-down vision
	to guide one, but bottom-up using existing resources seems to be the
	way to go.  Also, the marketplace might support the reconciliation
	of two existing recources.  This reminds me of a MIke Stonebraker
	"database" story.  He said, he's never seen a bank merger yet that
	didn't require the development of a "third" schema so that two
	differing bank db schemas could talk to one another.
	In principle (no pun intended), the financial world is simpler than
	most we propose to deal with (though the need for referential
	integrity is very high!).

[]When I try to put my engineering or logistics KBs under PENMAN, I'm sure I'm
[]going to find cases which require extending/modifying the PENMAN ontology in
[]addition to the situations requiring that I adapt mine.
[]
[]  An example (which may be wrong, since I haven't had a chance to ask you
[]about it): Marty's MKS system has a hierarchy of manufacturing operations,
[]or steps,
[]such as Processing-Steps, Testing-Steps and Decision-Steps.  Processing-Steps
[]break into Material-Processing and Data-Processing, etc.  Now,
[]Material-Processing-Steps look like they fit very neatly in the PENMAN
[]hierarchy as Directed-Actions (directly under PENMAN's concept of
[]Material-Process).  On the other hand, (even with Richard Whitney's help)
[]I had a lot of trouble figuring out where to put MKS' notion of Data-
[]Processing-Steps
[](e.g., computing yield of a wafer).  It's not quite concrete enough to seem to
[]fit PENMAN's model of a Material-Process, but the only alternative in PENMAN
[]seems to be to model Data-Processing-Steps as Mental-Processes.  The catch is
[]that PENMAN's model says that Mental-Processes require a "Conscious-Agent". To
[]reconcile these two ontologies, I seem to be faced with the choice of
[]modifying
[]the PENMAN ontology, modifying the MKS ontology, or modelling anything that
[]does Data-Processing-Steps as conscious -- the latter of which seems dubious.
[]
[]  Where does this sort of negotiation between modellers fit into the scheme?
[]
[]-- Bob