Mark Tuttle Position Statement ...

From: umls.UUCP!
Date: Thu, 21 Mar 91 12:04:03 PST
Message-id: <9103212004.AA17349@lti3>
Subject: Mark Tuttle Position Statement ...
[Sorry.  No local access to LATEX.]

		Workshop on Shared, Reusable Knowledge Bases
			March 24-25, 1991
			Parajo Dunes, CA

			Position Statement

			Mark S. Tuttle, Vice-President

			Lexical Technology, Inc.
			1000 Atlantic Ave., #106
			Alameda, CA 94501-1366
			(415)  865-8500 (voice)
			(415) 865-1312 (FAX) "soon";, until then.

o What knowledge would it be useful to share in a machine-readable form?

	Even if nothing else is shared I believe it would be useful to
	share the NAMES of the things that the knowledge is about.  This
	quickly leads to requirements for human-readable DEFINITIONS of
	the NAMES, so I know what YOUR name means, and formal notions of
	the CONTEXT of the NAMES, e.g.  RELATIONSHIPS among the NAMES.
	The latter can be simple stuff, e.g. NAME_A "is a" NAME_B, or even
	NAME_A "is narrower than" NAME_B, or more complex relationships,
	e.g. NAME_A "is a conceptual part of" NAME_B, or NAME_A "is a
	manifestation of" NAME_B.

	Note that while we all know that knowledge tends to be context
	specific, it is conceivable that a NAME SERVER could represent
	names and relationships among names in some uniform way without
	doing terrible violence to the knowledge so abstracted.

	Management of such NAME SPACES is one of the things good librarians
	do.  It is also a proto-science view of knowledge, more commonly
	used during the 19th Century, when NAMING things, i.e. CATEGORIZING
	them, in a way that "made sense" to humans was a major advance.
	While I claim that NAME sharing is a necessary (pre-)condition to
	sharing any kind of symbolic information, it is even more important
	in biomedicine, which is NOUN rich, and VERB poor.

	Skipping ahead, I will argue that it is "useful" to share this
	kind of knowledge ONLY if it is maintained.  E.g., suppose you got
	the shared knowledge when whales were thought to be fish.  Would
	you be ready to detect and effect the appropriate update?  Suppose
	you got the knowledge when all diseases, e.g. pellagra, were thought
	to be infectious.  Then it was discovered that some diseases were
	caused by nutritional deficiencies.  (Note that this update is
	different than the problem with whales.)  Then some of these diseases
	were found to be inherited, rather than environmental, etc.  This
	goes on right up to the present.  As a rule of thumb, about 10% of
	the knowledge in the project outlined below changes in some
	non-trivial way each year.  And, these are only the evolutionary
	changes.  Revolutionary changes are much less frequent but also
	more difficult to characterize.

	In summary, success is critical, i.e. we need to learn from
	successful attempts to share knowledge.  And, without success, no
	one will care about maintenance.  But without maintenance, success
	will be short lived.

o What is the purpose or goal of knowledge sharing in your work, and what
form does shared/reusable knowledge take?

	Lexical Technology, Inc. (LTI) is supported by the National Library
	of Medicine (NLM), to help develop the Unified Medical Language
	System (UMLS).  A 10-year project begun in 1986, the UMLS is an
	multi-institutional attempt to provide a uniform interface to the
	world's biomedical knowledge.  The first major deliverable of this
	project was the first Metathesaurus of biomedicine (Meta-1)
	released by the NLM last Fall.  Meta-1 consists of about 30,000
	entries each containing information about a single meaning (one
	meaning, one entry), meaning that ambiguous names show up in more
	than one entry.  Built by LTI, it helps answer the question, for
	humans and programs, "What is it called?"  Typically, "it" is
	found either by lexical matching, because you called "it" by a
	name similar to one of the names in the Metathesaurus, or by
	matching a related name and then navigating semantically to the
	desired entry.  Meta-1 supports five different kinds of semantic
	locality.  A related deliverable is a large-granule semantic
	network, which covers all of biomedicine with 133 nodes, e.g.
	"Disease or Syndrome", "Immunologic Factor", and 34
	kinds of relationships among those nodes, e.g. "Causes",
	"Complicates", "Manifestation Of".  Each of the ~30,000
	entries (meanings) has been assigned one or more semantic types,
	e.g.  "Disease or Syndrome", by a domain expert.  A hypothesis
	to be proven is that the Metathesaurus and the accompanying
	Semantic Network can be reused locally to improve access to local
	information resources.  There is no question in my mind that the
	project will improve information access to the forty some
	information resources maintained at the NLM.

o What are the barriers to knowledge sharing?

	As Minsky said, "If you can't solve the simplest non-trivial
	version of your problem, it's unlikely that you will solve the
	general case."  Thus, while in computer science we learn that we
	can exchange all the TRUEs and FALSEs, and NODEs with ARCs, with
	impunity, the notion of the objects of interest (the NOUNs) seems
	pretty critical and immutable.  If the VERBs involved seem natural
	and immutable too so much the better.  Further, if we can't share
	knowledge about finite sets, then it's a calculated risk (which we
	may decide to take) to work on sharing knowledge about denumerable
	sets.  In summary, the field of AI has a mind-set about what IS a
	worthy problem which prevents work which might lead to sharing.
	But what is the simplest non-trivial version of a problem?

	Here are three examples.

	During the mid '60's the most exciting problem in control theory
	was that of providing assistance to pilots landing on an aircraft
	carrier during periods of near zero visibility.  However, the first
	thing the control theorists found was that each plane had a
	different amount of slack in its controls.  Pilots adjusted to the
	idiosyncrasies of each plane without thinking much about it, except
	that each plane had its own control "signature".  (This wasn't each
	different model of plane, but each plane.)  One alternative, the "AI
	solution", would have been to try to replicate the pilots'
	adaptability.  Fortunately, the control theorists were also good
	engineers, so the first thing they did was take all the slack out
	of some planes, for experimental purposes, and get the Navy to work
	on developing planes with reproducible control systems.  All this was
	done, and THEN they solved the remaining control problems.

	LTI has a parallel problem within the UMLS.  Unlike the perfectly
	reasonable assumption with the Knowledge Interchange Format (KIF),
	there is no standard character set for the UMLS project.  Because
	the NLM uses a "non-standard" EBCDIC character set, and "everyone
	else" uses ASCII, the only standard is the notion of an eight-bit
	character.  But, this is not pure perversity on the part of the NLM.
	It took more than a century to negotiate all the agreements the NLM
	has with foreign medical journals (typically, the journals outlive
	the governments), and, in 1966, when the NLM went "on the air" with
	interactive bibliographic searching, they had to guarantee that
	diacritical marks from a number of languages using the Roman
	alphabet were preserved.  And, this is a BIG deal.  (Imagine
	telling the French that their diacritical marks were to be
	discarded!)  Thus, the whole notion of a character set has been a
	struggle from the beginning.  Should we try to solve this as if it
	were an AI problem?  Obviously not, but if we (LTI) doesn't solve
	it, who will?  So, is combining bibliographic citations in a
	single database knowledge sharing?  Whatever we think, sharing
	cannot take place without first solving the character set problem,
	and like matters.  Thus, we're forced back to this dilemma of what
	is a valid sharing problem?

	Suppose we assume away the character set problem (as will happen
	eventually).  If you and I use the same knowledge should we expect
	to get comparable results?  One of the first problems you and I
	will discover is that we can't scan the same text fragments and
	come up with the same set of words.  Reducing a sequence of
	characters to a sequence of words is a process of abstraction,
	which gets reinvented sometimes repeatedly WITHIN THE SAME
	PROJECT.  Is this a knowledge sharing problem?  If success is our
	objective, it certainly is.  Is the DARPA Knowledge Representation
	Standards Effort prepared to deal with this kind of stuff?
	In the case of the character set problem, it's pre-KIF (we'd have
	to define "escape" sequences), and in the case of the "scanning"
	(word extraction) problem, it may suggest that reusable knowledge
	may REQUIRE reusable tools.  More bluntly, if success means real
	sharing, then real problems often come with a large amount of
	unanticipated baggage.  This is neither a bug nor a feature.  It's
	just the way it is.

	In summary, knowledge sharing may require definition of a
	sharable INFRA-STRUCTURE on which the ability to share will depend,
	even before we can get to Minsky's "simplest non-trivial version of
	the problem".  Alternatively, with each body of knowledge to be
	shared may have to come with its own infra-structure.  There are
	probably many other "barriers", but this is a big one for LTI.

o How can AI help?

	I think AI is the most trend sensitive part of computer science.
	E.g., this workshop constitutes a trend.  If everyone of us says,
	by DEFINITION, that knowledge representation doesn't count unless
	someone else uses it in some alien application, or unless new
	insight is produced AND validated, this will go a long way toward
	focusing the issues.  Perfectly honorable work can be done that's
	called "engineering", or "experimental work"; but we should stone wall
	everyone and say "It ain't knowledge until its reused, usefully, in
	an alien application, except under the following narrow
	circumstances."  Note that a paper that no one reads might not be
	considered a "publication," and a paper that no one cites might not be
	considered (publically) useful.  (The latter might be very useful to
	the author, but only to the world at large if it leads the author to
	something which is cited.)  A related observation which is most
	troublesome is the problem of papers which are cited frequently
	but which describe programs which didn't last as long as the student
	who wrote it.  Thus, Terry Winograd's thesis would now have to be
	considered an ESSAY on "understanding".

	Though it's much less important than the propaganda efforts
	above, I think AI-approaches to the problem of managing name spaces
	and discovering synonymy within them, i.e. when two lexically
	distinct names, name the same meaning, would be very useful.
	They would certainly help in our project, and would increase
	sharing directly, were "name sharing" adopted as sub-goal.

	On a deeper level, two other ideas come to mind.  First, I think
	its a myth that most programs encapsulate all important knowledge
	declaratively.  Given that this is so, a lack of such separation
	seems to be a clear impediment to sharing and reuse.  Second,
	demons which monitored drifts and evolution in knowledge would be
	extremely useful.  For instance, to the NLM, it's a FEATURE that the
	meanings of the names it uses in its naming system drift with
	time.  At a practical level, any large, sharable ontology may
	evolve more or less continuously.  An interesting AI problem is to
	develop a demon which monitors such changes relative to an
	application, e.g. if I'm an application, "Do I care about any of
	today's changes?"  On an average day, I expect the answer to be

o What should be the unifying vision for knowledge sharing and reuse?

	First, I believe that "systems which are used tend to get better".
	Therefore trying to create a path which would both allow early
	success and at the same time lay the groundwork for more ambitious
	success later should be a high priority.  Thus "incrementalism"
	should be an important part of the vision.

	Second, I believe the vision should accommodate a homogeneous
	view of the problem, e.g. that there should be a NAME SERVER.

	Third, I believe the vision should accommodate a heterogeneous view
	of the problem, e.g. that knowledge tends to be context specific
	and that any given context may have its own weirdnesses, but these
	idiosyncrasies may be worth sharing nevertheless.

	Fourth, the vision should foster, nurture and enhance an
	infrastructure of otherwise mundane artifacts including character
	sets, physical and logical formats, tools, documentation management
	systems, browsers, etc.  Its hard to see how any attempt to share
	widely can succeed without such things.

	Fifth, the vision should address the many aspects of the
	maintenance issue head on, e.g. there will be those who maintain
	and those who consume maintenance and the needs of both groups
	need to be accommodated from the beginning.

	Finally, AI needs to get its own house in order.  To whit, someone
	needs to develop and operational definition of "knowledge" so that,
	for instance, the highly successful process of code sharing and
	reuse, and emerging standards for various multi-media
	representations of information are either distinguished or
	integrated.  E.g., is knowledge sharing different from code
	sharing?  Or not?  Are any worthy distinctions shades of gray,
	or qualitative differences?  Are we going to reinvent software
	engineering, build on it, or fight with it?

o How can the community cooperate to reach the vision?

	Propaganda, as described above, and the formulation and pursuit of
	achievable sub-goals are probably the two most important things.

	o How should reusable KBs be developed and disseminated?

		I think this question is best approached operationally.
		Ideas about how to do this are easy to come up with, and
		hard to execute.  Vigorous execution may lead to more
		insights than thought experiements.

	o What are some useful pilot experiments?

		I'd like to see two kinds, one narrow and one broad:  The
		first kind is the obvious one, namely to try to reuse some
		specialized knowledge in an alien environment.  E.g., one
		definition of a good tool is that someone uses it for
		something important that the tool's designer never thought
		of.  The second kind relates to the name server proposal
		above.  This kind of pilot experiment won't be worth much
		until its coverage is broad and its database large.

o What are the critical research issues that need to get solved?

	Until the knowledge sharing gets past the "easy" stage,
	meaning that most of our efforts are going into
	infrastructure, I don't see any critical research problems except
	those related to maintenance.  These have been outlined
	above.  Once we get past the "easy" stage, meaning we have a
	working infra-structure for sharing, then I expect all the classic
	problems to appear along with some new ones stemming from emergent
	complexity resulting from the large scale.