Shared ontologiesPatrick Cassidy <firstname.lastname@example.org>
From: Patrick Cassidy <email@example.com>
Subject: Shared ontologies
Date: Wed, 7 Jun 95 9:36:51 EDT
X-Mailer: ELM [version 2.3 PL11]
[This note was posted to the conceptual graphs listserver, in response
to a comment from Fritz Lehman. It is reposted here]
In his response to my suggestion about a triadic relation
for the sign relation, Fritz amplified on my preliminary comments:
PatC>> neither the sign relation nor identity can be completely defined formally in a
>> way that all "interpretants" will interpret in an identical fashion,
>> as long as the set of interpretants includes humans. If we confine
[additional material deleted]
Fritz> Agreed, that the sign relation can't be "completely defined"
> and that "we can't completely control" human interpretations. But
> there are broad areas of intersubjective agreement. Nobody considers
> the Eiffel Tower a kumquat. As we've discussed earlier, the fact that
> ontological classes have disputed definitions at the borderline
> doesn't mean there can't be full agreement on core meanings. That
> is, necessary (maybe not sufficient) axioms for membership in a
> class. This applies as much to the sign-relation as to anything else,
> as far as I can see.
Yes, I do agree with that. I feel certain that there can be substantial
areas of agreement between groups building and using ontologies,
even if the top levels do not much look alike. If we want our
computer systems to talk to and understand each other, it seems that
it is not merely desirable but necessary to find the maximum areas
Fritz> By the way, Pat, I noticed that your revision of Roget's
> Thesaurus is now the official version in the Gutenberg Project
> and is referenced by thousands of people's and institutions'
> World Wide Web homepages around the world. Is there an article
> which describes your revisions and further proposed improvements?
The version I sent to Gutenberg several years ago is little more
than an electronic version of the original 1911 book, cleaned up
to be "machine readable", and with perhaps a thousand newer words added
simply in passing, during the proof-reading stage. I have not
sent them any revisions. The version I accessed from Gutenberg last
year seemed substantially the same as the one I submitted, although
Professor Hart thought that there was a version with some modifications
made by someone else; but I don't know anything about that.
The version I am still working on has not been placed in any
electronic depository, and it is still very crude and incomplete.
But if anyone wants to look at the thesaurus, or at the much shorter
ontological hierarchy (about 2,000 terms, extending only down to the
headwords of Roget-like entries), I will be happy to provide a copy.
The thesaurus/semantic network is still under construction in a simple
word-processor format file (Microsoft Word), total size about
3.7 megabytes, or about 2.8 megabytes in ASCII format. Less than half
of the size is taken up by the semantic relation notation, plus a few
comments about the reasons for specific choices of hierarchy or
semantic relations. This file can be parsed and indexed, allowing
rapid viewing of the contexts of individual words, by a program
written by a colleague in Moscow, designed specifically for the format
and notation used in the word-processor file. That program runs
under DOS on an Intel-based PC.
Although it is still quite incomplete, I did in fact write
a description of the work in progress, and submitted it for
presentation at the ACL-95 meeting, being held in Boston at the
end of this month; but the paper was rejected. Rejection seemed
to be based primarily on the fact that there was no evaluation,
which is quite true. When I get a first, preliminary version
finished, (I hope by the end of this summer), the next step will
be to evaluate its utility for natural language understanding,
the purpose for which I am building this semantic network.
Writing the evaluation program will, I expect, be a nontrivial
research effort, and I hope I can find someone who has ideas on
how best to do that, and would like to work together on that
task. My present feeling is that a meaningful evaluation
of its utility would be, for example to compare its usefulness
to that of WordNet, and to some statistically-based method, in a word
This work has two objectives: (1) to construct a skeletal
semantic network, with a complete hierarchy of the basic English
language, for use in natural language understanding, and
(2) to find the minimum set of semantic relations needed to
express the relations between semantically related words.
For the latter purpose, we assume that the words collected
in each of the 1,000 Roget main entries are semantically
related to each other, and we must define explicitly what
that semantic relation is. So far there are about 150 semantic
relations (not including inverses) defined. When the first version
is finished, perhaps by the end of summer, each word will be
connected to the network by at least one semantic relation.
But to define words (with at least their necessary conditions)
will probably require at least ten times as many links.
And there are almost no technical words included. The
natural question to ask is whether this specific structure
will actually be useful, even after supplementation with necessary
detail. The problem of how to evaluate such an artifact may
not be entirely trivial. Unfortunately, my appreciation of
the history of natural Language Understanding research suggests that
at least a minimal network linking the whole basic language must be
completed before the proposed structure can be properly tested.
The end purpose of this network is language understanding,
and for those who feel that the purpose must be carefully defined,
my minimal definition is "to provide enough understanding of
English so that the system can build its own new concepts by
interacting with an informant in English." The "understanding"
will presumably require, in addition to a network of static nodes
and links, procedural code to "define" at least the meanings of
the semantic relations and 1000 to 3000 "primitive" concepts.
This is not a goal likely to be achieved by one person.
I am only hoping to create an outline of such a network, with most
of the links for most of the top 1000 concepts specified, and perhaps
enough of the most important procedures to be able to test
some aspects of utility for this type of network.
I plan to attend the workshop on Basic Ontological Issues in
Knowledge Sharing, to be held in conjunction with IJCAI95 in Montreal
at the end of August, and I hope that some progress will be made
at that meeting in moving toward some common ground in building
ontologies. Unfortunately, I haven't been able to attend other
conferences specifically devoted to such topics.
By coincidence, in the same batch of mail with Fritz's message,
there was a note from Lenat touching on the same point:
[some discussion about how their four presentations on CYC were booked
into a single time slot at a symposium several years ago]:
Lenat> When we asked (okay,
> complained) about this, we were told that surely no one was REALLY
> interested in hearing about the finished ontologies, or even the
> experiences and lessons learned in actually codifying and testing them;
> no, we were told, "real scientists" don't care about artifacts like
> that, they want to talk ABOUT the theoretical philosophical/logical
> problems in building/sharing... ontologies, things at the level of the
> Yale Shooting Problem, and when to treat extensional failure as
> intensional negation, etc.
Building useful artifacts is indeed more of an engineering task than
a fundamental science, although some fundamental scientific questions
must be answered to be successful, in the case of Natural Language
Understanding. Typically, the success of engineering projects is
measured by how much the ultimate product sells, which is reasonable.
But for the complicated tasks of mimicking human knowledge and
reasoning, the lessons we have learned include the appreciation that
no small group (and certainly no individual) will create a full
replica of [formalized] human knowledge in a few short years.
The CYC group is large enough to have (according to Lenat) written
down a large fraction of what is needed. Those who, for whatever
reason believe that a lot more work needs to be done, recognize
the value of increased sharing of knowledge bases, as evidenced
by the number of conferences on that topic. Meanwhile,
the "sales" measure of success must be replaced by something else
suited to artifacts still too incomplete to test on the general
market. Among such artifacts, ontologies seem to be a particular
problem. According to Lenat, there is little interest in such things:
Lenat> The CYC member companies okayed our release of CycL
> and most of the Cyc ontology (including everything that philosophers
> mean by the term), and several groups have partaken of them, but
> surprisingly few. But this is not my point -- let me get back to it.
Even if the desire is present, is it easy to get funding to evaluate
a commercial program? I am not in a position to know, never having
tried, but my suspicions are in the negative. If we want to build
a true "common" ontology, used by everyone, commercial restrictions,
if any, will have to be very light. My suggestion would be that
everyone building an ontology should allow free use of at least the
top 2,000 concepts in the hierarchy (actually, this is Fritz's
suggestion, with which I agree). Then, knowing that they can use
it in any way they choose, there may be a lot more interest in
finding a common (higher-level) ontology, which will be a compromise
among the various individually constructed ontologies, unless one
seems so superior that it attracts a lot of users. The alternative
would seem to be to continue at the present level of sharing of
ideas, until one system becomes so overwhelmingly popular that
is sets a standard (the stated goal of CYC, and presumably the
dream of anyone working on such a task). If no standard is
agreed upon or imposed, or evolves, our application programs will
continue to have great difficulty sharing knowledge. Even with
the top 2000 concepts agreed upon, there will still be enormous
scope for proprietary details and variations in the programs that
will use the ontology.
To be really standard, an ontology should also include
some level of agreement on semantic relations (slots) that are
used in defining terms. Without some commonality in relational
links, it will be hard to verify that the meanings of the
individual nodes in a semantic network have the same
significance to all systems. Can we rely on the hope
that system builders, looking at a node in a network labeled with a
particular English word, will all spontaneously agree on how
that is to be handled by the logical inference mechanisms?
Some additional comment from Lenat's note:
[ deleted material suggesting that FOL, CycL or KIF are all adequate for
Lenat> So my
> advice to all of you is: stop tinkering with KIF, start sharing your
> ontologies, and you will get hit in the face with what's really
> important, namely sharing most but not all of the meaning of most but
> not all of the terms most but not all of the time.
Definitely it's a good idea to try agreeing on meanings (although there
is also something to be said for finding a "KIF" that everyone can agree
is adequate). Now how do we go about this admirable task? The
CYC database and CycL have been made available to universities
(and at a price to commercial organizations), but it is still
proprietary. Since it is predominantly an engineering artifact,
interest in it would be primarily to examine it with a view to adopting
it in one's own programs. It cannot be viewed as a natural phenomenon
worthy of investigation in the manner of basic science. For those
who would like to obtain such a database for possible use in their
own programs, the prospect of high price and/or severe restrictions
on usage can be discouraging, and in my own case entirely prohibitive.
The "not invented here" syndrome is not entirely irrational under
all circumstances. On the other hand, no one expects a commercial
organization to simply give away their products developed at great cost,
but the necessary proprietary restrictions will always make adopting
someone else's commercial technology a risk that one doesn't jump
at. Is there nevertheless some way that *some* aspects of CYC can be
made sufficiently available (without seriously compromising their
own financial interests), that it can actually be incorporated into
other systems without the painful, drawn out, and frustrating
negotiations often required for licensing?
My own perspective may be too unusual to be of interest to
other workers in this field. Are there in fact any others who
think it would make an important contribution to knowledge
sharing if each group (including CYC) could provide free or
minimally restricted distribution of, say, the top 2000 concepts
in everyone's ontology, with the necessary semantic
links to provide "most but not all of the meaning of most but
not all of the terms most but not all of the time"?
In his book, Doug Lenat says that his philosophy is to get
the top levels right, and then fill in the details. I completely
agree. Is it not more likely that we will "get the top levels
right" if there is a free and uninhibited exchange of ideas about
these top levels, in a public forum?
I hope that this will be among the topics to be discussed
at the Montreal workshop. See you there.
MICRA, Inc. || (908) 561-3416
735 Belvidere Ave. || (908) 668-5252 (if no answer)
Plainfield, NJ 07062-2054 || (908) 668-5904 (fax)