Ontology Based Document Understanding

Notate 96 Conference


Paul S. Prueitt, PhD

Senior Scientist
Knowledge Processing Project
Highland Technologies Inc.

paul@htech.com


    This paper is modified from a chapter in "Knowledge Processing, A new toolset for Data Mining", an internal Highland Technologies document. The paper starts out by delineating the basis for ontology based document understanding. The first two sections address hard questions in knowledge representation. We have tried to present this material in such a way as to enable the first time reader to move ahead to the later sections. These sections can be skipped on first reading, but their content is required to understand the full import of the material that follows. The third section maps out the extraction of theme vectors from Oracle's ConText NLP product and links natural language processing to the formation of a class of natural kind. The fourth section is on the use of formal models as defined by Dmitri Pospelov. The fifth section walks the path required to produce a formal model from the linguistic analysis of text. The next two sections briefly treat the issues of (1) multiple ontology and (2) knowledge processing units as a classification engine. The last section introduces the procedure for constructing a periodic table of subfeatures for a given class of situations.


What is an ontology?


A machine readable ontology is a sophisticated version of a semantic net, in which concepts are identified with nodes and relationships between concepts are identified as linkage. Semantic nets are sufficient to understand a class of well formed text where semantics is about very static situations. Comparative terminology science, under development by Faina Citkina, identifies two sets of conditions under which text understanding can not be achieved by static ontology. The first condition is where a shift in meaning, from the static case, is required to account for circumstances of a specific situation. The second condition is when interpretative rules given by a selected static ontology, even when slightly modified, will always produce an error in interpretation. The architectural design for text understanding must take both of these possible conditions into account, and does so with a theory about the natural compartmentalization of processes in the world.


The issue of translatability, between one natural language into another natural language, identifies the types of issues that machine understanding systems are facing. Some of these issues may be addressed if we have machine readable ontology. A single formal representation of ontology is not, however, sufficient. We can describe conditions for semantic alignment under circumstances where the target and source text are both "interpreted" by correspondences to the combination of two, perhaps distinct, ontologies. What is required is the integration of two or more compartmentalized sets of rules and objects into a new compartment. This is like the joining of two opposing points of view into one that subsumes both. Integration is more likely to occur if it takes place within a larger context and the opposing views are entangled in this context. Tracing the entanglement of two ontologies within a specific situation is a difficult matter. The joining of distinct ontologies will trigger a finite number of paradoxes that must be resolved by known facts about the situation. A system that resolves these paradoxes will produce information complimentarity and the emergence of a new system for understanding both ontologies and their natural inter-relationships. A theory of differential ontology helps handle these critical problems.


Differential ontology aids text summarization and generation systems as well as text translation and situational modeling. The theory of process compartments, each compartment having its own ontology, provides a means to ground differential ontology to compartmentalized network dynamics. A mathematical framework based on weakly coupled oscillators illustrates the variety of structural outcomes from differential geometry. If ontology is associated with a compartment, and multiple compartments are possible, then the theory of process compartments provides a means to understand why some concepts are easily translated while others might not be translatable without significant effort. However, the assumption that multiple compartments exist is not justified easily.


The nature of paradox, complimentarity and emergence have physical correlates that are studied at the quantum level by physicists. It is not too much to expect help from this community. Quantum physics is a mature science that has faced a number of hard problems of this type. We can borrow some of the formal tools, developed to study elementary particle interaction, and extend quantum mechanical analytical methods to address the hard problems found in computational document understanding. First, we borrow the notion that a finite and quantal distribution of possible interpretations is driven by an underlying, and knowable, organization of the world. This enables the disambiguation of meaning, in most cases. In cases where novel emergence must occur in order to find an appropriate container for representation; then we hope to use the notion of entanglement and the formation of a new compartments through complimentarity and observation.


Dmitri Pospelov identified, in the early 1960s, a flaw in modern control theory based on formal systems. Independent Western researchers, like Robert Rosen, have also identified this flaw. Formal systems require a closure on what can be produced from rules of inference about signs and relationships between signs. This means that the formal system, no matter how well constructed, will not always be able to perfectly model the changes in a non-stationary world. Biological systems, however, are capable of constructing new observables through a process of perceptional measurement. How is this accomplished?


Peter Kugler defines perceptual measurement as the construction of world views by biological systems. His recent work addresses the issues of observation, complimentarity, emergence and entanglement. He concludes that the origins of semantic functions, relationships between symbols and the external world, is perceptional measurement. Kugler's experimental results has shown examples where a change in a point of reference will cause a shift in perceptual measurement and thus in the semantics of things observed. He hypothesized that the point of reference uses epistemology to localize facts about the world into a synthesized ontology specific to a situation. Kugler's work with Russian neuroscientist, Juri Kropotov, tracks this localization to structural properties seen in human cortical architecture and recently implemented into pattern recognition software.


Classification of issues regarding computational document understanding:


Faina Citkina has created a classification for treating issues of translatability. In her classification schema, there are three types of terminological relativity; referential, pragmatic and interlingual. The Citkina classification will be used to discuss issues that arise in computational document understanding.


Special texts, like product manuals, often have one to one correspondences to devices or processes. The issue of their understanding, and thus their translatability, is included in a class of interlingual type terminological relativity, since there is a clear external object for each concept expressed. Technical jargon would appear to have the same distinction, at least on the surface. A poem might have less clear reference to external objects and minimalist art would have even less correspondence to a finite and specific set of things in the world.


The Citkin class of interlingual issues can only be resolved if a knowledge domain has been encoded to allow automated checking procedures between the source text and the target text. The knowledge domain can be something like an expert system or object database, but these knowledge sources are not open systems and thus will fail unpredictably. Since telling us about the failure may also not occur, the system will, as it were, lie to us on a fairly regular basis. The knowledge domain can be a semantic net or an ontology like a semiotic table, in which case the possibility for document understanding and thus translation of meaning is enhanced.


The Citkin class of pragmatic issues is also related to a theory of interlingua where the situation addressed is dynamic. The Highland approach assumes the necessary existence of a period table where the system states that a compartment can assume are all specified and related to a database of subfeatures. The properties of this periodic table is representable in the form of a second database plus situational language and logic. Developing the table, the situational language and the logic is an almost inconceivable task, were it not for the work of semioticains D. Pospelov and V. Finn. However, in the case of Finn's system for structural pharmacology and several other proprietary systems, this work has been done and can be demonstrated. The Pospelov-Finn systems have the ability to produce an "emergent" ontology for situations where pragmatic and interlingua issues characterize the hard tasks. In this case, when the tools are available, the emergent ontology is computable.


An underlying ontology, as expressed in a semantic net or table, can assume different system states and thus the sense of the terms may drift. This would imply that the rules that govern an ontology would allow a modification of the sense of the target term so that the text would be understood in a sense that is consistent with the source term. Here the translation process must import some of the knowledge that tracks this drift in sense, but the target representation would be (almost) semantically invariant to the source representation.


Thus pragmatics is, as it should be, related only to a specific situation at a specific time (or state of the ontology). Interlingua type relativism is a condition of equality, i.e., this word is that word. Pragmatic type relativism is a condition of system transitions from one state into another, but under a uniform set of rules. As demonstrated by Pospelov and Finn, this set of rules can be captured in the special semiotic logics of applied semiotics.


The Citkin class of referential type include issues arising where a term's meaning in the source language has an ontology that does not exist in the target language. Here the process compartment that shapes the source term's meaning, in the world of someone's experience, does not correspond to a neural processing compartment, responsible for generating signs in the target language.


An example would be the ontology created by scientific deference to Marx and Pavlov's scientific materialism in post World War II USSR. In the West there was no such deference, or at least the deferences were of a different type. Western evaluation of much of the ontology of Russian Information Operations (RIO) technology have no corresponding English signs. We can predict; therefore, that RIO will continue to be a mystery to American IO, and vice versa. A second example is the deference given to two valued logic by Western philosophers and scientists. This deference is deeply grounded in our culture. In the West, the notion that non Boolean logic would be of "ontological" value is ridiculed. A third example would be the structure and form of Hopi sand (medicine) drawings. Most people unaware of Indian "Old Way" would never imagine that a relationship could be made between colored sand designs and the healing process. In each of these examples, the problem with translatability is that there are no containers to place meaning in target languages, unless that language has a similar referential type.

Mining for raw resources


The first two sections communicates the general principles that shape the theory and practice of knowledge processing. In the next section, we develop the notation and architecture for extracting concepts from the thematic analysis of document collections.


Let C be a collection of documents, T the set of ConText computed theme vectors, and I the inverted index for T, (see figure 1.)


T contains a set of theme vectors,


T = { (n, wp)j | dj e C }


where di is a document, n = { n1 , . . . , n16 } and wp = { wp1 , . . . , wp16 }. The positive integer ni is the semantic weight of the theme wpi .





Figure 1: Each document in a collection C is represented by a vector of weights and phrases. The full set of theme vectors T is represented as an inverted index I of merged Oracle ConText Option (OCO) classifications.


ConText Knowledge Catalog supports an automated procedure that uses an index of OCO classifications to produce a means to classify a collection of documents based on computed themes and user defined views of a knowledge domain. User views are expressed as traditional hierarchical taxonomies or as a semantic net.


(a)

(b)


Figure 2: User views can take the form of a simple hierarchy (a) or as a more complex semantic net (b)


To add value to OCO classifications, we first refine the representation architecture to produce a set of distinct situational specific ontologies, and then derive a theory of differential ontology to manage these as data sources. We are interested here in a computational architecture for knowledge extraction and representation. A neural network architecture for simulating selective attention, attraction to novelty and the production of choice is presented in publications with mathematician Dan Levine (see, for example, Levine, Parks & Prueitt, 1993, Methodological and Theoretical Issues in Neural Network Models of Frontal Cognitive Function, in International Journal of Neuroscience, 72 209-233.) A basis in the research of neuroscientist Karl Pribram for an multilevel, biologically feasible, architecture for compartmental processing of information has been developed but not yet implemented.


The method presented below is a modification of several published methods for identifying concepts using vector representation of documents. It borrows features from Hecht Nielson's method based on word stemming plus vector clustering, Oracle ConText Option (OCO) method for construction Knowledge Catalogs, and D. Pospelov - V. Finn methods for situational representation.


A schematic diagram, showing the architecture for knowledge extraction, is drawn in Figure 3. C, and T have been introduced above. S is the representational space for the collection's theme vectors. S is formally a simple Euclidean space with, for moderate size collections in one subject field, about 1500 dimensions. Each dimension is created to delineate a single theme phrase. Subject fields with greater than 1500 themes should be compartmentalized into a small number of topic areas. The relationships between topic areas and the topics need to be separated into manageable groups.


Figure 3: Schematic diagram for knowledge extraction and situational representation


Suppose that we have a document collection C about a small number of narrow topics. Let T be the set of ConText generated theme vectors. The phrase component of each vector component for every element in T can be sorted into bins. New bins are created when necessary so that each bin is representative of a single theme and every theme has been placed in a bin. This process creates an "inverted index" of the themes. The inverted index is ranked by the number of documents having that theme. This ranked index is denoted by the symbol I.


I = { ti | i e index set J }


Now a user can fix a view of the document collection by marking as "valid" all themes having relevance to that view. This procedure is, of course, based in an intuitive judgment by a user. Software can make the judgment easy to execute, and can allow multiple views to be specified. As we will see, changes in defined views are computationally easy since these changes do not depend on a recomputation of theme vectors.


The valid themes for a view define a subspace Sview . This subspace can be used for trending of synthetic concepts as defined in Chapters 1 and 2 of "Knowledge Processing, A new toolset for Data Mining". The inverted index I can be restricted to the valid themes for a specific view. The result is a new inverted index denoted here by the symbol J.


J = { ti | i e index set K which is a subset of the set J}


An assertion can be made about the completeness of Sview as a representational space with respect to a knowledge domain. If the collection of documents is comprehensive then additional computation of theme vectors for new documents will not increase the dimension of Sview . This is because the process of creating new bins for themes will saturate if the size of N is finite.


N denotes a class of natural kind. It is an inventory of all of the things that are constituents of events that rise in a specific arena. For example, a list of all man made pharmacological agents would be a class of natural kind. The set of all atomic elements is another class of natural kind. And under certain circumstances, the set of validated themes J is also a class of natural kind. A theory of how the elements of a class are created requires a list of subfeatures and a theory about how the parts of an elements are aggregated to form a whole. We will return to this point later when we discuss the so called "derived model" in the last section.


N can be thought of as the situations that arise from J. These situations are often associated with concepts in the form:


concept = { ti | i = 1, . . , k }


where ti is associated with an element in the set of subfeatures J. This is our model for document understanding.


The use of formal models and semiotic models:

In Chapter 1, "Trending Analysis with Concept Trends" of "Knowledge Processing", we introduced general notion for a synthetic concept with n components:


concept = { ai | i = 1, n }


where ai is a theme phrase. With this very simple construction it is possible to view the occurrence of the full concept or even individual themes within a single synthetic concept. The problem is, of course, that the concept has not been validated as meaningful. In what follows, we will integrate the notion of a formal system with some techniques for refining the description of a meaningful concept.


A history of formal and semiotic systems is given in D. Pospelov's book, Situational Control: Theory and Practice, published in Russian by Nauka in 1986. We have worked from a translation produced in 1991, and from a number of articles by Pospelov and Victor Finn and their students. We have also been able to discuss some issues in person with Pospelov and Finn. Our effort has been to synthesize Russian applied semiotics with Western artificial intelligence and the theory of natural language processing. Our purpose is to build a natural language pre-processor for computational knowledge processing based on ontology. The semiotic systems demonstrated to us by the Pospelov - Finn group has shown us that ontologies can be created using the extensive logical theory and heuristic methods developed by the Russian community. In what follows, we will make reference by name to individuals who either communicated a technique to us or published material describing this technique. However, a presentation of the research literature will be left for future work.


The notion of a synthetic concept arises rather naturally once the very hard task of generating theme vectors is completed. For us this is done using the Oracle ConText Option (OCO). ConText computes a set of linguistically weighted theme vectors T. Once this is done, the question of where the theme vectors came from is not important. However, this step is accomplished with an advanced natural language processor (NLP) having a rich knowledge of the world and language. If used in a certain way, the NLP can replace a time intensive step that Pospelov was required to make to produce his formal models. In Situational Control (pg 32) Pospelov states; "situational control demands great expenditures for the creation of a preliminary base of data about the object of control, its functioning and methods of controlling it." This step can be automated using ConText type technology.


Consider the case where our document collection contains diplomatic messages regarding the internal affairs of a country, industry, company, computer network, social situation, medical case, etc. This collection of messages is collected into a document database which is then indexed using ConText. A set T of theme vectors is produced. From this set of theme vectors we can easily merge the individual theme phrases into a table and count the number of messages having a specific theme. This produces an occurrence ranked inverted index I on the message themes. An expert on the internal affairs of the country is asked to mark those themes that are of most interest from a certain point of view. This produces a new inverted index containing themes of interest. These themes can be quickly converted into either a hierarchical taxonomy, perhaps with "see also" hyperlinks, or a semantic net. One purpose for constructing ConText knowledge catalogs is to provide a means to classify documents by user validated themes. The nodes of the semantic net, or taxonomy, are linked to names of files by HighView, our document management software. The resulting system has a means for displaying the collection of documents by thematic content and for retrieving documents from that display.


We have developed an additional capability. To understand this capability, we return to the definition of Pospelov's formal systems (Situational Control , pg 36):


Definition: The term formal system refers to a four-term expression:


M = < T, P, A, R>


where T = set of basic elements, P = syntactic rules, A = set of axioms, and R = semantic rules. The interested reader can refer to Situational Control for a more detailed treatment of formal systems. For our purposes we need only refer to a figure.

Figure 4: Taken from Figure 1.8, Situational Control (pg 37)


In figure 4, the set of base elements T are combined in various ways to produce three type of sets, axioms, semantically correct aggregates, and syntactically correct aggregates. We should remember that mathematical logic is founded on a similar construction and therefore that most of the results of mathematical logic will somehow apply later on to the theory of knowledge processing that we are constructing. For example, the set of axioms can be specified to consist of independent, non contradictory and self evident statements about the set of base elements T. Rules of inference can also be formulated to maintain notions about true or false inference from assignments of a measure of true to the axioms. The set of syntactically correct aggregations of elements of base elements can be defined either by listing or by some implicit set of rules. The semantically correct aggregations could then be interpreted as those syntactically correct aggregations that have an assignment of true as a consequence of the inference rules. However, this interpretation is only one of a number of interpretations for the formal relationships between T, P, A, and R.


We have found a simplification of theory and reflected this in our software. Pospelov himself notes that the axiom set can be the same as the set of semantically correct aggregations. In this case the rules of inference need not be known, but we certainly will lose the property that the axioms be independent and the axiom set be minimal in size. In the case where the system under study is a natural complex system, such as a national economy, there is no fully understood set of inference rules. One can only observe that the economy experiences modal changes from one condition into another. Each condition can be defined "phenomenologically" as a semantically correct aggregation of an unknown or partially unknown set of base elements. We view only the surface features. Given this caveat, we will define the following formal model.


Definition of a formal model from theme phrases:


Let T = J . T is now the set of theme phrases that have been chosen by an expert as representative of the expert's view of the situations addressed by the messages. The size of T, denoted by |T|, is finite and small - perhaps less than 300 elements. Let P be the set containing a single syntactic rule stating that any subset of T will be considered a syntactically correct aggregation. Of course the size of the set of syntactically correct aggregations is 2^300, which is a very large number. At this point we have a lower and an upper envelop on the semantic rules. Any possible semantic rule must assign the possibility of being meaningful to an element of the "power set" .


It is noted that one way to specify a set of semantic rules is to explicitly list the semantically correct aggregations. Let A, the set of axioms, be defined to be equal to the set of semantically correct aggregations.


T, P and A so defined leave only one remaining definition. The definition for the set of axioms is a boot strap, since at this point there is no means for identifying which of the syntactically correct aggregations of base elements are meaningful, with respect to the view under consideration. We need to create the semantic rules.


In Chapter 2 of "Knowledge Processing", we introduced the notion of stochastic clustering of theme vector representations of documents. We define a semantic rule that states the following: If a subset of T is grouped together by a clustering procedure, then the subset is meaningful. Such a rule would reduce the number of "candidate" semantically correct aggregations from 2^300 to a much smaller number, perhaps 2,000. However, such a rule is dependent on pairwise measures of similarity based theme vector distance. Selecting good pairwise measures of distance is an interesting problem that has been worked on by a number of researchers. This problem is equivalent to the construction of a good axiom set and proper rules of inference. We are interested in bypassing this problem by employing any reasonable pairwise measure and then employing "checking" procedures to validate potentially meaningful aggregations. What results is a compound semantic rule with two parts, (1) clustering and (2) checking.


To summarize our compound semantic rule: a set of themes serves as subfeatures to be aggregated using an algorithm to cluster theme vectors, as in Chapter 2 of "Knowledge Processing". When vector clustering identifies a collection of theme vectors as being close, then the individual themes within those theme vectors are grouped together as a syntactically correct aggregation. The aggregation is treated as a synthetic concept and checked to see if the synthetic concept is meaningful. Checking for meaning can be as simple as asking the expert if the synthetic concept is suggestive of the situations known to exist and referenced by the message collection.


At least one automated checking procedure has been identified. Synthetic concepts can be trended over feature sets, such as a time sequence, to see if temporal distributions reveal locally normal profiles, see figure 18 page 12 of "Knowledge Processing". Other visualization methods are clearly possible and have the advantage that a user is able to use human intuition to organize and structure the set A of axioms.


In Pospelov's book, Situational Control, he describes methods for "deconstructing the set A and reconstructing an new minimal size axiom set A' and rules of inference that will generate from A' a copy of A. In this case the formal model has an good axiom set and the inference rules are able to generate conjectures about new aggregations not originally in A, but from the same situation. This enables computational knowledge generation as demonstrated by the Russian semiotic systems.



Working with multiple classes of natural kind:


Each of the seven objects in Figure 3 have an independent role and can be stored separately. For example, J , Sview , and N can be stored in a small computer space and called into being when the original view of a document collection is appropriate. This enables a process compartment approach towards text understanding. The compartment in this case is called a Knowledge Processing Unit (KPU), Figure 6.


It is important to note that a compartmentalization of document views into classes of natural kind is operationally independent of ConText lexicon and knowledge catalog resources. An iterative feedback between a KPU and ConText would focus the linguistic analysis and produce better results.

Figure 6: Iterative feedback between a Knowledge Processing Unit (KPU) and ConText


This focus will improve the performance of a particular KPU.


The use of KPU as classification engine:


A single KPU can be used as a classification engine. The computation involved is minimal, except for ConText computation of a theme vector. The theme vector can be placed into a visual representation of the classes of natural kind, for example using the Spires software package. If Spires is not available, the vector distance between the center of the concept for each natural kind and the new theme vector is computed and classification made accordingly. The center of a concept can be defined to be the normalized vector sum of all basis elements associated with themes aggregated together to produce a synthetic concept.


Classification methods based on simple associative neural networks are also possible. Once one or more KPU are created, then a training set of documents can be used to encode a distributed relationship between the class of natural kind and individual documents. By altering the training presentation sequence and rules, a single document can be associated with multiple concepts. After training, new documents would be classified as concepts within a specific view. Almost no computation, and almost no computer memory, is required for classification using a trained classifier engine, and thus the user, with proper software, could quickly see the conceptual relationships that a document might have in multiple views.


The derived formal model:


The procedure outlined above uses the power of the Oracle ConText Option (OCO) to bypass the time intensive first step in constructing a Pospelov type formal model. We defined the formal model:


M1 = < T, P, A, R >


where T = a set of themes, P = { power set operator P(.) on T}, A = {semantically correct elements of P(T)}, and R = { compound semantic rule }.


We can now define the so called "derived model":


Md = < Td, Pd, Ad, Rd >


This derived model can be developed by following a procedure for deconstructing examples as outlined in Situational Analysis.


By referring to Figure 5 the reader can following the creation of the formal model M1 . J is computed using ConText and a brief interaction with one or more human experts. J is the base set of elements T for the formal model M1 . Through validation procedures, a class of natural kind N is identified by selecting from the power set P(T). Initially this class is simply the axiom set A.

Figure 5: Knowledge extraction and situational representation using a user defined view.


The formal model M1 can be constructed using existing software systems, Oracle's ConText plus some software developed by Highland. However, more can be done once M1 exists. M1 contains a description, A, of meaningful aggregations of subfeature representations of situations in the world. Using this set of descriptions, it is possible to create a theory of natural kind and a new set of subfeatures, Td = F, where each of the elements of a natural class is modeled as the emergent combination of subfeatures. The natural class is initially modeled as being isomorphic to the set A.


The theory of natural kind is specified as a set, Rd , of inference rules for determining the meaningfulness of synthetic concepts, as well as the logico-transformation rules governing how referential objects are formed in an external world. A theory of natural kind is a deep result that can be appreciated by examination of the work by M. Zabezhailo and V. Finn's work on structural pharmacology. The logico-transformation rules is a meta formalism that can be combined with the theory of plausible reasoning as developed by logician Victor Finn.


Note that the logico-transformation rules are not part of any formal model. Logico-transformation rules play an important role in moving from a single formal model into a more powerful semiotic model where transition between formal models will be allowed. Logico-transformation rules are intended to explain why a situation would arise as an example of a natural kind.


The semantic rules, R, is a surface result that provides a pragmatic way to delineate all, or most of, the natural kind in a situation. Highland's strategic plan is to apply R to build a classification engine for document management. This will not require the development of a derived model where logico-transformation rules are specified. The Knowledge Processing Project will, of course, continue to be interested in the derived model, but this system is simply more powerful than is required for vertical markets needing advanced document management methods based on ontology. Constructing N out of Rd and Td may not be far away.



Copyright © Paul S. Prueitt (and with his permission, ACSA). All Rights Reserved.
Email comments to: ACSA [click here] at 72662.133@compuserve.com