Calculating the semantic distance between two documents using a hierarchical thesaurus
The semantic distance between two documents is a measure of how dissimilar they are in terms of meaning or semantic content. In other words, it is a measure of how far apart the themes or subjects dealt with in two documents are from each other in semantic terms.
For the record, there are several different methods of defining and calculating such a measure without the aid of a thesaurus:
TF-IDF with Cosine Similarity: TF-IDF (Term Frequency-Inverse Document Frequency) is a method for transforming documents into numerical vectors, where each dimension corresponds to a specific term in the corpus, and the amplitude represents the frequency of the term corrected by the frequency of the document. Cosine similarity is then used to measure the angle between two vectors, giving a measure of the semantic similarity between two documents.
Latent Semantic Analysis (LSA): LSA, or sometimes called Latent Semantic Indexing (LSI), is a text analysis method that discovers hidden semantic relationships between words in a corpus using singular value decomposition (SVD). LSA transforms each document into a vector in a space of latent concepts, where the distance between two vectors can be used as a measure of semantic distance.
Word Embeddings (e.g. Word2Vec, GloVe): Word embedding models transform each word into a dense vector in a high-dimensional space. The semantic distance between two documents can be calculated by transforming each document into a vector (for example, by averaging the word embeddings of all the words in the document), and then measuring the distance between these vectors.
Contextual language models (e.g. BERT, GPT): These pre-trained language models generate word embeddings that take into account the context of each word. As with word embeddings, these embeddings can be used to transform each document into a vector and calculate the distance between them.
Topic Modeling (e.g. LDA): Topic models assign a distribution of topics to each document. The distance between these distributions can be used as a measure of the semantic distance between documents.
In this text, we will look more specifically at possible definitions and the calculation of semantic distance between documents using a (hierarchical) thesaurus.
Using a thesaurus: general approach
A thesaurus is a collection of words and phrases (or terms) which are organised according to their relationship to each other. The relationships are generally of three types: equivalence (synonyms or quasi-synonyms), association (semantic links between terms) and hierarchy (more general and more specific terms). A thesaurus is often used to help information retrieval and indexing by providing a standardised list of terms.
To calculate the semantic distance between two documents using a hierarchical thesaurus, the following steps should be followed:
Document representation: Each document must first be represented in terms of concepts from the thesaurus. This involves identifying each term or concept in the document which corresponds to an entry in the thesaurus. This can be done manually, by experts in the field, or using automated processes, in whole or in part, such as feature extraction, word vectorisation, lemmatisation, etc. Each document is then represented as a set or list of these concepts. If a document has several entries in the thesaurus, all these entries will be included in this representation.
Calculating the semantic distance between individual concepts: This calculation can be carried out using the hierarchical structure of the thesaurus. Most methods are based on the principle that the closer the closest common concept (in other words, the closest common ancestor) in the hierarchy is to both concepts, the smaller the semantic distance. One such method is the Wu-Palmer measure, which calculates semantic similarity as the number of nodes shared by the two paths from the concepts to the root node, divided by the total length of the paths. Another is Lin's measure, which is based on the mutual information of concepts in a corpus of texts. These two measures, and others, are defined and discussed in more detail below.
Aggregation: Once the semantic distances between individual concepts have been calculated, they need to be aggregated to obtain a semantic distance between documents. There are various ways of doing this, such as averaging, minimising, maximising, etc. the distances.
Wu-Palmer distance
The Wu-Palmer measure, introduced in 1994, is a semantic similarity measure based on the depth of nodes in a hierarchical thesaurus, generally one that represents "is one" relationships between terms. It is widely used in natural language processing.
The Wu-Palmer similarity measure is based on the idea that the more nodes two concepts have in common on their way to the root of the thesaurus, the more similar they are.
For two concepts c1 and c2, the Wu-Palmer similarity is calculated as follows:
sim(c1, c2) = 2 * depth(LCS(c1, c2)) / (depth(c1) + depth(c2))
where :
- LCS(c1, c2) refers to the most specific concept (Least Common Subsumer) which is an ancestor of both c1 and c2 (i.e. the most deeply rooted common concept in the hierarchy).
- depth(c) refers to the depth of a concept c, which is the number of nodes of c at the root of the thesaurus.
The Wu-Palmer similarity value varies between 0 and 1, where 1 means that the two concepts are identical and 0 that they are completely disjoint.
Note that this measure is highly dependent on the structure of the thesaurus. For example, it assumes that all links in the thesaurus are of equal importance, which is not necessarily the case. In addition, it does not take into account co-occurrence information for terms in documents, which can also provide important information about similarity.
Practical applications
Once a semantic distance (or several) has been defined, it can be used in a variety of knowledge engineering applications, including :
Content recommendation: Whether for articles, books, films or music, semantic distance can be used to recommend items that are semantically similar to those that a user has already liked or searched for.
Document classification (supervised learning): By calculating the semantic distance between documents, they can be grouped into categories based on their semantic content, which is useful for organising and exploring large sets of documents.
Document clustering (unsupervised learning): By calculating the semantic distance between documents, they can be grouped into clusters based on their semantic content. Each cluster contains documents that are semantically similar to each other and distinct from documents in other clusters. This can be extremely useful for organising and exploring large sets of documents. Popular algorithms for document clustering include K-means, DBSCAN, or topic modelling techniques such as Latent Dirichlet Allocation (LDA).
Question and Answer Systems (FAQ): Semantic distance can help to understand the similarity between a question asked by a user and previously asked and answered questions in a knowledge base, enabling relevant answers to be found more quickly.
Limitations of the Wu-Palmer distance
The Wu-Palmer similarity measure, while useful in many cases, has certain limitations, including:
Dependence on the structure of the thesaurus: Wu-Palmer similarity is highly dependent on the structure of the thesaurus used. If the thesaurus hierarchy is unevenly distributed or biased in some way, this can affect the similarity measure. In addition, thesauri often have limited coverage and may not contain domain-specific terms or more recent terms.
Equal importance of all links: The Wu-Palmer measure assumes that all links in the thesaurus have equal importance, which is not always true. In many cases, some relationships may be more significant than others. As an improvement, different weights can be assigned to relationships depending on their type or level in the hierarchy. For example, parent-child relationships could be given a higher weighting than relationships between distant cousins.
Lack of co-occurrence information: The Wu-Palmer measure does not take into account co-occurrence information for terms in documents, which can also provide important information about similarity. A possible area for improvement is therefore the use of word embedding methods such as Word2Vec or GloVe, which are based on the assumption that words that appear in similar contexts have similar meanings.
Difficulty with non-hierarchical thesauri: If the thesaurus used is not strictly hierarchical (i.e. if it is a graph and not a tree), it can be difficult to apply the Wu-Palmer measure.
Several measures derived from or similar to the Wu-Palmer measure can also be introduced, for example:
Resnik's similarity measure (1995): Resnik has proposed another similarity measure which uses the information contained in a thesaurus to quantify the similarity between concepts. Resnik's similarity measure is based on mutual information, which is a measure of the information that two concepts share. For two given concepts, Resnik defines their similarity as the information contained in their most specific (i.e. least general) common concept.
Jiang and Conrath's (1997) similarity measure: Jiang and Conrath have proposed a measure that combines Resnik's and Wu-Palmer's approaches. The JCN measure is based on the idea that the similarity between two concepts can be defined in terms of the distance between the concepts and their most specific common concept in the hierarchy, as well as the information contained in this common concept.
The coefficient of Leacock and Chodorow (1998): This semantic similarity measure also uses an approach based on depth in a concept hierarchy (such as that used by Wu-Palmer), but it introduces a distance component between concepts. The idea is that the further apart concepts are in the hierarchy, the less similar they are. The LCH measure is defined as the negative of the logarithm of the path distance between two concepts, divided by the maximum depth of the hierarchy.
Lin's similarity measure (1998): Lin has proposed another similarity measure that combines the Resnik and Wu-Palmer approaches. Lin's similarity measure is defined as twice the information contained in the most specific common concept, divided by the sum of the information contained in each concept.
In practice, the Wu-Palmer measure is still widely used due to its simplicity and ease of implementation. In addition to the areas for improvement mentioned above, it is also possible, where useful, to combine the Wu-Palmer measure or one of its derivatives mentioned above with other semantic similarity measures to obtain a more robust measure.
Conclusion
In conclusion, the realisation of a knowledge engineering application, such as a recommendation or clustering engine, based on the annotation of a corpus of documents using a thesaurus depends on several factors:
Quality and relevance of the thesaurus: A well-constructed thesaurus that is relevant to the corpus domain is crucial. It must cover all the concepts and terms present in the corpus so that semantic similarity can be calculated efficiently. The structure of the thesaurus chosen (hierarchical or non-hierarchical, for example) can influence the algorithms to be implemented downstream.
Accurate annotation of documents: Documents must be correctly annotated with the terms in the thesaurus. This stage can be tricky, as correctly identifying the relevant terms and concepts in a document often requires an understanding of the context, which can be difficult to automate.
Choosing the semantic similarity measure: Different measures, such as Wu-Palmer similarity, have different strengths and weaknesses. The choice of which measure to use depends on the specifics of the problem and the corpus.
Data processing and analysis: The data analysis techniques used, such as clustering or recommendation techniques, must be chosen and parameterised correctly to give useful results. This also depends on the specifics of the problem.
Updating and maintaining the thesaurus: Information is constantly evolving, so it is important to keep the thesaurus up to date with new knowledge and terms. Various automatic language processing, machine learning and text mining techniques can help with these tasks, including: named entity extraction, term and relationship extraction, word sense disambiguation techniques.
From an organisational point of view, to ensure the successful construction of this type of application, it is important to have a multi-disciplinary team with a good understanding of the domain, expertise in natural language processing and data analysis, and a continuous attention to accuracy and updating of information.
Brief bibliography
- R. Rada, H. Bicknell, E. Mili, and M. Blettner, "Development and application of a metric on semantic nets," IEEE Transaction on Systems, Man, and Cybernetics, 1(19), pp. 17- 30, 1989.
- Z. Wu and M. Palmer, "Verbs semantics and lexical selection," In Proceedings of the 32nd annual meeting on Association for Computational Linguistics, pp. 133-138. Association for Computational Linguistics, 1994.
- Resnik, P. Using information content to evaluate semantic similarity in a taxonomy. arXiv preprint cmp-lg/9511007. 1995.
- Jiang, J. J. and D. W. Conrath. Semantic similarity based on corpus statistics and lexical taxonomy. arXiv preprint cmp-lg/9709008. 1997.
- Leacock, C. and M. Chodorow. Combining local context and wordnet similarity for word sense identification. WordNet: An electronic lexical database 49(2), 265-283, 1998.
- Lin, D. An information-theoretic definition of similarity. In ICML, Volume 98, pp. 296-304, 1998.
- Haïfa Zargayouna, Sylvie Salotti. Similarity measurement in an ontology for semantic indexing of XML documents. 15th Journées francophones d'Ingénierie des Connaissances, May 2004, Lyon, France. pp.249-260. hal-00380573
- M.K. Shenoy, K. Shet and U.D. Acharya. A new similarity measure for taxonomy based on edge counting. Int. J. Web Semant. Technol. 2012, 3, 23.
- D. Guessoum, M. Miraoui, and C. Tadj. "A modification of wu and palmer semantic similarity measure." In The Tenth International Conference on Mobile Ubiquitous Computing, Systems, Services and Technologies, pp. 42-46. 2016.
- Julien Subercaze, Christophe Gravier, Frederique Laforest. Metric folding for semantic similarity computation at scale. 16 èmes Journées Francophones Extraction et Gestion des Connaissances, EGC 2016, Jan 2016, Reims, France. hal-01254852