SemanticDistance

Ontologies, or controlled vocabularies, are increasingly used in biology to describe gene functions or other concepts (e.g., illnesses, anatomy -- for a comprehensive list, see the Open Biomedical Ontologies and the Biological Ontology Databases). A common request is to evaluate the similarity of two objects given their terms in an ontology. E.g., given two genes and their functional annotations in Gene Ontology, should I consider their function close or not?

A first crude approach to quantify this similarity is to consider the respective positions of the terms used in the ontology (seen as a directed acyclic graph of terms and their relationships). The distance between two terms can be calculated for example as the number of nodes separating them. SemanticDistance offers an implementation of a better approach proposed by Lord et al., itself based on previous works on ontologies like WordNet.

The idea is here to use the notion of 'information content'; we can calculate for each term in an ontology a value reflecting how informative this term is. The more the term is used, the less informative it is; the distance between two terms is then calculated by taking the information content of the parents terms they share. For example, if two terms only share the root of the ontology (the less informative of all terms), their distance is high. But if they share a rarely used parent, then their distance will be low (their similarity will be high). To obtain a complete (and better !) explanation of this algorithm you can either read the articles below or the annexe C of my PhD thesis (in French).

References:

Version

1.1 (Jun 8, 2009)

Documentation

The documentation is included in the package below, in format.

Download