Ontologies – What are they and what are they good for?
By Ocean Haghighi-Daly
I create resilient data solutions that bring clarity and value from big, complex, and often messy data. I am continuously learning and using cutting edge cloud-based technologies to help my clients process, maintain, and analyse their data better. I like all things big-data and machine learning, especially when they create meaningful insights and valuable predictions from a company’s often least-utilised asset – data!
Recently I helped create an ontology for Highways England as part of a wider Data as a Service (DaaS) platform.
In this article I will sacrifice complete accuracy in favour of normal words so I am not bombarding you with new terminology. In this way I hope to give you a clear view of the main concepts related to ontologies and impress upon you their usefulness.
What is an Ontology?
An ontology is a model. The model includes things that exist (Entities) and how they are related to each other (Relationships). Ontologies are usually seen in the biological and medical sciences where they help ensure all possible research avenues related to an organism or disease are investigated (e.g. by systematically documenting all the ways an organism/disease interacts with the world). The model allows you to see if you have missed out Relationships, since you have given order to what previously might have been just a list of things and some of the ways they interact. By systematically and logically splitting things up in an ordered way (by ‘modelling’ it, i.e. creating an ontology of it) you can be more confident that you have covered all aspects of whatever it is you are looking at or researching.
- People who perform Roles
- Buildings where the People work
- Assets, notably the motorways and major A roads in England
- Stakeholders (i.e. the Government)
- Maintenance activities on the roads
- Incidents which occur on the roads.
All the things in bold above are examples of things that could be Entities in the model – things that exist. Relationships are then defined between these Entities to show how everything is related. E.g. :
- Roles are Performed by People
- Maintenance is Done by Role
- Assets are Owned by Highways England.
For us, the ontology should be simple enough to be understood by all employees of Highways England (and most people with some domain experience) without the need for specialist knowledge or training.
What’s the Point?
To answer the question straight away, the use cases can be broadly split into “business cases”, where the focus is around enabling smoother business operations, and “data cases”, where the focus is on data quality, efficiency, and use. Both cases can have large impacts on an organisation’s bottom line – put simply, there is potential for significant saving by doing things more efficiently. The use cases:
- Common terminology – so we know what everyone is talking about.
- Informing Logical and Physical Data Models – so our data is more easily understandable and can be compared and combined more easily.
- Starting a new project, role, or company – understanding where your role, project, and team fit into the organisation and seeing what other parts of the business you affect.
- High level company structure – insight into structure and interaction within the organisation.
- Gap analysis – seeing where different business areas have overlapping functionality and hold overlapping data, to both reduce redundancy (saving money in storage and maintenance costs) and improve data effectiveness (with more complete data showing a more complete picture).
- Schemas for graph databases – ontologies can act as graph database schemas that can be set up and queried to expose new insights from the data.
- Synonyms – ability to understand all possible synonyms of terms by creating a store linked to the uniquely defined common language.
- Interconnected web of data – understand the interconnectedness of data better (e.g. don’t want to delete a dataset not knowing that another part of the organisation is somehow dependent on it).
- Data location – develop a better understanding of where different data is located (this is more implementation dependent and depends on the ontology-to-data setup).
- Third party suppliers – facilitate clear communication and understanding of external partners by providing definitions, organisational diagrams, and context for their work.
- Foundational data models for digital twins – ontologies can act as schemas for digital twins of systems (further detail of this is given by the Centre for Digital Built Britain).
Smoother business operations can be facilitated by clearer communication. Employee time is an expensive resource for most organisations and, when coupled with distributed teams who use different terminology (with different meanings), a surprisingly large amount of time can be wasted on misunderstandings. Wasted time not only means wasted money by directly using up employee hours, but also in delayed projects leading to loss of revenue or, potentially, fines. In the worst case scenario, a miscommunication can lead to the wrong decision being made, wrong project being started, or wrong asset being built, incurring greater costs to fix. It may sound trivial, but think about how many times you’ve talked to someone (perhaps by email) and realised half way through the conversation that they mean something different to what you originally thought.
Organisations often restructure to improve efficiency and save money. Among other things this can manifest itself by reducing friction in workflows by bringing relevant teams closer together, removing redundant duplication of work (freeing up valuable staff to work on improving other areas), and reducing bureaucracy that slows down projects and inhibits positive change. In order to structure (or restructure) an organisation efficiently, management must understand how the business works, operates, and interacts with itself. An ontology can store this information and provide these insights.
Staff come, go, and change roles all the time. Introducing new joiners to the organisation, their team, and their role can be facilitated with better illustrations of how the organisation works and where their role fits into it. Again, employee time corresponds to business cost, so improvements that make employee’s lives better and easier will save the organisation money and help them be more efficient.
Nowadays data is being recognised as a standalone asset, as more businesses see the value their data can provide. The benefits that good quality data can bring when utilised effectively can lead to strategic growth, as well as reduced costs, that can provide sizeable monetary benefits (which can be significant percentages of the organisation’s revenue).
The issue of understanding data is one I have often experienced personally. Huge amounts of software developer’s and data analysts/scientists/engineers/modellers’ time can be wasted by trying to find out the true meaning of some recorded data. Often data is labelled with something generic and you have to deep-dive into the code that creates the data to find out what it actually is. Taking it one step further, the data can be labelled with something that seems understandable but, because of some odd naming choice or team-specific meaning, actually relates to something slightly (or entirely) different. Data labelled ambiguously can waste days of time and, in the worst case, lead to systems being built and insights being made on incorrectly interpreted data which then get used by the business to make bad decisions, costing money.
Even when data is appropriately named, inconsistencies in how different teams record their data can lead to incomplete, disparate, and less valuable data. It takes time to join up different datasets when the names of table columns don’t match, there is extra or missing data in some of the datasets, or the data is stored in different formats. More value is lost when related datasets are not connected because they are created, maintained, and analysed by separate teams who are unaware of each other’s existence. The duplication of work, as well as the duplication of storage and maintenance, is an unnecessary cost to the business.
Who would use an Ontology?
Various people in the organisation would use the ontology but the exact users would depend on the depth of complexity the ontology has stored and the visualisation tools made available.
In order of most-likely to least-likely:
- Datamodellers – would use the ontology to structure Logical and Physical Data Models to store the data coherently and cleanly
- Senior management – to understand the structure of the organisation and how the different areas interact to both understand the business more clearly but also find efficiency improvements and enhancements
- Cross functional business teams and project teams – to understand what other teams their team affects, to help reduce unexpected side effects from their work on other business areas
- Data Scientists/Engineers – if the ontology is strongly linked to the data storage mechanism (e.g. by using Data Vault on Azure) then finding and accessing the desired data could be made easier by linking all instances of e.g. vehicle data to a single ‘vehicle’ entity that comes from the ontology
- Data owners – to better understand their data
- Suppliers, external parties, and new joiners – to understand the business and how it interacts.
What actually is an ontology? What does it look like?
Above this text is an ontology (don’t get put off by the messy diagram, better visualisations are to follow). Both this and the following diagram only show Entities (and corresponding Relationships) in the top three levels of the (Highways England ontology) hierarchy.
There are two concepts in the ontology that will allow you to make more sense of the above mess: Hierarchies and Relationships. The ontology is made up of Entities which are given a hierarchical structure – think of folders on your computer being inside other folders, but instead of storing documents the hierarchy represents Parent-Child Relationships (Jim Jr is the child of Jim Sr) and, for us, Type-Subtype Relationships (a VW Golf is a type of Car). The solid blue lines represent the hierarchy, whereas all other lines represent other Relationships between Entities.
On the right there is a diagram showing just the hierarchy.
How do you start? Use a predefined standard
The very top level Entity of the ontology is Thing – everything that exists in the universe is characterised as a “Thing”. From there, we used a predefined standard called the “Basic Formal Ontology” which broadly splits up the universe into “things that exist” and “things that happen” (In ontology jargon, these are called Continuants and Occurrents respectively). The “things that exist” are further split into Material things (things you can touch) and Immaterial things (concepts). We therefore have three items we categorise the entire universe into – “material things that exist”, “immaterial things that exist”, and “things that happen”. These are the Entities you see “Thing” being split up into in the diagrams above.
The “Basic Formal Ontology” we used is one of a large number of standard “starting points” you can use to base your ontology off of. These “starting points” (called “Upper Ontologies” or “Top Level Ontologies”) vary in the way they choose to logically separate the world, and can vary hugely in their complexity. Which standard you choose as your base will depend on the use case of your ontology. We wanted the ontology to be easily understandable by all employees of Highways England, the idea being you should not need to be an ontology expert to understand how the company works. Therefore, in this project we sacrificed the ability to handle certain situations and descriptions of the world in an “ontologically elegant” manner, which other (more complex) standards allowed, in order to achieve the desired simplicity. An example of a benefit of some more complex ontologies is a more “elegant” handling of the concept of “time”.
How do you actually create it?
- Choose a way of creating and storing your ontology (e.g. an ontology software tool).
- Get a good understanding of the domain you wish to represent.
- Decide who will use the ontology (and how) so you can choose an appropriate standard (“Upper Ontology”) to use.
- Split the different areas of your domain into the standard’s structure using your chosen tool.
At Highways England we started with a high-level business diagram of the organisation, which we were provided. The diagram (shown to the right) is the “Information Vision and Strategy” diagram detailing the planned future structure and flow of the organisation and its data. Starting with this diagram allowed us to ensure the ontology would be representing (and supporting) the future structure that Highways England is creating. The diagram below is a high-level overview of the flow of physical assets (outer ring) and data assets (inner blue ring) and how they interact with the different business areas. To do our work we used a much more detailed version of the diagram to the right.
We decided to split the organisation into its 11 business areas and work sequentially on those – this allowed us to break up and understand a huge organisation in more manageable chunks. For each business area we started by reading all documents we could find related to that sector online, on the organisation’s website, and provided directly by the client. We created diagrams (using draw.io) as we went along to illustrate what “Things” existed in those business areas and how they were related. Once the diagrams were as complete as we could get them, we engaged directly with stakeholders who worked in those business areas in order to get feedback, corrections, and a better understanding of the business. The diagrams were improved from this feedback and often new documents with further details would be provided. The process was then iterated – more stakeholders would be interviewed and their feedback implemented until the stakeholders felt the diagram was an accurate representation of their sector. The Entities and Relationships defined in the diagram were then added to the ontology, making the ontology a more complete representation of the organisation.
During stakeholder engagements, special significance was placed on getting the terminology in the diagrams correct. There is little use in an ontology that uses words the business is unfamiliar with. It is much better to take your lead from the business’ terminology because it is these business users who will use the ontology. If they don’t understand it, or it is too contrived or confusing, they simply won’t use it.
After a few business areas were covered a lot of the key Entities existed in the model, so the creation of diagrams and updates to the master ontology involved less work. Person, for example, appeared in all business areas so was be a common feature in all diagrams. After covering five of the key Highways England business areas we found we were adding very few new Entities.
How much information do you add?
The level of granularity to include in the ontology is not simple to decide. If you add too little information and stay very high level, your ontology doesn’t provide much value as it is quite abstract. With too much information, the ontology can quickly become unwieldy with seemingly endless lists of items to add. “Instances” of Entities (e.g. the specific car you own and drive is an instance of a “Car”) are not included in the ontology – we are not trying to list every actual item that exists, but rather specify all types of item that exist and how those types of item interact. In the ontology we do not capture the information that your blue 2012 VW Golf hit a crash barrier on the M4 – we only capture the information that a Car can have an Accident with a Road Restraint. Specific examples are instances of the generic relationships captured in the ontology and would be the data actually stored in databases that the ontology helps structure.
The level of detail to be included in an ontology depends on use case, requirement, and time available. While there aren’t strict rules, some things considered when deciding if Entities were too granular were:
- If Entities were more than three levels below “Thing” in the hierarchy. If so, they would usually require a reason (business use-case) to be included.
- Whether a complete list of “sibling” Entities (Entities at the same level in the Parent-Child hierarchy) could be created. If the required list of sibling Entities was very large, or the relationships that all the sibling Entities would have were identical, the Entity would usually be considered too granular.
How do you visualise an ontology?
With difficulty. There are not many good visualisation tools available for ontologies, so much so that a separate development team was spun up to investigate and improve some of the existing tools. In the end we landed on using a modified version of WebProtégé (a free open-source ontology creation software) with improved filtering and visualisation features.
When ontologies get large they become more difficult to visualise cleanly. This became evident when we got WebProtégé to visualise the ontology:
Being able to add tags to Entities (and Relationships) and have the visualisation tool filter on those tags allowed us to create much more meaningful diagrams:
Subsection of the ontology related to the “Maintain” business area of Highways England:
Once you’ve got an ontology, how do you use it?
This really depends on what you use it for. Business users could query it in a web browser with a search bar to find terms, their definitions, synonyms, business areas, and how they are related to other Entities. Managers might like to see a graph database style diagram of their business area (like the “Maintain business area” diagram above) with links to other areas of the organisation. Data modellers would use the ontology to structure the storage and naming of data – this is the use case I will focus on now.
Data structuring: Basic and brittle
The ontology can inform the naming of tables and columns in databases, as well as their potential structure (what information goes in which database in which table). This would be done by creating a “Logical Data Model” (LDM) based on the ontology for a specific dataset. The LDM includes names of tables and columns (which are taken from the ontology where they can) but does not include datatypes. The LDM is supposed to show the structure and layout of the information to be stored without fixing the implementation too strongly. From the LDM a “Physical Data Model” (PDM) is created which includes the datatypes and any “normalisation” that is required for the specific database structure that will be used (for example, some keywords might be restricted in your particular database – e.g. SELECT or NONE – such words, which may appear in the ontology and get passed down to the LDM, get filtered out in the PDM). At this point it is good to mention that each Entity and Relationship in the ontology should have a unique identifier associated with it – a URI (Uniform Resource Identifier) or IRI (Internationalized Resource Identifier) – and that these identifiers should be linked to each column in a table in the LDM and PDM. This allows easier tracing back and identification of what each column is storing data about.
The reason I call this structure “Basic and brittle” is because it is the simplest way to use the ontology to “inform” your data models, but it fixes your data models to a specific instance of your ontology. Databases have to be changed to use your new naming and structure. If your ontology changes at any point in the future (e.g. you realise something should be named or structured differently) you would need to rename the tables and columns of any underlying databases, and any code that accesses them. This is a lot of work and not at all ideal. In theory your ontology shouldn’t change too much but in practice if you are creating the ontology at the same time as it is being used to inform data models, updates to the ontology will happen. The ontology can also be expected to change as the organisation itself evolves, develops, and grows.
Data structuring: Advanced and adaptable
A much more appealing solution is to use a structure like a “data vault” where you can associate tables and columns within those tables with “hubs” (i.e. Entities) connected together by “links” (i.e. Relationships). Instead of renaming tables and columns left right and centre, you simply associate parts of the ontology with columns in tables that already exist. When the name of an Entity changes (e.g. you decide to rename Person to Individual) you do not touch the underlying database table names (which, for all you care, could be named “Data about Steve”) but just rename your “hub”. Databases do not need to be changed to use new terminology or have a new structure to take advantage of your ontology as you now simply apply the ontology as a kind of “conformance layer” on top of your underlying data, which can be stored in a mess of databases as you wish (although a well named and structured database will make life easier, and can definitely affect access performance).
A quick note on ontology tools
WebProtégé: The tool we ended up using most because of the inbuilt version control that allowed the team to work and collaborate on the same model simultaneously. Visualisations were not sufficient, but were some of the best we found. A BJSS development team worked to improve the features of WebProtégé for two weeks and successfully built improvements that allowed us to create more meaningful diagrams.
Desktop Protégé: While in theory related to WebProtégé, Desktop Protégé is a completely separate offering with entirely different features. Some aspects of the visualisation are better with Desktop Protégé so we used it for some specific tasks but generally we stopped using it once we found WebProtégé.
Draw.io: A free online (and desktop) tool to create diagrams. We used draw.io extensively when understanding different business areas for quick notes and prototyping.
Graph plotting tools: There are numerous graph database tools which could be used to represent ontologies, however there is a technical barrier to entry with most of these.
In this article I’ve introduced the concept of an ontology and what it’s useful for. We looked at who users of an ontology would be, how you could go about creating an ontology, and finally delved into a more data specific use case.
The “Data as a Service” (DaaS) platform Highways England is creating will centralise data, allowing for better insights and reduced duplication – part of the future vision and strategy of the organisation. The ontology discussed here is being used to inform the data structure within the DaaS platform, helping to democratise data by using a single language that is shared by all.
In future posts I hope to explain some intricacies of the ontology created and the way we solved certain problems, including:
- Relationships – how to structure them
- Testing – creating an automated testing framework for ontologies in python
- How to use Annotations to create useful structures and aid filtering.