Using the Deep Narrative Analysis (DNA) Ontology for ESG
A previous blog post discussed the design of the DNA Ontology and its focus on events and situations (e.g., verbs). The ontology is designed to be general and flexible. So, the OntoInsights team sought to evaluate its usability in a totally different domain ... to capture and fuse Environment, Social and Governance (ESG) data.
To this end, we created and hosted a sample knowledge graph as part of the Hanken Quantum Hackathon 2021. The graph fused information about 1900+ companies -- including their industries, profits, environmental impacts, and country of headquarters -- and combined that with data about their "headquarters" country.
This post explores our experiences in creating that knowledge graph.
The company data provided dollar amounts (in US Dollars) for various types of environmental impacts. This data was extracted (using Python code) from a spreadsheet based on the Harvard study, Corporate Environmental Impact: Measurement, Data and Information. The country data, on the other hand, provided statistics on a country's electricity production, environmental issues, demographics and many, other details. It was assembled by web scraping the pages of the CIA World Factbook.
Similar to the design of the DNA Ontology, the Hackathon data was organized based on the “6Ws” - Who, What, Where, When, Why and hoW:
- Happenings/occurrences/states and conditions (what and how)
- Examples are the generation of pollutants or electrical energy, and the perpetuation of conditions such as food insecurity or lack of sanitation in a country
- Note that the various types of pollutants define how the environment is impacted, how electricity is produced, or how the people of a country live (e.g., in what conditions)
- To codify the details of these occurrences and conditions, data is reported as types of Measurements and Assessments
- Agents/actors/organizations (who) and Locations (where)
- Specifically, the data examines the environmental impact of over 1900 organizations, across 72 countries (where the organizations are headquartered)
- In addition, the industries of the organizations are captured (as classified by the Global Industry Classification Standard, GICS )
- Time/sequences (when)
- The fused data only references annual/yearly information, so that is specifically reported in the KG
- Impact, economic and environmental data is reported for multiple years for both organizations and countries
- Measurements indicate both a :reported_value and a :reported_year
- Goal/intent (why)
- The impact of organizations on several of the UN Sustainable Development Goals is available
- Causation, precondition and prevention (why)
- Although the ontology can support cause/effect, enablement/prevention of events and conditions, and similar relationships, there was no supporting data from the sources that utilized these associations
All the above concepts were condensed from the various modules in the DNA GitHub ontologies directory, to the file, background-ontology.ttl. It was not necessary to do this condensing, but it simplified the resulting HTML documentation tree that was generated. (The file, hackathon-ontology-tree.html, is an HTML, searchable tree view of all the concepts -- classes and properties -- in the Hackathon ontology modules. The file can be downloaded and opened in any browser to show the inheritance (e.g., generalization-specialization) hierarchy of the classes and properties.)
For the Hackathon, it was also necessary to extend the DNA Ontology to define the specific Measurement and Assessments that were captured from the data sources. The new Measurement and Assessment sub-classes are specified in the hackathon-esg-ontology.ttl file in GitHub (and are included in the HTML tree documentation discussed above). Beyond the new classes, it was also necessary to define a few new properties. The new properties are listed below along with the reasons for their definition:
- about_industy – Disambiguated the multiple "topics" of the AvgSalesByIndustry Measurement
- This relationship identifies the particular industry for which average sales are reported, in the Country for which the Measurement is defined
- environmental_issues – Captured the unstructured text from the CIA World Factbook describing the general environmental conditions for a Country
- has_headquarters – Clarified the has_location property to indicate that the Location/Country is where an Organization is headquartered
- is_in_industry – Defined as semantically identical to the has_line_of_business predicate from the backing ontology, but was renamed to align with the vocabulary of GICS (the Global Industry Classification System)
- land_area_sq_kms – Defined an additional property for a Country indicating its total land area (from the CIA World Factbook)
- Note that the backing ontology only defines a general area_sq_kms property
- localized – When true, indicates that a FoodInsecurityAssessment (of “High” or “Very High”) is not a general condition, but localized
- This information was obtained from the CIA World Factbook
- reported_year – Simplified the backing ontology’s has_time property to avoid creating a time-related instance, just to indicate the year for which a Measurement is reported
- Note that the more complex semantics of the backing ontology are needed for general time declarations (and indeed can be reconstructed from this simplification)
Lastly, reviewing all the files in the hackathon-extensions directory of the DNA ontology on GitHub, one finds a few more files beyond the background-ontology and hackathon-esg-ontology modules. These are:
- geonames_countries.ttl – This file was directly copied from the DNA ontologies directory and provides information on Countries, their currency and neighbors, and the Continents that contain them
- lob_industry_extensions.ttl – This module extended the LineOfBusiness classes defined in DNA’s agent.ttl
- These extensions address the industries from GICS that are referenced in the Harvard study
In order to better understand this discussion, it is valuable to visualize the Hackathon concepts and data.
For companies, the knowledge graph contained data specifying:
- The industry of the company
- The country where the company is headquartered
- Total operating income and sales in a given year
- Total environmental impact, which is a composite score based on all contributing environmental and social issues that are related to that company's individual activities
- Individual impact measurements
For countries, the knowledge graph held a variety of economic data (GDP, unemployment rate, inflation rate, …), land use data (total land area and the amount of arable land, forest, …), information on waste generated and recycled, data on electrical production and consumption as well as the production and consumption of natural resources such as crude oil and natural gas, data on CO2, methane and particulate emissions, information on the percentages of the population in poverty and without potable water or sanitation facilities, and much more.
These concepts are shown in Figure 2.