Why Integrity is a Directed Knowledge Graph
Hello everyone! Let me begin today’s blog post with a note of thanks: The response to Integrity has been incredibly encouraging in the few weeks since we announced it to the scholarly and academic world. Many of the questions posed to us will be addressed in future blog posts, if not sooner.
I’d like to update you on the progress of Integrity, a directed knowledge graph of scholarly and scientific publishing information that uses reputable sources of open access metadata.
Integrity is working to link up disparate data sources; record the provenance of that data; and find the relationships contained within – for instance, among Institutes, Funders, Contributors and Articles.
In particular, I’ll focus on the “directed knowledge graph” part for this post.
A “directed” graph means that a relationship between two data points flows in a specific direction; this directionality provides structure to the graph and can assist in finding centrality because the beginning and target nodes are defined. In other words, relationships between data can be symmetric or asymmetric, and an asymmetric or directed relationship should be encoded as clearly as possible — the structure of the graph expresses the meaning.
As Amy Holder of Neo4j writes:
"At a high level, knowledge graphs are interlinked sets of facts that describe real-world entities, facts or things and their interrelations in a human understandable form. Unlike a simple knowledge base with flat structures and static content, a knowledge graph acquires and integrates adjacent information using data relationships to derive new knowledge."
Amy also notes in her post (https://neo4j.com/blog/ai-graph-technology-knowledge-graphs/) that this provides a sound basis for running AI over the graph once the data is interlinked properly. Which is what we intend to do.
(Full disclosure: Integrity is part of Neo4j’s amazing and extremely generous Startup Program, which gives our team access to their Enterprise class software.)
At the moment we’re melding DOAJ, ORCID and Crossref records. Some of the DOAJ and ORCID data are without DOIs, so we are using Crossref, custom python code and Neo4j Cypher to find credible and distinct matches that fill in the missing data.
The next step is to run the AI over the keywords and the contributors to disambiguate – systematically and at scale. You can expect a blog post on those efforts too.