sotools.common package

Methods for common operations when reading and interpreting schema.org markup.

getDatasetIdentifiers(g) Return a list of SO:Dataset.identifier entries from the provided Graph
getDatasetMetadataLinks(g) Extract links to metadata documents describing SO:Dataset
getDatasetMetadataLinksFromAbout(g) Extract a list of metadata links SO:about(SO:Dataset)
getDatasetMetadataLinksFromEncoding(g) Extract link to metadata from SO:Dataset.encoding
getDatasetMetadataLinksFromSubjectOf(g) Extract list of metadata links from SO.Dataset.subjectOf
getLiteralDatasetIdentifiers(g) Retrieve literal SO:Dataset.identifier entries
getStructuredDatasetIdentifiers(g) Extract structured SO:Dataset.identifier entries
getSubgraph(g, subject[, max_depth]) Retrieve the subgraph of g with subject.
hasDataset(g) Number of SO:Dataset graphs in g
inflateSubgraph(g, sg, ts[, depth, max_depth]) Inflate the subgraph sg to contain all children of sg appearing in g.
loadSOGraph([filename, data, publicID, …]) Load RDF string or file to an RDFLib ConjunctiveGraph
loadSOGraphFromHtml(html, url) Extract jsonld entries from provided HTML text
loadSOGraphFromUrl(url) Loads graph from json-ld contained in a landing page.
renderGraph(g) For rendering an rdflib graph in Jupyter notebooks
validateSHACL(shape_graph, data_graph) Validate data against a SHACL shape using common options.

Method Descriptions

sotools.common.getDatasetIdentifiers(g)[source]

Return a list of SO:Dataset.identifier entries from the provided Graph

Parameters:g (Graph) – Graph containing SO:Dataset
Returns:A list of {value:, url:, propertyId:}
Return type:list

Example:

# Load graph and show SO:Dataset.identifier entries
import sotools
import json
from pprint import pprint

json_source = "examples/data/id_structured_01.json"
g = sotools.loadSOGraph(filename=json_source)
identifiers = sotools.getDatasetIdentifiers(g)
print("The json-ld source graph:")
print(json.dumps(json.load(open(json_source, 'r')), indent=2))
print("\nThe identifier(s) used in the dataset:")
pprint(identifiers, indent=2)
The json-ld source graph:
{
  "@context": {
    "@vocab": "http://schema.org",
    "datacite": "http://purl.org/spar/datacite/"
  },
  "@type": "Dataset",
  "identifier": {
    "@type": [
      "PropertyValue",
      "datacite:ResourceIdentifier"
    ],
    "datacite:usesIdentifierScheme": {
      "@id": "datacite:doi"
    },
    "propertyID": "DOI",
    "url": "https://doi.org/10.1575/1912/bco-dmo.665253",
    "value": "10.1575/1912/bco-dmo.665253"
  }
}

The identifier(s) used in the dataset:
[ { 'propertyId': 'DOI',
    'url': 'https://doi.org/10.1575/1912/bco-dmo.665253',
    'value': '10.1575/1912/bco-dmo.665253'}]

Extract links to metadata documents describing SO:Dataset

Metadata docs can be referenced different ways

  • as SO:Dataset.subjectOf
  • the inverse of 1, SO:CreativeWork.about(SO:Dataset)
  • SO:Dataset.encoding
Parameters:g (Graph) – Graph containing SO:Dataset
Returns:A list of {dateModified:, encodingFormat:, contentUrl:, description:, subjectOf:,}
Return type:list

Example:

# Get links to metadata documents referenced from a SO:dataset
import sotools
import json
from pprint import pprint

# Note: in this case two entries are returned because the single
# link is recognized with two different encodingFormats
json_source = "examples/data/ds_m_subjectof.json"
g = sotools.loadSOGraph(filename=json_source)
links = sotools.getDatasetMetadataLinks(g)
print("The source graph:")
print(json.dumps(json.load(open(json_source, 'r')), indent=2))
print("\nThe links to external metadata:")
pprint(links, indent=2)
The source graph:
{
  "@context": {
    "@vocab": "https://schema.org/"
  },
  "@id": "ds-02",
  "url": "https://my.server.org/data/ds-02",
  "@type": "Dataset",
  "identifier": "dataset-02",
  "name": "Dataset subjectOf metadata",
  "description": "Dataset snippet with descriptive metadata indicated through subjectOf relation.",
  "license": "https://creativecommons.org/publicdomain/zero/1.0/",
  "subjectOf": {
    "@id": "ds-02/metadata.xml",
    "@type": "CreativeWork",
    "name": "Dublin Core Metadata Document Describing the Dataset",
    "url": "https://my.server.org/data/ds-02/metadata.xml",
    "encodingFormat": [
      "application/rdf+xml",
      "http://ns.dataone.org/metadata/schema/onedcx/v1.0"
    ]
  }
}

The links to external metadata:
[ { 'contentUrl': 'https://my.server.org/data/ds-02/metadata.xml',
    'dateModified': None,
    'description': 'None',
    'encodingFormat': 'http://ns.dataone.org/metadata/schema/onedcx/v1.0',
    'subjectOf': 'file:///home/docs/checkouts/readthedocs.org/user_builds/so-tools/checkouts/latest/docsource/source/examples/data/ds-02'},
  { 'contentUrl': 'https://my.server.org/data/ds-02/metadata.xml',
    'dateModified': None,
    'description': 'None',
    'encodingFormat': 'application/rdf+xml',
    'subjectOf': 'file:///home/docs/checkouts/readthedocs.org/user_builds/so-tools/checkouts/latest/docsource/source/examples/data/ds-02'}]
sotools.common.getDatasetMetadataLinksFromAbout(g)[source]

Extract a list of metadata links SO:about(SO:Dataset)

Parameters:g (Graph) – Graph containing an SO:Dataset
Returns:A list of {dateModified:, encodingFormat:, contentUrl:, description:, subjectOf:,}
Return type:list

Example:

# Get links to metadata documents referenced from a SO:dataset
import sotools
import json
from pprint import pprint

json_source = "examples/data/ds_m_about.json"
g = sotools.loadSOGraph(filename=json_source, publicID="https://my.server.net/data/")
links = sotools.getDatasetMetadataLinks(g)
print("The source graph:")
print(json.dumps(json.load(open(json_source, 'r')), indent=2))
print("\nThe links to external metadata:")
pprint(links, indent=2)
The source graph:
{
  "@context": {
    "@vocab": "https://schema.org/"
  },
  "@graph": [
    {
      "@type": "Dataset",
      "@id": "./",
      "identifier": "dataset-01",
      "name": "Dataset with metadata about",
      "description": "Dataset snippet with metadata and data components indicated by hasPart and the descriptive metadata through an about association.",
      "license": "https://creativecommons.org/publicdomain/zero/1.0/",
      "hasPart": [
        {
          "@id": "./metadata.xml"
        },
        {
          "@id": "./data_part_a.csv"
        }
      ]
    },
    {
      "@id": "./metadata.xml",
      "@type": "MediaObject",
      "contentUrl": "https://example.org/my/data/1/metadata.xml",
      "dateModified": "2019-10-10T12:43:11+00:00.000",
      "description": "A metadata document describing the Dataset and the data component",
      "encodingFormat": "http://www.isotc211.org/2005/gmd",
      "about": [
        {
          "@id": "./"
        },
        {
          "@id": "./data_part_a.csv"
        }
      ]
    },
    {
      "@id": "./data_part_a.csv",
      "@type": "MediaObject",
      "contentUrl": "https://example.org/my/data/1/data_part_a.csv"
    }
  ]
}

The links to external metadata:
[ { 'contentUrl': 'https://example.org/my/data/1/metadata.xml',
    'dateModified': rdflib.term.Literal('2019-10-10T12:43:11+00:00.000'),
    'description': 'A metadata document describing the Dataset and the data '
                   'component',
    'encodingFormat': 'http://www.isotc211.org/2005/gmd',
    'subjectOf': 'https://my.server.net/data/'}]
sotools.common.getDatasetMetadataLinksFromEncoding(g)[source]

Extract link to metadata from SO:Dataset.encoding

Parameters:g – ConjunctiveGraph
Returns:A list of {dateModified:, encodingFormat:, contentUrl:, description:, subjectOf:,}
Return type:list

Example:

# Get links to metadata documents referenced from a SO:dataset
import sotools
import json
from pprint import pprint

json_source = "examples/data/ds_m_encoding.json"
g = sotools.loadSOGraph(filename=json_source, publicID="https://my.server.net/data/")
links = sotools.getDatasetMetadataLinks(g)
print("The source graph:")
print(json.dumps(json.load(open(json_source, 'r')), indent=2))
print("\nThe links to external metadata:")
pprint(links, indent=2)
The source graph:
{
  "@id": "ds_m_encoding",
  "@context": {
    "@vocab": "https://schema.org/"
  },
  "@type": "Dataset",
  "name": "Dataset with metadata encoding",
  "description": "Dataset snippet using SO:Encoding pattern for associated XML metadata.",
  "license": "https://creativecommons.org/publicdomain/zero/1.0/",
  "identifier": "dataset-00",
  "encoding": {
    "@id": "ds_m_encoding#media-object",
    "@type": "MediaObject",
    "contentUrl": "https://my.server.net/datasets/00.xml",
    "dateModified": "2019-10-10T12:43:11+00:00.000",
    "description": "ISO TC211 XML rendering of metadata",
    "encodingFormat": "http://www.isotc211.org/2005/gmd"
  }
}

The links to external metadata:
[ { 'contentUrl': 'https://my.server.net/datasets/00.xml',
    'dateModified': rdflib.term.Literal('2019-10-10T12:43:11+00:00.000'),
    'description': 'ISO TC211 XML rendering of metadata',
    'encodingFormat': 'http://www.isotc211.org/2005/gmd',
    'subjectOf': 'https://my.server.net/data/ds_m_encoding'}]
sotools.common.getDatasetMetadataLinksFromSubjectOf(g)[source]

Extract list of metadata links from SO.Dataset.subjectOf

Parameters:g (Graph) – Graph containing the SO:Dataset
Returns:A list of {dateModified:, encodingFormat:, contentUrl:, description:, subjectOf:,}
Return type:list

Example:

# Get links to metadata documents referenced from a SO:dataset
import sotools
import json
from pprint import pprint

json_source = "examples/data/ds_m_subjectof.json"
g = sotools.loadSOGraph(filename=json_source, publicID="https://my.server.net/data/")
links = sotools.getDatasetMetadataLinks(g)
print("The source graph:")
print(json.dumps(json.load(open(json_source, 'r')), indent=2))
print("\nThe links to external metadata:")
pprint(links, indent=2)
The source graph:
{
  "@context": {
    "@vocab": "https://schema.org/"
  },
  "@id": "ds-02",
  "url": "https://my.server.org/data/ds-02",
  "@type": "Dataset",
  "identifier": "dataset-02",
  "name": "Dataset subjectOf metadata",
  "description": "Dataset snippet with descriptive metadata indicated through subjectOf relation.",
  "license": "https://creativecommons.org/publicdomain/zero/1.0/",
  "subjectOf": {
    "@id": "ds-02/metadata.xml",
    "@type": "CreativeWork",
    "name": "Dublin Core Metadata Document Describing the Dataset",
    "url": "https://my.server.org/data/ds-02/metadata.xml",
    "encodingFormat": [
      "application/rdf+xml",
      "http://ns.dataone.org/metadata/schema/onedcx/v1.0"
    ]
  }
}

The links to external metadata:
[ { 'contentUrl': 'https://my.server.org/data/ds-02/metadata.xml',
    'dateModified': None,
    'description': 'None',
    'encodingFormat': 'application/rdf+xml',
    'subjectOf': 'https://my.server.net/data/ds-02'},
  { 'contentUrl': 'https://my.server.org/data/ds-02/metadata.xml',
    'dateModified': None,
    'description': 'None',
    'encodingFormat': 'http://ns.dataone.org/metadata/schema/onedcx/v1.0',
    'subjectOf': 'https://my.server.net/data/ds-02'}]
sotools.common.getLiteralDatasetIdentifiers(g)[source]

Retrieve literal SO:Dataset.identifier entries

Parameters:g (Graph) – Graph containing SO:Dataset
Returns:A list of {value:, url:, propertyId:} with url=None and propertyId=”Literal”
Return type:list
sotools.common.getStructuredDatasetIdentifiers(g)[source]

Extract structured SO:Dataset.identifier entries

Parameters:g (Graph) – Graph containing SO:Dataset
Returns:A list of {value:, url:, propertyId:}
Return type:list
sotools.common.getSubgraph(g, subject, max_depth=100)[source]

Retrieve the subgraph of g with subject.

Given the graph g, extract the subgraph identified as the object of the triple with subject subject.

Parameters:
  • g (Graph) – Source graph
  • subject (URIRef) – Subject of the root of the subgraph to retrieve
  • max_depth (integer) – Maximum recursion depth
Returns:

(Graph) The subgraph of g with subject.

Example:

import rdflib
import rdflib.compare
import sotools

expected_json = """{
    "@context": {
        "@vocab":"https://example.net/"
    },
    "@id":"./sub",
    "property_0": "literal_0",
    "property_1": ["literal_1-0", "literal_1-1"],
    "property_2": {
        "property_3":"Anonymous subgraph"
    }
}
"""

test_json = """{
    "@context": {
        "@vocab":"https://example.net/"
    },
    "@id":"./parent",
    "sub":""" + expected_json + """,
    "parent_property":"Should not appear in extracted"
}
"""

# Load the full graph, setting the base to "https://example.net/"
g_full = rdflib.Graph()
g_full.parse(data=test_json, format="json-ld", publicID="https://example.net/")
print("### Full:")
print(g_full.serialize(format="turtle").decode())

g_expected = rdflib.ConjunctiveGraph()
g_expected.parse(data=expected_json, format="json-ld", publicID="https://example.net/")
print("### Expected:")
print(g_expected.serialize(format="turtle").decode())

#Extract the subgraph that is the object of the subject "https://example.net/sub"
g_sub = sotools.getSubgraph(g_full, rdflib.URIRef("https://example.net/sub"))
print("### Extracted:")
print(g_sub.serialize(format="turtle").decode())

#Direct comparison of the graphs, will fail if there are BNodes
print(f"Extracted subgraph is equal to the expected graph: {g_sub == g_expected}")

# Use isomorphic comparison. This operation can be very expensive if either of
# the graphs are large and degenerate with lots of BNodes.
print((f"Extracted subgraph is isomorphic with the expected: "
      f"{rdflib.compare.isomorphic(g_sub, g_expected)}"))
### Full:
@prefix : <https://example.net/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

:parent :parent_property "Should not appear in extracted" ;
    :sub :sub .

:sub :property_0 "literal_0" ;
    :property_1 "literal_1-0",
        "literal_1-1" ;
    :property_2 [ :property_3 "Anonymous subgraph" ] .


### Expected:
@prefix : <https://example.net/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

:sub :property_0 "literal_0" ;
    :property_1 "literal_1-0",
        "literal_1-1" ;
    :property_2 [ :property_3 "Anonymous subgraph" ] .


### Extracted:
@prefix : <https://example.net/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

:sub :property_0 "literal_0" ;
    :property_1 "literal_1-0",
        "literal_1-1" ;
    :property_2 [ :property_3 "Anonymous subgraph" ] .


Extracted subgraph is equal to the expected graph: False
Extracted subgraph is isomorphic with the expected: True
sotools.common.hasDataset(g)[source]

Number of SO:Dataset graphs in g

Parameters:g (Graph) – The graph to evaluate
Returns:Number of SO:Dataset graphs in g
Return type:integer

Example:

# Load a graph and evaluate if it contains a SO:Dataset
import sotools

g = sotools.loadSOGraph(
    filename="examples/data/ds_bad_namespace.json",
    publicID="https://my.data.net/data/"
)
sotools.hasDataset(g)
3
sotools.common.inflateSubgraph(g, sg, ts, depth=0, max_depth=100)[source]

Inflate the subgraph sg to contain all children of sg appearing in g.

Parameters:
  • g (Graph) – The master graph from which the subgraph is extracted
  • sg (Graph) – The subgraph, modified in place
  • ts (iterable of triples) – list of triples, the objects of which identify subjects to copy frmm g
  • depth (integer) – tracks depth of recursion
  • max_depth (integer) – maximum recursion depth for retrieving terms
Returns:

None

sotools.common.loadSOGraph(filename=None, data=None, publicID=None, normalize=True, deslop=True, format='json-ld')[source]

Load RDF string or file to an RDFLib ConjunctiveGraph

Creates a ConjunctiveGraph from the provided file or text. If both are provided then text is used.

NOTE: Namespace use of <http://schema.org>, <https://schema.org>, or <http://schema.org/> is normalized to <https://schema.org/> if normalize is True.

NOTE: Case of SO: properties in SO_TERMS is adjusted consistency if deslop is True

Parameters:
  • filename (string) – path to RDF file on disk
  • data (string) – RDF text
  • publicID (string) – (from rdflib) The logical URI to use as the document base. If None specified the document location is used.
  • normalize (boolean) – Normalize the use of schema.org namespace
  • deslop (boolean) – Adjust schema.org terms for case consistency
  • format (string) – The serialization format of the RDF to load
Returns:

The loaded graph

Return type:

ConjunctiveGraph

Example:

# Load a Dataset from json-ld, normalize schema.org namespace, and dump as ttl.
import sotools
import json
json_source = "examples/data/ds_bad_namespace.json"
g = sotools.loadSOGraph(filename=json_source,
                        publicID="https://my.data.net/data/",
                        normalize=True,
                        deslop=True)

print("Loaded JSON:")
print(json.dumps(json.load(open(json_source, 'r')), indent=2))
print("\nNormalized schema.org namespace and serialized to ttl:\n")
print(g.serialize(format="ttl").decode())
Loaded JSON:
[
  {
    "@context": {
      "@vocab": "https://schema.org"
    },
    "@id": "demo_0",
    "@type": "Dataset",
    "name": "https, no trailing slash"
  },
  {
    "@context": {
      "@vocab": "http://schema.org"
    },
    "@id": "demo_1",
    "@type": "Dataset",
    "name": "http, no trailing slash"
  },
  {
    "@context": {
      "@vocab": "http://schema.org/"
    },
    "@id": "demo_2",
    "@type": "Dataset",
    "name": "http only"
  }
]

Normalized schema.org namespace and serialized to ttl:

@prefix SO: <https://schema.org/> .
@prefix ns1: <https://my.data.net/data/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

ns1:demo_0 a SO:Dataset ;
    SO:name "https, no trailing slash" .

ns1:demo_1 a SO:Dataset ;
    SO:name "http, no trailing slash" .

ns1:demo_2 a SO:Dataset ;
    SO:name "http only" .


sotools.common.loadSOGraphFromHtml(html, url)[source]

Extract jsonld entries from provided HTML text

Parameters:html (string) – HTML text to be parsed
Returns:Graph loaded from html
Return type:ConjunctiveGraph
sotools.common.loadSOGraphFromUrl(url)[source]

Loads graph from json-ld contained in a landing page.

Parameters:url (string) – Url to process
Returns:Graph of instance
Return type:ConjunctiveGraph

Example:

# Load graph from a URL and print the SO:Dataset.identifier values found
import sotools
from pprint import pprint

url = "https://www.bco-dmo.org/dataset/679374"
g = sotools.loadSOGraphFromUrl(url)
pprint(sotools.getDatasetIdentifiers(g), indent=2)
[ { 'propertyId': 'Literal',
    'url': None,
    'value': 'http://lod.bco-dmo.org/id/dataset/679374'}]
sotools.common.renderGraph(g)[source]

For rendering an rdflib graph in Jupyter notebooks

Parameters:g (Graph) – The graph to render
Returns:Output for rendering directly in the notebook
Return type:Jupyter cell

Example:

# Load a graph and render the output (for jupyter notebooks)
import sotools
g = sotools.loadSOGraph(filename="examples/data/ds_m_subjectof.json")
sotools.renderGraph(g)
_images/sotools.common_9_0.svg
sotools.common.validateSHACL(shape_graph, data_graph)[source]

Validate data against a SHACL shape using common options.

Parameters:
  • shape_graph (ConjunctiveGraph) – A SHACL shape graph
  • data_graph (ConjunctiveGraph) – Data graph to be validated with shape_graph

Returns (tuple): Conformance (boolean), result graph (Graph) and result text

Example:

import sotools
import rdflib

data_source = "examples/data/ds_bad_namespace.json"
data_graph = rdflib.ConjunctiveGraph()
data_graph.parse(data_source, format="json-ld", publicID="https://example.net/data/")
shape_source = "examples/shapes/test_namespace.ttl"
shape_graph = rdflib.ConjunctiveGraph()
shape_graph.parse(shape_source, format="turtle")
conforms, result_graph, result_text = sotools.validateSHACL(shape_graph, data_graph)
print(f"Data shape conforms: {conforms}")
print(f"Results text: \n{result_text}")
print("Results graph:")
sotools.renderGraph(result_graph)
Data shape conforms: False
Results text: 
Validation Report
Conforms: False
Results (3):
Constraint Violation in NotConstraintComponent (http://www.w3.org/ns/shacl#NotConstraintComponent):
	Severity: sh:Violation
	Source Shape: d1:DatasetBad3Shape
	Focus Node: <https://example.net/data/demo_1>
	Value Node: <https://example.net/data/demo_1>
	Message: Expecting SO namespace of <https://schema.org/> not <http://schema.org>
Constraint Violation in NotConstraintComponent (http://www.w3.org/ns/shacl#NotConstraintComponent):
	Severity: sh:Violation
	Source Shape: d1:DatasetBad1Shape
	Focus Node: <https://example.net/data/demo_0>
	Value Node: <https://example.net/data/demo_0>
	Message: Expecting SO namespace of <https://schema.org/> not <https://schema.org>
Constraint Violation in NotConstraintComponent (http://www.w3.org/ns/shacl#NotConstraintComponent):
	Severity: sh:Violation
	Source Shape: d1:DatasetBad2Shape
	Focus Node: <https://example.net/data/demo_2>
	Value Node: <https://example.net/data/demo_2>
	Message: Expecting SO namespace of <https://schema.org/> not <http://schema.org/>

Results graph:
_images/sotools.common_10_1.svg

Running code on this page

All examples on this page can be run live in Binder. To do so:

  1. Click on the “Activate Binder” button
  2. Wait for Binder to be active. This can take a while, you can watch progress in your browser’s javascript console. When a line like Kernel: connected (89dfd3c8... appears, Binder should be ready to go.
  3. Run the following before any other script on the page. This sets the right path context for loading examples etc.
import os
try:
    os.chdir("docsource/source")
except:
    pass
print("Page is ready. You can now run other code blocks on this page.")