sotools.common package¶

Methods for common operations when reading and interpreting schema.org markup.

`getDatasetIdentifiers`(g)	Return a list of `SO:Dataset.identifier` entries from the provided Graph
`getDatasetMetadataLinks`(g)	Extract links to metadata documents describing SO:Dataset
`getDatasetMetadataLinksFromAbout`(g)	Extract a list of metadata links SO:about(SO:Dataset)
`getDatasetMetadataLinksFromEncoding`(g)	Extract link to metadata from SO:Dataset.encoding
`getDatasetMetadataLinksFromSubjectOf`(g)	Extract list of metadata links from SO.Dataset.subjectOf
`getLiteralDatasetIdentifiers`(g)	Retrieve literal SO:Dataset.identifier entries
`getStructuredDatasetIdentifiers`(g)	Extract structured SO:Dataset.identifier entries
`getSubgraph`(g, subject[, max_depth])	Retrieve the subgraph of g with subject.
`hasDataset`(g)	Number of SO:Dataset graphs in g
`inflateSubgraph`(g, sg, ts[, depth, max_depth])	Inflate the subgraph sg to contain all children of sg appearing in g.
`loadSOGraph`([filename, data, publicID, …])	Load RDF string or file to an RDFLib ConjunctiveGraph
`loadSOGraphFromHtml`(html, url)	Extract jsonld entries from provided HTML text
`loadSOGraphFromUrl`(url)	Loads graph from json-ld contained in a landing page.
`renderGraph`(g)	For rendering an rdflib graph in Jupyter notebooks
`validateSHACL`(shape_graph, data_graph)	Validate data against a SHACL shape using common options.

Method Descriptions¶

sotools.common.getDatasetIdentifiers(g)[source]¶

Return a list of SO:Dataset.identifier entries from the provided Graph

Parameters:	g (Graph) – Graph containing `SO:Dataset`
Returns:	A list of `{value:, url:, propertyId:}`
Return type:	list

Example:

# Load graph and show SO:Dataset.identifier entries
import sotools
import json
from pprint import pprint

json_source = "examples/data/id_structured_01.json"
g = sotools.loadSOGraph(filename=json_source)
identifiers = sotools.getDatasetIdentifiers(g)
print("The json-ld source graph:")
print(json.dumps(json.load(open(json_source, 'r')), indent=2))
print("\nThe identifier(s) used in the dataset:")
pprint(identifiers, indent=2)

The json-ld source graph:
{
  "@context": {
    "@vocab": "http://schema.org",
    "datacite": "http://purl.org/spar/datacite/"
  },
  "@type": "Dataset",
  "identifier": {
    "@type": [
      "PropertyValue",
      "datacite:ResourceIdentifier"
    ],
    "datacite:usesIdentifierScheme": {
      "@id": "datacite:doi"
    },
    "propertyID": "DOI",
    "url": "https://doi.org/10.1575/1912/bco-dmo.665253",
    "value": "10.1575/1912/bco-dmo.665253"
  }
}

The identifier(s) used in the dataset:
[ { 'propertyId': 'DOI',
    'url': 'https://doi.org/10.1575/1912/bco-dmo.665253',
    'value': '10.1575/1912/bco-dmo.665253'}]

sotools.common.getDatasetMetadataLinks(g)[source]¶

Extract links to metadata documents describing SO:Dataset

Metadata docs can be referenced different ways

as SO:Dataset.subjectOf
the inverse of 1, SO:CreativeWork.about(SO:Dataset)
SO:Dataset.encoding

Parameters:	g (Graph) – Graph containing `SO:Dataset`
Returns:	A list of `{dateModified:, encodingFormat:, contentUrl:, description:, subjectOf:,}`
Return type:	list

Example:

# Get links to metadata documents referenced from a SO:dataset
import sotools
import json
from pprint import pprint

# Note: in this case two entries are returned because the single
# link is recognized with two different encodingFormats
json_source = "examples/data/ds_m_subjectof.json"
g = sotools.loadSOGraph(filename=json_source)
links = sotools.getDatasetMetadataLinks(g)
print("The source graph:")
print(json.dumps(json.load(open(json_source, 'r')), indent=2))
print("\nThe links to external metadata:")
pprint(links, indent=2)

The source graph:
{
  "@context": {
    "@vocab": "https://schema.org/"
  },
  "@id": "ds-02",
  "url": "https://my.server.org/data/ds-02",
  "@type": "Dataset",
  "identifier": "dataset-02",
  "name": "Dataset subjectOf metadata",
  "description": "Dataset snippet with descriptive metadata indicated through subjectOf relation.",
  "license": "https://creativecommons.org/publicdomain/zero/1.0/",
  "subjectOf": {
    "@id": "ds-02/metadata.xml",
    "@type": "CreativeWork",
    "name": "Dublin Core Metadata Document Describing the Dataset",
    "url": "https://my.server.org/data/ds-02/metadata.xml",
    "encodingFormat": [
      "application/rdf+xml",
      "http://ns.dataone.org/metadata/schema/onedcx/v1.0"
    ]
  }
}

The links to external metadata:
[ { 'contentUrl': 'https://my.server.org/data/ds-02/metadata.xml',
    'dateModified': None,
    'description': 'None',
    'encodingFormat': 'http://ns.dataone.org/metadata/schema/onedcx/v1.0',
    'subjectOf': 'file:///home/docs/checkouts/readthedocs.org/user_builds/so-tools/checkouts/latest/docsource/source/examples/data/ds-02'},
  { 'contentUrl': 'https://my.server.org/data/ds-02/metadata.xml',
    'dateModified': None,
    'description': 'None',
    'encodingFormat': 'application/rdf+xml',
    'subjectOf': 'file:///home/docs/checkouts/readthedocs.org/user_builds/so-tools/checkouts/latest/docsource/source/examples/data/ds-02'}]

sotools.common.getDatasetMetadataLinksFromAbout(g)[source]¶

Extract a list of metadata links SO:about(SO:Dataset)

Parameters:	g (Graph) – Graph containing an `SO:Dataset`
Returns:	A list of `{dateModified:, encodingFormat:, contentUrl:, description:, subjectOf:,}`
Return type:	list

Example:

# Get links to metadata documents referenced from a SO:dataset
import sotools
import json
from pprint import pprint

json_source = "examples/data/ds_m_about.json"
g = sotools.loadSOGraph(filename=json_source, publicID="https://my.server.net/data/")
links = sotools.getDatasetMetadataLinks(g)
print("The source graph:")
print(json.dumps(json.load(open(json_source, 'r')), indent=2))
print("\nThe links to external metadata:")
pprint(links, indent=2)

The source graph:
{
  "@context": {
    "@vocab": "https://schema.org/"
  },
  "@graph": [
    {
      "@type": "Dataset",
      "@id": "./",
      "identifier": "dataset-01",
      "name": "Dataset with metadata about",
      "description": "Dataset snippet with metadata and data components indicated by hasPart and the descriptive metadata through an about association.",
      "license": "https://creativecommons.org/publicdomain/zero/1.0/",
      "hasPart": [
        {
          "@id": "./metadata.xml"
        },
        {
          "@id": "./data_part_a.csv"
        }
      ]
    },
    {
      "@id": "./metadata.xml",
      "@type": "MediaObject",
      "contentUrl": "https://example.org/my/data/1/metadata.xml",
      "dateModified": "2019-10-10T12:43:11+00:00.000",
      "description": "A metadata document describing the Dataset and the data component",
      "encodingFormat": "http://www.isotc211.org/2005/gmd",
      "about": [
        {
          "@id": "./"
        },
        {
          "@id": "./data_part_a.csv"
        }
      ]
    },
    {
      "@id": "./data_part_a.csv",
      "@type": "MediaObject",
      "contentUrl": "https://example.org/my/data/1/data_part_a.csv"
    }
  ]
}

The links to external metadata:
[ { 'contentUrl': 'https://example.org/my/data/1/metadata.xml',
    'dateModified': rdflib.term.Literal('2019-10-10T12:43:11+00:00.000'),
    'description': 'A metadata document describing the Dataset and the data '
                   'component',
    'encodingFormat': 'http://www.isotc211.org/2005/gmd',
    'subjectOf': 'https://my.server.net/data/'}]

sotools.common.getDatasetMetadataLinksFromEncoding(g)[source]¶

Extract link to metadata from SO:Dataset.encoding

Parameters:	g – ConjunctiveGraph
Returns:	A list of `{dateModified:, encodingFormat:, contentUrl:, description:, subjectOf:,}`
Return type:	list

Example:

# Get links to metadata documents referenced from a SO:dataset
import sotools
import json
from pprint import pprint

json_source = "examples/data/ds_m_encoding.json"
g = sotools.loadSOGraph(filename=json_source, publicID="https://my.server.net/data/")
links = sotools.getDatasetMetadataLinks(g)
print("The source graph:")
print(json.dumps(json.load(open(json_source, 'r')), indent=2))
print("\nThe links to external metadata:")
pprint(links, indent=2)

The source graph:
{
  "@id": "ds_m_encoding",
  "@context": {
    "@vocab": "https://schema.org/"
  },
  "@type": "Dataset",
  "name": "Dataset with metadata encoding",
  "description": "Dataset snippet using SO:Encoding pattern for associated XML metadata.",
  "license": "https://creativecommons.org/publicdomain/zero/1.0/",
  "identifier": "dataset-00",
  "encoding": {
    "@id": "ds_m_encoding#media-object",
    "@type": "MediaObject",
    "contentUrl": "https://my.server.net/datasets/00.xml",
    "dateModified": "2019-10-10T12:43:11+00:00.000",
    "description": "ISO TC211 XML rendering of metadata",
    "encodingFormat": "http://www.isotc211.org/2005/gmd"
  }
}

The links to external metadata:
[ { 'contentUrl': 'https://my.server.net/datasets/00.xml',
    'dateModified': rdflib.term.Literal('2019-10-10T12:43:11+00:00.000'),
    'description': 'ISO TC211 XML rendering of metadata',
    'encodingFormat': 'http://www.isotc211.org/2005/gmd',
    'subjectOf': 'https://my.server.net/data/ds_m_encoding'}]

sotools.common.getDatasetMetadataLinksFromSubjectOf(g)[source]¶

Extract list of metadata links from SO.Dataset.subjectOf

Parameters:	g (Graph) – Graph containing the `SO:Dataset`
Returns:	A list of `{dateModified:, encodingFormat:, contentUrl:, description:, subjectOf:,}`
Return type:	list

Example:

# Get links to metadata documents referenced from a SO:dataset
import sotools
import json
from pprint import pprint

json_source = "examples/data/ds_m_subjectof.json"
g = sotools.loadSOGraph(filename=json_source, publicID="https://my.server.net/data/")
links = sotools.getDatasetMetadataLinks(g)
print("The source graph:")
print(json.dumps(json.load(open(json_source, 'r')), indent=2))
print("\nThe links to external metadata:")
pprint(links, indent=2)

The source graph:
{
  "@context": {
    "@vocab": "https://schema.org/"
  },
  "@id": "ds-02",
  "url": "https://my.server.org/data/ds-02",
  "@type": "Dataset",
  "identifier": "dataset-02",
  "name": "Dataset subjectOf metadata",
  "description": "Dataset snippet with descriptive metadata indicated through subjectOf relation.",
  "license": "https://creativecommons.org/publicdomain/zero/1.0/",
  "subjectOf": {
    "@id": "ds-02/metadata.xml",
    "@type": "CreativeWork",
    "name": "Dublin Core Metadata Document Describing the Dataset",
    "url": "https://my.server.org/data/ds-02/metadata.xml",
    "encodingFormat": [
      "application/rdf+xml",
      "http://ns.dataone.org/metadata/schema/onedcx/v1.0"
    ]
  }
}

The links to external metadata:
[ { 'contentUrl': 'https://my.server.org/data/ds-02/metadata.xml',
    'dateModified': None,
    'description': 'None',
    'encodingFormat': 'application/rdf+xml',
    'subjectOf': 'https://my.server.net/data/ds-02'},
  { 'contentUrl': 'https://my.server.org/data/ds-02/metadata.xml',
    'dateModified': None,
    'description': 'None',
    'encodingFormat': 'http://ns.dataone.org/metadata/schema/onedcx/v1.0',
    'subjectOf': 'https://my.server.net/data/ds-02'}]

sotools.common.getLiteralDatasetIdentifiers(g)[source]¶

Retrieve literal SO:Dataset.identifier entries

Parameters:	g (Graph) – Graph containing `SO:Dataset`
Returns:	A list of `{value:, url:, propertyId:}` with url=None and propertyId=”Literal”
Return type:	list

sotools.common.getStructuredDatasetIdentifiers(g)[source]¶

Extract structured SO:Dataset.identifier entries

Parameters:	g (Graph) – Graph containing `SO:Dataset`
Returns:	A list of `{value:, url:, propertyId:}`
Return type:	list

sotools.common.getSubgraph(g, subject, max_depth=100)[source]¶

Retrieve the subgraph of g with subject.

Given the graph g, extract the subgraph identified as the object of the triple with subject subject.

Parameters:	g (Graph) – Source graph subject (URIRef) – Subject of the root of the subgraph to retrieve max_depth (integer) – Maximum recursion depth
Returns:	(Graph) The subgraph of g with subject.

Example:

import rdflib
import rdflib.compare
import sotools

expected_json = """{
    "@context": {
        "@vocab":"https://example.net/"
    },
    "@id":"./sub",
    "property_0": "literal_0",
    "property_1": ["literal_1-0", "literal_1-1"],
    "property_2": {
        "property_3":"Anonymous subgraph"
    }
}
"""

test_json = """{
    "@context": {
        "@vocab":"https://example.net/"
    },
    "@id":"./parent",
    "sub":""" + expected_json + """,
    "parent_property":"Should not appear in extracted"
}
"""

# Load the full graph, setting the base to "https://example.net/"
g_full = rdflib.Graph()
g_full.parse(data=test_json, format="json-ld", publicID="https://example.net/")
print("### Full:")
print(g_full.serialize(format="turtle").decode())

g_expected = rdflib.ConjunctiveGraph()
g_expected.parse(data=expected_json, format="json-ld", publicID="https://example.net/")
print("### Expected:")
print(g_expected.serialize(format="turtle").decode())

#Extract the subgraph that is the object of the subject "https://example.net/sub"
g_sub = sotools.getSubgraph(g_full, rdflib.URIRef("https://example.net/sub"))
print("### Extracted:")
print(g_sub.serialize(format="turtle").decode())

#Direct comparison of the graphs, will fail if there are BNodes
print(f"Extracted subgraph is equal to the expected graph: {g_sub == g_expected}")

# Use isomorphic comparison. This operation can be very expensive if either of
# the graphs are large and degenerate with lots of BNodes.
print((f"Extracted subgraph is isomorphic with the expected: "
      f"{rdflib.compare.isomorphic(g_sub, g_expected)}"))

### Full:
@prefix : <https://example.net/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

:parent :parent_property "Should not appear in extracted" ;
    :sub :sub .

:sub :property_0 "literal_0" ;
    :property_1 "literal_1-0",
        "literal_1-1" ;
    :property_2 [ :property_3 "Anonymous subgraph" ] .


### Expected:
@prefix : <https://example.net/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

:sub :property_0 "literal_0" ;
    :property_1 "literal_1-0",
        "literal_1-1" ;
    :property_2 [ :property_3 "Anonymous subgraph" ] .


### Extracted:
@prefix : <https://example.net/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

:sub :property_0 "literal_0" ;
    :property_1 "literal_1-0",
        "literal_1-1" ;
    :property_2 [ :property_3 "Anonymous subgraph" ] .


Extracted subgraph is equal to the expected graph: False
Extracted subgraph is isomorphic with the expected: True

sotools.common.hasDataset(g)[source]¶

Number of SO:Dataset graphs in g

Parameters:	g (Graph) – The graph to evaluate
Returns:	Number of SO:Dataset graphs in g
Return type:	integer

Example:

# Load a graph and evaluate if it contains a SO:Dataset
import sotools

g = sotools.loadSOGraph(
    filename="examples/data/ds_bad_namespace.json",
    publicID="https://my.data.net/data/"
)
sotools.hasDataset(g)

sotools.common.inflateSubgraph(g, sg, ts, depth=0, max_depth=100)[source]¶

Inflate the subgraph sg to contain all children of sg appearing in g.

Parameters:	g (Graph) – The master graph from which the subgraph is extracted sg (Graph) – The subgraph, modified in place ts (iterable of triples) – list of triples, the objects of which identify subjects to copy frmm g depth (integer) – tracks depth of recursion max_depth (integer) – maximum recursion depth for retrieving terms
Returns:	None

sotools.common.loadSOGraph(filename=None, data=None, publicID=None, normalize=True, deslop=True, format='json-ld')[source]¶

Load RDF string or file to an RDFLib ConjunctiveGraph

Creates a ConjunctiveGraph from the provided file or text. If both are provided then text is used.

NOTE: Namespace use of <http://schema.org>, <https://schema.org>, or <http://schema.org/> is normalized to <https://schema.org/> if normalize is True.

NOTE: Case of SO: properties in SO_TERMS is adjusted consistency if deslop is True

Parameters:	filename (string) – path to RDF file on disk data (string) – RDF text publicID (string) – (from rdflib) The logical URI to use as the document base. If None specified the document location is used. normalize (boolean) – Normalize the use of schema.org namespace deslop (boolean) – Adjust schema.org terms for case consistency format (string) – The serialization format of the RDF to load
Returns:	The loaded graph
Return type:	ConjunctiveGraph

Example:

# Load a Dataset from json-ld, normalize schema.org namespace, and dump as ttl.
import sotools
import json
json_source = "examples/data/ds_bad_namespace.json"
g = sotools.loadSOGraph(filename=json_source,
                        publicID="https://my.data.net/data/",
                        normalize=True,
                        deslop=True)

print("Loaded JSON:")
print(json.dumps(json.load(open(json_source, 'r')), indent=2))
print("\nNormalized schema.org namespace and serialized to ttl:\n")
print(g.serialize(format="ttl").decode())

Loaded JSON:
[
  {
    "@context": {
      "@vocab": "https://schema.org"
    },
    "@id": "demo_0",
    "@type": "Dataset",
    "name": "https, no trailing slash"
  },
  {
    "@context": {
      "@vocab": "http://schema.org"
    },
    "@id": "demo_1",
    "@type": "Dataset",
    "name": "http, no trailing slash"
  },
  {
    "@context": {
      "@vocab": "http://schema.org/"
    },
    "@id": "demo_2",
    "@type": "Dataset",
    "name": "http only"
  }
]

Normalized schema.org namespace and serialized to ttl:

@prefix SO: <https://schema.org/> .
@prefix ns1: <https://my.data.net/data/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

ns1:demo_0 a SO:Dataset ;
    SO:name "https, no trailing slash" .

ns1:demo_1 a SO:Dataset ;
    SO:name "http, no trailing slash" .

ns1:demo_2 a SO:Dataset ;
    SO:name "http only" .

sotools.common.loadSOGraphFromHtml(html, url)[source]¶

Extract jsonld entries from provided HTML text

Parameters:	html (string) – HTML text to be parsed
Returns:	Graph loaded from html
Return type:	ConjunctiveGraph

sotools.common.loadSOGraphFromUrl(url)[source]¶

Loads graph from json-ld contained in a landing page.

Parameters:	url (string) – Url to process
Returns:	Graph of instance
Return type:	ConjunctiveGraph

Example:

# Load graph from a URL and print the SO:Dataset.identifier values found
import sotools
from pprint import pprint

url = "https://www.bco-dmo.org/dataset/679374"
g = sotools.loadSOGraphFromUrl(url)
pprint(sotools.getDatasetIdentifiers(g), indent=2)

[ { 'propertyId': 'Literal',
    'url': None,
    'value': 'http://lod.bco-dmo.org/id/dataset/679374'}]

sotools.common.renderGraph(g)[source]¶

For rendering an rdflib graph in Jupyter notebooks

Parameters:	g (Graph) – The graph to render
Returns:	Output for rendering directly in the notebook
Return type:	Jupyter cell

Example:

# Load a graph and render the output (for jupyter notebooks)
import sotools
g = sotools.loadSOGraph(filename="examples/data/ds_m_subjectof.json")
sotools.renderGraph(g)

sotools.common.validateSHACL(shape_graph, data_graph)[source]¶

Validate data against a SHACL shape using common options.

Parameters:	shape_graph (ConjunctiveGraph) – A SHACL shape graph data_graph (ConjunctiveGraph) – Data graph to be validated with shape_graph

Returns (tuple): Conformance (boolean), result graph (Graph) and result text

Example:

import sotools
import rdflib

data_source = "examples/data/ds_bad_namespace.json"
data_graph = rdflib.ConjunctiveGraph()
data_graph.parse(data_source, format="json-ld", publicID="https://example.net/data/")
shape_source = "examples/shapes/test_namespace.ttl"
shape_graph = rdflib.ConjunctiveGraph()
shape_graph.parse(shape_source, format="turtle")
conforms, result_graph, result_text = sotools.validateSHACL(shape_graph, data_graph)
print(f"Data shape conforms: {conforms}")
print(f"Results text: \n{result_text}")
print("Results graph:")
sotools.renderGraph(result_graph)

Data shape conforms: False
Results text: 
Validation Report
Conforms: False
Results (3):
Constraint Violation in NotConstraintComponent (http://www.w3.org/ns/shacl#NotConstraintComponent):
	Severity: sh:Violation
	Source Shape: d1:DatasetBad3Shape
	Focus Node: <https://example.net/data/demo_1>
	Value Node: <https://example.net/data/demo_1>
	Message: Expecting SO namespace of <https://schema.org/> not <http://schema.org>
Constraint Violation in NotConstraintComponent (http://www.w3.org/ns/shacl#NotConstraintComponent):
	Severity: sh:Violation
	Source Shape: d1:DatasetBad1Shape
	Focus Node: <https://example.net/data/demo_0>
	Value Node: <https://example.net/data/demo_0>
	Message: Expecting SO namespace of <https://schema.org/> not <https://schema.org>
Constraint Violation in NotConstraintComponent (http://www.w3.org/ns/shacl#NotConstraintComponent):
	Severity: sh:Violation
	Source Shape: d1:DatasetBad2Shape
	Focus Node: <https://example.net/data/demo_2>
	Value Node: <https://example.net/data/demo_2>
	Message: Expecting SO namespace of <https://schema.org/> not <http://schema.org/>

Results graph:

Running code on this page¶

All examples on this page can be run live in Binder. To do so:

Click on the “Activate Binder” button
Wait for Binder to be active. This can take a while, you can watch progress in your browser’s javascript console. When a line like Kernel: connected (89dfd3c8... appears, Binder should be ready to go.
Run the following before any other script on the page. This sets the right path context for loading examples etc.

import os
try:
    os.chdir("docsource/source")
except:
    pass
print("Page is ready. You can now run other code blocks on this page.")

sotools.common package¶

Method Descriptions¶

Running code on this page¶

Contents

Navigation