sotools.common package¶
Methods for common operations when reading and interpreting schema.org markup.
getDatasetIdentifiers (g) |
Return a list of SO:Dataset.identifier entries from the provided Graph |
getDatasetMetadataLinks (g) |
Extract links to metadata documents describing SO:Dataset |
getDatasetMetadataLinksFromAbout (g) |
Extract a list of metadata links SO:about(SO:Dataset) |
getDatasetMetadataLinksFromEncoding (g) |
Extract link to metadata from SO:Dataset.encoding |
getDatasetMetadataLinksFromSubjectOf (g) |
Extract list of metadata links from SO.Dataset.subjectOf |
getLiteralDatasetIdentifiers (g) |
Retrieve literal SO:Dataset.identifier entries |
getStructuredDatasetIdentifiers (g) |
Extract structured SO:Dataset.identifier entries |
getSubgraph (g, subject[, max_depth]) |
Retrieve the subgraph of g with subject. |
hasDataset (g) |
Number of SO:Dataset graphs in g |
inflateSubgraph (g, sg, ts[, depth, max_depth]) |
Inflate the subgraph sg to contain all children of sg appearing in g. |
loadSOGraph ([filename, data, publicID, …]) |
Load RDF string or file to an RDFLib ConjunctiveGraph |
loadSOGraphFromHtml (html, url) |
Extract jsonld entries from provided HTML text |
loadSOGraphFromUrl (url) |
Loads graph from json-ld contained in a landing page. |
renderGraph (g) |
For rendering an rdflib graph in Jupyter notebooks |
validateSHACL (shape_graph, data_graph) |
Validate data against a SHACL shape using common options. |
Method Descriptions¶
-
sotools.common.
getDatasetIdentifiers
(g)[source]¶ Return a list of
SO:Dataset.identifier
entries from the provided GraphParameters: g (Graph) – Graph containing SO:Dataset
Returns: A list of {value:, url:, propertyId:}
Return type: list Example:
# Load graph and show SO:Dataset.identifier entries import sotools import json from pprint import pprint json_source = "examples/data/id_structured_01.json" g = sotools.loadSOGraph(filename=json_source) identifiers = sotools.getDatasetIdentifiers(g) print("The json-ld source graph:") print(json.dumps(json.load(open(json_source, 'r')), indent=2)) print("\nThe identifier(s) used in the dataset:") pprint(identifiers, indent=2)
The json-ld source graph: { "@context": { "@vocab": "http://schema.org", "datacite": "http://purl.org/spar/datacite/" }, "@type": "Dataset", "identifier": { "@type": [ "PropertyValue", "datacite:ResourceIdentifier" ], "datacite:usesIdentifierScheme": { "@id": "datacite:doi" }, "propertyID": "DOI", "url": "https://doi.org/10.1575/1912/bco-dmo.665253", "value": "10.1575/1912/bco-dmo.665253" } } The identifier(s) used in the dataset: [ { 'propertyId': 'DOI', 'url': 'https://doi.org/10.1575/1912/bco-dmo.665253', 'value': '10.1575/1912/bco-dmo.665253'}]
-
sotools.common.
getDatasetMetadataLinks
(g)[source]¶ Extract links to metadata documents describing SO:Dataset
Metadata docs can be referenced different ways
- as SO:Dataset.subjectOf
- the inverse of 1, SO:CreativeWork.about(SO:Dataset)
- SO:Dataset.encoding
Parameters: g (Graph) – Graph containing SO:Dataset
Returns: A list of {dateModified:, encodingFormat:, contentUrl:, description:, subjectOf:,}
Return type: list Example:
# Get links to metadata documents referenced from a SO:dataset import sotools import json from pprint import pprint # Note: in this case two entries are returned because the single # link is recognized with two different encodingFormats json_source = "examples/data/ds_m_subjectof.json" g = sotools.loadSOGraph(filename=json_source) links = sotools.getDatasetMetadataLinks(g) print("The source graph:") print(json.dumps(json.load(open(json_source, 'r')), indent=2)) print("\nThe links to external metadata:") pprint(links, indent=2)
The source graph: { "@context": { "@vocab": "https://schema.org/" }, "@id": "ds-02", "url": "https://my.server.org/data/ds-02", "@type": "Dataset", "identifier": "dataset-02", "name": "Dataset subjectOf metadata", "description": "Dataset snippet with descriptive metadata indicated through subjectOf relation.", "license": "https://creativecommons.org/publicdomain/zero/1.0/", "subjectOf": { "@id": "ds-02/metadata.xml", "@type": "CreativeWork", "name": "Dublin Core Metadata Document Describing the Dataset", "url": "https://my.server.org/data/ds-02/metadata.xml", "encodingFormat": [ "application/rdf+xml", "http://ns.dataone.org/metadata/schema/onedcx/v1.0" ] } } The links to external metadata: [ { 'contentUrl': 'https://my.server.org/data/ds-02/metadata.xml', 'dateModified': None, 'description': 'None', 'encodingFormat': 'http://ns.dataone.org/metadata/schema/onedcx/v1.0', 'subjectOf': 'file:///home/docs/checkouts/readthedocs.org/user_builds/so-tools/checkouts/latest/docsource/source/examples/data/ds-02'}, { 'contentUrl': 'https://my.server.org/data/ds-02/metadata.xml', 'dateModified': None, 'description': 'None', 'encodingFormat': 'application/rdf+xml', 'subjectOf': 'file:///home/docs/checkouts/readthedocs.org/user_builds/so-tools/checkouts/latest/docsource/source/examples/data/ds-02'}]
-
sotools.common.
getDatasetMetadataLinksFromAbout
(g)[source]¶ Extract a list of metadata links SO:about(SO:Dataset)
Parameters: g (Graph) – Graph containing an SO:Dataset
Returns: A list of {dateModified:, encodingFormat:, contentUrl:, description:, subjectOf:,}
Return type: list Example:
# Get links to metadata documents referenced from a SO:dataset import sotools import json from pprint import pprint json_source = "examples/data/ds_m_about.json" g = sotools.loadSOGraph(filename=json_source, publicID="https://my.server.net/data/") links = sotools.getDatasetMetadataLinks(g) print("The source graph:") print(json.dumps(json.load(open(json_source, 'r')), indent=2)) print("\nThe links to external metadata:") pprint(links, indent=2)
The source graph: { "@context": { "@vocab": "https://schema.org/" }, "@graph": [ { "@type": "Dataset", "@id": "./", "identifier": "dataset-01", "name": "Dataset with metadata about", "description": "Dataset snippet with metadata and data components indicated by hasPart and the descriptive metadata through an about association.", "license": "https://creativecommons.org/publicdomain/zero/1.0/", "hasPart": [ { "@id": "./metadata.xml" }, { "@id": "./data_part_a.csv" } ] }, { "@id": "./metadata.xml", "@type": "MediaObject", "contentUrl": "https://example.org/my/data/1/metadata.xml", "dateModified": "2019-10-10T12:43:11+00:00.000", "description": "A metadata document describing the Dataset and the data component", "encodingFormat": "http://www.isotc211.org/2005/gmd", "about": [ { "@id": "./" }, { "@id": "./data_part_a.csv" } ] }, { "@id": "./data_part_a.csv", "@type": "MediaObject", "contentUrl": "https://example.org/my/data/1/data_part_a.csv" } ] } The links to external metadata: [ { 'contentUrl': 'https://example.org/my/data/1/metadata.xml', 'dateModified': rdflib.term.Literal('2019-10-10T12:43:11+00:00.000'), 'description': 'A metadata document describing the Dataset and the data ' 'component', 'encodingFormat': 'http://www.isotc211.org/2005/gmd', 'subjectOf': 'https://my.server.net/data/'}]
-
sotools.common.
getDatasetMetadataLinksFromEncoding
(g)[source]¶ Extract link to metadata from SO:Dataset.encoding
Parameters: g – ConjunctiveGraph Returns: A list of {dateModified:, encodingFormat:, contentUrl:, description:, subjectOf:,}
Return type: list Example:
# Get links to metadata documents referenced from a SO:dataset import sotools import json from pprint import pprint json_source = "examples/data/ds_m_encoding.json" g = sotools.loadSOGraph(filename=json_source, publicID="https://my.server.net/data/") links = sotools.getDatasetMetadataLinks(g) print("The source graph:") print(json.dumps(json.load(open(json_source, 'r')), indent=2)) print("\nThe links to external metadata:") pprint(links, indent=2)
The source graph: { "@id": "ds_m_encoding", "@context": { "@vocab": "https://schema.org/" }, "@type": "Dataset", "name": "Dataset with metadata encoding", "description": "Dataset snippet using SO:Encoding pattern for associated XML metadata.", "license": "https://creativecommons.org/publicdomain/zero/1.0/", "identifier": "dataset-00", "encoding": { "@id": "ds_m_encoding#media-object", "@type": "MediaObject", "contentUrl": "https://my.server.net/datasets/00.xml", "dateModified": "2019-10-10T12:43:11+00:00.000", "description": "ISO TC211 XML rendering of metadata", "encodingFormat": "http://www.isotc211.org/2005/gmd" } } The links to external metadata: [ { 'contentUrl': 'https://my.server.net/datasets/00.xml', 'dateModified': rdflib.term.Literal('2019-10-10T12:43:11+00:00.000'), 'description': 'ISO TC211 XML rendering of metadata', 'encodingFormat': 'http://www.isotc211.org/2005/gmd', 'subjectOf': 'https://my.server.net/data/ds_m_encoding'}]
-
sotools.common.
getDatasetMetadataLinksFromSubjectOf
(g)[source]¶ Extract list of metadata links from SO.Dataset.subjectOf
Parameters: g (Graph) – Graph containing the SO:Dataset
Returns: A list of {dateModified:, encodingFormat:, contentUrl:, description:, subjectOf:,}
Return type: list Example:
# Get links to metadata documents referenced from a SO:dataset import sotools import json from pprint import pprint json_source = "examples/data/ds_m_subjectof.json" g = sotools.loadSOGraph(filename=json_source, publicID="https://my.server.net/data/") links = sotools.getDatasetMetadataLinks(g) print("The source graph:") print(json.dumps(json.load(open(json_source, 'r')), indent=2)) print("\nThe links to external metadata:") pprint(links, indent=2)
The source graph: { "@context": { "@vocab": "https://schema.org/" }, "@id": "ds-02", "url": "https://my.server.org/data/ds-02", "@type": "Dataset", "identifier": "dataset-02", "name": "Dataset subjectOf metadata", "description": "Dataset snippet with descriptive metadata indicated through subjectOf relation.", "license": "https://creativecommons.org/publicdomain/zero/1.0/", "subjectOf": { "@id": "ds-02/metadata.xml", "@type": "CreativeWork", "name": "Dublin Core Metadata Document Describing the Dataset", "url": "https://my.server.org/data/ds-02/metadata.xml", "encodingFormat": [ "application/rdf+xml", "http://ns.dataone.org/metadata/schema/onedcx/v1.0" ] } } The links to external metadata: [ { 'contentUrl': 'https://my.server.org/data/ds-02/metadata.xml', 'dateModified': None, 'description': 'None', 'encodingFormat': 'application/rdf+xml', 'subjectOf': 'https://my.server.net/data/ds-02'}, { 'contentUrl': 'https://my.server.org/data/ds-02/metadata.xml', 'dateModified': None, 'description': 'None', 'encodingFormat': 'http://ns.dataone.org/metadata/schema/onedcx/v1.0', 'subjectOf': 'https://my.server.net/data/ds-02'}]
-
sotools.common.
getLiteralDatasetIdentifiers
(g)[source]¶ Retrieve literal SO:Dataset.identifier entries
Parameters: g (Graph) – Graph containing SO:Dataset
Returns: A list of {value:, url:, propertyId:}
with url=None and propertyId=”Literal”Return type: list
-
sotools.common.
getStructuredDatasetIdentifiers
(g)[source]¶ Extract structured SO:Dataset.identifier entries
Parameters: g (Graph) – Graph containing SO:Dataset
Returns: A list of {value:, url:, propertyId:}
Return type: list
-
sotools.common.
getSubgraph
(g, subject, max_depth=100)[source]¶ Retrieve the subgraph of g with subject.
Given the graph
g
, extract the subgraph identified as the object of the triple with subjectsubject
.Parameters: - g (Graph) – Source graph
- subject (URIRef) – Subject of the root of the subgraph to retrieve
- max_depth (integer) – Maximum recursion depth
Returns: (Graph) The subgraph of g with subject.
Example:
import rdflib import rdflib.compare import sotools expected_json = """{ "@context": { "@vocab":"https://example.net/" }, "@id":"./sub", "property_0": "literal_0", "property_1": ["literal_1-0", "literal_1-1"], "property_2": { "property_3":"Anonymous subgraph" } } """ test_json = """{ "@context": { "@vocab":"https://example.net/" }, "@id":"./parent", "sub":""" + expected_json + """, "parent_property":"Should not appear in extracted" } """ # Load the full graph, setting the base to "https://example.net/" g_full = rdflib.Graph() g_full.parse(data=test_json, format="json-ld", publicID="https://example.net/") print("### Full:") print(g_full.serialize(format="turtle").decode()) g_expected = rdflib.ConjunctiveGraph() g_expected.parse(data=expected_json, format="json-ld", publicID="https://example.net/") print("### Expected:") print(g_expected.serialize(format="turtle").decode()) #Extract the subgraph that is the object of the subject "https://example.net/sub" g_sub = sotools.getSubgraph(g_full, rdflib.URIRef("https://example.net/sub")) print("### Extracted:") print(g_sub.serialize(format="turtle").decode()) #Direct comparison of the graphs, will fail if there are BNodes print(f"Extracted subgraph is equal to the expected graph: {g_sub == g_expected}") # Use isomorphic comparison. This operation can be very expensive if either of # the graphs are large and degenerate with lots of BNodes. print((f"Extracted subgraph is isomorphic with the expected: " f"{rdflib.compare.isomorphic(g_sub, g_expected)}"))
### Full: @prefix : <https://example.net/> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix xml: <http://www.w3.org/XML/1998/namespace> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . :parent :parent_property "Should not appear in extracted" ; :sub :sub . :sub :property_0 "literal_0" ; :property_1 "literal_1-0", "literal_1-1" ; :property_2 [ :property_3 "Anonymous subgraph" ] . ### Expected: @prefix : <https://example.net/> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix xml: <http://www.w3.org/XML/1998/namespace> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . :sub :property_0 "literal_0" ; :property_1 "literal_1-0", "literal_1-1" ; :property_2 [ :property_3 "Anonymous subgraph" ] . ### Extracted: @prefix : <https://example.net/> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix xml: <http://www.w3.org/XML/1998/namespace> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . :sub :property_0 "literal_0" ; :property_1 "literal_1-0", "literal_1-1" ; :property_2 [ :property_3 "Anonymous subgraph" ] . Extracted subgraph is equal to the expected graph: False Extracted subgraph is isomorphic with the expected: True
-
sotools.common.
hasDataset
(g)[source]¶ Number of SO:Dataset graphs in g
Parameters: g (Graph) – The graph to evaluate Returns: Number of SO:Dataset graphs in g Return type: integer Example:
# Load a graph and evaluate if it contains a SO:Dataset import sotools g = sotools.loadSOGraph( filename="examples/data/ds_bad_namespace.json", publicID="https://my.data.net/data/" ) sotools.hasDataset(g)
3
-
sotools.common.
inflateSubgraph
(g, sg, ts, depth=0, max_depth=100)[source]¶ Inflate the subgraph sg to contain all children of sg appearing in g.
Parameters: - g (Graph) – The master graph from which the subgraph is extracted
- sg (Graph) – The subgraph, modified in place
- ts (iterable of triples) – list of triples, the objects of which identify subjects to copy frmm g
- depth (integer) – tracks depth of recursion
- max_depth (integer) – maximum recursion depth for retrieving terms
Returns: None
-
sotools.common.
loadSOGraph
(filename=None, data=None, publicID=None, normalize=True, deslop=True, format='json-ld')[source]¶ Load RDF string or file to an RDFLib ConjunctiveGraph
Creates a ConjunctiveGraph from the provided file or text. If both are provided then text is used.
NOTE: Namespace use of
<http://schema.org>
,<https://schema.org>
, or<http://schema.org/>
is normalized to<https://schema.org/>
ifnormalize
is True.NOTE: Case of
SO:
properties in SO_TERMS is adjusted consistency ifdeslop
is TrueParameters: - filename (string) – path to RDF file on disk
- data (string) – RDF text
- publicID (string) – (from rdflib) The logical URI to use as the document base. If None specified the document location is used.
- normalize (boolean) – Normalize the use of schema.org namespace
- deslop (boolean) – Adjust schema.org terms for case consistency
- format (string) – The serialization format of the RDF to load
Returns: The loaded graph
Return type: ConjunctiveGraph
Example:
# Load a Dataset from json-ld, normalize schema.org namespace, and dump as ttl. import sotools import json json_source = "examples/data/ds_bad_namespace.json" g = sotools.loadSOGraph(filename=json_source, publicID="https://my.data.net/data/", normalize=True, deslop=True) print("Loaded JSON:") print(json.dumps(json.load(open(json_source, 'r')), indent=2)) print("\nNormalized schema.org namespace and serialized to ttl:\n") print(g.serialize(format="ttl").decode())
Loaded JSON: [ { "@context": { "@vocab": "https://schema.org" }, "@id": "demo_0", "@type": "Dataset", "name": "https, no trailing slash" }, { "@context": { "@vocab": "http://schema.org" }, "@id": "demo_1", "@type": "Dataset", "name": "http, no trailing slash" }, { "@context": { "@vocab": "http://schema.org/" }, "@id": "demo_2", "@type": "Dataset", "name": "http only" } ] Normalized schema.org namespace and serialized to ttl: @prefix SO: <https://schema.org/> . @prefix ns1: <https://my.data.net/data/> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix xml: <http://www.w3.org/XML/1998/namespace> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . ns1:demo_0 a SO:Dataset ; SO:name "https, no trailing slash" . ns1:demo_1 a SO:Dataset ; SO:name "http, no trailing slash" . ns1:demo_2 a SO:Dataset ; SO:name "http only" .
-
sotools.common.
loadSOGraphFromHtml
(html, url)[source]¶ Extract jsonld entries from provided HTML text
Parameters: html (string) – HTML text to be parsed Returns: Graph loaded from html Return type: ConjunctiveGraph
-
sotools.common.
loadSOGraphFromUrl
(url)[source]¶ Loads graph from json-ld contained in a landing page.
Parameters: url (string) – Url to process Returns: Graph of instance Return type: ConjunctiveGraph Example:
# Load graph from a URL and print the SO:Dataset.identifier values found import sotools from pprint import pprint url = "https://www.bco-dmo.org/dataset/679374" g = sotools.loadSOGraphFromUrl(url) pprint(sotools.getDatasetIdentifiers(g), indent=2)
[ { 'propertyId': 'Literal', 'url': None, 'value': 'http://lod.bco-dmo.org/id/dataset/679374'}]
-
sotools.common.
renderGraph
(g)[source]¶ For rendering an rdflib graph in Jupyter notebooks
Parameters: g (Graph) – The graph to render Returns: Output for rendering directly in the notebook Return type: Jupyter cell Example:
# Load a graph and render the output (for jupyter notebooks) import sotools g = sotools.loadSOGraph(filename="examples/data/ds_m_subjectof.json") sotools.renderGraph(g)
-
sotools.common.
validateSHACL
(shape_graph, data_graph)[source]¶ Validate data against a SHACL shape using common options.
Parameters: - shape_graph (ConjunctiveGraph) – A SHACL shape graph
- data_graph (ConjunctiveGraph) – Data graph to be validated with shape_graph
Returns (tuple): Conformance (boolean), result graph (Graph) and result text
Example:
import sotools import rdflib data_source = "examples/data/ds_bad_namespace.json" data_graph = rdflib.ConjunctiveGraph() data_graph.parse(data_source, format="json-ld", publicID="https://example.net/data/") shape_source = "examples/shapes/test_namespace.ttl" shape_graph = rdflib.ConjunctiveGraph() shape_graph.parse(shape_source, format="turtle") conforms, result_graph, result_text = sotools.validateSHACL(shape_graph, data_graph) print(f"Data shape conforms: {conforms}") print(f"Results text: \n{result_text}") print("Results graph:") sotools.renderGraph(result_graph)
Data shape conforms: False Results text: Validation Report Conforms: False Results (3): Constraint Violation in NotConstraintComponent (http://www.w3.org/ns/shacl#NotConstraintComponent): Severity: sh:Violation Source Shape: d1:DatasetBad3Shape Focus Node: <https://example.net/data/demo_1> Value Node: <https://example.net/data/demo_1> Message: Expecting SO namespace of <https://schema.org/> not <http://schema.org> Constraint Violation in NotConstraintComponent (http://www.w3.org/ns/shacl#NotConstraintComponent): Severity: sh:Violation Source Shape: d1:DatasetBad1Shape Focus Node: <https://example.net/data/demo_0> Value Node: <https://example.net/data/demo_0> Message: Expecting SO namespace of <https://schema.org/> not <https://schema.org> Constraint Violation in NotConstraintComponent (http://www.w3.org/ns/shacl#NotConstraintComponent): Severity: sh:Violation Source Shape: d1:DatasetBad2Shape Focus Node: <https://example.net/data/demo_2> Value Node: <https://example.net/data/demo_2> Message: Expecting SO namespace of <https://schema.org/> not <http://schema.org/> Results graph:
Running code on this page¶
All examples on this page can be run live in Binder. To do so:
- Click on the “Activate Binder” button
- Wait for Binder to be active. This can take a while, you can watch progress in your
browser’s javascript console. When a line like
Kernel: connected (89dfd3c8...
appears, Binder should be ready to go. - Run the following before any other script on the page. This sets the right path context for loading examples etc.
import os
try:
os.chdir("docsource/source")
except:
pass
print("Page is ready. You can now run other code blocks on this page.")