Linking to metadata documents¶
Contents
Overview¶
A dataset may have associated metadata serialized in formats other than the SO:Dataset
[1] , and
it is beneficial to indicate how those metadata may be retrieved.
There are several options for providing links to resources associated with a SO:Dataset
. Each of these
must satisfy the criteria of providing a link to the resource, indicating the type of the linked resource, and
the relationship between the linked resource and the SO:Dataset
and its components.
Three options for linking to external metadata documents are described here:
- Using subjectOf metadata links to indicate the
SO:Dataset
is the subject of anSO:CreativeWork
of derivatives. - Using the inverse of 1, about metadata links
- Using encoding metadata links to indicate the referenced
SO:MediaObject
is an alternative encoding of theSO:Dataset
document.
These are more fully described below, with examples.
subjectOf metadata links¶
The subjectOf
[2] property indicates that the current SO:
entity is the subject of the linked
property. In the following example, the SO:Dataset
with id ds-02
is the subject of the
SO:CreativeWork
[3] document located at the url https://my.server.org/data/ds-02/metadata.xml
.
The type of the linked metadata document is indicated by the encodingFormat
[4] list. In this case,
the document has an encodingFormat
of application/rdf+xml
and also
http://ns.dataone.org/metadata/schema/onedcx/v1.0
, which is a value from the DataONE vocabulary of object
formats [5].
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | {
"@context": {
"@vocab": "https://schema.org/"
},
"@id": "ds-02",
"url": "https://my.server.org/data/ds-02",
"@type": "Dataset",
"identifier": "dataset-02",
"name": "Dataset subjectOf metadata",
"description": "Dataset snippet with descriptive metadata indicated through subjectOf relation.",
"license": "https://creativecommons.org/publicdomain/zero/1.0/",
"subjectOf": {
"@id": "ds-02/metadata.xml",
"@type": "CreativeWork",
"name": "Dublin Core Metadata Document Describing the Dataset",
"url": "https://my.server.org/data/ds-02/metadata.xml",
"encodingFormat": ["application/rdf+xml", "http://ns.dataone.org/metadata/schema/onedcx/v1.0"]
}
}
|
import sotools
json_source = "examples/data/ds_m_subjectof.json"
g = sotools.loadSOGraph(filename=json_source, publicID="https://my.server.net/data/")
sotools.renderGraph(g)
Hence:
link: | https://my.server.org/data/ds-02/metadata.xml |
---|---|
type: | Dublin Core in RDF-XML |
relationship: | The SO:Dataset is the subject of the metadata |
The links (and some other information) can be extracted using the SPARQL:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX SO: <https://schema.org/>
SELECT ?dateModified ?encodingFormat ?url ?description ?about
WHERE {
?about rdf:type SO:Dataset .
?about SO:subjectOf ?y .
?y SO:url ?url .
?y SO:encodingFormat ?encodingFormat .
OPTIONAL {
?y SO:dateModified ?dateModified .
?y SO:description ?description .
}
}
Which is implemented in the method sotools.common.getDatasetMetadataLinksFromSubjectOf()
.
For example:
# Get links to metadata documents referenced from a SO:dataset
import sotools
from pprint import pprint
json_source = "examples/data/ds_m_subjectof.json"
g = sotools.loadSOGraph(filename=json_source, publicID="https://my.server.net/data/")
links = sotools.getDatasetMetadataLinks(g)
pprint(links, indent=2)
[ { 'contentUrl': 'https://my.server.org/data/ds-02/metadata.xml',
'dateModified': None,
'description': 'None',
'encodingFormat': 'http://ns.dataone.org/metadata/schema/onedcx/v1.0',
'subjectOf': 'https://my.server.net/data/ds-02'},
{ 'contentUrl': 'https://my.server.org/data/ds-02/metadata.xml',
'dateModified': None,
'description': 'None',
'encodingFormat': 'application/rdf+xml',
'subjectOf': 'https://my.server.net/data/ds-02'}]
Conformance to the SO:subjectOf
approach can be evaluated using the SHACL shape:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 | @prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix SO: <https://schema.org/> .
@prefix d1: <http://ns.dataone.org/schema/SO/1.0/> .
d1:rdfPrefix
sh:declare [
sh:namespace "http://www.w3.org/1999/02/22-rdf-syntax-ns#"^^xsd:anyURI ;
sh:prefix "rdf" ;
] .
d1:schemaPrefix
sh:declare [
sh:namespace "https://schema.org/"^^xsd:anyURI ;
sh:prefix "SO" ;
] .
d1:subjectofCreativeWorkShape
a sh:NodeShape ;
sh:target [
a sh:SPARQLTarget ;
sh:prefixes d1:rdfPrefix, d1:schemaPrefix ;
sh:select """
SELECT ?this
WHERE {
?DS rdf:type SO:Dataset .
?DS SO:subjectOf ?this .
?this rdf:type SO:CreativeWork .
}
""" ;
] ;
sh:property [
sh:message "url is required for subjectOf as a CreativeWork" ;
sh:path SO:url;
sh:minCount 1;
] .
d1:subjectofMediaObjectShape
a sh:NodeShape ;
sh:target [
a sh:SPARQLTarget ;
sh:prefixes d1:rdfPrefix, d1:schemaPrefix ;
sh:select """
SELECT ?this
WHERE {
?DS rdf:type SO:Dataset .
?DS SO:subjectOf ?this .
?this rdf:type SO:MediaObject .
}
""" ;
] ;
sh:message "url or contentUrl is required for subjectOf as a MediaObject" ;
sh:xone (
[
sh:path SO:contentUrl;
sh:minCount 1;
]
[
sh:path SO:url;
sh:minCount 1;
]
) .
|
For example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | import rdflib
import sotools
import pyshacl
json_source = "examples/data/ds_m_subjectof.json"
data_graph = sotools.loadSOGraph(filename=json_source, publicID="https://my.server.net/data/")
shape_graph = rdflib.ConjunctiveGraph()
shape_graph.parse("examples/shapes/test_dataset_subjectof.ttl", format="turtle")
conforms, result_graph, result_text = pyshacl.validate(
data_graph,
shacl_graph=shape_graph,
inference="rdfs",
meta_shacl=True,
abort_on_error=False,
debug=False,
advanced=True
)
print(result_text)
|
Validation Report
Conforms: True
about metadata links¶
The about
[6] property is the inverse of the subjectOf
property and so asserts the linked property is about the
current SO:
object.
In the following example, a composite dataset is described. The SO:MediaObject
[7] with
id ./metadata.xml
is about
the SO:Dataset
with id ./
and the SO:MediaObject
with id
./data_part_a.csv
.
The type of the metadata document as indicated by the encodingFormat
property is http://www.isotc211.org/2005/gmd
,
a value which is drawn from the DataONE vocabulary of object formats.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | {
"@context": {
"@vocab": "https://schema.org/"
},
"@graph": [
{
"@type": "Dataset",
"@id": "./",
"identifier": "dataset-01",
"name": "Dataset with metadata about",
"description": "Dataset snippet with metadata and data components indicated by hasPart and the descriptive metadata through an about association.",
"license": "https://creativecommons.org/publicdomain/zero/1.0/",
"hasPart": [
{
"@id": "./metadata.xml"
},
{
"@id": "./data_part_a.csv"
}
]
},
{
"@id": "./metadata.xml",
"@type": "MediaObject",
"contentUrl": "https://example.org/my/data/1/metadata.xml",
"dateModified": "2019-10-10T12:43:11+00:00.000",
"description": "A metadata document describing the Dataset and the data component",
"encodingFormat":"http://www.isotc211.org/2005/gmd",
"about": [
{
"@id": "./"
},
{
"@id": "./data_part_a.csv"
}
]
},
{
"@id": "./data_part_a.csv",
"@type": "MediaObject",
"contentUrl": "https://example.org/my/data/1/data_part_a.csv"
}
]
}
|
import sotools
json_source = "examples/data/ds_m_about.json"
g = sotools.loadSOGraph(filename=json_source, publicID="https://my.server.net/data/")
sotools.renderGraph(g)
Hence:
link: | https://example.org/my/data/1/metadata.xml |
---|---|
type: | ISO TC211 XML Metadata |
relationship: | The metadata is about the SO:Dataset , and hence the SO:Dataset is the subject of the metadata |
The links and other information can be extracted using the SPARQL:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX SO: <https://schema.org/>
SELECT ?dateModified ?encodingFormat ?contentUrl ?description ?about
WHERE {
?about rdf:type SO:Dataset .
?y SO:about ?about .
?y SO:contentUrl ?contentUrl .
?y SO:encodingFormat ?encodingFormat .
OPTIONAL {
?y SO:dateModified ?dateModified .
?y SO:description ?description .
}
}
which is implemented in the sotools.common.getDatasetMetadataLinksFromAbout()
.
For example:
# Get links to metadata documents referenced from a SO:dataset
import sotools
from pprint import pprint
json_source = "examples/data/ds_m_about.json"
g = sotools.loadSOGraph(filename=json_source, publicID="https://my.server.net/data/")
links = sotools.getDatasetMetadataLinks(g)
pprint(links, indent=2)
[ { 'contentUrl': 'https://example.org/my/data/1/metadata.xml',
'dateModified': rdflib.term.Literal('2019-10-10T12:43:11+00:00.000'),
'description': 'A metadata document describing the Dataset and the data '
'component',
'encodingFormat': 'http://www.isotc211.org/2005/gmd',
'subjectOf': 'https://my.server.net/data/'}]
encoding metadata links¶
The encoding
property is defined [8] as:
A media object that encodes this CreativeWork. This property is a synonym for associatedMedia.
In this approach it is considered that the SO:Dataset document describes a dataset, as does the associated
metadata document (ISO or EML for example). As such, the XML and SO:Dataset
are alternate encodings of the
same thing.
In the following example, the SO:Dataset
with id ds_m_encoding
has an encoding
of type SO:MediaObject
with
an id of ds_m_encoding#media-object
and encodingFormat
of http://www.isotc211.org/2005/gmd
which is a value drawn
from the DataONE vocabulary of object formats. The media object is located at the
URL https://my.server.net/datasets/00.xml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | {
"@id":"ds_m_encoding",
"@context": {
"@vocab": "https://schema.org/"
},
"@type":"Dataset",
"name": "Dataset with metadata encoding",
"description": "Dataset snippet using SO:Encoding pattern for associated XML metadata.",
"license": "https://creativecommons.org/publicdomain/zero/1.0/",
"identifier": "dataset-00",
"encoding": {
"@id":"ds_m_encoding#media-object",
"@type":"MediaObject",
"contentUrl":"https://my.server.net/datasets/00.xml",
"dateModified":"2019-10-10T12:43:11+00:00.000",
"description":"ISO TC211 XML rendering of metadata",
"encodingFormat":"http://www.isotc211.org/2005/gmd"
}
}
|
import sotools
json_source = "examples/data/ds_m_encoding.json"
g = sotools.loadSOGraph(filename=json_source, publicID="https://my.server.net/data/")
sotools.renderGraph(g)
Hence:
link: | https://my.server.net/datasets/00.xml |
---|---|
type: | ISO TC211 XML Metadata |
relationship: | The metadata is an encoding of the SO:Dataset document |
The links and other information can be extracted using the SPARQL:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX SO: <https://schema.org/>
SELECT ?dateModified ?encodingFormat ?contentUrl ?description ?x
WHERE {
?x rdf:type SO:Dataset .
?x SO:encoding ?y .
?y SO:encodingFormat ?encodingFormat.
?y SO:dateModified ?dateModified .
?y SO:contentUrl ?contentUrl .
?y SO:description ?description .
}
which is implemented with the method sotools.common.getDatasetMetadataLinksFromEncoding()
.
For example:
# Get links to metadata documents referenced from a SO:dataset
import sotools
from pprint import pprint
json_source = "examples/data/ds_m_encoding.json"
g = sotools.loadSOGraph(filename=json_source, publicID="https://my.server.net/data/")
links = sotools.getDatasetMetadataLinks(g)
pprint(links, indent=2)
[ { 'contentUrl': 'https://my.server.net/datasets/00.xml',
'dateModified': rdflib.term.Literal('2019-10-10T12:43:11+00:00.000'),
'description': 'ISO TC211 XML rendering of metadata',
'encodingFormat': 'http://www.isotc211.org/2005/gmd',
'subjectOf': 'https://my.server.net/data/ds_m_encoding'}]
Footnotes¶
[1] | https://schema.org/Dataset |
[2] | https://schema.org/subjectOf |
[3] | https://schema.org/CreativeWork |
[4] | https://schema.org/encodingFormat |
[5] | https://cn.dataone.org/cn/v2/formats |
[6] | https://schema.org/about |
[7] | https://schema.org/MediaObject |
[8] | https://schema.org/encoding |
Running code on this page¶
All examples on this page can be run live in Binder. To do so:
- Click on the “Activate Binder” button
- Wait for Binder to be active. This can take a while, you can watch progress in your
browser’s javascript console. When a line like
Kernel: connected (89dfd3c8...
appears, Binder should be ready to go. - Run the following before any other script on the page. This sets the right path context for loading examples etc.
import os
try:
os.chdir("docsource/source")
except:
pass
print("Page is ready. You can now run other code blocks on this page.")