Linking to metadata documents¶

Contents

Overview
subjectOf metadata links
about metadata links
encoding metadata links
Footnotes
Running code on this page

Overview ¶

A dataset may have associated metadata serialized in formats other than the SO:Dataset [1] , and it is beneficial to indicate how those metadata may be retrieved.

There are several options for providing links to resources associated with a SO:Dataset. Each of these must satisfy the criteria of providing a link to the resource, indicating the type of the linked resource, and the relationship between the linked resource and the SO:Dataset and its components.

Three options for linking to external metadata documents are described here:

Using subjectOf metadata links to indicate the SO:Dataset is the subject of an SO:CreativeWork of derivatives.
Using the inverse of 1, about metadata links
Using encoding metadata links to indicate the referenced SO:MediaObject is an alternative encoding of the SO:Dataset document.

These are more fully described below, with examples.

subjectOf metadata links ¶

The subjectOf [2] property indicates that the current SO: entity is the subject of the linked property. In the following example, the SO:Dataset with id ds-02 is the subject of the SO:CreativeWork [3] document located at the url https://my.server.org/data/ds-02/metadata.xml.

The type of the linked metadata document is indicated by the encodingFormat [4] list. In this case, the document has an encodingFormat of application/rdf+xml and also http://ns.dataone.org/metadata/schema/onedcx/v1.0, which is a value from the DataONE vocabulary of object formats [5].

{
    "@context": {
        "@vocab": "https://schema.org/"
    },
    "@id": "ds-02",
    "url": "https://my.server.org/data/ds-02",
    "@type": "Dataset",
    "identifier": "dataset-02",
    "name": "Dataset subjectOf metadata",
    "description": "Dataset snippet with descriptive metadata indicated through subjectOf relation.",
    "license": "https://creativecommons.org/publicdomain/zero/1.0/",
    "subjectOf": {
        "@id": "ds-02/metadata.xml",
        "@type": "CreativeWork",
        "name": "Dublin Core Metadata Document Describing the Dataset",
        "url": "https://my.server.org/data/ds-02/metadata.xml",
        "encodingFormat": ["application/rdf+xml", "http://ns.dataone.org/metadata/schema/onedcx/v1.0"]
    }
}

import sotools
json_source = "examples/data/ds_m_subjectof.json"
g = sotools.loadSOGraph(filename=json_source, publicID="https://my.server.net/data/")
sotools.renderGraph(g)

Hence:

link:	`https://my.server.org/data/ds-02/metadata.xml`
type:	Dublin Core in RDF-XML
relationship:	The `SO:Dataset` is the subject of the metadata

The links (and some other information) can be extracted using the SPARQL:

PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX SO:   <https://schema.org/>

SELECT ?dateModified ?encodingFormat ?url ?description ?about
WHERE {
    ?about rdf:type SO:Dataset .
    ?about SO:subjectOf ?y .
    ?y SO:url ?url .
    ?y SO:encodingFormat ?encodingFormat .
    OPTIONAL {
      ?y SO:dateModified ?dateModified .
      ?y SO:description ?description .
    }
}

Which is implemented in the method sotools.common.getDatasetMetadataLinksFromSubjectOf() .

For example:

# Get links to metadata documents referenced from a SO:dataset
import sotools
from pprint import pprint

json_source = "examples/data/ds_m_subjectof.json"
g = sotools.loadSOGraph(filename=json_source, publicID="https://my.server.net/data/")
links = sotools.getDatasetMetadataLinks(g)
pprint(links, indent=2)

[ { 'contentUrl': 'https://my.server.org/data/ds-02/metadata.xml',
    'dateModified': None,
    'description': 'None',
    'encodingFormat': 'http://ns.dataone.org/metadata/schema/onedcx/v1.0',
    'subjectOf': 'https://my.server.net/data/ds-02'},
  { 'contentUrl': 'https://my.server.org/data/ds-02/metadata.xml',
    'dateModified': None,
    'description': 'None',
    'encodingFormat': 'application/rdf+xml',
    'subjectOf': 'https://my.server.net/data/ds-02'}]

Conformance to the SO:subjectOf approach can be evaluated using the SHACL shape:

@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix SO: <https://schema.org/> .
@prefix d1: <http://ns.dataone.org/schema/SO/1.0/> .

d1:rdfPrefix
  sh:declare [
    sh:namespace "http://www.w3.org/1999/02/22-rdf-syntax-ns#"^^xsd:anyURI ;
    sh:prefix "rdf" ;
  ] .

d1:schemaPrefix
  sh:declare [
    sh:namespace "https://schema.org/"^^xsd:anyURI ;
    sh:prefix "SO" ;
  ] .

d1:subjectofCreativeWorkShape
    a sh:NodeShape ;
    sh:target [
        a sh:SPARQLTarget ;
        sh:prefixes d1:rdfPrefix, d1:schemaPrefix ;
        sh:select """
            SELECT ?this
            WHERE {
                ?DS rdf:type SO:Dataset .
                ?DS SO:subjectOf ?this .
                ?this rdf:type SO:CreativeWork .
            }
        """ ;
    ] ;
    sh:property [
        sh:message "url is required for subjectOf as a CreativeWork" ;
        sh:path SO:url;
        sh:minCount 1;
    ] .

d1:subjectofMediaObjectShape
    a sh:NodeShape ;
    sh:target [
        a sh:SPARQLTarget ;
        sh:prefixes d1:rdfPrefix, d1:schemaPrefix ;
        sh:select """
            SELECT ?this
            WHERE {
                ?DS rdf:type SO:Dataset .
                ?DS SO:subjectOf ?this .
                ?this rdf:type SO:MediaObject .
            }
        """ ;
    ] ;
    sh:message "url or contentUrl is required for subjectOf as a MediaObject" ;
    sh:xone (
        [
            sh:path SO:contentUrl;
            sh:minCount 1;
        ]
        [
            sh:path SO:url;
            sh:minCount 1;
        ]
    ) .

For example:

import rdflib
import sotools
import pyshacl
json_source = "examples/data/ds_m_subjectof.json"
data_graph = sotools.loadSOGraph(filename=json_source, publicID="https://my.server.net/data/")
shape_graph = rdflib.ConjunctiveGraph()
shape_graph.parse("examples/shapes/test_dataset_subjectof.ttl", format="turtle")
conforms, result_graph, result_text = pyshacl.validate(
  data_graph,
  shacl_graph=shape_graph,
  inference="rdfs",
  meta_shacl=True,
  abort_on_error=False,
  debug=False,
  advanced=True
)
print(result_text)

Validation Report
Conforms: True

about metadata links ¶

The about [6] property is the inverse of the subjectOf property and so asserts the linked property is about the current SO: object.

In the following example, a composite dataset is described. The SO:MediaObject [7] with id ./metadata.xml is about the SO:Dataset with id ./ and the SO:MediaObject with id ./data_part_a.csv.

The type of the metadata document as indicated by the encodingFormat property is http://www.isotc211.org/2005/gmd, a value which is drawn from the DataONE vocabulary of object formats.

{
  "@context": {
    "@vocab": "https://schema.org/"
  },
  "@graph": [
    {
      "@type": "Dataset",
      "@id": "./",
      "identifier": "dataset-01",
      "name": "Dataset with metadata about",
      "description": "Dataset snippet with metadata and data components indicated by hasPart and the descriptive metadata through an about association.",
      "license": "https://creativecommons.org/publicdomain/zero/1.0/",
      "hasPart": [
        {
          "@id": "./metadata.xml"
        },
        {
          "@id": "./data_part_a.csv"
        }
      ]
    },
    {
      "@id": "./metadata.xml",
      "@type": "MediaObject",
      "contentUrl": "https://example.org/my/data/1/metadata.xml",
      "dateModified": "2019-10-10T12:43:11+00:00.000",
      "description": "A metadata document describing the Dataset and the data component",
      "encodingFormat":"http://www.isotc211.org/2005/gmd",
      "about": [
        {
          "@id": "./"
        },
        {
          "@id": "./data_part_a.csv"
        }
      ]
    },
    {
      "@id": "./data_part_a.csv",
      "@type": "MediaObject",
      "contentUrl": "https://example.org/my/data/1/data_part_a.csv"
    }
  ]
}

import sotools
json_source = "examples/data/ds_m_about.json"
g = sotools.loadSOGraph(filename=json_source, publicID="https://my.server.net/data/")
sotools.renderGraph(g)

Hence:

link:	`https://example.org/my/data/1/metadata.xml`
type:	ISO TC211 XML Metadata
relationship:	The metadata is about the `SO:Dataset`, and hence the `SO:Dataset` is the subject of the metadata

The links and other information can be extracted using the SPARQL:

PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX SO:   <https://schema.org/>

SELECT ?dateModified ?encodingFormat ?contentUrl ?description ?about
WHERE {
    ?about rdf:type SO:Dataset .
    ?y SO:about ?about .
    ?y SO:contentUrl ?contentUrl .
    ?y SO:encodingFormat ?encodingFormat .
    OPTIONAL {
      ?y SO:dateModified ?dateModified .
      ?y SO:description ?description .
    }
}

which is implemented in the sotools.common.getDatasetMetadataLinksFromAbout().

For example:

# Get links to metadata documents referenced from a SO:dataset
import sotools
from pprint import pprint

json_source = "examples/data/ds_m_about.json"
g = sotools.loadSOGraph(filename=json_source, publicID="https://my.server.net/data/")
links = sotools.getDatasetMetadataLinks(g)
pprint(links, indent=2)

[ { 'contentUrl': 'https://example.org/my/data/1/metadata.xml',
    'dateModified': rdflib.term.Literal('2019-10-10T12:43:11+00:00.000'),
    'description': 'A metadata document describing the Dataset and the data '
                   'component',
    'encodingFormat': 'http://www.isotc211.org/2005/gmd',
    'subjectOf': 'https://my.server.net/data/'}]

encoding metadata links ¶

The encoding property is defined [8] as:

A media object that encodes this CreativeWork. This property is a synonym for associatedMedia.

In this approach it is considered that the SO:Dataset document describes a dataset, as does the associated metadata document (ISO or EML for example). As such, the XML and SO:Dataset are alternate encodings of the same thing.

In the following example, the SO:Dataset with id ds_m_encoding has an encoding of type SO:MediaObject with an id of ds_m_encoding#media-object and encodingFormat of http://www.isotc211.org/2005/gmd which is a value drawn from the DataONE vocabulary of object formats. The media object is located at the URL https://my.server.net/datasets/00.xml

{
   "@id":"ds_m_encoding",
   "@context": {
       "@vocab": "https://schema.org/"
   },
   "@type":"Dataset",
   "name": "Dataset with metadata encoding",
   "description": "Dataset snippet using SO:Encoding pattern for associated XML metadata.",
   "license": "https://creativecommons.org/publicdomain/zero/1.0/",
   "identifier": "dataset-00",
   "encoding": {
       "@id":"ds_m_encoding#media-object",
       "@type":"MediaObject",
       "contentUrl":"https://my.server.net/datasets/00.xml",
       "dateModified":"2019-10-10T12:43:11+00:00.000",
       "description":"ISO TC211 XML rendering of metadata",
       "encodingFormat":"http://www.isotc211.org/2005/gmd"
   }
}

import sotools
json_source = "examples/data/ds_m_encoding.json"
g = sotools.loadSOGraph(filename=json_source, publicID="https://my.server.net/data/")
sotools.renderGraph(g)

Hence:

link:	`https://my.server.net/datasets/00.xml`
type:	ISO TC211 XML Metadata
relationship:	The metadata is an encoding of the `SO:Dataset` document

The links and other information can be extracted using the SPARQL:

PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX SO:   <https://schema.org/>

SELECT ?dateModified ?encodingFormat ?contentUrl ?description ?x
WHERE {
    ?x rdf:type SO:Dataset .
    ?x SO:encoding ?y .
    ?y SO:encodingFormat ?encodingFormat.
    ?y SO:dateModified ?dateModified .
    ?y SO:contentUrl ?contentUrl .
    ?y SO:description ?description .
}

which is implemented with the method sotools.common.getDatasetMetadataLinksFromEncoding().

For example:

# Get links to metadata documents referenced from a SO:dataset
import sotools
from pprint import pprint

json_source = "examples/data/ds_m_encoding.json"
g = sotools.loadSOGraph(filename=json_source, publicID="https://my.server.net/data/")
links = sotools.getDatasetMetadataLinks(g)
pprint(links, indent=2)

[ { 'contentUrl': 'https://my.server.net/datasets/00.xml',
    'dateModified': rdflib.term.Literal('2019-10-10T12:43:11+00:00.000'),
    'description': 'ISO TC211 XML rendering of metadata',
    'encodingFormat': 'http://www.isotc211.org/2005/gmd',
    'subjectOf': 'https://my.server.net/data/ds_m_encoding'}]

Footnotes ¶

[1]	https://schema.org/Dataset

[2]	https://schema.org/subjectOf

[3]	https://schema.org/CreativeWork

[4]	https://schema.org/encodingFormat

[5]	https://cn.dataone.org/cn/v2/formats

[6]	https://schema.org/about

[7]	https://schema.org/MediaObject

[8]	https://schema.org/encoding

Running code on this page ¶

All examples on this page can be run live in Binder. To do so:

Click on the “Activate Binder” button
Wait for Binder to be active. This can take a while, you can watch progress in your browser’s javascript console. When a line like Kernel: connected (89dfd3c8... appears, Binder should be ready to go.
Run the following before any other script on the page. This sets the right path context for loading examples etc.

import os
try:
    os.chdir("docsource/source")
except:
    pass
print("Page is ready. You can now run other code blocks on this page.")

Linking to metadata documents¶

Overview¶

subjectOf metadata links¶

about metadata links¶

encoding metadata links¶

Footnotes¶

Running code on this page¶

Overview ¶

subjectOf metadata links ¶

about metadata links ¶

encoding metadata links ¶

Footnotes ¶

Running code on this page ¶