Thursday, December 9, 2021

Using Wikidata APIs to regularize findspots

Combining attributes of two different pipelines, I have made an substantial update to the RDF ingestion process in the Kerameikos.org XForms back-end. As previously discussed in the development of a Linked Art JSON-LD harvester in fall 2019, findspot gazetteer URIs that match the Getty Thesaurus of Geographic Names, the UK's Ordnance Survey, and Geonames.org are reconciled to Wikidata URIs. A SPARQL query is then issued to the Wikidata endpoint to extract the coordinates, feature type/class, and parent geographic entity, if applicable.

CONSTRUCT {
  ?place a skos:Concept; 
  		   rdfs:label ?placeLabel;
           skos:closeMatch ?osgeo;
           skos:closeMatch ?tgn;
           skos:closeMatch ?geonames ;
           skos:closeMatch ?pleiades ; 
           skos:broader ?parent ;
           dct:coverage ?coord ;
           dct:type ?type .
}
WHERE {
  ?place wdt:P1667 "7015539" . #TGN ID for Vulci .
  OPTIONAL {?place wdt:P3120 ?osgeoid .
  	BIND (uri(concat("http://data.ordnancesurvey.co.uk/id/", ?osgeoid)) as ?osgeo)}
  OPTIONAL {?place wdt:P1667 ?tgnid .
  	BIND (uri(concat("http://vocab.getty.edu/tgn/", ?tgnid)) as ?tgn)}
  OPTIONAL {?place wdt:P1566 ?geonamesid .
  	BIND (uri(concat("https://sws.geonames.org/", ?geonamesid, "/")) as ?geonames)}
  OPTIONAL {?place wdt:P1584 ?pleiadesid .
  	BIND (uri(concat("https://pleiades.stoa.org/places/", ?pleiadesid)) as ?pleiades)}
  OPTIONAL {?place p:P625/ps:P625 ?coord}
  OPTIONAL {?place wdt:P131 ?parent}
  OPTIONAL {?place wdt:P31/wdt:P279+ ?type . FILTER (?type = wd:Q486972)} #is human settlement
  OPTIONAL {?place wdt:P31 ?type FILTER (?type = wd:Q839954)} #archaeological site
  SERVICE wikibase:label {
	bd:serviceParam wikibase:language "en"
  }
}

An iterative process generates RDF for each place (crm:E53_Place) and spatial feature (dually crmgeo:SP5_Geometric_Place_Expression and geo:SpatialThing to be compatible with both CIDOC-CRM and the WGS84 ontology) and its parent region. Spatial features are only attached to a place if it is a human settlement or archaeological site (so no coordinates that represent the central point of a region or nation).

This workflow had applied only to Linked Art JSON-LD ingestion, which had been prototyped with a handful of vases from the Indianapolis Museum of Art at Newfields. Subsequently, we have ingested several other collections, where CSV or JSON exports were loaded into OpenRefine for further reconciliation and exported into the CIDOC-CRM model through OpenRefine's templating system. Prior to the implementation of the template system for the Tampa Museum of Art, I had written a PHP script to turn the British Museum's CSV export from OpenRefine (following my own cleanup) into RDF, and the script performed the Wikidata SPARQL lookups illustrated above in order to incorporate the place RDF hierarchy directly in the RDF/XML file with the BM's objects, which I uploaded into the Kerameikos.org SPARQL endpoint. I had also applied this workflow to the the Getty collection.

Now that the Wikidata reconciliation and SPARQL-based lookups have been integrated directly into the RDF ingestion system in the Kerameikos.org XForms engine, I have eliminated any need for creating bespoke PHP scripts to perform findspot hierarchy lookups for any collection that we integrate into the project.


Essentially, museums can either provide Linked Art JSON-LD for harvesting (if the JSON-LD includes the necessary Kerameikos or Getty URIs) or any spreadsheet can be cleaned up in OpenRefine (with findspots reconciled directly to Wikidata URIs) and exported directly into RDF/XML following the templating principles outlined above. The Kerameikos.org ingestion workflow will fill in any gaps in findspot coverage and geographic hierarchy without further software intervention. This is a significant advancement in the sustainability of our data integration workflow and allows us to fully standardize the data model for findspot places.

I plan to implement these updates into the Nomisma.org ingestion engine next.

Thursday, December 2, 2021

Aligning Kerameikos.org more directly with CIDOC-CRM

When the Kerameikos.org project was founded in 2013, our intent was for the LOD thesaurus system to be modeled primarily in SKOS, with instances in certain categories to be designated subject-specific RDF classes in our own ontology (e.g., kon:Shape) or classes in existing ontologies (for example, foaf:Person and foaf:Group).

Our thesaurus is still built around SKOS, but since we have aligned our vase aggregation RDF model with Linked Art (a community-built CIDOC-CRM profile serialized as JSON-LD), I have subsequently made some alterations to the classes we use for concept URIs and updated our ontology.

These changes affect the RDF concepts themselves, but also I've searched and replaced classes throughout the Kerameikos codebase as well.

  • foaf:Person has been replaced with crm:E21_Person
  • foaf:Group has been replaced with crm:E74_Group
  • kon:ProductionPlace has been replaced with crm:E53_Place and kon:ProductionPlace has been deprecated from the Kerameikos ontology.
    • Spatial expressions are dually compatible with both CIDOC-CRM and the WGS84 ontology in that the E53:Place concept includes both geo:location and crm:P168_place_is_defined_by properties linking to the same node URI, which carries both the geo:SpatialThing and crmgeo:SP5_Geometric_Place_Expression classes. These spatial features may include geo:lat and geo:long (for points) or osgeo:asGeoJSON as before, but now include the crmgeo:asWKT property with a datatype of http://www.opengis.net/ont/geosparql#wktLiteral, which should make these points and polygons compatible with endpoints that support the GeoSPARQL protocol. See the machine-readable data underlying http://kerameikos.org/id/athens, for example.

The Kerameikos.org ontology page has been significantly revised to make it more transparent than before, in line with improvements we have made to the Nomisma page in recent years. The ontology URI now supports content negotiation to request RDF/XML or Turtle as alternatives with the Accept header and relevant mime-types. We have also implemented ontology versions, so that you can compare the 2015 edition with the current 2021 revision.

The ontology has been tightened up with better definitions of our few custom ceramic-oriented RDF classes (Shape, Technique, and Style), all of which are subclasses of crm:E55_Type. There is one property, kon:hasShape, which is a subproperty of crm:P2_has_type, intended to link a Human-Made Object (vase) [rdfs:domain] to the range [rdfs:range] of kon:Shape. Therefore, this expression is fully compatible with CIDOC-CRM's own domains and ranges while also conforming to the standard intellectual vocabulary of pottery specialists. We may implement a "Fabric" class as a subclass of crm:E37_Material in order to make technical distinctions between the clay from Corinth and Attica, for example. We will expand the scope of our ontology, and its relationship to CIDOC-CRM, as use cases arise.