Thursday, December 9, 2021

Using Wikidata APIs to regularize findspots

Combining attributes of two different pipelines, I have made an substantial update to the RDF ingestion process in the Kerameikos.org XForms back-end. As previously discussed in the development of a Linked Art JSON-LD harvester in fall 2019, findspot gazetteer URIs that match the Getty Thesaurus of Geographic Names, the UK's Ordnance Survey, and Geonames.org are reconciled to Wikidata URIs. A SPARQL query is then issued to the Wikidata endpoint to extract the coordinates, feature type/class, and parent geographic entity, if applicable.

CONSTRUCT {
  ?place a skos:Concept; 
  		   rdfs:label ?placeLabel;
           skos:closeMatch ?osgeo;
           skos:closeMatch ?tgn;
           skos:closeMatch ?geonames ;
           skos:closeMatch ?pleiades ; 
           skos:broader ?parent ;
           dct:coverage ?coord ;
           dct:type ?type .
}
WHERE {
  ?place wdt:P1667 "7015539" . #TGN ID for Vulci .
  OPTIONAL {?place wdt:P3120 ?osgeoid .
  	BIND (uri(concat("http://data.ordnancesurvey.co.uk/id/", ?osgeoid)) as ?osgeo)}
  OPTIONAL {?place wdt:P1667 ?tgnid .
  	BIND (uri(concat("http://vocab.getty.edu/tgn/", ?tgnid)) as ?tgn)}
  OPTIONAL {?place wdt:P1566 ?geonamesid .
  	BIND (uri(concat("https://sws.geonames.org/", ?geonamesid, "/")) as ?geonames)}
  OPTIONAL {?place wdt:P1584 ?pleiadesid .
  	BIND (uri(concat("https://pleiades.stoa.org/places/", ?pleiadesid)) as ?pleiades)}
  OPTIONAL {?place p:P625/ps:P625 ?coord}
  OPTIONAL {?place wdt:P131 ?parent}
  OPTIONAL {?place wdt:P31/wdt:P279+ ?type . FILTER (?type = wd:Q486972)} #is human settlement
  OPTIONAL {?place wdt:P31 ?type FILTER (?type = wd:Q839954)} #archaeological site
  SERVICE wikibase:label {
	bd:serviceParam wikibase:language "en"
  }
}

An iterative process generates RDF for each place (crm:E53_Place) and spatial feature (dually crmgeo:SP5_Geometric_Place_Expression and geo:SpatialThing to be compatible with both CIDOC-CRM and the WGS84 ontology) and its parent region. Spatial features are only attached to a place if it is a human settlement or archaeological site (so no coordinates that represent the central point of a region or nation).

This workflow had applied only to Linked Art JSON-LD ingestion, which had been prototyped with a handful of vases from the Indianapolis Museum of Art at Newfields. Subsequently, we have ingested several other collections, where CSV or JSON exports were loaded into OpenRefine for further reconciliation and exported into the CIDOC-CRM model through OpenRefine's templating system. Prior to the implementation of the template system for the Tampa Museum of Art, I had written a PHP script to turn the British Museum's CSV export from OpenRefine (following my own cleanup) into RDF, and the script performed the Wikidata SPARQL lookups illustrated above in order to incorporate the place RDF hierarchy directly in the RDF/XML file with the BM's objects, which I uploaded into the Kerameikos.org SPARQL endpoint. I had also applied this workflow to the the Getty collection.

Now that the Wikidata reconciliation and SPARQL-based lookups have been integrated directly into the RDF ingestion system in the Kerameikos.org XForms engine, I have eliminated any need for creating bespoke PHP scripts to perform findspot hierarchy lookups for any collection that we integrate into the project.


Essentially, museums can either provide Linked Art JSON-LD for harvesting (if the JSON-LD includes the necessary Kerameikos or Getty URIs) or any spreadsheet can be cleaned up in OpenRefine (with findspots reconciled directly to Wikidata URIs) and exported directly into RDF/XML following the templating principles outlined above. The Kerameikos.org ingestion workflow will fill in any gaps in findspot coverage and geographic hierarchy without further software intervention. This is a significant advancement in the sustainability of our data integration workflow and allows us to fully standardize the data model for findspot places.

I plan to implement these updates into the Nomisma.org ingestion engine next.

No comments:

Post a Comment