Friday, October 25, 2019

Linked Art data harvesting and aligning to ARIADNE for archaeological context

As mentioned in the related numismatic blog post, First pass at processing Linked Art JSON-LD to Nomisma RDF, and the slides presented by the Smithsonian's Adam Soroka on my behalf at the Linked Art showcase last month at the Victoria and Albert Museum in London, Linked Art JSON-LD harvesting is now functional in the kerameikos.org back-end. Built around test data provided by Sami Norling at the Indianapolis Museum of Art at Newfields and supplemented with some additional properties and Getty URIs, JSON-LD is processed by the XForms engine in Orbeon (which powers both the Nomisma and Kerameikos frameworks). Getty vocabulary URIs are mapped to applicable Kerameikos ones, and the JSON-LD is distilled into its essential graph form as RDF/XML and posted into the Kerameikos SPARQL endpoint.

For each JSON-LD GET operation, the following three tasks are initiated:

Automatic reconciliation of URIs to Kerameikos

Distinct entities related to each vase (shapes, materials, styles, techniques, artists, production places, etc.) are aggregated into a list. A SPARQL query is executed for each one (that isn't already a Kerameikos URI) in order to get the equivalent Kerameikos URI via skos:exactMatch. These mappings are stored so that SPARQL queries do not need to be executed multiple times for the same URI.

Normalizing findspot URIs to Wikidata entities

URIs for findspots, following the proposed ARIADNE Plus data model (more details below), which can be Geonames, Pleiades, Getty Thesaurus of Geographic Names, Ordnance Survey, and Wikidata, are queried in the Kerameikos endpoint to see if they have already been normalized and harvested. If not, then a SPARQL query is sent to the Wikidata.org endpoint in order to find the related Wikidata Q entity for the gazetteer URI. The Wikidata entity URI therefore serves as the primary URI scheme for findspots, regardless of which gazetteer a dataset may use locally. The SPARQL query will also gather the skos:exactMatch URIs from the Getty TGN, Pleiades, Ordnance Survey, and Geonames, when available, and extract latitudes and longitudes.

CONSTRUCT {
  ?place a skos:Concept; 
       skos:prefLabel ?placeLabel;
           skos:exactMatch ?osgeo;
           skos:exactMatch ?tgn;
           skos:exactMatch ?geonames ;
           skos:exactMatch ?pleiades ;
           dct:coverage ?coord .
}
WHERE {
  ?place wdt:P1667 "7015539" . #TGN ID for Vulci
  OPTIONAL {?place wdt:P3120 ?osgeoid .
   BIND (uri(concat("http://data.ordnancesurvey.co.uk/id/", ?osgeoid)) as ?osgeo)}
  OPTIONAL {?place wdt:P1667 ?tgnid .
   BIND (uri(concat("http://vocab.getty.edu/tgn/", ?tgnid)) as ?tgn)}
  OPTIONAL {?place wdt:P1566 ?geonamesid .
   BIND (uri(concat("http://sws.geonames.org/", ?geonamesid, "/")) as ?geonames)}
  OPTIONAL {?place wdt:P1584 ?pleiadesid .
   BIND (uri(concat("https://pleiades.stoa.org/places/", ?pleiadesid)) as ?pleiades)}
  OPTIONAL {?place p:P625/ps:P625 ?coord}
  SERVICE wikibase:label {
 bd:serviceParam wikibase:language "en"
  }
}

Furthermore, a second SPARQL query is sent to Wikidata to get the geographic hierarchy and ingest simple RDF for these places as well. This makes it possible to query for all vases found in Lazio regardless of whether they have been linked directly to Vulci or Veii. Note: this hierarchy is based on modern administrative divisions, not historical boundaries (Vulci and Veii are historically in Etruria). It might be possible to use a combination of deposit date and place to derive a historical region once projects like the World-Historical Gazetteer become more developed with regard to both time and space.

Transforming JSON-LD to CIDOC-CRM RDF/XML

After performing pre-processing URI reconciliation tasks, each Human-Made Object in the JSON response will be processed into RDF/XML. Much of the cruft that aids developers in creating human-readable interfaces will be eliminated, such as labels for entities and other sorts of textual statements. Date-times are converted into xsd:gYear. Relevant Getty (or other) URIs are mapped to Kerameikos URIs that have been created so far. Measurements are converted to metric. In order to better conform to the way in which pottery specialists model and query information, several classifications are mapped into Kerameikos.org pottery-specific RDF properties rather than following the Linked Art CIDOC CRM profile explicitly. The Kerameikos model is nearly identical to Linked Art, however, with the exception of the use of kon:hasShape (instead of a generic crm:P2_has_type for an object type) and kon:hasStyle instead of a artistic genre of a Visual Item.

A final product (still a prototype, as the Linked Art data model is still evolving) can be seen here.

Joining Linked Art and ARIADNE

Many vases in museums that have provenance include a citation to the place/site name alone with no further context about the precise location within a site. Of course, modern excavations will have this level of detail, and the ARIADNE implementation of the CRMarchaeo extension is fully capable of exploiting this fine granularity. Our use cases are much simpler, and many coin findspots follow a relevant pattern. However, some finds databases, such as the Portable Antiquities Scheme, might include more precise latitude and longitude as well as the lowest-level parish URI from Ordnance Survey. I think the ARIADNE-based find model should work for both use cases in Kerameikos and Nomisma.

I have put forth a proposal to the Linked Art community, https://github.com/linked-art/linked.art/issues/285, which has not yet received any feedback. It includes some extensions with the CRMsci and CRMgeo ontologies. This proposal has been offered following the consultation of ARIADNE data specialist, Achille Felicetti through introduction by Holly Wright.

Things to note:

1. An HMO is sci:O19i_was_object_found_by an S19_Encounter_Event. This Encounter might involve individual agents, techniques (metal detecting, as defined in the English Heritage FISH taxonomy), and a place.

2. The place might have known geographic coordinates, but may not. This place might have additional context expressed by P2_has_type (e.g., a tomb, expressed by a Getty AAT URI). A findspot should always point to a parent place defined by a gazetteer URI. A findspot for a vase might be somewhere within Vulci, but is never Vulci directly.

3. I have decided to insert a second RDF class for the crmgeo:SP5_Geometric_Place_Expression that encapsulates the WKT coordinates associated with a E53_Place: http://www.w3.org/2003/01/geo/wgs84_pos#SpatialThing. This opens the door to splitting the WKT point into geo:lat and geo:long properties, which are much more widely used within the broader LOD ecosystem than the CRMgeo extension. This means the E53_Place has two properties pointing to the same SpatialThing node, a geo:location and a crm:P168_place_is_defined_by, meaning the model remains conformant to CIDOC CRM.

As discussed above, the JSON-LD harvesting workflow will normalize this gazetteer URI to a Wikidata Q entity, extracting skos:exactMatches, coordinates, and modern geographic hierarchy and ingest these into the Kerameikos SPARQL endpoint for query and visualization.

A getFindspots API has been implemented in Kerameikos, e.g., http://kerameikos.org/apis/getFindspots?id=stamnos, which yields GeoJSON serialized from a SPARQL query that gets all of the unique findspots for a particular concept.

Geographic distribution of stamnoi.


A stamnos from Newfields is the first object in Kerameikos.org with a findspot (Vulci).

Due to the inherent hierarchy extracted from Wikidata, it is possible to query all vases found in the country of Italy, for example:


SELECT ?object ?title WHERE {  
  ?object crmsci:O19i_was_object_found_by ?encounter ;
          crm:P1_is_identified_by ?id .
  ?id crm:P2_has_type <http://vocab.getty.edu/aat/300404670> ;
      crm:P190_has_symbolic_content ?title .
  ?encounter a crmsci:S19_Encounter_Event ;
               crm:P7_took_place_at/crm:P89_falls_within+ <http://www.wikidata.org/entity/Q38>
}

This advancement is the tip of the iceberg for what's possible once we begin to aggregate a larger corpus of materials with archaeological context.