Thursday, November 10, 2016

Distribution visualization with SPARQL and d3js

After more than a year of dormancy, I have picked up Kerameikos.org development again in preparation of a collaboration with the Beazley Archive of the University of Oxford and, hopefully, a grant application. We hope to publish the entire array of identifiers necessary for Archaic and Classical Greek pottery and develop more advanced analysis and visualization systems built upon open vase data we can acquire from a variety of sources (e.g., the British Museum and the Harvard Art Museums).

Aside from some minor stylistic updates to the site, I implemented two major changes:

1. I rewrote the geographic visualizations to serialize the SPARQL response into geoJSON to render in Leaflet instead of the OpenLayers-based Timemap library, which has not seen active development in at least five years. I really like being able to scroll through a timeline of objects, but I will have to wait until another Leaflet plugin can do something similar.

2. I implemented SPARQL-based distribution visualization with the d3plus plugin to d3js. The code was almost entirely ported from the Nomisma.org distribution analysis features I have recently been working on.

This builds on the previously established model where request parameters are parsed within Orbeon's XML Pipeline Language and constructed into an XML object that is then transformed with XSLT into a textual SPARQL query. The difference here is that the example vases from the Getty and British Museum are represented as Linked Open Data with CIDOC-CRM, as well as defined by the typological URIs in their own vocabulary systems (AAT/ULAN/TGN and the British Museum's own internal LOD thesaurus, respectively). As a result, the XML model that represents the query is significantly more complex than the Nomisma visualizations, which are built on a simpler RDF model and only a single vocabulary system.

In the query below, we are getting the distribution of shapes for Red Figure pottery:

SELECT DISTINCT ?concept ?label (count(?concept) as ?count) WHERE {
  {
    SELECT ?1 WHERE { kid:red_figure skos:exactMatch ?1}
  }
    ?object crm:P32_used_general_technique ?1.
    ?object kon:hasShape ?dist  
  {
    SELECT ?dist ?label ?concept WHERE {
      ?concept skos:exactMatch ?dist;
               skos:prefLabel ?label FILTER langMatches(lang(?label), "en")}
  }
} GROUP BY ?concept ?label ORDER BY ?label
As you can see, there is a subselect where we gather all of the URIs that are SKOS exact matches for the Kerameikos URI and then get the objects created with this technique. Using a simplified semantic that better represents knowledge organization specifically within ceramics studies, we use kon:hasShape to get the shape URIs. Like techniques, these URIs may be in the AAT or BM thesaurus. We therefore have to get the matching Kerameikos URI, and extract the English label. Here is the full query. Here are the results to the SPARQL query in HTML.

With regard to the XML model that forms the SPARQL query, the XPL/XSLT stylesheet is on Github. Below is an example, where $object is the object in the triple. The $id variable is formed by position (must be unique in the query) of the piece of the query in HTTP request parameter. The parameter, in this case, is 'compare=technique kid:red_figure'. Queries can be more precise by concatenating multiple predicate-object pairs with a semicolon.

<statements>
    <select id="{$id}">
        <triple s="{$object}" p="skos:exactMatch" o="?{$id}"/>
    </select>
     <triple s="?object" p="crm:P32_used_general_technique" o="?{$id}"/>
     <triple s="?object" p="kon:hasShape" o="?dist"/>
</statements>

This XML is transformed with XSLT into SPARQL and executed in the XPL. Like in Nomisma, you can compare multiple query sets.

Distribution of shapes for Red vs. Black Figure Greek pottery (from a limited sample size)

Charts are generated via AJAX on Kerameikos ID pages but are generated by passing request parameters on the distribution page, enabling the copying and pasting of charts. Furthermore, you can download CSV that represents the datasets, which will include geographic coordinates if Production Place is the distribution category.

No comments:

Post a Comment