Querying Oregon Digital RDF, Part 1

Oregon Digital is a digital asset management system which is shared jointly developed by staff at the Oregon State University Libraries and Press and the University of Oregon Libraries. It is built using the Samvera Hyrax digital repository framework, which was chosen (among other reasons) because of its support for Resource Description Framework (RDF) data. Uniform Resource Identifiers (URIs) can be recorded as values in Oregon Digital metadata descriptions, and collection metadata can be exported by administrators in the RDF N-Triples format. In this blog post I will share a look at the RDF metadata description sets which can be exported from Oregon Digital and share two SPARQL queries which can be run on this data.

An RDF triple with subject, predicate, and object

You might ask why I’d use RDF and SPARQL when Solr queries can be run against all our metadata without any need to generate exports or manage individual RDF files for collections. I see the value of RDF and SPARQL in the ability to make use of other data sources. RDF – a foundational model for sharing linked open data – makes use of unambiguous identifiers for resources of interest, and these resources can be shared and reused across the web. So, for example, SPARQL queries can be run against Oregon Digital RDF and information in other linked open data repositories using federated queries, a powerful extension of the SPARQL query language.

Very brief technical information

The queries shown below use the SPARQL 1.1 query language. All of the queries and other code snippets shown here, as well as a Jupyter notebook which can be used to run the queries, are available in this GitHub Gist. To run queries on RDF data, a SPARQL endpoint or query interface is needed. The GitHub Gist where these queries are shared includes Python code for running these queries against the data, because this is the method I use. The materials don’t provide a detailed tutorial on setting up the software needed to use SPARQL for querying RDF data, but they do include some information and links to additional resources.

Looking closely at the data

Even though I can include URIs in Oregon Digital metadata and export metadata as RDF, I don’t describe it as linked open data for a few reasons:

Subject URIs in are not persistent or dereferenceable—I expect subject URIs to look different in RDF exports from other Hyrax instances where they are available, and Oregon Digital subject URIs (and other aspects of the RDF) will change in the future when Oregon Digital data storage changes
It isn’t possible to language-tag text values in our Samvera Hyrax instance, and there are no language tags in exported RDF
Oregon Digital RDF isn’t currently available to all users

Another interesting point for me—someone still relatively new to the Samvera Hyrax user community—is that RDF exported from Oregon Digital seems very “noisy” from the perspective of an end-user interested in descriptive metadata. Many or most of the triples in each description set aren’t descriptive—technical-administrative metadata takes up a lot of space. For example, each resource is classed in six distinct ways (that is, has six distinct values for the rdf:type predicate), as can be seen in the snippet of Oregon Digital RDF available in the online materials—see od_rdf_excerpt.ttl.

To be clear, despite this, I think Hyrax’s implementation of elements of RDF and the ability to record URIs in metadata descriptions have benefits for both users and administrators!

Getting a useful subset of the metadata…

When I’m investigating RDF from a source I’ve never used before, I often download some data in Turtle serialization just to look at it in a text editor. Sometimes I need to convert whatever serialization is available to Turtle—which is much easier for humans to read—myself, but the Python rdflib library and many other tools make this easy to do.

I knew from looking at the data that a portion of subject URIs—the nine-character persistent identifier (PID)—is present in the URL for viewing the object and metadata in a web browser. Viewing metadata in the browser helps me understand how metadata descriptions look and function for a user, so this seems like a valuable piece of information to create from the RDF. The following query can be run against RDF for a collection to yield a title, PID, and detailed web view URL for each object in it.

title_pid_showpage.rq

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
PREFIX fedrep: <http://fedora.info/definitions/v4/repository#> 
PREFIX ldp: <http://www.w3.org/ns/ldp#> 
PREFIX pcdm: <http://pcdm.org/models#> 
PREFIX dcterms: <http://purl.org/dc/terms/>

SELECT ?title ?pid ?showpage 
WHERE { 
   ?iri rdf:type fedrep:Container, fedrep:Resource, pcdm:Object, ldp:Container, 
      ldp:RDFSource, <http://projecthydra.org/works/models#Work> ; 
   <info:fedora/fedora-system:def/model#hasModel> ?model ; 
   dcterms:title ?title .
   BIND (REPLACE(str(?iri), 
      "http://fcrepo\\.od2-prod\\.svc\\.cluster\\.local\\.:8080/fcrepo/rest/prod/([\\S]{2}\\/){4}(\\S{9})", 
      "$2") AS ?pid) 
   BIND (CONCAT("https://www.oregondigital.org/concern/", LCASE(?model), "s/", ?pid) AS ?showpage) 
   }
ORDER BY ?title 
LIMIT 5 # (!) delete this line to see all query results

Details

For this to function with RDF coming from a different Samvera Hyrax instance it would be necessary to change at least two components:

The regular expression used as the second argument for the REPLACE function has been written to match the structure of subject URIs coming from our instance, and would need to be changed
The use of an object-model name (?model in this query) in web view URLs is expected to be common across Hyrax instances, but the location of this data in exported RDF may vary; here it appears as the value of the property with URI info:fedora/fedora-system:def/model#hasModel

…for resources of interest

Now that I have the query syntax needed to distill a PID and create a web view URL for RDF resources, I can add some search terms to retrieve this information for the objects I’m interested in. In this query, I use the SPARQL UNION keyword to retrieve objects for which metadata contains either a particular string, or one of two subject URIs.

match_on_string_or_uri.rq

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
PREFIX fedrep: <http://fedora.info/definitions/v4/repository#> 
PREFIX ldp: <http://www.w3.org/ns/ldp#> 
PREFIX pcdm: <http://pcdm.org/models#> 
PREFIX dcterms: <http://purl.org/dc/terms/>

SELECT DISTINCT ?title ?pid ?showpage 
WHERE { 
?iri rdf:type fedrep:Container, fedrep:Resource, pcdm:Object, ldp:Container, 
   ldp:RDFSource, <http://projecthydra.org/works/models#Work> ; 
<info:fedora/fedora-system:def/model#hasModel> ?model ; 
dcterms:title ?title . 
{?iri ?p1 ?value1 . FILTER regex(?value1, "quilt(s)?", "i") }
UNION 
{?iri ?p2 ?value2 . FILTER (?value2 IN (<http://id.loc.gov/vocabulary/ethnographicTerms/afset014799>, # quilting
   <http://id.loc.gov/vocabulary/ethnographicTerms/afset014804> )) } # quilts
BIND (REPLACE(str(?iri), 
   "http://fcrepo\\.od2-prod\\.svc\\.cluster\\.local\\.:8080/fcrepo/rest/prod/([\\S]{2}\\/){4}(\\S{9})", 
   "$2") AS ?pid) 
BIND (CONCAT("https://www.oregondigital.org/concern/", LCASE(?model), "s/", ?pid) AS ?showpage) 
} 
ORDER BY ?title
LIMIT 5 # (!) delete this line to see all query results

Details

This query doesn’t come close to taking full advantage of SPARQL’s regex function, but this will match on “quilt” or “quilts” and the “i” flag allows matching regardless of capitalization.

Search this site

University of Oregon Digital Library Services Menu

University of Oregon Digital Library Services

Category: Data

Querying Oregon Digital RDF, Part 1

Very brief technical information

Looking closely at the data

Getting a useful subset of the metadata…

Details

…for resources of interest

Details