*And some other RDF, too
Last year around this time I published a blog post about querying RDF digital-collection metadata description sets from Oregon Digital, a digital collections repository run by Oregon State University and the University of Oregon. It was fun to write and, as a relatively new Oregon Digital team member, it helped me gain a basic understanding of Oregon Digital metadata expressed as Resource Description Framework (RDF) triples [RDF] (we don’t use RDF in any production workflows right now, and one could work full-time loading and remediating metadata without ever looking at it).
The queries I used for that post don’t do anything that couldn’t be done some other way, and often much more quickly and easily, like by using our Solr search index. So, I wanted to return to the topic and show some queries which do a bit more to demonstrate why we go to the trouble of recording URIs in metadata description sets and configuring predicate URIs for metadata fields in the first place—you know, linked-data stuff, like aggregating metadata about the same resources from different data stores! Looking back at last year’s post I realize I said, as if to set the stage for further exploration:
I think [Samvera] Hyrax’s implementation of elements of RDF and the ability to record URIs in metadata descriptions have benefits for both users and administrators!
Tools
I would like to use only implementations of RDF standards, like a SPARQL query processor [SPARQL], to do accomplish these tasks. To my mind one benefit of implementing RDF technologies for creating, storing, and/or serving metadata is relying more on the related standards and spending less time on tool- or product-specific details. (Of course, I’ve spoken to library software developers who point out drawbacks of using RDF for some of these purposes, too.) Another benefit, as I mentioned above, is the ability to aggregate resource descriptions from different sources.
But between the absence of a SPARQL endpoint for some data (including Oregon Digital, where RDF is currently only available for download by administrators) and my novice-level skill with SPARQL and programming, it was helpful for me to use Python tools [PYTHON] to process SPARQL queries on data stored in my computer, send queries to an endpoint where available, and pass results from one set of queries to the next. I ran my code in a Jupyter Notebook (.ipynb) file [JUPYTER].
Data
The goal for this demonstration was to gather data of interest which could be used to create something like a browse-by-topic interface for the collection. Note that I didn’t build anything with the data I collected (I resisted temptation to go down a rabbit hole and begin learning Python web templating). I’m currently working with subject specialists here to expand our Gertrude Bass Warner Collection of Japanese Votive Slips (nōsatsu) 1850s to 1930s, so I selected Library of Congress Subject Headings present in this collection as a test set for RDF data aggregation.
Queries and some details
This first query of collection metadata yielded the LCSH URIs recorded as subjects, and for each, the number of times it occurs and the persistent identifier and model information for each resource where it appears in metadata, which can be used to construct display-page URLs [ODRDF]. Running this query in Python code, I narrowed down the results (arbitrarily, to make managing my data a little easier) to only those headings recorded between 10 and 100 times in the collection.
g = rdflib.Graph().parse("gb-warner-nosatsu_0.nt")
# your RDF file in any serialization that can be parsed by rdflib goes here
# see details at https://rdflib.readthedocs.io/en/stable/plugins/#plugin-parsers
data = {}
with open("odrdf.rq", "r") as query1:
result = g.query(query1.read())
for row in result:
if int(row.lcshCount) > 9 and int(row.lcshCount) <= 100:
data.update({row.lcsh: {
"count": row.lcshCount,
"odworks": str(row.odWorks).split("|")
}})
Having a list of Library of Congress Subject Headings (LCSH), I next retrieved and queried data from the Library of Congress Linked Data Service. This query retrieves human-readable labels for each heading, and where they are available, Wikidata items and subject headings from the National Diet Library (NDL) of Japan which have been mapped as equivalent.
The syntax here and for the query that follows is slightly different from that which will be executed. It’s not actually SPARQL—it would raise an error if passed to a SPARQL endpoint as-is, and I probably shouldn’t really use the .rq file extension here—because it includes some Python string accommodations that allow for passing LCSH and Wikidata URIs into the query strings.
for iri in data:
response = requests.get(iri, headers=headers)
g = rdflib.Graph().parse(data=response.text, format="xml")
with open("lcsh.rq", "r") as rqfile:
result = g.query(rqfile.read().format(iri, iri, iri)) # passing in URI here
I think the retrieved NDL subject headings have potential to connect to even more data of interest for users of this collection, but they are included as a bit of a placeholder for now, as I haven’t had the opportunity to dive into the documentation on Web NDL Authorities and do more with them yet.
Next, I passed the list of Wikidata URIs to the Wikidata SPARQL endpoint for one final set of queries to gather more information. This yielded, where available, English- and Japanese- language labels for the mapped Wikidata items and URLs for articles from English- and Japanese-language Wikipedia covering these topics.
Results
The diagram below outlines the data I aggregated and the RDF property relationships I used to do so, and the JSON code snippet shows aggregated data for one example LCSH heading. This example heading ("Bonsai"@en) had relationships to everything I looked for, but results were varied in terms of which LCSH terms had been mapped to NDL subject headings and to Wikidata, and of those Wikidata entities, which had both English- and Japanese-language labels and corresponding English- and/or Japanese-language Wikipedia articles.

This code is available in this GitHub Gist.
Notes
[RDF] The Resource Description Framework is a data model that “can be used to publish and interlink data on the Web,” according to the
W3C RDF 1.1 Primer.
[SPARQL] SPARQL is a query language (and more!) for use with RDF data. As the
SPARQL 1.1 Overview puts it, “SPARQL 1.1 is a set of specifications that provide languages and protocols to query and manipulate RDF graph content on the Web or in an RDF store.”
[PYTHON] Alongside some modules from the Python standard library and the
Requests HTTP library, I rely heavily here—and in general, for working with RDF—on
the rdflib package. I also often use the
rdflib-endpoint module to work on queries over local data in a GUI interface. Thank you to the developers of these and other open-source software tools!
[JUPYTER] See
this very brief README for some information about running Jupyter notebooks and installing the rdflib and requests modules.
[ODRDF] The workflow here is written specifically for data coming out of our digital collections platform—for example, in
odrdf.rq, I’m retrieving LCSH URIs recorded as values for triples with predicate dct:subject, but they might not be there in other datasets. Also, I’m getting the information I need to construct URLs (
?odWorks) by picking out part of subject URI strings and combining these with other information. I don’t expect this pattern would work for other data. For a general-purpose query to pull and count LCSH URIs from any metadata description set where they are recorded as values, see
simplelcsh.rq.