Tagged week 2

Linked Open History: Using RDF and Linked Open Data to Connect Primary Source Materials

The white paper below is my final assignment for Week 2, RDF and Linked Open Data, as part of the Graduate Certificate in Digital Humanities at DHSI 2015.


Introduction

Historians rely on primary source materials for their research, and also for teaching students of history. The Internet holds a wealth of digitized and born digital primary sources from all regions of the past. However, these digital historical materials are often not cataloged or referenced in meaningful ways. The methods of RDF (Resource Description Framework) and Linked Open Data could be used to mark and notate online documents that contain primary source information for certain historical periods. Many instructors of history collect primary sources in digital formats for use in the classroom, yet these records often end up in dark archives or personal collections. This is not so much because of copyright, as fair use or fail dealing methods enable portions of these materials to be shared legally.[1] The main reason many of these materials are locked away in private digital repositories is that rendering these files as Linked Open Data is especially difficult. Even under the best circumstances the process of creating Linked Open Data is time-consuming and labor intensive. Despite these difficulties though, incremental progress should be made toward methods of using Linked Open Data in historical research. Primary source materials that are embedded with the proper metadata would be easier to find online by scholars and students. In addition, data visualization methods could be used to show relationships between texts, photographs, and videos across time and place, adding new richness to historical scholarship.

Defining RDF and Linked Open Data (LOD)

RDF (Resource Description Framework) is a method of sharing or exchanging information across the web. As its name implies, RDF is only a framework for exchanging data, and there is no set standard for creating RDF information. RDF works in a series of three nodes, or triples, which are broken into “subject,” “predicate,” and “object.” The power of RDF is that it can change semantic information that is readable by humans, into loosely structured data that is readable by machines. With this transformation from semantic content to RDF information, machines can take action on the data, ideally without human intervention. RDF information itself is not a database, but it can be housed in database formats, such as SQL, NoSQL, and Graphs. This flexibility allows RDF to be used in many ways and across different information structures. Linked Open Data, or LOD, is the idea put forth by Tim Berners-Lee that information on the web should be linked together in meaningful ways — as stated by Berners-Lee, “With linked data, when you have some of it, you can find other, related, data.”[2] The “open” portion of LOD works toward ensuring that materials on the web are freely available for linking, so that the web of linked data is not broken by restrictive licensing or other impediments. Once semantic information is turned into RDF data, it can be linked together creating powerful networks of information for usage by people, and machines as well.

Authoritative Description

In order for RDF and Linked Open Data to work, there must be an authoritative file that nodes can link to that gives definitive information on the subject. Such a collection of authority files is referred to as an ontology or vocabulary for referencing data. An example of a widely used ontology is dbpedia.org, which references wikipedia.org in order to collect information about connected data points. For example, the Wikipedia page for Theodore Roosevelt contains a wealth of information about the American president, and also a link for disambiguation. Many people have been named Theodore Roosevelt, as well as schools, buildings, and ships, and ensuring which Theodore Roosevelt is being referenced is essential for both human clarity and machine action. If a node within an RDF triple contains “Theodore Roosevelt” referring to the president, the DBpedia page can be used as the definitive reference: http://dbpedia.org/page/Theodore_Roosevelt Using a link on the Internet, or URL (Uniform Resource Locator), as a nodal reference helps all other connected nodes create further connections across the web of data.

LOD in Libraries, Archives, and Museums

Many libraries, archives, and museums have finding aids that help scholars locate materials. RDF and Linked Open Data could be used to make these finding aids more extensible, which would help scholars work between collections and also between institutions.[3] At the moment, many finding aids and archival reference materials are in .pdf format, which is no longer proprietary since Adobe released the standard to ISO in 2008.[4] However, PDF files are not as open to data exchange as HTML or other forms of writing for the web. While they might contain descriptive information about materials in the collections, .pdf files and other closed or proprietary formats prevent the connectivity of LOD. Moving finding aids from PDF to HTML or other markup languages and including a URI (Uniform Resource Identifier, often used as an IRI, Internationalized Resource Identifier), would enable linkage between finding aids. Archives and other academic institutions are also poised to share ontologies that describe their collections. These ontologies could either be centralized or independent, as long as they contain a stable URL or link for other institutions to reference. Some institutions have already begun to extend their collections with LOD, but more work is definitely needed.[5]

LOD and Primary Sources in the Classroom

In my own teaching practices I have created primary source materials for students using text from Google Books and other online primary source repositories, such as the Internet History Sourcebook.[6] These primary source texts may be changed into excerpts, or left in their original form. The most important change for the classroom though, is to use responsive HTML5 so that students can read the materials on their laptops, tablets, and especially smartphones. None of these devices are required in the classroom, and printed materials are also welcome, of course. HTML5 elements can also be used to ensure that printed resources are clear and well formatted. My collection of primary sources at my project-website, SGA Historical Materials [http://sgahm.org/primary-sources] is open on the web, however there is no RDF or LOD metadata attached to the HTML5 files. Through the RDF and Linked Open Data class at DHSI 2015 I have definitely seen the power of linking these materials together, and to the open web of linked data. Little by little, I hope to work on my historical materials project, and add more metadata to the documents. Also, I will need to do more research to ensure that materials I create remain as accessible as possible. If the inclusion of RDFa or Microdata within the .html file hinders screen readers or other assistive devices, different methods will need to be used. Perhaps maintaining multiple documents linked together, just as CSS style sheets operate, could enable both accessibility and LOD connections.

Visualizing Primary Sources with LOD

Primary sources used in the classroom are often disconnected from one another. Some may be on the web as HTML, some as PDFs either as text or images, and still other materials may be in printed books. RDF and Linked Open Data could be used to draw linkages between text-based documents, visual materials, and other historical primary sources. These linkages would form a web of interconnectivity that could be visualized much as the Linked Jazz project has done with jazz musicians and their social networks.[7] As instructors of history move away from providing facts and more toward helping with the context of historical documents, providing a way to see how texts and ideas have interacted with each other in the past can be a powerful tool. I can imagine something of a timeline slider, with curated primary sources materials connected to one another. Zooming in or out on a specific area could show relationships between documents, illustrating how historical figures and events were connected in the past. This project would require quite a bit of innovative programming, as well as preparing documents with LOD-ready metadata. This project can be saved for the future, but the vision of LOD and the networks of knowledge it can create will help make current metadata practices more effective.

Conclusion

RDF and Linked Open Data are very powerful ideas that are still in their formative stages of development and implementation. Historians and other academics in the Humanities can contribute to LOD in meaningful ways by adding pertinent metadata to historical materials and using open digital formats and licenses. Researchers and students alike can benefit from LOD practices, which can help make archival research more efficient and instruction more compelling. Moving information into the realm of LOD does have a high cost of time and labor, but the power of LOD is such that current practices must begin to shift toward more open and functional methods. Efforts should focus on the projects that will have the most immediate impact, while also keeping an eye toward future technical developments and innovations.


Footnotes

[1] “Code of Practices in Fair Use,” Association of Research Libraries http://www.arl.org/focus-areas/copyright-ip/fair-use/code-of-best-practices#.VX3vrmDtJUR

[2] Tim Berners-Lee, “Linked Data,” W3.org http://www.w3.org/DesignIssues/LinkedData.html (2006, revised 2009)

[3] The #lodlam hashtag on Twitter is great place to discover the newest innovations in Linked Open Data with in Libraries, Archives, and Museums, as well as the http://lodlam.net/ website.

[4] “PDF Format Becomes ISO Standard,: ISO, News, July 2008 http://www.iso.org/iso/home/news_index/news_archive/news.htm?refid=Ref1141

[5] Kate Theimer, “Archives who have implemented linked data?” Archives Next blog, March 2013 http://www.archivesnext.com/?p=3450

[6] Paul Halsall ed., “Internet History Sourcebook Project,” Fordham University http://legacy.fordham.edu/halsall/index.asp

[7] “About the Project,” Linked Jazz https://linkedjazz.org/about-the-project/

Journal 5 – RDF and Linked Open Data

Discussion today included the difficulties of teaching coding and computer literacy. Graphical user interfaces (GUI) have enabled non-programmers to become more familiar with computers and their possibilities. However, these GUI (sometimes pronounced “gooey”) interfaces also change frequently depending on the application and its design aesthetics. Innovations in hardware, such as the trackpad and touch screens, can also effect computer literacy and usage. The command line interface (CLI) seems to be more timeless, with the logic of the program perhaps becoming more apparent through the typed commands themselves. Our instructor Jim mentioned that people learning to code for the first time have often spent years writing and communicating in academia or business, with sentences as the basic structure. For some though (including myself), writing sentences for computers instead of people is terribly difficult, despite the many similarities in structure and language. Programming classes for humanists, whether RDF and Linked Open Data, Ruby on Rails, or Python, can help scholars learn the language and logic of computers, which helps in using our own devices in new ways and also for understanding how social media and the wider Internet operate.

RDF Graph example worksheet for The Canadian Women's World Cup team.
RDF Graph example worksheet for The Canadian Women’s World Cup team.

Our class project for the week was a demonstration of RDF and Linked Open data that was hand-written on construction paper. We chose the Women’s World Cup as our subject, with the Canadian team as our primary node. From the Canadian team outwards, we connected various bits of information in a series of triples, or three part data references. These triples were then written in Turtle to abbreviate the code. We used dbpedia.org as our authoritative ontology, circling the nodes in blue that would reference its database. For example, in the triple [“Canadian team” (subject) <--> “sponsored by” (predicate) <--> “Umbro” (object)], Umbro would link out to its dbpedia.org page: http://dbpedia.org/page/Umbro This gives the Umbro node a definitive reference, and the other information on the page, such as Umbro’s website, their brands, and location would also be accessible through the link. A query could then be run, “Is the Canadian women’s soccer or futbol team sponsored by any European companies.” Even though “Europe” was not on our node worksheet, the dbpedia.org reference would allow an extension of the query into the wider web, resulting in an answer of, “Yes, by Umbro in the UK.” This is a simple example of linked data, but with the further extensions provided by authoritative ontologies, more complex queries would certainly be possible.

RDF triples in Turtle composed of the Graph information in the image above.
RDF triples in Turtle composed of the Graph information in the image above.

Journal 4 – RDF and Linked Open Data

This morning we discussed some steps to open up data. These suggestions were distilled from the Open Knowledge Foundation’s Open Data Handbook by instructor Jim Smith: https://www.dhdata.org/dhdata/datasets/1-how-to-open-up-data/index.html

One of the main points was that DH projects need to start small, and there should be a series of steps or stages for development, even in the smallest projects. Part of this idea of starting small includes limiting the size of the initial dataset. Small datasets can still be helpful to wider communities. If the dataset isn’t moving in the correct direction, its small size can allow for redirection before too much time has been invested.

The dataset for Theatre Finder (a DH project about historic theaters) is written in JSON, but it could be altered to be JSON-LD without destroying or having to rewrite the original database. Coding the dataset as JSON-LD would make the information in Theatre Finder much more usable, connecting it to the wider networks of linked data across the web. Theatre Finder is a good candidate for becoming an authority dataset, or ontology. Theatre Finder didn’t begin with this in mind, but the dataset information is comprehensive and is widely used.

It’s important to consider that database ontologies are generally created by the people who use and need the data on a regular basis. Before creating a dataset DH professionals need to ask the question, “who can and will use this data?” This question is important, because these groups will also be the ones to contribute to the dataset. Stability is also a factor of community involvement. Datasets are constantly changing, not just by adding information to them, but by also reconsidering definitions and vocabularies inline with social and cultural change.

Example data from Theatre Finder. The information under "Overview," such as country and city, could be connected to dbpedia.org as linked data.
Example data from Theatre Finder. The information under “Overview,” such as country and city, could be connected to dbpedia.org as linked data.

Journal 3 – RDF and Linked Open Data

This morning we worked with Turtle(s?), a nickname for Terse RDF Triple Language. Turtle allows the long text of complex triples to be written in an abbreviated format. For the purposes of the class, or at least my own benefit as a beginner with RDF, writing-out the triples in long form is best. Once the entire triple is there in its extended form, it’s easier to see the connection between the full triple and the abbreviated Turtle version of the triple. Going over Turtle this morning was helpful because it made us think about the triples we were using, their composition, and how they could be better structured.

RDF (Resource Description Framework) is not a programming language, so there’s no way for it to throw an error if something is missing or incorrect. An error message could come later from another program that can’t find information, or if information isn’t presented as expected. However, this delay can make working with RDF triples a bit tricky.

JSON-LD is a way to use both JSON (Javascript Object Notation (it’s not using “Javascript” anymore, but it was initially part of the format)) and Linked Data. JSON is relatively new, used by tech startups, and its popularity is growing because of the many web applications being developed. JSONlint is a validator for JSON, and the playground at json-ld.org can be used for further testing. JSON-LD is designed to build on existing APIs (Application Program Interface) in semantic ways to make them more usable and data rich. This enhancement isn’t for human readability, but for machines so that APIs are more easily accessed and actionable.

JSON-LD is a relatively new way of transporting data. Its initial development began in 2010, and it became a W3C Recommendation as of January 2014. There’s a quite a bit of controversy surrounding JSON-LD, but it’s not really about the method or the technical specifications. Instead, the argument is whether JSON-LD is for the Semantic Web (human readable) or for API enhancement (machine readable). On the surface the discourse surrounding JSON-LD might appear to be trivial, only a matter for technical debate. The deeper question though, is how information is represented in digital space, and what role people have as developers and consumers in our modern and machine-actionable world.

This example of a Turtle at w3.org contains errors, it should read ( :a :b :c ) - no spaces between the colon and the b & c values.
This example of a Turtle at w3.org contains errors, it should read ( :a :b :c ) – no spaces between the colon and the b & c values.

Journal 2 – RDF and Linked Open Data

In the morning today we talked about RDF and how its data is composed. RDF is about sharing and exchanging information, but not necessarily about sharing the tools to interpret the information. RDF can be like NoSQL in that it’s flexible, just add more properties. When the project becomes more mature though, things needs to be locked down and standardized. Eventually, the information about “blank node” connections would need to be published so that all connections can be clear outside the project.

An informal graph of sample triples by W3C: http://www.w3.org/TR/2014/NOTE-rdf11-primer-20140624/

In the afternoon we worked with markers and construction paper, laying out a physical example of the materials we could work with. In our case this was ancient pottery, and also the variables or attributes of possible pottery fragments. We had a specialist in our group, an academic that works with ancient Mediterranean pottery fragments, and she was able to give us a wide variety of attributes. Each fragment of pottery has multiple data points, such as shape, type, place found, date of creation, and type of glaze. Each of these attributes requires a further deconstruction, such as place, requiring both a name as a string value or text, and also a latitude and longitude value that’s numeric and geographical. RDF information needs to be very granular and specific. For example, not just a dollar amount for price, but two specifications — the dollar amount field would be a number reference, and also a currency reference that would point to a web-hosted ontology.

The value of “glaze” would point to an additional table containing information such as the elemental makeup and the percentage of each element contained within the fragment. It’s possible to use the RDF triple (subject<-->predicate<-->object) of glaze<-->element<-->percentage, but this would not necessarily be machine readable. People could understand that the percentage was a feature of the element, but machines/computers might get stuck at the element value. It’s not certain that machines would read an element and then also look for a percentage, most often the machine reading would stop at the element itself. If a blank node was used, perhaps titled “has components,” then this blank node could point to both the element and the value. This would relate the element and the percentage together without requiring one value to be privileged over the other. Using the title of “has components” would also make this blank node understandable for people.

Our RDF Graph for the term "glaze." On the left of "glaze" is the blank node with the name "has components."
Our RDF Graph for the term “glaze.” On the left of “glaze” is the blank node with the name “has components.”

Journal 1 – RDF and Linked Open Data

In the morning portion of class today we analyzed and critiqued projects that have occurred over the years, including the Indiana Ontology Philosophy Project [https://inpho.cogs.indiana.edu], and the 1995 Cervantes Project [http://cervantes.tamu.edu/V2/CPI/index.html]. We discussed RDF ontologies, or vocabularies, and we looked at a few databases that house these descriptors, such as dbpedia.org. Both scientific and humanistic data have been increasing exponentially over the last decade, and the need to link these resources together in an open format is very apparent. However, many academics are unaware that the methods they use to create and distribute data are closed systems and formats, such as Word documents and PDFs. Using the frameworks for linked open data can ensure that web-based projects become connected to the scholarly record, instead of being siloed and possibly forgotten in lonely corners of the Internet.

In the afternoon we covered database types, including SQL, NoSQL, Graph, and LDAP. With each of these database types come benefits and also pitfalls, but the key takeaway is to use the database type you’re most familiar with to help get projects off the ground. SQL databases are more rigid than NoSQL, but the additional flexibility of NoSQL can help projects without a clear idea of their datasets to begin building while the initial development is still in process. RDF is itself a framework, but not a standard. This is obvious in the RDF acronym, Resource Description Framework, but the ubiquity of the term can make it appear as though RDF is fully fleshed-out and set in stone. Overall, what’s really being worked toward through RDF and Linked Open Data is to interconnect web resources in such a way that they’re beneficial for knowledge creation by humans, and this can only be done if they’re inherently readable and actionable by machines.

Screen capture of the Cervantes Project showing multiple problems, including character encoding.
Screen capture of the Cervantes Project showing multiple problems, including character encoding.
Screen capture of from the Shelley-Goodwin Archive, images and text match, and also additional viewing options.
Screen capture of from the Shelley-Godwin Archive, images and text match, and also additional viewing options.