Linked Open History: Using RDF and Linked Open Data to Connect Primary Source Materials

The white paper below is my final assignment for Week 2, RDF and Linked Open Data, as part of the Graduate Certificate in Digital Humanities at DHSI 2015.


Introduction

Historians rely on primary source materials for their research, and also for teaching students of history. The Internet holds a wealth of digitized and born digital primary sources from all regions of the past. However, these digital historical materials are often not cataloged or referenced in meaningful ways. The methods of RDF (Resource Description Framework) and Linked Open Data could be used to mark and notate online documents that contain primary source information for certain historical periods. Many instructors of history collect primary sources in digital formats for use in the classroom, yet these records often end up in dark archives or personal collections. This is not so much because of copyright, as fair use or fail dealing methods enable portions of these materials to be shared legally.[1] The main reason many of these materials are locked away in private digital repositories is that rendering these files as Linked Open Data is especially difficult. Even under the best circumstances the process of creating Linked Open Data is time-consuming and labor intensive. Despite these difficulties though, incremental progress should be made toward methods of using Linked Open Data in historical research. Primary source materials that are embedded with the proper metadata would be easier to find online by scholars and students. In addition, data visualization methods could be used to show relationships between texts, photographs, and videos across time and place, adding new richness to historical scholarship.

Defining RDF and Linked Open Data (LOD)

RDF (Resource Description Framework) is a method of sharing or exchanging information across the web. As its name implies, RDF is only a framework for exchanging data, and there is no set standard for creating RDF information. RDF works in a series of three nodes, or triples, which are broken into “subject,” “predicate,” and “object.” The power of RDF is that it can change semantic information that is readable by humans, into loosely structured data that is readable by machines. With this transformation from semantic content to RDF information, machines can take action on the data, ideally without human intervention. RDF information itself is not a database, but it can be housed in database formats, such as SQL, NoSQL, and Graphs. This flexibility allows RDF to be used in many ways and across different information structures. Linked Open Data, or LOD, is the idea put forth by Tim Berners-Lee that information on the web should be linked together in meaningful ways — as stated by Berners-Lee, “With linked data, when you have some of it, you can find other, related, data.”[2] The “open” portion of LOD works toward ensuring that materials on the web are freely available for linking, so that the web of linked data is not broken by restrictive licensing or other impediments. Once semantic information is turned into RDF data, it can be linked together creating powerful networks of information for usage by people, and machines as well.

Authoritative Description

In order for RDF and Linked Open Data to work, there must be an authoritative file that nodes can link to that gives definitive information on the subject. Such a collection of authority files is referred to as an ontology or vocabulary for referencing data. An example of a widely used ontology is dbpedia.org, which references wikipedia.org in order to collect information about connected data points. For example, the Wikipedia page for Theodore Roosevelt contains a wealth of information about the American president, and also a link for disambiguation. Many people have been named Theodore Roosevelt, as well as schools, buildings, and ships, and ensuring which Theodore Roosevelt is being referenced is essential for both human clarity and machine action. If a node within an RDF triple contains “Theodore Roosevelt” referring to the president, the DBpedia page can be used as the definitive reference: http://dbpedia.org/page/Theodore_Roosevelt Using a link on the Internet, or URL (Uniform Resource Locator), as a nodal reference helps all other connected nodes create further connections across the web of data.

LOD in Libraries, Archives, and Museums

Many libraries, archives, and museums have finding aids that help scholars locate materials. RDF and Linked Open Data could be used to make these finding aids more extensible, which would help scholars work between collections and also between institutions.[3] At the moment, many finding aids and archival reference materials are in .pdf format, which is no longer proprietary since Adobe released the standard to ISO in 2008.[4] However, PDF files are not as open to data exchange as HTML or other forms of writing for the web. While they might contain descriptive information about materials in the collections, .pdf files and other closed or proprietary formats prevent the connectivity of LOD. Moving finding aids from PDF to HTML or other markup languages and including a URI (Uniform Resource Identifier, often used as an IRI, Internationalized Resource Identifier), would enable linkage between finding aids. Archives and other academic institutions are also poised to share ontologies that describe their collections. These ontologies could either be centralized or independent, as long as they contain a stable URL or link for other institutions to reference. Some institutions have already begun to extend their collections with LOD, but more work is definitely needed.[5]

LOD and Primary Sources in the Classroom

In my own teaching practices I have created primary source materials for students using text from Google Books and other online primary source repositories, such as the Internet History Sourcebook.[6] These primary source texts may be changed into excerpts, or left in their original form. The most important change for the classroom though, is to use responsive HTML5 so that students can read the materials on their laptops, tablets, and especially smartphones. None of these devices are required in the classroom, and printed materials are also welcome, of course. HTML5 elements can also be used to ensure that printed resources are clear and well formatted. My collection of primary sources at my project-website, SGA Historical Materials [http://sgahm.org/primary-sources] is open on the web, however there is no RDF or LOD metadata attached to the HTML5 files. Through the RDF and Linked Open Data class at DHSI 2015 I have definitely seen the power of linking these materials together, and to the open web of linked data. Little by little, I hope to work on my historical materials project, and add more metadata to the documents. Also, I will need to do more research to ensure that materials I create remain as accessible as possible. If the inclusion of RDFa or Microdata within the .html file hinders screen readers or other assistive devices, different methods will need to be used. Perhaps maintaining multiple documents linked together, just as CSS style sheets operate, could enable both accessibility and LOD connections.

Visualizing Primary Sources with LOD

Primary sources used in the classroom are often disconnected from one another. Some may be on the web as HTML, some as PDFs either as text or images, and still other materials may be in printed books. RDF and Linked Open Data could be used to draw linkages between text-based documents, visual materials, and other historical primary sources. These linkages would form a web of interconnectivity that could be visualized much as the Linked Jazz project has done with jazz musicians and their social networks.[7] As instructors of history move away from providing facts and more toward helping with the context of historical documents, providing a way to see how texts and ideas have interacted with each other in the past can be a powerful tool. I can imagine something of a timeline slider, with curated primary sources materials connected to one another. Zooming in or out on a specific area could show relationships between documents, illustrating how historical figures and events were connected in the past. This project would require quite a bit of innovative programming, as well as preparing documents with LOD-ready metadata. This project can be saved for the future, but the vision of LOD and the networks of knowledge it can create will help make current metadata practices more effective.

Conclusion

RDF and Linked Open Data are very powerful ideas that are still in their formative stages of development and implementation. Historians and other academics in the Humanities can contribute to LOD in meaningful ways by adding pertinent metadata to historical materials and using open digital formats and licenses. Researchers and students alike can benefit from LOD practices, which can help make archival research more efficient and instruction more compelling. Moving information into the realm of LOD does have a high cost of time and labor, but the power of LOD is such that current practices must begin to shift toward more open and functional methods. Efforts should focus on the projects that will have the most immediate impact, while also keeping an eye toward future technical developments and innovations.


Footnotes

[1] “Code of Practices in Fair Use,” Association of Research Libraries http://www.arl.org/focus-areas/copyright-ip/fair-use/code-of-best-practices#.VX3vrmDtJUR

[2] Tim Berners-Lee, “Linked Data,” W3.org http://www.w3.org/DesignIssues/LinkedData.html (2006, revised 2009)

[3] The #lodlam hashtag on Twitter is great place to discover the newest innovations in Linked Open Data with in Libraries, Archives, and Museums, as well as the http://lodlam.net/ website.

[4] “PDF Format Becomes ISO Standard,: ISO, News, July 2008 http://www.iso.org/iso/home/news_index/news_archive/news.htm?refid=Ref1141

[5] Kate Theimer, “Archives who have implemented linked data?” Archives Next blog, March 2013 http://www.archivesnext.com/?p=3450

[6] Paul Halsall ed., “Internet History Sourcebook Project,” Fordham University http://legacy.fordham.edu/halsall/index.asp

[7] “About the Project,” Linked Jazz https://linkedjazz.org/about-the-project/