Bulking Up
This post is the first of a two-part look at bulk upload and data remediation in Hyku. Part one is going to take a look at the background of bulk operations and why they are difficult to do well. Part two will talk about our specific work to try and address some of these needs in the Hyku for Consortia project. (Photo above by Ryoji Iwata on Unsplash)
Bulk operations in Hyku have a long history. In the initial user survey, conducted way back in 2015, one of the main findings was that Hyku needed to support the “easy in and out” of metadata. Metadata migration/remediation/transformation has always been a major activity in libraries. Think back to what an enormous task retrospective conversion of card catalogs to MARC was. Any library system containing metadata has to be able to manage that data at a large scale.
The design team for Hyku knew that bulk operations would be a key element to allowing potential users to commit to migrating out of their current tools. Hyku entered a market with a number of existing repositories. This new solution might have been able to solve many of the community’s frustrations with those tools, but only if there was an easy way to migrate to it. The initial requirements and personas therefore both reflected the needs to tools to upload and transform metadata from one system to another. Mockups reflected the need to both migrate data as well as remediate it.
This work was then reflected in Github issues during the project development (see: https://github.com/samvera/hyku/issues?q=is%3Aissue+is%3Aopen+bulk), but other more basic needs for repository development (you need a repository to migrate data to, after all) took a higher priority. So a new grant project called Bridge2Hyku picked up where development left off and explored the issue of migration in more depth (https://bridge2hyku.github.io/). Our colleagues at the Bridge to Hyku project did great work analyzing not only how to upload data and objects to Hyku, but also how to get it out of some of the major repository systems currently in use.
All of this work then…but why is metadata migration and bulk creation/upload so difficult?
The nature of structured data is what makes it so powerful: you can index and search it, you can compare like to like, you can organize and sort. In short, it makes order out of chaos. And as humans, that’s what we naturally do: recognize patterns. But, also like humans, we might all see the world slightly differently. So different metadata schema and repository systems can have their own way of seeing the world. Some are quite simple and allow for the same basic type of description of everything. Others are quite granular, allowing for more nuanced description of subtle details that can be important and powerful. So any system to migrate or convert from one system to another typically relies on a lot of human intelligence to see the patterns and make the connections.
But human capacity is only so much. How do you analyze thousands of records? Analytical tools like Open Refine can be helpful. So can guidelines for general rules on the major categories of migration as shown in crosswalks from other projects. But, as these examples perhaps show, these tools are not simple and not necessarily easy to pick up and learn. So any migration process is either going to require a lot of manual intellectual effort, or the creation of new tools to help with this business of organizing and translating.
The quirks of particular systems can also provide barriers. You may come up with a great crosswalk that works for one system, but doesn’t capture the nuance of another. Within Hyku for example, all works are sorted into worktypes. These types define the metadata schema used, the relationships between objects that can be created, and in some cases, the way that the object itself is presented and handled within the repository.
Data from other systems that don’t use this type of organization then require an extra step to define the worktype data should be migrated to. The system that data is coming from can also prove a barrier. Some systems are opaque making it hard to know exactly how data is stored. Others make it difficult to export data out. Many systems can provide an XML feed of records through a tool using the OAI-PMH protocol, but these are then just records, not objects themselves. Others might use a newer protocol like ResourceSync for export, but may be incompatible with systems still relying on OAI-PMH.
Finally, issues can come from the very nature of materials themselves. A particular challenge we’ve had with migration relates to the inter-relationships between objects. As I’ve talked about before, and will likely write about on this blog in the future, one of the key needs we found to assist in the uptake of Open Educational Resources (OER) is the availability of related teaching tools or ancillary materials. A freely available textbook is great, but if there are also related quizzes, videos, or lecture slides, an educator has all they need to make the switch. In order to make these materials visible in an OER repository, we need to have the ability to define lots of different types of relationship like “translation of”, “part of”, or “replaced by” (for new editions).
Creating these relationships may be easy when materials are being uploaded as they are created on an ad-hoc basis. But migrating them to a new environment presents a new challenge: how do you create a relationship between materials that may be next in the queue to be created? There isn’t a simple solution. For us, it’s meant creating some new code to handle the creation of relationships as a second step in the data migration process. The point of this example isn’t necessarily the solution we found to this problem, but the acknowledgment that many other types of materials may present their own unique needs. While uniformity and standardization is good, it’s the balance between standardization and diversity that makes a repository useful.
So bulk operations in repositories is a hard nut to crack. There are similarities in any migration or conversion, but there are also a lot of specific challenges to every situation. In our next post, we ‘ll talk about the development of bulk upload functionality for Hyku Commons and how we addressed challenges in our own work.