How do we make it easy for data from every possible source in a city to work together?

That was essentially the question proposed at the second “camp” session at the Fall NNIP Partnership meeting. Pretty ambitious for an hour long session, but about a dozen or so people from around the country jumped headlong into a question that plagues many data exchange and open data efforts: how do we make all this stuff work together?

A typical response is to try and create a  standard structure of information and attempt to implement or force it on everyone in a community. This set of rules can include what data should be recorded, how it should be recorded, what structure it is stored in and a dictionary that explains all of that.

Since the question is often being asked of data scientists, the response is usually to look at existing, evidence-based models that have been implemented elsewhere. So the group talked data standards while I listened, letting the scope of our conversation get clearer as we talked about different standards.

And then, lightning hit.

See, since these conversations are usually had by data scientists that are excellent at interpreting, analyzing and visualizing data people tend to want to create a precious system, or the “one system to rule them all”. Fortunately, because of the work done with the Milwaukee Community Database project, I had a different perspective to offer.

To address the ongoing sustainability of that project, which is a joint effort between the Milwaukee Neighborhood News Service, Marquette University and the Milwaukee Data Initiative, we interviewed people from more than 40 organizations to determine what data they wanted and what data they could share. When we started, we focused our interviews with journalists and data scientists, many who suggested we talk to their local or institutional librarians. So we interviewed research librarians and archivists at Milwaukee Public Library, Marquette’s Raynor Library and the University of Wisconsin-Milwaukee.

After each conversation, I had a palm-to-forehead moment. Let me tell you, the message sunk in: librarians have data down.

Librarians work in big and small data every data. Their jobs are essentially to take massive content, assign metadata (data that provides information about other data) to that content, and then connect that information to other information, sometimes in other institutions. So when we interviewed them, they didn’t ask “what is open data” or “what are data sharing standards”. No, these people were responding “Oh yeah, RDF? We use that.” and “we’ve developed our own metadata standards, too.”

Did I mention they were also familiar with “linked data”, the semantic web and the SKOS metadata schema?

If this isn’t a set of terms you are familiar with, it’s completely understandable. It’s a difficult concept to understand and is not yet implemented on a wide scale outside of the research and library world. A concept invented by the brilliant Tim Boerners-Lee, who also happened to be the lead on a team that invented something called HTML and CSS (the core technology that has made the internet POSSIBLE), “linked data” is a metadata standard that makes it easy to connect distributed data across the web.

Without going into too much detail, SKOS (the metadata standard for linked data) allows an address in one dataset from the State to be related to an address at the County, and to the mailing address list published by the City, all the way to the version of the address that the resident gives you when you are doing your survey. It not only links those things together, it accounts for the “fuzziness” of those data points.

Let’s unpack that last sentence and the term “data fuzziness”. If I start recording my own attendance at a community event for my own purposes, do I need to record it the same way or with the same information as the community organizer or sponsoring organization? Every point of data collected about something can be collected in a variety of ways, sometimes for very good reasons.

So instead of forcing a data standard that everyone must use, linked data lets Organizations A, B, C and D record attendance however they want, link those numbers together and account for which of those is the “authority” and for whom, and how it relates to the other fields from the other organizations.

There are other methods out there as well, from designing a data translation matrix to using standards like HSSDS (Human Service Data Specification) from openreferral.org.

There are some inherent and significant challenges associated with linked data. First, all software systems that are “linked” have to be “interoperable”, or able to share data. There are also the normal concerns about linking identifiable data and the difficult of developing data sharing agreements. Then there is the challenge of convening and coordinating all of the people that manage the data systems. The Milwaukee Data Initiative estimated that doing Linked Data in Milwaukee would be a 10-year, multi-million dollar project that would require convening local stakeholders with County, State and Federal agencies along with private and public foundations. No small feat.

But the potential payoffs are huge: massive de-duplication of effort, the possibility of real-time performance monitoring, data-driven decisions regarding program and funding efforts and so much more.

Could linked data be an opportunity for investment in the long-term future of Milwaukee and Wisconsin? We will only know if we make the effort. If we do, we’ll have the attention of data scientists around the world as they watch the possibility of what could happen or what could unfold here, in Milwaukee.