Severely puzzled by encountering frequent references to PIDs in information about storing research data, and thus wondering how “pelvic inflammatory diseases” relate to data storage, EUDAT’s intrepid roving reporter set out north once more to interview Carl Johan Håkansson, the product manager for EUDAT’s B2SHARE service, in Stockholm. We are glad to report that the trip to the PDC Center for High Performance Computing at the KTH Royal Institute of Technology relieved those worries. Read on and you can discover what data-related PIDs really are, and also learn a nifty shortcut for sharing your stored research data with colleagues around the world: welcome to the wonderful world of PIDs and handles!
Good afternoon, Carl Johan. I wonder if you could help with a little mystery involving EUDAT. I keep coming across references to PIDs and I’m dreadfully puzzled as to why pelvic inflammatory diseases are so important for storing research data…
Ah, ok, when working with data management, we often use things known in full as “persistent identifiers”. This is commonly shortened to either “PI” or “PID”. In Europe “PID” is generally the preferred form, and so that is what we use in EUDAT. Perhaps I should also point out that, in some contexts, the term “handle” can be used to mean the same thing, so sometimes in EUDAT a PID may be called a handle..
Thanks, Carl Johan. What are these PIDs or handles actually used for in EUDAT?
Well, as you know, EUDAT provides services for managing European research data files. We work with lots of different research communities and institutions across Europe, and hence the data files managed by our services could potentially be stored in a data centre anywhere in Europe. So, to keep track of where the data files are stored physically, we assign a persistent identifier, or PID, to each set of data that is stored. This means that each data set has its own unique identifier.
Right, then what do these mysterious PIDs look like?
The PIDs that we use in EUDAT have two parts, separated by a forward slash, “/”. The first part or prefix indicates where the data set is actually stored (that is, which data centre it is stored at) and the suffix is basically a string of characters (mostly letters and digits) that is unique to that particular data set.
Er, Carl Johan... I notice you keep saying data set rather than data file. Is there a reason for that?
Yes, well spotted. Since EUDAT is about making it easy to use research data, we allow researchers to store related data files together as a “data set” or “data collection”. This means that you can keep all the data relevant to a particular lot of research together. So, in actual fact, a data set could be a single file or it could consist of multiple data files. And, by the way, you may also find “data files” referred to as “data objects” or even “digital objects”.
Thanks, now getting back to the PIDs themselves, do I need to arrange for them to be created for my data if I store a data set from my research via an EUDAT service?
No, not at all. There’s nothing to worry about with PIDs. As a researcher storing data through EUDAT’s services, you can actually forget all about PIDs if you like. They will be created and used automatically by the software. Likewise, if someone is searching for data using EUDAT’s services, that can be done without knowing about PIDs. However, I should add that PIDs can be very useful, if you would like to use them…
First, let me show you some examples of PIDs. Here are two real PIDs.
11304/6eacaa76-c275-11e4-ac7e-860aa0063d1f
11304/9fb5e092-7018-11e4-ac7e-860aa0063d1f
The first part or prefix, “11304”, indicates where the data set is stored, and the part after the “/” is a unique identifier for the particular data set. You can see from the prefixes that both of these data collections are stored at the same data centre, although it is not obvious to us where that is.
Now, the neat thing about these PIDs is that we can actually use them to go directly to the data, although we do need to use some other software to do that. For data that is accessible via the internet, such as the data stored in EUDAT's B2SHARE service, you can simply use your web browser. All you need to do is to put a little something in front of the PID to give you the web addresses for the two data sets.
http://hdl.handle.net/11304/6eacaa76-c275-11e4-ac7e-860aa0063d1f
http://hdl.handle.net/11304/9fb5e092-7018-11e4-ac7e-860aa0063d1f
Both of these data sets happen to be stored in B2SHARE, so if you enter either of these addresses into your web browser, it should bring up the page for the relevant data set in B2SHARE, and you would be able to download the data. This is great for sharing your data with colleagues.
Yes, I can see that would be handy. I guess that I could also use that to make a bookmark in my web browser so I could go directly to the webpage for my data later on if I needed to do that.
Absolutely. However, bear in mind that, in those examples, we were talking about data that is directly accessible via the world wide web. For other types of data, such as the extremely large data sets stored through EUDAT’s B2SAFE service, you would need to use some other kind of software, rather than your web browser, to access the data, but the PIDs essentially work in much the same way.
Now, as I said, you don’t need to know about actually creating PIDs, but you might want to find out the PID for some data that you’ve stored via an EUDAT service. It is quite easy to find the PID for data that you store through B2SHARE as the PID is always shown on the web page for the data set. If you have just stored some data in B2SHARE, you can simply click the link that takes you to the page for that data and the PID will be displayed there. And it will be similar for any data that you find by searching B2SHARE.
The situation is a bit different for the large research data sets that are stored using EUDAT's service B2SAFE. Since these sets of data are so huge, they cannot be directly accessed via a web page like the B2SHARE data. PIDs are still created automatically for the enormous B2SAFE data collections and used to keep track of where the data is stored, but you would use other methods to share that type of “big data”, rather than using the actual PID.
So, you don’t need to know about PIDs, but if you have or find the PID for data that is stored online, then you can use the PID to go directly to the data set. There are actually two ways to do this. Both of these methods use what is usually called a handle service. (Remember that we sometimes use “handle” as another word for PID.) Handle services work out where the data with a particular PID, or handle, is stored and take you there.
The web address for the handle service that EUDAT uses is http://hdl.handle.net. If you enter that in your web browser, you will arrive at this site:Screenshot of the webpage found at http://hdl.handle.net
Partway down the page, you can see a box where you can enter a handle or PID. If you type one of the PIDs that I mentioned earlier into that box, you should be taken straight to the web page for the data.
The other way to find data using the PID involves using the web address that is created from the address of the handle service followed by a “/” and then the PID. I gave you two examples of these kinds of addresses earlier on. The web address of the handle service is the “little something” that I mentioned is added to the start of the PID.
Before we continue, I'd just like to clarify a few details about how the PID system works. EUDAT uses EPIC PIDs, that is, PIDs that are based on the services and software provided by the European Persistent Identifier Consortium, EPIC (http://www.pidconsortium.eu). The EPIC PID system is in turn based on another handle system, known simply as the “Handle System” (www.handle.net). This system is used globally on the internet and is what makes your data accessible via the world wide web. The web address we used before, http://hdl.handle.net, is actually part of this underlying Handle System.
The Handle System can also be used for other types of handles or PIDs, for example “Digital Object Identifiers” (or DOIs) are a very common type of handle that will be familiar to many people. Both DOIs and EPIC PIDs use the Handle System and therefore work in exactly the same way on a technical level. In fact, you can even use a DOI service, such as http://dx.doi.org, with your EPIC PID to find your data in exactly the same way as we did before with the Handle System.
Great, thanks! So I could just store my data using B2SHARE and then email the web address (which is created from the PID being added onto end of the handle service address) to some colleagues in another country or university and thus share my data with them?
Yes, precisely. It really is useful to be able to do that rather than trying to email data files. By the way, there is one more thing I should mention for anyone who uses an EUDAT service where one or more copies of the data are created. For example, EUDAT's service B2SAFE can copy, or replicate, a data set from one data centre to another, either to make a backup copy of the information for the sake of safety or to move the information nearer a large supercomputer system where it will be used for calculations. Having the data geographically located at the same data centre as the supercomputer improves the performance of the calculations as massive amounts of data are not being sent back and forth over long distances.
In such cases where data sets are being replicated, PIDs are used to keep track of the various copies of the same data; a separate PID is created for each copy of the data. That might sound a bit confusing since I said PIDs were unique, but think of it in terms of us needing to keep track of each of the separate copies. So we need a unique identifier for each copy of the data. However the PIDs of the copies and the PID of the original version are all connected together by the handle service. This means that as long as you have the PID for one of the copies you will always be able to access the data somehow. The PIDs for the different copies are associated with each other so if one copy of the data is lost or damaged but you still have its PID, you can find another copy of the same data via the handle system.
There’s quite an advantage in using PIDs in this way: if you had the direct web address for where some data was stored, and then the data was moved to another data centre, the old web address would be useless. However, a web address that is based on the PID will continue to work, even if the data is moved.
Thanks very much, Carl Johan. We’ve really got a handle on handles and PIDs today, not to mention learning about the convenience of PID-based web addresses for accessing research data directly.