Digital Object Identifiers and Metadata for Electronic Documents
The first aim of the research is to look at the important issues in interfacing Digital Object Identifiers (DOIs) to metadata. Software is already being written to create a corpus of metadata for an archive of journal articles in PDF format. Indexing into this metadata is to be accomplished via a DOI for each article and a local Handle resolver; if the final PDF documents are freely available then users who click on a DOI in a Web page index will be first guided to the metadata and then to the full paper if they so require. Another facet of the research is to endow every bibliographic citation in the References section of all the PDF papers with a metadata descriptor accessed via a DOI. A future is envisaged where publishers gradually provide, at the very minimum, a DOI and metadata descriptor for everything they have published.
The second aim of this project is to research problems of accessibility and scalability in a future where a next-generation Web could locate any entity referred to by a DOI embedded in a Web page. The Handle/URN mechanism, if built into the Web, could find the nearest copy of a document on the most accessible Web site (or could prompt the user with a choice of mirror sites). By doing this most of the problems of maintaining broken links would vanish, but the problems of scaling up URNs and making the Web URN-aware should not be underestimated.
For many years now, metadata about books, journals and other publications has resided in the databases of the world's major publishers. By metadata we mean data about data and in today's Web-based world of electronic publications and electronic libraries this takes on new significance. Metadata about a published work might include such items as Title, Author, Abstract, Keywords, Publication date and so on. Within the Dublin Core metadata standard there is now a set of a dozen or so standardised fields of this sort and many secondary publishers (ISI, MedLine etc.) have made a useful business out of providing an abstracts service for journals of interests and thereby re-creating metadata which might previously have existed only in an implicit form in the original publisher's database.
For electronic documents, metadata about them can be extraordinarily useful whenever the format of the original document is hard to search or the document itself is unavailable for some reason. To take a specific example: many files on the Web are present not in HTML form but in Acrobat/PDF. This latter format is searchable (via search engines such as Verity) but so far none of the commonly used Web search engines are prepared to search PDF files. One way around this is to create metadata for these PDF files which summarise the content of the target PDF file, thereby enabling Web users to gain preliminary information about a PDF document before committing themselves to downloading the PDF and activating Acrobat Reader. Equally, a publisher may wish to restrict access (or to charge for access) to a full PDF file. The metadata descriptor is a handy gathering of `free information' which in some sense acts as a place marker for the full document.
Finally, a metadata descriptor can be created for entities which are not yet available in any electronic form (for example: rare manuscripts in archives that have yet to be scanned).
The idea of a Universal Resource Name (URN) has been circulating within the Web community for some time. The idea is that documents on the Web should be capable of being referred to by a URN which is unique over all of the Web and persists for an indefinite length of time. At present documents can only be referred to by identifiers such as http://www.cs.nott.ac.uk/mydoc.htm which says where a document is rather than what it is. If the document moves location then the infuriating 404 not found error message appears again and again.
If a persistently name such as davidsdocument could be attached to a document and registered with a central authority (and disallowed for use by anyone else other than the owner of that URN) then, in principle, Web software and browsers could be aware of where a copy of that document resides and could track its location(s) on the Web even if these change from time to time.
In 1996 the American Association of Publishers initiated a set of meetings to discuss the idea that a published work should have a unique and persistent identifier -- the Digital Object identifier (DOI). In many ways the DOI is an extension of the ideas of ISBN (International Standard Book Number) and ISSN (International Standard Serial Number) but, in the new context of Web accessibility of information, spirited debate has continued as to just what the DOI should (and should not) be.
The DOI initiative moved into an important new phase with the setting up of the DOI organisation, currently under the chairmanship of Norman Paskin (seconded from Elsevier UK), and with the adoption of a goal that the DOI should be one of the valid forms of a URN. However, the technical issues surrounding URNs were never going to be easy to resolve and the DOI has chosen, for its pilot scheme, to adopt an already working technology from the research organisation CNRI (in Reston Virginia). This technology is called Handles and a Handle is one form of a URN which adopts the Unicode character set and which offers a handle resolution service via CNRI.
The initial idea of a DOI was, principally, to aid publishers in conducting e-commerce for electronic copies of papers, books etc. A secondary aim was to help librarians, search engines and other end users to locate material on the Web. Gradually it was realised that a DOI could also act as a placeholder for a saleable entity that was not yet (or perhaps never could be) in electronic form. This led, in turn, to the important idea that the DOI should at the very minimum be resolvable into metadata about the object that the DOI represents.
Professor David F.Brailsford
School of Computer Science and IT
University of Nottingham
Tel: 0115 951 4251.
The NCC contact for this project is: Tim Ring (Editor) E-mail: email@example.com