Metadata harvesting is the automatic collection of metadata from individual repositories using metadata extraction systems or generators. It occurs through analyzing tags and elements like Dublin Core to gather descriptive, technical, and administrative information without human intervention. However, inconsistencies in metadata practices across repositories can cause confusion and insufficient data for service providers harvesting metadata through the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). Improving guidelines, local standards, evaluation, communication, and data quality can help address these harvesting problems.
2. What is Metadata Harvesting? An automatic metadata generating method Occurs when metadata is automatically collected from META tags Automatically gathers metadata from individual repositories
3. Example Metadata Generators Metadata generators are also known as metadata extraction systems Sample metadata extraction systems available for libraries include: DC-dot MarcEdit Metaextract IBM Magic System Some are available via open source
4. DC-dot DC-dot is open source and it can be redistributed or modified DC-dot creates Dublin Core metadata Metadata creation is initiated by submitting a URL Generates keywords by analyzing hyperlinked concepts and presentation encoding Does not produce description metadata Generates type, format and date metadata
5. MarcEdit MarcEdit is open source MarcEdit was initially conceived as a graphical user interface designed as a batch MARC editing tool. An application suite of metadata editing tools that includes character set conversion, XML crosswalking, and metadata harvesting. It allows users to: Customize the existing data conversion rules or create new data conversion rules Harvest metadata from a supported metadata format Create conversion templates for additional metadata formats Customize existing conversion templates to reflect many variations in best practices used among projects
6. Metaextract Designed for metadata extraction in the domain of math and science education for K-12 Also designed to extract Dublin Core and Gateway to Educational Materials metadata on both the item and collection levels Collection-level metadata is generated based on a collection-specific configuration Item-level metadata is extracted from the content of educational documents using three extraction modules: eQuery HTML-based modules Keyword generator module
7. IBM Magic System Includes various content analytic modules for metadata generation: Audiovisual analysis modules – recognizes semantic sound categories as well as text analysis modules that extract title, keywords, and summary from text documents Facilitates content reuse and repurposing Improves interoperability Creates more timely registration of content
8. Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) Released in June 2002 Provides an application-independent interoperability framework based on metadata harvesting Two levels of participants in the OAI-PMH: Data providers: Administer the systems Service providers: Use the metadata harvested to build their digital collection
9. OAI-PMH Key terms Harvester Operated by a service provider as a way to collect metadata from a repository Repository A network accessible server that is able to process OAI-PMH requests Managed by the data provider to allow harvesters access to its metadata
10. Harvesting Problems Lack of consistency Different collections using different DC elements and controlled vocabularies Repositories may have missing data within their metadata The repository may decline to fill out elements Incorrect data Data in the wrong element Harvested metadata can be confusing Strings of names can be ordered in an inconsistent manner or ambiguously separated with commas instead of semicolons Insufficient data
11. Recommendations for Improving Harvesting Establish guidelines and best practices Develop local standards Evaluate metadata Check to see if there are certain elements where you have local metadata that would not be useful in an aggregated environment. Check to see if any fields are populated with unknown or N/A Communicate with the service provider
12. Conclusion Evidence suggests that OAI-PMH is a successful endeavor Increase in number of repositories Many funded projects based on OAI eprints.org Metadata Harvesting Initiative of the Mellon Foundation NSF National Science Digital Library (NSDL) The importance of metadata is one of the reasons that the Open Archives Initiative created the Protocol for Metadata Harvesting
Editor's Notes
Metadata harvesting and the Open Archives Initiative Protocol for Metadata Harvesting by Andrew Schenck and Pamela Russell
Metadata harvesting is an automatic metadata generating method. Harvesting occurs when metadata is automatically collected from META tags found in the “header” source code of an HTML resource or encoded from another resource format. Metadata harvesting automatically gathers metadata from individual repositories where it has been produced by either automatic or manual approaches.
Much like other automated tasks, there are a multitude of metadata generators available.These generators, also known as metadata extraction systems, can be extremely helpful for libraries wishing to extract metadata from various repositories. Some of the different metadata extraction systems available for libraries to use include: DC-dotMarcEditMetaextractand IBM Magic System.Some of these systems are available via open source and are free, although the people needed to run them must usually be paid.Many of the systems were created to harvest all types of metadata, and some were created to harvest metadata for very specific objects or areas of study.
DC-dot was developed by Andy Powell at UKOLN at the University of Bath. DC-dot is open source and it can be redistributed or modified under the terms of the GNU General Public License as published by the Free Software Foundation.DC-dot creates Dublin Core metadata and can format output according to a number of different metadata schemas.In DC-dot, metadata creation is initiated by submitting a URL. The resource identifier metadata from the Web browser’s address prompt is copied, and metadata included in the title, keywords, description, and type fields is then harvested from the resource META tags. DC-dot will automatically generate keywords by analyzing hyperlinked concepts and presentation encoding (bolding and font size), but will not produce description metadata. DC-dot also automatically generates type, format, and date metadata
MarcEdit was created by Terry Reese in 1998 and was initially conceived as a graphical user interface designed as a batch MARC editing tool. Currently, MarcEdit is an application suite of metadata editing tools that includes character set conversion, XML crosswalking, and metadata harvesting. Unlike other metadata extraction systems, MarcEdit allows users to customize the existing data conversion rules or create new data conversion rules.This allows users to harvest metadata from a supported metadata format as well as create conversion templates for additional metadata formats.It also allows users to customize existing conversion templates to reflect many variations in best practices used among projects.
Metaextract is an extraction system that was designed for metadata extraction in the domain of math and science education for K-12.It was designed to extract Dublin Core and Gateway to Educational Materials metadata on both the item and collection levels using natural language processing techniques.The collection-level metadata is generated based on a collection-specific configuration and the item-level metadata is extracted from the content of educational documents using three extraction modules: eQuery, HTML-based modules, and a keyword generator module.
IBM Magic System was presented in 2005 and includes various content analytic modules for metadata generation.Audiovisual analysis modules are available that recognize semantic sound categories and identify narrators and informative text segments as well as text analysis modules that extract title, keywords and summaryfrom text documents.The IBM Magic System can facilitate content reuse and repurposing, improve interoperability and create more timely registration of content by course developers and authors.
The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) provides an application-independent interoperability framework that is based on metadata harvesting.There are two levels of participants in the OAI-PMH: data providers and service providers.Data providers administer the systems that support the OAI-PMH as a means of supplying metadata.Service providers use the metadata harvested from the OAI-PMH to help build their digital collections.
Some other key terms necessary to understand OAI-PMH are harvester and repository. A harvester is a client application that can issue any OAI-PMH requests.The harvester is operated by a service provider as a way to collect metadata from a repository. A repository is a network accessible server that is able to process OAI-PMH requests. A repository is managed by the data provider to allow harvesters access to its metadata.
The most common problem with harvested metadata is a lack of consistency. For example, inconsistencies across collections can occur when data providers use some Dublin Core elements and controlled vocabularies in one collection but not in another.On a larger scale, some data providers use different Dublin Core elements in different ways throughout their repository. This can lead to similar kinds of metadata ending up in different fields when harvested. The metadata harvested from OAI-PMH has other significant problems.Many repositories have missing data within their metadata. For example, if an entire collection consisted of materials of the same format or type, the repository may decline to fill out the “format” or “type” element in Dublin Core because the information would be deemed unnecessary for the collection’s local purposes. Every item is the same type so why fill out that field? This causes problems when an OAI-PMH service provider wants to limit their search. If they wanted to limit their search using the format or type element they wouldn’t be able to do so because that particular field had been left empty by the repository.An example of incorrect data in a repository would be creator names repeated in the language element or repeating the identifier for the metadata record in the Dublin Core identifier element. Also included in incorrect data would be any misspelled words or stray characters such as dashes or hyphens.Another problem with harvested metadata is that it can be confusing. Strings of names can be ordered in an inconsistent manner or ambiguously separated with commas instead of semicolons. This type of confusing data can occur when the entries are dumped without revision into a metadata record. This may happen when records are cut and pasted from Web HTML text. Insufficient data can also cause problems with harvesting because the metadata present in the repositories is not useful when trying to limit searches and retrieve specific information.
Recommendations for improving harvesting:As a repository, established guidelines should be used and local standards should be developed. Either use a guideline and best practices resource that already exists or develop and document standards to meet your local needs.Evaluate your metadata to determine if there is some that you do not want or need to share.Check to see if there are certain elements where you have local metadata that would not be useful in an aggregated environment.If you find that there are some unnecessary elements, unmap the fields before allowing them to be harvested.While checking for necessary and unnecessary fields, check to see if any fields are populated with unknown or N/A. In and aggregate environment this should not be done. It is better to leave a field blank than to use unknown or N/A in fields where harvesters might interpret them as meaningful data.Most importantly, communicate with the service provider who is harvesting your records. Review your metadata and determine if there are ways to make it cleaner and easier to understand
Although the OAI-PMH is far from perfect, there is ample evidence to suggest that it is a successful endeavor.The number of repositories who make their metadata available through OAI-PMH has grown since the initial release in January of 2001.Another way to gage success is from the level of attention garnered from funding agencies. Some examples of funded projects and programs that promote or are based on the OAI are eprints.org, Metadata Harvesting Initiative of the Mellon Foundation and the NSF National Science Digital Library (NSDL).The importance of metadata is one of the reasons that the Open Archives Initiative created the Protocol for Metadata Harvesting. Although it is not a perfect process, it has been very successful in helping many libraries of all types, both large and small, to create and offer Web access to digital collections.