SlideShare a Scribd company logo
1 of 12
Metadata Harvesting and the OAI-PMH Andrew Schenck Pamela Russell LIS 688
What is Metadata Harvesting? An automatic metadata generating method Occurs when metadata is automatically collected from META tags  Automatically gathers metadata from individual repositories
Example Metadata Generators Metadata generators are also known as metadata extraction systems Sample metadata extraction systems available for libraries include: DC-dot MarcEdit Metaextract IBM Magic System Some are available via open source
DC-dot DC-dot is open source and it can be redistributed or modified DC-dot creates Dublin Core metadata Metadata creation is initiated by submitting a URL Generates keywords by analyzing hyperlinked concepts and presentation encoding Does not produce description metadata Generates type, format and date metadata
MarcEdit MarcEdit is open source MarcEdit was initially conceived as a graphical user interface designed as a batch MARC editing tool. An application suite of metadata editing tools that includes character set conversion, XML crosswalking, and metadata harvesting.  It allows users to: Customize the existing data conversion rules or create new data conversion rules Harvest metadata from a supported metadata format Create conversion templates for additional metadata formats Customize existing conversion templates to reflect many variations in best practices used among projects
Metaextract Designed for metadata extraction in the domain of math and science education for K-12 Also designed to extract Dublin Core and Gateway to Educational Materials metadata on both the item and collection levels  Collection-level metadata is generated based on a collection-specific configuration Item-level metadata is extracted from the content of educational documents using three extraction modules: eQuery HTML-based modules Keyword generator module
IBM Magic System Includes various content analytic modules for metadata generation: Audiovisual analysis modules – recognizes semantic sound categories as well as text analysis modules that extract title, keywords, and summary from text documents Facilitates content reuse and repurposing Improves interoperability Creates more timely registration of content
Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) Released in June 2002 Provides an application-independent interoperability framework based on metadata harvesting Two levels of participants in the OAI-PMH: Data providers: Administer the systems Service providers: Use the metadata harvested to build their digital collection
OAI-PMH Key terms Harvester Operated by a service provider as a way to collect metadata from a repository Repository A network accessible server that is able to process OAI-PMH requests Managed by the data provider to allow harvesters access to its metadata
Harvesting Problems Lack of consistency Different collections using different DC elements and controlled vocabularies Repositories may have missing data within their metadata The repository may decline to fill out elements Incorrect data Data in the wrong element Harvested metadata can be confusing Strings of names can be ordered in an inconsistent manner or ambiguously separated with commas instead of semicolons Insufficient data
Recommendations for Improving Harvesting Establish guidelines and best practices Develop local standards Evaluate metadata Check to see if there are certain elements where you have local metadata that would not be useful in an aggregated environment. Check to see if any fields are populated with unknown or N/A Communicate with the service provider
Conclusion Evidence suggests that OAI-PMH is a successful endeavor Increase in number of repositories Many funded projects based on OAI eprints.org  Metadata Harvesting Initiative of the Mellon Foundation NSF National Science Digital Library (NSDL) The importance of metadata is one of the reasons that the Open Archives Initiative created the Protocol for Metadata Harvesting

More Related Content

What's hot (20)

Dublin core Presentation
Dublin core PresentationDublin core Presentation
Dublin core Presentation
 
Greenstone Digital Library
Greenstone Digital LibraryGreenstone Digital Library
Greenstone Digital Library
 
NISCAIR.pptx
NISCAIR.pptxNISCAIR.pptx
NISCAIR.pptx
 
OAI and OAI-PMH
OAI and OAI-PMHOAI and OAI-PMH
OAI and OAI-PMH
 
Resource description and Access
Resource description and AccessResource description and Access
Resource description and Access
 
INSPEC
INSPECINSPEC
INSPEC
 
Inis ppt
Inis pptInis ppt
Inis ppt
 
Delnet
DelnetDelnet
Delnet
 
Evaluation of medlars
Evaluation of medlarsEvaluation of medlars
Evaluation of medlars
 
Unisist ppt
Unisist pptUnisist ppt
Unisist ppt
 
DELNET by Gaurav Boudh
DELNET by Gaurav BoudhDELNET by Gaurav Boudh
DELNET by Gaurav Boudh
 
citation analysis
citation analysiscitation analysis
citation analysis
 
Metadata
MetadataMetadata
Metadata
 
Introduction to indexing
Introduction to indexingIntroduction to indexing
Introduction to indexing
 
Dublin Core Intro
Dublin Core IntroDublin Core Intro
Dublin Core Intro
 
Common communication format
Common communication formatCommon communication format
Common communication format
 
Techniques for Electronic Resource Management: Crowdsourcing for Best Practices
Techniques for Electronic Resource Management: Crowdsourcing for Best PracticesTechniques for Electronic Resource Management: Crowdsourcing for Best Practices
Techniques for Electronic Resource Management: Crowdsourcing for Best Practices
 
Z39.50: Information Retrieval protocol ppt
Z39.50: Information Retrieval protocol pptZ39.50: Information Retrieval protocol ppt
Z39.50: Information Retrieval protocol ppt
 
Information storage and retrieval
Information storage and  retrievalInformation storage and  retrieval
Information storage and retrieval
 
ISO 2709
ISO 2709ISO 2709
ISO 2709
 

Similar to Metadata Harvesting via OAI-PMH

UNIT - 1 Part 2: Data Warehousing and Data Mining
UNIT - 1 Part 2: Data Warehousing and Data MiningUNIT - 1 Part 2: Data Warehousing and Data Mining
UNIT - 1 Part 2: Data Warehousing and Data MiningNandakumar P
 
MetadataTheory: Metadata Tools (7th of 10)
MetadataTheory: Metadata Tools (7th of 10)MetadataTheory: Metadata Tools (7th of 10)
MetadataTheory: Metadata Tools (7th of 10)Nikos Palavitsinis, PhD
 
CC Technology Summit 3 Update
CC Technology Summit 3 UpdateCC Technology Summit 3 Update
CC Technology Summit 3 UpdateNathan Yergler
 
TSPUG: Content Management in SharePoint 2010
TSPUG: Content Management in SharePoint 2010TSPUG: Content Management in SharePoint 2010
TSPUG: Content Management in SharePoint 2010Eli Robillard
 
Metadata: Towards Machine-Enabled Intelligence
Metadata: Towards Machine-Enabled Intelligence               Metadata: Towards Machine-Enabled Intelligence
Metadata: Towards Machine-Enabled Intelligence dannyijwest
 
Metadata: Towards Machine-Enabled Intelligence
Metadata: Towards Machine-Enabled IntelligenceMetadata: Towards Machine-Enabled Intelligence
Metadata: Towards Machine-Enabled Intelligencedannyijwest
 
2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...
2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...
2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...IEEEMEMTECHSTUDENTSPROJECTS
 
IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...
IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...
IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...IEEEMEMTECHSTUDENTPROJECTS
 
Searching Repositories of Web Application Models
Searching Repositories of Web Application ModelsSearching Repositories of Web Application Models
Searching Repositories of Web Application ModelsMarco Brambilla
 
Webinar: 10-Step Guide to Creating a Single View of your Business
Webinar: 10-Step Guide to Creating a Single View of your BusinessWebinar: 10-Step Guide to Creating a Single View of your Business
Webinar: 10-Step Guide to Creating a Single View of your BusinessMongoDB
 
LIS688_Group1
LIS688_Group1 LIS688_Group1
LIS688_Group1 e_chae
 
Vision Based Deep Web data Extraction on Nested Query Result Records
Vision Based Deep Web data Extraction on Nested Query Result RecordsVision Based Deep Web data Extraction on Nested Query Result Records
Vision Based Deep Web data Extraction on Nested Query Result RecordsIJMER
 
Oracle data integrator training from hyderabad
Oracle data integrator training from hyderabadOracle data integrator training from hyderabad
Oracle data integrator training from hyderabadFuturePoint Technologies
 
Opinioz_intern
Opinioz_internOpinioz_intern
Opinioz_internSai Ganesh
 
Drilling Down to the Challenges of SharePoint Taxonomy Implementation
Drilling Down to the Challenges of SharePoint Taxonomy ImplementationDrilling Down to the Challenges of SharePoint Taxonomy Implementation
Drilling Down to the Challenges of SharePoint Taxonomy ImplementationTSoholt
 

Similar to Metadata Harvesting via OAI-PMH (20)

UNIT - 1 Part 2: Data Warehousing and Data Mining
UNIT - 1 Part 2: Data Warehousing and Data MiningUNIT - 1 Part 2: Data Warehousing and Data Mining
UNIT - 1 Part 2: Data Warehousing and Data Mining
 
MetadataTheory: Metadata Tools (7th of 10)
MetadataTheory: Metadata Tools (7th of 10)MetadataTheory: Metadata Tools (7th of 10)
MetadataTheory: Metadata Tools (7th of 10)
 
Meta data
Meta dataMeta data
Meta data
 
CC Technology Summit 3 Update
CC Technology Summit 3 UpdateCC Technology Summit 3 Update
CC Technology Summit 3 Update
 
CodeIgniter
CodeIgniterCodeIgniter
CodeIgniter
 
TSPUG: Content Management in SharePoint 2010
TSPUG: Content Management in SharePoint 2010TSPUG: Content Management in SharePoint 2010
TSPUG: Content Management in SharePoint 2010
 
Webinar@AIMS: LODE-BD
Webinar@AIMS: LODE-BDWebinar@AIMS: LODE-BD
Webinar@AIMS: LODE-BD
 
Metadata: Towards Machine-Enabled Intelligence
Metadata: Towards Machine-Enabled Intelligence               Metadata: Towards Machine-Enabled Intelligence
Metadata: Towards Machine-Enabled Intelligence
 
MIDESS
MIDESSMIDESS
MIDESS
 
Metadata: Towards Machine-Enabled Intelligence
Metadata: Towards Machine-Enabled IntelligenceMetadata: Towards Machine-Enabled Intelligence
Metadata: Towards Machine-Enabled Intelligence
 
2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...
2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...
2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...
 
IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...
IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...
IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...
 
Searching Repositories of Web Application Models
Searching Repositories of Web Application ModelsSearching Repositories of Web Application Models
Searching Repositories of Web Application Models
 
Webinar: 10-Step Guide to Creating a Single View of your Business
Webinar: 10-Step Guide to Creating a Single View of your BusinessWebinar: 10-Step Guide to Creating a Single View of your Business
Webinar: 10-Step Guide to Creating a Single View of your Business
 
LIS688_Group1
LIS688_Group1 LIS688_Group1
LIS688_Group1
 
Cake PHP
Cake PHPCake PHP
Cake PHP
 
Vision Based Deep Web data Extraction on Nested Query Result Records
Vision Based Deep Web data Extraction on Nested Query Result RecordsVision Based Deep Web data Extraction on Nested Query Result Records
Vision Based Deep Web data Extraction on Nested Query Result Records
 
Oracle data integrator training from hyderabad
Oracle data integrator training from hyderabadOracle data integrator training from hyderabad
Oracle data integrator training from hyderabad
 
Opinioz_intern
Opinioz_internOpinioz_intern
Opinioz_intern
 
Drilling Down to the Challenges of SharePoint Taxonomy Implementation
Drilling Down to the Challenges of SharePoint Taxonomy ImplementationDrilling Down to the Challenges of SharePoint Taxonomy Implementation
Drilling Down to the Challenges of SharePoint Taxonomy Implementation
 

Recently uploaded

Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Atmosphere science 7 quarter 4 .........
Atmosphere science 7 quarter 4 .........Atmosphere science 7 quarter 4 .........
Atmosphere science 7 quarter 4 .........LeaCamillePacle
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 

Recently uploaded (20)

Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Atmosphere science 7 quarter 4 .........
Atmosphere science 7 quarter 4 .........Atmosphere science 7 quarter 4 .........
Atmosphere science 7 quarter 4 .........
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 

Metadata Harvesting via OAI-PMH

  • 1. Metadata Harvesting and the OAI-PMH Andrew Schenck Pamela Russell LIS 688
  • 2. What is Metadata Harvesting? An automatic metadata generating method Occurs when metadata is automatically collected from META tags Automatically gathers metadata from individual repositories
  • 3. Example Metadata Generators Metadata generators are also known as metadata extraction systems Sample metadata extraction systems available for libraries include: DC-dot MarcEdit Metaextract IBM Magic System Some are available via open source
  • 4. DC-dot DC-dot is open source and it can be redistributed or modified DC-dot creates Dublin Core metadata Metadata creation is initiated by submitting a URL Generates keywords by analyzing hyperlinked concepts and presentation encoding Does not produce description metadata Generates type, format and date metadata
  • 5. MarcEdit MarcEdit is open source MarcEdit was initially conceived as a graphical user interface designed as a batch MARC editing tool. An application suite of metadata editing tools that includes character set conversion, XML crosswalking, and metadata harvesting. It allows users to: Customize the existing data conversion rules or create new data conversion rules Harvest metadata from a supported metadata format Create conversion templates for additional metadata formats Customize existing conversion templates to reflect many variations in best practices used among projects
  • 6. Metaextract Designed for metadata extraction in the domain of math and science education for K-12 Also designed to extract Dublin Core and Gateway to Educational Materials metadata on both the item and collection levels Collection-level metadata is generated based on a collection-specific configuration Item-level metadata is extracted from the content of educational documents using three extraction modules: eQuery HTML-based modules Keyword generator module
  • 7. IBM Magic System Includes various content analytic modules for metadata generation: Audiovisual analysis modules – recognizes semantic sound categories as well as text analysis modules that extract title, keywords, and summary from text documents Facilitates content reuse and repurposing Improves interoperability Creates more timely registration of content
  • 8. Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) Released in June 2002 Provides an application-independent interoperability framework based on metadata harvesting Two levels of participants in the OAI-PMH: Data providers: Administer the systems Service providers: Use the metadata harvested to build their digital collection
  • 9. OAI-PMH Key terms Harvester Operated by a service provider as a way to collect metadata from a repository Repository A network accessible server that is able to process OAI-PMH requests Managed by the data provider to allow harvesters access to its metadata
  • 10. Harvesting Problems Lack of consistency Different collections using different DC elements and controlled vocabularies Repositories may have missing data within their metadata The repository may decline to fill out elements Incorrect data Data in the wrong element Harvested metadata can be confusing Strings of names can be ordered in an inconsistent manner or ambiguously separated with commas instead of semicolons Insufficient data
  • 11. Recommendations for Improving Harvesting Establish guidelines and best practices Develop local standards Evaluate metadata Check to see if there are certain elements where you have local metadata that would not be useful in an aggregated environment. Check to see if any fields are populated with unknown or N/A Communicate with the service provider
  • 12. Conclusion Evidence suggests that OAI-PMH is a successful endeavor Increase in number of repositories Many funded projects based on OAI eprints.org Metadata Harvesting Initiative of the Mellon Foundation NSF National Science Digital Library (NSDL) The importance of metadata is one of the reasons that the Open Archives Initiative created the Protocol for Metadata Harvesting

Editor's Notes

  1. Metadata harvesting and the Open Archives Initiative Protocol for Metadata Harvesting by Andrew Schenck and Pamela Russell
  2. Metadata harvesting is an automatic metadata generating method. Harvesting occurs when metadata is automatically collected from META tags found in the “header” source code of an HTML resource or encoded from another resource format. Metadata harvesting automatically gathers metadata from individual repositories where it has been produced by either automatic or manual approaches.
  3. Much like other automated tasks, there are a multitude of metadata generators available.These generators, also known as metadata extraction systems, can be extremely helpful for libraries wishing to extract metadata from various repositories. Some of the different metadata extraction systems available for libraries to use include: DC-dotMarcEditMetaextractand IBM Magic System.Some of these systems are available via open source and are free, although the people needed to run them must usually be paid.Many of the systems were created to harvest all types of metadata, and some were created to harvest metadata for very specific objects or areas of study.
  4. DC-dot was developed by Andy Powell at UKOLN at the University of Bath. DC-dot is open source and it can be redistributed or modified under the terms of the GNU General Public License as published by the Free Software Foundation.DC-dot creates Dublin Core metadata and can format output according to a number of different metadata schemas.In DC-dot, metadata creation is initiated by submitting a URL. The resource identifier metadata from the Web browser’s address prompt is copied, and metadata included in the title, keywords, description, and type fields is then harvested from the resource META tags. DC-dot will automatically generate keywords by analyzing hyperlinked concepts and presentation encoding (bolding and font size), but will not produce description metadata. DC-dot also automatically generates type, format, and date metadata
  5. MarcEdit was created by Terry Reese in 1998 and was initially conceived as a graphical user interface designed as a batch MARC editing tool. Currently, MarcEdit is an application suite of metadata editing tools that includes character set conversion, XML crosswalking, and metadata harvesting. Unlike other metadata extraction systems, MarcEdit allows users to customize the existing data conversion rules or create new data conversion rules.This allows users to harvest metadata from a supported metadata format as well as create conversion templates for additional metadata formats.It also allows users to customize existing conversion templates to reflect many variations in best practices used among projects.
  6. Metaextract is an extraction system that was designed for metadata extraction in the domain of math and science education for K-12.It was designed to extract Dublin Core and Gateway to Educational Materials metadata on both the item and collection levels using natural language processing techniques.The collection-level metadata is generated based on a collection-specific configuration and the item-level metadata is extracted from the content of educational documents using three extraction modules: eQuery, HTML-based modules, and a keyword generator module.
  7. IBM Magic System was presented in 2005 and includes various content analytic modules for metadata generation.Audiovisual analysis modules are available that recognize semantic sound categories and identify narrators and informative text segments as well as text analysis modules that extract title, keywords and summaryfrom text documents.The IBM Magic System can facilitate content reuse and repurposing, improve interoperability and create more timely registration of content by course developers and authors.
  8. The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) provides an application-independent interoperability framework that is based on metadata harvesting.There are two levels of participants in the OAI-PMH: data providers and service providers.Data providers administer the systems that support the OAI-PMH as a means of supplying metadata.Service providers use the metadata harvested from the OAI-PMH to help build their digital collections.
  9. Some other key terms necessary to understand OAI-PMH are harvester and repository. A harvester is a client application that can issue any OAI-PMH requests.The harvester is operated by a service provider as a way to collect metadata from a repository. A repository is a network accessible server that is able to process OAI-PMH requests. A repository is managed by the data provider to allow harvesters access to its metadata.
  10. The most common problem with harvested metadata is a lack of consistency. For example, inconsistencies across collections can occur when data providers use some Dublin Core elements and controlled vocabularies in one collection but not in another.On a larger scale, some data providers use different Dublin Core elements in different ways throughout their repository. This can lead to similar kinds of metadata ending up in different fields when harvested. The metadata harvested from OAI-PMH has other significant problems.Many repositories have missing data within their metadata. For example, if an entire collection consisted of materials of the same format or type, the repository may decline to fill out the “format” or “type” element in Dublin Core because the information would be deemed unnecessary for the collection’s local purposes. Every item is the same type so why fill out that field? This causes problems when an OAI-PMH service provider wants to limit their search. If they wanted to limit their search using the format or type element they wouldn’t be able to do so because that particular field had been left empty by the repository.An example of incorrect data in a repository would be creator names repeated in the language element or repeating the identifier for the metadata record in the Dublin Core identifier element. Also included in incorrect data would be any misspelled words or stray characters such as dashes or hyphens.Another problem with harvested metadata is that it can be confusing. Strings of names can be ordered in an inconsistent manner or ambiguously separated with commas instead of semicolons. This type of confusing data can occur when the entries are dumped without revision into a metadata record. This may happen when records are cut and pasted from Web HTML text. Insufficient data can also cause problems with harvesting because the metadata present in the repositories is not useful when trying to limit searches and retrieve specific information.
  11. Recommendations for improving harvesting:As a repository, established guidelines should be used and local standards should be developed. Either use a guideline and best practices resource that already exists or develop and document standards to meet your local needs.Evaluate your metadata to determine if there is some that you do not want or need to share.Check to see if there are certain elements where you have local metadata that would not be useful in an aggregated environment.If you find that there are some unnecessary elements, unmap the fields before allowing them to be harvested.While checking for necessary and unnecessary fields, check to see if any fields are populated with unknown or N/A. In and aggregate environment this should not be done. It is better to leave a field blank than to use unknown or N/A in fields where harvesters might interpret them as meaningful data.Most importantly, communicate with the service provider who is harvesting your records. Review your metadata and determine if there are ways to make it cleaner and easier to understand
  12. Although the OAI-PMH is far from perfect, there is ample evidence to suggest that it is a successful endeavor.The number of repositories who make their metadata available through OAI-PMH has grown since the initial release in January of 2001.Another way to gage success is from the level of attention garnered from funding agencies. Some examples of funded projects and programs that promote or are based on the OAI are eprints.org, Metadata Harvesting Initiative of the Mellon Foundation and the NSF National Science Digital Library (NSDL).The importance of metadata is one of the reasons that the Open Archives Initiative created the Protocol for Metadata Harvesting. Although it is not a perfect process, it has been very successful in helping many libraries of all types, both large and small, to create and offer Web access to digital collections.