Loading Data via OAI-PMH

Polymatheia supports accessing metadata records via the OAI-PMH protocol. It is a generic protocol designed for harvesting large amounts of metadata from an archive and was initially intended primarily as an archive-to-archive metadata exchange protocol. As a result of this focus, the OAI-PMH protocol provides only very limited facilities for filtering data on the server side. Instead the standard pattern is to download complete data-sets from the server and then filter and transform locally.

Finding the available Sets

OAI-PMH supports filtering on the server-side via the concept of a set. To find the available sets use the OAISetReader:

from polymatheia.data.reader import OAISetReader

reader = OAISetReader('http://www.digizeitschriften.de/oai2/')
for setSpec in reader:
    print(setSpec)

Finding the available MetadataFormats

An OAI-PMH server can provide the same metadata using different formats. To find which formats the server provides use the OAIMetadataFormatReader:

from polymatheia.data.reader import OAIMetadataFormatReader

reader = OAIMetadataFormatReader('http://www.digizeitschriften.de/oai2/')
for format in reader:
    print(format)

Fetching records

To retrieve all records use the OAIRecordReader:

from polymatheia.data.reader import OAIRecordReader

reader = OAIRecordReader('http://www.digizeitschriften.de/oai2/')
for record in reader:
    print(record)

Warning

This will retrieve ALL records provided by the OAI-PMH server, which can take a significant amount of time.

Limiting the number of records

To retrieve the first n records:

from polymatheia.data.reader import OAIRecordReader

reader = OAIRecordReader('http://www.digizeitschriften.de/oai2/', max_records=10)
for record in reader:
    print(record)

Note

Whether running this code repeatedly returns the same records depends on the OAI-PMH server implementation. Polymatheia cannot guarantee any order.

Selecting by set

To retrieve only those records that are in a set, provide the set specifier via the set_spec parameter:

from polymatheia.data.reader import OAIRecordReader

reader = OAIRecordReader('http://www.digizeitschriften.de/oai2/', max_records=10, set_spec='EU')
for record in reader:
    print(record)

Selecting the metadata format

To retrieve the records in a specific metadata format, provide the format identifier via the metadata_prefix parameter:

from polymatheia.data.reader import OAIRecordReader

reader = OAIRecordReader('http://www.digizeitschriften.de/oai2/', max_records=10, metadata_prefix='mets')
for record in reader:
    print(record)