Filtering Data

Polymatheia provides a simple filtering language to remove records that are not needed for further processing. All filtering is performed using the polymatheia.filter.RecordsFilter. All filters are specified using tuples.

Basic filters

The basic filters provided by Polymatheia allow you to compare a value in a record to a fixed value:

  • ('true',): Lets any record pass.

  • ('false',): Lets no record pass.

  • ('eq', a, b): Lets the record pass if the value of a is equal to the value of b.

  • ('neq', a, b): Lets the record pass if the value of a is not equal to the value of b.

  • ('gt', a, b): Lets the record pass if the value of a is greater than the value of b.

  • ('gte', a, b): Lets the record pass if the value of a is greater than or equal to the value of b.

  • ('lt', a, b): Lets the record pass if the value of a is less than the value of b.

  • ('lte', a, b): Lets the record pass if the value of a is less than or equal to the value of b.

  • ('contains', a, b): Lets the record pass if the value of a is contains the value of b.

  • ('exists', a): Lets the record pass if the value of a is not None.

Where the filter expression contains a and b, either of these can be one of:

  • A dotted string: in this case the value to be compared is taken from the record using the dotted string to identify the value to compare.

  • A list: the value to be compared is taken from the record using the list to identify the value to compare.

  • Anything else: the value is compared as is.

from polymatheia.data.reader import LocalReader
from polymatheia.filter import RecordsFilter

reader = LocalReader('europeana_json')
fltr = ('eq', ['type'], 'IMAGE')
images = RecordsFilter(reader, fltr)
for record in images:
    print(record)

Compound filters

Filters can be combined into more complex filter expressions using the following compound filters:

  • ('not', filter_expression): Lets the record pass if the filter_expression is not True.

  • ('or', filter_expression_1, ..., filter_expression_n): Lets the record pass if one or more of the filter_expression_1 to filter_expression_n is True.

  • ('and', filter_expression_1, ..., filter_expression_n): Lets the record pass only if all filter_expression_1 to filter_expression_n are True.

The negation filter not is primarily needed with the contains filter, as the other basic filters provide explicit negation filters:

from polymatheia.data.reader import LocalReader
from polymatheia.filter import RecordsFilter

reader = LocalReader('europeana_json')
fltr = ('not', ('contains', ['dcLanguage'], 'de'))
not_german = RecordsFilter(reader, fltr)
for record in not_german:
    print(record)

The or and and filters use standard boolean logic for evaluation:

from polymatheia.data.reader import LocalReader
from polymatheia.filter import RecordsFilter

reader = LocalReader('europeana_json')
fltr = ('or', ('contains', ['dcLanguage'], 'de'), ('contains', ['dcLanguage'], 'ger'))
full_german = RecordsFilter(reader, fltr)
for record in full_german:
    print(record)
from polymatheia.data.reader import LocalReader
from polymatheia.filter import RecordsFilter

reader = LocalReader('europeana_json')
fltr = ('and', ('contains', ['dcLanguage'], 'de'), ('eq', ['type'], 'IMAGE'))
german_images = RecordsFilter(reader, fltr)
for record in german_images:
    print(record)

The compound filters can themselves be nested:

from polymatheia.data.reader import LocalReader
from polymatheia.filter import RecordsFilter

reader = LocalReader('europeana_json')
fltr = ('and',
            ('or',
                ('contains', ['dcLanguage'], 'de'),
                ('contains', ['dcLanguage'], 'ger')),
            ('eq', ['type'], 'IMAGE'))
full_german_images = RecordsFilter(reader, fltr)
for record in full_german_images:
    print(record)