Item Validation

One useful feature when monitoring a spider is being able to validate your returned items against a defined schema.

Spidermon provides a mechanism that allows you to define an item schema and validation rules that will be executed for each item returned. To enable the item validation feature, the first step is to enable the built-in item pipeline in your project settings:

# tutorial/settings.py
ITEM_PIPELINES = {
    'spidermon.contrib.scrapy.pipelines.ItemValidationPipeline': 800,
}

After that, you need to choose which validation library will be used. Spidermon accepts schemas defined using schematics or JSON Schema.

With schematics

Schematics is a validation library based on ORM-like models. These models include some common data types and validators, but they can also be extended to define custom validation rules.

Warning

You need to install schematics to use this feature.

# Usually placed in validators.py file
from schematics.models import Model
from schematics.types import URLType, StringType, ListType

class QuoteItem(Model):
    quote = StringType(required=True)
    author = StringType(required=True)
    author_url = URLType(required=True)
    tags = ListType(StringType)

Check schematics documentation to learn how to define a model and how to extend the built-in data types.

With JSON Schema

JSON Schema is a powerful tool for validating the structure of JSON data. You can define which fields are required, the type assigned to each field, a regular expression to validate the content and much more.

Warning

You need to install jsonschema to use this feature.

This guide explains the main keywords and how to generate a schema. Here we have an example of a schema for the quotes item from the tutorial.

{
  "$schema": "http://json-schema.org/draft-07/schema",
  "type": "object",
  "properties": {
    "quote": {
      "type": "string"
    },
    "author": {
      "type": "string"
    },
    "author_url": {
      "type": "string",
      "pattern": ""
    },
    "tags": {
      "type"
    }
  },
  "required": [
    "quote",
    "author",
    "author_url"
  ]
}

Settings

These are the settings used for configuring item validation:

SPIDERMON_VALIDATION_ADD_ERRORS_TO_ITEMS

Default: False

When set to True, this adds a field called _validation to the item that contains any validation errors. You can change the name of the field by assigning a name to SPIDERMON_VALIDATION_ERRORS_FIELD:

{
    '_validation': defaultdict(<class 'list'>, {'author_url': ['Invalid URL']}),
    'author': 'C.S. Lewis',
    'author_url': 'invalid_url',
    'quote': 'Some day you will be old enough to start reading fairy tales '
        'again.',
    'tags': ['age', 'fairytales', 'growing-up']
}

SPIDERMON_VALIDATION_DROP_ITEMS_WITH_ERRORS

Default: False

Whether to drop items that contain validation errors.

SPIDERMON_VALIDATION_ERRORS_FIELD

Default: _validation

The name of the field added to the item when a validation error happens and SPIDERMON_VALIDATION_ADD_ERRORS_TO_ITEMS is enabled.

SPIDERMON_VALIDATION_MODELS

Default: None

A list containing the schematics models that contain the definition of the items that need to be validated.

# settings.py

SPIDERMON_VALIDATION_MODELS: [
    'myproject.validators.DummyItemModel'
]

If you are working on a spider that produces multiple items types, you can define it as a dict:

# settings.py

SPIDERMON_VALIDATION_MODELS: {
    DummyItem: 'myproject.validators.DummyItemModel',
    OtherItem: 'myproject.validators.OtherItemModel',
}

SPIDERMON_VALIDATION_SCHEMAS

Default: None

A list containing the location of the item schema. Could be a local path or a URL.

# settings.py

SPIDERMON_VALIDATION_SCHEMAS: [
    '/path/to/schema.json',
    's3://bucket/schema.json',
    'https://example.com/schema.json',
]

If you are working on a spider that produces multiple items types, you can define it as a dict:

# settings.py

SPIDERMON_VALIDATION_SCHEMAS: {
    DummyItem: '/path/to/dummyitem_schema.json',
    OtherItem: '/path/to/otheritem_schema.json',
}

Validation in Monitors

You can build a monitor that checks the validation problems and raises errors if there are too many. You can base it on spidermon.contrib.monitors.mixins.ValidationMonitorMixin which provides methods that can be useful for this. There are 2 groups of methods, for checking all validation errors and specifically for checking missing_required_field errors. All of these methods rely on the job stats, reading spidermon/validation/fields/errors/* entries.

  • check_missing_required_fields, check_missing_required_field - check that number of missing_required_field errors is less than the specified threshold.
  • check_missing_required_fields_percent, check_missing_required_field_percent - check that percent of missing_required_field errors is less than the specified threshold.
  • check_fields_errors, check_field_errors - check that the number of specified (or all) errors is less than the specified threshold.
  • check_fields_errors_percent, check_field_errors_percent - check that the percent of specified (or all) errors is less than the specified threshold.

All *_field method take a name of one field, while all *_fields method take a list of field names.

Warning

The default behavior for *_fields methods when no field names is passed is to combine error counts for all fields instead of checking each field separately. This is usually not very useful and inconsistent with the behavior when a list of fields is passed, so you should set the correct_field_list_handling monitor attribute to get the correct behavior. This will be the default in some later version.

Some examples:

# checks that each of field2 and field3 is missing in no more than 10 items
self.check_missing_required_fields(field_names=['field2', 'field3'], allowed_count=10)

# checks that field2 has errors in no more than 15% of items
self.check_field_errors_percent(field_name='field2', allowed_percent=15)

# checks that no errors is present in any fields
self.check_field_errors_percent()