Item Validation

One useful feature when monitoring a spider is being able to validate your returned items against a defined schema.

Spidermon provides a mechanism that allows you to define an item schema and validation rules that will be executed for each item returned. To enable the item validation feature, the first step is to enable the built-in item pipeline in your project settings:

# tutorial/settings.py
ITEM_PIPELINES = {
    'spidermon.contrib.scrapy.pipelines.ItemValidationPipeline': 800,
}

After that, you need to choose which validation library will be used. Spidermon accepts schemas defined using schematics or JSON Schema.

With schematics

Schematics is a validation library based on ORM-like models. These models include some common data types and validators, but they can also be extended to define custom validation rules.

Warning

You need to install schematics to use this feature.

# Usually placed in validators.py file
from schematics.models import Model
from schematics.types import URLType, StringType, ListType

class QuoteItem(Model):
    quote = StringType(required=True)
    author = StringType(required=True)
    author_url = URLType(required=True)
    tags = ListType(StringType)

Check schematics documentation to learn how to define a model and how to extend the built-in data types.

With JSON Schema

JSON Schema is a powerful tool for validating the structure of JSON data. You can define which fields are required, the type assigned to each field, a regular expression to validate the content and much more.

Warning

You need to install jsonschema to use this feature.

This guide explains the main keywords and how to generate a schema. Here we have an example of a schema for the quotes item from the tutorial.

{
  "$schema": "http://json-schema.org/draft-07/schema",
  "type": "object",
  "properties": {
    "quote": {
      "type": "string"
    },
    "author": {
      "type": "string"
    },
    "author_url": {
      "type": "string",
      "pattern": ""
    },
    "tags": {
      "type"
    }
  },
  "required": [
    "quote",
    "author",
    "author_url"
  ]
}

Settings

These are the settings used for configuring item validation:

SPIDERMON_VALIDATION_ADD_ERRORS_TO_ITEMS

Default: False

When set to True, this adds a field called _validation to the item that contains any validation errors. You can change the name of the field by assigning a name to SPIDERMON_VALIDATION_ERRORS_FIELD:

{
    '_validation': defaultdict(<class 'list'>, {'author_url': ['Invalid URL']}),
    'author': 'C.S. Lewis',
    'author_url': 'invalid_url',
    'quote': 'Some day you will be old enough to start reading fairy tales '
        'again.',
    'tags': ['age', 'fairytales', 'growing-up']
}

SPIDERMON_VALIDATION_DROP_ITEMS_WITH_ERRORS

Default: False

Whether to drop items that contain validation errors.

SPIDERMON_VALIDATION_ERRORS_FIELD

Default: _validation

The name of the field added to the item when a validation error happens and SPIDERMON_VALIDATION_ADD_ERRORS_TO_ITEMS is enabled.

SPIDERMON_VALIDATION_MODELS

Default: None

A list containing the schematics models that contain the definition of the items that need to be validated.

# settings.py

SPIDERMON_VALIDATION_MODELS: [
    'myproject.spiders.validators.DummyItemModel'
]

If you are working on a spider that produces multiple items types, you can define it as a dict:

# settings.py

SPIDERMON_VALIDATION_MODELS: {
    DummyItem: 'myproject.spiders.validators.DummyItemModel',
    OtherItem: 'myproject.spiders.validators.OtherItemModel',
}

SPIDERMON_VALIDATION_SCHEMAS

Default: None

A list containing the location of the item schema. Could be a local path or a URL.

# settings.py

SPIDERMON_VALIDATION_SCHEMAS: [
    '/path/to/schema.json',
    's3://bucket/schema.json',
    'https://example.com/schema.json',
]

If you are working on a spider that produces multiple items types, you can define it as a dict:

# settings.py

SPIDERMON_VALIDATION_SCHEMAS: {
    DummyItem: '/path/to/dummyitem_schema.json',
    OtherItem: '/path/to/otheritem_schema.json',
}