Item Validation =============== One useful feature when monitoring a spider is being able to validate your returned items against a defined schema. Spidermon provides a mechanism that allows you to define an item schema and validation rules that will be executed for each item returned. To enable the item validation feature, the first step is to enable the built-in item pipeline in your project settings: .. code-block:: python # tutorial/settings.py ITEM_PIPELINES = { 'spidermon.contrib.scrapy.pipelines.ItemValidationPipeline': 800, } .. warning:: Preferably, enable it as the last pipeline executed, ensuring that no subsequent pipeline changes the content of the item, ignoring the validation already performed. Using JSON Schema ----------------- `JSON Schema`_ is a powerful tool for validating the structure of JSON data. You can define which fields are required, the type assigned to each field, a regular expression to validate the content and much more. .. warning:: You need to install `jsonschema`_ to use this feature. This `guide`_ explains the main keywords and how to generate a schema. Here we have an example of a schema for the quotes item from the :doc:`tutorial `. .. code-block:: json { "$schema": "http://json-schema.org/draft-07/schema", "type": "object", "properties": { "quote": { "type": "string" }, "author": { "type": "string" }, "author_url": { "type": "string", "pattern": "" }, "tags": { "type": "array", "items": { "type":"string" } } }, "required": [ "quote", "author", "author_url" ] } Settings -------- These are the settings used for configuring item validation: SPIDERMON_VALIDATION_ADD_ERRORS_TO_ITEMS ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Default: ``False`` When set to ``True``, this adds a field called `_validation` to the item that contains any validation errors. You can change the name of the field by assigning a name to `SPIDERMON_VALIDATION_ERRORS_FIELD`_: .. code-block:: js { '_validation': {'author_url': ['Invalid URL']}, 'author': 'C.S. Lewis', 'author_url': 'invalid_url', 'quote': 'Some day you will be old enough to start reading fairy tales ' 'again.', 'tags': ['age', 'fairytales', 'growing-up'] } SPIDERMON_VALIDATION_DROP_ITEMS_WITH_ERRORS ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Default: ``False`` Whether to drop items that contain validation errors. SPIDERMON_VALIDATION_ERRORS_FIELD ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Default: ``_validation`` The name of the field added to the item when a validation error happens and `SPIDERMON_VALIDATION_ADD_ERRORS_TO_ITEMS`_ is enabled. Nested fields are supported by using `.` separator: .. code-block:: python # settings.py SPIDERMON_VALIDATION_ERRORS_FIELD = "top_level.second_level._validation" SPIDERMON_VALIDATION_SCHEMAS ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Default: ``None`` A `list` containing the location of the item schema. Could be a local path or a URL. .. code-block:: python # settings.py SPIDERMON_VALIDATION_SCHEMAS = [ '/path/to/schema.json', 's3://bucket/schema.json', 'https://example.com/schema.json', ] If you are working on a spider that produces multiple items types, you can define it as a `dict`: .. code-block:: python # settings.py from tutorial.items import DummyItem, OtherItem SPIDERMON_VALIDATION_SCHEMAS = { DummyItem: '/path/to/dummyitem_schema.json', OtherItem: '/path/to/otheritem_schema.json', } Validation in Monitors ---------------------- You can build a monitor that checks the validation problems and raises errors if there are too many. You can base it on ``spidermon.contrib.monitors.mixins.ValidationMonitorMixin`` which provides methods that can be useful for this. There are 2 groups of methods, for checking all validation errors and specifically for checking ``missing_required_field`` errors. All of these methods rely on the job stats, reading ``spidermon/validation/fields/errors/*`` entries. * ``check_missing_required_fields``, ``check_missing_required_field`` - check that number of ``missing_required_field`` errors is less than the specified threshold. * ``check_missing_required_fields_percent``, ``check_missing_required_field_percent`` - check that percent of ``missing_required_field`` errors is less than the specified threshold. * ``check_fields_errors``, ``check_field_errors`` - check that the number of specified (or all) errors is less than the specified threshold. * ``check_fields_errors_percent``, ``check_field_errors_percent`` - check that the percent of specified (or all) errors is less than the specified threshold. All ``*_field`` method take a name of one field, while all ``*_fields`` method take a list of field names. .. warning:: The default behavior for ``*_fields`` methods when no field names is passed is to combine error counts for all fields instead of checking each field separately. This is usually not very useful and inconsistent with the behavior when a list of fields is passed, so you should set the ``correct_field_list_handling`` monitor attribute to get the correct behavior. This will be the default in some later version. .. note:: The ``*_percent`` methods receive the ratio, not the percent number, so for 15% you need to pass 0.15. Some examples: .. code-block:: python # checks that each of field2 and field3 is missing in no more than 10 items self.check_missing_required_fields(field_names=['field2', 'field3'], allowed_count=10) # checks that field2 has errors in no more than 15% of items self.check_field_errors_percent(field_name='field2', allowed_percent=0.15) # checks that no errors is present in any fields self.check_field_errors_percent() .. _`JSON Schema`: https://json-schema.org/ .. _`guide`: http://json-schema.org/learn/getting-started-step-by-step.html .. _`jsonschema`: https://pypi.org/project/jsonschema/