Monitoring your jobs¶
Monitors¶
Monitors are the main class where you include your monitoring logic. After defining them, you need to include them in a MonitorSuite, so they can be executed.
As spidermon.core.monitors.Monitor inherits from Python unittest.TestCase, you can use all existing assertion methods in your monitors.
In the following example, we define a monitor that will verify whether a minimum number of items were extracted and fails if it is less than the expected threshold.
from spidermon import Monitor, monitors
@monitors.name('Item count')
class ItemCountMonitor(Monitor):
@monitors.name('Minimum items extracted')
def test_minimum_number_of_items_extracted(self):
minimum_threshold = 100
item_extracted = getattr(self.data.stats, 'item_scraped_count', 0)
self.assertFalse(
item_extracted < minimum_threshold,
msg='Extracted less than {} items'.format(minimum_threshold)
)
A Monitor
instance defines a monitor that includes
your monitoring logic and has the following properties that can be used to help you
implement your monitors:
data.stats
dict-like object containing the stats of the spider execution
data.crawler
instance of actual Crawler object
data.spider
instance of actual Spider object
- class spidermon.core.monitors.Monitor(methodName='runTest', name=None)¶
Monitor Suites¶
A Monitor Suite groups a set of Monitor classes and allows you to specify which actions must be executed at specified moments of the spider execution.
Here is an example of how to configure a new monitor suite in your project:
# monitors.py
from spidermon.core.suites import MonitorSuite
# Monitor definition above...
class SpiderCloseMonitorSuite(MonitorSuite):
monitors = [
# (your monitors)
]
monitors_finished_actions = [
# actions to execute when suite finishes its execution
]
monitors_failed_actions = [
# actions to execute when suite finishes its execution with a failed monitor
]
# settings.py
SPIDERMON_SPIDER_OPEN_MONITORS = (
# list of monitor suites to be executed when the spider starts
)
SPIDERMON_SPIDER_CLOSE_MONITORS = (
# list of monitor suites to be executed when the spider finishes
)
- class MonitorSuite(name=None, monitors=None, monitors_finished_actions=None, monitors_passed_actions=None, monitors_failed_actions=None, order=None, crawler=None)¶
An instance of
MonitorSuite
defines a set of monitors and actions to be executed after the job finishes its execution.name
suite namemonitors
list ofMonitor
that will be executed if this suite is enabled.monitors_finished_actions
list of action classes that will be executed when all monitors finished their execution.monitors_passed_actions
list of action classes that will be executed if all monitors passed.monitors_failed_actions
list of action classes that will be executed if at least one of the monitors failed.order
if you have more than one suite enabled in your project, this integer defines the order of execution of the suitescrawler
crawler instance- on_monitors_finished(result)¶
Executed right after the monitors finished their execution and before any other action is executed.
result
stats of the spider execution
- on_monitors_passed(result)¶
Executed right after the monitors finished their execution but after the actions defined in monitors_finished_actions were executed if all monitors passed.
result
stats of the spider execution
- on_monitors_failed(result)¶
Executed right after the monitors finished their execution but after the actions defined in monitors_finished_actions were executed if at least one monitor failed.
result
stats of the spider execution
Base Stat Monitor¶
Most of the monitors we create validate a numerical value from job stats against a configurable threshold. This is a common pattern that leads us to create almost repeated code for any new monitor we add to our projects.
To reduce the amount of boilerplate code, we have this base class that your custom monitor can inherit from and with a few attributes you end with a full functional monitor that just needs to be added to your Monitor Suite to be used.
- class spidermon.contrib.scrapy.monitors.base.BaseStatMonitor(methodName='runTest', name=None)
Base Monitor class for stat-related monitors.
Create a monitor class inheriting from this class to have a custom monitor that validates numerical stats from your job execution against a configurable threshold. If this threshold is passed in via command line arguments (and not it the spider settings), the setting is read as a string and converted to
threshold_datatype
type (default is float).As an example, we will create a new monitor that will check if the value obtained in a job stat ‘numerical_job_statistic’ is greater than or equal to the value configured in
CUSTOM_STAT_THRESHOLD
project setting:class MyCustomStatMonitor(BaseStatMonitor): stat_name = "numerical_job_statistic" threshold_setting = "CUSTOM_STAT_THRESHOLD" assert_type = ">="
For the
assert_type
property you can select one of the following:>
Greater than
>=
Greater than or equal
<
Less than
<=
Less than or equal
==
Equal
!=
Not equal
Sometimes, we don’t want a fixed threshold, but a dynamic based on more than one stat or getting data external from the job execution (e.g., you want the threshold to be related to another stat, or you want to get the value of a stat from a previous job).
As an example, the following monitor will use as threshold the a variable number of errors allowed based on the number of items scraped. So this monitor will pass only if the number of errors is less than 1% of the number of items scraped:
class MyCustomStatMonitor(BaseStatMonitor): stat_name = "log_count/ERROR" assert_type = "<" def get_threshold(self): item_scraped_count = self.stats.get("item_scraped_count") return item_scraped_count * 0.01
By default, if the stat can’t be found in job statistics, the monitor will fail. If you want the monitor to be skipped in that case, you should set
fail_if_stat_missing
attribute asFalse
.The following monitor will not fail if the job doesn’t have a
numerical_job_statistic
value in its statistics:class MyCustomStatMonitor(BaseStatMonitor): stat_name = "numerical_job_statistic" threshold_setting = "CUSTOM_STAT_THRESHOLD" assert_type = ">=" fail_if_stat_missing = False
- threshold_datatype
alias of
float
The Basic Monitors¶
Spidermon has some batteries included :)
- class spidermon.contrib.scrapy.monitors.monitors.CriticalCountMonitor(methodName='runTest', name=None)¶
Check for critical errors in the spider log.
You can configure it using
SPIDERMON_MAX_CRITICALS
setting. There’s NO default value for this setting, if you try to use this monitor without setting it, it’ll raise aNotConfigured
exception.If the job doesn’t have any critical error, the monitor will be skipped.
- class spidermon.contrib.scrapy.monitors.monitors.DownloaderExceptionMonitor(methodName='runTest', name=None)¶
This monitor checks if the amount of downloader exceptions (timeouts, rejected connections, etc.) is lesser or equal to a specified threshold.
This amount is provided by
downloader/exception_count
value of your job statistics. If the value is not available in the statistics (i.e., no exception was raised), the monitor will be skipped.Configure the threshold using the
SPIDERMON_MAX_DOWNLOADER_EXCEPTIONS
setting. There’s NO default value for this setting. If you try to use this monitor without a value specified, aNotConfigured
exception will be raised.
- class spidermon.contrib.scrapy.monitors.monitors.ErrorCountMonitor(methodName='runTest', name=None)¶
Check for errors in the spider log.
You can configure it using
SPIDERMON_MAX_ERRORS
setting. There’s NO default value for this setting, if you try to use this monitor without setting it, it’ll raise aNotConfigured
exception.If the job doesn’t have any error, the monitor will be skipped.
- class spidermon.contrib.scrapy.monitors.monitors.FieldCoverageMonitor(methodName='runTest', name=None)¶
Validate if field coverage rules are met.
To use this monitor you need to enable the
SPIDERMON_ADD_FIELD_COVERAGE
setting, which will add information about field coverage to your spider statistics.To define your field coverage rules create a dictionary containing the expected coverage for each field you want to monitor.
As an example, if the items you are returning from your spider are Python dictionaries with the following format:
{ "field_1": "some_value", "field_2": "some_value", "field_3": { "field_3_1": "some_value", "field_3_2": "some_value", } }
A set of rules may be defined as follows:
# project/settings.py SPIDERMON_FIELD_COVERAGE_RULES = { "dict/field_1": 0.4, # Expected 40% coverage for field_1 "dict/field_2": 1.0, # Expected 100% coverage for field_2 "dict/field_3": 0.8, # Expected 80% coverage for parent field_3 "dict/field_3/field_3_1": 0.5, # Expected 50% coverage for nested field_3_1 }
You are not obligated to set rules for every field, just for the ones in which you are interested. Also, you can monitor nested fields if available in your returned items.
If a field returned by your spider is a list of dicts (or objects) and you want to check their coverage, that is also possible. You need to set the
SPIDERMON_LIST_FIELDS_COVERAGE_LEVELS
setting. This value represents for how many levels inside the list the coverage will be computed (if the objects inside the list also have fields that are objects/lists). The coverage for list fields is computed in two ways: with respect to the total items scraped (these values can be greater than 1) and with respect to the total of items in the list. The stats are in the following form:{ "spidermon_field_coverage/dict/field2/_items/nested_field1": "some_value", "spidermon_field_coverage/dict/field2/nested_field1": "other_value", }
The stat containing _items means it is calculated based on the total list items, while the other, based on the total number of scraped items.
If the objects in the list also contain another list field, that coverage is also computed in both ways, with the total list items considered for the _items stat that of the innermost list.
In case you have a job without items scraped, and you want to skip this test, you have to enable the
SPIDERMON_FIELD_COVERAGE_SKIP_IF_NO_ITEM
setting to avoid the field coverage monitor error.Warning
Rules for nested fields will be validated against the total number of items returned.
For the example below, rule for
dict/field_3/field_3_1
will validate if 50% of all items returned containsfield_3_1
, not just the ones that contain parentfield_3
.Note
If you are returning an item type other than a dictionary, replace dict by the class name of the item you are returning.
Considering you have an item defined as:
class MyCustomItem(scrapy.Item): field_1 = scrapy.Field() field_2 = scrapy.Field()
You must define the field coverage rules as follows:
SPIDERMON_FIELD_COVERAGE_RULES = { "MyCustomItem/field_1": 0.4, "MyCustomItem/field_2": 1.0, }
- class spidermon.contrib.scrapy.monitors.monitors.FinishReasonMonitor(methodName='runTest', name=None)¶
Check if a job has a expected finish reason.
You can configure the expected reason with the
SPIDERMON_EXPECTED_FINISH_REASONS
, it should be aniterable
of valid finish reasons.The default value of this settings is:
['finished', ]
.
- class spidermon.contrib.scrapy.monitors.monitors.ItemCountMonitor(methodName='runTest', name=None)¶
Check if spider extracted the minimum number of items.
You can configure it using
SPIDERMON_MIN_ITEMS
setting. There’s NO default value for this setting, if you try to use this monitor without setting it, it’ll raise aNotConfigured
exception.
- class spidermon.contrib.scrapy.monitors.monitors.ItemValidationMonitor(methodName='runTest', name=None)¶
This monitor checks if the amount of validation errors is lesser or equal to a specified threshold.
This amount is provided by
spidermon/validation/fields/errors
value of your job statistics. If the value is not available in the statistics (i.e., no validation errors), the monitor will be skipped.Warning
You need to enable item validation in your project so this monitor can be used.
Configure the threshold using the
SPIDERMON_MAX_ITEM_VALIDATION_ERRORS
setting. There’s NO default value for this setting. If you try to use this monitor without a value specified, aNotConfigured
exception will be raised.
- class spidermon.contrib.scrapy.monitors.monitors.PeriodicExecutionTimeMonitor(methodName='runTest', name=None)¶
Check for runtime exceeding a target maximum runtime.
You can configure the maximum runtime (in seconds) using
SPIDERMON_MAX_EXECUTION_TIME
as a project setting or spider attribute.
- class spidermon.contrib.scrapy.monitors.monitors.PeriodicItemCountMonitor(methodName='runTest', name=None)¶
Check for increase in item count.
You can configure the threshold for increase using
SPIDERMON_ITEM_COUNT_INCREASE
as a project setting or spider attribute. Use int value to check for x new items every check or float value to check in percentage increase of items.
- class spidermon.contrib.scrapy.monitors.monitors.RetryCountMonitor(methodName='runTest', name=None)¶
Check if any requests have reached the maximum amount of retries and the crawler had to drop those requests.
You can configure it using the
SPIDERMON_MAX_RETRIES
setting. The default is-1
which disables the monitor.
- class spidermon.contrib.scrapy.monitors.monitors.SuccessfulRequestsMonitor(methodName='runTest', name=None)¶
Check the amount of successful requests.
You can configure it using the
SPIDERMON_MIN_SUCCESSFUL_REQUESTS
setting.
- class spidermon.contrib.scrapy.monitors.monitors.TotalRequestsMonitor(methodName='runTest', name=None)¶
Check the total amount of requests.
You can configure it using the
SPIDERMON_MAX_REQUESTS_ALLOWED
setting. The default is-1
which disables the monitor.
- class spidermon.contrib.scrapy.monitors.monitors.UnwantedHTTPCodesMonitor(methodName='runTest', name=None)¶
Check for maximum number of unwanted HTTP codes. You can configure it using
SPIDERMON_UNWANTED_HTTP_CODES_MAX_COUNT
setting orSPIDERMON_UNWANTED_HTTP_CODES
settingThis monitor fails if during the spider execution, we receive more than the number of
SPIDERMON_UNWANTED_HTTP_CODES_MAX_COUNT
setting for at least one of the HTTP Status Codes in the list defined inSPIDERMON_UNWANTED_HTTP_CODES
setting.Default values are:
SPIDERMON_UNWANTED_HTTP_CODES_MAX_COUNT = 10 SPIDERMON_UNWANTED_HTTP_CODES = [400, 407, 429, 500, 502, 503, 504, 523, 540, 541]
SPIDERMON_UNWANTED_HTTP_CODES
can also be a dictionary with the HTTP Status Code as key and the maximum number of accepted responses with that code.With the following setting, the monitor will fail if more than 100 responses are 404 errors or at least one 500 error:
SPIDERMON_UNWANTED_HTTP_CODES = { 400: 100, 500: 0, }
Furthermore, instead of being a numeric value, the code accepts a dictionary which can contain any of two keys:
max_count
andmax_percentage
. The former refers to an absolute value and works the same way as setting an integer value. The latter refers to a max_percentage of the total number of requests the spider made. If both are set, the monitor will fail if any of the conditions are met. If none are set, it will default toDEFAULT_UNWANTED_HTTP_CODES_MAX_COUNT`
.With the following setting, the monitor will fail if it has at least one 500 error or if there are more than
min(100, 0.5 * total requests)
400 responses.SPIDERMON_UNWANTED_HTTP_CODES = { 400: {"max_count": 100, "max_percentage": 0.5}, 500: 0, }
- class spidermon.contrib.scrapy.monitors.monitors.WarningCountMonitor(methodName='runTest', name=None)¶
Check for warnings in the spider log.
You can configure it using
SPIDERMON_MAX_WARNINGS
setting. There’s NO default value for this setting, if you try to use this monitor without setting it, it’ll raise aNotConfigured
exception.If the job doesn’t have any warning, the monitor will be skipped.
- class spidermon.contrib.scrapy.monitors.monitors.ZyteJobsComparisonMonitor(methodName='runTest', name=None)¶
Note
This monitor is useful when running jobs in Zyte’s Scrapy Cloud.
Check for a drop in scraped item count compared to previous jobs.
You need to set the number of previous jobs to compare, using
SPIDERMON_JOBS_COMPARISON
. The default is0
which disables the monitor. We use the average of the scraped items count.You can configure which percentage of the previous item count is the minimum acceptable, by using the setting
SPIDERMON_JOBS_COMPARISON_THRESHOLD
. We expect a float number between0.0
(not inclusive) and with no upper limit (meaning we can check if itemcount is increasing at a certain rate). If not set, a NotConfigured error will be raised.You can filter which jobs to compare based on their states using the
SPIDERMON_JOBS_COMPARISON_STATES
setting. The default value is("finished",)
.You can also filter which jobs to compare based on their tags using the
SPIDERMON_JOBS_COMPARISON_TAGS
setting. Among the defined tags we consider only those that are also present in the current job.
Is there a Basic Scrapy Suite ready to use?¶
Of course, there is! We really want to make it easy for you to monitor your spiders ;)
- class spidermon.contrib.scrapy.monitors.suites.SpiderCloseMonitorSuite(name=None, monitors=None, monitors_finished_actions=None, monitors_passed_actions=None, monitors_failed_actions=None, order=None, crawler=None)
This Monitor Suite implements the following monitors:
You can easily enable this monitor after enabling Spidermon:
SPIDERMON_SPIDER_CLOSE_MONITORS = ( 'spidermon.contrib.scrapy.monitors.SpiderCloseMonitorSuite', )
If you want only some of these monitors it’s easy to create your own suite with your own list of monitors similar to this one.
Periodic Monitors¶
Sometimes we don’t want to wait until the end of the spider execution to monitor it. For example, we may want to be notified as soon the number of errors reaches a value or close the spider if the time elapsed is greater than expected.
You define your Monitors and Monitor Suites the same way as before, but you need to provide the time interval (in seconds) between each of the times the Monitor Suites is run.
In the following example, we defined a periodic monitor suite that will be executed every minute and will verify if the number of errors found is lesser than a value. If not, the spider will be closed.
First we define a new action that will close the spider when executed:
# tutorial/actions.py
from spidermon.core.actions import Action
class CloseSpiderAction(Action):
def run_action(self):
spider = self.data['spider']
spider.logger.info("Closing spider")
spider.crawler.engine.close_spider(spider, 'closed_by_spidermon')
Then we create our monitor and monitor suite that verifies the number of errors and then take an action if it fails:
# tutorial/monitors.py
from tutorial.actions import CloseSpiderAction
@monitors.name('Periodic job stats monitor')
class PeriodicJobStatsMonitor(Monitor, StatsMonitorMixin):
@monitors.name('Maximum number of errors reached')
def test_number_of_errors(self):
accepted_num_errors = 20
num_errors = self.data.stats.get('log_count/ERROR', 0)
msg = 'The job has exceeded the maximum number of errors'
self.assertLessEqual(num_errors, accepted_num_errors, msg=msg)
class PeriodicMonitorSuite(MonitorSuite):
monitors = [PeriodicJobStatsMonitor]
monitors_failed_actions = [CloseSpiderAction]
The last step is to configure the suite to be executed every 60 seconds:
# tutorial/settings.py
SPIDERMON_PERIODIC_MONITORS = {
'tutorial.monitors.PeriodicMonitorSuite': 60, # time in seconds
}
What to monitor?¶
These are some of the usual metrics used in the monitors:
the amount of items extracted by the spider.
the amount of successful responses received by the spider.
the amount of failed responses (server-side errors, network errors, proxy errors, etc.).
the amount of requests that reach the maximum amount of retries and are finally discarded.
the amount of time it takes to finish the crawl.
the amount of errors in the log (spider errors, generic errors detected by Scrapy, etc.)
the amount of bans.
the job outcome (if it finishes without major issues or if it is closed prematurely because it detects too many bans, for example).
the amount of items that don’t contain a specific field or a set of fields
the amount of items with validation errors (missing required fields, incorrect format, values that don’t match a specific regular expression, strings that are too long/short, numeric values that are too high/low, etc.)