Re: [exim] Monitoring and Reporting Eximstats

Top Page
Delete this message
Reply to this message
Author: Chris Siebenmann
Date:  
To: Christian K
CC: exim-users, cks
Subject: Re: [exim] Monitoring and Reporting Eximstats
> I'd like to hear about what techniques you use to monitor your setups,
> especially with setups beyond one or two mail servers.


Our current approach for Prometheus-based monitoring is to process
mailq output to get a snapshot of queue-based information and to use
Linux's 'ss' to get information on SMTP connection counts (we looked
at using exiwhat and processing its output, but exiwhat is fairly
heavyweight; see the manpage).

I hadn't considered the use of running commands or sending messages
in the ACLs in order to keep a running count of messages delivered or
the like; it's a neat approach. The drawback is that you'll definitely
be building relatively custom tools for this. There are apparently some
general programs to basically 'tail' logs and convert the information
there into running counters and so on; the ones for Prometheus I know of
are:

    https://github.com/google/mtail
    https://github.com/fstab/grok_exporter


It occurs to me that one obvious and very simple thing to do in ACLs
and so on is simply to update a timestamp of 'I most recently accepted
a message at ...' or 'I most recently delivered a message at ...'. This
could be used to drive very basic but useful health checks, where if
you see these fall too out of date something may be wrong.

(You can also use ${run} and so on outside of ACLs, but I'm not sure if
there's a good place to put them so that they happen after a transport
succeeds. If you work really hard you can probably detect bounces by
detecting the expansion of the bounce text, through embedding a ${run}
or the like in it, but perhaps there's a better way[*].)

    - cks
[*: We sort of care about a spike in the bounce rate because it's often
    an indicator of a problem, such as yet another compromised user account
    being used to send out spam.
]