Ted Cooper wrote:
> There are already companies that make their money out of scanning the
> internet looking at who is running what web server, how come there isn't
> one out there scanning the internet for all the mail servers?
>
> I have some bandwidth to waste and some time to spare, I wonder how much
> of a storm I would bring down upon myself if I started scanning the
> internet at random and collating the results. There would of course have
> to be some rules in place to prevent legal action and such.
I was thinking about this as well. If I were doing it, firstly I would
completely forget about the relay scanning part. Let the RBLs worry
about that. You can always just query them for each mail server, if you
wanted to correlate open-relay statistics with MTA use. Reliably
testing for realying is tricky, given that there are so many elaborate
tricks to check against -- or at least that's what I've read. It would
also slow things down considerably.
Then, I would sign up for zone file access with both Verisign[1] (for
com and net) and PIR[2] (for org). You could add any other TLD zones,
but I don't know their policies and com/net/org covers a great deal of
territory.
That gets you access to a list of all domain names and their
corresponding nameservers. Phase 1 of the project would involve walking
these zones and doing an MX lookup for each domain. This allows you to
map domain names to mail server IP addresses. Given that the vast
majority of domains out there are parked or virtual-hosted, you would
end up with a list of IP addresses of mail servers vastly smaller than
the number of domains. But, you want to save that mapping information,
as it allows you to report "MTA use by physical server" as well as "MTA
use by domain name", similar to how Netcraft does it for HTTP servers.
Note that you would want to have consideration to not overload any DNS
servers here. Naturally, you wouldn't want to just feed these domain
lists into an automated "dig $DOMAIN mx" as that would end up querying
the TLD nameservers for each 2nd level domain, of which there are tens
of millions. (This is prohibited in the zone file agreement, anyway.)
But since you have the zone files you don't need to query the TLD
servers, you can query each domain's listed nameservers directly, so
this is a highly distibuted task. The program should have controls so
that it doesn't query any one server (or /24) more than a certain
amount, but it should otherwise be able to truck right along.
By my estimates of the current zone file size[3], if given a month of
time you're looking at approximately 13 domain MX lookups per second.
To get IP addresses you'd need at least one A record lookup for each of
those MX records, but I figure that there are so many parked domains and
such that there will be a great deal of caching of those A records, so
hopefully it wouldn't be nearly as significant.
Finally you conduct your survey of the mail servers, which you have
reduced down to a set of unique IP addresses. I have no idea how many
you'd be left with. It would be a pretty good swath of the internet,
though. (com/net/org really covers a lot.) I don't know what order
you'd go in (random) or any of the details as I have no idea how many
records we're talking about here. But essentially you just connect,
HELP, QUIT, and move on. Record all responses, process afterwards.
The advantage of starting with the zone files and working your way
through MX records is that you're not scanning random IP addresses
looking for mailservers. So I don't see how anyone could label it
abuse. As long as your bot/spider knows not to pester any given server
(or /24) with more than a few packets per hour, I don't think anyone
would object. Since every IP address that you've connecting to is
guaranteed to be a MX for at least one domain, no one could accuse you
of unauthorized cracking / hacking / scanning / whatever. Certainly, I
can't see how any mail admin would object to a brief connection that
just says HELP and QUIT one time and is never heard from again.
If you were really slick you could incorporate deltas from the updated
zone files after a month has passed, and do a progessive scan for things
that have changed.
Brian
[1]
http://www.verisign.com/nds/naming/tld/
[2]
http://www.pir.org/registrars/zone_file_access
[3] .com = 26m, .net = 4.4m, .org = 2.8m