machina-policy: A Common Lisp parser for robots.txt files

Because you perform responsible web scraping.

What machina-policy Is

machina-policy is a way to parse and query robots.txt files so your web-scraping bot can play nicely with the other kids.

Where to Get machina-policy

machina-policy has been converted from darcs to git and is now available on github.

All dependencies are available via Quicklisp.

How to Use machina-policy

See the readme for more information.

Known Limitations

machina-policy is not intended as the final solution in web scraping (cl-web-crawler may get you closer to that goal). It is intended to make handling robots.txt files easy, not to make handling robots.txt transparent.

Currently, machina-policy is very much geared towards single-domain usage. For instance, #'URI-ALLOWED-P does not check to ensure the hostname of the given URI actually falls under the jurisdiction of the given POLICY. If crawling multiple domains, it is your responsibilty to keep the policies for those domains separate.

Contacting the Author

My e-mail address is pix@kepibu.org. Questions, comments, patches, beratings, and bug reports are all welcome.