Becausing scraping content from webpages will drive you quackers!
Oh, Ducks! is a handy way to get data out of HTML documents using CSS selectors.
You can check out my development repo from the darcs repository.
darcs get 'http://repo.kepibu.org/Oh, Ducks!/'
You'll probably also want my development version of cl-unification. It might work with the official version, but I make no guarantees.
See the notes file for more information, or the tests.lisp file for examples.
Here's an example of using Oh, Ducks! taken from one of my personal projects. It extracts the first paragraph of real content from pages on a particular wikia wiki and replaces a more brittle klacks parser.
(defun parse-wiki-page (page)
(unify:match (#t(oh-ducks:html (:model oh-ducks:dom)
("#article > #bodyContent" . #t(list ?article)))
page
:errorp nil)
(labels …
(let ((nodes (unify:match-case (article :errorp nil)
;; <p> before <dl> is junk like "This is a featured article."
(#t(oh-ducks:html ("> dl ~ p" . ?p)) p)
;; No <dl>, but .infobox
(#t(oh-ducks:html ("> .infobox ~ p" . ?p))
(without-featured-article p))
;; No <dl> or .infobox
(#t(oh-ducks:html ("> p" . ?p))
(without-featured-article p)))))
…))))
xmls-style lists don't always work properly, especially when selecting on parent or sibling elements.
See also the notes file in the repo.
My e-mail address is pix@kepibu.org. Questions, comments, patches, beratings, and bug reports are all welcome.