Oh, Ducks!: Unification Over HTML Documents Using CSS Selectors

Becausing scraping content from webpages will drive you quackers!

What Oh, Ducks! Is

Oh, Ducks! is a handy way to get data out of HTML documents using CSS selectors.

Where to Get Oh, Ducks!

You can check out my development repo from the darcs repository.

darcs get 'http://repo.kepibu.org/Oh, Ducks!/'

You'll probably also want my development version of cl-unification. It might work with the official version, but I make no guarantees.

How to Use Oh, Ducks!

See the notes file for more information, or the tests.lisp file for examples.

Example

Here's an example of using Oh, Ducks! taken from one of my personal projects. It extracts the first paragraph of real content from pages on a particular wikia wiki and replaces a more brittle klacks parser.

(defun parse-wiki-page (page) (unify:match (#t(oh-ducks:html (:model oh-ducks:dom) ("#article > #bodyContent" . #t(list ?article))) page :errorp nil) (labels … (let ((nodes (unify:match-case (article :errorp nil) ;; <p> before <dl> is junk like "This is a featured article." (#t(oh-ducks:html ("> dl ~ p" . ?p)) p) ;; No <dl>, but .infobox (#t(oh-ducks:html ("> .infobox ~ p" . ?p)) (without-featured-article p)) ;; No <dl> or .infobox (#t(oh-ducks:html ("> p" . ?p)) (without-featured-article p))))) …))))

Known Limitations

xmls-style lists don't always work properly, especially when selecting on parent or sibling elements.

See also the notes file in the repo.

Contacting the Author

My e-mail address is pix@kepibu.org. Questions, comments, patches, beratings, and bug reports are all welcome.