Senate Votes in XML

Printer-friendly version

One of my personal annoyances came to a quiet end last week, when the U.S. Senate decided to begin publishing vote information in XML rather than the HTML that had been its format for years.

The House, usually the institutionally more nimble of the chambers, began publishing vote information in XML back in 2003 (view the source on this page to see an example). Here's a Senate vote - it has information on the date and time of the vote, plus all of the individual positions.

This makes it easier to parse the information into a spreadsheet or database manager for two reasons. First, there are a lot of excellent XML parsers out there, and for every language. And more importantly, HTML is brittle by comparison - a single change could break your scraper. Well-structured XML beats HTML scraping any day.

Up until now, there have been essentially three main repositories of congressional vote data outside of academia: The Washington Post, The New York Times and Govtrack.us, which predates the other two in its efforts to parse congressional information.

Now, it's much easier for individual sites or news organizations to grab and collect only the vote information they might be interested in rather than repeating the time-consuming process of scraping. So fire up your favorite XML parser (two of mine are Beautiful Soup for Python and Hpricot for Ruby) and build your own dataset!

Advertise in Uplink

IRE logo

The National Institute for Computer-Assisted Reporting is a joint program of
Investigative Reporters and Editors, Inc., and the Missouri School of Journalism.

141 Neff Annex, Missouri School of Journalism, Columbia MO, 65211, Tel. 573-882-2042, Fax 573-884-5544

All Rights Reserved