Derek Willis's blog

EveryBlock Goes Open Source

By now you may have read that EveryBlock, a Knight Foundation-funded project, has released its source code to the public (here's a browsable version). Getting a chance to look under the hood is a great opportunity to see how other folks tackle some of the tasks we all face, or are likely to.

Yahoo! Placemaker

The process of geolocating information isn't new to journalists; producing maps has long been a key part of what we do. But when it comes to our stories, extracting mappable entities like cities from text is a relatively new concept.

There are commercial services that do this task, and researchers have created software for academic pursuits as well. Widespread free availability of geolocation services, however, has been mostly wishful thinking until last month.

Senate Votes in XML

One of my personal annoyances came to a quiet end last week, when the U.S. Senate decided to begin publishing vote information in XML rather than the HTML that had been its format for years.

Computing Environments Built For You

One of the biggest hurdles we all face for trying new software or utilities is the lack of a sandbox, a machine we can just use when we want to without having to worry if something goes wrong.

This is particularly true for new open-source technologies, like the fast-growing field of open source GIS software. Sure, it would be great to try out OpenLayers or other mapping utilities, but it's not like we can just turn our main computer into a development box overnight.

Data, APIs and TimesOpen

On Feb. 20, a group of my colleagues at The New York Times gathered for a daylong series of presentations on a set of APIs that we've been releasing during the past few months. TimesOpen, as it was called, gathered about 140 developers and other folks interested in working with Times data.

Toward better political data

Early this past fall, our group of journalists and developers at The New York Times began to assemble the data necessary to produce our election guide, which would not only focus on the presidential race but also include races for the U.S. House and Senate.

Sharing code snippets

Folks doing CAR are blessed with a wealth of tools, which is both a blessing and a curse. For example, I frequently use two database programs, MySQL and PostgreSQL, at work. While similar in most respects, they have slightly different syntax for some common tasks such as string functions.

You see this situation played out on the NICAR-L listserv all the time, when someone asks a question that usually starts with, "I know I've done this before, but I can't seem to remember the right syntax."

Magic/Replace for Data Cleaning

Everybody loves cleaning data, right?

Well, OK, it's probably one of the more onerous tasks that CAR people face, and for the most part it hasn't improved dramatically in its ease, even while other technologies have made things like online mapping simpler.

Scraping vs. Parsing

I almost never do any Web scraping any more.

That's not because it's not useful - it's one of the most powerful tools I've picked up - but because when it comes to making data out of HTML, scraping by matching patterns of characters doesn't always make a lot of sense.

So instead of relying strictly on regular expressions to match patterns within a Web page, I now use HTML parsers to locate and extract information. The difference is precision.

The upside to open

As NICAR conference sessions go, it was a first: "World-class CAR on 99 cents a day: Linux and open-source software in the newsroom."

That demo room session by Aron Pilhofer at the 2002 computer-assisted reporting in Philadelphia was the first time that open source made an appearance on the program.

Advertise in Uplink

IRE logo

The National Institute for Computer-Assisted Reporting is a joint program of
Investigative Reporters and Editors, Inc., and the Missouri School of Journalism.

141 Neff Annex, Missouri School of Journalism, Columbia MO, 65211, Tel. 573-882-2042, Fax 573-884-5544

All Rights Reserved