Three (more) good reasons to learn Linux

Printer-friendly version

The other day, I was talking to one of my colleagues about how I rarely use ArcView anymore. Since becoming a full-fledged Linux jockey, I've found so many tools that process GIS data better than Arc ever has. Sure, it still provides a pretty graphical interface, which definitely comes in handy, but most of the time I don't need it. The green and black of the Linux terminal window line suits me just fine.

Before you dismiss my flagrant geekdom, let me make this case: Whenever a question comes up on NICAR-L about how to accomplish some uncommon task, I can almost always think of a command-line tool that will get the job done. For free. If that isn't a good reason to start making friends with Linux, nothing is.

So if you're looking for an excuse to dive in to the Linux terminal, I trolled back through the listserv archives and found a few examples of problems that were easily solved with old-school command line tools:

Question: How do I convert (insert GIS data type here) into a shapefile? Or a shapefile into KML?
Answer: ogr2ogr

Not only will ogr2ogr help you convert MapInfo, CAD, PostGIS, KML or any other spatial data type into anything else, it's also great for piping DBF files (the NICAR data library's file format of choice) into MySQL. If I ever need to turn a shapefile into KML, or a TIGER file into a shapefile, this is where I turn.

As for how to use it, this guide explains it better than I ever could.

Question: How do I split/merge/convert to text/unlock/do anything else to a PDF?
Answer: pdftk

Pdftk is the Swiss army knife of PDF processing. Sure, Xpdf and other Windows tools do a great job. But pdftk does more. In fact, if you mix pdftk with Ghostscript, you'll basically have the fully functioning equivalent of Adobe Acrobat Pro.

Because it's open source, pdftk also comes with an added bonus: the ability to modify the source code (which I believe is C++). Say you have an encrypted PDF that Acrobat won't let you modify. A few strategically placed lines of code in the pdftk source can get you around that. I'll save the specifics for the truly interested nerds, but trust me: It works.

Question: Where can I find good OCR software on the cheap?
Answer: ocropus

Ocropus is based on the old Tesseract OCR engine, which was designed more than 10 years ago by Hewlett-Packard and has evolved into one of the most accurate open source OCR engines available. Google's coders, who are working on Ocropus, are slowy inegrating features to handle layout analysis, noise reduction and other features critical to accurate OCR.

If you need to convert an image to text, this is about the best OCR package you'll find for free.

I've said this before, but taking the time to learn the command line will completely change the way you look at CAR. A lot of the strange quirks that pop up in file systems and programming languages will suddenly make sense.

If you want to get started, read Matt Waite's post, download Ubuntu and check out this guide. You'll be glad you did.

Advertise in Uplink

IRE logo

The National Institute for Computer-Assisted Reporting is a joint program of
Investigative Reporters and Editors, Inc., and the Missouri School of Journalism.

141 Neff Annex, Missouri School of Journalism, Columbia MO, 65211, Tel. 573-882-2042, Fax 573-884-5544

All Rights Reserved