Web scraping: How journalists get their own data – Canadian Journalism Foundation

Don't rely on others to get you the information you need. Glen McGregor explains in the latest issue of Media magazine that the ability to program a basic script can allow you the freedom to no longer rely solely on Access to Information laws and give you the liberty to analyze data in new ways. And a foreword by David McKie, J-Source's Ideas editor, explains why this is important.

There is a way to get data even when bureaucrats say you can’t have the information

By David McKie

Governments of all stripes, and at all levels, are declaring themselves open for business. Some major cities are uploading an increasing number of data sets online, allowing journalists to find potential stories that would have been locked away in city filing cabinets or stored on computer hard drives.

The federal government has recently announced that it, too, is committed to open government through a series of initiatives that have had mixed success. Despite the hopeful talk of openness, there are still obstacles.

As they’ve done in the past, journalists may prevail in winning access to information that departments are less than willing to share. But these battles take time in an era when the news cycle quickens with each passing day.

Sometimes, there’s a need to go after these data sets even when bureaucrats are saying, ‘no you can’t have them.’ And this is where a concept called “scraping” comes in handy.

Admittedly, it’s not for everyone. If you have trouble organizing information on your hard drive, or navigating the web with the help of a search engine, then scraping may not be for you.

Still, it’s useful to know what can be done so if you lack the skill, you can at least find a more savvy journalist or computer programmer to get the job done.

Luckily, we have the Ottawa Citizen’s Glen McGregor, who is using innovative techniques to find and tell interesting and important stories. His latest series examined Health Canada’s Medical Marijuana program.

In his column for the most recent edition of Media magazine, which has been online for a few weeks now, Glen walks us through the process of scraping.

It’s also worth mentioning that Glen has kindly agreed to help me organize and run some data journalism workshops during the upcoming Canadian Association of Journalists conference to be held in Toronto from April 27 to April 29. We’re working on the details for what we hope will be dynamic sessions. In the meantime, happy reading.

***

Scrape your way to happiness

By Glen McGregor

In the film The Social Network, an early scene shows how Mark Zuckerberg extracted pictures of other Harvard students from the university’s web servers to run head‑to‑head beauty contests on his own website.

The beer‑fuelled project got Zuckerberg in trouble with Harvard administrators and vilified by students, but ultimately inspired the creation of a little site that eventually went on to be called Facebook.

The technique the Zuckerberg character demonstrated was called “web scraping” or “screen scraping.” It is the use of computer programs to robotically download large amounts of data from the web that could not easily be obtained manually. Pointing‑and‑clicking the link to every student’s photo and saving the file would be impossibly slow and boring for a human. For a computer, the job is child’s play.

Web scraping is commonly used by programmers to extract records from other websites. If you’ve ever Googled for the best price on airfare to Florida, or a deal on an LCD television, you’ve probably come across comparison sites that rely on price quotes scraped from, say, American Airlines and Air Canada, or Best Buy and Amazon.

For data journalists, web scraping can be a powerful tool that allows them to assemble electronic records to find stories that could not be easily obtained otherwise.

Government agencies are increasingly putting data online. But rarely do they provide the data in a way that can be downloaded in raw form, the way journalists like to use data ‑‑ to analyse in Microsoft Excel, crunch in a MySQL database or upload to a Google Map.

Usually, the data is hidden behind a web interface that the ministry or agency has created and that allows searches on a particular term.

Type Toyota into search function of Transport Canada’s database of vehicle recalls and you’ll find the matching records. But nowhere does Transport Canada allow you to download all recall records for all makes and models of vehicles for all years. (McKie's note: This would have come in handy after stories broke about problems with Toyota’s massive recalls in Canada and the United States.)

If your story can wait 30 days or longer, you might be able to get the same electronic data by filing a freedom-of-information or access-to-information request. However, there’s a chance the agency will turn it down, sometimes claiming the data is already available through their web interface. Even if you get the data, it starts to get stale the moment it’s burned onto a CD.

But by web scraping, journalists can extract the data using their own custom programs that will send repeated search requests (to use our example) to Transport Canada’s server, for not just Toyotas, but Hondas, Fords, Chevies, Ferraris and the hundreds of other vehicles in their database. The script will capture the results returned and save them in a nice, tidy text file on your hard drive that will load onto Excel. And it will keep doing it right up until publication.

Another example: the City of Ottawa posts restaurant health inspections online. You can search by the name of the restaurant to every inspection, good and bad. The current design of the EatSafe database, however, doesn’t allow searches based on other criteria. One cannot see, for instance, all the restaurants on Somerset Street, near to the city’s downtown core, that failed inspections in the last week.

Using web-scraping techniques, the Ottawa Citizen in 2010 extracted all the records in the EatSafe database ran a series of stories that explored the concentration of health problems in Chinatown and the high failure rate of shawarma spots. A similar script was used to download lists of Order of Canada recipients, resulting in a story about under‑representation of Westerners in the national honour. A scrape of crime reports showed the areas of Ottawa with the most stolen bicycle complaints. Scraping Craigslist’s local “Missed Connections” page led to a blog post that showed men were searching for lost love far more than women ‑‑ and mostly on the bus.

Every online data set is configured differently, so each requires a unique approach to scraping.

Options for doing it?

Some data can be grabbed with something as simple as DownThemAll!, a free plug‑in for the Firefox web browser that will download all the linked files on a page. It can be configured using filters for file names and types, so one could capture only JPEGs with the word “Harper” or PDFs called “report.”

In The Social Network, the Zuckerberg character refers to open‑source software called Wget, that runs from the command line on Mac or PCs and has powerful customization options. It can extract files based on an itemized list of file names. You’ll need to play around with the command line functions to learn how to use Wget.

Web-scraping is the ultimate

But the best way and most effective approach to real web‑scraping is to write your own custom computer scripts. Often, these are the only way toextract data from online databases that require user input, such as thevehicle recalls list or restaurant inspections site.

To do this, you will need to learn a little bit of computer programming using a language such as Python, Ruby, Perl or PHP. You only to need to choose one.

Python, named after Monty not the snake, is my favourite for its simple syntax and great online support from Pythonistas. Ruby is also popular with data journalists.

Most Chapters stores have entire shelves devoted to programming guides, with lots of great entry‑level stuff. You can also work through free online tutorials that will guide you from installation to writing complex routines. Once you’re comfortable writing simple programs, also called “scripts”, you can soon graduate to using web interfaces available for these languages to download online data.

A program to scrape the vehicle recalls database would be written to submit a search term to the Transport website from a list of vehicle makes. It would capture the list of links the web server returns, then another part of the program would open each of these links, read the data, strip out all the HTML tags, and save the good stuff to a file.

Depending on the number of records and the speed of the server, it might take hours to run the program and assemble all the data in a single file. (For journalists not inclined to learn a computer language, Scraperwiki.com brings together programmers with people who need scraping work done.)

Some websites can be difficult to scrape, even for skilled programmers. Many use cookies, session IDs or captchas that can complicate attempts to scrape. Watch that scene in The Social Network to get a sense of the variability and unique problem solving required for each site.

Keys to success

The key to effective scraping is to understand how your web browsercommunicates with the web server. You can eavesdrop on this communicationusing another Firefox plug‑in called Firebug, or other similar utilitiesavailable for the Google Chrome browser. These show the instructions passedback and forth with each search request. Your scraping script will need toreplicate these to work effectively.

Make no mistake, learning to web scrape is time consuming in a way that doesn’t always show up in your published or broadcast story. Your editor will be baffled when you start describing why you need to do it, so you’ll probably have to learn to scrape on your own time.

But if you can develop web scraping skills, no longer will you have to rely solely on FOIs and ATIPs to get data. You take the data governments already put online and leapfrog over their clumsy interfaces to create your own copies.

When the media relations flack won’t give you electronic records you need and instead directs you to the online search engine, you can scrape the data out and start reporting on it immediately.

Glen McGregor is a national affairs reporter with the Ottawa Citizen. He is available to give web-scraping and data-journalism seminars to your newsroom or classroom.


	info@cjf-fjc.ca
	77 Bloor St. West, Suite 600, Toronto, ON M5S 1M2
	(437) 783-5826
	Charitable Registration No. 132489212RR0001

ABOUT US