Craig SilvermanAre newsrooms are more interested in protecting their online content from theft than making sure it’s not already stolen? Craig Silverman examines plagiarism detection services.

Six years ago, in the wake of the Jayson Blair scandal at the New York Times, Peter Bhatia, then the president of the American Society of Newspaper Editors, gave a provocative speech at the organization’s 2004 conference.

“One way to define the past ASNE year is to say it began with Jayson Blair and ended with Jack Kelley,” he said.

Bhatia’s message was that it was time for the industry and profession to take new measures to prevent serious breaches such as plagiarism and fabrication:

Isn’t it time to say enough? Isn’t it time to attack this problem in our newsrooms head-on and put an end to the madness? … Are we really able to say we are doing everything possible to make sure our newspaper and staffs are operating at the highest ethical level?

Today, six years after his speech, another plagiarism scandal erupted at the New York Times (though it’s certainly not on the scale of Blair’s transgressions). A separate incident also recently unfolded at the Daily Beast. Once again, the profession is engaged in discussion and debate about how to handle this issue. One suggestion I made in my weekly column for Columbia Journalism Review was for newsrooms to start using plagiarism detection software to perform random checks on articles. New York Times public editor Clark Hoyt followed up with a blog post on this idea, and there has been other discussion.

Many people are left wondering how effective these services are, why they aren’t being used in newsrooms, and which ones might be the best candidates for use in journalism. Surprisingly, it turns out that newsrooms are more interested in finding out who’s stealing their content online than making sure the content they publish is original.

Why Newsrooms Don’t Use Them

  1. Cost: The idea of spending potentially thousands of dollars on one of these services is a tough sell in today’s newsrooms. “We’ve had a lot conversation with media outlets, particularly after a major issue comes up, but the conversation is ultimately what is the cost and whatever cost I give them, they think it’s nuts,” Robert Creutz, the general manager of the iThenticate plagiarism detection service, told me. He estimated his service, which is the most expensive one out there, would charge between $5,000 and $10,000 per year to a large newspaper that was doing random checks on a select number of articles every day. Many other detection services would charge far less, but it seems that any kind of cost is prohibitive these days.
  2. Workflow When New York Times public editor Clark Hoyt asked the paper’s “editor for information and technology” about these services, he was told the paper has concerns about the reliability of the services. Hoyt also wrote that “they would present major workflow issues because they tend to turn up many false-positive results, like quotes from newsmakers rendered verbatim in many places.” News organizations are, of course, hesitant to introduce anything new into their processes that will take up more time and therefore slow down the news. They currently see these services as a time suck on editors, and think the delay isn’t worth the results.
  3. Catch-22 In basic terms, these services compare a work against web content and/or against works contained in a database of previously published material. (Many services only check against web content.) Major news databases such as Nexis and Factiva are not part of the proprietary plagiarism detection databases, which means the sample group is not ideal for news organizations. As a result, news organizations complain that the services will miss potential incidents of plagiarism. But here’s the flip side: If they signed up with these services and placed their content in the database, it would instantly improve the quality of plagiarism detection. Their unwillingness to work with the services is a big reason why the databases aren’t of better quality.
  4. Complicated contracts The Hartford Courant used the iThenticate service a few years ago to check op-ed pieces. “It’s worth the cost,” Carolyn Lumsden, the paper’s commentary editor, told American Journalism Review back in 2004. “It doesn’t catch absolutely everything, but it catches enough that you’re alerted if there’s a problem.” When I followed up with her a few weeks back, she told me the paper had ended its relationship with the company. “We had a really good deal with them … But then iThenticate wanted us to sign a complicated multipage legal contract. Maybe that’s what they do with universities (their primary client). But we weren’t interested in such entanglements.”

The Strange Double Standard

So, as a result of the above concerns, almost no traditional news organizations use a plagiarism detection service to check their work either before or after publication. (On the other hand, Demand Media, a company that has attracted a lot of criticism for its lack of quality content and low pay, is a customer of iThenticate.) Here’s the strange twist: Many of these same organizations are happy to invest the money, time and other resources required to use services that check if someone else is plagiarizing their work.

bailey.jpg

Jonathan Bailey, the creator of Plagiarism Today and president of CopyByte, a consultancy that helps advise organizations about detecting and combating plagiarism, said he’s aware of many media companies that pay to check and see if anyone’s stealing their content.

“It’s fascinating because one of the companies I work with is Attributor … and I’m finding lots of newspapers and major publications are on board with [that service], but they are not using it to see if the content they’re receiving is original,” he said. “It’s a weird world for me in that regard. A lot of news organizations are interested in protecting their work from being stolen, but not in protecting themselves from using stolen work.”

(In a separate article on J-Source, Ira Basen looks deeper at media companies using Attributor.)

How they Work

Bailey compares these services to search engines. Just as Google will take a query and check it against its existing database of web content, plagiarism detection services check a submitted work against an existing database of content.

“They work fundamentally with the same principles as search engines,” he said. “They all take data from various sources and fingerprint it and compress it and store it in a database. When they find potential matches, they do a comparison.”

Each service has its own algorithm to locate and compare content, and they also differ in terms of the size of their databases. Many of the free or less expensive services only search the current web. That means they don’t compare material against archived web pages or proprietary content databases.

Bailey said that another big difference between services is the amount of work they require a person to undertake in order examine any potential matches. (This is the concern voiced by the editor at the New York Times.) Some services return results that list a series of potential matches, but don’t explain which specific sentences or paragraphs seem suspect. This causes a person to spend time eliminating false positives.

ithen.jpgBailey also said some of the web-only services are also unable to distinguish between content that is properly quoted or attributed, and material that is stolen. This, too, can waste a person’s time. However, he said that iThenticate, for example, does a decent job of eliminating the more obvious false positives, and that it has an API that enables it to be integrated into any web-based content management system.

Where They’re Most Effective

Bailey has used and tested a wide variety of the plagiarism detection services available, and said they vary widely in terms of quality. Along with his experience, Dr. Debora Weber-Wulff, a professor of Media and Computing at Hochschule für Technik und Wirtschaft Berlin, has conducted two tests of these services.

Her 2007 study is available here, and Bailey also wrote about her 2008 research on his website.

Asked to summarize the effectiveness of these services, Dr. Weber-Wulff offered a blunt reply by email: “Not at all.”

Weber-Wulff_2008.jpg

“They don’t detect plagiarism at all,” she wrote. “They detect copies, or near copies. And they are very bad at doing so. They miss a lot, since they don’t test the entire text, just samples usually. And they don’t understand footnoting conventions, so they will flag false positives.”

Her tests involved taking 31 academic papers that included plagiarized elements and running them through different services. Her data is important and worth looking at, though journalists should note that academic papers and articles are not going to elicit the same results. The Hartford Courant seemed happy with its service, as was the Fort Worth Star-Telegram when it used one a few years ago, according to Clark Hoyt’s blog post. On the other hand, the New York Times continues to have concerns.

For his part, Bailey mentioned a few services that might work for journalism.

“IThenticate do a very good job, they provide a good service but it is pricey,” he said, “and it is very difficult to motivate newspapers to sign up when they’re putting second mortgages on their buildings.”

He also mentioned Copyscape.

“Copyscape is another one that is very popular and it’s very cheap at only 5 cents a search,” he said, noting it took the top spot in Dr. Weber-Wulff’s latest study. “It’s very good at matching — it uses Google — and it does a thorough job, though the problem is that it only searches the current web, so you have a limited database.”

He recommends using Copyscape if the plan is to perform spot checks for a news organization. Bailey also mentioned SafeAssign as a third offering to look at.

In his view, it’s unacceptable that news organizations are ignoring these services.

“The risks of not running a [plagiarism] check are incredibly high,” he said, citing the damage that can be done to a brand, and the potential for lawsuits. “At the very least, they should be doing post-publication checking rather than letting the public-at-large or competitors do it for them — because that’s when things get particularly ugly.


MediashiftThis article was originally published on PBS Mediashift. J-Source and MediaShift have a content-sharing arrangement to broaden the audience of both sites.