Are newsrooms are more interested in protecting their online content from theft than making sure it’s not already stolen? Craig Silverman examines plagiarism detection services.
Six years ago, in the wake of the Jayson Blair scandal at the New York Times, Peter Bhatia, then the president of the American Society of Newspaper Editors, gave a provocative speech at the organization’s 2004 conference.
“One way to define the past ASNE year is to say it began with Jayson Blair and ended with Jack Kelley,” he said.
Bhatia’s message was that it was time for the industry and profession to take new measures to prevent serious breaches such as plagiarism and fabrication:
Isn’t it time to say enough? Isn’t it time to attack this problem in our newsrooms head-on and put an end to the madness? … Are we really able to say we are doing everything possible to make sure our newspaper and staffs are operating at the highest ethical level?
Today, six years after his speech, another plagiarism scandal erupted at the New York Times (though it’s certainly not on the scale of Blair’s transgressions). A separate incident also recently unfolded at the Daily Beast. Once again, the profession is engaged in discussion and debate about how to handle this issue. One suggestion I made in my weekly column for Columbia Journalism Review was for newsrooms to start using plagiarism detection software to perform random checks on articles. New York Times public editor Clark Hoyt followed up with a blog post on this idea, and there has been other discussion.
Many people are left wondering how effective these services are, why they aren’t being used in newsrooms, and which ones might be the best candidates for use in journalism. Surprisingly, it turns out that newsrooms are more interested in finding out who’s stealing their content online than making sure the content they publish is original.
So, as a result of the above concerns, almost no traditional news organizations use a plagiarism detection service to check their work either before or after publication. (On the other hand, Demand Media, a company that has attracted a lot of criticism for its lack of quality content and low pay, is a customer of iThenticate.) Here’s the strange twist: Many of these same organizations are happy to invest the money, time and other resources required to use services that check if someone else is plagiarizing their work.
Jonathan Bailey, the creator of Plagiarism Today and president of CopyByte, a consultancy that helps advise organizations about detecting and combating plagiarism, said he’s aware of many media companies that pay to check and see if anyone’s stealing their content.
“It’s fascinating because one of the companies I work with is Attributor … and I’m finding lots of newspapers and major publications are on board with [that service], but they are not using it to see if the content they’re receiving is original,” he said. “It’s a weird world for me in that regard. A lot of news organizations are interested in protecting their work from being stolen, but not in protecting themselves from using stolen work.”
(In a separate article on J-Source, Ira Basen looks deeper at media companies using Attributor.)
Bailey compares these services to search engines. Just as Google will take a query and check it against its existing database of web content, plagiarism detection services check a submitted work against an existing database of content.
“They work fundamentally with the same principles as search engines,” he said. “They all take data from various sources and fingerprint it and compress it and store it in a database. When they find potential matches, they do a comparison.”
Each service has its own algorithm to locate and compare content, and they also differ in terms of the size of their databases. Many of the free or less expensive services only search the current web. That means they don’t compare material against archived web pages or proprietary content databases.
Bailey said that another big difference between services is the amount of work they require a person to undertake in order examine any potential matches. (This is the concern voiced by the editor at the New York Times.) Some services return results that list a series of potential matches, but don’t explain which specific sentences or paragraphs seem suspect. This causes a person to spend time eliminating false positives.
Bailey also said some of the web-only services are also unable to distinguish between content that is properly quoted or attributed, and material that is stolen. This, too, can waste a person’s time. However, he said that iThenticate, for example, does a decent job of eliminating the more obvious false positives, and that it has an API that enables it to be integrated into any web-based content management system.
Bailey has used and tested a wide variety of the plagiarism detection services available, and said they vary widely in terms of quality. Along with his experience, Dr. Debora Weber-Wulff, a professor of Media and Computing at Hochschule für Technik und Wirtschaft Berlin, has conducted two tests of these services.
Asked to summarize the effectiveness of these services, Dr. Weber-Wulff offered a blunt reply by email: “Not at all.”
“They don’t detect plagiarism at all,” she wrote. “They detect copies, or near copies. And they are very bad at doing so. They miss a lot, since they don’t test the entire text, just samples usually. And they don’t understand footnoting conventions, so they will flag false positives.”
Her tests involved taking 31 academic papers that included plagiarized elements and running them through different services. Her data is important and worth looking at, though journalists should note that academic papers and articles are not going to elicit the same results. The Hartford Courant seemed happy with its service, as was the Fort Worth Star-Telegram when it used one a few years ago, according to Clark Hoyt’s blog post. On the other hand, the New York Times continues to have concerns.
For his part, Bailey mentioned a few services that might work for journalism.
“IThenticate do a very good job, they provide a good service but it is pricey,” he said, “and it is very difficult to motivate newspapers to sign up when they’re putting second mortgages on their buildings.”
He also mentioned Copyscape.
“Copyscape is another one that is very popular and it’s very cheap at only 5 cents a search,” he said, noting it took the top spot in Dr. Weber-Wulff’s latest study. “It’s very good at matching — it uses Google — and it does a thorough job, though the problem is that it only searches the current web, so you have a limited database.”
He recommends using Copyscape if the plan is to perform spot checks for a news organization. Bailey also mentioned SafeAssign as a third offering to look at.
In his view, it’s unacceptable that news organizations are ignoring these services.
“The risks of not running a [plagiarism] check are incredibly high,” he said, citing the damage that can be done to a brand, and the potential for lawsuits. “At the very least, they should be doing post-publication checking rather than letting the public-at-large or competitors do it for them — because that’s when things get particularly ugly.
This article was originally published on PBS Mediashift. J-Source and MediaShift have a content-sharing arrangement to broaden the audience of both sites.