In my last blog post, I discussed the list of the 100 most-cited papers of all time compiled by Thomson-Reuters for Nature to celebrate the 50th anniversary of the Science Citation Index. In the same Nature article, there is a brief mention of a similar list compiled by Google based on their Google Scholar database. Unlike the Thomson-Reuters/Science Citation Index (SCI) list, the Google list includes books. This is partly a byproduct of the way the two databases are structured—Thomson-Reuters has separate databases for journals and books while Google has a single database that includes journal articles, books and “selected web pages”1—and, I suspect, partly a conscious choice by the Nature editors to focus on the most-cited papers. Certainly, the article focuses on the SCI list rather than the Google list which, as mentioned above, is different in composition. This provides us with an interesting opportunity to think a little harder about why things get cited and how we go about the business of counting citations and thereby trying to measure impact.
The most striking thing in the Google list is the number of books among the most highly cited work. 64 of the 100 most highly cited works are books, according to Google. Many of these books are technique-oriented, as one might expect from the kinds of papers that made the SCI list discussed in my last post. For example, the most highly cited book on Google’s list, and 4th most cited work in the overall list, is Molecular Cloning: A Laboratory Manual by Sambrook, Fritsch and Maniatis. The same book, but with a different permutation of authors (Maniatis, Fritsch and Sambrook), also shows up as number 15 on Google’s list. How can this be? This book has gone through a number of editions, with changing authorship. The book at #4 on Google’s list is the second edition, while #15 is the first edition. This highlights one of the key difficulties in compiling such a list: Books are often inconsistently cited, and changing editions pose a challenge in terms of combining or not combining citations. Since different editions with a simple permutation of authorship are actually an evolution of the same book, it seems to me that we should combine the citation counts for entries 4 and 15 (and later editions that don’t show up on this list as well). That would vault Molecular Cloning to #1 on Google’s list. If we take citations as a measure of impact, this book would be the most important scientific work of all time (so far). However, I think we can all agree that there is something funny about that statement. The number of citations indicates that this is clearly a very useful book, but it’s a compendium of methods developed by many, many other groups. It is highly cited because of its convenience. The original papers are not cited as often as this book (at least by Google’s count), but clearly it’s the original scientific work used by thousands of labs around the world that has had the impact, not this manual. So here we have a work that is very highly cited (and therefore, by any reasonable definition, important) but where it’s obvious that the very large citation count is not measuring scientific impact so much as utility as a reference.
The same sort of argument could be applied to scientific papers. Take, for example, the density functional theory papers discussed in my previous post. I would argue that the two papers by Walter Kohn in the SCI list have had more impact than any of the other DFT papers in this list since they enabled all the subsequent work developing the theory into practical methods. But they are not cited as often as some of the papers that describe functionals used in quantum chemical calculations. Citations therefore measure something—utility?—but it isn’t impact as I would understand the term.
There are some books on Google’s list that do describe original contributions to the literature. Among other things, there are those I would characterize as “big idea” books, in which new, influential ideas were first described. Number 7 on Google’s list is Thomas Kuhn’s The Structure of Scientific Revolutions. This is not a book that contains practical advice on carrying out particular experiments or particular types of calculations. It’s a contribution to the history and philosophy of science. But Kuhn’s ideas about the way in which science progresses have really struck a chord, so this book is cited a lot, across a wide range of fields, most of which have nothing to do with the history or philosophy of science.
The Google list also contains works from fields outside of hard-core science, which we don’t see in the Science Citation Index list. Thus, number 6 on Google’s list is Case Study Research: Design and Methods by Robert K. Yin, a book covering research methods used mostly in business studies. The Google list includes a number of other works from business studies, as well as from the social sciences. It’s sometimes useful to be reminded that “research” and “scientific research” are not synonymous.
But this is a blog about science, so back to science we go. An interesting question we could ask is how the books on Google’s list would have fared if they had been included in the Thomson-Reuters effort. To try to answer this question, I looked at another highly cited book, #5 in the Google list, Numerical Recipes by Press, Teukolsky, Vetterling and Flannery. Looking up citations to books in the Science Citation Index is not trivial. Because books don’t have records in the SCI database, there is no standard format for the citation extracted from citing papers. Moreover, people often make mistakes in formatting citations. Authors are left out, or the order of authorship is permuted. Additionally, people often cite a particular chapter or page when citing a book, and each of these specific citations is treated as a citation of a different work in the database. Anyhow, here’s what I did: I searched for citations to works by “Press W*” entitled “Num*”. This generated a list of 4761 cited works. This large number of distinct hits to the Numerical Recipes books makes it impossible to complete the search for citing articles. All we can tell is that there are more than 4761 citations to Numerical Recipes in the Web of Science database. In fact, the number must be much larger since it’s plain to see even from the small sample I looked at that some of the variations are cited dozens or even hundreds of times. But an accurate method of counting them in the Web of Science evades me.
Numerical Recipes is a bad case. There are many editions with slightly different titles (“in Fortran”, “in C”, etc.), the subtitle is sometimes included (“The Art of Scientific Computing”), multiple authors, and so on. Maybe if we try a book with one author and a limited number of editions? I then tried to do a citation search for Kuhn’s The Structure of Scientific Revolutions. Here, we find a different problem: The results are highly sensitive to details such as whether or not we include the word “The” from the title. And, although there are far fewer hits than for Numerical Recipes, there are still hundreds of them to sift through. Again, I’ve had to admit defeat: There does not appear to be a simple way to count citations to heavily cited books in the Web of Science.
Of course, citation counting is a tricky business at the best of times, and the problem afflicts both the Thomson-Reuters and Google Scholar databases. Errors in citations, which are fairly frequent, may deflate the citation count of a paper unless one is very careful about structuring the search. But beyond that, some papers are just hard to chase down in the database. Take Claude Shannon’s first of two classic papers published in 1948 on information theory, number 9 on the Google list, and nowhere to be found on the SCI list. It’s actually very difficult to find this paper in the Google database. I have found many lightly cited variants on this citation, but the version that Google Scholar reports as having been highly cited is actually a 2001 corrected reprint in the ACM journal Mobile Computing and Communications Review. It’s not clear to me that this is correct—has this paper really been cited more often than the 1948 original?—but then I’m not sure how Google’s database is structured. For the record, the Web of Science reports that the 1948 paper has been cited 9847 times, while the 2001 reprint has been cited 278 times. Quirks of a database can make the apparently simple act of counting citations tricky, all the more so for highly cited papers.
We all wish that we could quantify scientific output so that we could make better decisions about funding, prizes, and so on. It would sure make all of our lives much easier if this were possible. However, the problems that plague the apparently simple task of trying to round up and interpret a list of the most cited work—high citation rates for work that provides a convenient reference for an idea or technique but is not particularly original (books on methods, review papers), inconsistent database entries and citation errors—also affect citation counts for work that has accumulated a more normal number of citations. None of this is to deny the importance of a good book or review paper, nor are my comments intended to mean that there isn’t a clear difference in impact between two research papers in the same field whose citation counts differ by an order of magnitude. But there are enough subtleties in counting citations and in interpreting the results that I would not want to try to rank papers or scientists on this basis.
1R. Vine (2006) Google Scholar. J. Med. Libr. Assoc. 94, 97–99.
What is the 1) most cited work you have found? According to Google Scholar, the book Diagnostic and Statistical Manual of Mental Disorders (DSM-5) by the American Psychiatric Association has been cited 240,000+ times.
The Google Scholar list referenced in this post gives U.K. Laemmli’s article “Cleavage of structural proteins during the assembly of the head of bacteriophage T4” (Nature 227, 680–685, 1970, http://dx.doi.org/10.1038/227680a0) as the most cited item in their survey, with a bit over 213,000 citations at the time they compiled their list. Interestingly, the DSM doesn’t show up in their list. The Nature article in which these citation lists were discussed doesn’t give enough information about their methodology to figure out why the DSM was missed. I agree that it almost certainly should have shown up in the Google Scholar list. This omission shows how careful we have to be in interpreting the results of all of these citation counting experiments.