Passing arguments to a Matlab script

It’s sometimes useful to run the same program with different parameters many times, for example when we want to systematically vary a parameter in a model to see how it affects the behavior of the model. This is really easy in C and related languages, which have excellent command-line processing facilities. It turns out to be easy in Matlab as well. Since I had to hunt around to figure out how to make this trick work, I thought I would present it here.

Matlab has a -r command-line switch that allows you to execute a Matlab command right after Matlab starts up. For example, here is a Matlab command-line version of the classic “Hello, world!” program:

matlab -r "disp('Hello, world.')"

(As a parenthetical remark, the mysteries of bash command-line interpretation defeated my attempts to put an exclamation mark at the end of the sentence, as is traditional.) If you type this command at a shell prompt, Matlab will print “Hello, world.” just before the Matlab command-line prompt appears. You can of course imagine initializing one or more variables in similar manner:

matlab -r "x=1; y=2;"

OK, but what if you want to initialize some variables used by a Matlab program? There are a couple of options here. One is to include the program name in the quoted command string. If your program is called prog.m, for example, you could type

matlab -r "x=1; y=2; prog"

Alternatively, you could redirect standard input:

matlab -r "x=1; y=2;" < prog.m

The above will work provided the variables x and y are not defined inside prog.m. If you always want to define these variables from the command-line, then that’s fine. However, you might want to have variables that have default values that you can override from the command line. This is also fairly easy. Consider the following program, which we will assume has been saved into a file called prog2.m:

if ~exist('x')
    x = 0.1;
end
x

The command line matlab -r "x=1; prog2" displays an x value of 1, while the command line matlab -r prog2 sets x to 0.1, as you might have expected.

A possibly useful way to use impact factors

Journal impact factors are much abused. Originally developed to help librarians make rational decisions about subscriptions, they are increasingly used to judge the worth of a scientist’s output. If we can place a paper in a high-impact-factor journal, we bask in the reflected glory of those who have gone before us, whether our paper is really any good or not. On the other hand, if we publish in lower-impact-factor journals, it’s guilt by association.

If you write grant applications, or have to apply for tenure or promotion, someone is likely to look at the impact factors (IFs) of the journals you have published in, particularly if the papers were published relatively recently and haven’t had the time to accumulate many citations. They are particularly likely to do that if they aren’t experts in your field, and aren’t sure about the quality of the journals you publish in. Like it or not, you are going to have to face the IF gauntlet. The problem is that IFs vary widely by field. What you need to do is to provide some perspective to the people reading your file so that they don’t assume that the standards of their field apply to yours.

I recently reviewed a grant application whose author found a nice way to address this issue: Each journal in the Thomson-Reuters database is assigned to one or more categories based on the area(s) of science they cover. For each category, the Journal Citation Reports provides a median impact factor as well as an aggregate impact factor, the latter being the impact factor you would calculate for all the articles published in the journals concerned as if they came from a single journal. If you want to put the impact factor of a particular journal in perspective, what you do is that you compare that impact factor either to the median or to the aggregate impact factor for the category (or categories) that the journal belongs to.

If you’re going to do this, I would suggest that you, first, be consistent about which statistic you use and, second, give this statistic for all the categories that a given journal belongs to. This will avoid accusations that you are cherry-picking statistics.

For example, my most recent paper was published in Mathematical Modelling of Natural Phenomena (MMNP), a journal with an impact factor of 0.8, which doesn’t seem impressive on the surface. This journal has been classified by Thomson-Reuters as belonging to the following categories:

Category Median IF Aggregate IF Quartile
Mathematical & Computational Biology 1.5 2.5 4
Mathematics, Interdisciplinary Applications 1.1 1.5 3
Multidisciplinary Sciences 0.7 5.3 2

This, I think, puts Mathematical Modelling of Natural Phenomena in perspective: It’s not a top-of-the-table journal, but its 0.8 impact factor isn’t ridiculously small either.

A closely related strategy would be to indicate which quartile of the impact factor scale a journal belongs to in its category. This information is also available in Journal Citation Reports, and I have provided these data for MMNP in the table above.

The main point I’m trying to make is that, if at all possible, you should provide an interpretation of your record and not let others impose an interpretation on your file. If you are in a position to fight the IF fire with fire, i.e. with category data from the Journal Citation Reports, it may be wise to do that.

All of that being said, some of the statistics for MMNP shown above demonstrate how crazy IF statistics are. If we look at the quartile placement of this journal in different categories, they range from the 2nd quartile, which should be suggestive of a pretty good journal, to the 4th, which makes this journal look pretty weak. In an ideal world, I would not suggest that you include such flaky statistics in your grant applications. But we don’t live in an ideal world. Referees and grant panel members discuss IFs all the time, so if it happens that you can tell a positive story based on the analysis of IFs, it’s just smart to do so.

How should we decide whether or not to accept a peer-review invitation?

In a recent commentary published in the journal Science and Engineering Ethics, José Derraik has proposed two criteria for deciding whether one should accept a peer-review invitation. Quoting directly from his article, these are

  1. If a given scientist is an author in x manuscripts submitted for publication in peer-reviewed journals over y months, they must agree to peer-review at least x manuscripts over the same y months.
  2. The perceived status of the journal requesting input into the peer-review process must not be the primary factor affecting the decision to accept or decline the invitation.

As a member of the editorial board of a small open-access journal that is trying to do some good in the world, BIOMATH, I fully concur with Derraik’s second point. If someone has submitted a paper in good faith to a scientific journal, and that journal is seeking expert advice on the quality of the paper, that advice should not be withheld without good reason. Prestige of the journal shouldn’t even be a consideration. I’m not talking about shady journals here, and in any event, the shady journals don’t typically look for peer reviewers.

I also have some sympathy for Derraik’s first point. We all receive too many requests to referee papers. At some point, you have to decide that you have done enough. I’m not sure about the simple equality between published papers and refereed papers that Derraik suggests. I think this is likely to lead to an undersupply of qualified referees. His argument relies on the fact that most papers have multiple authors, but at least in the fields I follow closely, most of those authors are students. While a student can co-referee a paper with a senior scientist as a training exercise, the senior scientist still has to take primary responsibility for the review. In order for the system to work properly, I suspect that most of us have to referee twice or three times as many papers as we write. The multiplier might be smaller (perhaps as small as 1) for people who write a lot of papers with many coauthors, but those folks are outliers. Nevertheless, I think Derraik is right that there has to be some proportionality between output and contribution to refereeing.

I think there’s another principle that we should add to Derraik’s list:

3. If you can’t think of very many alternative referees who are as qualified as you are to review the submission you have received, then you should accept the invitation.

This happens more often than you might think. Authors suggest referees based on the people they know in the field doing similar work. Editors similarly work hard to match the paper with appropriate referees, so it does happen fairly often that you’re the ideal referee for something you have received. In those cases, you should assume your responsibilities and do the work if it’s at all possible.

The flip side of Derraik’s list, which he doesn’t tackle directly, is the question of when you should refuse a referee assignment. To me, it comes down to a few things:

  1. I do consider whether I have been doing too much refereeing lately. There is only so much time, and at some point you need to write papers rather than read other people’s stuff all the time.
  2. I always ask myself if I can easily think of other qualified referees. If the answer is yes, I’m more inclined to decline the invitation. That doesn’t mean I automatically decline such invitations, only that I worry less if I feel I have to decline based on other considerations. And of course, I always pass along a list of potential referees to the journal when I do decide to decline an invitation on this basis.
  3. Sometimes, you receive papers you’re just not that qualified to review. Then you should definitely turn down the invitation.
  4. On occasion, you receive something and realize that other time commitments will make it impossible for you to complete the refereeing assignment in a reasonable span of time. Note that journals increasingly request a return of referee reports on unreasonable timetables. (Two weeks? Get real!) I have to admit that I sometimes turn down refereeing requests because the journal is proposing unreasonable timelines. I simply refuse to jump just because somebody says so. In other cases, I ask the editor if he/she would be willing to receive a report within x weeks, where x is a value chosen to work around other commitments, with x typically less than or equal to 4. They almost always say yes to these requests. There are times though when I’m so busy that I really could not read the paper and return the report for many, many weeks. In these cases, it’s best to decline the invitation right away.

Refereeing papers is a largely thankless job (although you may want to check out Publons, which is working to change that). That doesn’t make it less important, but it does mean that you have to balance the time you put into that against other commitments. To me, the overriding consideration is expertise: Am I the right person for the job? If the answer is yes, and you’re not completely overwhelmed with other duties, you really should accept the assignment.

Why you should join a scientific society

Again, this post is mostly addressed to students, since I assume that most scientists with a permanent job are already members of at least one scientific society. I will keep these comments general, although I will mention specific societies as examples from time to time.

Scientific societies vary greatly in focus, size, organization, and, yes, personality. Some, like the American Association for the Advancement of Science (AAAS), cover all the sciences and are, accordingly, massive—Wikipedia says that the AAAS had over 125,000 members in 2008. Others, such as the Canadian Society for Chemistry, target a major scientific discipline. Still others, like the 750-member Society for Mathematical Biology, narrow their focus to a specialized field. As some of their names suggest, scientific societies tend to be organized at the national level, although almost all of them will have significant numbers of foreign members, and many explicitly think of themselves as international societies. The larger societies tend to be run much more like businesses, with large complements of staff responsible for day-to-day operations. The smaller societies typically have few or no staff, and run on the labor of volunteers.

Scientific societies serve many, many purposes. Selfishly, they are conduits for information and provide networking opportunities for their members. In the case of societies organized at the national level, especially the larger ones, they are often important lobbying organizations that make sure that governments regularly hear scientists’ perspectives on various issues. The good societies are communities whose conferences are gatherings of people with common interests, even when those interests are uncommon as in the case of the small, specialized societies.

That last point is the one I want to emphasize: Joining a scientific community, in the ideal case, is joining a community. A member of a scientific society, whether a student or a famous professor, is “one of us” to other members of that society. And there are benefits to being a member, in the full sense of the word, of a group.

Some of the benefits are obvious: Every professional society has some kind of regular newsletter. These can vary from very simply reproduced amateur newsletters to professional-looking newspapers or magazines. These newsletters typically contain news stories about what is going on in the field, society news, profiles of members, conference announcements and job ads. Modern scientific societies will also have mailing lists that are restricted to their members. These are typically used to disseminate more time-sensitive information including, again, job ads and conference announcements, but can contain a variety of other content, as is the case for the public mailing lists discussed in my last blog post. Scientific societies usually hold conferences, and members always get discounted conference fees. And of course, attending a society conference is an ideal way to meet other members of the society.

In fact, society conferences can be invaluable networking opportunities. The people you meet there may one day be in a position to offer you a job. Even if that doesn’t happen, being known within your field means that people who make decisions about your career, about things like scholarship applications or grants for example, are likely to know you. Now we try really hard to screen out our biases when we’re refereeing grants or papers, but the truth is that it’s much easier to be a harsh judge when we don’t know the person whose file we’re judging.

In addition, many scientific societies have mentorship programs, as well as a variety of professional development events, often during their conferences, or in the days immediately preceding or following a conference. The latter can be technical seminars (for example, chemical safety mini-courses run by chemical societies), or they can be oriented toward career building, such as workshops on job interviews. The latter can be extremely worthwhile to young scholars.

But, you might say, I’m not interested in a career in academia. Then you should think hard about which society you join, but you should still join a society. Find one that has many non-academic members. Chemical societies, for example, typically have many members from industry. Some societies, like the Canadian Applied and Industrial Mathematics Society, try really hard to bridge the academic-industrial divide, and could be expected to have a number of industrial members, or at the very least some programs intended to help connect those two worlds.

Hopefully, I have convinced you that you should join a scientific society. But how do you choose one? Do you join your national society or a larger American society, for instance? Low student membership fees and reciprocal membership arrangements, in which members of a national society get reduced fees in another national society at a reduced rate, may make this a false dichotomy. However, money is tight for many students, so you may have to make an initial decision. Advice from your supervisor can be helpful here, but you should do a bit of thinking, too. What are your career objectives, and how can one society or another help you get there? What society do most of the people in your field and in the geographic area where you would eventually like to work belong to? What conferences do you want to attend? These are all factors you should consider. In the end, you are looking for a society in which you will feel comfortable, and find fellow travelers.

Beyond the society itself, the larger societies (and even some of the not-so-large ones) often have divisions to create smaller communities within the large community. For example, the Society for Industrial and Applied Mathematics (SIAM) has several highly active activity groups. Often, the real community is found at the level of these divisions. They would typically have their own mailing lists and conferences. Most societies with divisions will allow you to choose one division for free as part of the overall cost of membership. So if the really big societies seem intimidating to you, they need not be, provided they have a strong division in your area of interest.

The good news for students is a student membership fees are usually really low. Some societies, like SIAM, even allow their full members to nominate a certain number of students for free memberships. Scientific societies really want student members, because today’s student member is tomorrow’s full member.

So talk to your boss, and do a bit of research and thinking on your own. Join a society. It’s a small step towards building your career, but potentially a really pivotal one.

Useful mailing lists for mathematical biology

One of the toughest things to do as an academic is to keep informed about what is going on out there, in a professional sense. An earlier blog post addressed the issue of keeping up with the literature. But there are other things you need to know about: upcoming conferences and workshops, calls for papers, funding programs, new software, books published in the field, and of course, especially for younger scholars, Ph.D., postdoctoral, and job opportunities. So where do you find all of this stuff? A great place to start is to get on a few key mailing lists in your field. Yes, it’s old fashioned, but it’s a really effective way to have important announcements come to you. Perhaps in a few years, Twitter or other social media mechanisms will replace mailing lists. For now though, a lot of the people who have information you need are of my generation, and they’re sending their postings to mailing lists.

There’s a bit of vocabulary to learn to make effective use of mailing lists. Some mailing lists allow postings to flow directly to users as soon as they are received (or approved, in the case of moderated lists). Others function as digests, which means that contents are collected for a certain period of time (which can vary according to the list) and are then sent out in one larger email. Some lists offer the option of either getting postings immediately or as daily or weekly digests. Do look at the options when you subscribe to a mailing list.

I will be focusing particularly here on mailing lists for people in mathematical biology, since that is the community I most closely associate with. If you work in another field, ask your supervisor about mailing lists you should join. He or she should be a good resource person on this topic.

With that out of the way, here are some mailing lists I recommend for mathematical biologists:

  • SMB Digest: SMB Digest is a mailing list of the Society for Mathematical Biology. It is easily the most useful mailing list for mathematical biologists. It’s also highly unusual in that it’s a society mailing list that is open to non-members. (Most societies treat their mailing lists as a perk of membership. I will have more to say on joining scientific societies in a later blog post.) As a result of the SMB Digest being open to anyone, almost everyone will post items of interest to the community here. To join this mailing list, go to https://www.smb.org/smb-digest-community-forum-how-to/ for instructions. If you’re in mathematical biology, you simply must subscribe to this mailing list.
  • Non Linear Science Network Digest: Strictly speaking, this isn’t a mathematical biology mailing list, but many of us work on biological problems for which the appropriate methods come from nonlinear dynamics, so there is a lot of overlap between the audience for this list and the mathematical biology community. You can join this mailing list at http://www.maia.ub.es/cgi-bin/mailman/listinfo/nls-net.
  • NIMBIOS Newsletter: This one is a bit different. The other mailing lists mentioned above are intended to distribute information of general interest. The NIMBIOS Newsletter on the other hand is a publication of the National Institute for Mathematical and Biological Synthesis (hence the acronym) whose purpose is to publicize NIMBIOS activities and programs. This is however a very active institute with many interesting programs (visiting fellowships, postdoctoral fellowships, workshops, etc.), so I think it’s worthwhile being on this mailing list even if you have no direct plans and no immediate interest in visiting them. You can join this list here: http://www.nimbios.org/press/newsletter.
  • University of Lethbridge theoretical biology mailing list: This mailing list will only be of interest to people at or near the UofL. We use it to distribute information about seminars, courses, or other items of strictly local interest. If you want to join this list, go to http://listserv.uleth.ca/mailman/listinfo/theor-biol-l. The volume on this list is very low, although I always hope that more list members will share what’s going on in their area through this list.

If you know of other mailing lists that are useful for mathematical biologists, let me know and I may add them to this post.

Farewell, Oktay Sinanoğlu (1935–2015)

I’m a long-time fan of Oktay Sinanoğlu. I use the word “fan” quite deliberately: I don’t think there’s any other way to describe my relationship to the man. We’ve never met, or even exchanged emails. But I read some of his papers in graduate school and was immediately drawn in. I was therefore sad when I learned recently that he had died. One more scientific hero I’ll never meet…

Sinanoğlu had a long and productive career at Yale. Nevertheless, he was almost certainly better known in Turkey, where he became something of a national hero, than in the Western world. His papers covered a very wide cross-section of theoretical chemistry, including electronic structure, atomic clusters, solvent effects on chemical reactions, spectroscopy, automated generation of synthetic pathways, irreversible thermodynamics, dissipative structures, graph theoretical methods for studying the stability of reaction networks, and model reduction methods. It was the latter two topics that attracted my attention to Sinanoğlu when I was a graduate student. They intersected nicely with my interests at the time, which revolved around the dynamical systems approach to chemical kinetics.

My main research interest at the time was model reduction. Sinanoğlu, with his student Ariel Fernández, was among the first people to consider the construction of attracting manifolds for reaction-diffusion systems.1,2 This is a very difficult problem that is still a very active area of research. When I look back on the Fernández-Sinanoğlu papers on this topic, it seems to me that they anticipate later work on inertial manifolds.3 Because there weren’t many people following the field at the time, I don’t think that these papers are as well known as they deserve to be. Fernández and Sinanoğlu were just a bit ahead of their time. Had this work been published in the 1990s rather than the mid-1980s, I’m sure these papers would have received a great deal more attention.

Although I wasn’t working on these problems myself at the time, I became very interested in applications of graph theory in chemical kinetics while still a graduate student. It would be many years before I made any contributions to this topic myself, in association with my then-postdoc Maya Mincheva.4–6 Among the papers I read way back then were a pair written by Sinanoğlu in which chemical reaction networks were conceptualized as graphs.7,8 This allowed Sinanoğlu to enumerate all graphs corresponding to reactions with given numbers of reactions and species.7 A subsequent paper contained a conjecture about a topological feature of the graphs of chemical mechanisms capable of oscillations,8 thus attempting to tie together the structural features of his graphs and the dynamics generated by the rate equations. This is the theme we picked up many years later, although we followed a line of research initiated by Clarke9 and Ivanova10 rather than Sinanoğlu’s theory.

So, Oktay, thanks for inspiring a young graduate student. Rest in peace.

1A. Fernández and O. Sinanoğlu (1984) Global attractors and global stability for closed chemical systems. J. Math. Phys. 25, 406–409.
2A. Fernández and O. Sinanoğlu (1984) Locally attractive normal modes for chemical process. J. Math. Phys. 25, 2576–2581.
3A. N. Yannacopoulos, A. S. Tomlin, J. Brindley, J. H. Merkin and M. J. Pilling (1995) The use of algebraic sets in the approximation of inertial manifolds and lumping in chemical kinetic systems. Physica D 83, 421–449.
4M. Mincheva and M. R. Roussel (2006) A graph-theoretic method for detecting potential Turing bifurcations. J. Chem. Phys. 125, 204102.
5M. Mincheva and M. R. Roussel (2007) Graph-theoretical methods for the analysis of chemical and biochemical networks. I. Multistability and oscillations in ordinary differential equation models. J. Math. Biol. 55, 61–86.
6M. Mincheva and M. R. Roussel (2007) Graph-theoretical methods for the analysis of chemical and biochemical networks. II. Oscillations in Networks with Delays. J. Math. Biol. 55, 87–104.
7O. Sinanoğlu (1981) 1- and 2-topology of reaction networks. J. Math. Phys. 22, 1504–1512.
8O. Sinanoğlu (1993) Autocatalytic and other general networks for chemical mechanisms, pathways, and cycles: their systematic and topological generation. J. Math. Chem. 12, 319–363.
9B. L. Clarke (1974) Graph theoretic approach to the stability analysis of steady state chemical reaction networks. J. Chem. Phys. 60, 1481–1492.
10A. N. Ivanova (1979) Conditions for the uniqueness of the stationary states of kinetic systems, connected with the structures of their reaction mechanisms. 1. Kinet. Katal. 20, 1019–1023.

The most-cited work of all time

In my last blog post, I discussed the list of the 100 most-cited papers of all time compiled by Thomson-Reuters for Nature to celebrate the 50th anniversary of the Science Citation Index. In the same Nature article, there is a brief mention of a similar list compiled by Google based on their Google Scholar database. Unlike the Thomson-Reuters/Science Citation Index (SCI) list, the Google list includes books. This is partly a byproduct of the way the two databases are structured—Thomson-Reuters has separate databases for journals and books while Google has a single database that includes journal articles, books and “selected web pages”1—and, I suspect, partly a conscious choice by the Nature editors to focus on the most-cited papers. Certainly, the article focuses on the SCI list rather than the Google list which, as mentioned above, is different in composition. This provides us with an interesting opportunity to think a little harder about why things get cited and how we go about the business of counting citations and thereby trying to measure impact.

The most striking thing in the Google list is the number of books among the most highly cited work. 64 of the 100 most highly cited works are books, according to Google. Many of these books are technique-oriented, as one might expect from the kinds of papers that made the SCI list discussed in my last post. For example, the most highly cited book on Google’s list, and 4th most cited work in the overall list, is Molecular Cloning: A Laboratory Manual by Sambrook, Fritsch and Maniatis. The same book, but with a different permutation of authors (Maniatis, Fritsch and Sambrook), also shows up as number 15 on Google’s list. How can this be? This book has gone through a number of editions, with changing authorship. The book at #4 on Google’s list is the second edition, while #15 is the first edition. This highlights one of the key difficulties in compiling such a list: Books are often inconsistently cited, and changing editions pose a challenge in terms of combining or not combining citations. Since different editions with a simple permutation of authorship are actually an evolution of the same book, it seems to me that we should combine the citation counts for entries 4 and 15 (and later editions that don’t show up on this list as well). That would vault Molecular Cloning to #1 on Google’s list. If we take citations as a measure of impact, this book would be the most important scientific work of all time (so far). However, I think we can all agree that there is something funny about that statement. The number of citations indicates that this is clearly a very useful book, but it’s a compendium of methods developed by many, many other groups. It is highly cited because of its convenience. The original papers are not cited as often as this book (at least by Google’s count), but clearly it’s the original scientific work used by thousands of labs around the world that has had the impact, not this manual. So here we have a work that is very highly cited (and therefore, by any reasonable definition, important) but where it’s obvious that the very large citation count is not measuring scientific impact so much as utility as a reference.

The same sort of argument could be applied to scientific papers. Take, for example, the density functional theory papers discussed in my previous post. I would argue that the two papers by Walter Kohn in the SCI list have had more impact than any of the other DFT papers in this list since they enabled all the subsequent work developing the theory into practical methods. But they are not cited as often as some of the papers that describe functionals used in quantum chemical calculations. Citations therefore measure something—utility?—but it isn’t impact as I would understand the term.

There are some books on Google’s list that do describe original contributions to the literature. Among other things, there are those I would characterize as “big idea” books, in which new, influential ideas were first described. Number 7 on Google’s list is Thomas Kuhn’s The Structure of Scientific Revolutions. This is not a book that contains practical advice on carrying out particular experiments or particular types of calculations. It’s a contribution to the history and philosophy of science. But Kuhn’s ideas about the way in which science progresses have really struck a chord, so this book is cited a lot, across a wide range of fields, most of which have nothing to do with the history or philosophy of science.

The Google list also contains works from fields outside of hard-core science, which we don’t see in the Science Citation Index list. Thus, number 6 on Google’s list is Case Study Research: Design and Methods by Robert K. Yin, a book covering research methods used mostly in business studies. The Google list includes a number of other works from business studies, as well as from the social sciences. It’s sometimes useful to be reminded that “research” and “scientific research” are not synonymous.

But this is a blog about science, so back to science we go. An interesting question we could ask is how the books on Google’s list would have fared if they had been included in the Thomson-Reuters effort. To try to answer this question, I looked at another highly cited book, #5 in the Google list, Numerical Recipes by Press, Teukolsky, Vetterling and Flannery. Looking up citations to books in the Science Citation Index is not trivial. Because books don’t have records in the SCI database, there is no standard format for the citation extracted from citing papers. Moreover, people often make mistakes in formatting citations. Authors are left out, or the order of authorship is permuted. Additionally, people often cite a particular chapter or page when citing a book, and each of these specific citations is treated as a citation of a different work in the database. Anyhow, here’s what I did: I searched for citations to works by “Press W*” entitled “Num*”. This generated a list of 4761 cited works. This large number of distinct hits to the Numerical Recipes books makes it impossible to complete the search for citing articles. All we can tell is that there are more than 4761 citations to Numerical Recipes in the Web of Science database. In fact, the number must be much larger since it’s plain to see even from the small sample I looked at that some of the variations are cited dozens or even hundreds of times. But an accurate method of counting them in the Web of Science evades me.

Numerical Recipes is a bad case. There are many editions with slightly different titles (“in Fortran”, “in C”, etc.), the subtitle is sometimes included (“The Art of Scientific Computing”), multiple authors, and so on. Maybe if we try a book with one author and a limited number of editions? I then tried to do a citation search for Kuhn’s The Structure of Scientific Revolutions. Here, we find a different problem: The results are highly sensitive to details such as whether or not we include the word “The” from the title. And, although there are far fewer hits than for Numerical Recipes, there are still hundreds of them to sift through. Again, I’ve had to admit defeat: There does not appear to be a simple way to count citations to heavily cited books in the Web of Science.

Of course, citation counting is a tricky business at the best of times, and the problem afflicts both the Thomson-Reuters and Google Scholar databases. Errors in citations, which are fairly frequent, may deflate the citation count of a paper unless one is very careful about structuring the search. But beyond that, some papers are just hard to chase down in the database. Take Claude Shannon’s first of two classic papers published in 1948 on information theory, number 9 on the Google list, and nowhere to be found on the SCI list. It’s actually very difficult to find this paper in the Google database. I have found many lightly cited variants on this citation, but the version that Google Scholar reports as having been highly cited is actually a 2001 corrected reprint in the ACM journal Mobile Computing and Communications Review. It’s not clear to me that this is correct—has this paper really been cited more often than the 1948 original?—but then I’m not sure how Google’s database is structured. For the record, the Web of Science reports that the 1948 paper has been cited 9847 times, while the 2001 reprint has been cited 278 times. Quirks of a database can make the apparently simple act of counting citations tricky, all the more so for highly cited papers.

We all wish that we could quantify scientific output so that we could make better decisions about funding, prizes, and so on. It would sure make all of our lives much easier if this were possible. However, the problems that plague the apparently simple task of trying to round up and interpret a list of the most cited work—high citation rates for work that provides a convenient reference for an idea or technique but is not particularly original (books on methods, review papers), inconsistent database entries and citation errors—also affect citation counts for work that has accumulated a more normal number of citations. None of this is to deny the importance of a good book or review paper, nor are my comments intended to mean that there isn’t a clear difference in impact between two research papers in the same field whose citation counts differ by an order of magnitude. But there are enough subtleties in counting citations and in interpreting the results that I would not want to try to rank papers or scientists on this basis.

1R. Vine (2006) Google Scholar. J. Med. Libr. Assoc. 94, 97–99.

The top 100 most-cited papers of all time

I wrote earlier about the 50th anniversary of the Science Citation Index. Recently, Nature got together with Thomson-Reuters, the publishers of the Science Citation Index (now usually known as the Web of Science), to come up with a list of the 100 most-cited papers of all time.1 It’s an interesting list, which I encourage you to take a look at. Let’s face it: top-100 lists are always fun. Who is in there? Who is not? The Nature article provides a few reflections on this. For my part, I’m going to look at what this list tells us about citation patterns in different areas of science, focusing particularly on an area of science I know well, namely density functional theory, and one with which I have a tangential acquaintance, NMR.

There are, as the Nature article pointed out, a large number of papers in the top 100 from the field of density-functional theory (DFT). I may have missed some, but here are the ones I noticed: Lee, Yang and Parr (1988)2 at #7, Becke (1993)3 at #8, Perdew, Burke and Ernzerhof (1996)4 at #16, Becke (1988)5 at #25, Kohn and Sham (1965)6 at #34, Hohenberg and Kohn (1964)7 at #39, Perdew and Wang (1992)8 at #93, and Vosko, Wilk and Nusair (1980)9 at #96.

So what is DFT, anyway? One of the great problems in electronic structure calculations for molecules is electron correlation. Electrons repel, so they tend to stay away from each other. Classic methods of electronic structure calculation don’t properly take electron correlation into account. There are ways to put electron correlation back in after the fact, but they’re either not very accurate, or they take a huge amount of computing. Another problem arises because of exchange, a strange quantum mechanical effect that causes identical electrons with the same spin to stay away from each other moreso than is the case due to simple electrostatics (i.e. more than would be the case for electrons with opposite spin). DFT is based on some theory developed by Kohn in the 1960s (in papers #34 and 39 from Nature‘s list) that essentially states that there is a functional of the electron density that describes electron correlation and the exchange interaction exactly. Modern DFT is based on approximating this functional (usually using separate correlation and exchange parts) semi-empirically. Using good DFT exchange and correlation functionals allows us to do very accurate electronic structure calculations much more quickly than is the case with older methods. The one catch is that we don’t really know what the exchange and correlation functionals should be, so there’s a lot of work to be done coming up with good functionals and validating them. Nevertheless, the current crop of functionals does a pretty good job in many cases of chemical interest.

To understand the DFT citation patterns a bit better, I used the Web of Science to count up the number of times each of these papers was cited with one of the others. Here’s what I found:

LYP 88 Becke 93 PBE 96 Becke 88 KS 65 HK 64 PW 92 VWN 80
LYP 88 48653 33303 3498 17608 3305 2917 2114 5320
Becke 93 48041 3266 11118 2718 2499 2469 4284
PBE 96 38281 2948 5405 5040 2576 1647
Becke 88 27370 2734 2332 2246 5821
KS 65 23840 15129 2028 1955
HK 64 22608 1750 1656
PW 92 13173 1260
VWN 80 12862

Hopefully the code I’m using here is clear enough: LYP 88, for example, is Lee, Yang and Parr (1988). The entries on the diagonal are the total numbers of citations to the corresponding papers. This matrix is necessarily symmetric about its diagonal, so I didn’t fill in the entries below the diagonal. Note that the total citations for each paper differ somewhat from those reported in Nature‘s spreadsheet because I performed my analysis at a later point in time, and these papers continue to accumulate citations at an astonishing rate.

A few numbers jump out from this table: The top two DFT papers, Lee, Yang and Parr (1988) and Becke (1993), are cited together with very high frequency: 68% of the papers citing Lee, Yang and Parr (1988) also cite Becke (1993). Although cited together slightly less often, Becke (1988) is also frequently co-cited with Lee, Yang and Parr (1988): 36% of the papers citing the latter also cite Becke (1988). Now if we ask how many of the papers citing Lee, Yang and Parr (1988) also cite at least one of the Becke papers, we find that an astonishing 85% do. This is, of course, not a random occurrence. One of the most popular exchange-correlation functionals around, B3LYP, combines Becke’s 1988 exchange functional, which was further studied in his 1993 paper, with the Lee, Yang and Parr correlation functional. People who use the B3LYP functional in calculations will usually cite Lee, Yang and Parr (1988) along with at least one of the Becke papers. So if one of these papers was to appear in the top-100 list, it was likely that all three would, as they do. The appearance of these papers in the top-100 list is therefore a testament to the heavy use made of the exchange-correlation functionals developed by these authors in the chemical literature. In fact, all of the DFT papers in the top-100 list describe functionals that are heavily used in applications, except for the Kohn papers which provided the underlying theory.

One of the points made by the authors of the Nature article is that papers that describe methods get cited much more than papers that introduce new ideas into science. So why do the Kohn papers appear in this list? I would argue that this is due to a quirk of citation among people who do DFT calculations. The vast majority of citations to these papers are by people who do DFT calculations, not by people further developing the Hohenberg-Kohn-Sham theory. To fully understand how strange this is, we have to consider that the overwhelming majority of people doing DFT calculations and citing these papers use software written by someone else, usually commercial software like Gaussian. Ordinary users of a computational method don’t usually “dig down” to the theory layer in their citations in this way. For example, the vast majority of modern quantum chemical calculations (including most DFT calculations) are based on Roothaan’s classic work on self-consistent-field calculations.10 These papers have been cited, respectively, 4535 and 1828 times. This is an extremely high citation rate, but it’s a tiny fraction of the literature reporting calculations based on Roothaan’s algorithms. So it’s a bit strange that Kohn’s work gets cited by DFT users at this high rate, particularly since we can find other foundational papers in quantum chemistry, such as Roothaan’s that are not as routinely cited.

Now let’s contrast the citation record of DFT with that of NMR. NMR is nuclear magnetic resonance. NMR spectroscopy is used on a daily basis by every synthetic chemistry group in the world, and by many physical and analytical chemistry laboratories as well. Although they will typically back up NMR measurements with other techniques, NMR is how chemists identify the compounds they have made, and determine their structures. One would think that we would see papers that describe fundamental NMR techniques or popular experiments make this list. They don’t. There is a single NMR-related paper in the list, one that describes a software program for analyzing both crystallography and NMR data, showing up at #69. That’s it. So why is that? It’s certainly not that there are more DFT papers than there are papers that use NMR. In fact the reverse is certainly true. However, when experiments become sufficiently common, chemists stop citing their original sources. I was just looking at a colleague’s paper in which he mentioned six different NMR experiments in addition to the usual single-nucleus spectra. A literature reference was given for only one of these experiments, presumably because he felt the others were sufficiently well-known that they didn’t need references. The equivalent practice in DFT would be not to cite anything when using the B3LYP functional, on the basis that everybody knows this functional. That’s quite a difference in citation practices between two different areas of chemistry! And the fascinating thing is that these two fields have overlapping membership: There are lots of synthetic chemists who do DFT calculations to support their experimental work. And for some reason, they behave differently when describing DFT methods than when describing NMR methods.

To understand the vast difference in citation practices between these two areas, let’s look at a specific example. In many ways, two-dimensional NMR experiments, in which signals are spread along a second dimension that encodes additional molecular information, very much parallels DFT: These methods were developed at about the same time, and hardware that could carry out these operations routinely became available to ordinary chemists around the same time in both fields, and they both opened up what could be done in their respective fields. The first two-dimensional NMR experiment, COSY, was first proposed in 1971 by Jean Jeener.11 It’s not entirely trivial to hunt down citations to papers in conference proceedings in the Web of Science because they are not cited in any consistent format. However, after doing a bit of work, and including the reprinting of these lecture notes in a collection a few decades later, I found approximately 352 citations to Jeener’s epoch-making paper. Compare that to the 23840 citations to the Kohn-Sham (1965) paper. One could argue that Jeener’s paper was published in an obscure venue, and that this depressed the number of citations to this paper, which is certainly plausible.  Jeener’s proposal was implemented by Aue, Bartholdi and Ernst in 1976.12 That paper was cited 2919 times, which is a far cry from the number of citations accumulated by the Kohn papers, or by the “applied” DFT papers in which practical functionals are described. Kohn shared the 1998 Nobel Prize in Chemistry. Ernst was awarded the 1991 Nobel Prize in Chemistry. There are a lot of ways in which the two contributions are comparable. But not in citation counts. And clearly, it’s not a matter of the popularity of the methods: I used the ACS journal web site to see how many papers in the Journal of Organic Chemistry mentioned the COSY experiment. The Journal of Organic Chemistry is a journal that, by its nature, contains mostly papers reporting the synthesis and characterization of compounds, so it’s a good place to gauge the extent to which an experimental method is used. In that one journal alone, 6351 papers mention COSY. To be fair, some of these references will be to descendants of the original COSY experiment (of which there are many), but the very large number of COSY papers and the relatively small number of citations to the early papers on COSY still speaks to wildly different citation cultures between NMR and DFT practitioners.

None of this is intended to denigrate the work of the excellent scientists whose papers have made the top-100 list. They clearly deserve a very large pat on the back. However, it does show that we have to be extraordinarily careful in comparing citation rates even between very closely related fields. And these rates will of course also affect citation-based metrics like the h-index, perhaps not in extreme cases like the highly cited papers mentioned here, but certainly in the case of authors whose papers are well cited, if not insanely well cited.

In the interests of full disclosure: Axel Becke, whose name features so prominently in the top-100 list and in this blog post, supervised my senior research project when I was an undergraduate student at Queen’s. My first scientific paper was coauthored with Axel.13 In fact, I may have benefited from the higher citation rates in DFT as this paper is by far my most cited paper. I sometimes joke that my career has all been downhill since this very first scientific contribution. But to figure out if that was true, we would have to take the citation practices of the various areas I’ve worked in into account…

1R. van Noorden, B. Maher and R. Nuzzo (2014) The top 100 papers. Nature 514, 550–553.

2C. Lee, W. Yang and R. G. Parr (1988) Development of the Colle-Salvetti correlation-energy formula into a functional of the electron density. Phys. Rev. B 37, 785–789.

3 A. D. Becke (1993) Density-functional thermochemistry. III. The role of exact exchange. J. Chem. Phys. 98, 5648–5652.

4J. P. Perdew, K. Burke and M. Ernzerhof (1996) Generalized gradient approximation made simple. Phys. Rev. Lett. 77, 3865–3868.

5A. D. Becke (1988) Density-functional exchange-energy approximation with correct asymptotic behaviour. Phys. Rev. A 38, 3098–3100.

6W. Kohn and L. J. Sham (1965) Self-consistent equations including exchange and correlation effects. Phys. Rev. 140, A1133–A1138.

7P. Hohenberg and W. Kohn (1964) Inhomogeneous electron gas. Phys. Rev. 136, B864–B871.

8J. P. Perdew and Y. Wang (1992) Accurate and simple analytic representation of the electron-gas correlation-energy. Phys. Rev. B 45, 13244–13249.

9S. H. Vosko, L. Wilk and M. Nusair (1980) Accurate spin-dependent electron liquid correlation energies for local spin-density calculations — a critical analysis. Can. J. Phys. 58, 1200–1211.

10C. C. J. Roothaan (1951) New developments in molecular orbital theory. Rev. Mod. Phys. 23, 69–89; (1960) Self-consistent field theory for open shells of electronic systems. Rev. Mod. Phys. 32, 179–185.

11J. Jeener (1971) “Lecture notes from Ampere Summer School in Basko Polje, Yugoslavia. Reprinted in NMR and More in Honour of Anatole Abragam, Eds. M. Goldman and M. Porneuf, Les editions de physique (1994).

12W. P. Aue, E. Bartholdi and R. R. Ernst (1976) Two-dimensional spectroscopy. Application to nuclear magnetic resonance. J. Chem. Phys. 64, 2229–2246.

13A. D. Becke and M. R. Roussel (1989) Exchange holes in inhomogeneous systems: A coordinate-space model. Phys. Rev. A 39, 3761–3767.

How to not find a graduate supervisor

Several times a week, I receive emails from prospective graduate students. The overwhelming majority of these emails get a boilerplate “no thanks” response from me. (I have actually automated these responses so I can send them with just a few quick mouse clicks.) Most of my colleagues don’t even bother to respond. Why? Because the emails I (and my colleagues) get almost always look like mass emails sent to (probably) hundreds of scientists worldwide without any indication that the student knows what I do or, worse, with clear indications that they don’t know what I do.

To those of you sending these emails: If you don’t want to go to graduate school, stop reading this post and keep sending those emails. My colleagues and I will keep deleting them.

Here’s what a typical one of these emails looks like, with my comments in square brackets:

Dear Professor, [What, you couldn’t be bothered to find out my name?]

I have read your website, and I am really excited about your research. [It would be a nice touch if you actually included some text in your email that showed that you knew what that research was.] I would like to join your group as a Ph.D. student starting in September.

I have an M.Sc. from U. of Wherever, where I completed a project in organic synthesis with Professor Whoever. [First you tell me you looked at my website. Now you tell me that you have experience in organic synthesis which is completely irrelevant to me. What you’re really saying is that you have not read my website and have no idea what I do. The email started off badly. Now I’m annoyed at you for wasting my time.] I think this background prepares me to contribute to your research.

I look forward to a positive response.

Sincerely,

A. Student

Look, students. It’s never been easier to figure out what a professor does. We all have websites that contain detailed descriptions of our research because we all want to find good graduate students. All you have to do is to look at those web sites and write emails that contain specific details relating to a particular professor’s interests. Sending out several hundred generic emails won’t get you a response even from people who might otherwise be interested. If you’re too lazy to look at my web site and to write an email that has been customized to my interests, I’m not likely to take your email very seriously.

If you like rejection, go ahead and send those generic emails. If you actually want to go to graduate school, do some research, write a few targeted emails to people who are actually in your area of interest and explain to them how you’re excited about their research (mentioning actual details of what they do), and how your background is, you think, good preparation for work in that person’s lab. Professors actually answer emails that have been written to them, and not just written to a professor. So if you’re not getting answers to your emails to professors, the problem isn’t the professors. It’s your emails.

The Science Citation Index turns 50

The Science Citation Index, which many of you will now know as the Web of Science, turned 50 this year.1 Hopefully, you are already familiar with this tool, which allows you to, in essence, search forward in time from a particular paper. This is an important way to search the literature, and has a myriad of uses, good and bad.

I first discovered the Science Citation Index when I was a student in the 1990s. Back then, the Science Citation Index existed as a paper index. Every year, a new set of volumes would come out that contained all the citations indexed in that year. Every five years, a five-year cumulation would come out that would replace the annual volumes. You would look up a cited paper in these volumes by the lead author. If the paper had been cited in the range of time covered by a volume set, that paper would appear under this author’s name with the citing papers listed under it. If the paper was more than a few years old, you often had to go through several volumes to pick up all the citations, but it was still well worth doing when you wanted to track what had happened to an idea over time. This process is described in some detail in one of Eugene Garfield’s articles.

Of course, the Science Citation Index has other uses. It occurred to me fairly early on that I could use it to check if people were citing my papers. This was often directly useful by making me aware of related work that might not otherwise have come to my attention. Of course, there’s also an ego-stroking aspect to this exercise, at least if your papers get cited. My own work took a while to catch on, so citations were few and far between for several years.

Over the years, paper gave way to CD-ROMs, and eventually to a web interface. Computer-based tools allow for more sophisticated analyses, but the core functionality is the same: the ability to track ideas through time, and to discover inter-related papers. One of the most intriguing (and under-used) features of the web system is the “View Related Records” link, which shows papers that share citations with a chosen paper. If people are working on related problems but aren’t aware of each other (which happens quite a lot) and are therefore not citing each other, this is often a useful way to discover another branch of research on a given problem since they are likely to be citing many of the same references. If you’re a young scientist starting out in a field, I would strongly suggest that you use this feature starting from a few  papers that are key to your research.

What is perhaps the most remarkable aspect of the Science Citation Index is that it was largely the brainchild of one person, Eugene Garfield. No idea is without its precedents, but there is no doubt that Garfield was the prime mover behind the Science Citation Index. We all owe Dr Garfield a huge debt.

So happy 50th Science Citation Index, and many happy returns!

1Eugene Garfield (1964) “Science Citation Index”—A New Dimension in Indexing. Science 144, 649–654.