Sharing data: a step forward

You would think that scientists would be eager to share data. After all, the myth of science that is taught to students is that we build on each other’s work, so of course if we have an interesting data set, we will let anyone have it who wants it, right?

It turns out that the truth is somewhat other than we would like it to be. There are both bad and worse reasons why data is not routinely shared. Probably the worst reason of all is wanting to sit on the data so that one extracts the maximum benefit from the data set while shutting out others. A variation on this theme is only allowing people access to your data if they will agree to make you a coauthor. I once collaborated with a scientist who wanted to use a crystal structure obtained by another lab. (I will leave the names out of it since they’re not relevant.) This was in the mid-1990s when the requirement to deposit structures with the Protein Data Bank (PDB) prior to publication was not yet universal. She was told by her colleagues (and I use the word loosely here) that she could only have their coordinate files if she agreed to include them as coauthors on any paper in which those coordinates were used for the following five years. My colleague and I were astonished by this. Beyond providing coordinates from an already-published structure, they would have made no intellectual contribution to her work, and yet they wanted to be treated as coauthors for an extended period of time. This is no longer possible with protein structures due to the now-universal requirement to deposit structures at the PDB as a condition of publication, but clearly people who hold data sometimes feel this gives them power they can use to further their careers. This is just wrong.

A not-so-good reason for not sharing data is that doing so takes time. A data set that may be perfectly OK for your use may not be suitable for sharing as is. I won’t get into issues of confidentiality with human subjects because I’m not an expert in this area, but clearly anonymizing medical data prior to sharing is important, and then there’s the tricky issue of consent: If the participants in a study did not explicitly agree to have their data used in other studies, is it OK to share the data set with others, even with suitable safeguards in place to protect the privacy of the study participants? Even for data not involving human subjects, sharing data takes time because you have to make sure you provide enough information about the data set for users to be able to make sense of it. This includes (obviously) a full description of what the various data fields represent, but also the conditions under which the data were obtained, any post-processing of the data, etc. Many scientists opt to just keep their data to themselves rather than generating all the necessary metadata. This situation is made worse by the fact that one gets very little credit for putting together a usable data set: It doesn’t count as a publication, so it won’t help a student land a scholarship, tenure and promotion committees are unlikely to give a data set much weight in their deliberations, and granting agencies won’t give you a grant solely because you generate high-quality reusable data.

A significant step forward has been taken with the launch of a new online journal by the Nature Publishing Group entitled Scientific Data. (Incidentally, I learned about this new journal from an article in The Scientist.) This journal is dedicated to the publication of data sets with proper metadata so that they can be used widely. Hopefully, the clout of the Nature Publishing Group will make the various bodies that make decisions about what scientific activities are valued pay attention, and will lead to an increase in the sharing of data sets.

In case you’re wondering whether I put my money where my mouth is: My web site includes a small section of data sets that I have generated that others might find of interest. Could I do more? Sure. Making it worth my while to do so via a journal like Scientific Data might be just the push I need.

Matlab and integers

In computer programming, when a calculation involves variables of different types, we say that these calculations are performed in mixed-mode arithmetic. In most common typed programming languages, mixed-mode arithmetic involves promotion of variables in such a way as not to lose precision. The most common examples of mixed-mode arithmetic would be calculations involving integer and floating-point variables. As a rule, the integer variables are converted to floating-point values before the arithmetic operations are carried out. Consider for example the following C program:

int main()
{
    int two = 2;
    double x = 2.2;
    double y = two*x;
    printf("%f\n",y);
}

When computing y, the integer two is converted to a double-precision value in order not to degrade the precision of the mixed-mode multiplication with the double-precision value x. The multiplication is then carried out, and the computer prints out 4.4, as expected. Fortran functions in an essentially identical way. Note that the promotion rules are independent of the variable to which the result will be assigned. If necessary, the result of the computation is converted to another type before storage.

Of all the computer languages I have worked with, Matlab is the first one that goes the other way. Consider the following Matlab code:

two = int8(2);
x = 2.2;
y = two*x

Just assigning a value to x, whether or not it has a decimal point, implicitly makes x a double, as reported by class(x). (Most of us who started out with Fortran eventually came to the conclusion that implicit typing is a great evil. Too bad the Matlab folks don’t feel the same way.) If you run this code in Matlab, you will get the following result: y = 4. If you query Matlab about the class of y, it will return int8. Not what most of us would expect…

This unusual behavior means that you have to be very careful when using integers in Matlab. It is very easy to write a line of mixed-mode arithmetic that won’t trigger any warnings, but that will generate results different from those expected by the programmer.

So what is the rational response to Matlab’s unusual mixed-mode arithmetic rules? One might be tempted to simply avoid integers, but the fact is that there are good reasons to use integer variables from time to time. Here are a few guidelines that might help:

  1. Minimize the use of integers in Matlab programs. You probably don’t need an integer variable for a simple accumulator (e.g. a variable that counts up how many times something happened). (Note the use of the word “probably” in the previous sentence. I know that there are clever people who will cook up counterexamples. As with all things, you have to think a bit about what your data will look like. Since we’re talking about scientific programming here, we probably don’t need to worry about malicious users, although simple mistakes in the input are always a possibility and you do have to plan for those. But that’s a matter of testing the inputs to the program, and not what happens when reasonable inputs have passed through those filters.) You may not need an integer array to store data that, conceptually, ought to be integer in nature, provided you’re careful not to do things that are risky with floating-point numbers, such as comparing them for equality. (For a brief introduction to some of the issues you can run into with floating-point numbers, see http://floating-point-gui.de.) I have to say right up front that this is contrary to the advice I would give for any other computer language I know. We would normally think of appropriate typing of variables as a smart, safe practice that prevents, for example, variables representing whole numbers from accidentally being corrupted into non-integer values. Here though, the risk of numbers being unintentionally rounded to integers during a mixed-mode calculation is the greater risk.
  2. Make a habit of searching for instances of any integer variables in your Matlab programs. Be aware that code such as
    x = int16(1);
    y = x;
    

    implicitly makes y of class int16. (Again with the implicit typing!) In fact,

    x = int16(1);
    y(1) = x;
    y(2) = 4.8
    

    generates the output

    y =
    
          1      5
    

    since the entire array y is typed int16 by the assignment of the first element, and the floating-point value 4.8 is converted to an integer (by rounding) prior to storage.

    Comment any implicitly typed integer variables to make their types clear to anyone reading the code. Look for places where integer variables, whether implicit or explicit, are involved in mixed-mode arithmetic and explicitly convert the integers to double in the arithmetic expressions (unless you require some other behavior, of course).

  3. Watch for whole-number outputs to programs whose outputs are not expected to be whole numbers. This can be a sign that a mixed-mode calculation has resulted in an integer result.
  4. Do use integers for tags. Tags, in the sense I intend here, are numbers we associate with each of several possibilities, to be used perhaps in a case statement, or to symbolically represent positions in an array associated with particular objects. Here is a simple-minded example, which we might use in a program that handled employee data in a restaurant:
    Unclassified = int8(0);
    Cook = int8(1);
    ShiftSupervisor = int8(2);
    

    The idea is that we would have a field in the employee record that contained his or her job classification. Rather than store a word (where maintaining consistent spelling might be a long-term maintenance headache), we store a number that represents the job classification, and rather than using those numbers explicitly anywhere, we use the tags defined above, e.g.

    if (employee(i).jobclass == Cook)
    ...
    

    This makes the code easy to read and easy to maintain. Since we are going to compare tags from time to time for equality (as in the above code snippet), we need a type where these comparisons can be made exactly, i.e. an integer type.

I generally like Matlab because it provides a large set of functions to carry out just about every common numerical procedure you might run across, which is great from the point of view of not having to reinvent the wheel. It’s also a very easy programming environment for students to learn. Here though, we have run into two bad (in my opinion) language design decisions:

  1. Mixed-mode arithmetic rules that are likely to generate hard-to-find bugs because they violate most programmers’ expectations.
  2. Implicit typing, which was a common source of bugs in Fortran before the IMPLICIT NONE declaration was made standard.

Both of these design decisions are built into Matlab and it would be hard to fix them without breaking a lot of existing code. I guess we’ll all have to learn to be extremely careful with integers in Matlab.

How to read a scientific paper

OK, so you’ve identified some papers that you should read (or just want to read). Now what? A scientific paper isn’t like a novel. You can’t usually make one quick pass through it and expect to get all the details. However, you may not need to. In fact, you may not need to read most of it at all. How do you decide where to focus your reading? That is the question I will tackle in this blog post.

Why are you reading this paper?

That’s the very first question you need to answer. You will read a paper very differently depending on what you are trying to get from it. Here are some reasons for reading a paper, and some notes on how that might affect your approach to the paper:

  • You’re “reading around” your research topic, to get some perspective. If that’s the case, you might be able to skip a lot of the technical stuff and focus on the parts of the paper (introduction, discussion, conclusions) where the authors explain what they have learned and how it fits into the bigger picture.
  • You’re trying to look up a particular piece of information. Depending on the nature of that piece of information, you might be able to just quickly scan the paper for it. In the end, you might look at the figures and tables, and maybe read a few paragraphs where a particular result is presented and discussed.
  • You want to find out about a technique. Unless the whole paper is about that technique, you might be able to focus relatively narrowly on the methods section and the few paragraphs where the authors explain how they used this technique and what they got from it.

Types of papers

There are lots of different types of papers, but a useful first division is to separate the primary research literature from reviews and commentaries. The primary research literature provides reports of (hopefully) new scientific studies. Reviews and commentaries on the other hand try to put previously published research into some sort of context. Typically, if you’re going to read a review paper, it’s because you want to get a well-rounded picture of some area of research. You would usually read a review from front to back. Because they are intended to explain and summarize some area of research, review papers are usually easier to read than the primary literature. The trade-off is that you’re getting one person’s (or a  group’s) view of the field, which at times can be quite biased. In any event, reading review papers doesn’t pose the kinds of challenges nor require the kinds of decisions you have to make when reading the primary literature. The rest of this post focuses on issues that arise when reading papers from the primary literature.

Anatomy of a paper

A scientific paper consists of several, more-or-less standard parts. There’s a bit more variation in theoretical papers than in experimental papers, but at least some of this structure remains in the former. It’s important to understand the purpose of each of these sections in order to read papers efficiently.

All papers start with a title and authorship information. A good title ought to tell you a lot about the main message of the paper, but the truth is that there are plenty of good and useful papers with bad or useless titles. A good title might catch your attention, but an uninformative title shouldn’t necessarily stop you from reading on. In other words, don’t judge a paper by its title.

The title and authors are generally followed by an abstract (not always labeled as such) which is a (usually) one-paragraph summary of the article. Here, the authors are telling you what they think is important in their article. Always read the abstract, no matter why you’re reading the paper. The abstract helps you understand the authors’ perspective on their work, and may point out additional points of interest you can look for in the paper other than what you were hunting for.

Almost all papers start with an introduction which, again, may not be titled as such. The introduction sets the stage for the rest of the paper. It provides background, and therefore typically contains a lot of references, which can be useful when you’re coming up to speed on a topic. It explains the problem the authors hope to solve. Most introductions also provide a brief overview of the paper, right at the end before the main body of the paper starts. As a rule, you want to read the introduction, if for no other reason than that it often functions as a mini-review of the field. The one exception might be the case where you’re looking for a very specific piece of technical information, in which case you might skip the introduction. Even in the latter case, you might want to look for the overview at the end of the introduction since this might help you find what you are looking for a bit faster.

Traditionally, the introduction of an experimental paper was followed by a materials and methods section. Some journals put this section at the end of the paper rather than after the introduction, and some only contain a summary of the methods, with the details appearing in online supplementary materials. Theory papers often don’t have a methods section. Rather, the methods are discussed in the body of the paper since they’re not easily separated from other aspects of the work. You will often skip the methods section, unless you’re there to learn one of the methods used in the paper. Why? If the authors have done a good job of writing the rest of the paper, they will tell you (at least in general terms) what experiments they did as they are describing their results. Moreover, the methods often don’t make all that much sense disconnected from the reason for performing the experiment. If you find that the descriptions of the experiments in the text are not sufficient for your purposes, you can always go look at the methods later.

The main part of the paper will be taken up with a description of what was done and what we can conclude from that, possibly intermixed with a broader discussion of significance. While some journals insist on a clearly labeled Results section, others allow more latitude in dividing up this part of the paper into sections with titles that describe the topic(s) they attack. Depending on why you are reading a paper, this may be a section you read really carefully, one that you skim to find a specific fact, or one that you skip altogether. Nine times out of ten you will read most of this section, since this is where the new science is described, and reading about the advance made in a paper is the usual reason for reading the primary literature. When would you skip the results section(s)? Mostly this would be the case when you’re just acquiring background information, in which case you might just read the introduction and conclusions. Of course, a good review paper or two might be a better way to pick up background information, but there isn’t review paper to cover every possible topic.

The conclusions section is where the authors wrap up: They present their major findings, indicate how those findings do or don’t support one or another hypothesis, and often present some idea of how they think the area of research might evolve from there. Like the introduction, this section usually tries to give us a broader context, so it typically contains quite a few references. The conclusions are often worth reading for that reason alone. More than that, they summarize the technical work that came before, often in much simpler language, so the conclusions can help you understand what, from the authors’ perspective, their work contributes to the field. You will almost always read this section carefully.

Hard papers

Many scientific papers are hard to read, for a variety of reasons. One of the things that often hampers readability is that the work presented is highly technical, so it’s hard for a non-expert to follow for lack of relevant expertise. It’s important not to get discouraged when you feel a bit lost while reading a paper. It’s a common feeling and, unless you really need to understand a paper in depth, which is sometimes necessary for key papers very close to your area of research, it’s often just fine not to get every detail. If you’re really having trouble following the detailed argument in the results section, at least try to understand the general flow of the argument. Failing that, concentrate on the introduction and conclusions, which will tell you what the authors wanted to do and what they claim to have accomplished.

Papers containing mathematical content are a very special subset of hard papers. Here’s a little secret: Unless you are reading a paper to find out how someone proved a particular theorem, you can almost always skip all the proofs. Moreover, you can often skim the equations rather than reading them carefully. A well-written mathematical paper will tell you why certain equations are being derived, which ones are important, what they tell you, and so on. There may only be one or two particularly important equations you need to look at carefully. There may not be any equations worthy of careful scrutiny. Again, it all depends on what you’re hoping to get from the paper. The conclusions obtained from the equations, which will be described in text, may be all you need to know. Alternatively, you may be interested in the final equation obtained, but not in the detailed derivation or proof. We thus come back to the point that you have to have a fairly clear idea of why you are reading a paper.

Figures and tables

Figures and tables, especially if they are accompanied by a good caption, are sometimes worthy of your attention, and sometimes much less so. In some papers, the figures practically tell the whole story: how variable A is related to variable B, the results of key experiments, and so on. They’re generally worth a look, with the caveat that it may happen that you hit a figure you simply don’t understand, perhaps because the interpretation of the figure requires technical knowledge you simply don’t have. When that happens, deal with it as you do with text you don’t understand: If it’s really important to you, work at it or find someone to explain it to you. If not, skip it.

Results tables (as opposed to tables that give the parameters of an experiment) are usually worth a look, too. Papers contain many more figures than tables. If the authors included a table, it’s probably because the data in that table are inherently important or, at least, interesting. Even if you don’t need it now, make a note of papers that contain data tables since you may need those data at some point in the future.

Summary

I would summarize this blog post as follows:

  • Think about what you want to get from a paper before you sit down to read it, and focus your reading on the part(s) of the paper most likely to give you what you want.
  • Don’t worry about parts of the paper you don’t understand, unless the part you don’t understand specifically addresses your reason for reading the paper.
  • The parts of the paper where the authors give you their perspective, mostly the introduction and conclusions, but possibly other sections with extensive interpretive comments, are often the most valuable parts of a paper.

Keeping up with the literature

If you’re a researcher, whether you’re a graduate student starting out or a seasoned scientist, keeping up with the literature is hard work. Especially in today’s world, you probably need to keep an eye not only on your own narrow specialty, but on a number of related areas of research. The problem is particularly bad for those of us who engage in multidisciplinary research. In this blog post, I’m going to suggest some techniques you can use to keep up with relevant developments. This post is mostly intended for my student readers. The experienced scientists who read this blog will have their own strategies for keeping abreast of the latest developments, and I would invite them to share their ideas here.

Before I get into the mechanics however, let me address an important question: Why should you read scientific papers? Why not just lock yourself into your lab and devote yourself completely to your research work? Well, you could try that, but you probably wouldn’t turn out to be a very effective scientist. By reading relevant papers, you can of course avoid repeating work that has already been done. You can learn techniques that you can apply in your own research. You can learn about ideas that might impact the interpretation of your results, or even open up new directions in your own research. On a purely selfish level, having a broad awareness of the science going on around you will help you get through your comprehensive exam and oral thesis defence. So you need to read.

How, then, do you find relevant papers to read?

The references in papers

When you started your research, your supervisor probably handed you a handful of papers and told you to read those. And of course, you will read many other papers as your program unfolds. When you read a paper, you are not always expected to master every detail, but you should be on the lookout for points that are particularly relevant to your research. Often, the things that will catch your attention in a paper are not the primary results, but points that are brought up in the introduction or discussion relating the current work to some earlier research, or a method borrowed from an earlier study. You should be on the lookout for particularly relevant references, and read these. To be blunt: your supervisor expects this of you, although many won’t say so directly.

Searches and alerts

You will often need to search the literature for information on specific topics. That’s probably obvious to you. You may also discover a key author whose work is particularly close to yours, and you will likely want to see what else this person has published, which you can find out using an author search. There’s another useful type of search that you need to know about called a citation search. Suppose that you have identified a key paper, and you want to know if anyone has followed up on the ideas in this paper. A citation search tells you about any papers that cite the paper you started with, i.e. papers that include your starting paper in their list of references. A citation search is, in essence, a mechanism for following an idea forward in time.

There are a few different systems that allow you to do citation searches. I like the Web of Science, but it’s hardly the only game in town. Talk to your librarian about what tools are available on your campus.

A lot of the time, you will do a topic, author or citation search just once, and then look through the results to find a few papers that look particularly interesting. However, you may at some point have a topic that is so central to your research, or an author whose work is so relevant, or a paper that is so important to the field, that you want to know anytime a paper appears meeting one of these search criteria. In these cases, you would set up an alert in a relevant search engine. (Again, talk to your librarian about what alerting systems are available on your campus.) There are some variations on this theme, but usually an alert would be a search that is run automatically on a weekly schedule, with results (if any) emailed to you. In order to set up an alert, you usually need to set up an account with the database provider. If your institution subscribes to the database, this would normally be free.

Reading journals

When I was a graduate student, I used to go to the departmental library on Fridays and see what new journals had arrived. In the Chemistry library at the University of Toronto, new journals were piled on a table in the reading area. I would flip through the tables of contents of a few journals that were particularly relevant to me, and maybe browse one or two others each week as the spirit moved me. As a result, I would sometimes run into useful articles that I might not have found any other way.

I doubt that very many people browse physical journals in quite this way anymore. I certainly don’t. However, it’s useful to browse a few journals to facilitate the serendipitous discoveries of interesting work.

The best way to “browse” journals now is probably to have the journals email you their tables of contents. Almost every journal has some mechanism for this. Just go to the journal’s home page and look around for a link to their email alerting service. These services are always free. I would encourage you to get the tables of contents of a few important generalist journals (e.g. ScienceNatureProceedings of the National Academy of Sciences) as well as a few specialist journals in your area. You don’t want to have a hundred journals send you their tables of contents, so how do you decide which ones to get? Well, what have you been reading? Your supervisor’s suggestions as well as the results of your searches will likely have turned up a few key journals in which a significant amount of work in your area is published. Get those journals to send you their tables of contents. It may only be one or two journals at first, but as you read more, you will find additional journals whose tables of contents are worth adding to your list.

Hopefully these suggestions will help you locate relevant papers. In my next blog post, I will write about how you should read a scientific paper.

“Following the treatment of”: How to avoid reinventing the wheel

In my last blog post, I wrote about clean-room writing as a way of avoiding plagiarism. In today’s post, I will talk about how you can avoid charges of plagiarism while providing background information that follows a plan established by someone else.

Here’s a common writing problem: You’re writing background material for a larger work. The background material runs to several paragraphs, and you have found a source (often a book) that explains the issue you need to include in your background material particularly well. In science, we don’t quote long passages. As we discussed earlier, plagiarism is unacceptable, and copying someone’s organization is generally a form of plagiarism. Note the word generally in the last sentence. There’s a small loophole, which you have to use carefully, but which is available to you for cases like this one.

Here’s the loophole: If you explicitly say that you’re presenting something the way it was presented elsewhere, it’s OK. To do this, we often use words like “Following the treatment of…”, or “The organization of the material in this section follows…”. The explicit acknowledgment that you are borrowing someone else’s way of organizing a certain topic (and perhaps their notation and/or terminology) makes this OK. Note that you have to be very clear that you are doing this. A simple citation won’t do here.

Now we have to be clear about something else: Explicit acknowledgment does not provide a blanket exemption from the normal rules of plagiarism.

  • You still have to write your own text, i.e. you can’t just use someone else’s words, even if you have borrowed their organization. I would still write text like this using the clean-room technique described in my previous post. The only difference is that my notes in a case like this might be a little more detailed, laying out the logical sequence of ideas to be covered. I would still avoid writing notes in complete sentences to avoid inadvertent reuse of the original author’s wording.
  • You can only do this for a well-defined portion of your work, e.g. a few paragraphs or, at most, a short section on a specific topic. You can’t compose (for example) entire chapters of a thesis this way.

If you use this loophole, it becomes much more difficult to avoid other forms of plagiarism, and of course there’s always the question of whether you have used too much of someone else’s work in constructing your own. This is therefore something that is to be done with considerable caution, and likely with someone else going through your work to see that it has been done properly. However, it’s often useful to avoid reinventing the wheel by simply acknowledging that it has been done, and then just using the darned thing.

Clean-room writing

How to commit plagiarism

Over the last few years, I have noticed more and more problems with student plagiarism. I’ve spent a lot of time thinking about where these problems come from. I don’t generally think that they are due to deliberate attempts to cheat. Rather, I think that modern tools create situations where plagiarism becomes almost inevitable unless you are both conscious of the issue and careful. In this blog post, I am going to suggest a method for avoiding plagiarism that I think most of us should adopt. It’s not a panacea, but it’s better than what most people are doing right now.

First of all, we should all agree on what plagiarism is. There are lots of sources on plagiarism, of which my favorite was produced by the Office of Research Integrity of the U.S. Department of Health and Human Services. Roughly speaking, I tend to think of plagiarism in terms of a hierarchical classification, which goes from the grossest to the most subtle forms.

The first form involves simply cutting and pasting from a source. Most people would agree that is wrong, although a remarkable number of students appear not to entirely get that. When we think of plagiarism as cheating, this is generally what we are thinking about.

The second form, which has a very similar effect, is using someone else’s words in your text. While this sounds like cutting and pasting, it often happens in a different way, when we write something while we are looking at a source, maybe so we get some details right. People who commit this form of plagiarism are often not even aware they are doing it. The net effect though is text that generally looks an awful lot like the original source, with only a few words changed here and there. If you don’t believe that you are prone to this problem, try reading the Wikipedia’s historical summary of the third law of thermodynamics. (I picked this topic because few people know much about it. Plagiarism becomes all the more likely when writing about topics with which one is not intimately familiar.) Then immediately try to write your own text on this topic. While you are writing, keep the Wikipedia article open and look back at it for details from time to time. Most people find it very difficult to write sentences and paragraphs that differ significantly from the original text under these conditions. This is plagiarism.

If you somehow avoid writing sentences that look like those in the original text, you are likely to at least mimic the structure of the original text, with the same facts presented in the same order. This is the third and most insidious form of plagiarism. It is hard to detect, and people who commit this form of plagiarism will often deny vehemently that they have done anything wrong. However, when you do this, you have not told us what you think about a subject. You have just told us what the writer of, in this case, the Wikipedia article thinks. You may have done it in different words, but you did not organize the facts yourself, which is the key difference between original writing and plagiarism. (There are occasions when it’s appropriate to write something that is organized like someone else’s coverage of a topic. I will come back to this in a later blog post.)

Thinking about the exercise I proposed above, I would suggest that there are at least three distinct issues leading to plagiarism:

  1. Writing while looking at a source, or immediately after reading a source. It is almost impossible to come up with your own words and an original way to organize the facts when you are doing this. The perfectly good words and organization chosen by the original writer become “the obvious way” to write about a topic, and it’s virtually impossible to break out of that.
  2. Excessive reliance on a single source. If you use multiple sources to inform your thinking, the particular way any one author organized his or her writing is much less likely to have a dominant effect on how you write about something.
  3. Not having clearly distinguished research, outline, writing and revision phases in the writing process. This is perhaps the greatest failing of modern students in their approach to writing. If you start by doing some research, taking brief notes as you go, then write an outline as a way of organizing your thoughts, then write text based on your outline, and finally go through several rounds of revisions, it becomes difficult to commit plagiarism because you will really be writing about a topic from your perspective, and not from that of another writer.

How not to commit plagiarism

Having talked about these issues with many students, and given the pervasive nature of information technology, which puts sources at our fingertips almost anywhere, anytime, I have concluded that the best way to fight plagiarism is to adopt what I call clean room writing techniques. Much of what I am going to describe is essentially the research, outline, writing, revision cycle described above. However, I think that we need to go a little farther given how easy it is to unconsciously plagiarize material.

There is a similar problem in the software industry. Let’s say you want to write a piece of software that does the same thing as another existing piece of software. Because software is protected by copyright, you’re not allowed to copy someone else’s software. You may want to peek, but in the end you have to write your own, original implementation. The way this is done is to have people write the software (even if they previously peeked at the other company’s software) in a “clean room”, which is a room where you have everything you need to do your work except the other company’s software. Depending on how paranoid you are, such a room might not have a direct connection to the Internet. Sometimes, the people who peek are different from the people who write the new software. Sometimes, peeking just isn’t allowed.

To avoid plagiarism, you need to write in something like a clean-room environment. What I mean by this is that looking at your sources and writing should not occur at the same time. Your research for whatever you are writing (term paper, thesis introduction, article manuscript, …) should happen at a different time than the actual composition of text. Text should be written from notes which were not generated by simple cutting-and-pasting from a source. Yes, I know, cutting-and-pasting is quick and efficient. The problem, as explained above, is that you’re almost certain to use someone else’s words if you do that. When you take notes, write in your own words what you thought was important or interesting in some particular text you are reading.

Once you have composed a draft of a text, you can of course fact-check against the original sources. Having given the ideas your own form by writing a draft without direct access to the sources, you are much less likely to unintentionally borrow someone’s words during revision.

Note that clean-room writing does not lift the responsibility of citing your sources. Your notes should clearly link content to references, so you should be able to cite your sources as you write. Occasionally, you will need to make a note to yourself to chase down a reference later. You can still add references during the revision stage if you at least note the places in your text that will need to cite sources.

This may seem a bit radical, but I’ve seen too many students get in trouble for plagiarism over the last few years, and I know that many of them did not intend to plagiarize. It just happened, for the reasons I explained in the first part of this blog post. Eventually, you can relax this strict approach a little, but if you’re an inexperienced writer, it’s best to go into the metaphorical clean room anytime you’re writing new text.

Don’t be stupid: get vaccinated

There’s a lot of stuff in the media about vaccines right now, particularly here in Southern Alberta where we’re going through a measles epidemic because we have an unusual number of people in our region who don’t get vaccinated. It’s also flu shot time.

Look, it’s simple: Getting vaccinated protects you from getting sick. It also protects people around you from getting sick because, once you have been vaccinated, you can’t participate in spreading the disease around. Vaccines are among the safest and most effective methods we have for fighting illness. I know a lot of people think that measles and the flu aren’t serious diseases, but they are. The mortality rate from measles in the developed world is between 1 and 3 per 1000 cases. I don’t know about you, but I definitely would not want to bet my life against those odds, particularly when the alternative, the vaccine, involves little more than a little inconvenience. Flu mortality varies wildly with age and strain, but it’s a serious risk, too. And of course there are lots of complications that are less dramatic than death, but still very serious. Similar comments could be made about any of the diseases for which we get vaccinated.

But, you say, what about the risk of vaccine side-effects? If you’re allergic to some of the vaccine ingredients (which include things like eggs), then of course you might have a reaction. The people who administer vaccines know about these things, they ask about them, and then they make sure you stay around for a while after getting the vaccine, just in case. In some cases, they can provide an alternative formulation that excludes a particular allergen. Most of the other side-effects of vaccines are mild (muscle soreness, low-grade fever), and much less disruptive of your daily activities than the diseases they protect you against. The sensational side-effects you hear about from time to time are mostly urban legends. The health care provider administering the vaccine ought to be able to tell you about any realistic side-effects if you’re concerned.

There’s a nice Ph.D. Comics video about vaccination that has just been posted. (Ph.D. Comics does some serious stuff, and they’re really good at explaining things in language we can all understand, regardless of education.) If you need further convincing, watch it, and get vaccinated.

The big five-oh, Part 2

When I arrived in Lethbridge in the summer of 1995, my first job was to write an NSERC research grant proposal. This proposal used delay systems, a theme I had first explored in detail while I was a postdoc at McGill, as a connecting theme. It’s interesting to go back to this first proposal, because some of the ideas are recognizable in my current research program, but others were dropped long ago. It included a proposal to develop a detailed model of the lac operon, something I never quite got around to doing, but which is clearly related to my current research interests in gene expression. There was a proposal to work on the equivalence between various types of differential equations, including master equations, which I’m still working on. There were also some ideas for stochastic optimizers, which led to some work on the structures of ion clusters,1 but which I didn’t pursue for long.

So what did I busy myself with? My very first paper in Lethbridge was on competitive inhibition oscillations,2 a phenomenon I had first discovered in the final stages of my Ph.D. This line of thought eventually led to the discovery of sustained stochastic oscillations in this system many years later.3

I’m not going to go through all the work I’ve done since those early years, so maybe I’ll just mention a few major themes that emerged over time, and take the opportunity to formulate a bit of advice to young scientists.

I continued to be interested in model reduction, a topic I continue to work on to this day. After leaving Toronto, I had thought that I would stop working on these problems. I wasn’t sure that I had all that much more to say about the theory of slow invariant manifolds. But colleagues in the field encouraged me to keep working on these problems, and from time to time I had some new idea that I thought would contribute something to the field. I am no longer under any illusion that I’m going to stop working on these problems anytime soon. What is the lesson to young scientists here? If you work on a sufficiently interesting set of problems during your Ph.D., this work is likely to follow you throughout your career, and that’s not a bad thing.

While I was finishing my Ph.D., I remember having a talk with Ray Kapral in which I said, with the certainty that only a young, inexperienced scientist can muster, that the problems involved in modelling chemical systems with ordinary differential equations were sufficient to keep me occupied, and that I would never (I actually remember using this word) work on partial differential equation or stochastic models. By 2002, I was studying reaction-diffusion (partial differential equation) models with my then postdoc, Jichang Wang. By 2004, I was working on stochastic models with Rui Zhu, also a postdoc at the time. In fact, most of my research effort is currently directed to stochastic systems. It was silly of me to say I would never work in one modelling framework or another. What I had the wisdom to do as I matured was to pick the correct modelling paradigm at the appropriate moment to tackle the problems I wanted to solve.

One of the things that, I think, has kept my research program relevant and vital over the years is that we’ve done a lot of different things: in addition to the topics mentioned above, there were projects on dynamical systems with stochastic switching, on stochastic modelling of gene expression, on photosynthesis, and on graph-theoretical approaches to bifurcation theory, to name just a few. Most of these topics connect to each other in some way, or at least they do in my head.

Looking back on my first 50 papers, much as it’s fun to think about the research, it’s the people that stand out. I’ve worked with many fine supervisors, colleagues, postdocs and students. I have learned something from each and every one of them. In fact, if I have one piece of advice for young scientists, it’s to find good people to work with, and to pay attention to what they do and how they do it. You can’t necessarily do things exactly the same way as someone else does, but you ought to be able to derive some general lessons you can use to guide your own research career and interactions with other scientists.

Be brave in choosing research topics. Work hard. Find good people to work with. I can’t guarantee that doing these things will lead to success, but not doing them will, at best, lead to mediocrity.

1Richard A. Beekman, Marc R. Roussel and P. J. Wilson (1999) Equilibrium configurations of systems of trapped ions. Phys. Rev. A 59, 503–511. Taunia L.L. Closson and Marc R. Roussel (2009) The flattening phase transition in systems of trapped ions. Can. J. Chem. 87, 1425–1435.
2Lan G. Ngo and Marc R. Roussel (1997) A new class of biochemical oscillator models based on competitive binding. Eur. J. Biochem. 245, 182–190.
3Kevin L. Davis and Marc R. Roussel (2006) Optimal observability of sustained stochastic competitive inhibition oscillations at organellar volumes. FEBS J. 273, 84–95.

The big five-oh, Part 1

No, I haven’t turned 50 yet. However, my 50th refereed paper has now appeared in print. This therefore seems like an appropriate time to look back on some of the research I have done since I started out as a graduate student at the University of Toronto. (I had two prior papers from my undergraduate work, but these were both in areas of science I didn’t pursue.) My intention here isn’t to write a scholarly review paper, so you won’t find a detailed set of citations here. My full list of publications is, in any event, available on my web site.

My M.Sc. and Ph.D. theses were both on the application of invariant manifold theory to steady-state kinetics. I was introduced to these problems by my supervisor at the University of Toronto, Simon J. Fraser. Simon is a great person to work for. He is supportive, and full of ideas, but he also lets you pursue your own ideas. I had a great time working for him, and learned an awful lot of nonlinear dynamics from him.

One way to think of the evolution of a chemical system is as a motion in a phase space, typically a space whose axes are the concentrations of the various chemical species involved, but sometimes including other relevant variables like temperature. The  phase space of a chemical system is typically very high-dimensional. The reactions that transform one species into another occur on many different time scales. The net result is that we can picture the motion in phase space as involving a hierarchy of collapse processes onto surfaces of lower and lower dimension, the fastest processes being responsible for the first collapse events, followed by slower and slower processes.1 These surfaces are invariant manifolds of the differential equations, and we developed methods to compute them. Given the equation of a low-dimensional manifold, we obtain a reduced model of the motion in phase space, i.e. one involving the few variables necessary to describe motion on this manifold.

Invariant manifold theory has been a fertile area of research for me over the years. I continue to publish in this area from time to time. In fact, one of my current M.Sc. students, Blessing Okeke, is working on a set of problems in this area. Expect more work on these problems in the future!

Toward the end of my time in Toronto, my supervisor, Simon J. Fraser, allowed me to spend some time working with Carmay Lim who, at the time, was cross-appointed to several departments at the University of Toronto, and worked out of the Medical Sciences Building. This was a very productive time, and I learned a lot from Carmay, particularly about doing research efficiently.

We worked on a set of applied problems on the lignification of wood using an interesting piece of hardware called a cellular automata machine. This was a special-purpose computer built to efficiently simulate two-dimensional cellular automata. The machine was programmed in Forth, a programming language most of you have probably never heard of, with some bits written in assembly language for extra efficiency. For a geek like me, programming this machine was great fun. I think we did some useful work, too, as our work on lignification kinetics still gets cited from time to time.

I had been to the 1992 SIAM Conference on Applications of Dynamical Systems in Snowbird which, I think, was just the second of what would become a long-lived series of conferences. There, I had discovered that there was a lot of interest in delay-differential equations (DDEs), as the tools necessary to analyze these equations were being sharpened. I had thought about the possibility of applying DDEs to chemical modelling, and decided to apply to work with Michael Mackey at McGill University, who was an expert on the application of DDEs in biological modelling. McGill was a great environment, and I learned a lot from Michael and his students. The most significant outcome of my time in Montreal was a paper published in the Journal of Physical Chemistry on the use of DDEs in chemical modelling.2

I pursued this style of modelling in a handful of papers. Eventually, I got interested in the use of delays to simplify models that can’t be described by differential equations, namely stochastic systems.3 This is another one of those ideas that I have kept following down through the years.

In my next blog post, I will reflect on some of the work I have done since arriving in Lethbridge.

1Marc R. Roussel and Simon J. Fraser (1991) On the geometry of transient relaxation. J. Chem. Phys. 94, 7106–7113.
2Marc R. Roussel (1996) The use of delay-differential equations in chemical kinetics. J. Phys. Chem. 100, 8323–8330.
3Marc R. Roussel and Rui Zhu (2006) Validation of an algorithm for delay stochastic simulation of transcription and translation in prokaryotic gene expression. Phys. Biol. 3, 274–284.

“Hey” as an email greeting

phd042215s

Reproduced with permission from “Piled Higher and Deeper” by Jorge Cham
www.phdcomics.com
.

Like every other teacher, from time to time, I receive an email from a student written in an excessively informal style. Cham’s comic, reproduced above, is pretty close to some of the emails we receive. About a year ago, after receiving yet another such email, I decided to send the following to one of my classes (copied unedited in its entirety):

There’s more to university than just learning formal material, so I hope you won’t mind if, from time to time, I give you some life advice.

Recently, I have received a number of emails from students that start with “hey”, or other equally informal forms of address. Now, I don’t insist that you call me “Professor Roussel” (although you certainly can), but you should watch how you write emails to people who aren’t in your close circle of friends. At this point, you are training to become professionals. You will (hopefully) all end up in responsible positions where you will eventually be interacting both with people supervising you and with people you are supervising. Both your superiors and the people who report to you expect professionals to maintain a certain level of decorum, and people sometimes get very offended when an email starts with “hey”. It’s just not a respectful form of address.

You’re now at a place where we expect a bit more formality. It’s time to start writing emails that look like they were written by a grownup. Take a moment to write a proper greeting when you’re writing one of your instructors. “Dear Marc” is OK, if you’re comfortable with that. Otherwise, “Professor Roussel” will do fine. Most of your emails will be asking your instructors for something: information, help with something, an appointment. It’s best in those cases to put your best foot forward and to be polite. Moreover, this is where you form the habits that will carry you through your first few years of work. You just don’t want to write your boss an email that starts with “hey”.

By the way, this isn’t directed at anyone in particular. I don’t require an apology from anyone. I just want to help you take the next step in your development as young adults.

Some months later, I was approached by a fellow faculty member who told me that a young person she knew had told her about this email, and that she liked it so much that she was planning to show it to some of her students (with my permission, which I was only too happy to give). And then there’s a blog post by Chris Blattman on email etiquette for students that contains somewhat similar comments, as well as the brilliant Ph.D. comic (which was published and added to this blog post long after this blog post was originally written). (Thanks to Paul Hayes for bringing Blattman’s blog post to my attention.) Clearly I’m not the only person who thinks this way.

I wrote the above email to my students thinking that I would get them thinking about appropriate levels of formality for different situations. (Other than the indirect report mentioned above and one other indirect report, I have no idea whether I made any impact. There was one student who commented on their teaching evaluation that they thought this email wasn’t treating them “as adults”, but of course the point of my email was that some of them weren’t writing adult emails.) As I wrote at the time, university isn’t just about learning specific subject matter, and I worry that the current generation of students is badly equipped for the social aspects of the work world. A lot of the rapid communication methods we have now assume a certain level of informality. (You’re not going to have elaborate greetings in a 140-character tweet or, generally, in a text message composed on a phone.) The problem is that this informal style of communication doesn’t translate well to other media, to many social situations, across age groups, or across cultures. When, where and how are students supposed to learn this? It seems to me that electronic etiquette has to be woven into the informal curriculum from a young age, and that it needs to be reinforced all the way through the education system. Right now, we have a generation of teachers (and I include myself in this group) who mostly came to electronic communications after they had learned other means of communication. While, in many contexts, this is viewed as a disadvantage, in this case I think that we older folk are actually better equipped to navigate the multiple levels of formality needed to get through a day, including the correct levels of formality to use in various forms of electronic communications. When I was learning to write letters in school, they first taught us how to write formal letters. Having learned what a formal letter looked like, relaxing some of these rules when communicating with friends or loved ones became a conscious choice, making it unlikely that we would, say, write an overly informal memo to the boss. Going the other way, starting with informal communication styles and then trying to raise the level of formality as required, is, I suspect, much harder.