Sunday, July 30, 2017

Can we measure science?

I was writing a couple grants recently, some with page limits and some with word limits. Which of course got me thinking about the differences in how to game these two constraints. If you have a word limit, you definitely don’t want to use up your limit on a bunch of little words, which might lead to a bit more long-wordiness. With the page limit, though, you spend endless time trying to use shorter words to get that one pesky paragraph one little line shorter (and hope the figures don’t jump around). Each of these constraints has its own little set of games we play trying to obey the letter of the law while seemingly breaking its spirit. But here’s the thing: no amount of "gaming the system" will ever allow me to squeeze a 10 page grant into 5 pages. While there’s always some gamesmanship, in the end, it is hard to break the spirit of the metric, at least in a way that really matters. [Side note, whoever that reviewer was who complained that I left 2-3 inches of white space at the end of my last NIH grant, that was dumb—and yes, turns out the whole method does indeed work.]

I was thinking about this especially in the context of metrics in science, which is predicated on the idea that we can measure science. You know, things like citations and h-index and impact factor and RCR (NIH’s relative citation ratio) and so forth. All of which many (if not most) scientists these days declare as being highly controversial and without any utility or merit—"Just read the damn papers!" is the new (and seemingly only) solution to everything that ails science. Gotta say, this whole thing strikes me as surprisingly unscientific. I mean, we spend our whole lives predicated on the notion that carefully measuring things is the way to understand the world around us, and yet as soon as we turn the lens on ourselves, it’s all “oh, it’s so horribly biased, it’s a popularity contest, all these metrics are gamed, it’s there’s no way to measure someone’s science other than just reading their papers. Oh, and did I mention that time so and so didn’t cite my paper? What a jerk.” Is everyone and every paper a special snowflake? Well, turns out you can measure snowflakes, too (Libbrecht's snowflake work is pretty cool, BTW 1 2).

I mean, seriously, I think most of us wish we had the sort of nice quantitative data in biology that we have with bibliometrics. And I think it’s reasonably predictive as well. Overall, better papers end up with more citations, and I would venture to say that the predictive power is better than most of what we find in biology. Careers have certainly been made on worse correlations. But, unlike the rest of biomedical science, any time someone even insinuates that metrics might be useful, out come the anecdotes:
  • “What about this undercited gem?” [typically one of your own papers]
  • “What about this overhyped paper that ended up being wrong?” [always someone else’s paper]
  • “What about this bubble in this field?” [most certainly not your own field]
Ever see the movie “Minority Report”, where there are these trio of psychics that can predict virtually every murder, leading to a virtually murder-free society? And it’s all brought down because of a single case the system gets wrong about Tom Cruise? Well, sign me up for the murder-free society and send Tom Cruise to jail, please. I think most scientists would agree that self-driving cars will lead to statistically far fewer accidents than human-driven cars, and so even if there’s an accident here and there, it’s the right thing to do. Why doesn’t this rational approach translate to how we think about measuring the scientific enterprise?

Some will say these metrics are all biased. Like, some fields are more hot than others, certain types of papers get more citations, and so forth. Since when does this mean we throw our hands up in the air and just say “Oh well, looks like we can’t do anything with these data!”? What if we said, oh, got more reads with this sequencing library than that sequencing library, so oh well, let’s just drop the whole thing? Nope, we try to correct and de-bias the data. I actually think NIH did a pretty good job of this with their relative citation ratio, which generally seems to identify the most important papers in a given area. Give it a try. (Incidentally, for those who maintained that NIH was simplistic and thoughtless in how it was trying to measure science during the infamous "Rule of 21" debate, I think this paper explaining how RCR works belies that notion. Let's give these folks some credit.)

While I think that citations are generally a pretty good indicator, the obvious problem is that for evaluating younger scientists, we can't wait for citations to accrue, which brings us to the dreaded Impact Factor. The litany of perceived problems with impact factor is too long and frankly too boring to reiterate here, but yes, they are all valid points. Nevertheless, the fact remains that there is a good amount of signal along with the noise. Better journals will typically have better papers. I will spend more time reading papers in better journals. Duh. Look, part of the problem is that we're expecting too much out of all these metrics (restriction of range problem). Here's an illustrative example. Two papers published essentially simultaneously, one in Nature and one in Physics Review Letters, with essentially the same cool result: DNA overwinds when stretched. As of this writing, the Nature paper has 280 citations, and the PRL paper has 122. Bias! The system is rigged! Death to impact factor! Or, more rationally, two nice papers in quality journals, both with a good number of citations. And I'm guessing that virtually any decent review on the topic is going to point me to both papers. Even in our supposedly quantitative branch of biology, aren't we always saying "Eh, factor of two, pretty much the same, it's biology…"? Point is, I view it as a threshold. Sure, if you ONLY read papers in the holy triumvirate of Cell, Science and Nature, then yeah, you're going to miss out on a lot of awesome science—and I don't know a single scientist who does that. (It would also be pretty stupid to not read anything in those journals, can we all agree to that as well?) And there is certainly a visibility boost that comes with those journals that you might not get otherwise. But if you do good work, it will more often than not publish well and be recognized.

Thing is that we keep hearing these "system is broken" anecdotes about hidden gems while ignoring all the times when things actually work out. Here's a counter-anecdote from my own time in graduate school. Towards the end of my PhD, I finally wrapped up my work on stochastic gene expression in mammalian cells, and we sent it to Science, Nature and PNAS (I think), with editorial rejections from all three (yes, this journal shopping is a demoralizing waste of time). Next stop was PLoS Biology, which was a pretty new journal at the time, and I remember liking the whole open access thing. Submitted, accepted, and then there it sat. I worked at a small institute (Public Health Research Institute), and my advisor Sanjay Tyagi, while definitely one of the most brilliant scientists I know, was not at all known in the single cell field (which, for the record, did actually exist before scRNA-seq). So nobody was criss-crossing the globe giving talks at international conferences on this work, and I was just some lowly graduate student. And yet even early on, it started getting citations, and now 10+ years later, it is my most cited primary research paper—and, I would say, probably my most influential work, even compared to other papers in "fancier" journals. And, let me also say that there were several other similar papers that came out around the same time (Golding et al. Cell 2005, Chubb et al. Curr Biol 2006, Zenklusen and Larson et al. Nat Struct Mol Bio 2008), all of which have fared well over time. Cool results (at least within the field), good journals, good recognition, great! By the way, I can't help but wonder if we had published this paper in the hypothetical preprint-only journal-less utopia that seems all the rage these days, would anyone have even noticed, given our low visibility in the field?

So what should we do with metrics? To be clear, I'm not saying that we should only use metrics in evaluation, and I agree that there are some very real problems with them (in particular, trainees' obsession with the fanciest of journals—chill people!). But I think that the judicious use of metrics in scientific evaluation does have merit. One area I've been thinking about is more nefarious forms of bias, like gender and race, which came up in a recent Twitter discussion with Anne Carpenter. Context was whether women face bias in citation counts. And the answer, perhaps unsurprisingly, is yes—check out this careful study in astrophysics (also 1 2 with similar effects). So again, should we just throw our hands up and say "Metrics are biased, let's toss them!"? I would argue no. The paper concludes that the bias in citation count is about 10% (actually 5% raw, then corrected to 10%). Okay, let's play this out in the context of hiring. Let's say you have two men, one with 10% fewer citations than the other. I'm guessing most search committees aren't going to care much whether one has 500 cites on their big paper instead of 550. But now let's keep it equal and put a woman's name on one of the applications. Turns out there are studies on that as well, showing a >20% decrease in hireability, even for a technician position, and my guess is that this would be far worse in the context of faculty hiring. I've know of at least two stories of people combating bias—effectively, I might add—in these higher level academic selection processes by using hard metrics. Even simple stuff like counting the number of women speakers and attendees at a conference can help. Take a look at the Salk gender discrimination lawsuit. Yes, the response from Salk about how the women scientists in question had no recent Cell, Science, or Nature papers or whatever is absurd, but notice that the lawsuits themselves mention various metrics: percentages, salary, space, grants, not to mention "glam" things like being in the National Academies as proxies for reputation. Don't these hard facts make their case far stronger and harder to dismiss? Indeed, isn't the fact that we have metrics to quantify bias critical here? Rather than saying "citations are biased, let's not use them", how about we just boost women's cites by 10% in any comparison involving citations, adjusting as new data comes in?

Another interesting aspect of the metric debate is that people tend to use them when it suits their agenda and dismiss them when they don't. This became particularly apparent in the Rule of 21 debate, which was cast as having two sides: those with lots of grants and seemingly low per dollar productivity per Lauer's graphs, and those with not much money and seemingly high per dollar productivity. At the high end were those complaining that we don't have a good way to measure science, presumably to justify their high grant costs because the metrics fail to recognize just how super-DUPER important their work is. Only to turn around and say that actually, upon reanalysis, their output numbers actually justify their high grant dollars. So which is it? On the other end, we have the "riff-raff" railing against metrics like citation counts for measuring science, only to embrace them wholeheartedly when they show that those with lower grant funding yielded seemingly more bang for the buck. Again, which is it? (The irony is that the (yes, correlative) data seem to argue most for increasing those with 1.5 grants to 2.5 or so, which probably pleases neither side, really.)

Anyway, metrics are flawed, data are flawed, methodologies are flawed, that's all of science. Nevertheless, we keep at it, and try to let the data guide us to the truth. I see no reason that the study of the scientific enterprise itself should be any different. Oh, and in case I still have your attention, you know, there's this one woefully undercited gem from our lab that I'd love to tell you about… :)

5 comments:

  1. Arjun,

    It's time for the backlash against the backlash. Of course there is signal in bibliometric data and the question is how that signal is used, misused or abused. You did a great job presenting many of the caveats, and I have only two things to add:

    1) Research is easier assessed in the rearview mirror, i.e., its reproducibility, its potential for opening up new fields and so on. I think we should be much more modest in our contemporaneous assessment. Much of the urgency for contemporaneous assessment is driven by ego needs and overly hyped news rather than of genuine use for science and research assessment.

    2) I do think that the signal in bibliometrics is much, much more often exaggerated than underestimated. Claiming that there is no signal is just an emotional response of colleagues exasperated by abuses of bibliometrics or a reflection of self-serving biases as you alluded. Bibliometrics is one of those things that tend to strongly polarize opinions.

    ReplyDelete
    Replies
    1. I agree with the rear-view mirror notion. In fact, I think it might be best to count the number of citations starting only 5 years after: http://rajlaboratory.blogspot.com/2014/05/a-proposal-for-measuring-paper-impact.html
      I think that there's some signal for contemporaneous assessment based on an (admittedly soft) threshold, where papers above a certain threshold are far more likely to be of good quality than those below. I'm not so sure, also, that bibliometrics are overly exaggerated. I find citations to be a bit more low key in scientific discussions than journal names, which is generally quite distasteful.

      Delete
    2. I also think of papers and journals in terms of soft thresholds, i.e., a few tiers. In my perception, the nature and the PRL papers from your example fall in the same tier. However, my impression is that many colleagues will not perceive them that way, especially if the papers are on the CVs of faculty applications for a bio oriented department.

      Of course, the citations to a particular paper are far more informative for the paper than the journal in which it was published. The citations are important but certainly incomplete metric for the paper.


      Delete
  2. "Judge not, lest ye be judged."

    Part of the difficulty in academics and science is that assessment is a constant - for us, for our students, for everyone. Too many zero-sum games. Metrics should be used with a dose of kindness, and the judger should be self-aware enough to ask how s/he would see the data if it were her/his own and not the applicant's. Unfortunately, self-awareness is not a criterion for this job.

    ReplyDelete
    Replies
    1. Hmm. I would agree it's not healthy to live our day to day obsessing about metrics and impact factors. I'm not quite sure where kindness comes into this, though. Context?

      Delete