www.fgks.org   »   [go: up one dir, main page]

[Note: Mike asked me to scrape a couple of comments on his last post – this one and this one – and turn them into a post of their own. I’ve edited them lightly to hopefully improve the flow, but I’ve tried not to tinker with the guts.]

This is the fourth in a series of posts on how researchers might better be evaluated and compared. In the first post, Mike introduced his new paper and described the scope and importance of the problem. Then in the next post, he introduced the idea of the LWM, or Less Wrong Metric, and the basic mathemetical framework for calculating LWMs. Most recently, Mike talked about choosing parameters for the LWM, and drilled down to a fundamental question: (how) do we identify good research?

Let me say up front that I am fully convicted about the problem of evaluating researchers fairly. It is a question of direct and timely importance to me. I serve on the Promotion & Tenure committees of two colleges at Western University of Health Sciences, and I want to make good decisions that can be backed up with evidence. But anyone who has been in academia for long knows of people who have had their careers mangled, by getting caught in institutional machinery that is not well-suited for fairly evaluating scholarship. So I desperately want better metrics to catch on, to improve my own situation and those of researchers everywhere.

For all of those reasons and more, I admire the work that Mike has done in conceiving the LWM. But I’m pretty pessimistic about its future.

I think there is a widespread misapprehension that we got here because people and institutions were looking for good metrics, like the LWM, and we ended up with things like impact factors and citation counts because no-one had thought up anything better. Implying a temporal sequence of:

1. Deliberately looking for metrics to evaluate researchers.
2. Finding some.
3. Trying to improve those metrics, or replace them with better ones.

I’m pretty sure this is exactly backwards: the metrics that we use to evaluate researchers are mostly simple – easy to explain, easy to count (the hanky-panky behind impact factors notwithstanding) – and therefore they spread like wildfire, and therefore they became used in evaluation. Implying a very different sequence:

1. A metric is invented, often for a reason completely unrelated to evaluating researchers (impact factors started out as a way for librarians to rank journals, not for administration to rank faculty!).
2. Because a metric is simple, it becomes widespread.
3. Because a metric is both simple and widespread, it makes it easy to compare people in wildly different circumstances (whether or not that comparison is valid or defensible!), so it rapidly evolves from being trivia about a researcher, to being a defining character of a researcher – at least when it comes to institutional evaluation.

If that’s true, then any metric aimed for wide-scale adoption needs to be as simple as possible. I can explain the h-index or i10 index in one sentence. “Citation count” is self-explanatory. The fundamentals of the impact factor can be grasped in about 30 seconds, and even the complicated backstory can be conveyed in about 5 minutes.

In addition to being simple, the metric needs to work the same way across institutions and disciplines. I can compare my h-index with that of an endowed chair at Cambridge, a curator at a small regional museum, and a postdoc at Podunk State, and it Just Works without any tinkering or subjective decisions on the part of the user (other than What Counts – but that affects all metrics dealing with publications, so no one metric is better off than any other on that score).

I fear that the LWM as conceived in Taylor (2016) is doomed, for the following reasons:

  • It’s too complex. It would probably be doomed if it had just a single term with a constant and an exponent (which I realize would defeat the purpose of having either a constant or an exponent), because that’s more math than either an impact factor or an h-index requires (perceptively, anyway – in the real world, most people’s eyes glaze over when the exponents come out).
  • Worse, it requires loads of subjective decisions and assigning importance on the part of the users.
  • And fatally, it would require a mountain of committee work to sort that out. I doubt if I could get the faculty in just one department to agree on a set of terms, constants, and exponents for the LWM, much less a college, much less a university, much less all of the universities, museums, government and private labs, and other places where research is done. And without the promise of universal applicability, there’s no incentive for any institution to put itself through the hell of work it would take to implement.

Really, the only way I think the LWM could get into place is by fiat, by a government body. If the EPA comes up with a more complicated but also more accurate way to measure, say, airborne particle output from car exhausts, they can theoretically say to the auto industry, “Meet this standard or stop selling cars in the US” (I know there’s a lot more legislative and legal push and pull than that, but it’s at least possible). And such a standard might be adopted globally, either because it’s a good idea so it spreads, or because the US strong-arms other countries into following suit.

Even if I trusted the US Department of Education to fill in all of the blanks for an LWM, I don’t know that they’d have the same leverage to get it adopted. I doubt that the DofE has enough sway to get it adopted even across all of the educational institutions. Who would want that fight, for such a nebulous pay-off? And even if it could be successfully inflicted on educational institutions (which sounds negative, but that’s precisely how the institutions would see it), what about the numerous and in some cases well-funded research labs and museums that don’t fall under the DofE’s purview? And that’s just in the US. The culture of higher education and scholarship varies a lot among countries. Which may be why the one-size-fits-all solutions suck – I am starting to wonder if a metric needs to be broken, to be globally applicable.

The problem here is that the user base is so diverse that the only way metrics get adopted is voluntarily. So the challenge for any LWM is to be:

  1. Better than existing metrics – this is the easy part – and,
  2. Simple enough to be both easily grasped, and applied with minimal effort. In Malcolm Gladwell Tipping Point terms, it needs to be “sticky”. Although a better adjective for passage through the intestines of academia might be “smooth” – that is, having no rough edges, like exponents or overtly subjective decisions*, that would cause it to snag.

* Calculating an impact factor involves plenty of subjective decisions, but it has the advantages that (a) the users can pretend otherwise, because (b) ISI does the ‘work’ for them.

At least from my point of view, the LWM as Mike has conceived it is awesome and possibly unimprovable on the first point (in that practically any other metric could be seen as a degenerate case of the LWM), but dismal and possibly pessimal on the second one, in that it requires mounds of subjective decision-making to work at all. You can’t even get a default number and then iteratively improve it without investing heavily in advance.

An interesting thought experiment would be to approach the problem from the other side: invent as many new simple metrics as possible, and then see if any of them offer advantages over the existing ones. Although I have a feeling that people are already working on that, and have been for some time.

Simple, broken metrics like impact factor are the prions of scholarship. Yes, viruses are more versatile and cells more versatile still, by orders of magnitude, but compared to prions, cells take an awesome amount of effort to build and maintain. If you just want to infect someone and you don’t care how, prions are very hard to beat. And they’re so subtle in their machinations that we only became aware of them comparatively recently – much like the emerging problems with “classical” (e.g., non-alt) metrics.

I’d love to be wrong about all of this. I proposed the strongest criticism of the LWM I could think of, in hopes that someone would come along and tear it down. Please start swinging.

You’ll remember that in the last installment (before Matt got distracted and wrote about archosaur urine), I proposed a general schema for aggregating scores in several metrics, terming the result an LWM or Less Wrong Metric. Given a set of n metrics that we have scores for, we introduce a set of n exponents ei which determine how we scale each kind of score as it increases, and a set of n factors ki which determine how heavily we weight each scaled score. Then we sum the scaled results:

LWM = k1·x1e1 + k2·x2e2 + … + kn·xnen

“That’s all very well”, you may ask, “But how do we choose the parameters?”

Here’s what I proposed in the paper:

One approach would be to start with subjective assessments of the scores of a body of researchers – perhaps derived from the faculty of a university confidentially assessing each other. Given a good-sized set of such assessments, together with the known values of the metrics x1, x2xn for each researcher, techniques such as simulated annealing can be used to derive the values of the parameters k1, k2kn and e1, e2en that yield an LWM formula best matching the subjective assessments.

Where the results of such an exercise yield a formula whose results seem subjectively wrong, this might flag a need to add new metrics to the LWM formula: for example, a researcher might be more highly regarded than her LWM score indicates because of her fine record of supervising doctoral students who go on to do well, indicating that some measure of this quality should be included in the LWM calculation.

I think as a general approach that is OK: start with a corpus of well understood researchers, or papers, whose value we’ve already judged a priori by some means; then pick the parameters that best approximate that judgement; and let those parameters control future automated judgements.

The problem, really, is how we make that initial judgement. In the scenario I originally proposed, where say the 50 members of a department each assign a confidential numeric score to all the others, you can rely to some degree on the wisdom of crowds to give a reasonable judgement. But I don’t know how politically difficult it would be to conduct such an exercise. Even if the individual scorers were anonymised, the person collating the data would know the total scores awarded to each person, and it’s not hard to imagine that data being abused. In fact, it’s hard to imagine it not being abused.

In other situations, the value of the subjective judgement may be close to zero anyway. Suppose we wanted to come up with an LWM that indicates how good a given piece of research is. We choose LWM parameters based on the scores that a panel of experts assign to a corpus of existing papers, and derive our parameters from that. But we know that experts are really bad at assessing the quality of research. So what would our carefully parameterised LWM be approximating? Only the flawed judgement of flawed experts.

Perhaps this points to an even more fundamental problem: do we even know what “good research” looks like?

It’s a serious question. We all know that “research published in high-Impact Factor journals” is not the same thing as good research. We know that “research with a lot of citations” is not the same thing as good research. For that matter, “research that results in a medical breakthrough” is not necessarily the same thing as good research. As the new paper points out:

If two researchers run equally replicable tests of similar rigour and statistical power on two sets of compounds, but one of them happens to have in her batch a compound that turns out to have useful properties, should her work be credited more highly than the similar work of her colleague?

What, then? Are we left only with completely objective measurements, such as statistical power, adherance to the COPE code of conduct, open-access status, or indeed correctness of spelling?

If we accept that (and I am not arguing that we should, at least not yet), then I suppose we don’t even need an LWM for research papers. We can just count these objective measures and call it done.

I really don’t know what my conclusions are here. Can anyone help me out?

ostrich peeing

cormorant peeing

alligator peeing

Stand by . . . grumpy old man routine compiling . . . 

So, someone at Sony decided that an Angry Birds movie would be a good idea, about three years after the Angry Birds “having a moment” moment was over. There’s a trailer for it now, and at the end of the trailer, a bird pees for like 17 seconds (which is about 1/7 of my personal record, but whatever).

And now I see these Poindexters all over the internet pushing their glasses up their noses and typing, “But everyone knows that birds don’t pee! They make uric acid instead! That’s the white stuff in ‘bird poop’. Dur-hur-hur-hurrr!” I am reasonably sure these are the same people who harped on the “inaccuracy” of the peeing Postosuchus in Walking With Dinosaurs two decades ago. (Honestly, how I didn’t get this written and posted in our first year of blogging is quite beyond my capacity.)

Congratulations, IFLScientists, on knowing One Fact about nature. Tragically for you, nature knows countless facts, and among them are that birds and crocodilians can pee. And since extant dinosaurs can and do pee, extinct ones probably could as well.

So, you know . . . try to show a little respect.

So, you know . . . try to show a little respect.

Now, it is true that crocs (mostly) and birds (always?) release more of their nitrogenous waste as uric acid than as urea. But their bodies produce both compounds. So does yours. We mammals are just shifted waaaay more heavily toward urea than uric acid, and extant archosaurs – and many (but not all) other reptiles to boot – are shifted waaaay more heavily toward uric acid than urea. Alligators also make a crapload of ammonia, but that’s a story for another time.

BUT, crucially, birds and crocs almost always release some clear, watery, urea-containing fluid when they dump the whitish uric acid, as shown in this helpful diagram that I stole from International Cockatiel Resource:

International Cockatiel Resource bird pee guide

If you’ve never seen this, you’re just not getting to the bird poop fast enough – the urine is drying up before you notice it. Pick up the pace!

Sometimes birds and crocs save up a large quantity of fluid, and then flush everything out of their cloacas and lower intestines in one shot, as shown in the photos dribbled through this post. Which has led to some erroneous reports that ostriches have urinary bladders. They don’t, they just back up lots of urine into their colons. Many birds recapture some water and minerals that way, and thereby concentrate their wastes and save water – basically using the colon as a sort of second-stage kidney (Skadhauge 1976).

Rhea peeing by Markus Buhler

Many thanks to Markus Bühler for permission to post his well-timed u-rhea photo.

[UPDATE the next day: To be perfectly clear, all that’s going on here is that the birds and crocs keep their cloacal sphincters closed. The kidneys keep on producing urine and uric acid, and with no way out (closed sphincter) and nowhere else to go (no bladder – although urinary bladders have evolved repeatedly in lizards), the pee backs up into the colon. So if you’re wondering if extinct dinosaurs needed some kind of special adaptation to be able to pee, the answer is no. Peeing is an inherent possibility, and in fact the default setting, for any reptile that can keep its cloaca shut.]

Aaaanyway, all those white urate solids tend to make bird pee more whitish than yellow, as shown in the photos. I have seen a photo of an ostrich making a good solid stream from cloaca to ground that was yellow, but that was years ago and frustratingly I haven’t been able to relocate it. Crocodilians seem to have no problem making a clear, yellowish pee-stream, as you can see in many hilarious YouTube videos of gators peeing on herpetologists and reporters, which I am putting at the bottom of this post so as not to break up the flow of the rant.

ostrich excreting

You can explore this “secret history” of archosaur pee by entering the appropriate search terms into Google Scholar, where you’ll find papers with titles like:

  • “Technique for the collection of clear urine from the Nile crocodile (Crocodylus niloticus)” (Myburgh et al. 2012)
  • “Movement of urine in the lower colon and cloaca of ostriches” (Duke et al. 1995)
  • “Plasma homeostasis and cloacal urine composition in Crocodylus porosus caught along a salinity gradient” (Grigg 1981)
  • “Cloacal absorption of urine in birds” (Skadhauge 1976)
  • “The cloacal storage of urine in the rooster” (Skadhauge 1968)

I’ve helpfully highlighted the operative term, to reinforce the main point of the post. Many of these papers are freely available – get the links from the References section below. A few are paywalled – really, Elsevier? $31.50 for a half-century-old paper on chicken pee? – but I’m saving them up, and I’ll be happy to lend a hand to other scholars who want to follow this stream of inquiry. If you’re really into the physiology of birds pooling pee in their poopers, the work of Erik Skadhauge will be a gold mine.

Now, to be fair, I seriously doubt that any bird has ever peed for 17 seconds. But the misinformation abroad on the net seems to be more about whether birds and other archosaurs can pee at all, rather than whether a normal amount of bird pee was exaggerated for comedic effect in the Angry Birds trailer.

ostrich excreting 3

In conclusion, birds and crocs can pee. Go tell the world.

And now, those gator peeing videos I promised:

UPDATE

Jan. 30, 2016: I just became aware that I had missed one of the best previous discussions of this topic, with one of the best videos, and the most relevant citations! The post is this one, by Brian Switek, which went up almost two years ago, the video is this excellent shot of an ostrich urinating and then defecating immediately after:

…and the citations are McCarville and Bishop (2002) – an SVP poster about a possible sauropod pee-scour, which is knew about but didn’t mention yet because I was saving it for a post of its own – and Fernandes et al. (2004) on some very convincing trace fossils of dinosaurs peeing on sand, from the Lower Cretaceous of Brazil. In addition to being cogent and well-illustrated, the Fernandes et al. paper has the lovely attribute of being freely available, here.

So, sorry, Brian, that I’d missed your post!

And for everyone else, stand by for another dinosaur pee post soon. And here’s one more video of an ostrich urinating (not pooping as the video title implies). The main event starts about 45 seconds in.

References

I said last time that my new paper on Better ways to evaluate research and researchers proposes a family of Less Wrong Metrics, or LWMs for short, which I think would at least be an improvement on the present ubiquitous use of impact factors and H-indexes.

What is an LWM? Let me quote the paper:

The Altmetrics Manifesto envisages no single replacement for any of the metrics presently in use, but instead a palette of different metrics laid out together. Administrators are invited to consider all of them in concert. For example, in evaluating a researcher for tenure, one might consider H-index alongside other metrics such as number of trials registered, number of manuscripts handled as an editor, number of peer-reviews submitted, total hit-count of posts on academic blogs, number of Twitter followers and Facebook friends, invited conference presentations, and potentially many other dimensions.

In practice, it may be inevitable that overworked administrators will seek the simplicity of a single metric that summarises all of these.

This is a key problem of the world we actually live in. We often bemoan that fact that people evaluating research will apparently do almost anything than actually read the research. (To paraphrase Dave Barry, these are important, busy people who can’t afford to fritter away their time in competently and diligently doing their job.) There may be good reasons for this; there may only be bad reasons. But what we know for sure is that, for good reasons or bad, administrators often do want a single number. They want it so badly that they will seize on the first number that comes their way, even if it’s as horribly flawed as an impact factor or an H-index.

What to do? There are two options. One is the change the way these overworked administrators function, to force them to read papers and consider a broad range of metrics — in other words, to change human nature. Yeah, it might work. But it’s not where the smart money is.

So perhaps the way to go is to give these people a better single number. A less wrong metric. An LWM.

Here’s what I propose in the paper.

In practice, it may be inevitable that overworked administrators will seek the simplicity of a single metric that summarises all of these. Given a range of metrics x1, x2xn, there will be a temptation to simply add them all up to yield a “super-metric”, x1 + x2 + … + xn. Such a simply derived value will certainly be misleading: no-one would want a candidate with 5,000 Twitter followers and no publications to appear a hundred times stronger than one with an H-index of 50 and no Twitter account.

A first step towards refinement, then, would weight each of the individual metrics using a set of constant parameters k1, k2kn to be determined by judgement and experiment. This yields another metric, k1·x1 + k2·x2 + … + kn·xn. It allows the down-weighting of less important metrics and the up-weighting of more important ones.

However, even with well-chosen ki parameters, this better metric has problems. Is it really a hundred times as good to have 10,000 Twitter followers than 100? Perhaps we might decide that it’s only ten times as good – that the value of a Twitter following scales with the square root of the count. Conversely, in some contexts at least, an H-index of 40 might be more than twice as good as one of 20. In a search for a candidate for a senior role, one might decide that the value of an H-index scales with the square of the value; or perhaps it scales somewhere between linearly and quadratically – with H-index1.5, say. So for full generality, the calculation of the “Less Wrong Metric”, or LWM for short, would be configured by two sets of parameters: factors k1, k2kn, and exponents e1, e2en. Then the formula would be:

LWM = k1·x1e1 + k2·x2e2 + … + kn·xnen

So that’s the idea of the LWM — and you can see now why I refer to this as a family of metrics. Given n metrics that you’re interested in, you pick 2n parameters to combine them with, and get a number that to some degree measures what you care about.

(How do you choose your 2n parameters? That’s the subject of the next post. Or, as before, you can skip ahead and read the paper.)

References

Like Stephen Curry, we at SV-POW! are sick of impact factors. That’s not news. Everyone now knows what a total disaster they are: how they are signficantly correlated with retraction rate but not with citation count; how they are higher for journals whose studies are less statistically powerful; how they incentivise bad behaviour including p-hacking and over-hyping. (Anyone who didn’t know all that is invited to read Brembs et al.’s 2013 paper Deep impact: unintended consequences of journal rank, and weep.)

Its 2016. Everyone who’s been paying attention knows that impact factor is a terrible, terrible metric for the quality of a journal, a worse one for the quality of a paper, and not even in the park as a metric for the quality of a researcher.

Unfortunately, “everyone who’s been paying attention” doesn’t seem to include such figures as search committees picking people for jobs, department heads overseeing promotion, tenure committees deciding on researchers’ job security, and I guess granting bodies. In the comments on this blog, we’ve been told time and time and time again — by people who we like and respect — that, however much we wish it weren’t so, scientists do need to publish in high-IF journals for their careers.

What to do?

It’s a complex problem, not well suited to discussion on Twitter. Here’s what I wrote about it recently:

The most striking aspect of the recent series of Royal Society meetings on the Future of Scholarly Scientific Communication was that almost every discussion returned to the same core issue: how researchers are evaluated for the purposes of recruitment, promotion, tenure and grants. Every problem that was discussed – the disproportionate influence of brand-name journals, failure to move to more efficient models of peer-review, sensationalism of reporting, lack of replicability, under-population of data repositories, prevalence of fraud – was traced back to the issue of how we assess works and their authors.

It is no exaggeration to say that improving assessment is literally the most important challenge facing academia.

This is from the introduction to a new paper which came out today: Taylor (2016), Better ways to evaluate research and researchers. In eight short pages — six, really, if you ignore the appendix — I try to get to grips with the historical background that got us to where we are, I discuss some of the many dimensions we should be using to evaluate research and researchers, and I propose a family of what I call Less Wrong Metrics — LWMs — that administrators could use if they really absolutely have to put a single number of things.

(I was solicited to write this by SPARC Europe, I think in large part because of things I have written around this subject here on SV-POW! My thanks to them: this paper becomes part of their Briefing Papers series.)

Next time I’ll talk about the LWM and how to calculate it. Those of you who are impatient might want to read the actual paper first!

References

Notocolossus is a beast

January 20, 2016

Notocolossus skeletal recon - Gonzalez Riga et al 2016 fig 1

(a) Type locality of Notocolossus (indicated by star) in southern-most Mendoza Province, Argentina. (b) Reconstructed skeleton and body silhouette in right lateral view, with preserved elements of the holotype (UNCUYO-LD 301) in light green and those of the referred specimen (UNCUYO-LD 302) in orange. Scale bar, 1 m. (González Riga et al. 2016: figure 1)

This will be all too short, but I can’t let the publication of a new giant sauropod pass unremarked. Yesterday Bernardo González Riga and colleagues published a nice, detailed paper describing Notocolossus gonzalezparejasi, “Dr. Jorge González Parejas’s southern giant”, a new titanosaur from the Late Cretaceous of Mendoza Province, Argentina (González Riga et al. 2016). The paper is open access and freely available to the world.

As you can see from the skeletal recon, there’s not a ton of material known from Notocolossus, but among giant sauropods it’s actually not bad, being better represented than Argentinosaurus, Puertasaurus, Argyrosaurus, and Paralititan. In particular, one hindfoot is complete and articulated, and a good chunk of the paper and supplementary info are devoted to describing how weird it is.

But let’s not kid ourselves – you’re not here for feet, unless it’s to ask how many feet long this monster was. So how big was Notocolossus, really?

Well, it wasn’t the world’s largest sauropod. And to their credit, no-one on the team that described it has made any such superlative claims for the animal. Instead they describe it as, “one of the largest terrestrial vertebrates ever discovered”, and that’s perfectly accurate.

Notocolossus limb bones - Gonzalez Riga et al 2016 fig 4

(a) Right humerus of the holotype (UNCUYO-LD 301) in anterior view. Proximal end of the left pubis of the holotype (UNCUYO-LD 301) in lateral (b) and proximal (c) views. Right tarsus and pes of the referred specimen (UNCUYO-LD 302) in (d) proximal (articulated, metatarsus only, dorsal [=anterior] to top), (e) dorsomedial (articulated), and (f) dorsal (disarticulated) views. Abbreviations: I–V, metatarsal/digit number; 1–2, phalanx number; ast, astragalus; cbf, coracobrachialis fossa; dpc, deltopectoral crest; hh, humeral head; ilped, iliac peduncle; of, obturator foramen; plp, proximolateral process; pmp, proximomedial process; rac, radial condyle; ulc, ulnar condyle. Scale bars, 20 cm (a–c), 10 cm (d–f). (Gonzalez Riga et al 2016: figure 4)

Any discussions of the size of Notocolossus will be driven by one of two elements: the humerus and the anterior dorsal vertebra. The humerus is 176 cm long, which is shorter than those of Giraffatitan (213 cm), Brachiosaurus (204 cm), and Turiasaurus (179 cm), but longer than those of Paralititan (169 cm), Dreadnoughtus (160 cm), and Futalognkosaurus (156 cm). Of course we don’t have a humerus for Argentinosaurus or Puertasaurus, but based on the 250-cm femur of Argentinosaurus, the humerus was probably somewhere around 200 cm. Hold that thought.

Notocolossus and Puertasaurus dorsals compared

Top row: my attempt at a symmetrical Notocolossus dorsal, made by mirroring the left half of the fossil from the next row down. Second row: photos of the Notocolossus dorsal with missing bits outlined, from Gonzalez Riga et al (2016: fig. 2). Scale bar is 20 cm (in original). Third row: the only known dorsal vertebra of Puertasaurus, scaled to about the same size as the Notocolossus vertebra, from Novas et al. (2005: fig. 2).

The anterior dorsal tells a similar story, and this is where I have to give González Riga et al. some props for publishing such detailed sets of measurements in the their supplementary information. They Measured Their Damned Dinosaur. The dorsal has a preserved height of 75 cm – it’s missing the tip of the neural spine and would have been a few cm taller in life – and by measuring the one complete transverse process and doubling it, the authors estimate that when complete it would have been 150 cm wide. That is 59 inches, almost 5 feet. The only wider vertebra I know of is the anterior dorsal of Puertasaurus, at a staggering 168 cm wide (Novas et al. 2005). The Puertasaurus dorsal is also quite a bit taller dorsoventrally, at 106 cm, and it has a considerably larger centrum: 43 x 60 cm, compared to 34 x 43.5 cm for Notocolossus (anterior centrum diameters, height x width).

Centrum size is an interesting parameter. Because centra are so rarely circular, arguably the best way to compare across taxa would be to measure the max area (or, since centrum ends are also rarely flat, the max cross-sectional area). It’s late and this post is already too long, so I’m not going to do that now. But I have been keeping an informal list of the largest centrum diameters among sauropods – and, therefore, among all Terran life – and here they are (please let me know if I missed anyone):

  • 60 cm – Argentinosaurus dorsal, MCF-PVPH-1, Bonaparte and Coria (1993)
  • 60 cm – Puertasaurus dorsal, MPM 10002, Novas et al. (2005)
  • 51 cm – Ruyangosaurus cervical and dorsal, 41HIII-0002, Lu et al. (2009)
  • 50 cm – Alamosaurus cervical, SMP VP−1850, Fowler and Sullivan (2011)
  • 49 cm – Apatosaurus ?caudal, OMNH 1331 (pers. obs.)
  • 49 cm – Supersaurus dorsal, BYU uncatalogued (pers. obs.)
  • 46 cm – Dreadnoughtus dorsal, MPM-PV 1156, Lacovara et al. (2014: Supplmentary Table 1) – thanks to Shahen for catching this one in the comments!
  • 45.6 cm – Giraffatitan presacral, Fund no 8, Janensch (1950: p. 39)
  • 45 cm – Futalognkosaurus sacral, MUCPv-323, Calvo et al. (2007)
  • 43.5 cm – Notocolossus dorsal, UNCUYO-LD 301, González Riga et al. (2016)

(Fine print: I’m only logging each taxon once, by its largest vertebra, and I’m not counting the dorsoventrally squashed Giraffatitan cervicals which get up to 47 cm wide, and the “uncatalogued” Supersaurus dorsal is one I saw back in 2005 – it almost certainly has been catalogued in the interim.) Two things impress me about this list: first, it’s not all ‘exotic’ weirdos – look at the giant Oklahoma Apatosaurus hanging out halfway down the list. Second, Argentinosaurus and Puertasaurus pretty much destroy everyone else by a wide margin. Notocolossus doesn’t seem so impressive in this list, but it’s worth remembering that the “max” centrum diameter here is from one vertebra, which was likely not the largest in the series – then again, the same is true for Puertasaurus, Alamosaurus, and many others.

Notocolossus phylogeny - Gonzalez Riga et al 2016 fig 5

(a) Time-calibrated hypothesis of phylogenetic relationships of Notocolossus with relevant clades labelled. Depicted topology is that of the single most parsimonious tree of 720 steps in length (Consistency Index = 0.52; Retention Index = 0.65). Stratigraphic ranges (indicated by coloured bars) for most taxa follow Lacovara et al.4: fig. 3 and references therein. Additional age sources are as follows: Apatosaurus[55], Cedarosaurus[58], Diamantinasaurus[59], Diplodocus[35], Europasaurus[35], Ligabuesaurus[35], Neuquensaurus[60], Omeisaurus[55], Saltasaurus[60], Shunosaurus[55], Trigonosaurus[35], Venenosaurus[58], Wintonotitan[59]. Stratigraphic ranges are colour-coded to also indicate geographic provenance of each taxon: Africa (excluding Madagascar), light blue; Asia (excluding India), red; Australia, purple; Europe, light green; India, dark green; Madagascar, dark blue; North America, yellow; South America, orange. (b–h) Drawings of articulated or closely associated sauropod right pedes in dorsal (=anterior) view, with respective pedal phalangeal formulae and total number of phalanges per pes provided (the latter in parentheses). (b) Shunosaurus (ZDM T5402, reversed and redrawn from Zhang[45]); (c) Apatosaurus (CM 89); (d) Camarasaurus (USNM 13786); (e) Cedarosaurus (FMNH PR 977, reversed from D’Emic[32]); (f) Epachthosaurus (UNPSJB-PV 920, redrawn and modified from Martínez et al.[22]); (g) Notocolossus; (h) Opisthocoelicaudia (ZPAL MgD-I-48). Note near-progressive decrease in total number of pedal phalanges and trend toward phalangeal reduction on pedal digits II–V throughout sauropod evolutionary history (culminating in phalangeal formula of 2-2-2-1-0 [seven total phalanges per pes] in the latest Cretaceous derived titanosaur Opisthocoelicaudia). Abbreviation: Mya, million years ago. Institutional abbreviations see Supplementary Information. (González Riga et al. 2016: figure 5)

As for the estimated mass of Notocolossus, González Riga et al. (2016) did their due diligence. The sections on mass estimation in the main text and supplementary information are very well done – lucid, modest, and fair. Rather than try to summarize the good bit, I’ll just quote it. Here you go, from page 7 of the main text:

The [humeral] diaphysis is elliptical in cross-section, with its long axis oriented mediolaterally, and measures 770 mm in minimum circumference. Based on that figure, the consistent relationship between humeral and femoral shaft circumference in associated titanosaurian skeletons that preserve both of these dimensions permits an estimate of the circumference of the missing femur of UNCUYO-LD 301 at 936 mm (see Supplementary Information). (Note, however, that the dataset that is the source of this estimate does not include many gigantic titanosaurs, such as Argentinosaurus[5], Paralititan[16], and Puertasaurus[11], since no specimens that preserve an associated humerus and femur are known for these taxa.) In turn, using a scaling equation proposed by Campione and Evans[20], the combined circumferences of the Notocolossus stylopodial elements generate a mean estimated body mass of ~60.4 metric tons, which exceeds the ~59.3 and ~38.1 metric ton masses estimated for the giant titanosaurs Dreadnoughtus and Futalognkosaurus, respectively, using the same equation (see Supplementary Information). It is important to note, however, that subtracting the mean percent prediction error of this equation (25.6% of calculated mass[20]) yields a substantially lower estimate of ~44.9 metric tons for UNCUYO-LD 301. Furthermore, Bates et al.[21] recently used a volumetric method to propose a revised maximum mass of ~38.2 metric tons for Dreadnoughtus, which suggests that the Campione and Evans[20] equation may substantially overestimate the masses of large sauropods, particularly giant titanosaurs. Unfortunately, however, the incompleteness of the Notocolossus specimens prohibits the construction of a well-supported volumetric model of this taxon, and therefore precludes the application of the Bates et al.[21] method. The discrepancies in mass estimation produced by the Campione and Evans[20] and Bates et al.[21] methods indicate a need to compare the predictions of these methods across a broad range of terrestrial tetrapod taxa[21]. Nevertheless, even if the body mass of the Notocolossus holotype was closer to 40 than 60 metric tons, this, coupled with the linear dimensions of its skeletal elements, would still suggest that it represents one of the largest land animals yet discovered.

So, nice work all around. As always, I hope we get more of this critter someday, but until then, González Riga et al. (2016) have done a bang-up job describing the specimens they have. Both the paper and the supplementary information will reward a thorough read-through, and they’re free, so go have fun.

References

I was a bit disappointed to hear David Attenborough on BBC Radio 4 this morning, while trailing a forthcoming documentary, telling the interviewing that you can determine the mass of an extinct animal by measuring the circumference of its femur.

We all know what he was alluding to, of course: the idea first published by Anderson et al. (1985) that if you measure the life masses of lots of animals, then measuring their long-bone circumferences when they’ve died, you can plot the two measurements against each other, find a best-fit line, and extrapolate it to estimate the masses of dinosaurs based on their limb-bone measurements.

AndersonEtAl1985-dinosaur-masses-fig1

This approach has been extensively refined since 1985, most recently by Benson et al. (2014). but the principle is the same.

But the thing is, as Anderson et al. and other authors have made clear, the error-bars on this method are substantial. It’s not super-clear in the image above (Fig 1. from the Anderson et al. paper) because log-10 scales are used, but the 95% confidence interval is about 42 pixels tall, compared with 220 pixels for an order of magnitude (i.e. an increment of 1.0 on the log-10 scale). That means the interval is 42/220 = 0.2 of an order of magnitude. That’s a factor 10 ^ 0.2 = 1.58. In other words you could have two animals with equally robust femora, one of them nearly 60% heavier than the other, and they would both fall within the 95% confidence interval.

I’m surprised that someone as experienced and knowledgeable as Attenborough would perpetuate the idea that you can measure mass with any precision in this way (even more so when using only a femur, rather than the femur+humerus combo of Anderson et al.)

More: when the presenter told him that not all scientists buy the idea that the new titanosaur is the biggest known, he said that came as a surprise. Again, it’s disappointing that the documentary researchers didn’t make Attenborough aware of, for example, Paul Barrett’s cautionary comments or Matt Wedel’s carefully argued dissent. Ten minutes of simple research would have found this post — for example, it’s Google’s fourth hit for “how big is the new argentinian titanosaur”. I can only hope that the actual documentary, which screens on Sunday 24 January, doesn’t present the new titanosaur’s mass as a known and agreed number.

(To be clear, I am not blaming Attenborough for any of this. He is a presenter, not a palaeontologist, and should have been properly prepped by the researchers for the programme he’s fronting. He is also what can only be described as 89, so should be forgiven if he’s not quite as quick on his feet when confronted with an interviewer as he used to be.)

Update 1 (the next day)

Thanks to Victoria Arbour for pointing out an important reference that I missed: it was Campione and Evans (2012) who expanding Anderson et al.’s dataset and came up with the revised equation which Benson et al. used.

Update 2 (same day as #1)

It seems most commenters are inclined to go with Attenborough on this. That’s a surprise to me — I wonder whether he’s getting a free pass because of who he is. All I can say is that as I listened to the segment it struck me as really misleading. You can listen to it for yourself here if you’re in the UK; otherwise you’ll have to make do with this transcript:

“It’s surprising how much information you can get from just one bone. I mean for example that thigh bone, eight feet or so long, if you measure the circumference of that, you will be able to say how much weight that could have carried, because you know what the strength of bone is. So the estimate of weight is really pretty accurate and the thought is that this is something around over seventy tonnes in weight.”

(Note also that the Anderson et al./Campione and Evans method has absolutely nothing to do with the strength of bone.)

Also of interest was this segment that followed immediately:

How long it was depends on whether you think it held its neck out horizontally or vertically. If it held it out horizontally, well then it would be about half as big again as the Diplodocus, which is the dinosaur that’s in the hall of the Natural History Museum. It would be absolutely huge.

Interviewer: And how tall, if we do all the dimensions?

Ah well that is again the question of how it holds its neck, and it could have certainly reached up about to the size of a four or five storey building.

Needless to say, the matter of neck posture is very relevant to our interests. I don’t want to read too much into a couple of throwaway comments, but the implication does seem to be that this is an issue that the documentary might spend some time on. We’ll see what happens.

References