I’ve begun the process of paring down the database into the form for final analysis (but remember, of course, that the full, unaltered data live on). So far, I have deleted entries that fit into the following categories:
- Extraneous entries for “combined” specimen data sets. In other words, only the combined entry will be used for analysis. One specimen, one data point.
- Specimens comprising isolated elements (e.g., isolated humeri or femora). Because we’re analyzing skeletal proportions primarily, isolated elements aren’t terribly useful.
- Individual measurements with a “+” have been removed, as they represent measurements “as preserved” rather than estimates of original length.
Very soon we will want to make a final decision on what to do with species represented by multiple specimens. One strategy (which I think was used by Carrano 2006, if I remember correctly) is to use only the largest specimen. Here, the benefits are that juvenile specimens are pretty automatically excluded. The downside is that the largest specimens may not be the most complete.
A second strategy is to average all of the entries together. Of course, we would have to be careful on this. For instance, when we’re using ratios, we’ll want to calculate the ratio first, and then average the ratio. We don’t want to calculate, for instance, a humerus:femur ratio from the averaged measurements. Here’s why.
Let’s pretend we have specimens A and B. Specimen A has a humerus and femur length of 100 and 50, respectively. This gives a humerus:femur ratio of 2.0. Specimen B has a humerus and femur length of 50 and 100, respectively, for a humerus:femur ratio of 0.50. If we were to average the humeri and femora first, we would get average lengths of 75 each, which then results in a ratio of 1! Obviously (I hope), it is apparent that this “ratio of averages” doesn’t accurately reflect what’s going on. Furthermore, it’s quite different from the “average of ratios,” which weighs in at 1.25.
The advantage of averaging values for all specimens in a species is that we can better incorporate individual variation, and also better deal with incomplete specimens. We need to be cautious of averaging in cases of extreme size variation for a single species (hence, part of why it’s desirable to use ratios). Here, it may be worthwhile still to discard known juveniles.
Want to see the work in progress? Check it out here.
Thoughts? Let’s hear from you in the comments section!
A huge thank you to all of the volunteers who helped to organize multiple entries for individual specimens into a single combined entry. This was an unglamorous task, but an important one for subsequent analysis. We got 243 combined specimen entries in under two weeks’ time. In no particular order (other than alphabetically by first name), Doug Henning, Jr., Henrique Niza, Jay Fitzsimmons, John Dziak, Rob Taylor, and William Banks Miller did a fantastic job heading up the effort. As a result, we now have 1,898 individual verified lines of data (including combined entries). That’s pretty darned amazing!
Also, kudos to all of the individuals who have been participating in the discussion on the blog over the past weeks. This sort of environment is exactly what we (Matt, Mike, and I) wanted to foster when we started the project. The sheer breadth and depth of expertise among the contributors is pretty impressive! And for those who don’t feel like “experts”, it’s certainly OK to chime in. Everyone’s comments are welcomed. Anyone can do science.
Anyone can do science – this firm belief is part of why we started the Open Dinosaur Project. In fact, as Matt noted some time ago, there is a whole world of “citizen science” opportunities out there! If you’re addicted to the idea of citizen science, and want to learn about other projects in this vein, head on over to scienceforcitizens.net. They’ve got a whole directory of opportunities in all scientific fields in which you can participate!
Even better, there is a project page for the ODP, and a nifty little blog post by John Ohab with a video message from Matt and me (Mike’s over in England, and Matt and I practically live next door, so you’re stuck with only 2/3 of the project leads). John mentioned that scienceforcitizens.net (which is still in the beta stage, but looking quite nice) encourages all of us to create an account and even member blog posts about our experiences as citizen scientists. If you have a moment, go check it out!
As we finish up combining the data, it’s time to start thinking about the specific analyses that we’re going to do. What are the specific questions we’re asking? What are the techniques that we need to address the questions? Some excellent discussions between ODPers have been happening in one of the recent posts, and I was hoping to continue that here. In particular, I wanted to refocus the discussion on the project’s essential questions, and consider the types of analyses that we can use to answer each question. I’m just thinking out loud here (this is open notebook science, after all), and invite suggestions and discussion in the comments section. In particular, I’m referencing the “big questions” outlined in one of our first posts.
Why did ornithischians evolve quadrupedality multiple times?
I think this one is going to have to simply rely upon our interpretation of the data. After all, we can perhaps answer “how,” but the “why” can’t really be answered in this setting. So, it’s something to consider in the “discussion” section of the paper. But, see the next question. . .
Was the evolution of quadrupedality consistently associated with an increase in body size?
Here, we’re basically looking at evolutionary trends. In other words, can we detect a trend in body size within various ornithischian lineages? The more I think about this, the less I’m convinced we can directly answer the question (if you disagree, and have a solution, pipe up in the comments, please). One problem is the difficulty in knowing whether or not certain taxa were truly quadrupedal. So, where do you make the cut-off for quadrupedal vs. bipedal vs. both? In many cases we just don’t know. There’s a danger in circular reasoning, too (the limb bones look like it’s quadrupedal, so we call it quadrupedal, and then use it as an example of a quadrupedal taxon for analysis of limb bones).
But, I think we can detect trends across Ornithischia as a whole, and within specific lineages. For instance, is there a trend for increasing body size across Ornithiscians? Is there a trend for increasing body size within Ornithopoda? Ceratopsia? Thyreophora? In fact, Matt Carrano found a consistent and statistically significant increase in body size within ornithischians (and indeed, within most dinosaurs) when considering femoral measurements (go here to download a free PDF of Carrano, 2006). So, that makes this question a little less interesting (and indeed, less publishable, because it’s already been done). Do you think we should move it to the back burner? Or should we spin it in another way? Thoughts are welcome.
Did different groups of quadrupedal ornithischians arrive at this body form in similar ways, or did they have different strategies?
Here (as far as I know) is a genuinely novel question, and I think it’s the core of the ODP’s current phase. What we’re really saying (I think) is this: We know that thyreophorans, hadrosaurs, and ceratopsids independently evolved quadrupedal locomotion. Did each group have similar limb proportions, or were they different? I think this is where we’ll want to look at principal components analysis, at least as a starting point for data visualization. And, we’ll have to do that within a phylogenetic context. A recent paper by Liam Revell (2009) addressed how to do this (thanks to ODPer Randy Irmis for bringing up this paper; you can download it for free here – it’s well worth a read).
A second way to look at this question is to look for trends in certain structures – for instance, do the metacarpals tend to get elongated in each group (relative to the rest of the arm) as different clades became quadrupedal? Here, we might use a simple non-parametric correlation of the ratio with patristic distances (see the Carrano paper, again, and references therein, for a brief introduction to this method), to investigate that question within different lineages. Basically, patristic distance estimates the distance of a particular species from the base of the tree (by the number of branching points leading up to it). A taxon that split off early in a group’s evolution would have a low patristic distance, and vice versa for one that split off late in a group’s evolution. So, we might look at the correlation of metacarpal:arm length ratio to patristic distance for thyreophorans, hadrosaurs, and ceratopsians.
I think I’ll end here for now! Please add thoughts, suggestions, corrections, and anything else you think relevant in the comments. Next time, I’ll move on to the final issue, quantifying morphological disparity in ornithischian evolution.
Carrano, M. T. 2006. Body-size evolution in the Dinosauria. In M. T. Carrano , R. W. Blob, T. J. Gaudin & J. R. Wible (eds.), Amniote Paleobiology: Perspectives on the Evolution of Mammals, Birds, and Reptiles. University of Chicago Press, Chicago:225-268. Freely available here.
Revell, L. J. 2009. Size-correction and principal components for interspecific comparative studies. Evolution 63: 3258-3268. Freely available here.
We’re on the home stretch for combining specimen data. . .I just updated the spreadsheet (accessible, as always, here); feel free to edit as appropriate to combine all of the final entries. Note that I have temporarily removed the already combined entries, as well as the singletons.
The first combined entry has been left as an example. As before, please color the original data orange, and the combined line that you insert yellow.
Those who have contributed to the ODP over the last few months know that a single specimen might have measurements featured in 2, 3, 4, or more separate scientific papers. In order to keep data entry and verification as transparent as possible, we’ve included the presentation from each scientific paper as a separate entry. Now, though, it’s time to combine these separate entries into composite entries that can be analyzed as a single unit (see this post for how you can help).
But, we do face some real challenges in cobbling this information together. One major problem concerns different specimen numbers or museum abbreviations for the same specimen. For those who aren’t familiar with the museum world, every specimen in a museum gets a unique number. This helps us to keep track of the data with each specimen (not just measurements, but locality information, storage location, etc.). Rather than saying “that big T. rex skull on display in that big New York museum,” we just say “AMNH 5027″. This means that it’s specimen number 5027 at the American Museum of Natural History; there’s only one specimen with that number. Believe it or not, some people memorize such minutia (maybe you’re one of them). I know the specimen numbers for most of the well-known ceratopsian skulls (just mention the phrase “YPM 1822″, and Triceratops prorsus springs to mind), but still have a tough time remembering my wife’s birthday. Believe me, I catch grief for that one.
At any rate. . .in some cases, it’s pretty easy to figure out multiple presentations of the same specimen. AMNH FR5240 (American Museum of Natural History Fossil Reptile #5240) is pretty certainly the same as AMNH 5240. There are just a few extra letters (to distinguish 5240 in the fossil reptile collection from 5240 in the modern fish collection, for instance).
Sometimes things get complicated. For instance, museums change names. The old “Geological Survey of Canada” specimens eventually became “National Museum of Canada” specimens, which then morphed into “Canadian Museum of Nature” specimens when the institution changed its name. So, the Chasmosaurus skeleton that started out as GSC 2245 became NMC 2245 became CMN 2245. “CMN” seems to be the abbreviation of choice nowadays, and luckily the specimen numbers stayed the same. Sometimes historic abbreviations are carried on through sheer inertia. For instance, “USNM” stands for “United States National Museum.” Yet, it hasn’t been called that in decades – today we know it as the “National Museum of Natural History” (or just “The Smithsonian” to most of the general public). But, for various reasons (including overlap in abbreviations with all of the other countries’ national museums), “USNM” still stands. When different publications use different abbreviations, we still have to sort out what’s going on.
Sometimes things get really complicated. Did you know that the Protoceratops skeleton listed as AMNH 6471 by Brown and Schlaikjer’s 1940 paper is now known as CM 9185? This happened when the specimen was sent from the American Museum of Natural History to the Carnegie Museum in Pittsburgh. The only reason I know of this is because Matt Carrano had noted this in one of his data entries, and also through a chance reading of a 1981 publication on dinosaurs of the Carnegie by Jack McIntosh.
And sometimes things get just flat-out twisted. Back in the day, the Royal Ontario Museum completely renumbered their fossil collection. What was once known as the Corythosaurus ROM 5505 is now ROM 845. The Lambeosaurus ROM 6474 is now called ROM 1218. Thankfully, some papers indicate the old and the new catalog numbers. But not always. There are measurements from old papers of certain specimens (e.g., ROM 5167 and ROM 5971, specimens of Edmontosaurus regalis and Prosaurolophus maximus, respectively) that just aren’t clear. So, we’ll either hope that someone out there reading this knows the current specimen number, or we’ll have to contact a curator at the museum to find out. (feel free to chime in in the comments, if you know the answer)
These sorts of things are hugely important for the utility of our dataset, and we’re depending on each other to get these details ironed out. That’s the real strength of an open project like the ODP – anyone can contribute!
Have you been featured in the news, on a blog, or elsewhere? Let us know!
In order to streamline things during this time in the project (and in order to keep important notes from getting lost in other comment threads or email inboxes), I’ve created an “Errata” page. As it says there, this is an excellent place to post taxonomic suggestions, museum abbreviation updates, potential typos in the data, duplicate entries, etc. You can access it on the side bar, under Resources, with the link name of “Tasks: Found an Error“?
Thanks to contributor David Dreisigmeyer for the suggestion!
Thank you to everyone for an excellent discussion going on over at the previous post. It’s really helping to clarify a number of issues – and I appreciate all of the expertise being tossed in. This is what open science is all about. Of course, the discussion continues – keep the comments rolling in!
As mentioned, we want to have a way to combine duplicate (and non-duplicate) measurements from all of the different sources for each specimen into a single entry. For instance, the Ankylosaurus magniventris specimen AMNH 5214 has four separate entries. One entry presents humerus, femur, fibula, and metatarsal lengths, another one presents only femur lengths, and so on. And, there are multiple different values given for some measurements. For instance, the femur length is given variously as 560, 536, and 542 mm (whether referring to left, right, or an unspecified side). So, we want to condense those four entries into one for the sake of further analysis (keeping the original data safe and sound, in case anyone wants to go back to them).
There is no perfect strategy, but based on our previous discussions it’s looking like the best approach is to “average and combine.” As another example, let’s consider how this would work for the Psittacosaurus mongoliensis specimen AMNH 6538.
There are two entries for this specimen, and we’ll only take a look at subsets of these entries. Two tibia lengths are presented: one at 129 mm and the other at 125 mm. So, our combined entry would use the average of these, 127 mm. Only one of the two entries presents the fibula length (given as 121.4e). In this case, we’ll assume that the estimated measurement is accurate, and enter 121.4 as the combined value (in my general experience, most of these estimated values seem to be pretty darned close, and reflect a little bit missing at the end of the bone or a similar condition; of course, it’s up to everyone to keep their eyes on exceptions to this and flag them accordingly).
I’ve begun to modify the spreadsheet, so that all specimens which can have combined entries have a line for this (thanks to John Dziak for noting this). As the ceratopsians and ankylosaurs are mostly together in terms of taxonomic updating (unless anyone else spots additional problems – please flag them if you do!), they’re first targets for combination.
Here’s a proposed set of guidelines; if any other situations crop up, please post a comment and we can amend as appropriate. This is the sort of thing that will probably go into a Materials & Methods section in the paper.
Guidelines for Combining Multiple Entries for a Single Specimen
- If values for various sides are included, please average them all into a single measurement. For instance, if a left and right humerus (in the L L and R L columns) are noted, the average would go into the “L” column. If two measurements for left humeri are included (in the L L column), the average again should go into the “L” column. And so on. . .
- If a value is indicated as estimated (with an “e” before or after the number), it is appropriate to treat the measurement as valid (unless information indicates that the restoration is too extensive to trust the measurement).
- If, in a set of measurements, one or more values seem to be “off” (e.g., a case where femur length is given as 342, 339, and 402 mm, respectively), flag this entry. Here, we will probably go with the more likely values (342 and 339) and dump the 402 as an outlier.
- Each combined entry is indicated by yellow in the first few columns, and the word “combined” in the Reference column.
- The metatarsal and metacarpal columns are in “text” format (to avoid funky autoformatting of the L/R measurements to dates). So, you will have to adjust techniques accordingly.
- Once an entry is finished, the person who combined it puts their name in column CM (“Entry 1″) and marks the entire row as yellow.
How to Contribute
So, we’re looking for some volunteers to help combine entries. In the true spirit of crowdsourcing, the fully editable document is available here. Please make edits directly on the document (rather than downloading and resending it to me). Right now, the data down to row 531 are prepped and ready to combine, and I’ve taken care of the first few entries. As we resolve and clean up other areas of the database, those will pop up as available. As always, it’s important to check your work frequently and alert someone if you notice an error or inconsistency.
Thank you, and good luck!
Thanks to the hard work of a number of individuals, our big old measurement spreadsheet is nearly complete. Almost all of the relevant entries have been verified (aside from a few stragglers), and we’re ready to get serious about data analysis. It’s not too late to contribute to the verification effort – as a reminder, I’d like to close off submission of new entries (except for previously unpublished, original measurements), unless there is a very, very good reason. Don’t worry – you’ll get another chance to contribute later this year when we start Phase II!
Today I began the task of sorting out synonymies and specimen numbers in the database. As always, the latest version is available here (note: there is also a bare-bones CSV format snapshot current as of 7 February 2010) available here). Before we can began working any sort of statistical magic, we need to get the data into order. This includes:
- Making sure that all genus/species names are up to date. In general, we’ll use the latest taxonomic authority. The 2004 Dinosauria is a good start, and any more recent papers are also helpful for sorting things out. In some cases, it’s just going to take going to an expert. If you think a genus or species name should be updated, please post it in the comments.
- Making sure that all museum abbreviations are up to date. There is some variation from paper to paper in how museum abbreviations are listed, so we’ll want to get all of those clarified. For instance, all of the instances of NMC (National Museum of Canada) and GSC (Geological Survey of Canada) should get changed over to CMN (Canadian Museum of Nature).
- Combining duplicate entries for a single specimen into one. How do you think we should do this one? I’m thinking of doing an average of all measurements, but maintaining some leeway to discard a measurement that doesn’t seem right. For instance, if two sources cite femur length as 520 and 523 mm, and a third cites femur length as 783, I think we can safely toss out the latter. Thoughts or opinions? This is important, and is something that we’ll have to write up for the materials and methods portion of the paper.
- Combining duplicate entries for a single species into one. Again, how should we deal with this? We don’t really want to include multiple data points for a single species when doing our analyses (or do we?), because it adds erroneous degrees of freedom (bad from a statistical standpoint), among other things. There is a case for taking species means in some analyses, but again we need to be careful about how we average things. For instance, we probably want to toss out juveniles (in most cases). Does this mean only using the very largest specimen for a species? Or use only the specimen with the most complete data appropriate for a given analysis? Thoughts or opinions?
- Types of analyses. We should start thinking about the kinds of regressions/PCA’s/etc. that we want to run. I expect that some bivariate plots similar to what we posted earlier might make their way in (e.g., humerus vs. femur length).
At this stage, it’s quite possible that we might catch some little errors that have crept into the data here and there. As always, please let someone know if this is the case! A comment on the blog is certainly appropriate.
So, please offer any input or advice that you might have. This might include species synonymies, museum abbreviation adjustments, opinions on data combination, etc. Every opinion counts!