Thanks to the hard work of a number of individuals, our big old measurement spreadsheet is nearly complete. Almost all of the relevant entries have been verified (aside from a few stragglers), and we’re ready to get serious about data analysis. It’s not too late to contribute to the verification effort – as a reminder, I’d like to close off submission of new entries (except for previously unpublished, original measurements), unless there is a very, very good reason. Don’t worry – you’ll get another chance to contribute later this year when we start Phase II!
Today I began the task of sorting out synonymies and specimen numbers in the database. As always, the latest version is available here (note: there is also a bare-bones CSV format snapshot current as of 7 February 2010) available here). Before we can began working any sort of statistical magic, we need to get the data into order. This includes:
- Making sure that all genus/species names are up to date. In general, we’ll use the latest taxonomic authority. The 2004 Dinosauria is a good start, and any more recent papers are also helpful for sorting things out. In some cases, it’s just going to take going to an expert. If you think a genus or species name should be updated, please post it in the comments.
- Making sure that all museum abbreviations are up to date. There is some variation from paper to paper in how museum abbreviations are listed, so we’ll want to get all of those clarified. For instance, all of the instances of NMC (National Museum of Canada) and GSC (Geological Survey of Canada) should get changed over to CMN (Canadian Museum of Nature).
- Combining duplicate entries for a single specimen into one. How do you think we should do this one? I’m thinking of doing an average of all measurements, but maintaining some leeway to discard a measurement that doesn’t seem right. For instance, if two sources cite femur length as 520 and 523 mm, and a third cites femur length as 783, I think we can safely toss out the latter. Thoughts or opinions? This is important, and is something that we’ll have to write up for the materials and methods portion of the paper.
- Combining duplicate entries for a single species into one. Again, how should we deal with this? We don’t really want to include multiple data points for a single species when doing our analyses (or do we?), because it adds erroneous degrees of freedom (bad from a statistical standpoint), among other things. There is a case for taking species means in some analyses, but again we need to be careful about how we average things. For instance, we probably want to toss out juveniles (in most cases). Does this mean only using the very largest specimen for a species? Or use only the specimen with the most complete data appropriate for a given analysis? Thoughts or opinions?
- Types of analyses. We should start thinking about the kinds of regressions/PCA’s/etc. that we want to run. I expect that some bivariate plots similar to what we posted earlier might make their way in (e.g., humerus vs. femur length).
At this stage, it’s quite possible that we might catch some little errors that have crept into the data here and there. As always, please let someone know if this is the case! A comment on the blog is certainly appropriate.
So, please offer any input or advice that you might have. This might include species synonymies, museum abbreviation adjustments, opinions on data combination, etc. Every opinion counts!