Notes on Data Processing

I’ve begun the process of paring down the database into the form for final analysis (but remember, of course, that the full, unaltered data live on). So far, I have deleted entries that fit into the following categories:

  • Extraneous entries for “combined” specimen data sets. In other words, only the combined entry will be used for analysis. One specimen, one data point.
  • Specimens comprising isolated elements (e.g., isolated humeri or femora). Because we’re analyzing skeletal proportions primarily, isolated elements aren’t terribly useful.
  • Individual measurements with a “+” have been removed, as they represent measurements “as preserved” rather than estimates of original length.

Very soon we will want to make a final decision on what to do with species represented by multiple specimens. One strategy (which I think was used by Carrano 2006, if I remember correctly) is to use only the largest specimen. Here, the benefits are that juvenile specimens are pretty automatically excluded. The downside is that the largest specimens may not be the most complete.

A second strategy is to average all of the entries together. Of course, we would have to be careful on this. For instance, when we’re using ratios, we’ll want to calculate the ratio first, and then average the ratio. We don’t want to calculate, for instance, a humerus:femur ratio from the averaged measurements. Here’s why.

Let’s pretend we have specimens A and B. Specimen A has a humerus and femur length of 100 and 50, respectively. This gives a humerus:femur ratio of 2.0. Specimen B has a humerus and femur length of 50 and 100, respectively, for a humerus:femur ratio of 0.50. If we were to average the humeri and femora first, we would get average lengths of 75 each, which then results in a ratio of 1! Obviously (I hope), it is apparent that this “ratio of averages” doesn’t accurately reflect what’s going on. Furthermore, it’s quite different from the “average of ratios,” which weighs in at 1.25.

The advantage of averaging values for all specimens in a species is that we can better incorporate individual variation, and also better deal with incomplete specimens. We need to be cautious of averaging in cases of extreme size variation for a single species (hence, part of why it’s desirable to use ratios). Here, it may be worthwhile still to discard known juveniles.

Want to see the work in progress? Check it out here.

Thoughts? Let’s hear from you in the comments section!

This entry was posted in Miscellaneous. Bookmark the permalink.

11 Responses to Notes on Data Processing

  1. Rob Taylor says:

    Had I bit more time this weekend I might see if I could back this up with a look at the actual data, but I certainly have a sense that using the largest specimen approach would not work to our advantage given our particular data set. If I had to cast a vote today (based on the exposure I’ve had to the data thus far), I’d say drop the juvies/sub-adults and go with the average method. (Will be curious to learn whether others are in agreement, though!)

  2. John Dziak says:

    I think that you and Dr. Taylor are probably right to suggest using species averages — I’d hate for you to have to throw out all but the largest of each species! — but I’m a little concerned about how to combine measurements if some are missing. If Specimen A has a humerus measurement but not a femur, and Specimen B has a femur measurement but not a humerus, do you just put them together? Might that not give you ratios that would not be found in a single individual — sort of a digital bone bed? If each species had either one or many specimens I wouldn’t worry about it, but what if you just had one big male and one small female, etc.?

  3. William Miller says:

    For the purpose of the main questions, discarding juveniles definitely sounds very important (though they might still be useful for later projects) – not just that their measurements will be much smaller but their proportions will also be quite different.

    So perhaps, in addition to excluding things that are explicitly listed as juveniles, we should drop ones that are more than 50% (or whatever number, I’m not sure what would be good) smaller than the biggest specimen of that species?

    And how do we ensure what’s the same species, anyway (and thus what to average)? It seems that a lot of them are taxonomically in question; several times in the combining phase the same specimen was assigned to different species. Do all the [Genus X] sp.’s get averaged together, or does each Stegosaurus sp. specimen (for example) get treated as a separate ‘species’ (and thus not averaged with anything else).

  4. Andy Farke says:

    @Rob – I think you are correct; using only the largest individual will skew our data set pretty badly.
    @John – yes, we want to avoid “Frankenstein ratios” in our analysis. So, we’d want to calculate the ratios for individual specimens and then average the ratios. An alternative, I suppose, would be to impute data for an individual specimen by scaling from the same element in a conspecific taxon. I want to avoid creating data as much as possible, and we may not want to incorporate it into the main analysis, but perhaps this would be a way to look at incomplete specimens that occupy “interesting” portions of the cladogram.

  5. Andy Farke says:

    William Miller :

    And how do we ensure what’s the same species, anyway (and thus what to average)? It seems that a lot of them are taxonomically in question; several times in the combining phase the same specimen was assigned to different species. Do all the [Genus X] sp.’s get averaged together, or does each Stegosaurus sp. specimen (for example) get treated as a separate ’species’ (and thus not averaged with anything else).

    This will be a perpetual issue, and we’re guaranteed that at least one specimen will get reassigned to another species at some point in the future. I would say that we go with the most recent credible opinion on the species for some of these, and also use personal judgement. As you noted, Stegosaurus has been a particularly problematic taxon. The latest published opinion (by Maidment et al.) is that within North America, we have only two species: S. mjosi and S. armatus. Stegosaurus stenops is, by the latest published opinion, subsumed into S. armatus. For items just identified to genus, we’ll probably deal with them on a case-by-case basis. Usually, I expect we’ll exclude them. Going back to Stegosaurus, many of those are juvenile specimens, so they’ll get weeded out anyhow.

    Of course, we’re relying on everyone’s expertise (whether based on reading or unpublished knowledge) to ensure the accuracy of this whole operation.

  6. William Miller says:

    There are an awful lot of identified-only-to-genus specimens, though; it seems like we’d lose a lot of data by excluding them. Still, I can’t think of anything better.

    What about some of the specimens that are [Genus] sp. according to one paper and identified to species by another? Go with the more detailed identification, or the more recent, or what?

  7. Mike Taylor says:

    I think we’d do much better aggregating to genus level than to species level. Although an argument can be made that, for extant taxa, species are “real” in a sense that genera are not, that is not true at all of extinct animals: both species and genera are defined only on the basis of morphological similarity, so there is no conceptual difference between them. Something like 80% of Mesozoic dinosaur genera are monospecific anyway, so in most cases it’ll make little or no practical difference, but aggregating to genus level means that we can use all those specimens not identified to the species level.

  8. Andy Farke says:

    William Miller :

    There are an awful lot of identified-only-to-genus specimens, though; it seems like we’d lose a lot of data by excluding them. Still, I can’t think of anything better.

    What about some of the specimens that are [Genus] sp. according to one paper and identified to species by another? Go with the more detailed identification, or the more recent, or what?

    As you’ll see on the (to-be-posted) final averaged list, the number of genus-only specimens is actually quite small, once we exclude all of those isolated femora, etc.
    For specimens identified only to genus in one paper but to species in another: case-by-case, but probably the most recent identification (unless we have compelling evidence otherwise).

  9. Andy Farke says:

    @Mike: I’m partially in agreement with you here, but I think we’ll want to be very wary of “dustbin” taxa like Camptosaurus or Iguanodon. But, a lot of such specimens are out of the mix once we exclude specimens known only from isolated elements.

  10. John Dziak says:

    Andy Farke :
    @John – yes, we want to avoid “Frankenstein ratios” in our analysis. So, we’d want to calculate the ratios for individual specimens and then average the ratios. An alternative, I suppose, would be to impute data for an individual specimen by scaling from the same element in a conspecific taxon. I want to avoid creating data as much as possible, and we may not want to incorporate it into the main analysis, but perhaps this would be a way to look at incomplete specimens that occupy “interesting” portions of the cladogram.

    Thanks very much for the clarification! I didn’t understand at first what you were getting at by averaging ratios, but now I fully agree with your approach. I don’t like the idea of imputing data either, except in a principled multiple imputation approach, which might be too unwieldy for this paper.

  11. Pingback: Paring Down the Data « The Open Dinosaur Project

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s