What’s Next?

Thanks to the hard work of a number of individuals, our big old measurement spreadsheet is nearly complete. Almost all of the relevant entries have been verified (aside from a few stragglers), and we’re ready to get serious about data analysis. It’s not too late to contribute to the verification effort – as a reminder, I’d like to close off submission of new entries (except for previously unpublished, original measurements), unless there is a very, very good reason. Don’t worry – you’ll get another chance to contribute later this year when we start Phase II!

Today I began the task of sorting out synonymies and specimen numbers in the database. As always, the latest version is available here (note: there is also a bare-bones CSV format snapshot current as of 7 February 2010) available here). Before we can began working any sort of statistical magic, we need to get the data into order. This includes:

  • Making sure that all genus/species names are up to date. In general, we’ll use the latest taxonomic authority. The 2004 Dinosauria is a good start, and any more recent papers are also helpful for sorting things out. In some cases, it’s just going to take going to an expert. If you think a genus or species name should be updated, please post it in the comments.
  • Making sure that all museum abbreviations are up to date. There is some variation from paper to paper in how museum abbreviations are listed, so we’ll want to get all of those clarified. For instance, all of the instances of NMC (National Museum of Canada) and GSC (Geological Survey of Canada) should get changed over to CMN (Canadian Museum of Nature).
  • Combining duplicate entries for a single specimen into one. How do you think we should do this one? I’m thinking of doing an average of all measurements, but maintaining some leeway to discard a measurement that doesn’t seem right. For instance, if two sources cite femur length as 520 and 523 mm, and a third cites femur length as 783, I think we can safely toss out the latter. Thoughts or opinions? This is important, and is something that we’ll have to write up for the materials and methods portion of the paper.
  • Combining duplicate entries for a single species into one. Again, how should we deal with this? We don’t really want to include multiple data points for a single species when doing our analyses (or do we?), because it adds erroneous degrees of freedom (bad from a statistical standpoint), among other things. There is a case for taking species means in some analyses, but again we need to be careful about how we average things. For instance, we probably want to toss out juveniles (in most cases). Does this mean only using the very largest specimen for a species? Or use only the specimen with the most complete data appropriate for a given analysis? Thoughts or opinions?
  • Types of analyses. We should start thinking about the kinds of regressions/PCA’s/etc. that we want to run. I expect that some bivariate plots similar to what we posted earlier might make their way in (e.g., humerus vs. femur length).

Note: As discussed in the comments, no data will truly be “tossed out” – we’re maintaining the primary archive of data as is. Any deletions or combinations will be done on a second copy of the data.

At this stage, it’s quite possible that we might catch some little errors that have crept into the data here and there. As always, please let someone know if this is the case! A comment on the blog is certainly appropriate.

So, please offer any input or advice that you might have. This might include species synonymies, museum abbreviation adjustments, opinions on data combination, etc. Every opinion counts!

This entry was posted in Progress Reports, To-Do List. Bookmark the permalink.

44 Responses to What’s Next?

  1. emanuel tschopp says:

    Hi OPDs

    I don’t know exactly where to put comments on the spreadsheet, so I try here 😉
    There are some double entries of the same specimen with different specimen numbers in Stegosauria. The SMA specimen numbers Maidment used in hter 2008 paper are wrong. SMA S01 = SMA RCR0603; SMA V03 = SMA 0018, and SMA L02 I guess, should be SMA 0092, but this I have to verify. I’ve been working at the SMA for quite some years now, and there’s a paper in review on SMA 0018 (Christiansen & Tschopp), where we state the wrong number Maidment used at least for this specimen.

    By the way: where are the sauropods?

  2. Mike Taylor says:

    I would have expected to keep each specimen’s data distinct, at least for most analyses. Apart from anything else, that is one way that we might catch taxonomic mistaked, e.g. discovering that the proportions of a hindlimb assigned to Corythosaurus actually match those of Lambeosaurus specimens.

  3. David Dreisigmeyer says:

    First off, my background in data analysis is mostly exploratory and data-driven. I don’t care what the pattern is in the data, as long as there is a useful pattern I can find to achieve whatever my current goal is (typically some sort of anomaly or fault condition). While potentially interesting for uncovering unexpected relationships, this viewpoint may not align with specific hypothesis testing.

    Here’s what I would suggest:

    1) Combining duplicate entries for a single specimen into one
    I probably would not do this (beyond throwing out obviously bad measurements) unless there is a compelling reason to believe that one measurement is correct. All that this represents is ‘experimental’ error and should be handled by whoever is doing the data analysis. Maybe someone wants to do a study on how bone measurements can vary on a single specimen (pulling an example out of thin air here).

    2) Combining duplicate entries foe a single species into one
    I would strongly recommend not doing this. Someone may be interested in comparing single species measurements, or juvenile versus adult. Also, if they wish to a prioiri combine or remove single species measurements, the burden should fall on the analyst. When you have raw data, the preprocessing should be done when the analysis is done. One never knows now what would be of interest in the future. And once that data is gone it is likely gone forever. For me, it’s been frustrating in the past dealing with this sort of missing information that was originally deemed unimportant but would now really help out on my current (always non-paleontological) analysis.

    3) Types of analysis
    One very interesting option would be to perform a Nonnegative Matrix Factorization (versus say a PCA) on this data. This is a nice way to cluster data and remove important ‘pieces’ of information one by one. PCA orders the data from most to least energy assuming your data is Gaussian. So it attempts to find projections of your data onto some normal distribution. But one might suspect that the interesting aspects would be contained in the departures from normality. This is the idea behind, say, factor analysis. One could also consider Support Vector Machines. I am not arguing that PCA shouldn’t be tried. In fact, owing to its simplicity and wide use, it would be the first thing I would try, progressing onto increasingly more ‘sophisticated’ techniques as needed.

  4. John Dziak says:

    Some statistical analyses allow data to be nested (e.g., individuals within species). For example, GEE is a further generalization of generalized linear models (e.g., multiple linear regression, logistic regression, etc.) which allows the data to be in clusters. I don’t know of a clustered principal component analysis though.

  5. John Dziak says:

    P.S. I think that the disagreements between you and David Dreisigmeyer about what to combine or throw out might represent different ideas about what it means to throw out (delete permanently vs. exclude in analysis). You might need to have two versions of the data file, one with all the information and one with only “clean” data (e.g., combining duplicates and getting rid of obviously wrong information).

  6. David Dreisigmeyer says:

    John’s likely correct on having different ideas about throwing out data. I like his suggestion of having two data files.

  7. Andy Farke says:

    Yes, that’s absolutely correct – we will never, ever throw out data. No matter what data handling protocol is adopted, the originally submitted data will always remain intact. Edits will all happen in a second file. But, I’m glad you brought that up.

  8. Andy Farke says:

    Thanks, Emanuel! That is really, really helpful!!!! We’ll be able to update the sheet accordingly in the next day or two.

    Sauropods should be following later this year. . .we’re just doing one clade at a time (aside from a few non-ornithischians that we’ve included in the data for comparison). 🙂

  9. Andy Farke says:

    Quick question for David & John – I’m not terribly familiar with all of the methods you bring up, but how well do they play with missing data? One of the problems with the data set is that there are a *lot* of missing entries (both on account of the fossil record as well as on account of the authors who selected the measurements they were interested in). Even if we go with something basic, like only specimens with humerus, ulna, MC III, femur, tibia, MT III, we might only have 20 or fewer specimens that have all relevant measurements. I know that it’s possible to use various techniques (regression, geometric mean, etc.) to fill empty data cells, but we probably don’t want to do this too much! Any thoughts?

  10. Andy Farke says:

    There are definite arguments for both ways, and in some ways it depends on the type of analysis we’re doing. A clade-wide, uncorrected plot of femur vs. tibia, for instance, could probably be done with the specimen-level data. But here’s an extreme hypothetical example that reflects my concerns. . .say we’re plotting femur vs tibia. We have 25 Stegosaurus armatus specimens, 5 ankylosaurs, 10 hadrosaur, 6 ceratopsian, and 10 other ornithischians. Would not the results be driven largely by the S. armatus data?

    And in reference to the analyses that account for phylogeny (which we definitely should do), we really don’t have any choice but to use species-level data points. I suppose one could make each specimen an OTU and set it with a very small branch length from the species node. . .anyone know of precedent for doing this?

  11. John Dziak says:

    I’ll look into the missing data issue. It’s no problem for GEE. By the way, a more manageable alternative to having two datasets would be to have an extra column with codes, like:
    1.) Unique entry to be used in analyses
    2.) Aggregate of multiple entries on the same SPECIMEN (not original data, but to be used in analyses)
    3.) Duplicate entry on a specimen, to be kept but not used in analyses

  12. Andy Farke says:

    I like it! That way we aren’t juggling multiple spreadsheets, but we still maintain the uniqueness of the original entries.

  13. David Dreisigmeyer says:

    Missing data could be a problem for most of the methods I listed. I know PCA has been extended to this case, but I’m not quite sure how effective this is. I’ll look into this problem and contact some people.

    Question: Should we consider the effects of missing not at random here? Knowing that one measurement is missing may allow you to predict if another one will be missing because, e.g., the individual bones were close to each other in life. This may be nitpicking though.

  14. John Dziak says:

    There could also be a column for Juvenile, I suppose — or perhaps even for “life stage” (perinate, juvenile, late juvenile — maybe embryo someday)

  15. David Dreisigmeyer says:

    While their data is different they seem to have the same problem with respect to missing data, even after bining:

    http://www.sciencemag.org/cgi/content/abstract/321/5895/1485

    http://rsbl.royalsocietypublishing.org/content/4/6/733.abstract

  16. Rob Taylor says:

    I agree that this is an excellent suggestion, and in fact often use this approach when there’s need to produce aggregate records for an analysis. The only question I’d have would be around standardizing specimen numbers. For example, if you have an essentially unique entry that features an outdated specimen number, would you create a new entry? Or would it be better to simply add an additional column to the spreadsheet where the current specimen numbers could be stored? (It occurs that the latter might actually facilitate the aggregation process.)

  17. John Dziak says:

    I wouldn’t worry too much about the missing data in the context of a regression-type analysis. The “not at random” missingness that causes serious bias is the kind where: whether Y is missing depends on the value of Y. If whether Y is missing depends only on X, then it is still considered “missing at random” although not “missing completely at random,” and you can still use standard techniques.

    I guess one could argue that smaller dinosaurs have smaller bones which are more likely to be missing, so trying to find, say, “an average tibia size for all dinosaurs” would be tricky and couldn’t be done with just an unweighted sample mean. But it seems to me that wouldn’t be a problem if you were trying to get a model to predict tibia size from fibula size, or to correlate them.

    If you do have missingness not at random then a multiple imputation technique can be used to reduce bias. Some advisers of mine, Joe Schafer and Linda Collins, have studied these techniques although they are not my area of expertise. I don’t think you really need one here.

    I don’t know very much about how missing data affects a PCA or factor analysis. That might depend on what assumptions you make. I’m sure people have written about this, at least in the social sciences.

  18. William Miller says:

    Possibly a stupid question: but for Euoplocephalus tutus, the GenusFinal and SpeciesFinal are listed as Dyoplosaurus acutosquameus. I find Dyoplosaurus described in 1924, Euoplocephalus in 1910, so shouldn’t the GenusFinal be Euoplocephalus – which has priority – if we are synonymizing them? A paper (at http://www.bioone.org/doi/abs/10.1671/039.029.0405) suggests that they are likely different genera after all, but I don’t know whether we should go with it – probably someone who works with thyreophorans should see if it’s convincing. (It looks good to me — no holotype overlap, for example — but I’m a very long way from expert.)

    (how do you do italics in these comments?)

  19. William Miller says:

    Edit: crud, the link should be: http://www.bioone.org/doi/full/10.1671/039.029.0405 for full text.

    The parentheses got incorporated into it and I don’t know if you can edit comments here…

  20. William Miller says:

    OK, think I see what’s going on now – only that specimen (ROM 784) is being moved to Dyoplosaurus, not the whole genus Euoplocephalus. Sorry.

    Is Acanthopholis platypus a ‘valid’ name and not a nomen dubium? Dinodata.com claims the metatarsal of A. platypus is from a sauropod; is this somewhere in the actual literature? If so it would be important.

  21. Andy Farke says:

    Ah, good finds both! And it looks like they might be a good way to handle our planned analysis on morphological disparity. It looks like Brusatte and colleagues used binary characters as a disparity metric, and we can go one better with actual linear measurements!

  22. Andy Farke says:

    Good call. We’ve got that in the notes, where people have noted it, but we certainly will need to make it more explicit for some specimens.

  23. Andy Farke says:

    You’re right, I think it is a nomen dubium. We’ll maybe reclassify it as “Dinosauria” for now.

  24. 220mya says:

    You pretty much have to combine multiple values for one species into one datapoint for most phylogenetically corrected analyses. This is because they require a fully resolved tree. If you made every specimen an OTU, the only honest way to do it is to have a big polytomy for the node that is the common ancestor of all the specimens. You could also re-run the analysis a bunch of times with randomly resolved trees, but that is alot of work for little pay-off. Many methods now allow incorporation of std deviation, etc., so you won’t lose all the info by combining to one datapoint. I particularly suggest COMPARE, which can implement pGLS in a maximum likelihood framework. Its free and runs using Java, so should work no matter what OS you have. Check it out: http://www.indiana.edu/~martinsl/compare/

  25. William Miller says:

    Ok, sure.

    I did find a paper claiming it’s a sauropod: http://jgslegacy.lyellcollection.org/cgi/content/abstract/48/1-4/375 This might be where dinodata.com got the idea from.

  26. Andy Farke says:

    Cool, thanks for tracking that down!

  27. Andy Farke says:

    I think a separate column would be best, and I’ve begun to take steps to enact this.

  28. Pingback: Time to Get to Work « The Open Dinosaur Project

  29. 220mya says:

    All specimens with the “MNA P1” prefix need to be changed to “MNA V”

    Don’t know if there are any UUVP numbers, but these specimens have all been completely renumbered with *new and different* UMNH VP numbers.

  30. Andy Farke says:

    Noted on the MNA issue. I don’t see any UUVP numbers, thankfully – tracking down the ROM ones has been enough of a pain as it is!

  31. William Miller says:

    Thanks!

    Another taxonomy issue; the BYU specimen of Othnielia has been renamed as Othnielosaurus (because the Othnielia type is supposedly not diagnostic; I have no personal knowledge if that is true). Is this currently accepted? The BYU specimen is in our data, what should we do?

  32. christian foth says:

    hello

    today i ask my boss, how to deal with our missing values for the analyses. he proposed that we should 1st do a correlation matrix between all bones length and all individuals of on species (and this for every species)

    Than we have to look if there any allometries or just point clouds. the latter is perfect.

    in the next step, we can use this data to do PCA for every species with no regard of the individuals. by this way we can pool the data and get a ‘prototype’ for every species with all bones length.

    In the last step we can do a nested PCA including the ‘prototype’-data from all species and compare them to each other

    In the case of allometry (see 1st step) you have to do an PCA on individual-level and look how the allometry behave in every single species. You also can delete all smaller individuals to get the point cloud.

  33. christian foth says:

    the specimen-no. from stenopelix is now GZG 741/2 (Butler & Sullivan 2009)

  34. David Dreisigmeyer says:

    I’m really looking forward to work with all of you! I can see already this is going to be a great learning experience for me — always a good thing.

  35. John Dziak says:

    I agree with Christian Foth’s boss. Some PCA software will probably do listwise deletion (remove a whole case if any of its variables are missing) automatically if you give them the raw data — and you don’t want this because it would mean almost all of the dataset gets deleted. So I would recommend first computing the correlation matrix (or the covariance matrix after rescaling the data to give each variable the same variance — that would be about the same as the correlation matrix) using pairwise deletion (which is probably the default or at least available for most software that would give you a correlation matrix). Then use the correlation matrix as input to the PCA software, ignoring the raw data (this should be allowed by most software, since after all it is the correlation matrix, not the raw data, which is analyzed in PCA).

    I wouldn’t worry about bias being introduced by missing data. You aren’t claiming to have a random sample of all dinosaurs ever to live, anyway — that would be ridiculous. And within a given specimen, I think it wouldn’t be too bad to treat the missingness of individual bones or measurements as being “completely at random” (i.e., independent of their actual values). I don’t think that it is necessary to assume that the missingness of one bone be independent of the missingness of other bones — even without that I think you can still treat the missingness as noninformative. (I’m not an expert on this though.) The only other options would be using regression or an ad-hoc rule to fill in data (I don’t like this at all!) or using multiple imputation (potentially valid but more complicated, particularly for something like a PCA).

  36. John Dziak says:

    (Actually, I meant that I agreed with the part about using the correlation matrix. I don’t know anything about allometry.)

  37. This is a good discussion – I’m always keen to improve my statistics knowledge.

    For those of you who would like to learn more about how to deal with missing data, I’d recommend the following review article. It explains things including “missing at random” vs “missing completely at random”, and steps to take to deal with missingness.

    Nakagawa S, Freckleton RP (2008) Missing inaction: the dangers of ignoring missing data. Trends in Ecology and Evolution, 23, 592-596.

    Regarding PCAs and other methods of data reduction. It seems that most of the discussion on employing PCAs and their kin is on within-species data as opposed to inter-species data. That’s good. If PCAs are used on inter-species data then we should consider whether to account for phylogenetic autocorrelation prior to PCA analysis. A recent paper (below) discusses the Type I error that can be associated with PCAs on inter-species data prior to methods that account for phylogenetic autocorrelation (e.g., independent contrasts), and has R code for a program that incorporates phylogeny before such analyses. At least that’s what the abstract says the paper finds – I haven’t read it yet (it has now been bumped up to the top of my “to-read” stack!).

    Revell LJ (2009) Size-correction and principal components for interspecific comparative studies. Evolution, 63, 3258-3268.

    Finally, for the organization of data spreadsheets. I agree with everyone else, that the original data should be preserved, and that other columns should be added to facilitate certain analyses (e.g., interspecific comparisons). Some researchers who have more stats skills than me may find this database useful for questions we haven’t considered. It is possible to use Bayesian Hierarchical analyses to, as John Dziak noted, nest layers of variation at different scales. It should be possible for someone to have the “specimen” layer that incorporates error estimates for within-specimen measurements, the “species” layer that incorporates within-species error and predictors of its variation, and finally the inter-species layer. For a paper that explains hierarchical models, in the contaxt of methods for exploratory analyses of huge data sets (developed for another citizen science project database – of bird observations with the Cornell Lab of Ornithology), see:

    Fink D, Hochachka W (2009) Gaussian semiparametric analysis using hierarchical predictive models. In: Modeling Demographic Processes in Marked Populations. (eds Thomson Dl, Cooch Eg, Conroy Mj) pp Page. New York, Springer.

    Unfortunately, although I’m learning about these stats methods from reading, I haven’t actually performed any of the ones I talk about here (other than simple PCA).

  38. David Dreisigmeyer says:

    I’ve been looking at the multiple imputation (MI) methods and another idea would be to use the MI method to fill the data matrix followed by a non-negative matrix factorization (NMF) of the resulting completed data matrix. If this is possible the NMF offers some attractive features:

    1) My experience is that it is fairly robust to small changes in the data, so similar results could be expected with the different imputations

    2) It extracts different unique features of the data. It’s closely related to clustering algorithms in this respect.

    3) The result is a basis that contains nonnegative vectors (vs PCA) which, in the present case it has a clear interpretation as ‘archetypal’ limbs. So it seems like a fairly natural algorithm to use here.

    4) This method had beeb used to analyze biological (mainly microarray) data, e.g.:

    http://www.ploscompbiol.org/article/info:doi%2F10.1371%2Fjournal.pcbi.1000029

    Jill Mesirov is the pioneer here. Originally the method was used for image processing.

    I’ve had really good success with this method in the past. That said, I’m not sure how it would play with MI. Not only would any results be novel, but the method itself would be also if it works (double bang for the money).

  39. John Dziak says:

    I don’t think you would want to use multiple imputation to fill in one dataset. That would be single imputation, which isn’t a good idea (it gives you no way to assess sensitivity to the imputed data — in other words it’s too much like making up data). I think that people who use MI use it to create many datasets, analyze each, and average the results. This would be good for calculating a mean or a regression coefficient but I don’t know how it would work for a principal components analysis or factor analysis. I don’t really like the idea of having 20 different factor structures and averaging them together to get an answer.

  40. John Dziak says:

    But again, I don’t think you NEED to fill in all the missing data to do a principal components analysis or factor analysis. Just estimate the correlation matrix using as much data as is available for each pairwise coefficient, and then feed this matrix into the analysis. That would be the most straightforward thing to do, and I would feel more comfortable with it than with any single-imputation method like plugging in means or regression coefficients. It probably does involve treating the missingness as being completely at random, but I can’t really imagine how that would hurt you in your context.

  41. David Dreisigmeyer says:

    John, do you know anything about MI with respect to clustering algorithms? NMF is closely related to these. For NMF, what I’ve seen in the past is that certain directions would be fairly robust to perturbations of the data. (For us these perturbations would be the different values we use to fill in the data matrix with each imputation.) *If* that would be the case here, then you would have groups of distinct vectors that would represent ‘archetypal’ limbs. From this perspective it may not be too bad to average these within each group, provided they are sufficiently distinct. It’s really the non-negativity of the directions that interests me since they have an actual physical interpretation, which PCA and FA would seem to lack. However, PCA is guaranteed to at least produce something.

    I may be missing some crucial idea about MI here though. A Google search also turns up very little for MI and PCA or clustering. So probably the method should be developed and tested before using it in primetime…

    Geometrically, it seems that we are be looking at roughly the same thing in different ways. If we let D be the data matrix with columns corresponding to individuals and rows to bones, the PCA would find the subspace that contains the most energy for the bone variations — ‘eigen-limbs’. These eigen-limbs themselves would not necessarily be physical (unless you have a 1-D subspace which is very possible here depending on the individuals included). It would only be the intersection of the subspace with the positive orthant (actually it would have to lie interior to this). What NMF would attempt to find is the vertices of the (strictly positive) convex cone that would contain the same data as is in the PCA subspace (but here the data would need to be a convex combination of the vertices, which would be the ‘archetypal-bones’).

    But, for PCA there would be a well-developed method of averaging the subspaces by finding the Karcher mean over a Grassmann manifold. This is really good at facial recognition problems and a numerically solved problem.

  42. David Dreisigmeyer says:

    John, what do you think about this paper:

    http://www.hindawi.com/journals/cin/2009/785152.html

  43. John Dziak says:

    For PROC LCA, the model-based clustering software we maintain at the center I am at, I think that we ignore missing data on the response dimensions, using the assumption that it is missing at random, in which case it does not change the likelihood function. We don’t omit cases with missing data, of course.

  44. Pingback: The Analyses Ahead « The Open Dinosaur Project

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s