Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Dear Dave, First off, my apologies for being slow to return my comments to you. I've spent a lot of time over the past months playing with the parts of the Neptune data that I have, and so your paper felt very relevant and timely to me—and I needed a bit of time to absorb the wealth of ideas in it (and the shock to the system of some of it!). I clearly did not have a complete understanding of nature of the Neptune data, and I'm very grateful that you have written this paper to clarify things for the community at large. I wasn't sure how much general vs. specific feedback you were after, so I'll start with the extremes! On the most specific front, I am returning the PDF you sent me with annotations of typos and other small (e.g. wording) suggestions I came across. The Preview app doesn't have the greatest annotation tools, but (in case you haven't used them before) if you choose the menu item View -> Sidebar -> Annotations, you'll see a list of them so you can quickly find them. Jumping right from the specific to the general, if I could make just one overall critical comment about the paper it would be that I think it sometimes paints a more pessimistic view of the record than I would put forward. One of my favorite sentences in the paper is: It may be time to acknowledge a broader view of life and establish the evolution of unicellular organisms as one of the 'main' subjects of paleobiologic research. My impression upon first reading was that the focus in the paper on the—clearly very real and important—shortcomings of the microfossil record left it looking less attractive "as one of the 'main' subjects of paleobiology" next to the rest of the fossil record. Now, I am well aware that this may well just be based in ignorance and naiveté on my part, and because I know you have many more decades of experience with deep sea microfossils, I did think long and hard about my comments, and I've tried to provide below specific instances where I think the paper might benefit from highlighting the advantages that working with microfossil data provides over the kind of macrofossil stuff that populates the PBDB, for example. I keep comparing Neptune to PBDB in my mind throughout your paper, as I think that's the implicit comparison (though you don't explicitly say so), since that's where so much of the macroevolutionary paleobiological research focus has been in recent years. Pages 1-3 read well; I really like the distinction you set up here between the marine animal record and the microfossil record in terms of proportion of high-level groups preserved vs. proportion of species preserved. On page 5 you refer to table X. I thought it might be helpful to provide absolute numbers (and references) for the preservable species, rather than just %ages. At the bottom of page 6 you make the point that it's possible that unpreservable microfossil taxa evolved and went extinct again. While I agree it's completely possible, it doesn't strike me as a significant worry—and it certainly isn't something that distinguishes the microfossil record from any other sorts of fossil record. The same thing I think applies to the points you make on page 7. Regarding time-averaging: While it's true that you do often get bed-level ecological structure of animals preserved on the shelves in obrution deposits (and I think it's a fair distinction to make that these sorts of deposits are absent from the microfossil record) most deposits of marine invertebrates are just as time-averaged, if not more so. (E.g. Kidwell, S.M., 1998, Time-averaging in the marine fossil record: overview of strategies and uncertainties: Geobios, v. 30, p. 977– 995; or a recent book chapter by Kowalewski and Bambach, The Limits of Paleontological Resolution, in Topics in Geobiology, 2008, Volume 21, 1-48. The latter places shelly shelf invertebrate assemblage time-averaging at 10s to 100,000s of years) Regarding the recognition of hiatuses: again, I think you're absolutely right to point out the major difference between deep-sea and shallow-sea sediments in that the latter rarely, if ever, sees variations in lithology. However, I'm not sure this necessarily makes the recognition of hiatuses much easier in shallower-water sediments… I don't have any good references at hand to back this up, but I'm thinking e.g. of the "coordinated stasis" debate in the 90s, where what had been previously identified as pulsed turnover events turned out to be stratigraphic artifacts due to hiatuses/sequence boundaries (maybe a Holland paper?). Even changes in lithology can either represent rather little or very much time; and similarly you do get sometimes very significant hiatuses within homogenous lithologies in shelf sediments, too. If found your subsection "Incomplete Data" to be particularly interesting. From what you've written, my understanding is that most faunal lists in Neptune consist of a "model A" component, some more or less randomly chosen taxa, and a "model B" component, those taxa on a pre-determined list of biostratigraphically relevant taxa also found to be present in the sample. There are a couple of places where I'm not entirely clear on what you're saying, though. On page 11, you write that "the differences in the average reported diversity per sample/study simply reflect the average practical size of a taxonomic list, and do not have a necessary relationship to actual real sample diversity." Now does this mean that each study has a different taxonomic list, and that's what determines list length—more so than underlying diversity? If so, this should be a fairly easy prediction to test (and would back up what you're saying/be an interesting thing to add to the paper if you have time). If the Neptune database has publication information (I imagine it does), you should be able to parse the data by time bins as well as by publication, and see if the variability is better described by what time bin the taxon is in or by what publication they're from. If you wanted to get statistical it seems like testing model support for the two predictors with something like Akaike weights would be a good way to go. In the next paragraph, I got confused. You write that "data collected under model C will generally show a good correlation between sample availability and total diversity, but this is due to the strong correlation, at least in deep-sea drilling material, between taxonomic effort (and thus total reported diversity) and sample availability". What I take this to mean is that in stratigraphic sections with more samples, the observed diversity goes up, but not because of the reason we might think (i.e. pushing up along a collector curve, the more things you look at, the more different kinds of things you see), but rather because sample availability is correlated with taxonomic effort—meaning there are longer (biostratigraphic) taxonomic lists used to check presence/absence than in stratigraphic sections with fewer samples. If I've understood that right (and I'm not sure I have!) it seems to me that this reduces down to essentially the same thing as a collector curve, albeit via the detour of constructing a taxonomic list: the more diverse-seeming assemblages seem thus because they have longer "model B" lists, not because they've had more random samples taken. But the reason they have longer taxonomic lists is because there is more "sample availability", as you put it, which I think means… they have been more extensively—(randomly?!)—sampled. At the bottom of page 11 you refer to Figure 8, which I found a little hard to read. It was not clear to me what the inset vs. the main plot are, and whether they show the same data. If they do, it strikes me that the main plot (i.e. not the inset one) must be missing some data, because the largest number of samples in which a taxon appears on that plot is <100, whereas there are several % of taxa in the inset plot which appear in 100-600 samples. Maybe you could indicate that some data is left out if this is true; either way, it might be helpful to label and describe in the legend what the two plots show. It would also be helpful to note on the plot (or in the legend) how many taxa there are in total. I also couldn't quite follow the calculation you make in the figure legend. What I thought I understood it to say was that—given what is known about preservation and species durations—you would expect each taxon to appear in more samples than it does, lending support to the idea that most taxa are undersampled (because of the predominance of "model B" data collection). But I couldn't understand why you expected the mode in plot 8a to be 300-400 taxa—from my reading of the plot (the y-axis is the number of rad taxa which appear in the x-axis' number of samples, right?) a higher value for the mode would imply that there are even more taxa that appear in only a few samples, which isn't what I'd expect to see. But maybe you actually mean "mean", not mode—but isn't the y-average on the histogram just a function of the total number of taxa, i.e. it won't change with the shape of the distribution? Maybe I'm being stupid here and you're talking about the x-average, although it just doesn't look like the main plot has an average anywhere near 100, since that's the maximum value… I think I may be missing the point here. A more general observation here with regard to the characteristics of the existing microfossil dataset is that, while what you describe is undoubtedly true and possibly quite problematic for data analysis, I'm not sure this really distinguishes the data from the macrofossil record in any deep way. While the published studies that go into PBDB are probably rarely as systematic in recording taxa as the ODP/DSDP reports (go through the "model B" list, then add a few random "model A" taxa), I don't think the record as a whole is all that different. In some sense the vast majority of workers in the macrofossil record are following some sort of "model B" type list-checking, in many cases for the same reasons as it's done in Neptune (report ammonites X, Y, and Z, following the taxon list for Upper Jurassic index fossils, or report trilobites X and Y which tells us the formation is of Ordovician stage Z); other studies will focus in on one very specific group (turritellid gastropods, say, or some specific group of vertebrates) and describe everything they find in that group, ignoring everything else—also essentially a "model B" sort of process. Then there's some "model A" type of papers, monographs etc. describing more exhaustively what's found, (but often with a distinct and problematic bias for new things, rather than occurrences of common or already-described taxa). Either way, separate additions of "model A" and "model B" lists I think would lead to a record that looks fairly similar to a record consisting of "model C" lists. My main point here is that while I wouldn't for a minute suggest that the incompleteness of the Neptune data isn't real or problematic, but that the situation is no better in the macrofossil record, and I think that bears mentioning. It might be worth mentioning in your section on reworking that this sort of long-timescale process is known to occur in the macrofossil record, too, although I understand it's referred to as remanié there, and it isn't known how common it is (though I agree most 'reworking' for macrofossil people is on the 1,000s-100,000 yr timescale). Useful references might be Craig, G.Y. 1966. Concepts in Palaeoecology, Earth-Science Rev. 2:127-155, and for an example of multi-myr reworking, Flessa, K.W. Time-averaging and temporal resolution in Recent marine shelly faunas, Paleo Soc Short Course #6. At the risk of sounding like a broken record… It struck me when reading the section "Age Model Problems" that, again, this isn't something that's unique to Neptune data, and that the salient characteristic distinguishing dating of deep-sea paleontological data is that it's much more tightly constrained than for the macrofossil world. Alroy's time bins in the PBDB publications are 10 my long for a reason—so if we consider our age model errors to typically screw us up by 1 my, as you suggest, then we can be completely safe and happy with 2 my time bins in Neptune for macroevolutionary studies… that's still five times better. Also, if this source of error (unlike reworking) is unbiased, it shouldn't matter much for macroevolutionary studies. As long as it affects samples equally, and more or less evenly through time, we should be OK as long as the signal we're trying to see is strong enough, and I'd hope that for at least some of the most important paleobiological questions that can be addressed with this data, it would be. On page 15 you suggest that the scarcity of most taxa in Neptune (the low modal number of samples in figure 8a) suggests the use of range-through methods for determining species ranges. You mention that range-through methods are susceptible to range-extension errors, but I think it's also worth noting that the macroevolutionary paleobiology field (i.e. the post-Sepkoski, PBDB crowd) have largely abandoned range-through when used for studies of diversity, for (I think) a different set of reasons. Firstly, range-through leads to fairly ugly edge effects. If you imagine a hypothetical diversity history that has constant diversity, with some constant non-zero turnover rate, and constant but imperfect preservation, you would get a convex diversity curve—while you'd be able to range through things in the middle, you can't (by definition) range through the beginning or end points of your time series. If you compare your fig. 10 plots for in-bin vs range-through Neptune data, I think you can see this problem in action. Secondly, range-through ignores the many biases, mostly related to uneven sampling (in the broadest sense) through time. This goes back to Raup's 1976 criticism of the earliest Sepkoski curves (here: http://www.cornellcollege.edu/geology/courses/Greenstein/paleo/raup76.pdf), and I think this is what ultimately spurred the development of the occurrence-level databases, because they allowed for a correction of unequal sampling. [How well these corrections actually work is another matter, and they all have their own biases, but range-through ignores the problem altogether]. Also starting on page 15, you go through the example of comparing the range-through Neptune diversity of forams in the 5-6 Ma bin to Plankrange list. While I think this is a really interesting exercise, there is a number of reasons why I would be more hesitant to interpret the results as a 65/140 species error rate in Neptune. [As a side note before I say what those reasons are, I found it hard to keep the numbers of different categories of taxa (in Plankrange, but misplaced; in Plankrange, in right bin; etc) in my head, so I found myself scribbling down a little table as I was reading through this section. I still didn't understand how you ended up with 65 "valid" taxa and 40 to 50 "invalid" taxa (that makes 105-110, neither the total number of 140 you cite for range-through, nor the 102 that are actually in the bin). I think it would make it easier to read if the numbers were also in a little table rather than only described in the main text. Ideally, I'd like to see a big table listing all of the species in both the databases, lined up in adjacent columns so you can see which ones match in which categories—maybe in the supplementary info, if the journal has it. I think that would make it much easier to see what's going on.] Firstly, I think it's conceptually problematic to use the Plankrange diversity as an indicator of "known", or true, foram diversity, against which to measure Neptune. From what I can see, that database is compiled overwhelmingly from biostratigraphic publications, suffering from precisely the "model B" bias that you identified as being so pernicious above. Each of those publications is, again, only going to record the minimal set of stratigraphically useful taxa to build a zonation. Neptune, consisting by your description of "model C" data, at least thus has some component of "model A" data in addition to the "model B" data; so a priori I would expect Neptune to capture a greater proportion of the true diversity on the slides than Plankrange. So the 30 taxa you couldn't match to Plankrange at all, for example, could all be valid taxa that just haven't found their way into biostratigraphic schemes. I don't know if this is reasonable or not, but if it is, you could now turn around and rather than seeing this as 30/140 erroneous taxa in Neptune, see 30/71 taxa missing from Plankrange. Secondly, you do acknowledge that part of the Neptune occurrences in the 5-6 bin could be legitimate, non-erroneous range extensions, but you suggest that most of them are there as a result of reworking or age model errors. Maybe I'm just being hopelessly optimistic here for Neptune, but rather than an assurance that "several of these … were examined", I'd prefer to see a table with each of those taxa and a more convincing assessment of whether they're really displaced or just new, real data. And I only say this because I was very convinced (as well as troubled) by your description on page 12 (under "Reworking") of the difficulty of establishing the range of a species in the first place—the trap of circular reasoning looms large here, to my mind. In the "Solutions" section, I did find the promise of having an essentially perfect fossil record as an attainable research goal quite inspiring. I would like to see a bit more calculation—totally back of the envelope—of how much work you think that would entail. Can it be done by one micropaleontologist in five years? By ten in ten years? Fifty people in twenty years? Or, perhaps more appropriately, how many (wo-)man-hours would it take? If it's not reasonable to guess for the other groups, I'd love to see that number at least for rads. That said, I'm not convinced that simply because the data we have are imperfect, and that perfect data are within a generation's reach, this means we can't do any macroevolutionary studies with Neptune data at all. Perhaps that was not the intended implication? In the "Analytic methods" section (p. 17-19), I had a few more comments. 1) I'm not sure that the boundary-crosser method would eliminate the range extension problem, since it essentially tallies range-through diversity at a boundary, i.e. it eliminates singletons and taxa that may not have coexisted in a bin—erroneous range extensions will still inflate boundary-crosser counts if they lie on the other side of a boundary from the true LAD of the taxon. 2) I understand your logic here with regard to subsampling—randomly removing occurrences is likely to trim off those relatively rarer instances of erroneous range extension. However, I think this will work much better when subsampling is done by occurrence (i.e. by classical rarefaction); when subsampling by list, as is the case with most of the techniques Rabosky & Sorhannus used (UW, OW, O2W)—if the errors are evenly distributed among lists—you may actually still be stuck with most of the problem, because you won't be able to throw out the few erroneous occurrences without also throwing out all of the valid occurrences that are on the same list. However, that's a bit beside the point. My main issue here is that this paragraph seems to imply that subsampling is carried out in order to remove these erroneous data points, which I think is not accurate. As I understand it, the approach of Rabosky & Sorhannus was to try to standardize for the exponential increase in the number of occurrences throughout the Cenozoic, and thus to attempt to remove a potentially very powerful bias in sampling intensity from the diversity data—not to trim off erroneous occurrences. 3) As regards losing diversity of rare taxa by subsampling strongly, this is undoubtedly true—but I think advocates of sampling standardization would argue that it's necessary to do in order to compare apples to apples. And while it's also true that the resulting diversity curves give only relative changes in diversity, rather than absolute ones, I think the Alroys of the world would again argue that it's better to have a reliable relative diversity curve than an absolute one that is subject to strong sampling bias. 4) I also disagree with your wording that "resampling does not actually identify the sources of error, i.e., the samples in the database from which the problematic data come". It's true that it doesn't address the errors you outline in the iRAT section, but it does address a source of error you describe on page 9 and in figure 6, namely the uneven intensity of sampling with time. I think you're right that it's insufficient to subsample without considering the RAT sources of error, but subsampling does address a real source of error. I really like the pacman method! I'm very pleased the paper builds up to such a practical and positive end. To sum it all up, I think it's tremendously helpful to have such a comprehensive summary of the nature of the Neptune data, the sources of error, and a method of addressing those sources of error. The one source of error you don't really talk about much beyond the mention in figure 6 and on page 9 is the uneven sampling intensity, and I do think it's one that's worth addressing in addition to the RAT sources of error. What occurred to me upon reading your paper was that a combination of data filtering by a method like pacman (to address RAT error) followed by some form of subsampling (to address uneven sampling) would be a comprehensive approach that would address all sorts of errors. But that's just a thought. Alright. My apologies again for having been so slow to get this back to you, but it really got me thinking and I wanted to do it right. I hope some of my comments are useful. I look forward to seeing the paper in print! What journal is it going to? All the best, - Ben.