Discussing data

Henrick’s Hawaiian metabarcoding

  • data summary
    • raw amplicon data
    • rarified otu and haplotype tables
    • three main products:
      • across chronosequence
      • across elevation (temp and precip gradient)
      • collembola invasion
    • would be great to have phylogeny but we’re not there yet: there’s hope for using reference collections and grafting COI trees onto better supported backbone trees
    • reference collection forthcoming
  • motivation
    • Dimensions project sampling across the Hawaiian chronosequence and stratified across the forest profile (from floor to canopy) yielded $> 10^6$ specimens.
    • How to deal with this massive amount of material? Too expensive for keyout each specimen. This leads to metabarcoding
  • metabarcoding
    • issue: body size and DNA content: DNA content is super linear with body length
    • solution:
      • for Hawaii arthropods have a limited size range, so we can sort into a few size categories to mostly control for DNA content
      • size classes are 0–2, 2–4, 4–7, >7 mm in length
    • issue: exponential PCR amplification based on primer mismatch
    • solutions:
      • some patterns are interesting within close relatives where this bias does not matter, e.g. within lineage response to climate/elevation
      • taking average of multiple markers averages out the bias of each
      • statistical approaches for correcting for bias
    • we have a high throughput pipeline fro taking sorted specimens to sequencing to analysis
    • issue: sorting is slow
  • for Hawaii beating samples
    • barcode by size by site by plant separately
    • cluster OTUs
    • make an OTU table by rarifying each pooled sequencing run to standard number of sequences
    • also make a haplotype table by clustering “OTUs” with 0 divergence
  • similar data also exist for Hawaii environmental gradients
  • Collembola invasion
    • 44% of abundance of arthropods on Big Island
    • could be predator release: based on gut content no spiders eat them, but we haven’t looked at carabids yet
    • abundance of collembola relates with age: most abundance of collembola are on young and old substrates
  • Something to look at: does copy number vary with genetic diversity? We might expect that more copy number allows for more diversity

Petr’s data assembly

  • folder on Google drive called data hosting all the data
  • so far we have 4 relevant data sets (all data are abundance only unless otherwise noted)
    • Paulo’s Azores arthropods
    • Dan Gruner’s Hawaiian arthropds
    • Jon’s Hawaii trees
    • Brent’s (via Isaac) Reunion spiders; includes sequence data
  • potential data
    • Jairo has data but is not ready for sharing yet, though narrow personal collaborations are possible
    • Joaquin has mainland European carrabid data; includes phylogeny and traits
    • Petr compiled global mammal database: no abundance; phylo, geo distribution, traits
    • Christine has snail data in the process of being assembled including phylo, trait, spatial occurrences as abundance proxies
  • data documentation is key, especially permisions
    • there’s a README template to be completed by each data providers; it includes an important section on permissions
    • maybe we should make consistent deffinitions of terms of use
    • but we don’t want to make the form more complex because it’s already cumberson for data providers to fill it out
    • we’ll get more into this
  • How should we proceed?
    • Jon proposes a project first approach where project leaders solicit specific data
    • wouldn’t it be nice if project driven data could be sharred more broadly
    • the proposal for now is that each project get’s its own folder it’s responsible for putting data in and maintaining it