Discussing data

13 Aug 2018

Henrick’s Hawaiian metabarcoding

data summary
- raw amplicon data
- rarified otu and haplotype tables
- three main products:
  - across chronosequence
  - across elevation (temp and precip gradient)
  - collembola invasion
- would be great to have phylogeny but we’re not there yet: there’s hope for using reference collections and grafting COI trees onto better supported backbone trees
- reference collection forthcoming
motivation
- Dimensions project sampling across the Hawaiian chronosequence and stratified across the forest profile (from floor to canopy) yielded $> 10^6$ specimens.
- How to deal with this massive amount of material? Too expensive for keyout each specimen. This leads to metabarcoding
metabarcoding
- issue: body size and DNA content: DNA content is super linear with body length
- solution:
  - for Hawaii arthropods have a limited size range, so we can sort into a few size categories to mostly control for DNA content
  - size classes are 0–2, 2–4, 4–7, >7 mm in length
- issue: exponential PCR amplification based on primer mismatch
- solutions:
  - some patterns are interesting within close relatives where this bias does not matter, e.g. within lineage response to climate/elevation
  - taking average of multiple markers averages out the bias of each
  - statistical approaches for correcting for bias
- we have a high throughput pipeline fro taking sorted specimens to sequencing to analysis
- issue: sorting is slow
for Hawaii beating samples
- barcode by size by site by plant separately
- cluster OTUs
- make an OTU table by rarifying each pooled sequencing run to standard number of sequences
- also make a haplotype table by clustering “OTUs” with 0 divergence
similar data also exist for Hawaii environmental gradients
Collembola invasion
- 44% of abundance of arthropods on Big Island
- could be predator release: based on gut content no spiders eat them, but we haven’t looked at carabids yet
- abundance of collembola relates with age: most abundance of collembola are on young and old substrates
Something to look at: does copy number vary with genetic diversity? We might expect that more copy number allows for more diversity

Petr’s data assembly

folder on Google drive called data hosting all the data
so far we have 4 relevant data sets (all data are abundance only unless otherwise noted)
- Paulo’s Azores arthropods
- Dan Gruner’s Hawaiian arthropds
- Jon’s Hawaii trees
- Brent’s (via Isaac) Reunion spiders; includes sequence data
potential data
- Jairo has data but is not ready for sharing yet, though narrow personal collaborations are possible
- Joaquin has mainland European carrabid data; includes phylogeny and traits
- Petr compiled global mammal database: no abundance; phylo, geo distribution, traits
- Christine has snail data in the process of being assembled including phylo, trait, spatial occurrences as abundance proxies
data documentation is key, especially permisions
- there’s a README template to be completed by each data providers; it includes an important section on permissions
- maybe we should make consistent deffinitions of terms of use
- but we don’t want to make the form more complex because it’s already cumberson for data providers to fill it out
- we’ll get more into this
How should we proceed?
- Jon proposes a project first approach where project leaders solicit specific data
- wouldn’t it be nice if project driven data could be sharred more broadly
- the proposal for now is that each project get’s its own folder it’s responsible for putting data in and maintaining it