Processing Nature Overheard Exports in R

Introduction

The NHMDE package for R can be used to pre-process the NHM Data Ecosystem exports for Nature Overheard to make them suitable for initial exploratory analyses in R.

Installation

# If required, install the devtools package
if (!requireNamespace("devtools", quietly = TRUE)) {
  install.packages("devtools")
}

# Install the NHMDE package from GitHub
devtools::install_github("edwbaker/NHMDE")

# Load the NHMDE package
library(NHMDE)

Comparison of NHMDE with raw data

The cs_get_data() function from NHMDE is used to read data from community science exports from the Data Ecosystem. Reading data using cs_get_data() is similar to using the standard R function read.csv(), but you must also specify the project.

raw_data <- read.csv("data/nature_overheard__samples.csv")
nhmde_data <- cs_get_data("data/nature_overheard__samples.csv", project="natureoverheard")

cs_get_data() will return a data frame with the data from the file, but unlike read.csv(), it will also check that the data has the expected columns and set appropriate data types on each column.

For factors (categorical data types) it will also recode the data with shorter labels suitable for use in plots, and reorder the levels of the factor to give expected outputs when using standard R functions.

The checks and transformations applied to the data comprise of those applicable to all community science samples, and those specific to the project. This is handled internally by the cs_get_data() function, but will require new definitions to be created for additional projects.

Examples of the benefits of using cs_get_data() are illustrated below.

Habitat types

The export from the Data Ecosystem has all habitat types recorded for a sample in a single column.

head(raw_data$habitat_features)

## [1] ""                                                         
## [2] "Trees,Shrubs,Hedgerow"                                    
## [3] "Trees,Shrubs,Hedgerow,Mown lawn,Wildflowers,Garden plants"
## [4] "Trees,Shrubs,Hedgerow,Wildflowers"                        
## [5] "Trees,Shrubs,Wildflowers,Garden plants"                   
## [6] "Shrubs,Mown lawn,Wildflowers"

The NHMDE package will split these into separate columns, with a column for each habitat type.

head(nhmde_data$habitat_feature_Trees)

## [1] FALSE  TRUE  TRUE  TRUE  TRUE FALSE

This allows for easier analysis of the data, as the habitat types are now in separate columns.

As all of the habitat feature columns start with the same prefix (habitat_feature_), it is easy to select them all for analysis from the data frame.

# Identify the columns starting with "habitat_feature_"
cols <- startsWith(colnames(nhmde_data), "habitat_feature_")

# Sum the counts of each habitat feature
counts <- colSums(nhmde_data[, cols])

# Get the names of included columns
names <- colnames(nhmde_data)[cols]

# Remove the `habitat_feature_` prefix
names <- gsub("habitat_feature_", "", names)

# Replace space character with newline character to save space in plot
names <- gsub(" ", "\n", names)

# 
barplot(counts, names.arg=names, cex.names=0.8)

Plotting distance from the road

Using raw data

barplot(table(raw_data$distance_from_road), cex.names=0.8)

Using NHMDE

barplot(table(nhmde_data$distance_from_road), cex.names=0.8)

The pre-processing provided by NHMDE has given the factors shorter labels, allowing them to be clearly displayed, and has reordered the levels of the factor to give a more meaningful plot.

References

Baker E (2024). NHMDE R Package. https://github.com/edwbaker/NHMDE.