---
title: "spThin example"
author: "Matthew E. Aiello-Lammens"
date: "November 14, 2019"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteEngine{knitr::knitr}
  %\VignetteIndexEntry{spThin example}
---

# Introduction

This vignette goes through the spatial thinning example presented in
"spThin: An R package for spatial thinning of species occurrence records
for use in ecological niche models". Here we demonstrate how `spThin` can
be used to spatially thin species occurence records, we test how many
repetitions of the thinning algorithm are necessary to achieve the optimal
number of thinned records for a dataset previously thinned "by hand", and 
we examine whether there is a notable increase in efficiency if an occurence
dataset is thinned as multiple smaller groups of occurrences, 
rather than a single large set of occurrences.

# Load the `spThin` R package

Here we load the R package from source code. This source code will soon be
submitted to CRAN, so that this package can be loaded using standard 
package management methods

```{r load_package}
library( spThin )
```

# Example dataset

To demonstrate the use of `spThin` we used a set of 201 verified, georeferenced
occurrence records for the Caribbean spiny pocket mouse *Heteromys anomalus*. 
These occurrences are from Columbia, Venezuela, and
three Caribbean islands: Trinidad, Tobago, and Margarita. This dataset
is included as part of the `spThin` package.

#### Load *H. anomalus* dataset


```{r}
data( Heteromys_anomalus_South_America )
head( Heteromys_anomalus_South_America )
```

Here we load and examine the dataset. The name assigned to this dataset
is `Heteromys_anomalus_South_America`.
Note that this dataset includes a column indicating which REGION the 
occurrences was collected. Regions here refer to either the mainland or
three islands in which an occurrence was collected. We can see that
there are many more occurrences collected for the mainland than for
the three islands. Note that Trinidad has been shortened to 'trin'
an Margarita has been shortened to 'mar'.

```{r}
table( Heteromys_anomalus_South_America$REGION )
```

# Run `spThin::thin` on the full dataset

`thin` involves multiple settings. This allows for extensive 
flexibility in how the user spatially thins a dataset.
However, many have
default values. See `?thin` for further information.

```{r} 
thinned_dataset_full <-
  thin( loc.data = Heteromys_anomalus_South_America, 
        lat.col = "LAT", long.col = "LONG", 
        spec.col = "SPEC", 
        thin.par = 10, reps = 100, 
        locs.thinned.list.return = TRUE, 
        write.files = FALSE, 
        write.log.file = FALSE)
```

Below is the same call, but in this case we are writing a number of files to disk.
This files include a set of *.csv files of the thinned data and a log file.

```{r, eval=FALSE} 
thinned_dataset_full <-
  thin( loc.data = Heteromys_anomalus_South_America, 
        lat.col = "LAT", long.col = "LONG", 
        spec.col = "SPEC", 
        thin.par = 10, reps = 100, 
        locs.thinned.list.return = TRUE, 
        write.files = TRUE, 
        max.files = 5, 
        out.dir = "hanomalus_thinned_full/", out.base = "hanomalus_thinned", 
        write.log.file = TRUE,
        log.file = "hanomalus_thinned_full_log_file.txt" )
```


In the case above, we found that 10 repetitions were sufficient
to return spatially thinned datasets with the optimal number of 
occurrence records (124). Because this is a random process, 
it is possible that a similarly repeated run would not return **any**
datasets with the optimal number of occurrence records. 
To visually assess whether we are using enough `reps` to 
approach the optimal number we use the function `plotThin`,
This function produces three plots: 1) the cumulative number of 
records retained versus the number of repetitions, 2) the log
cumulative number of records retained versus the log number of
repetitions, and 3) a histogram of the maximum number of records
retained for each thinned dataset.

```{r}
plotThin( thinned_dataset_full )
```

Looking at the plot of cumulative maximum records retained 
versus number of repetitions, we see that in this run, this value
is constant through out the dataset creation process, indicating
that a single repetition would have sufficed to reach 124. This is
likely not always the case, but this plot can be examined to assess
whether a given number of repetitions is sufficient to achieve
a plateau (*sensu* species accumulation curves in Ecology).

# Run `spThin::thin` on datasets separated by region

#### Coastal mainland

```{r} 
thinned_dataset_mainland <-
  thin( loc.data = Heteromys_anomalus_South_America[ which( Heteromys_anomalus_South_America$REGION == "mainland" ) , ], 
        lat.col = "LAT", long.col = "LONG", 
        spec.col = "SPEC", 
        thin.par = 10, reps = 100, 
        locs.thinned.list.return = TRUE, 
        write.files = FALSE, 
        write.log.file = FALSE)
```

### Trinidad

```{r} 
thinned_dataset_trin <-
  thin( loc.data = Heteromys_anomalus_South_America[ which( Heteromys_anomalus_South_America$REGION == "trin" ) , ], 
        lat.col = "LAT", long.col = "LONG", 
        spec.col = "SPEC", 
        thin.par = 10, reps = 10, 
        locs.thinned.list.return = TRUE, 
        write.files = FALSE, 
        write.log.file = FALSE)
```

### Margarita

```{r} 
thinned_dataset_mar <-
  thin( loc.data = Heteromys_anomalus_South_America[ which( Heteromys_anomalus_South_America$REGION == "mar" ) , ], 
        lat.col = "LAT", long.col = "LONG", 
        spec.col = "SPEC", 
        thin.par = 10, reps = 10, 
        locs.thinned.list.return = TRUE, 
        write.files = FALSE, 
        write.log.file = FALSE )
```

### Tobago

```{r} 
thinned_dataset_tobago <-
  thin( loc.data = Heteromys_anomalus_South_America[ which( Heteromys_anomalus_South_America$REGION == "tobago" ) , ], 
        lat.col = "LAT", long.col = "LONG", 
        spec.col = "SPEC", 
        thin.par = 10, reps = 10, 
        locs.thinned.list.return = TRUE, 
        write.files = FALSE, 
        write.log.file = FALSE )
```