Contributing to palaeoverse: structure and standards
Source:vignettes/structure-and-standards.Rmd
structure-and-standards.Rmd
Introduction
The palaeoverse
R package is a community-driven software
library providing generic tools for palaeobiological analysis. The core
principles of palaeoverse are to: (1) streamline analyses, (2) enhance
code readability, and (3) improve reproducibility of results.
This document describes the essential structure and conventions of
the palaeoverse
R package. Naturally, there are always
disagreements regarding best practices and conventions, and
palaeoverse
is no exception. Despite this, all the
essentials in palaeoverse
are encouraged to make the lives
of both the developer, and the user, easier. It is worth noting that the
core structure and conventions adopted in palaeoverse
are
heavily influenced by Hadley Wickham and Jenny Bryan’s R Packages and the tidyverse style guide,
which is currently also Google’s guide.
“Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread.” – tidyverse style guide
Files
File names should always be concise, meaningful and end in
'.R'
. Avoid using special characters whenever possible
(i.e. stick to letters and numbers), and use '_'
or
'-'
instead of spaces in file names. The use of lowercase
is also strongly encouraged, and never have file names
that only differ in capitalization. Note: the preferred
separator is '_'
to consist with the ‘lowercase snake_case’
convention for developing functions (more on that later!).
# Nein!
Sup3rAw3sumFuncti0n.r
# Besser
super_awesome_function.R
# Das ist gut!
time_bins.R
R style guide
We recommend using the following conventions to make your code easier for us (and the community) to understand and use:
Assigning: use arrows (<-) rather than equals sign to assign.
Comments: please explain your code comprehensively using comments (#) as this will help us to understand your intentions and to review the code.
Packages: please use as few external packages (i.e. dependencies) as possible, as the package is more likely to break when alterations are made to these; if your code uses packages, load them all at the start of the script (i.e. in tutorials or examples). This is more transparent for the user than having various packages sprinkled throughout the code.
Progress: if your code may take a long time to run
(more than a minute), include a progress bar such as
'txtProgressBar()'
in your function.
Sections: use lines of ‘-’ or ‘=’ to indicate
section breaks. It is hard to provide exact details on how long a
section break should be. However, we tend to think of them as paragraphs
in our code. Tip: In RStudio, you can use
'# SECTION NAME ###'
to make a section which is interpreted
by the outline feature. You can also make subsections by adding more
'#"
before the section name. If you are using RStudio, you
can also use the keyboard shortcut
Cmd/Ctrl + Shift + R
.
Spacing: add spaces to your code to make it more human-readable.
Objects: use object names that are short but relevant to their contents, ideally in ‘lowercase snake_case’.
Wrapping: break long commands into multiple lines, conventionally no line should be longer than ~75–80 characters.
Language: use plain English (British spelling preferred, but not required) for your code and documentation.
# Nope!
x = 1
y<-x+1
# Awesome!
x <- 1
y <- x + 1
Data
Often we might want to include data into palaeoverse. This might come in the form of example datasets for testing functions (e.g. fossil occurrences), reference datasets such as the Geological Timescale 2020, or even data that is fundamental for a function to run. In palaeoverse’s structure, we currently recognize four main ways of including data in the package depending on its usage.
- Raw data
- Internal data
- Exported data
- External data (preferred option)
However, for the sake of file size efficiency, please only include data and variables in your files which will be used by your function. Note: in order to be released on CRAN, R packages must be less than 5 MB in size. Please only include data that is absolutely vital for your function. The preferred solution to including data into palaeoverse functions is calling the data via an API or download URL (i.e. external data).
Raw data
Raw data should always be included in inst/extdata
. If
you want to include cleaned/processed data in data/
, it is
generally a good idea to include the code used to process the raw data.
If you ever need to reproduce or update your cleaned data, this will
save you precious time. The code for processing your data should be
included in data-raw/
. Strictly speaking, raw data does not
need to be documented. However, it is a good idea to include the
original source and version (including a download date) in your
code.
Internal data
Data you do not wish to directly make available to the user should be
saved as R/sysdata.rda
. This is the best option for
pre-computed data tables that are needed for a function to run. Strictly
speaking internal data does not need to be documented. However, it is a
good idea to document the internal data in the function
documentation.
Exported data
Package data you wish to make available to the user should be stored
in data/
. Each file in this directory should be either a
'.rda'
or '.RData'
. This file type is fast,
small and explicit. The most appropriate way to include exported data is
to use usethis::use_data()
.
When using large datasets, we want to ensure that our files are not
bloated and taking up too much space on our users’ machines. As such,
you may want to experiment with the compression settings in
usethis::use_data()
. Generally, xz
and
gzip
can create smaller files than the default
bzip2
. You can also implement several ‘hacks’ to generate
smaller files which you may want to consider for large datasets
(depending on whether your data is sensitive to such changes). Data with
many decimal places consume a lot of memory, consider how many
significant figures are relevant for your data, and round()
accordingly. You can also experiment with your file size by multiplying
your data by X (e.g. 1,000) to remove decimal places altogether.
Note: Remember to undo any transformations when calling
or working with the data.
External data
External data is the preferred approach for contributors to include
data into palaeoverse
and will be required in almost all
cases. This is to ensure that the package does not become unnecessary
bloated to all users when the data can just be called by a download
link. Currently, external data are stored on a GitHub repo
(reconstruction files for palaeorotate
). It is likely that
in the future this will be migrated elsewhere as GitHub is not a data
repository (hopefully a dedicated server when funding allows!). If you
wish to include external data, please get in touch with one of the
palaeoverse developers, and we can discuss what might be the best
solution.
#generate temp directory
files <- tempdir()
#download files
download.file(url = "www.goo.com", destfile = paste0(files, "/mgoo.csv"))
#run some kind function using the download
#REMEMBER: remove downloaded files
unlink(x = paste0(files, "/", list.files(files)))
Data documentation
Objects in data/
are always exported by default, and
should be documented accordingly. In order to properly document data,
you must document the name of the dataset and save it in
R/data
. For example, the documentation block used to
document GTS2020 is saved as R/data.R
and is similar to the
following (simplified here):
#' Geological Time Scale 2020
#'
#' A dataset of the Geological Time Scale 2020. Age data from:
#' \url{https://stratigraphy.org/timescale/}.
#' Supplementary information is also included in the dataset for
#' plotting functionality (e.g. GTS2020 colour scheme).
#'
#' @format A data frame with 189 rows and 20 variables:
#' \describe{
#' \item{index}{Index number for the order of all intervals in the dataset}
#' \item{stage_number}{Index number for stages}
#' \item{series_number}{Index number for series}
#' \item{system_number}{Index number for system}
#' \item{interval_name}{Names of intervals in the dataset}
#' ...
#' }
#' @section References:
#' Gradstein, F.M., Ogg, J.G., Schmitz, M.D. and Ogg, G.M. eds. (2020).
#' Geologic time scale 2020. Elsevier.
#' @source Compiled by Lewis A. Jones. See item descriptions for details.
"GTS2020"
What does it mean to document your data?
Documenting your data is to provide a thorough description of the
data and any information relevant to understanding it. Good
documentation is key to reproducible science, and will also help us to
ensure that we acknowledge all data collators who have provided data for
the palaeoverse
package. When providing us with datasets,
please give the following information:
Author(s): Who collected the data, and prepared its
current format? Please provide citations if relevant. Have the authors
given permission for the dataset to be included in the
palaeoverse
package?
Description: Brief description of the dataset.
Provenance: When and how was the data collected? When was the dataset finalised in its current form?
Size: Please state the full, uncompressed size of the file.
Variables: Describe each of the columns in your dataset, providing as much information as you can on the full name of the variable, data type (e.g. continuous, discrete, categoric, etc.), units, and how it was collected.
Functions
Conventions
Functions are saved as '.R'
files in the
'R'
folder. The name of the file needs to correspond to the
function, e.g. the file time_bins.R
contains the function
time_bins()
. Function names and arguments should be
informative and aim to keep in line with available functions in
palaeoverse
. For example, all functions that bin data,
whether it be by time or space, start with bin_
, whereas
taxonomic-related functions start with tax_
(e.g. tax_unique
, tax_check
). Where possible,
argument names should also consist with functions in the package. For
example, if a function requires an occurrence dataframe as input, the
argument name should be occdf
. It is difficult to give a
complete static summary of all the conventions as
palaeoverse
will undoubtedly evolve over time. We also wish
to remain flexible for contributors, provided that it does not
compromise the user-friendliness of the package. We welcome contributors
to check through the source code of palaeoverse
for
function examples.
Documentation
For documentation, we use roxygen2. The title, a brief
description, and every argument (including the input class and default
input) and the output of the function need to be documented in
roxygen2
style, for example:
#' An exemplary function
#' This function is used to demonstrate the documentation of functions.
#' @param example \code{character}. Arguments are the function inputs.
#' @param another_example \code{logical}. All arguments need to be documented.
#' @return A \code{list} is returned as output in this example function.
#' @details Describe more details if necessary, and list sources if applicable.
#' @section Developer:
#' Your name
#' @section Reviewer:
#' Name(s)
#' @examples
#' #Show off the example function
#' example_function(example = “documentation”, another_example = TRUE)
#' @export
Add the '@export'
namespace tag to make the function
available.
To get started with roxygen2
, set your working directory
to the package directory, or to the directory where you store your
function as a '.R'
file. The R command
devtools::document()
creates a ‘man’ folder in the
directory, which contains a '.Rd'
file corresponding to
your documented function. Opening that file in RStudio, you can create a
preview to see what the documentation looks like. After you have
implemented changes, rebuild your documentation file with Ctrl+Shift+B
or devtools::document()
.
Efficiency
When possible, you should make coding decisions which will ensure that your code is maximally efficient - this could make a big difference to users who want to apply your function to a large dataset. A few general examples include:
- Using functions from the
'apply'
family can be faster than for-loops. However, a for-loop is almost always more reader-friendly, so try to balance the two! - Storing objects as lists, or lists of lists, rather than data frames inside functions
- Vectorise, when possible. This way a function can operate on all elements of a vector without needing to loop through each element.
- Avoiding using
rbind()
andcbind()
to compile objects row-wise or column-wise within for-loops; specifying the row or column number using the iteration number is usually a faster alternative.
However, please don’t let this deter you - we welcome submissions from R users of all experience levels, and our team of in-house code evaluators can help you with any concerns about efficiency.
Error messages
To ensure that the functions are used appropriately, error messages should be generated when the function is receiving input it is not designed for. Error messages consist of a brief description of what went wrong. Sometimes it makes sense to specify where it went wrong, and what kind of input was expected. Optionally, error messages can include hints to guide the user towards correct input.
Examples of error messages include
- input of the wrong format,
e.g.
“Error: 'x' must be a numeric vector.”
- input of the wrong dimension,
e.g.
“Error: 'x' must be a data.frame with two columns.”
- input without mandatory names,
e.g.
“Error: 'x' does not have a column 'stage'.
'x' must be a data.frame with the columns 'stage' and 'age'.”
To implement error messages in R, the stop()
function
can be used:
# Generate error message
if (!is.numeric(x)) {
stop("Error: 'x' must be a numeric vector.")
}
Warning messages
In general, we try to minimise the use of warning messages in
palaeoverse
as these can be easily ignored by the user, or
completely skipped in pipeline analyses (see
more?). However, as with most things in life, there is a time and a
place. So, while the use of warning()
is generally
discouraged if throwing an error will suffice, we do allow the use of
warning messages in palaeoverse
, where appropriate. Warning
messages can be useful for providing additional information to the user
about function output. For example, in the palaeorotate
function we provide a warning to users if one or more points could not
be reconstructed due to the georeferenced plate not existing at the
desired time of reconstruction.
To implement warning messages in R, the warning()
function can be used:
# Generate warning message
if (!is.numeric(x)) {
warning("Warning: 'x' must be a numeric vector.")
}
Tests
Testing is perhaps one of the most important parts of developing a
function. If you have not already, we recommend reading through Chapter
12–Testing of ‘R
packages’ before continuing here as palaeoverse
follows
this guidance.
Before a function can be added to palaeoverse
, it needs
to go through formal testing. This is required as hopefully your
function will be very popular, and we need to ensure that it behaves as
expected to avoid any issues. To do so, we make use of the R package
testthat
.
The initial setup for function testing with testthat
is
already established in palaeoverse, under the directory
'tests/testthat/'
. The organisation of test files must
match that of 'R/'
files in palaeoverse
. For
example, the function time_bins()
is saved as
time_bins.R
in the R/
directory, and has an
associated test file of 'test-time_bins.R'
in the
'tests/testthat'
directory. This ensures that associated
function testing is clear.
Tip: the usethis
package provides a
helpful pair of functions for creating/alternating between these
files:
usethis::use_r()
usethis::use_test()
Make sure to create enough tests within your .R
file to
cover all of the possible variants of a function. This includes creating
tests that cover most or all optional arguments and the majority of
options for those arguments (and the required arguments). Remember, even
if you personally would not use a function for a particular reason, you
must attempt to cover the majority of edge cases that may arise that are
allowed by the function.
Related tests should be bundled within test_that()
calls
combined with strings of text to identify the broad reason for each
bundle of tests (e.g., testing a function works with a particular type
of data). Finally, if tests rely on data or packages outside of the
palaeoverse
community must have, they should be skipped if
those data or packages are not available.
Ultimately, we aim to have >90% code coverage (95% preferred), which means 90% of the lines of code in our codebase should be tested by at least one test. Pushing code to GitHub will trigger a code coverage check, which will alert you as to whether you need to write more tests.
Contributing to palaeoverse
At palaeoverse
we have adopted a set of structures
and standards to follow for contributing to the development of
palaeoverse
. If you would like to contribute to the
palaeoverse
toolkit, we strongly advise reading this
document first. If you plan to contribute a function to
palaeoverse
, you should first raise an issue via
the GitHub repository (see below). This way the development team can
assess whether the function is suitable or needed in the
palaeoverse
toolkit prior to submission.
Git and GitHub
We use git via Github (under the palaeoverse GitHub umbrella) to manage our R code and data. If you are not familiar with these tools, there are some excellent free resources available online:
How to contribute
Minor changes
You can fix typos, spelling mistakes, or grammatical errors in the
documentation directly on GitHub, provided it is done so in the source
file. This means you’ll need to edit roxygen2 comments in the
.R
file, not the .Rd
file.
Substantial changes
If you would like to make a substantial change, you should first file an issue and make sure someone from the development team agrees that it’s needed. If you’ve found a bug, please file an issue that illustrates the bug with a reproducible example.
Pull request process
You (the contributor) should clone the desired repository (i.e. the palaeoverse R
package) to your personal computer. Before changes are made, you
should switch to a new git branch (i.e., not the main branch). When your
changes are complete, you can submit your changes for merging via a pull
request (“PR”) on GitHub. Note that a complete pull request should
include a succinct description (see
function template) of what the code changes do, proper documentation
(via roxygen2), and unit tests
(via testthat
). Only the description is required for the
initial pull request and code review (see below), but pull requests will
not be merged until they contain complete documentation and tests.
If you are not comfortable with git/GitHub, you can reach out to one of the core developers (see collaborators) via email and they can make a pull request on your behalf. However, you will be expected to respond to any reviewer comments on GitHub (see below).
If you don’t feel comfortable implementing changes yourself, you can submit a bug report or feature request as a GitHub issue in the proper repository (e.g. palaeoverse issues).
Code review
All pull requests must be reviewed by two core developers of
palaeoverse
(see
collaborators) before merging. The review process will ensure that
contributions 1) meet the standards and expectations as described above,
2) successfully perform the functions that they claim to perform, and 3)
don’t break any other parts of the codebase.
Submitting a pull request to the palaeoverse
package
will automatically initiate an R
CMD check, lintr
check, and test coverage check
via GitHub Actions. While these checks will conduct some automatic
review to ensure the package has not been broken by the new code and
that the code matches the style guide (see above), a manual review is
still required before the pull request can be merged.
Reviewers may have questions while reviewing your pull request. You are expected to respond to any of these questions via GitHub. If fixes and/or changes are required, you are expected to make these changes. If the required changes are minor enough, reviewers may make them for you, but this should not be expected. If you have any questions or lack the background to make the required changes, you should work with the reviewer to determine a plan of attack.