# install all required packages
pkgs = c("bench", "microbenchmark", "sf", "stars", "terra")
install.packages(pkgs)

Introduction

Benchmarking code in R can be done in several ways – using a built-in function or using more advanced functions from dedicated packages. The simplest way is the system.time() function. First, we should read the documentation to see how this function works and what it returns. To view the documentation, use the question mark “?” before the function name.

?system.time

We will test it on an example.

# sample numbers with replacement
# (n = 1,000,000)
t = system.time(sample(1:100, 1e6, replace = TRUE))
t

##    user  system elapsed 
##    0.06    0.00    0.07

We are most interested in the third value returned "elapsed", which is the actual execution time of the tested function in seconds (lower value is better/faster). Of course, we can refer to this value directly using the index in square brackets.

t[[3]]
## [1] 0.07
t[["elapsed"]]
## [1] 0.07

Now let’s check out the more advanced functions from the {bench} and {microbenchmark} packages on other data. For the next examples, we can generate a data frame with 3 columns.

df = data.frame(x = rnorm(1e5), y = rnorm(1e5), z = rnorm(1e5))
str(df)

## 'data.frame':    100000 obs. of  3 variables:
##  $ x: num  -1.1 -0.446 0.239 -0.69 0.656 ...
##  $ y: num  0.568 -0.395 0.346 -0.555 0.896 ...
##  $ z: num  -1.215 -0.975 0.454 -0.857 -0.194 ...

Benchmarking in the {bench} package can be done using the mark() function. It is quite intuitive to call the function explicitly from the package like this: bench::mark() (without having to load the entire package).

As a test, we will see how fast can be calculated the sum for each row of a data frame using two different functions: rowSums() and apply().

t = bench::mark(
  iterations = 10, check = TRUE,
  rowSums = rowSums(df),
  apply = apply(df, MARGIN = 1, FUN = sum) # 1 indicates rows
)
t[, 1:5]

## # A tibble: 2 x 5
##   expression      min   median `itr/sec` mem_alloc
##   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>
## 1 rowSums       3.3ms   3.57ms    241.      3.15MB
## 2 apply       283.8ms 316.37ms      2.99    6.87MB

In the {microbenchmark} package, we have an analogous microbenchmark() function that works similarly, but returns slightly different parameters.

library(microbenchmark)
t = microbenchmark(
  times = 10, check = "identical",
  rowSums = rowSums(df),
  apply = apply(df, 1, sum)
)
t

## Unit: milliseconds
##     expr        min         lq       mean     median       uq        max neval
##  rowSums   3.017799   3.038325   3.290923   3.224911   3.2908   4.307269    10
##    apply 258.939062 286.609943 317.185685 315.195688 346.8460 370.976920    10

We can also easily prepare a visualization (in this case a boxplot). Moreover, there is autoplot() function in the {ggplot2} package with different types of plots.

boxplot(t)

As you can see, both functions allow for timings aggregation (mean, median, etc.) from multiple iterations and results validation, which is a significant improvement over system.time().

This section does not exhaust the topic. A more complex approach to benchmarks is code profiling, that is, testing the performance of large blocks of code. This allows us to find out which function is bottleneck. In R, this can be done using the Rprof() function or {profvis} package. However, for internal C/C++ code, this is not possible (this is true for most packages that are based on GDAL and GEOS).

For more information, see the books:

Advanced R by Hadley Wickham (chapter 23 Measuring performance)
Efficient R programming by Colin Gillespie and Robin Lovelace (chapter 1.6 Benchmarking and profiling)

Exercise

Compare computing the mean value of the vector using mean() and sum() / length() functions.

Part I: Raster Data

Data Source

We will use Digital Surface Model (DSM) from satellite Shuttle Radar Topography Mission (SRTM). You can download the data in 5° x 5° tiles from this website: https://srtm.csi.cgiar.org/srtmdata/. You can choose any tile, but we will use a tile with ID srtm_39_02 as an example. The data is available as a single layer GeoTIFF with a resolution of approximately 90 m (6000 x 6000 pixels). The coordinate reference system (CRS) of the raster is WGS84 (EPSG:4326), and the datatype is int16.

Please be aware that this dataset is quite small and consists of only one layer, so the results are not entirely representative (especially for large-scale applications).

Benchmarks

In the first step, let’s load all the necessary spatial packages.

library(sf)
library(stars)
library(terra)
sf_use_s2(FALSE) # use planar GEOS engine by default

Next, let’s create an empty folder into which we will download the raster.

url = "https://srtm.csi.cgiar.org/wp-content/uploads/files/srtm_5x5/TIFF/srtm_39_02.zip"

if (!dir.exists("data")) {
  dir.create("data")
  download.file(url, "data/srtm.zip")
  unzip("data/srtm.zip", exdir = "data") # ~70 MB tif
}

# list files after extracting the archive
list.files("data")

## [1] "points.gpkg"    "readme.txt"     "srtm.zip"       "srtm_39_02.hdr"
## [5] "srtm_39_02.tfw" "srtm_39_02.tif"

# or save path to raster as variable
f = list.files("data", pattern =  "\\.tif$", full.names = TRUE)

In the following sections, we will define the tasks in which we will test our packages.

Data Loading

Let’s start with the simplest task – loading raster data into R. In the {stars} package, the read_stars() function is used for this, and in the {terra} package, the rast() function.

x = read_stars("data/srtm_39_02.tif", proxy = FALSE)
x

## stars object with 2 dimensions and 1 attribute
## attribute(s), summary of first 1e+05 cells:
##                 Min. 1st Qu. Median     Mean 3rd Qu. Max.  NA's
## srtm_39_02.tif    -6       4     12 21.12648      35  122 85974
## dimension(s):
##   from   to offset        delta refsys point values x/y
## x    1 6000     10  0.000833333 WGS 84 FALSE   NULL [x]
## y    1 6000     55 -0.000833333 WGS 84 FALSE   NULL [y]

plot(x)

y = rast("data/srtm_39_02.tif")
y

## class       : SpatRaster 
## dimensions  : 6000, 6000, 1  (nrow, ncol, nlyr)
## resolution  : 0.0008333333, 0.0008333333  (x, y)
## extent      : 10, 15, 50, 55  (xmin, xmax, ymin, ymax)
## coord. ref. : lon/lat WGS 84 (EPSG:4326) 
## source      : srtm_39_02.tif 
## name        : srtm_39_02

plot(y)

Before we go any further, we need to note one very important point. In the rast() documentation, we can read:

When a SpatRaster is created from a file, it does not load the cell (pixel) values into memory (RAM). It only reads the parameters that describe the geometry of the SpatRaster, such as the number of rows and columns and the coordinate reference system. The actual values will be read when needed.

So, in fact, when using the rast() function, we do not load the raster into memory, but only the metadata describing it. This is of great importance, because if we do not load it immediately, all the functions tested, will include overhead caused by loading the raster into memory. To check if the raster has been loaded into memory by {terra}, we can use the inMemory() function, which will return a logical value.

# is the raster in memory?
inMemory(y)

## [1] FALSE

As we noted earlier, the raster is not loaded into memory. To actually load the raster into memory, we need to use the set.values() function.

set.values(y)
inMemory(y)

## [1] TRUE

The situation is similar in the {stars} package, but in this case the argument in the read_stars() function takes logical values that define whether the file will be directly loaded into memory or not. See the documentation excerpt below:

?read_stars
proxy: logical; if TRUE, an object of class stars_proxy is read which contains array metadata only; if FALSE the full array data is read in memory. (...)

After this explanation, we can move on to the benchmark. The argument check in {bench} must be obligatorily set to FALSE, because the tested packages return different data structures.

t = bench::mark(
  iterations = 5, check = FALSE, memory = FALSE,
  stars = read_stars("data/srtm_39_02.tif", proxy = FALSE),
  terra = set.values(rast("data/srtm_39_02.tif"))
)
t[, 1:4]

## # A tibble: 2 x 4
##   expression      min   median `itr/sec`
##   <bch:expr> <bch:tm> <bch:tm>     <dbl>
## 1 stars         796ms    831ms      1.20
## 2 terra         807ms    898ms      1.14

Note that {stars} always loads the data into memory as a double type (although the raster is integer). Since {stars} objects are kept as arrays in R, you can convert them to integer type yourself. The situation is different for {terra}. The rasters are kept in an external C++ structure (outside R) as a double and all calculations are performed on this type. The difference in datatypes is related to the efficiency of operations and the amount of RAM required.

Check out the following example in {stars}:

typeof(x[[1]]) # select first attribute (srtm_39_02)

## [1] "double"

format(object.size(x), units = "auto") # print object size

## [1] "274.7 Mb"

x[[1]] = as.integer(x[[1]]) # convert double to integer
format(object.size(x), units = "auto") # size after conversion

## [1] "137.3 Mb"

For comparison {terra}:

format(object.size(y), units = "auto")

## [1] "1.3 Kb"

Cropping

In the next task, we will test the raster cropping to the defined range. We loaded the packages and data in the previous section. First, let’s check the extent of our raster (note that the coordinates in the WGS84 system are in degrees).

st_bbox(x) # {stars}

## xmin ymin xmax ymax 
##   10   50   15   55

ext(y) # {terra}

## SpatExtent : 10, 15, 50, 55 (xmin, xmax, ymin, ymax)

Next, let’s define a new smaller extent to which we will crop the raster. Note that objects of class bbox ({stars}) and ext ({terra}) have different coordinate orders.

x_ext = st_bbox(c(xmin = 11, ymin = 51, xmax = 14, ymax = 53),
             crs = st_crs(4326))
y_ext = ext(11, 14, 51, 53)

Visualize the new extent:

plot(x, axes = TRUE, main = NULL, reset = FALSE)
# bbox object must be converted to sfc geometry
plot(st_as_sfc(x_ext), add = TRUE, border = "red", lwd = 2)

Use the st_crop() function to crop the raster in {stars}, and the crop() function in {terra}.

x_crop = st_crop(x, x_ext)
st_bbox(x_crop)

## xmin ymin xmax ymax 
##   11   51   14   53

y_crop = crop(y, y_ext)
ext(y_crop)

## SpatExtent : 11, 14, 51, 53 (xmin, xmax, ymin, ymax)

Everything works fine, so we can move on to the benchmark. However, earlier we can remove unnecessary objects from the environment using the rm() function so that they don’t take up space in memory.

rm(x_crop, y_crop)

t = bench::mark(
  iterations = 5, check = FALSE, memory = FALSE,
  stars = st_crop(x, x_ext),
  terra = crop(y, y_ext)
)
t[, 1:4]

## # A tibble: 2 x 4
##   expression      min   median `itr/sec`
##   <bch:expr> <bch:tm> <bch:tm>     <dbl>
## 1 stars        38.8ms   45.5ms      21.5
## 2 terra        94.9ms   96.3ms      10.2

Downsampling

The last task will be to reduce the spatial resolution of the raster from higher to smaller (i.e. less detail). Let’s recall the resolution of our input raster.

# values in degrees
sapply(st_dimensions(x), "[[", "delta") # {stars}

##             x             y 
##  0.0008333333 -0.0008333333

res(y) # {terra}

## [1] 0.0008333333 0.0008333333

Next, let’s use the st_warp() from {stars}, and resample() from {terra}, but before we can do that, we need to prepare the rasters with the target geometry.

x_dest = st_as_stars(st_bbox(x), dx = 0.01, dy = 0.01, values = 0L)
x_small = st_warp(x, x_dest, method = "average", use_gdal = TRUE)
x_small

## stars object with 2 dimensions and 1 attribute
## attribute(s):
##                           Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
## file26833b55d51.tif  -5.548611 43.85417 84.54514 168.7611 260.9462 1195.875
##                       NA's
## file26833b55d51.tif  38548
## dimension(s):
##   from  to offset delta refsys point values x/y
## x    1 500     10  0.01 WGS 84 FALSE   NULL [x]
## y    1 500     55 -0.01 WGS 84 FALSE   NULL [y]

y_dest = rast(extent = ext(y), resolution = c(0.01, 0.01), crs = "epsg:4326")
y_small = resample(y, y_dest, method = "average")
y_small

## class       : SpatRaster 
## dimensions  : 500, 500, 1  (nrow, ncol, nlyr)
## resolution  : 0.01, 0.01  (x, y)
## extent      : 10, 15, 50, 55  (xmin, xmax, ymin, ymax)
## coord. ref. : lon/lat WGS 84 (EPSG:4326) 
## source      : memory 
## name        :  srtm_39_02 
## min value   :   -5.548611 
## max value   : 1195.875000

Now let’s perform the benchmark.

rm(x_small, y_small)

t = bench::mark(
  iterations = 5, check = FALSE, memory = FALSE,
  stars = st_warp(x, x_dest, method = "average", use_gdal = TRUE),
  terra = resample(y, y_dest, method = "average")
)
t[, 1:4]

## # A tibble: 2 x 4
##   expression      min   median `itr/sec`
##   <bch:expr> <bch:tm> <bch:tm>     <dbl>
## 1 stars         3.33s    3.52s     0.285
## 2 terra         1.52s    1.53s     0.652

There are other ways to reduce spatial resolution using aggregation. In {terra} you can use the aggregate() function which takes the aggregation factor. In {stars} you can use the same function (in fact, this is the equivalent of terra::extract()), but you must specify a polygon as an argument – st_make_grid() function can be useful for this.

Exercise

Prepare a task in which you check the performance of pixel classification using the height condition. For example, all pixels above the threshold of 200 m will become 1, and all pixels below 0. This task can be done in many ways, below are some ideas:

# {stars}
## method I
xx = x # duplicate object
xx[xx < 200] = 0
xx[xx >= 200] = 1
typeof(xx[[1]])

## method II (returns factor type, not numeric)
xx = x
xx = cut(xx, breaks = c(-50, 200, 1500), labels = c("0", "1"))
typeof(xx[[1]])

plot(xx, col = c("blue", "red"))
table(xx[[1]]) # frequency table

# {terra}
## method I
yy = ifel(y >= 200, 1, 0)

## method II
mat = matrix(c(-50, 200, 0,
               200, 1500, 1),
             nrow = 2, byrow = TRUE)
yy = classify(y, mat)

plot(yy, col = c("blue", "red"))
table(values(yy))

Part II: Vector Data

Attention! If you are lacking RAM after the first part, then you can turn on this notebook directly from the second part (it is independent).

library(sf)
library(stars)
library(terra)
sf_use_s2(FALSE) # use planar GEOS engine by default

Before starting, let’s remove all unnecessary objects from the environment.

rm(list = ls())

Data Source

In this section, we will use synthetic (i.e. generated by us) data as input. However, if you would like to use real data, OpenStreetMap is a great resource. This data can be obtained using the {osmdata} package.

As an example dataset, let’s generate a set of 200,000 points in the metric coordinate system (EPSG: 3857). The extent of the vector layer can be the same as the raster used in the previous section. Finally, we save our data to GeoPackage format to use exactly the same dataset in all tasks.

n = 200000
set.seed(123) # define random seed to make the results reproducible
coords = data.frame(x = runif(n, min = 10, max = 15),
                    y = runif(n, min = 50, max = 55))
pts = st_as_sf(coords, coords = c("x", "y"), crs = 4326)
pts = st_transform(pts, crs = 3857) # transform to planar CRS
pts[1:5, ]

## Simple feature collection with 5 features and 0 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 1273260 ymin: 6869425 xmax: 1636657 ymax: 7329359
## Projected CRS: WGS 84 / Pseudo-Mercator
##                  geometry
## 1 POINT (1273260 6987154)
## 2 POINT (1551964 6909108)
## 3 POINT (1340830 7329359)
## 4 POINT (1604680 7173921)
## 5 POINT (1636657 6869425)

write_sf(pts, "data/points.gpkg") # ~82 MB

Benchmarks

Data Loading

First, let’s check how fast we can load vector data in GeoPackage format. Depending on the file format, the times can vary. It is also worth mentioning a new spatial data format – GeoParquet, which is more effective than the previous ones (check out this interesting blogpost from Dewey Dunnington).

For vector data, the functions are analogous to those for raster data. So in the case of {sf} this is read_sf(), and in the case of {terra} this is vect(). {sf} loads data into data.frame / tibble, while {terra} stores it in an external structure.

x = read_sf("data/points.gpkg")
x[1:5, ]

## Simple feature collection with 5 features and 0 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 1273260 ymin: 6869425 xmax: 1636657 ymax: 7329359
## Projected CRS: WGS 84 / Pseudo-Mercator
## # A tibble: 5 x 1
##                geom
##         <POINT [m]>
## 1 (1273260 6987154)
## 2 (1551964 6909108)
## 3 (1340830 7329359)
## 4 (1604680 7173921)
## 5 (1636657 6869425)

y = vect("data/points.gpkg")
y

##  class       : SpatVector 
##  geometry    : points 
##  dimensions  : 2e+05, 0  (geometries, attributes)
##  extent      : 1113197, 1669792, 6446280, 7361860  (xmin, xmax, ymin, ymax)
##  source      : points.gpkg
##  coord. ref. : WGS 84 / Pseudo-Mercator (EPSG:3857)

Now let’s do the benchmark.

t = bench::mark(
  iterations = 5, check = FALSE, memory = FALSE,
  sf = read_sf("data/points.gpkg"),
  terra = vect("data/points.gpkg")
)
t[, 1:4]

## # A tibble: 2 x 4
##   expression      min   median `itr/sec`
##   <bch:expr> <bch:tm> <bch:tm>     <dbl>
## 1 sf            705ms    705ms      1.42
## 2 terra         564ms    576ms      1.73

Buffers

We will use the st_buffer() from {sf}, and buffer() from {terra} to create the buffers. These functions take the width as an argument and the number of segments to create the polygon. We need to make sure that the geometries will consist of an equal number of segments (the more segments, the rounder the polygon, but it takes more time and memory).

x_buff = st_buffer(x, dist = 50000, nQuadSegs = 5)
x_buff[1:5, ]

## Simple feature collection with 5 features and 0 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 1223260 ymin: 6819425 xmax: 1686657 ymax: 7379359
## Projected CRS: WGS 84 / Pseudo-Mercator
## # A tibble: 5 x 1
##                                                                             geom
##                                                                    <POLYGON [m]>
## 1 ((1323260 6987154, 1320813 6971703, 1313711 6957765, 1302649 6946703, 1288711~
## 2 ((1601964 6909108, 1599516 6893657, 1592414 6879719, 1581353 6868657, 1567414~
## 3 ((1390830 7329359, 1388383 7313908, 1381281 7299969, 1370220 7288908, 1356281~
## 4 ((1654680 7173921, 1652233 7158470, 1645131 7144532, 1634069 7133470, 1620131~
## 5 ((1686657 6869425, 1684209 6853974, 1677107 6840036, 1666046 6828974, 1652107~

y_buff = buffer(y, width = 50000, quadsegs = 5)
y_buff

##  class       : SpatVector 
##  geometry    : polygons 
##  dimensions  : 2e+05, 0  (geometries, attributes)
##  extent      : 1063197, 1719792, 6396280, 7411860  (xmin, xmax, ymin, ymax)
##  coord. ref. : WGS 84 / Pseudo-Mercator (EPSG:3857)

Simple visualization of selected objects (n = 100).

plot(x[1:100, ], col = "blue", pch = 20, axes = TRUE)
plot(x_buff[1:100, ], add = TRUE)

rm(x_buff, y_buff)

t = bench::mark(
  iterations = 4, check = FALSE, memory = FALSE,
  sf = st_buffer(x, dist = 50000, nQuadSegs = 5),
  terra = buffer(y, width = 50000, quadsegs = 5)
)
t[, 1:4]

## # A tibble: 2 x 4
##   expression      min   median `itr/sec`
##   <bch:expr> <bch:tm> <bch:tm>     <dbl>
## 1 sf            6.68s     6.8s     0.147
## 2 terra         6.18s    6.21s     0.158

Distance

As the last task in this part, we can calculate the distance between all points. The result returns an \(n * n\) matrix (this takes up a lot of space), so we will select a limited number of points (n = 3000). For this purpose, we will use the st_distance() from {sf}, and distance() from {terra}.

n = seq_len(3000) # define number of points
x_dist = st_distance(x[n, ], dist = "Euclidean")
x_dist[1:5, 1:5] # print only part

## Units: [m]
##          1         2        3        4         5
## 1      0.0 289425.24 348811.8 380422.5 381991.38
## 2 289425.2      0.00 470306.1 270009.3  93528.95
## 3 348811.8 470306.07      0.0 306231.1 546856.64
## 4 380422.5 270009.28 306231.1      0.0 306170.58
## 5 381991.4  93528.95 546856.6 306170.6      0.00

In {terra} the operation is identical, but an object of type dist must be converted to a matrix.

y_dist = as.matrix(distance(y[n, ]))
y_dist[1:5, 1:5]

##          1         2        3        4         5
## 1      0.0 289425.24 348811.8 380422.5 381991.38
## 2 289425.2      0.00 470306.1 270009.3  93528.95
## 3 348811.8 470306.07      0.0 306231.1 546856.64
## 4 380422.5 270009.28 306231.1      0.0 306170.58
## 5 381991.4  93528.95 546856.6 306170.6      0.00

The latest benchmark:

rm(x_dist, y_dist)

t = bench::mark(
  iterations = 5, check = FALSE, memory = FALSE,
  sf = st_distance(x[n, ], dist = "Euclidean"),
  terra = as.matrix(distance(y[n, ]))
)
t[, 1:4]

## # A tibble: 2 x 4
##   expression      min   median `itr/sec`
##   <bch:expr> <bch:tm> <bch:tm>     <dbl>
## 1 sf            908ms    913ms      1.09
## 2 terra         797ms    822ms      1.22

Excercise

An interesting idea for testing can be the use of both raster and vector data. As an example, you can check the sampling performance of raster pixels with the point layer we used earlier. The functions you need are:

terra::extract()
stars::st_extract()

Spatial Processing Benchmarks in R

Krzysztof Dyba

Introduction

Part I: Raster Data

Data Source

Benchmarks

Data Loading

Cropping

Downsampling

Exercise

Part II: Vector Data

Data Source

Benchmarks

Data Loading

Buffers

Distance

Excercise