# install all required packages
pkgs = c("bench", "microbenchmark", "sf", "stars", "terra")
install.packages(pkgs)
Benchmarking code in R can be done in several ways – using a built-in
function or using more advanced functions from dedicated packages. The
simplest way is the system.time()
function. First, we
should read the documentation to see how this function works and what it
returns. To view the documentation, use the question mark “?” before the
function name.
?system.time
We will test it on an example.
# sample numbers with replacement
# (n = 1,000,000)
t = system.time(sample(1:100, 1e6, replace = TRUE))
t
## user system elapsed
## 0.06 0.00 0.07
We are most interested in the third value returned
"elapsed"
, which is the actual execution time of the tested
function in seconds (lower value is better/faster). Of course, we can
refer to this value directly using the index in square brackets.
t[[3]]
## [1] 0.07
t[["elapsed"]]
## [1] 0.07
Now let’s check out the more advanced functions from the {bench} and {microbenchmark} packages on other data. For the next examples, we can generate a data frame with 3 columns.
df = data.frame(x = rnorm(1e5), y = rnorm(1e5), z = rnorm(1e5))
str(df)
## 'data.frame': 100000 obs. of 3 variables:
## $ x: num -1.1 -0.446 0.239 -0.69 0.656 ...
## $ y: num 0.568 -0.395 0.346 -0.555 0.896 ...
## $ z: num -1.215 -0.975 0.454 -0.857 -0.194 ...
Benchmarking in the {bench}
package can be done using
the mark()
function. It is quite intuitive to call the
function explicitly from the package like this:
bench::mark()
(without having to load the entire
package).
As a test, we will see how fast can be calculated the sum for each
row of a data frame using two different functions:
rowSums()
and apply()
.
t = bench::mark(
iterations = 10, check = TRUE,
rowSums = rowSums(df),
apply = apply(df, MARGIN = 1, FUN = sum) # 1 indicates rows
)
t[, 1:5]
## # A tibble: 2 x 5
## expression min median `itr/sec` mem_alloc
## <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt>
## 1 rowSums 3.3ms 3.57ms 241. 3.15MB
## 2 apply 283.8ms 316.37ms 2.99 6.87MB
In the {microbenchmark}
package, we have an analogous
microbenchmark()
function that works similarly, but returns
slightly different parameters.
library(microbenchmark)
t = microbenchmark(
times = 10, check = "identical",
rowSums = rowSums(df),
apply = apply(df, 1, sum)
)
t
## Unit: milliseconds
## expr min lq mean median uq max neval
## rowSums 3.017799 3.038325 3.290923 3.224911 3.2908 4.307269 10
## apply 258.939062 286.609943 317.185685 315.195688 346.8460 370.976920 10
We can also easily prepare a visualization (in this case a boxplot).
Moreover, there is autoplot()
function in the
{ggplot2}
package with different types of plots.
boxplot(t)
As you can see, both functions allow for timings aggregation (mean,
median, etc.) from multiple iterations and results validation, which is
a significant improvement over system.time()
.
This section does not exhaust the topic. A more complex approach to
benchmarks is code profiling, that is, testing the performance of large
blocks of code. This allows us to find out which function is bottleneck.
In R, this can be done using the Rprof()
function or {profvis} package.
However, for internal C/C++ code, this is not possible (this is true for
most packages that are based on GDAL and GEOS).
For more information, see the books:
Exercise
Compare computing the mean value of the vector using
mean()
and sum() / length()
functions.
We will use Digital
Surface Model (DSM) from satellite Shuttle Radar
Topography Mission (SRTM). You can download the data in 5° x 5°
tiles from this website: https://srtm.csi.cgiar.org/srtmdata/. You can choose any
tile, but we will use a tile with ID srtm_39_02
as an example. The data is available as a single layer GeoTIFF with a
resolution of approximately 90 m (6000 x 6000 pixels). The coordinate
reference system (CRS) of the raster is WGS84 (EPSG:4326), and the
datatype is int16
.
Please be aware that this dataset is quite small and consists of only one layer, so the results are not entirely representative (especially for large-scale applications).
In the first step, let’s load all the necessary spatial packages.
library(sf)
library(stars)
library(terra)
sf_use_s2(FALSE) # use planar GEOS engine by default
Next, let’s create an empty folder into which we will download the raster.
url = "https://srtm.csi.cgiar.org/wp-content/uploads/files/srtm_5x5/TIFF/srtm_39_02.zip"
if (!dir.exists("data")) {
dir.create("data")
download.file(url, "data/srtm.zip")
unzip("data/srtm.zip", exdir = "data") # ~70 MB tif
}
# list files after extracting the archive
list.files("data")
## [1] "points.gpkg" "readme.txt" "srtm.zip" "srtm_39_02.hdr"
## [5] "srtm_39_02.tfw" "srtm_39_02.tif"
# or save path to raster as variable
f = list.files("data", pattern = "\\.tif$", full.names = TRUE)
In the following sections, we will define the tasks in which we will test our packages.
Let’s start with the simplest task – loading raster data into R. In
the {stars}
package, the read_stars()
function
is used for this, and in the {terra}
package, the
rast()
function.
x = read_stars("data/srtm_39_02.tif", proxy = FALSE)
x
## stars object with 2 dimensions and 1 attribute
## attribute(s), summary of first 1e+05 cells:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## srtm_39_02.tif -6 4 12 21.12648 35 122 85974
## dimension(s):
## from to offset delta refsys point values x/y
## x 1 6000 10 0.000833333 WGS 84 FALSE NULL [x]
## y 1 6000 55 -0.000833333 WGS 84 FALSE NULL [y]
plot(x)
y = rast("data/srtm_39_02.tif")
y
## class : SpatRaster
## dimensions : 6000, 6000, 1 (nrow, ncol, nlyr)
## resolution : 0.0008333333, 0.0008333333 (x, y)
## extent : 10, 15, 50, 55 (xmin, xmax, ymin, ymax)
## coord. ref. : lon/lat WGS 84 (EPSG:4326)
## source : srtm_39_02.tif
## name : srtm_39_02
plot(y)
Before we go any further, we need to note one very important point.
In the rast()
documentation, we can read:
When a SpatRaster is created from a file, it does not load the cell (pixel) values into memory (RAM). It only reads the parameters that describe the geometry of the SpatRaster, such as the number of rows and columns and the coordinate reference system. The actual values will be read when needed.
So, in fact, when using the rast()
function, we do not
load the raster into memory, but only the metadata describing it. This
is of great importance, because if we do not load it immediately, all
the functions tested, will include overhead caused by loading the raster
into memory. To check if the raster has been loaded into memory by
{terra}
, we can use the inMemory()
function,
which will return a logical value.
# is the raster in memory?
inMemory(y)
## [1] FALSE
As we noted earlier, the raster is not loaded into memory. To
actually load the raster into memory, we need to use the
set.values()
function.
set.values(y)
inMemory(y)
## [1] TRUE
The situation is similar in the {stars}
package, but in
this case the argument in the read_stars()
function takes
logical values that define whether the file will be directly loaded into
memory or not. See the documentation excerpt below:
?read_stars
proxy: logical; if TRUE, an object of class stars_proxy is read which contains array metadata only; if FALSE the full array data is read in memory. (...)
After this explanation, we can move on to the benchmark. The argument
check
in {bench}
must be obligatorily set to
FALSE
, because the tested packages return different data
structures.
t = bench::mark(
iterations = 5, check = FALSE, memory = FALSE,
stars = read_stars("data/srtm_39_02.tif", proxy = FALSE),
terra = set.values(rast("data/srtm_39_02.tif"))
)
t[, 1:4]
## # A tibble: 2 x 4
## expression min median `itr/sec`
## <bch:expr> <bch:tm> <bch:tm> <dbl>
## 1 stars 796ms 831ms 1.20
## 2 terra 807ms 898ms 1.14
Note that {stars}
always loads the data into memory as a
double type (although the raster is integer). Since {stars}
objects are kept as arrays in R, you can convert them to integer type
yourself. The situation is different for {terra}
. The
rasters are kept in an external C++ structure (outside R) as a double
and all calculations are performed on this type. The difference in
datatypes is related to the efficiency of operations and the amount of
RAM required.
Check out the following example in {stars}
:
typeof(x[[1]]) # select first attribute (srtm_39_02)
## [1] "double"
format(object.size(x), units = "auto") # print object size
## [1] "274.7 Mb"
x[[1]] = as.integer(x[[1]]) # convert double to integer
format(object.size(x), units = "auto") # size after conversion
## [1] "137.3 Mb"
For comparison {terra}
:
format(object.size(y), units = "auto")
## [1] "1.3 Kb"
In the next task, we will test the raster cropping to the defined range. We loaded the packages and data in the previous section. First, let’s check the extent of our raster (note that the coordinates in the WGS84 system are in degrees).
st_bbox(x) # {stars}
## xmin ymin xmax ymax
## 10 50 15 55
ext(y) # {terra}
## SpatExtent : 10, 15, 50, 55 (xmin, xmax, ymin, ymax)
Next, let’s define a new smaller extent to which we will crop the
raster. Note that objects of class bbox
({stars}
) and ext
({terra}
) have
different coordinate orders.
x_ext = st_bbox(c(xmin = 11, ymin = 51, xmax = 14, ymax = 53),
crs = st_crs(4326))
y_ext = ext(11, 14, 51, 53)
Visualize the new extent:
plot(x, axes = TRUE, main = NULL, reset = FALSE)
# bbox object must be converted to sfc geometry
plot(st_as_sfc(x_ext), add = TRUE, border = "red", lwd = 2)
Use the st_crop()
function to crop the raster in
{stars}
, and the crop()
function in
{terra}
.
x_crop = st_crop(x, x_ext)
st_bbox(x_crop)
## xmin ymin xmax ymax
## 11 51 14 53
y_crop = crop(y, y_ext)
ext(y_crop)
## SpatExtent : 11, 14, 51, 53 (xmin, xmax, ymin, ymax)
Everything works fine, so we can move on to the benchmark. However,
earlier we can remove unnecessary objects from the environment using the
rm()
function so that they don’t take up space in
memory.
rm(x_crop, y_crop)
t = bench::mark(
iterations = 5, check = FALSE, memory = FALSE,
stars = st_crop(x, x_ext),
terra = crop(y, y_ext)
)
t[, 1:4]
## # A tibble: 2 x 4
## expression min median `itr/sec`
## <bch:expr> <bch:tm> <bch:tm> <dbl>
## 1 stars 38.8ms 45.5ms 21.5
## 2 terra 94.9ms 96.3ms 10.2
The last task will be to reduce the spatial resolution of the raster from higher to smaller (i.e. less detail). Let’s recall the resolution of our input raster.
# values in degrees
sapply(st_dimensions(x), "[[", "delta") # {stars}
## x y
## 0.0008333333 -0.0008333333
res(y) # {terra}
## [1] 0.0008333333 0.0008333333
Next, let’s use the st_warp()
from {stars}
,
and resample()
from {terra}
, but before we can
do that, we need to prepare the rasters with the target geometry.
x_dest = st_as_stars(st_bbox(x), dx = 0.01, dy = 0.01, values = 0L)
x_small = st_warp(x, x_dest, method = "average", use_gdal = TRUE)
x_small
## stars object with 2 dimensions and 1 attribute
## attribute(s):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## file26833b55d51.tif -5.548611 43.85417 84.54514 168.7611 260.9462 1195.875
## NA's
## file26833b55d51.tif 38548
## dimension(s):
## from to offset delta refsys point values x/y
## x 1 500 10 0.01 WGS 84 FALSE NULL [x]
## y 1 500 55 -0.01 WGS 84 FALSE NULL [y]
y_dest = rast(extent = ext(y), resolution = c(0.01, 0.01), crs = "epsg:4326")
y_small = resample(y, y_dest, method = "average")
y_small
## class : SpatRaster
## dimensions : 500, 500, 1 (nrow, ncol, nlyr)
## resolution : 0.01, 0.01 (x, y)
## extent : 10, 15, 50, 55 (xmin, xmax, ymin, ymax)
## coord. ref. : lon/lat WGS 84 (EPSG:4326)
## source : memory
## name : srtm_39_02
## min value : -5.548611
## max value : 1195.875000
Now let’s perform the benchmark.
rm(x_small, y_small)
t = bench::mark(
iterations = 5, check = FALSE, memory = FALSE,
stars = st_warp(x, x_dest, method = "average", use_gdal = TRUE),
terra = resample(y, y_dest, method = "average")
)
t[, 1:4]
## # A tibble: 2 x 4
## expression min median `itr/sec`
## <bch:expr> <bch:tm> <bch:tm> <dbl>
## 1 stars 3.33s 3.52s 0.285
## 2 terra 1.52s 1.53s 0.652
There are other ways to reduce spatial resolution using aggregation.
In {terra}
you can use the aggregate()
function which takes the aggregation factor. In {stars}
you
can use the same function (in fact, this is the equivalent of
terra::extract()
), but you must specify a polygon as an
argument – st_make_grid()
function can be useful for
this.
Prepare a task in which you check the performance of pixel classification using the height condition. For example, all pixels above the threshold of 200 m will become 1, and all pixels below 0. This task can be done in many ways, below are some ideas:
# {stars}
## method I
xx = x # duplicate object
xx[xx < 200] = 0
xx[xx >= 200] = 1
typeof(xx[[1]])
## method II (returns factor type, not numeric)
xx = x
xx = cut(xx, breaks = c(-50, 200, 1500), labels = c("0", "1"))
typeof(xx[[1]])
plot(xx, col = c("blue", "red"))
table(xx[[1]]) # frequency table
# {terra}
## method I
yy = ifel(y >= 200, 1, 0)
## method II
mat = matrix(c(-50, 200, 0,
200, 1500, 1),
nrow = 2, byrow = TRUE)
yy = classify(y, mat)
plot(yy, col = c("blue", "red"))
table(values(yy))
Attention! If you are lacking RAM after the first part, then you can turn on this notebook directly from the second part (it is independent).
library(sf)
library(stars)
library(terra)
sf_use_s2(FALSE) # use planar GEOS engine by default
Before starting, let’s remove all unnecessary objects from the environment.
rm(list = ls())
In this section, we will use synthetic (i.e. generated by us) data as input. However, if you would like to use real data, OpenStreetMap is a great resource. This data can be obtained using the {osmdata} package.
As an example dataset, let’s generate a set of 200,000 points in the metric coordinate system (EPSG: 3857). The extent of the vector layer can be the same as the raster used in the previous section. Finally, we save our data to GeoPackage format to use exactly the same dataset in all tasks.
n = 200000
set.seed(123) # define random seed to make the results reproducible
coords = data.frame(x = runif(n, min = 10, max = 15),
y = runif(n, min = 50, max = 55))
pts = st_as_sf(coords, coords = c("x", "y"), crs = 4326)
pts = st_transform(pts, crs = 3857) # transform to planar CRS
pts[1:5, ]
## Simple feature collection with 5 features and 0 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 1273260 ymin: 6869425 xmax: 1636657 ymax: 7329359
## Projected CRS: WGS 84 / Pseudo-Mercator
## geometry
## 1 POINT (1273260 6987154)
## 2 POINT (1551964 6909108)
## 3 POINT (1340830 7329359)
## 4 POINT (1604680 7173921)
## 5 POINT (1636657 6869425)
write_sf(pts, "data/points.gpkg") # ~82 MB
First, let’s check how fast we can load vector data in GeoPackage format. Depending on the file format, the times can vary. It is also worth mentioning a new spatial data format – GeoParquet, which is more effective than the previous ones (check out this interesting blogpost from Dewey Dunnington).
For vector data, the functions are analogous to those for raster
data. So in the case of {sf}
this is
read_sf()
, and in the case of {terra}
this is
vect()
. {sf}
loads data into data.frame /
tibble, while {terra}
stores it in an external
structure.
x = read_sf("data/points.gpkg")
x[1:5, ]
## Simple feature collection with 5 features and 0 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 1273260 ymin: 6869425 xmax: 1636657 ymax: 7329359
## Projected CRS: WGS 84 / Pseudo-Mercator
## # A tibble: 5 x 1
## geom
## <POINT [m]>
## 1 (1273260 6987154)
## 2 (1551964 6909108)
## 3 (1340830 7329359)
## 4 (1604680 7173921)
## 5 (1636657 6869425)
y = vect("data/points.gpkg")
y
## class : SpatVector
## geometry : points
## dimensions : 2e+05, 0 (geometries, attributes)
## extent : 1113197, 1669792, 6446280, 7361860 (xmin, xmax, ymin, ymax)
## source : points.gpkg
## coord. ref. : WGS 84 / Pseudo-Mercator (EPSG:3857)
Now let’s do the benchmark.
t = bench::mark(
iterations = 5, check = FALSE, memory = FALSE,
sf = read_sf("data/points.gpkg"),
terra = vect("data/points.gpkg")
)
t[, 1:4]
## # A tibble: 2 x 4
## expression min median `itr/sec`
## <bch:expr> <bch:tm> <bch:tm> <dbl>
## 1 sf 705ms 705ms 1.42
## 2 terra 564ms 576ms 1.73
We will use the st_buffer()
from {sf}
, and
buffer()
from {terra}
to create the buffers.
These functions take the width as an argument and the number of segments
to create the polygon. We need to make sure that the geometries will
consist of an equal number of segments (the more segments, the rounder
the polygon, but it takes more time and memory).
x_buff = st_buffer(x, dist = 50000, nQuadSegs = 5)
x_buff[1:5, ]
## Simple feature collection with 5 features and 0 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 1223260 ymin: 6819425 xmax: 1686657 ymax: 7379359
## Projected CRS: WGS 84 / Pseudo-Mercator
## # A tibble: 5 x 1
## geom
## <POLYGON [m]>
## 1 ((1323260 6987154, 1320813 6971703, 1313711 6957765, 1302649 6946703, 1288711~
## 2 ((1601964 6909108, 1599516 6893657, 1592414 6879719, 1581353 6868657, 1567414~
## 3 ((1390830 7329359, 1388383 7313908, 1381281 7299969, 1370220 7288908, 1356281~
## 4 ((1654680 7173921, 1652233 7158470, 1645131 7144532, 1634069 7133470, 1620131~
## 5 ((1686657 6869425, 1684209 6853974, 1677107 6840036, 1666046 6828974, 1652107~
y_buff = buffer(y, width = 50000, quadsegs = 5)
y_buff
## class : SpatVector
## geometry : polygons
## dimensions : 2e+05, 0 (geometries, attributes)
## extent : 1063197, 1719792, 6396280, 7411860 (xmin, xmax, ymin, ymax)
## coord. ref. : WGS 84 / Pseudo-Mercator (EPSG:3857)
Simple visualization of selected objects (n = 100).
plot(x[1:100, ], col = "blue", pch = 20, axes = TRUE)
plot(x_buff[1:100, ], add = TRUE)
rm(x_buff, y_buff)
t = bench::mark(
iterations = 4, check = FALSE, memory = FALSE,
sf = st_buffer(x, dist = 50000, nQuadSegs = 5),
terra = buffer(y, width = 50000, quadsegs = 5)
)
t[, 1:4]
## # A tibble: 2 x 4
## expression min median `itr/sec`
## <bch:expr> <bch:tm> <bch:tm> <dbl>
## 1 sf 6.68s 6.8s 0.147
## 2 terra 6.18s 6.21s 0.158
As the last task in this part, we can calculate the distance between
all points. The result returns an \(n *
n\) matrix (this takes up a lot of space), so we will select a
limited number of points (n = 3000). For this purpose, we will use the
st_distance()
from {sf}
, and
distance()
from {terra}
.
n = seq_len(3000) # define number of points
x_dist = st_distance(x[n, ], dist = "Euclidean")
x_dist[1:5, 1:5] # print only part
## Units: [m]
## 1 2 3 4 5
## 1 0.0 289425.24 348811.8 380422.5 381991.38
## 2 289425.2 0.00 470306.1 270009.3 93528.95
## 3 348811.8 470306.07 0.0 306231.1 546856.64
## 4 380422.5 270009.28 306231.1 0.0 306170.58
## 5 381991.4 93528.95 546856.6 306170.6 0.00
In {terra}
the operation is identical, but an object of
type dist
must be converted to a matrix.
y_dist = as.matrix(distance(y[n, ]))
y_dist[1:5, 1:5]
## 1 2 3 4 5
## 1 0.0 289425.24 348811.8 380422.5 381991.38
## 2 289425.2 0.00 470306.1 270009.3 93528.95
## 3 348811.8 470306.07 0.0 306231.1 546856.64
## 4 380422.5 270009.28 306231.1 0.0 306170.58
## 5 381991.4 93528.95 546856.6 306170.6 0.00
The latest benchmark:
rm(x_dist, y_dist)
t = bench::mark(
iterations = 5, check = FALSE, memory = FALSE,
sf = st_distance(x[n, ], dist = "Euclidean"),
terra = as.matrix(distance(y[n, ]))
)
t[, 1:4]
## # A tibble: 2 x 4
## expression min median `itr/sec`
## <bch:expr> <bch:tm> <bch:tm> <dbl>
## 1 sf 908ms 913ms 1.09
## 2 terra 797ms 822ms 1.22
An interesting idea for testing can be the use of both raster and vector data. As an example, you can check the sampling performance of raster pixels with the point layer we used earlier. The functions you need are:
terra::extract()
stars::st_extract()