Automatic acquisition and processing of satellite data in R

Author

Krzysztof Dyba

Introduction

Nowadays, knowledge of three basic technologies is necessary to process huge sets of satellite data:

  • SpatioTemporal Asset Catalogs (STAC) – a standardized way of organizing and describing spatial datasets (including satellite imagery), facilitating their search and access. It uses the JSON format to represent spatial data, but doesn’t store it. It only informs what data is available and what it covers (https://stacspec.org).
  • Cloud Optimized Geotiff (COG) – a special GeoTiff-based file format designed to be optimized for efficient access in cloud computing environments. The most important feature is that it allows streaming and access to only those parts of the image that are needed without having to download the entire huge file. In addition, COG also includes several pre-computed lower-resolution versions of the image, called overviews or pyramids. When displaying a large area at low zoom, these previews can be quickly downloaded instead of processing the full-resolution data (https://www.cogeo.org).
  • Virtual File Systems (VFS) – abstraction layers that enable GDAL to seamlessly read and write geospatial data from various sources such as remote locations, compressed archives, or cloud storage systems. For instance, /vsicurl/ is a major protocol for cloud-native geospatial workflows, which allows reading files directly from URLs using HTTP range requests. For a detailed description, see here.

In practice, the STAC catalog can describe Sentinel 2 data stored as COG files on Amazon Web Service. The user can search this catalog using the rstac package for images from a specific area and time range. Then, the terra package, which utilizes GDAL, can be used to read these files without fully downloading them (only the required part) and perform the appropriate analysis.

Data acquisition

After we selected the appropriate scenes, we can proceed to download them. Basically, there are two ways to do this:

  1. Downloading entire files to disk:
    • assets_download() function from the rstac package,
    • download.file() function built into R.
  2. Downloading a specific spatial range:
    • /vsicurl/ protocol available in programs utilizing GDAL, e.g., terra or gdalwarp.

Let’s start with the first method. Once again, we will select three specific layers from the scene (blue, green and red spectral bands) and return links to these files. They will then be passed to the download.file() function.

scenes |>
  items_select(1) |>
  assets_select(asset_names = c("blue", "green", "red", "nir")) |>
  assets_url() |>
  sort() -> urls
urls
[1] "https://e84-earth-search-sentinel-data.s3.us-west-2.amazonaws.com/sentinel-2-c1-l2a/34/U/EB/2023/11/S2A_T34UEB_20231107T093339_L2A/B02.tif"
[2] "https://e84-earth-search-sentinel-data.s3.us-west-2.amazonaws.com/sentinel-2-c1-l2a/34/U/EB/2023/11/S2A_T34UEB_20231107T093339_L2A/B03.tif"
[3] "https://e84-earth-search-sentinel-data.s3.us-west-2.amazonaws.com/sentinel-2-c1-l2a/34/U/EB/2023/11/S2A_T34UEB_20231107T093339_L2A/B04.tif"
[4] "https://e84-earth-search-sentinel-data.s3.us-west-2.amazonaws.com/sentinel-2-c1-l2a/34/U/EB/2023/11/S2A_T34UEB_20231107T093339_L2A/B08.tif"

Please note that bands don’t necessarily have to be sorted in the expected order. It’s a good idea to sort them to avoid confusing the order.

Let’s create a new directory called sentinel using the dir.create() function, where the files will be saved. To create the save paths, we need to specify the file name along with the extension. To do this, we can use the basename() function to extract these names from the links. Then, combine the directory path with the file path using the file.path() function.

dir.create("sentinel")
files = file.path("sentinel", basename(urls))

Finally, we will use a loop that will download all files. It is necessary to specify the links from which the data will be downloaded and the save paths. It is worth increasing the download timeout before doing so.

options(timeout = 600) # increase timeout

for (i in seq_along(urls)) {
  download.file(urls[i], files[i], mode = "wb")
}

The second option, based on the /vsicurl/ protocol, is simple. All we need to do is add the prefix /vsicurl/ to the raster links using the paste0() function, and then load the data as a raster object.

urls = paste0("/vsicurl/", urls)

Data processing

terra is a crucial package for processing spatial data — both raster and vector.

# install.packages("terra")
library("terra")

In the previous section, we defined vector urls with links (including the /vsicurl/ prefix) to three spectral bands. Let’s now load them using the rast() function.

r = rast(urls)
r
class       : SpatRaster 
size        : 10980, 10980, 4  (nrow, ncol, nlyr)
resolution  : 10, 10  (x, y)
extent      : 499980, 609780, 5590200, 5700000  (xmin, xmax, ymin, ymax)
coord. ref. : WGS 84 / UTM zone 34N (EPSG:32634) 
sources     : B02.tif  
              B03.tif  
              B04.tif  
              B08.tif  
names       :     B02,     B03,     B04,     B08 
min values  : -0.0999, -0.0999, -0.0527, -0.0353 
max values  :  1.7640,  1.6728,  1.6104,  1.5576 

For clarity, we can change band names by referring to the names attribute as follows:

names(r) = c("blue", "green", "red", "nir")

The basic metadata for this scene has been returned. The complete Sentinel scene consists of 10,000 columns and 10,000 rows. If we are interested in a smaller area, we can specify the appropriate spatial extent using the win argument and the SpatExtent class object. Of course, the coordinates must be specified in the same coordinate system as the raster (in this case, it is EPSG:32634).

w = ext(c(570000, 600000, 5560000, 5620000)) # xmin, xmax, ymin, ymax
r = rast(urls, win = w)
r
class       : SpatRaster 
size        : 2980, 3000, 4  (nrow, ncol, nlyr)
resolution  : 10, 10  (x, y)
window      : 570000, 6e+05, 5590200, 5620000  (xmin, xmax, ymin, ymax)
coord. ref. : WGS 84 / UTM zone 34N (EPSG:32634) 
sources     : B02.tif  
              B03.tif  
              B04.tif  
              B08.tif  
names       :      B02,      B03,      B04,      B08 
min values  : >-0.0999, >-0.0999, >-0.0527, >-0.0353 
max values  :  1.7640<,  1.6728<,  1.6104<,  1.5576< 

We can display the RGB composition using the plotRGB() function and setting the bands in the correct order. It is also necessary to stretch the colors using the stretch argument.

plotRGB(r, 3, 2, 1, stretch = "lin")

Using the summary() function, we can calculate basic statistics for the raster based on a defined sample.

summary(r, size = 10000)
      B02                B03                B04                B08         
 Min.   :-0.01050   Min.   :-0.00210   Min.   :-0.00190   Min.   :-0.0002  
 1st Qu.: 0.00710   1st Qu.: 0.02240   1st Qu.: 0.01670   1st Qu.: 0.1422  
 Median : 0.01150   Median : 0.02890   Median : 0.02400   Median : 0.1748  
 Mean   : 0.01597   Mean   : 0.03520   Mean   : 0.03345   Mean   : 0.1810  
 3rd Qu.: 0.02120   3rd Qu.: 0.04403   3rd Qu.: 0.04570   3rd Qu.: 0.2106  
 Max.   : 0.81600   Max.   : 0.60920   Max.   : 0.80720   Max.   : 0.7472  

Satellite data is usually stored as integers to reduce its size on disk. Based on the table above, we can see that the spectral band values have been scaled from integer to floating point, i.e., to a reflection range from 0 to 1. Data scaling is performed automatically in the rast() function if the file contains scaling and offset factors. These can be inspected using the scoff() function. If the factors are not defined, we must perform the transformations ourselves according to the proper formulas.

However, this dataset has outliers below 0. The simplest method to remove them is the clamp() function. We define the minimum and maximum threshold values, and values outside this range will be removed or overwritten (default) depending on the values argument.

r = clamp(r, lower = 0, upper = 1, values = TRUE)
summary(r, size = 10000)
      B02               B03               B04               B08        
 Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.0000  
 1st Qu.:0.00710   1st Qu.:0.02240   1st Qu.:0.01670   1st Qu.:0.1422  
 Median :0.01150   Median :0.02890   Median :0.02400   Median :0.1748  
 Mean   :0.01604   Mean   :0.03520   Mean   :0.03346   Mean   :0.1810  
 3rd Qu.:0.02120   3rd Qu.:0.04403   3rd Qu.:0.04570   3rd Qu.:0.2106  
 Max.   :0.81600   Max.   :0.60920   Max.   :0.80720   Max.   :0.7472  

Using the red (BO4) and near-infrared (B08) bands, we can calculate the normalized difference vegetation index (NDVI) with the following formula:

\[ \text{NDVI} = \frac{\text{NIR} - \text{Red}}{\text{NIR} + \text{Red}} \]

Let’s define the above formula as a function.

ndvi = function(red, nir) {
  (nir - red) / (nir + red)
}

Now we can apply it in the lapp() function, which is used to perform calculations on selected layers from the raster. We need to select bands 3 (red) and 4 (ndvi).

ndvi_raster = lapp(r[[c(3, 4)]], fun = ndvi)

After calculating the index, we can display it on the map using the plot() function. To create a unique color palette (e.g., red-yellow-green), we can use the colorRampPalette() function.

colors = colorRampPalette(c("red", "yellow", "darkgreen"))(100)
plot(ndvi_raster, col = colors, main = "NDVI")

The results can be saved to disk using the writeRaster() function and then opened in another program, such as QGIS. We must specify the path on the disk where the file will be saved and its extension, e.g., .tif.

writeRaster(ndvi_raster, "ndvi.tif")

Summary

In this short tutorial, we learned what technologies are used to acquire and process satellite data, and we completed a practical workflow in the R programming language using two packages: rstac and terra. Obviously, this is just an introduction and doesn’t exhaust the subject. Further topics are primarily related to the processing of time series for the entire globe using data from various sensors and machine learning and deep learning methods.

Task

  1. Find a satellite scene from any area and time. However, keep in mind that cloud cover should be low.
  2. Load two spectral bands: B08 (NIR) and B11 (SWIR 1) using the /vsicurl/ protocol and the terra package. Please note that these images are in different spatial resolutions, so you need to resample one using the resample() function.
  3. Randomly sample images using the SpatSample() and set.seed() functions.
  4. Create a scatter plot and calculate Pearson’s correlation coefficient. You can set the appropriate transparency for the points and include the \(x = y\) line.