# install.packages("rstac")
library("rstac")
Automatic acquisition and processing of satellite data in R
Introduction
Nowadays, knowledge of three basic technologies is necessary to process huge sets of satellite data:
- SpatioTemporal Asset Catalogs (STAC) – a standardized way of organizing and describing spatial datasets (including satellite imagery), facilitating their search and access. It uses the JSON format to represent spatial data, but doesn’t store it. It only informs what data is available and what it covers (https://stacspec.org).
- Cloud Optimized Geotiff (COG) – a special GeoTiff-based file format designed to be optimized for efficient access in cloud computing environments. The most important feature is that it allows streaming and access to only those parts of the image that are needed without having to download the entire huge file. In addition, COG also includes several pre-computed lower-resolution versions of the image, called overviews or pyramids. When displaying a large area at low zoom, these previews can be quickly downloaded instead of processing the full-resolution data (https://www.cogeo.org).
- Virtual File Systems (VFS) – abstraction layers that enable GDAL to seamlessly read and write geospatial data from various sources such as remote locations, compressed archives, or cloud storage systems. For instance,
/vsicurl/
is a major protocol for cloud-native geospatial workflows, which allows reading files directly from URLs using HTTP range requests. For a detailed description, see here.
In practice, the STAC catalog can describe Sentinel 2 data stored as COG files on Amazon Web Service. The user can search this catalog using the rstac package for images from a specific area and time range. Then, the terra package, which utilizes GDAL, can be used to read these files without fully downloading them (only the required part) and perform the appropriate analysis.
Data search
The rstac package is used to handle STAC catalogs in R, allowing to search for spatial data in accordance with this standard. Let’s start by loading this package (it must be installed beforehand). The next step is to connect to a data source made available by the data provider. A list of sources can be found on STAC Index. In our example, we will use the popular Earth Search service, which provides free satellite images from Sentinel and doesn’t require registration. To connect to this service, we need to use the stac()
function.
= stac("https://earth-search.aws.element84.com/v1")
stac_source stac_source
###rstac_query
- url: https://earth-search.aws.element84.com/v1/
- params:
- field(s): version, base_url, endpoint, params, verb, encode
As a result, we receive the rstac_query
object containing information about the HTTP request that will be sent to the server. There are two methods for sending requests:
- GET (
get_request()
) – mainly used to retrieve data from the server. The data is sent via the URL as query parameters. - POST (
post_request()
) – mainly used to send data to the server. The data is sent in the body of the HTTP request.
The appropriate method depends on the server. Let’s now try to query the server about what datasets (collections) are stored on it.
It will be very helpful to use the pipe operator (|>
) to combine several operations into one coherent sequence. We can use the keyboard shortcut CTRL
+ SHIFT
+ M
, but this requires first select the Use native pipe operator
option (Code
> Editing
tab) in RStudio.
= stac_source |>
collections collections() |>
get_request()
collections
###Collections
- collections (9 item(s)):
- sentinel-2-pre-c1-l2a
- cop-dem-glo-30
- naip
- cop-dem-glo-90
- landsat-c2-l2
- sentinel-2-l2a
- sentinel-2-l1c
- sentinel-2-c1-l2a
- sentinel-1-grd
- field(s): collections, links, context
In response, we received an object that is a multi-level list of lists. You can check its structure yourself using the View()
function. Exploring such objects is more complicated compared to one-dimensional objects (vectors). For instance, if we want to see the available datasets, we should use the lapply()
function, which iterates over each element of the input data (in this case, a list) applying the specified function. The result is a list, which can then be simplified to a vector using the unlist()
function. Additionally, we can use a lambda function (also known as an anonymous function). This allows us to replace the function()
keyword with the \()
character.
= unlist(lapply(collections$collections, \(x) x$id))
collection_names collection_names
[1] "sentinel-2-pre-c1-l2a" "cop-dem-glo-30" "naip"
[4] "cop-dem-glo-90" "landsat-c2-l2" "sentinel-2-l2a"
[7] "sentinel-2-l1c" "sentinel-2-c1-l2a" "sentinel-1-grd"
Sentinel 2 data is available in two collections, that is, sentinel-2-l2a
(older) and sentinel-2-c1-l2a
(newer). Currently, products are being updated to the newer sentinel-2-c1-l2a
collection, which should be preferred.
Now let’s search for available products within the sentinel-2-c1-l2a
collection. The stac_search()
function is used to search for data and allows us to define parameters such as:
collections
,- spatial extent in WGS 84 (
bbox
), - time interval in RFC 3339 standard (
datetime
), - maximum number of products (
limit
).
In this case, we must send the request using the POST method.
|>
stac_source stac_search(
collections = "sentinel-2-c1-l2a",
bbox = c(22.5, 51.1, 22.6, 51.2), # xmin, ymin, xmax, ymax (WGS84)
datetime = "2023-01-01T00:00:00Z/2023-12-31T00:00:00Z", # RFC 3339
limit = 5) |>
post_request() -> scenes
scenes
###Items
- matched feature(s): 288
- features (5 item(s) / 283 not fetched):
- S2A_T34UEB_20231230T094411_L2A
- S2A_T34UFB_20231230T094411_L2A
- S2A_T34UEB_20231227T093411_L2A
- S2A_T34UFB_20231227T093411_L2A
- S2B_T34UEB_20231225T094326_L2A
- assets:
aot, blue, cloud, coastal, granule_metadata, green, nir, nir08, nir09, preview, product_metadata, red, rededge1, rededge2, rededge3, scl, snow, swir16, swir22, thumbnail, tileinfo_metadata, visual, wvp
- item's fields:
assets, bbox, collection, geometry, id, links, properties, stac_extensions, stac_version, type
288 products were found that match the query conditions. As a result, we obtained various layers (assets
) and attributes (fields
) that we can inspect. Let’s see some sample product properties.
# first 15 properties
names(scenes$features[[1]]$properties)[1:15]
[1] "created" "platform" "constellation"
[4] "instruments" "eo:cloud_cover" "proj:epsg"
[7] "proj:centroid" "mgrs:utm_zone" "mgrs:latitude_band"
[10] "mgrs:grid_square" "grid:code" "view:azimuth"
[13] "view:incidence_angle" "view:sun_azimuth" "view:sun_elevation"
In total, we have access to 44 features that describe the product, including acquisition date, satellite platform name, scene cloud cover percentage, coordinate system, scene ID, etc.
Fortunately, we can simplify the process of inspecting and selecting the data we are interested in. Instead of looking through these complex lists, we can use the items_as_sf()
function, which will return the results in a spatial data frame, making things much easier.
= items_as_sf(scenes)
df # View(df)
The resulting data frame is a spatial object (sf data frame
), so scene geometries can be displayed using the plot()
function.
One of the critical attributes in this data frame is cloud cover (eo:cloud_cover
). Let’s check these values.
$`eo:cloud_cover` df
[1] 71.93396 68.93441 83.81743 94.01084 99.99993
As we can see, these are quite high values, which limits the usefulness of these scenes for analysis. We can modify our previous query by adding a filter rule to return only scenes below a specific cloud cover value. The ext_query()
function is designed for this purpose.
|>
stac_source stac_search(
collections = "sentinel-2-c1-l2a",
bbox = c(22.5, 51.1, 22.6, 51.2),
datetime = "2023-01-01T00:00:00Z/2023-12-31T00:00:00Z",
limit = 5) |>
ext_query(`eo:cloud_cover` < 10) |>
post_request() -> scenes
= items_as_sf(scenes)
df $`eo:cloud_cover` df
[1] 0.067071 0.043523 1.478151 0.678711 0.000757
Satellite scenes usually contain low-resolution thumbnails in RGB composition. In the case of Sentinel 2 data, they are saved as thumbnail.jpg
. For a quick preview, we must first select the appropriate layer using the assets_select()
function, and then return the link using the assets_url()
function. To load the .jpg
file, an additional package is required, e.g., imager, which has the load.image()
function.
# install.packages("imager")
library("imager")
par(mar = c(0, 0, 0, 0)) # remove margins
|>
scenes items_select(1) |>
assets_select(asset_names = "thumbnail") |>
assets_url() |>
load.image() |>
plot(axes = FALSE)
Data acquisition
After we selected the appropriate scenes, we can proceed to download them. Basically, there are two ways to do this:
- Downloading entire files to disk:
assets_download()
function from the rstac package,download.file()
function built into R.
- Downloading a specific spatial range:
/vsicurl/
protocol available in programs utilizing GDAL, e.g., terra or gdalwarp.
Let’s start with the first method. Once again, we will select three specific layers from the scene (blue
, green
and red
spectral bands) and return links to these files. They will then be passed to the download.file()
function.
|>
scenes items_select(1) |>
assets_select(asset_names = c("blue", "green", "red", "nir")) |>
assets_url() |>
sort() -> urls
urls
[1] "https://e84-earth-search-sentinel-data.s3.us-west-2.amazonaws.com/sentinel-2-c1-l2a/34/U/EB/2023/11/S2A_T34UEB_20231107T093339_L2A/B02.tif"
[2] "https://e84-earth-search-sentinel-data.s3.us-west-2.amazonaws.com/sentinel-2-c1-l2a/34/U/EB/2023/11/S2A_T34UEB_20231107T093339_L2A/B03.tif"
[3] "https://e84-earth-search-sentinel-data.s3.us-west-2.amazonaws.com/sentinel-2-c1-l2a/34/U/EB/2023/11/S2A_T34UEB_20231107T093339_L2A/B04.tif"
[4] "https://e84-earth-search-sentinel-data.s3.us-west-2.amazonaws.com/sentinel-2-c1-l2a/34/U/EB/2023/11/S2A_T34UEB_20231107T093339_L2A/B08.tif"
Please note that bands don’t necessarily have to be sorted in the expected order. It’s a good idea to sort them to avoid confusing the order.
Let’s create a new directory called sentinel
using the dir.create()
function, where the files will be saved. To create the save paths, we need to specify the file name along with the extension. To do this, we can use the basename()
function to extract these names from the links. Then, combine the directory path with the file path using the file.path()
function.
dir.create("sentinel")
= file.path("sentinel", basename(urls)) files
Finally, we will use a loop that will download all files. It is necessary to specify the links from which the data will be downloaded and the save paths. It is worth increasing the download timeout before doing so.
options(timeout = 600) # increase timeout
for (i in seq_along(urls)) {
download.file(urls[i], files[i], mode = "wb")
}
The second option, based on the /vsicurl/
protocol, is simple. All we need to do is add the prefix /vsicurl/
to the raster links using the paste0()
function, and then load the data as a raster object.
= paste0("/vsicurl/", urls) urls
Data processing
terra is a crucial package for processing spatial data — both raster and vector.
# install.packages("terra")
library("terra")
In the previous section, we defined vector urls
with links (including the /vsicurl/
prefix) to three spectral bands. Let’s now load them using the rast()
function.
= rast(urls)
r r
class : SpatRaster
size : 10980, 10980, 4 (nrow, ncol, nlyr)
resolution : 10, 10 (x, y)
extent : 499980, 609780, 5590200, 5700000 (xmin, xmax, ymin, ymax)
coord. ref. : WGS 84 / UTM zone 34N (EPSG:32634)
sources : B02.tif
B03.tif
B04.tif
B08.tif
names : B02, B03, B04, B08
min values : -0.0999, -0.0999, -0.0527, -0.0353
max values : 1.7640, 1.6728, 1.6104, 1.5576
For clarity, we can change band names by referring to the names
attribute as follows:
names(r) = c("blue", "green", "red", "nir")
The basic metadata for this scene has been returned. The complete Sentinel scene consists of 10,000 columns and 10,000 rows. If we are interested in a smaller area, we can specify the appropriate spatial extent using the win
argument and the SpatExtent
class object. Of course, the coordinates must be specified in the same coordinate system as the raster (in this case, it is EPSG:32634
).
= ext(c(570000, 600000, 5560000, 5620000)) # xmin, xmax, ymin, ymax
w = rast(urls, win = w)
r r
class : SpatRaster
size : 2980, 3000, 4 (nrow, ncol, nlyr)
resolution : 10, 10 (x, y)
window : 570000, 6e+05, 5590200, 5620000 (xmin, xmax, ymin, ymax)
coord. ref. : WGS 84 / UTM zone 34N (EPSG:32634)
sources : B02.tif
B03.tif
B04.tif
B08.tif
names : B02, B03, B04, B08
min values : >-0.0999, >-0.0999, >-0.0527, >-0.0353
max values : 1.7640<, 1.6728<, 1.6104<, 1.5576<
We can display the RGB composition using the plotRGB()
function and setting the bands in the correct order. It is also necessary to stretch the colors using the stretch
argument.
plotRGB(r, 3, 2, 1, stretch = "lin")
Using the summary()
function, we can calculate basic statistics for the raster based on a defined sample.
summary(r, size = 10000)
B02 B03 B04 B08
Min. :-0.01050 Min. :-0.00210 Min. :-0.00190 Min. :-0.0002
1st Qu.: 0.00710 1st Qu.: 0.02240 1st Qu.: 0.01670 1st Qu.: 0.1422
Median : 0.01150 Median : 0.02890 Median : 0.02400 Median : 0.1748
Mean : 0.01597 Mean : 0.03520 Mean : 0.03345 Mean : 0.1810
3rd Qu.: 0.02120 3rd Qu.: 0.04403 3rd Qu.: 0.04570 3rd Qu.: 0.2106
Max. : 0.81600 Max. : 0.60920 Max. : 0.80720 Max. : 0.7472
Satellite data is usually stored as integers to reduce its size on disk. Based on the table above, we can see that the spectral band values have been scaled from integer to floating point, i.e., to a reflection range from 0 to 1. Data scaling is performed automatically in the rast()
function if the file contains scaling and offset factors. These can be inspected using the scoff()
function. If the factors are not defined, we must perform the transformations ourselves according to the proper formulas.
However, this dataset has outliers below 0. The simplest method to remove them is the clamp()
function. We define the minimum and maximum threshold values, and values outside this range will be removed or overwritten (default) depending on the values
argument.
= clamp(r, lower = 0, upper = 1, values = TRUE)
r summary(r, size = 10000)
B02 B03 B04 B08
Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.0000
1st Qu.:0.00710 1st Qu.:0.02240 1st Qu.:0.01670 1st Qu.:0.1422
Median :0.01150 Median :0.02890 Median :0.02400 Median :0.1748
Mean :0.01604 Mean :0.03520 Mean :0.03346 Mean :0.1810
3rd Qu.:0.02120 3rd Qu.:0.04403 3rd Qu.:0.04570 3rd Qu.:0.2106
Max. :0.81600 Max. :0.60920 Max. :0.80720 Max. :0.7472
Using the red (BO4
) and near-infrared (B08
) bands, we can calculate the normalized difference vegetation index (NDVI) with the following formula:
\[ \text{NDVI} = \frac{\text{NIR} - \text{Red}}{\text{NIR} + \text{Red}} \]
Let’s define the above formula as a function.
= function(red, nir) {
ndvi - red) / (nir + red)
(nir }
Now we can apply it in the lapp()
function, which is used to perform calculations on selected layers from the raster. We need to select bands 3 (red) and 4 (ndvi).
= lapp(r[[c(3, 4)]], fun = ndvi) ndvi_raster
After calculating the index, we can display it on the map using the plot()
function. To create a unique color palette (e.g., red-yellow-green), we can use the colorRampPalette()
function.
= colorRampPalette(c("red", "yellow", "darkgreen"))(100)
colors plot(ndvi_raster, col = colors, main = "NDVI")
The results can be saved to disk using the writeRaster()
function and then opened in another program, such as QGIS. We must specify the path on the disk where the file will be saved and its extension, e.g., .tif
.
writeRaster(ndvi_raster, "ndvi.tif")
Summary
In this short tutorial, we learned what technologies are used to acquire and process satellite data, and we completed a practical workflow in the R programming language using two packages: rstac and terra. Obviously, this is just an introduction and doesn’t exhaust the subject. Further topics are primarily related to the processing of time series for the entire globe using data from various sensors and machine learning and deep learning methods.
Task
- Find a satellite scene from any area and time. However, keep in mind that cloud cover should be low.
- Load two spectral bands: B08 (NIR) and B11 (SWIR 1) using the
/vsicurl/
protocol and the terra package. Please note that these images are in different spatial resolutions, so you need to resample one using theresample()
function. - Randomly sample images using the
SpatSample()
andset.seed()
functions. - Create a scatter plot and calculate Pearson’s correlation coefficient. You can set the appropriate transparency for the points and include the \(x = y\) line.