Data loading

library("terra")
library("ranger") # Random Forest model
library("yardstick") # validation metrics
set.seed(1) # define a random seed

Let’s start by listing the rasters to be loaded from the Landast_crop folder. We need to define the file extension pattern = "\\.TIF$" and full paths full.names = TRUE in the list.files() function. Then we load these files using the rast() function.

landsat_f = list.files("Landast", pattern = "\\.TIF$", full.names = TRUE)
landsat_f = landsat_f[-1] # remove panchromatic band (15 m)
landsat = rast(landsat_f)

For ease of use, we can abbreviate the band names.

names(landsat) = paste0("B", 1:7) # change band names
landsat
## class       : SpatRaster 
## dimensions  : 3901, 2043, 7  (nrow, ncol, nlyr)
## resolution  : 30, 30  (x, y)
## extent      : 246375, 307665, 5830995, 5948025  (xmin, xmax, ymin, ymax)
## coord. ref. : WGS 84 / UTM zone 34N (EPSG:32634) 
## sources     : LC09_L2SP_190023_20230622_20230624_02_T1_SR_B1.TIF  
##               LC09_L2SP_190023_20230622_20230624_02_T1_SR_B2.TIF  
##               LC09_L2SP_190023_20230622_20230624_02_T1_SR_B3.TIF  
##               ... and 4 more source(s)
## names       :    B1,   B2,   B3,   B4,   B5,   B6, ... 
## min values  :   164,   ? ,   ? ,   ? ,   ? ,   ? , ... 
## max values  : 22365,   ? ,   ? ,   ? ,   ? ,   ? , ...

Next, we need to scale the values to the spectral reflectance. Basically, we change the datatype from integer to float. Note that there are outliers (below 0 and above 1) in this dataset.

# scale data to reflectance
landsat = landsat * 2.75e-05 - 0.2
summary(landsat) # print statistics
##        B1                 B2                 B3                  B4           
##  Min.   :-0.16631   Min.   :-0.11503   Min.   :-0.001917   Min.   :-0.009975  
##  1st Qu.: 0.01541   1st Qu.: 0.02454   1st Qu.: 0.053055   1st Qu.: 0.039195  
##  Median : 0.02215   Median : 0.03218   Median : 0.068455   Median : 0.057785  
##  Mean   : 0.02532   Mean   : 0.03576   Mean   : 0.069585   Mean   : 0.062563  
##  3rd Qu.: 0.03191   3rd Qu.: 0.04288   3rd Qu.: 0.082755   3rd Qu.: 0.078493  
##  Max.   : 0.46355   Max.   : 0.48296   Max.   : 0.507245   Max.   : 0.530620  
##        B5                B6                 B7          
##  Min.   :0.01708   Min.   :0.005947   Min.   :0.003665  
##  1st Qu.:0.26280   1st Qu.:0.131045   1st Qu.:0.064083  
##  Median :0.31420   Median :0.162395   Median :0.088420  
##  Mean   :0.31503   Mean   :0.168040   Mean   :0.099067  
##  3rd Qu.:0.36793   3rd Qu.:0.199547   3rd Qu.:0.123620  
##  Max.   :0.68468   Max.   :0.624065   Max.   :0.536863

We can also display RGB composition using the plotRGB() function. We need to properly define the spectral bands and it is worth to improve the contrast using the stretch argument.

plotRGB(landsat, r = 4, g = 3, b = 2, stretch = "lin")

Now, let’s load our reference data with land cover categories. We have two datasets – a training (to train the classification model) and a validation (to independently evaluate its performance).

training_f = "training_set.tif"
training = rast(training_f)
training
## class       : SpatRaster 
## dimensions  : 1484, 1293, 1  (nrow, ncol, nlyr)
## resolution  : 30, 30  (x, y)
## extent      : 386041.9, 424831.9, 559757.5, 604277.5  (xmin, xmax, ymin, ymax)
## coord. ref. : ETRF2000-PL / CS92 (EPSG:2180) 
## source      : training_set.tif 
## categories  : category 
## name        :    category 
## min value   :       water 
## max value   : undeveloped

Using the levels() function we can see what categories occur in the area.

levels(training)[[1]]
##   value    category
## 1     1       water
## 2     2   buildings
## 3     3  vegetation
## 4     4    cropland
## 5     5 undeveloped

We can also display a map.

colors = c("#1445f9", "#d20000", "#29a329", "#fdd327", "#d9d9d9")
plot(training, main = "Training dataset", col = colors)

We can load the validation data in exactly the same way.

validation_f = "validation_set.tif"
validation = rast(validation_f)
validation
## class       : SpatRaster 
## dimensions  : 1295, 1163, 1  (nrow, ncol, nlyr)
## resolution  : 30, 30  (x, y)
## extent      : 383432.6, 418322.6, 601390.4, 640240.4  (xmin, xmax, ymin, ymax)
## coord. ref. : ETRF2000-PL / CS92 (EPSG:2180) 
## source      : validation_set.tif 
## categories  : category 
## name        :    category 
## min value   :       water 
## max value   : undeveloped
plot(validation, main = "Validation dataset", col = colors)

Note two things:

  1. The satellite data (EPSG:32634) and land cover data (EPSG:2180) have different coordinate reference systems (CRS).
  2. The rasters are not aligned (the pixels are slightly shifted).

In the last step, we load the csv file for which we have to make a prediction and finally submit the results. For this purpose, we can use the read.csv() function.

submission = read.csv("submission.csv")
head(submission)
##   ID        X        Y category
## 1  1 409130.5 532574.0       NA
## 2  2 418181.4 540287.0       NA
## 3  3 433302.5 565382.2       NA
## 4  4 401445.3 558185.0       NA
## 5  5 432859.9 549953.8       NA
## 6  6 422146.8 551109.9       NA

The first two columns contain the geographical coordinates of points in the EPSG:2180 (longitude and latitude). The third column category is empty and we have to indicate predicted land cover category.

Then let’s convert the data frame to a vector (SpatVector) and reproject the coordinate reference systems to EPSG:32634.

submission = vect(submission, geom = c("X", "Y"), crs = "EPSG:2180")
submission = project(submission, crs(landsat))

Model training

To train the model, we do not have to use all available data (pixels), we can use their random sample, which will significantly speed up training. The spatSample() function is used to draw the sample below.

train_smp = spatSample(training, size = 20000, method = "random",
                       as.points = TRUE, values = FALSE, na.rm = TRUE)

Now we have sample the points (coordinates), so we need to obtain the values of spectral bands (landsat) and land cover categories (training and validation). The extract() function is used to extract pixel values. Remember that all objects must have identical CRS.

x = extract(training, train_smp, ID = FALSE)
y = extract(landsat, project(train_smp, crs(landsat)), ID = FALSE)
train_smp = cbind(x, y) # combine columns

Let’s check what the data distribution looks like using the table() and prop.table() functions.

prop.table(table(train_smp$category)) * 100
## 
##       water   buildings  vegetation    cropland undeveloped 
##       1.810       2.795      44.165      49.810       1.420

As we can see, our dataset is strongly unbalanced, i.e., the most pixels represent cropland and vegetation, and the other categories appear very rarely.

We can use the same functions to create a validation dataset.

validation_smp = spatSample(validation, size = 20000, method = "random",
                            as.points = TRUE, values = FALSE, na.rm = TRUE)
x = extract(validation, validation_smp, ID = FALSE)
y = extract(landsat, project(validation_smp, crs(landsat)), ID = FALSE)
validation_smp = cbind(x, y)
rm(x, y) # remove unnecessary variables

The key step is to train the classification model based on the training data. In this example, we will use the random forests from the ranger package and its default hyperparameters. As the first argument of the function, we must indicate the classification formula, i.e., indicate which variable is explained (dependent) and which are explanatory (independent). Our model should classify land cover categories according to spectral bands (B1 - B7).

category ~ B1 + B2 + B3 + B4 + B5 + B6 + B7

The above notation can be simplified by using a dot instead of the names of explanatory variables.

# ranger uses all threads by default
mdl = ranger(category ~ ., data = train_smp, importance = "impurity")

Using the importance argument (and later function), we can examine the importance of variables, i.e. indicate which spectral bands are most useful for classification and which are less important.

barplot(sort(importance(mdl)), xlab = "Spectral band", main = "Variable importance")

Model evaluation

After the model is trained, its performance should be checked on an independent dataset, i.e., one that was not used for training. There are many ways to validate, but in this example we will use the holdout method (validation_smp data).

We use the predict() function to make a prediction on a new dataset for which we know the actual land cover categories (of course, they should be removed from the prediction).

validation_pr = predict(mdl, validation_smp[, -1])
validation_pr = validation_pr$predictions # select predictions only
head(validation_pr)
## [1] cropland   cropland   vegetation vegetation vegetation vegetation
## Levels: water buildings vegetation cropland undeveloped

Now we need to calculate the model performance metrics – there are many of them. The simplest is accuracy, which is the ratio of correct classifications to all classifications (correct and incorrect), but in our case it will be unreliable because our dataset is unbalanced. A better alternative would be balanced accuracy or Cohen’s kappa.

# accuracy is over optimistic (don't use this)
accuracy_vec(validation_smp$category, validation_pr)
## [1] 0.8615

# balanced accuracy
bal_accuracy_vec(validation_smp$category, validation_pr)
## [1] 0.7389896

# Cohen's kappa
kap_vec(validation_smp$category, validation_pr)
## [1] 0.7391059

Moreover, we can do a more thorough analysis using the confusion matrix and see which classes are correctly classified and which are incorrectly.

# confusion matrix
table(prediction = validation_pr, true = validation_smp$category)
##              true
## prediction    water buildings vegetation cropland undeveloped
##   water         358         1         52       15           0
##   buildings       3        80         36       22          17
##   vegetation    148        66       6965     1195          44
##   cropland       40       245        780     9824          88
##   undeveloped     0        13          1        4           3

Submission

Finally, we need to make a prediction for the submission object. We can crop the satellite scene to the extent of the point layer (or alternatively select only the necessary pixels). Note: do not use more than 1 core in the predict() function, because the ranger model uses all threads by default.

landsat_pr = crop(landsat, ext(submission))
landsat_pr = predict(landsat_pr, mdl, index = 1, na.rm = TRUE, cores = 1)

Using the levels() function, we can assign categories to IDs.

levels(landsat_pr) = levels(training)
plot(landsat_pr, main = "Prediction", col = colors)

The final step is to extract the land cover categories for the given coordinates and save the result to a .csv file.

pts_pr = extract(landsat_pr, submission)
write.csv(pts_pr, "test_submission.csv", row.names = FALSE)