library("terra")
library("ranger") # Random Forest model
library("yardstick") # validation metrics
set.seed(1) # define a random seed
Let’s start by listing the rasters to be loaded from the
Landast_crop
folder. We need to define the file extension
pattern = "\\.TIF$"
and full paths
full.names = TRUE
in the list.files()
function. Then we load these files using the rast()
function.
landsat_f = list.files("Landast", pattern = "\\.TIF$", full.names = TRUE)
landsat_f = landsat_f[-1] # remove panchromatic band (15 m)
landsat = rast(landsat_f)
For ease of use, we can abbreviate the band names.
names(landsat) = paste0("B", 1:7) # change band names
landsat
## class : SpatRaster
## dimensions : 3901, 2043, 7 (nrow, ncol, nlyr)
## resolution : 30, 30 (x, y)
## extent : 246375, 307665, 5830995, 5948025 (xmin, xmax, ymin, ymax)
## coord. ref. : WGS 84 / UTM zone 34N (EPSG:32634)
## sources : LC09_L2SP_190023_20230622_20230624_02_T1_SR_B1.TIF
## LC09_L2SP_190023_20230622_20230624_02_T1_SR_B2.TIF
## LC09_L2SP_190023_20230622_20230624_02_T1_SR_B3.TIF
## ... and 4 more source(s)
## names : B1, B2, B3, B4, B5, B6, ...
## min values : 164, ? , ? , ? , ? , ? , ...
## max values : 22365, ? , ? , ? , ? , ? , ...
Next, we need to scale the values to the spectral reflectance. Basically, we change the datatype from integer to float. Note that there are outliers (below 0 and above 1) in this dataset.
# scale data to reflectance
landsat = landsat * 2.75e-05 - 0.2
summary(landsat) # print statistics
## B1 B2 B3 B4
## Min. :-0.16631 Min. :-0.11503 Min. :-0.001917 Min. :-0.009975
## 1st Qu.: 0.01541 1st Qu.: 0.02454 1st Qu.: 0.053055 1st Qu.: 0.039195
## Median : 0.02215 Median : 0.03218 Median : 0.068455 Median : 0.057785
## Mean : 0.02532 Mean : 0.03576 Mean : 0.069585 Mean : 0.062563
## 3rd Qu.: 0.03191 3rd Qu.: 0.04288 3rd Qu.: 0.082755 3rd Qu.: 0.078493
## Max. : 0.46355 Max. : 0.48296 Max. : 0.507245 Max. : 0.530620
## B5 B6 B7
## Min. :0.01708 Min. :0.005947 Min. :0.003665
## 1st Qu.:0.26280 1st Qu.:0.131045 1st Qu.:0.064083
## Median :0.31420 Median :0.162395 Median :0.088420
## Mean :0.31503 Mean :0.168040 Mean :0.099067
## 3rd Qu.:0.36793 3rd Qu.:0.199547 3rd Qu.:0.123620
## Max. :0.68468 Max. :0.624065 Max. :0.536863
We can also display RGB composition using the plotRGB()
function. We need to properly define the spectral bands and it is worth
to improve the contrast using the stretch
argument.
plotRGB(landsat, r = 4, g = 3, b = 2, stretch = "lin")
Now, let’s load our reference data with land cover categories. We have two datasets – a training (to train the classification model) and a validation (to independently evaluate its performance).
training_f = "training_set.tif"
training = rast(training_f)
training
## class : SpatRaster
## dimensions : 1484, 1293, 1 (nrow, ncol, nlyr)
## resolution : 30, 30 (x, y)
## extent : 386041.9, 424831.9, 559757.5, 604277.5 (xmin, xmax, ymin, ymax)
## coord. ref. : ETRF2000-PL / CS92 (EPSG:2180)
## source : training_set.tif
## categories : category
## name : category
## min value : water
## max value : undeveloped
Using the levels()
function we can see what categories
occur in the area.
levels(training)[[1]]
## value category
## 1 1 water
## 2 2 buildings
## 3 3 vegetation
## 4 4 cropland
## 5 5 undeveloped
We can also display a map.
colors = c("#1445f9", "#d20000", "#29a329", "#fdd327", "#d9d9d9")
plot(training, main = "Training dataset", col = colors)
We can load the validation data in exactly the same way.
validation_f = "validation_set.tif"
validation = rast(validation_f)
validation
## class : SpatRaster
## dimensions : 1295, 1163, 1 (nrow, ncol, nlyr)
## resolution : 30, 30 (x, y)
## extent : 383432.6, 418322.6, 601390.4, 640240.4 (xmin, xmax, ymin, ymax)
## coord. ref. : ETRF2000-PL / CS92 (EPSG:2180)
## source : validation_set.tif
## categories : category
## name : category
## min value : water
## max value : undeveloped
plot(validation, main = "Validation dataset", col = colors)
Note two things:
In the last step, we load the csv file for which we have to make a
prediction and finally submit the results. For this purpose, we can use
the read.csv()
function.
submission = read.csv("submission.csv")
head(submission)
## ID X Y category
## 1 1 409130.5 532574.0 NA
## 2 2 418181.4 540287.0 NA
## 3 3 433302.5 565382.2 NA
## 4 4 401445.3 558185.0 NA
## 5 5 432859.9 549953.8 NA
## 6 6 422146.8 551109.9 NA
The first two columns contain the geographical coordinates of points
in the EPSG:2180 (longitude and latitude). The third column
category
is empty and we have to indicate predicted land
cover category.
Then let’s convert the data frame to a vector (SpatVector) and reproject the coordinate reference systems to EPSG:32634.
submission = vect(submission, geom = c("X", "Y"), crs = "EPSG:2180")
submission = project(submission, crs(landsat))
To train the model, we do not have to use all available data
(pixels), we can use their random sample, which will significantly speed
up training. The spatSample()
function is used to draw the
sample below.
train_smp = spatSample(training, size = 20000, method = "random",
as.points = TRUE, values = FALSE, na.rm = TRUE)
Now we have sample the points (coordinates), so we need to obtain the
values of spectral bands (landsat
) and land cover
categories (training
and validation
). The
extract()
function is used to extract pixel values.
Remember that all objects must have identical CRS.
x = extract(training, train_smp, ID = FALSE)
y = extract(landsat, project(train_smp, crs(landsat)), ID = FALSE)
train_smp = cbind(x, y) # combine columns
Let’s check what the data distribution looks like using the
table()
and prop.table()
functions.
prop.table(table(train_smp$category)) * 100
##
## water buildings vegetation cropland undeveloped
## 1.810 2.795 44.165 49.810 1.420
As we can see, our dataset is strongly unbalanced, i.e., the most
pixels represent cropland
and vegetation
, and
the other categories appear very rarely.
We can use the same functions to create a validation dataset.
validation_smp = spatSample(validation, size = 20000, method = "random",
as.points = TRUE, values = FALSE, na.rm = TRUE)
x = extract(validation, validation_smp, ID = FALSE)
y = extract(landsat, project(validation_smp, crs(landsat)), ID = FALSE)
validation_smp = cbind(x, y)
rm(x, y) # remove unnecessary variables
The key step is to train the classification model based on the training data. In this example, we will use the random forests from the ranger package and its default hyperparameters. As the first argument of the function, we must indicate the classification formula, i.e., indicate which variable is explained (dependent) and which are explanatory (independent). Our model should classify land cover categories according to spectral bands (B1 - B7).
category ~ B1 + B2 + B3 + B4 + B5 + B6 + B7
The above notation can be simplified by using a dot instead of the names of explanatory variables.
# ranger uses all threads by default
mdl = ranger(category ~ ., data = train_smp, importance = "impurity")
Using the importance
argument (and later function), we
can examine the importance of variables, i.e. indicate which spectral
bands are most useful for classification and which are less
important.
barplot(sort(importance(mdl)), xlab = "Spectral band", main = "Variable importance")
After the model is trained, its performance should be checked on an
independent dataset, i.e., one that was not used for training. There are
many ways to validate,
but in this example we will use the holdout method
(validation_smp
data).
We use the predict()
function to make a prediction on a
new dataset for which we know the actual land cover categories (of
course, they should be removed from the prediction).
validation_pr = predict(mdl, validation_smp[, -1])
validation_pr = validation_pr$predictions # select predictions only
head(validation_pr)
## [1] cropland cropland vegetation vegetation vegetation vegetation
## Levels: water buildings vegetation cropland undeveloped
Now we need to calculate the model performance metrics – there are many of them. The simplest is accuracy, which is the ratio of correct classifications to all classifications (correct and incorrect), but in our case it will be unreliable because our dataset is unbalanced. A better alternative would be balanced accuracy or Cohen’s kappa.
# accuracy is over optimistic (don't use this)
accuracy_vec(validation_smp$category, validation_pr)
## [1] 0.8615
# balanced accuracy
bal_accuracy_vec(validation_smp$category, validation_pr)
## [1] 0.7389896
# Cohen's kappa
kap_vec(validation_smp$category, validation_pr)
## [1] 0.7391059
Moreover, we can do a more thorough analysis using the confusion matrix and see which classes are correctly classified and which are incorrectly.
# confusion matrix
table(prediction = validation_pr, true = validation_smp$category)
## true
## prediction water buildings vegetation cropland undeveloped
## water 358 1 52 15 0
## buildings 3 80 36 22 17
## vegetation 148 66 6965 1195 44
## cropland 40 245 780 9824 88
## undeveloped 0 13 1 4 3
Finally, we need to make a prediction for the submission
object. We can crop the satellite scene to the extent of the point layer
(or alternatively select only the necessary pixels). Note: do not use
more than 1 core in the predict()
function, because the
ranger
model uses all threads by default.
landsat_pr = crop(landsat, ext(submission))
landsat_pr = predict(landsat_pr, mdl, index = 1, na.rm = TRUE, cores = 1)
Using the levels()
function, we can assign categories to
IDs.
levels(landsat_pr) = levels(training)
plot(landsat_pr, main = "Prediction", col = colors)
The final step is to extract the land cover categories for the given
coordinates and save the result to a .csv
file.
pts_pr = extract(landsat_pr, submission)
write.csv(pts_pr, "test_submission.csv", row.names = FALSE)