In this section, we present the methodologies of a large-scale data process composed of two major blocks.
In this second one, together with local residents, we defined a central question: What environmental impacts does Pollença’s “dispersed” tourism model have compared with the more compact tourism patterns along the coast?
There, two concerns stood out: water and mobility.
Residents repeatedly emphasized water consumption. Many rural houses are not connected to the public sewer network or the potable water system, meaning their actual consumption remains unknown. In a context of growing scarcity and recurrent drought risk, estimating how much water these houses use becomes essential for designing new policies and interventions.
Mobility emerged as the second blind spot. These houses are scattered across the landscape, generating a form of car-dependent tourism that has never been quantified or visualized. Understanding how much mobility this dispersed model produces—especially compared to the more centralized coastal tourism areas—is a necessary step toward rethinking infrastructure, emissions, and territorial impacts.
To answer these questions in the absence of direct data, we develop predictive models for both water consumption and mobility using machine-learning techniques in the former and sptial analysis in the latter. These models allow us to approximate the invisible, to give shape to what has remained unmeasured, and ultimately to inform a political debate grounded in evidence rather than assumption.
This document is organized following a clear sequence. First, we describe the methodology used to construct the Atlas and present its principal findings. Second, we outline the methodologies and results behind the estimation of water consumption and mobility patterns. Finally, we offer a brief synthesis of the methodologies.
In order to comply with the protection of data regulations and ethics, no data set will be preesented and all the data from individuals will not be disclosed.
How many houses stand on Mallorca’s rural land? When were they built? How large are they? These are deceptively simple questions. In practice, they are remarkably difficult to answer. Many rural houses obscure their origins—how they were built, expanded, or transformed over time. No single dataset captures all of them or their characteristics with precision.
Yet, all houses, formal or informal, are recorded in the cadastre, even if with incomplete, outdated, or intentionally vague information. This makes the cadastral system our primary source, though it requires significant cross-referencing and cleaning. And because our goal is also to identify which of these houses operate as tourist rentals, we must complement cadastral sources with administrative data from the Government of the Balearic Islands.
To build a complete and reliable dataset, we integrate four main data sources:
The Cadastral inforamtion is taken from the website of the Spanish Cadastral (https://www.sedecatastro.gob.es/) and from the plug-in of the Spanish INSPIRE Cadastral in QGIS. Regarding the Tourism Rentals, it is taken from the IDEIB - Infraestructura de Dades Espacials de les Illes Balears (http://ideib.caib.es/cataleg/).
Together, these sources allow us to construct the most complete and detailed dataset ever assembled on Mallorca’s rural houses—formal, informal, and recreational—and provide the analytical foundation for the Atlas of Houses That Do Not Exist. Here is how we will be building the data:
We begin by downloading and cleaning all cadastral information for every municipality in Mallorca. Once the parcel-level data is processed, we cross-reference it with the Cadastral Construction Details—the georeferenced shapefile that contains the full set of building footprints across the island. This layer includes every construction component that forms part of a building. We filter this data so we can detect 1) the buildings that are classified as having a house (constructions that have a “V” in their classification), 2) how many levels they have (as a proxy of height), 3) if they have pools and what area they have and 4) what is the total amount of the area of the whole building and the area of just the part of the building that correspond to the uses as house.
Before merging both sources, we filter the dataset to retain only those constructions located on rural land1, excluding all buildings situated within urban classifications. This ensures that our analysis focuses exclusively on the rural housing stock that is central to the Atlas.
CatastreRural_Filtrat_s <- CatastreRural_Brut %>%
filter(`1_tipo_reg` == 14)
CatastreRural_Filtrat <- CatastreRural_Brut %>%
filter(`1_tipo_reg` == 14, str_detect(`71_cd`, "V"))
CatastreRural_AnySuper <- CatastreRural_Filtrat %>%
select(`1_tipo_reg`, `31_pc`, "79_aec", "84_stl",`71_cd`)
CatastreRural_Complet <- CatastreRural_AnySuper %>%
group_by(`31_pc`) %>%
summarise(Any=max(`79_aec`),
Buildings = sum(str_detect(`71_cd`, "V"), na.rm = TRUE),
SuperB = sum(as.numeric(`84_stl`), na.rm = TRUE)) %>%
rename(REFCAT = `31_pc`)
roman_map <- c(I = 1L, II = 2L, III = 3L, IV = 4L)
get_plantas <- function(x) {
m <- str_match_all(x, "(?<=^|\\+)(IV|III|II|I)(?=\\+|$)")
vals <- roman_map[ unlist(lapply(m, function(mm) mm[,2])) ]
vals <- vals[!is.na(vals)]
if (length(vals) == 0) NA_integer_ else max(vals)
}
Constru_amb_plantes <- Constru_complet %>%
mutate(PLANTES = map_int(CONSTRU, get_plantas))
Constru_Recompte <- Constru_amb_plantes %>%
st_drop_geometry() %>%
group_by(REFCAT) %>%
summarise(
PI = sum(if_else(CONSTRU == "PI", 1, 0), na.rm = TRUE),
AREA_PI = sum(if_else(CONSTRU == "PI", AREA, 0), na.rm = TRUE),
AREA_Plantes = sum(if_else(!is.na(PLANTES), AREA, 0), na.rm = TRUE),
MAX_Plantes = if (all(is.na(PLANTES))) NA_integer_ else max(PLANTES, na.rm = TRUE),
N_Plantes = sum(if_else(!is.na(PLANTES), 1, 0)),
.groups = "drop"
)
Catastre_Compelt <- Constru_Recompte %>%
full_join(CatastreRural_Complet, by= "REFCAT")
However, we are detecting some mismatches from the two datasets. This means that, by using the parcel reference to cross the two datasets, there are some cases that do not fully match and, hence, are not present. For now, we leave them there to further investigate why this happens.
Once the cadastral parcel data is merged with the georeferenced Cadastral Building Footprints, we begin a long and detailed cleaning process. This stage is time-consuming and involves manual verification, often building by building, to ensure that only actual rural houses remain in the final dataset. The cleaning follows this sequence of steps:
In QGIS, once all previous steps are completed, we join the resulting dataset with the Tourism Rentals Dataset from the Government of the Balearic Islands. Some transformations are required, since the georeferenced points in this dataset are not located at the centroid of the house but rather at the centroid (or in close proximity) of the plot of land. In other words, the tourism-rental point is positioned on the corresponding parcel, not directly on the building.
For this reason, we perform a location-based join to assign each tourism-rental point the cadastral reference of the parcel in which it falls. This allows us to determine, for every parcel, whether it contains a tourism rental and how many beds it offers. As a result, we generate two variables: a binary indicator (0/1) where 1 indicates a registered rental, and the number of beds when applicable.
Afterward, we join the parcel-level rental information to the corresponding building, allowing us to identify which houses are rentals and the number of beds they have.
However, it is important to note that this method is not fully accurate, since the GOIB has not georeferenced all rural rental licences. Moreover, many Airbnb rentals are under-reported or missing from the official dataset. For this reason, we use Airbnb data as a reference point to cross-check and validate the counts at the municipal level.
The first step is identifying which buildings are located in areas classified as flood-risk zones. For this, we use the official determination made in 2002 by the Government of the Balearic Islands (https://ideib.caib.es/cataleg/srv/cat/catalog.search;jsessionid=725358ED20EEAEDA207B4878DD0F457A#/metadata/A85E9070-8C41-4090-AFF3-9AC582E0965F), which maps the territories considered to be at high risk of flooding. Although these maps are now outdated—especially given advancements in GIS techniques and the increasing severity of the climate emergency—they still represent the most authoritative delineation of risk zones and are known to be conservative rather than expansive. For this reason, we retain them as the baseline reference, understanding that actual risk areas today would likely be larger. After intersecting these layers with our cleaned dataset, we generate a simple dummy variable (0/1) indicating whether each building is located within a flood-risk area.
After identifying buildings located in flood-risk zones, we delimit the areas that fall under highly protected rural land. To do this, we rely on the classification of the Pla Territorial de Mallorca (PTM), the territorial plan that distinguishes between different categories of rural land, including highly protected areas and general or common rural land. We focus specifically on Articles 19.3 and 20 of the regulatory section, and intersect their legal definitions with the corresponding cartography. Once this spatial layer is prepared, we determine whether any building falls within these protected categories.
The first step is assigning each building the rural-land category in which it is located. If a building falls into any of the following classifications, we label it as being in a Strong Environmental Protection Area, since these are zones where the construction of new isolated houses is explicitly prohibited:
AANP – Àrees Naturals d’Especial Interès d’Alt Nivell de Protecció
ANEI / ANEU – Àrees Naturals d’Especial Interès
ARIP-B – Àrees Rurals d’Interès Paisatgístic (Boscoses)
APT – Àrees de Protecció Territorial
Next, we create a second classification of strong protection. Buildings located in SRG-F (Suelo Rústico de Régimen General Forestal) or ARIP areas are also considered highly protected when the parcel on which they sit is smaller than 50,000 m². According to Article 20, parcels below this size cannot legally host new isolated single-family dwellings.
After this, we classify the remaining rural land as general rural protection, where the minimum parcel size for building a house is 14,000 m². Buildings located on parcels smaller than this threshold are not necessarily illegal—because municipal regulations before the 1990s used to allow smaller parcels—but they would not be permitted under today’s rules. For this reason, we categorize them as potentially informal, yet this would require go case by case.
In summary, when we refer to houses on protected land, we combine:
Buildings located in the first group of highly protected categories (AANP, ANEI/ANEU, ARIP-B, APT).
Buildings in SRG-F or ARIP areas located on parcels smaller than 50,000 m².
And, when relevant, buildings located on the rest of rural land on parcels smaller than 14,000 m², we consider potentially informal given the evolution of planning regulations.
When interpreting these classifications, we must keep in mind that our computation focuses solely on whether a house exists in a specific location. Many cases of informality do not stem from the physical characteristics of the house itself (such as height or volume), but rather from whether the dwelling was permitted to be built in that location under the relevant planning regulations. For example, houses that are legal, in protected land, yet with the right size of plot, but that they build a pool afterwards. This cases are not counted here.
At the end we join by location on the parcel they are located and we count the amount of sq meters the parcel have.
At the end of this process, we obtain a complete count of all buildings located on rural land, combining the results from both QGIS and R. These constitute the full set of buildings used in our analysis. Those are the variables that result.
names(buildings_complet)
## [1] "gml_id" "lowerCorne" "upperCorne" "beginLifes" "conditionO"
## [6] "beginning" "end" "endLifespa" "informatio" "reference"
## [11] "localId" "namespace" "horizontal" "horizont_1" "horizont_2"
## [16] "referenceG" "currentUse" "numberOfBu" "numberOfDw" "numberOfFl"
## [21] "documentLi" "format" "sourceStat" "officialAr" "value"
## [26] "value_uom" "layer" "path" "Cat_REFCAT" "Cat_PI"
## [31] "Cat_AREA_P" "Cat_AREA_1" "Cat_MAX_Pl" "Cat_N_Plan" "Cat_Any"
## [36] "Cat_Buildi" "Cat_SuperB" "Parc_areaV" "TuriGOIB_N" "TuriGOIB_P"
## [41] "Allotj_n" "Allotj_p" "PrtAbsolut" "Ptc50" "PrtGen"
## [46] "Build_Innu" "Munic_NOM" "Catastre_1" "Catastre_3" "Catastre_7"
## [51] "Any_num" "PrtAbsNum" "CODIMUIB_G" "geometry"
And here all the final stats of all the houses and the main variables:
buildings_complet_totals <- buildings_complet %>%
mutate(Cat_Any=as.numeric(Cat_Any),
Cat_SuperB=as.numeric(Cat_SuperB),
Cat_PI=as.numeric(Cat_PI)) %>%
group_by() %>%
summarise(Cases=n(),
PI=sum(Cat_PI, na.rm = TRUE),
mean_any=mean(Cat_Any, na.rm=TRUE),
med_any=median(Cat_Any, na.rm=TRUE),
mean_superB=mean(Cat_SuperB, na.rm=TRUE),
med_superB=median(Cat_SuperB, na.rm=TRUE),
mean_parc = mean(as.numeric(Parc_areaV), na.rm=TRUE),
med_parc = median(as.numeric(Parc_areaV), na.rm=TRUE),
sum_14K = sum(PrtGen),
sum_protecc = sum(as.numeric(PrtAbsolut)),
sum_protecc_amb50=sum_protecc+sum(Ptc50), #HEM POSAT AQUEST QUE ES PROTECCIO ALTA PERO RELATIVA
sum_innun=sum(as.numeric(Build_Innu), na.rm = TRUE)
) %>%
mutate(perc_14K=(sum_14K/Cases)*100,
perc_proteccioabsoluta=(sum_protecc/Cases)*100,
perc_proteccioamb50=(sum_protecc_amb50/Cases)*100,
perc_innundable=(sum_innun/Cases)*100)
knitr::kable(
buildings_complet_totals
)
| Cases | PI | mean_any | med_any | mean_superB | med_superB | mean_parc | med_parc | sum_14K | sum_protecc | sum_protecc_amb50 | sum_innun | perc_14K | perc_proteccioabsoluta | perc_proteccioamb50 | perc_innundable |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 55256 | 21897 | 1984.594 | 1996 | 182.7571 | 135 | 29389.56 | 5712 | 30134 | 4591 | 7965 | 5066 | 54.53525 | 8.3086 | 14.41472 | 9.168235 |
Now, using the dataset, we build the data to count how many houses there are in each municipality and what are their characteristics. The code we use originally si the following:
muni_count <- buildings_complet %>%
mutate(TuriGOIB_N=as.numeric(TuriGOIB_N),
Cat_Buildi=as.numeric(Cat_Buildi),
Cat_SuperB=as.numeric(Cat_SuperB),
Parc_areaV=as.numeric(Parc_areaV),
value = as.numeric(value),
Cat_AREA_1 = as.numeric(Cat_AREA_1),
illegal_75_7K=ifelse(CODIMUIB_G==1 & Parc_areaV < 7000 & Cat_Any > 1975 &
Cat_Any <= 1997, 1, 0),
illegal_75_14K = ifelse(CODIMUIB_G==1 & Parc_areaV < 14000 & Cat_Any > 1975 &
Cat_Any <= 1997, 1, 0),
illegal_97_14K =ifelse(CODIMUIB_G==1 & Parc_areaV < 14000 & Cat_Any > 1997, 1, 0)) %>%
group_by(Munic_NOM) %>%
summarise(
any_mean = mean(Any_num, na.rm = TRUE),
any_med=median(Any_num, na.rm = TRUE),
n_hab = sum(as.numeric(Cat_Buildi), na.rm = TRUE),
n_general = n(),
Rental_n_GOIB = sum(TuriGOIB_N, na.rm = TRUE),
Rental_pl_GOIB = sum(TuriGOIB_P, na.rm = TRUE),
n_Innund = sum(as.numeric(Build_Innu) * Cat_Buildi, na.rm = TRUE),
n_Innund_gen = sum(Build_Innu, na.rm = TRUE),
perc_innund = (n_Innund_gen/n_general)*100,
n_prtabs = sum(as.numeric(PrtAbsolut) * Cat_Buildi, na.rm = TRUE),
n_prtabs_general = sum(as.numeric(PrtAbsolut), na.rm = TRUE),
perc_prtabs = (n_prtabs_general/n_general)*100,
areaHous_mean = mean(as.numeric(Cat_SuperB), na.rm = TRUE),
areaHous_med = median(as.numeric(Cat_SuperB), na.rm = TRUE),
areagfa_mean = mean(as.numeric(value), na.rm = TRUE),
areagfa_med = median(as.numeric(value), na.rm = TRUE),
areaTOT_mean = mean(as.numeric(Cat_AREA_1), na.rm = TRUE),
areaPar_mean = mean(Parc_areaV, na.rm = TRUE),
areaPar_med = median(Parc_areaV, na.rm = TRUE),
n_PI = sum(as.numeric(Cat_PI), na.rm = TRUE),
areaPI_mean = mean(Cat_AREA_P, na.rm = TRUE),
n_prt50 = sum(Ptc50 * Cat_Buildi, na.rm = TRUE),
n_prt50_general = sum(Ptc50, na.rm = TRUE),
n_prtg_tot = n_prt50_general+n_prtabs_general,
SRG = sum(CODIMUIB_G, na.rm=TRUE),
illegal_75_7K = sum(illegal_75_7K, na.rm = TRUE),
illegal_75_14K = sum(illegal_75_14K, na.rm = TRUE),
illegal_97_14K = sum(illegal_97_14K, na.rm = TRUE),
illegal= illegal_75_14K+illegal_97_14K,
perc_illegal = (illegal/n_general)*100,
perc_prt50 = (n_prt50_general/n_general)*100) %>%
mutate(
perc_hab = (n_hab / sum(n_hab)) * 100,
perc_general = (n_general / sum(n_general)) * 100) %>%
full_join(airbnb) %>%
mutate(diffTuri= airbnb_25-HabT_n,
perc_Turi_general = (HabT_n / n_general) * 100,
perc_Turi_genANDall= ((HabT_n+Allotj_n) / n_general) * 100,
perc_airb_general = (airbnb_25 / n_general) * 100,
perc_Turi_habANDall = ((HabT_n+Allotj_n) / n_hab) * 100,
perc_prtg_tot = (n_prtg_tot/n_general)*100,
perc_airb_hab = (airbnb_25 / n_hab) * 100,
perc_illegal_SRG = (illegal/localId_co)*100) %>%
mutate(
dplyr::across(
.cols = where(is.numeric),
.fns = ~ tidyr::replace_na(., 0)
))
total_illegals <- muni_count %>%
group_by() %>%
summarise(
illegal = sum(illegal, na.rm = TRUE),
illegal_75_7K = sum(illegal_75_7K, na.rm = TRUE),
illegal_97_14K = sum(illegal_97_14K, na.rm = TRUE),
illegal_75_14K = sum(illegal_75_14K, na.rm = TRUE),
n = sum(n_general, na.rm = TRUE),
SRG = sum(SRG, na.rm = TRUE),
airbnb = sum(airbnb_25, na.rm = TRUE),
) %>%
mutate(
perc_illegal_SRG = (illegal/SRG)*100,
perc_illegal = (illegal/n)*100,
perc_illegal_97_14K = (illegal_97_14K/n)*100,
perc_illegal_75_14K = (illegal_75_14K/n)*100,
perc_illegal_75_7K = (illegal_75_7K/n)*100,
perc_illegal_97_14K_SRG = (illegal_97_14K/SRG)*100,
perc_illegal_75_14K_SRG = (illegal_75_14K/SRG)*100,
perc_illegal_75_7K_SRG = (illegal_75_7K/SRG)*100,
perc_airbnb = (airbnb/n)*100
)
With this, we can visualize the totals for each municipality.
The short answer is: clustering did not work. Our goal was to use all available cadastral features to identify potential typologies of houses—groups that would reflect different patterns of size, shape, age, parcel characteristics, or tourist-rental intensity. If such clusters existed, they could have served as a useful way of summarizing the morphological diversity of rural housing.
Before applying clustering, we tested whether the data had the necessary internal structure:
We created an index to uniquely identify each building.
We selected all numeric cadastral variables and harmonized missing values.
We removed variables with no variation (constant values).
We checked correlations and found several pairs of highly correlated variables (>0.9).
We ran an outlier analysis using IQR thresholds to detect extreme values.
All of the results suggested that the dataset is dominated by a high number of outliers and lacks clear internal substructure.
buildings_complet_index <- as.data.frame(buildings_complet) %>%
mutate(index = row_number())
buildings_complet_cl <- as_tibble(buildings_complet_index) %>% select(index, 31:52) %>%
select(-c("Cat_N_Plan","Catastre_1","Catastre_3",
"Catastre_7","Any_num")) %>%
mutate(
TuriGOIB_N = ifelse(is.na(TuriGOIB_N), 0, TuriGOIB_N),
TuriGOIB_P = ifelse(is.na(TuriGOIB_P), 0, TuriGOIB_P),
Allotj_n = ifelse(is.na(Allotj_n), 0, Allotj_n),
Allotj_p = ifelse(is.na(Allotj_p), 0, Allotj_p),
Build_Innu = ifelse(is.na(Build_Innu), 0, Build_Innu)
) %>%
filter(
!is.na(Cat_Buildi),
!is.na(Cat_AREA_1),
!is.na(Cat_MAX_Pl),
Cat_Any > 1000
) %>%
select(-Munic_NOM) %>%
as.data.frame() %>%
select( -Cat_Buildi, -Allotj_n, -Allotj_p, -TuriGOIB_P, -PrtAbsolut, -Ptc50, -PrtGen, -Build_Innu) %>%
mutate(across(-index, as.numeric))
buildings_complet_cl_si <- buildings_complet_cl %>%
select(-index) %>% # treus index si vols
mutate(across(everything(), as.numeric))
cor_dta <- cor(buildings_complet_cl_si)
cor_dta2 <- as.data.frame(as.table(cor_dta))
subset(cor_dta2, Freq>0.9)
## Var1 Var2 Freq
## 1 Cat_AREA_P Cat_AREA_P 1
## 10 Cat_AREA_1 Cat_AREA_1 1
## 19 Cat_MAX_Pl Cat_MAX_Pl 1
## 28 Cat_Any Cat_Any 1
## 37 Cat_SuperB Cat_SuperB 1
## 46 Parc_areaV Parc_areaV 1
## 55 TuriGOIB_N TuriGOIB_N 1
## 64 PrtAbsNum PrtAbsNum 1
cor_mat <- cor(buildings_complet_cl_si, use = "pairwise.complete.obs")
corrplot(cor_mat, method = "color", type = "upper", tl.cex=0.7)
#Outliers
outliers_iqr_cl <- sapply(buildings_complet_cl, function(col){
q1 <- quantile(col, 0.25)
q3 <- quantile(col, 0.75)
iqr <- q3 - q1
col < (q1 - 1.5*iqr) | col > (q3 + 1.5*iqr)
})
outlier_rows <- which(apply(outliers_iqr_cl, 1, any))
length(outlier_rows)
## [1] 14112
We applied a hierarchical clustering model (Ward.D2) on the scaled numerical features. We tested cluster solutions between k = 2 and k = 10, computing the mean silhouette width for each value of k. The silhouette values remained consistently low, and no value of k produced a meaningful improvement.
Even when forcing a solution (we tested k = 7 for illustration purposes), the clusters were numerically unbalanced, overlapping, and indistinguishable in their main characteristics (house area, parcel size, number of floors, construction year, number of rental beds, etc.). The dendrogram and silhouette plots confirmed the absence of well-defined cluster boundaries.
df_num <- buildings_complet_cl %>%
select(index, where(is.numeric))
df_num <- df_num %>% drop_na()
df_num <- df_num[, sapply(df_num, function(x) length(unique(x)) > 1)]
cols_ok <- sapply(df_num, function(x) length(unique(x)) > 1)
cols_ok["index"] <- TRUE
df_num <- df_num[, cols_ok]
idx <- df_num$index
df_scaled <- scale(df_num %>% select(-index))
d <- dist(df_scaled, method = "euclidean")
hc <- hclust(d, method = "ward.D2")
plot(hc, labels = FALSE, main = "Hierarchical clustering dendrogram")
ks <- 2:10
sil_means <- sapply(ks, function(k) {
cl_k <- cutree(hc, k = k)
sil <- silhouette(cl_k, d)
mean(sil[, 3])
})
sil_means
## [1] 0.2277950 0.2978560 0.3146154 0.3158251 0.3436693 0.3441119 0.3337026
## [8] 0.3537014 0.3538284
plot(ks, sil_means, type = "b",
xlab = "Número de clusters (k)",
ylab = "Media silhouette",
main = "Silhouette vs k (Ward.D2)")
abline(v = ks[which.max(sil_means)], lty = 2)
clusters <- cutree(hc, k = 7)
table(clusters)
## clusters
## 1 2 3 4 5 6 7
## 3875 252 11611 20545 11949 9 2
cut_height <- hc$height[length(hc$height) - (7 - 1)]
plot(hc, labels = FALSE, main = "Dendrograma amb línia de tall per k=7")
abline(h = cut_height, col = "blue", lty = 2, lwd = 2)
clusters_df <- data.frame(
index = idx,
cluster = as.integer(clusters)
)
buildings_complet_index <- buildings_complet_index %>%
left_join(clusters_df, by = "index")
buildings_complet_index$cluster[is.na(buildings_complet_index$cluster)] <- 0
names(buildings_complet_index)
## [1] "gml_id" "lowerCorne" "upperCorne" "beginLifes" "conditionO"
## [6] "beginning" "end" "endLifespa" "informatio" "reference"
## [11] "localId" "namespace" "horizontal" "horizont_1" "horizont_2"
## [16] "referenceG" "currentUse" "numberOfBu" "numberOfDw" "numberOfFl"
## [21] "documentLi" "format" "sourceStat" "officialAr" "value"
## [26] "value_uom" "layer" "path" "Cat_REFCAT" "Cat_PI"
## [31] "Cat_AREA_P" "Cat_AREA_1" "Cat_MAX_Pl" "Cat_N_Plan" "Cat_Any"
## [36] "Cat_Buildi" "Cat_SuperB" "Parc_areaV" "TuriGOIB_N" "TuriGOIB_P"
## [41] "Allotj_n" "Allotj_p" "PrtAbsolut" "Ptc50" "PrtGen"
## [46] "Build_Innu" "Munic_NOM" "Catastre_1" "Catastre_3" "Catastre_7"
## [51] "Any_num" "PrtAbsNum" "CODIMUIB_G" "geometry" "index"
## [56] "cluster"
table(buildings_complet_index$cluster)
##
## 0 1 2 3 4 5 6 7
## 7013 3875 252 11611 20545 11949 9 2
cluster_summary <- buildings_complet_index %>%
group_by(cluster) %>%
summarise(
n = n(),
# Cat_AREA_P
mean_Cat_AREA_P = mean(as.numeric(Cat_AREA_P), na.rm = TRUE),
median_Cat_AREA_P = median(as.numeric(Cat_AREA_P), na.rm = TRUE),
sd_Cat_AREA_P = sd(as.numeric(Cat_AREA_P), na.rm = TRUE),
# Cat_AREA_1
mean_Cat_AREA_1 = mean(as.numeric(Cat_AREA_1), na.rm = TRUE),
median_Cat_AREA_1 = median(as.numeric(Cat_AREA_1), na.rm = TRUE),
sd_Cat_AREA_1 = sd(as.numeric(Cat_AREA_1), na.rm = TRUE),
# Cat_SuperB
mean_Super = mean(as.numeric(Cat_SuperB), na.rm = TRUE),
median_Super = median(as.numeric(Cat_SuperB), na.rm = TRUE),
sd_Super = sd(as.numeric(Cat_SuperB), na.rm = TRUE),
# Cat_MAX_Pl
mean_Cat_MAX_Pl = mean(as.numeric(Cat_MAX_Pl), na.rm = TRUE),
median_Cat_MAX_Pl = median(as.numeric(Cat_MAX_Pl), na.rm = TRUE),
sd_Cat_MAX_Pl = sd(as.numeric(Cat_MAX_Pl), na.rm = TRUE),
# Parc_areaV
mean_Parc = mean(as.numeric(Parc_areaV), na.rm = TRUE),
median_Parc = median(as.numeric(Parc_areaV), na.rm = TRUE),
sd_Parc = sd(as.numeric(Parc_areaV), na.rm = TRUE),
# TuriGOIB_N
mean_Turi = mean(as.numeric(TuriGOIB_N), na.rm = TRUE),
median_Turi = median(as.numeric(TuriGOIB_N), na.rm = TRUE),
sd_Turi = sd(as.numeric(TuriGOIB_N), na.rm = TRUE),
# Cat_Any
mean_Cat_Any = mean(as.numeric(Cat_Any), na.rm = TRUE),
median_Cat_Any = median(as.numeric(Cat_Any), na.rm = TRUE),
sd_Cat_Any = sd(as.numeric(Cat_Any), na.rm = TRUE),
# Municipi més comú
top_munic = names(which.max(table(Munic_NOM))),
top_munic_n = max(table(Munic_NOM)),
second_munic_n = sort(table(Munic_NOM), decreasing = TRUE)[2],
top_advantage = max(table(Munic_NOM)) - sort(table(Munic_NOM), decreasing = TRUE)[2]
)
cluster_summary_flipped <- cluster_summary %>%
as.data.frame() %>%
column_to_rownames("cluster") %>%
t() %>%
as.data.frame()
plot_data <- buildings_complet_index %>%
filter(!cluster %in% c(0, 4, 6, 7)) %>%
mutate(cluster = factor(cluster)) %>%
select(
cluster,
Cat_AREA_P,
Cat_AREA_1,
Cat_SuperB,
Cat_MAX_Pl,
Parc_areaV,
TuriGOIB_N,
Cat_Any ) %>%
mutate(across(-cluster, as.numeric)) %>%
pivot_longer(
cols = -cluster,
names_to = "variable",
values_to = "value"
)
ggplot(plot_data, aes(x = cluster, y = value)) +
geom_boxplot(outlier.alpha = 0.3) +
facet_wrap(~ variable, scales = "free_y") +
labs(
x = "Cluster",
y = "Value of Variable",
title = "Distribution by Cluster"
)
plot(hc, labels = FALSE, main = "Hierarchical clustering dendrogram (k = 7)")
rect.hclust(hc, k = 7, border = "red")
Because clustering failed to reveal any additional structure or insight, we decided not to use it as part of the typology analysis. Instead, we focus on individual variables and spatial patterns, which provide more substantive information than the attempted cluster models.
During the month of January we carried out a Participatory Action Research (PAR) process in which the people of Pollença defined the research priorities themselves. After collecting 133 research questions that reflected the community’s concerns, three questions emerged as the final ones (decided in a final workshop with the main stakeholders). All of them where centering on a comparaison between rentals on rural land and the realities of other forms of tourism in the municipality (mainly hotels and urban rentals). So, the questions that resulted are:
In this analysis we focus on the second question: the environmental consequences of rural houses. To address it, we work through two complementary perspectives:
Water consumption: We build a model capable of predicting the water consumption of rural houses and compare it with the consumption patterns of urban households.
Mobility impact: We develop a model to estimate the mobility generated by rural houses, focusing on the distance to key amenities and the extent to which residents must rely on cars compared to those living in urban areas.
With this, we pretend to inspire stakeholders and policymakers to agree on shared action on those spaces. Visualizing this data we can put on the table the extend of the problems and discuss potential interventions in the municipality.
However, before proceeding, a short clarification is needed. We argue that informality is not primarily a matter of policy or legalization, but a projectual issue—something that must be addressed spatially, through design and planning interventions. Each place may require a different response. And as every planning proposal needs a clear spatial delimitation, we first sought to understand whether there are emerging neighbourhood-like clusters in rural land—areas where the density of houses is high enough to constitute a coherent unit for planning. Identifying such clusters would allow us to examine their specific patterns of water consumption or mobility and use them as the basis for discussing context-sensitive interventions.
For this reason, we built two models of clusters based on the idea of proximity and density of houses among each other that could allow us to determine what “neighborhoods” exist. We used two methodologies: K-Means and DBSCAN.
At the end of the day, we can consider each house a point (thus we do the centroid of each polygon of footprint we have using QGIS). And this point is situated into two coordinated: X and Y. Thus, we could just use this two coordinates to build a k-means cluster:
coords_utm_bp <- st_transform(buildings_complet_p, 25831)
coords_centroids <- st_centroid(coords_utm_bp)
coords_mat_cases <- st_coordinates(coords_centroids)
coords_scaled <- scale(coords_mat_cases)
set.seed(1000)
max_k <- 20
wss <- numeric(max_k)
for (k in 1:max_k) {
km <- kmeans(coords_scaled, centers = k, nstart = 25)
wss[k] <- km$tot.withinss # within-cluster sum of squares
}
plot(1:max_k, wss, type = "b",
xlab = "Nombre de clusters (k)",
ylab = "Total within-cluster SS",
main = "Elbow plot per k-means")
sil_values <- numeric(20)
for (k in 2:20) {
km_temp <- kmeans(coords_scaled, centers = k, nstart = 25)
sil_values[k] <- mean(silhouette(km_temp$cluster, dist(coords_scaled))[, 3])
}
# Silhouette plot
plot(2:20, sil_values[2:20], type = "b",
xlab = "Nombre de clusters (k)",
ylab = "Silhouette mitjà",
main = "Silhouette per k-means")
By using Silhouette and WSS we see that the potential cut on number of k could be in 6 or 10 clusters. So, we build both models and observe that both returns us a logical value of clusters distribution
km6 <- kmeans(coords_scaled, centers = 6, nstart = 25)
km10 <- kmeans(coords_scaled, centers = 10, nstart = 25)
coords_utm_bp$cluster6 <- factor(km6$cluster)
coords_utm_bp$cluster10 <- factor(km10$cluster)
knitr::kable(table(coords_utm_bp$cluster6),
caption = "K-mean neighb. (k=6)")
| Var1 | Freq |
|---|---|
| 1 | 669 |
| 2 | 693 |
| 3 | 143 |
| 4 | 332 |
| 5 | 375 |
| 6 | 422 |
knitr::kable(table(coords_utm_bp$cluster10),
caption = "K-mean neighb. (k=10)")
| Var1 | Freq |
|---|---|
| 1 | 335 |
| 2 | 266 |
| 3 | 235 |
| 4 | 346 |
| 5 | 321 |
| 6 | 10 |
| 7 | 338 |
| 8 | 113 |
| 9 | 123 |
| 10 | 547 |
Here the result on a map of them when we add them in QGIS (and sum them in hexagons of 200 sq meters):
When we asked different people from the town which cluster best reflected the lived realities of rural Pollença, they consistently pointed to cluster 10. According to them, this cluster most closely aligns with what they commonly recognize as the “areas” of the rural landscape—categories rooted in historic land divisions, traditional parcel structures, and popular geographic terminology.
Yet there is an important insight that residents pointed out—and one we fully agree with: houses in rural Pollença are not randomly distributed across the landscape. Their spatial logic is shaped by mountains, valleys, historic paths, and physical barriers. In other words, the “distance” between two houses is not the straight line between their coordinates but the network of roads, trails, and passes that actually connect them. Paths and trails that actually are the true accessess to houses and that reflect the orography of the land. Two houses that appear close in Euclidean space (the one that K-means use) may, in reality, be separated by a mountain, while others that seem far apart on a map are effectively neighbours because they share a direct path.
This becomes clear in areas such as Siller and Cala Sant Vicenç, where simple coordinate-based proximity fails to capture the real spatial relationships of daily life: they are separated by a mountain (el Coll de Siller) and thus, the paths that connect them are very far away.
Because of this, we sought to rebuild the clusterization based on network distance rather than on raw geographic coordinates, allowing us to reflect the true connectivity patterns that shape how people move, interact, and access the territory.
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a widely used algorithm in spatial analysis. It offers a straightforward way to identify density-based groups of points while also distinguishing outliers. Thus, is based on density and also elimiantes “noise”, this being the isolated houses in or case. One key advantage is that DBSCAN operates on a distance matrix, meaning it can use the actual network between points (our paths)—unlike k-means, which requires raw coordinates and assumes spherical clusters in geometric space and fit all the outliers in the clusters.
AS a result, for our case, where houses are separated by orography, paths, and complex access routes, DBSCAN is a more suitable method than the k-means. Because of this, we apply DBSCAN to identify meaningful clusters of rural houses. The procedure requires two main steps:
Building the dataset: We construct the full distance matrix, measuring the network-based distance from every house to every other house. This allows DBSCAN to work with the real distances that define how people actually move across the territory.
Running DBSCAN – Instead of specifying the number of clusters (as in k-means), DBSCAN requires two parameters: ε (epsilon): the maximum distance between two houses for them to be considered “neighbours.”; MinPts: the minimum number of houses required to form a cluster.
These two parameters shape the clusters: low values generate small, tight clusters; high values merge areas into larger units. Choosing ε and MinPts is an iterative and interpretive process, which we refine based on both statistical indicators and local knowledge.
To build the dataset we will be using QNEAT3, an algortihm in QGIS2. As we have all the buildings as points (the sames we used for the k-means) we jsut run the algorithm when taking the routes that IDEIB (From the Goverment of the Balearic Islands) provide as Open Data in their website. These routes are the complete network of public routes, paths and trails of all the Balearic Islands. Thus, from here, we run the algorithm “OD-Matrix from Points as Lines (n:n)” and looks for the shortest-path based on distance (this is what path optimizes the minimum meters,as it is the units we use in the geographical projection, runned). Costs components are split up into Entry-Cost (from a house of origin to the path), Network-Cost (the shortest distance of the path to the destination), Exit-Cost (the distance from the path at its final point to the destination) and Total Cost (sum of all other cost components). Here one can have more information on the algorithm: https://root676.github.io/OdMatrixAlgs.html.
By taking the total_cost we can build a matrix of distances that reflect the entire distance among the more than 2500 houses that the municipality has. A total combination of 6,937,956 combination of distances. From here, we can build the matrix that it looks like this:
| House 1 | House 2 | House 3 | … | House k | |
|---|---|---|---|---|---|
| House 1 | 0 | 2 | 5 | X | |
| House 2 | 2 | 0 | 8 | Y | |
| House 3 | 5 | 8 | 0 | Z | |
| … | |||||
| House k | X | Y | Z | 0 |
Here there is the code to build that matrix:
df <- st_drop_geometry(matrix)
df_clean <- df %>%
group_by(origin_id, destinatio) %>%
summarise(total_cost = min(total_cost, na.rm = TRUE)) %>%
ungroup()
matrix_wider <- df_clean %>%
pivot_wider(
names_from = destinatio,
values_from = total_cost
)
rownames(matrix_wider) <- matrix_wider$origin_id
matrix_wider_2 <- matrix_wider %>% select(-origin_id)
distance_matrix <- as.matrix(matrix_wider_2)
distance_matrix <- apply(distance_matrix, 2, as.numeric)
distance_matrix <- as.matrix(distance_matrix)
It is important that there is no NA or any Infinite numbers, as DBSCAN cannot operate with them in the matrix. They reflect in QNEAT3 that houses are too far compared to the others to compute. For this reason, what we do in this cases is, instead of eliminating them, returning a number that is big enough so it will not be counted to any of the clusters (as DBSCAN eliminates the noise).
sum(is.na(distance_matrix))
## [1] 0
sum(is.infinite(distance_matrix))
## [1] 110674
finite_vals <- distance_matrix[is.finite(distance_matrix)]
max_val <- max(finite_vals, na.rm = TRUE)
big_val <- max_val * 2
distance_matrix[!is.finite(distance_matrix)] <- big_val
From here we can operate, yet before we have to decide the parameters. What we will do, in this case, is run different combinations of them and compare their distributions. We use, in this case the formula KNNdist. We generated a set of candidate ε values based on the distribution of k-nearest-neighbour distances between houses (from the 80th to the 98th percentile), and combined them with a range of MinPts values commonly used in density-based clustering. For each combination, DBSCAN was run on the full distance matrix of Pollenca’s rural houses, and we recorded both the number of clusters produced and the proportion of houses classified as noise. We then filtered the results to retain only those combinations that produced a reasonable number of clusters (between 5 and 50) and a manageable amount of noise (under 30%).
From there we will be selecting the one that gives us a high number of clusters (similar tot eh K-means) and, at the same time, has a meaningful distribution among the clusters.
dist_obj <- as.dist(distance_matrix)
k <- 10
d <- kNNdist(dist_obj, k = k)
eps_candidates <- as.numeric(quantile(d, probs = seq(0.80, 0.98, by = 0.03)))
MinPts_values <- c(5, 10, 15, 20, 30)
results <- expand.grid(
eps = eps_candidates,
MinPts = MinPts_values
)
results$n_clusters <- NA
results$noise_ratio <- NA
for (i in seq_len(nrow(results))) {
eps_i <- results$eps[i]
MinPts_i <- results$MinPts[i]
db_i <- fpc::dbscan(
distance_matrix,
eps = eps_i,
MinPts = MinPts_i,
method = "dist"
)
tab <- table(db_i$cluster)
results$n_clusters[i] <- sum(names(tab) != "0")
results$noise_ratio[i] <- if ("0" %in% names(tab)) tab["0"] / sum(tab) else 0
}
choices_results <- subset(results, n_clusters >= 5 & n_clusters <= 50 & noise_ratio <= 0.3)
knitr::kable(results,
caption="Best results of KNNdist to determine parameters DBSCAN")
| eps | MinPts | n_clusters | noise_ratio |
|---|---|---|---|
| 589.0909 | 5 | 13 | 0.0503876 |
| 629.8899 | 5 | 11 | 0.0430233 |
| 689.4329 | 5 | 9 | 0.0383721 |
| 775.2191 | 5 | 10 | 0.0321705 |
| 932.2440 | 5 | 8 | 0.0205426 |
| 1177.2114 | 5 | 6 | 0.0147287 |
| 2222.8941 | 5 | 3 | 0.0089147 |
| 589.0909 | 10 | 7 | 0.0972868 |
| 629.8899 | 10 | 7 | 0.0798450 |
| 689.4329 | 10 | 5 | 0.0697674 |
| 775.2191 | 10 | 4 | 0.0550388 |
| 932.2440 | 10 | 4 | 0.0403101 |
| 1177.2114 | 10 | 4 | 0.0298450 |
| 2222.8941 | 10 | 1 | 0.0155039 |
| 589.0909 | 15 | 5 | 0.1317829 |
| 629.8899 | 15 | 6 | 0.1100775 |
| 689.4329 | 15 | 5 | 0.0856589 |
| 775.2191 | 15 | 3 | 0.0720930 |
| 932.2440 | 15 | 2 | 0.0596899 |
| 1177.2114 | 15 | 2 | 0.0445736 |
| 2222.8941 | 15 | 1 | 0.0155039 |
| 589.0909 | 20 | 7 | 0.1829457 |
| 629.8899 | 20 | 4 | 0.1546512 |
| 689.4329 | 20 | 4 | 0.1197674 |
| 775.2191 | 20 | 3 | 0.0984496 |
| 932.2440 | 20 | 2 | 0.0701550 |
| 1177.2114 | 20 | 2 | 0.0519380 |
| 2222.8941 | 20 | 1 | 0.0166667 |
| 589.0909 | 30 | 15 | 0.3294574 |
| 629.8899 | 30 | 11 | 0.2724806 |
| 689.4329 | 30 | 6 | 0.1658915 |
| 775.2191 | 30 | 3 | 0.1306202 |
| 932.2440 | 30 | 3 | 0.0937984 |
| 1177.2114 | 30 | 2 | 0.0573643 |
| 2222.8941 | 30 | 1 | 0.0220930 |
Having seen this, we chose different combinations and we operate:
db_500_30 <- fpc::dbscan(distance_matrix, eps = 589, MinPts = 30, method = "dist")
db_630_30 <- fpc::dbscan(distance_matrix, eps = 630, MinPts = 30, method = "dist")
knitr::kable(table(db_500_30$cluster),
caption = "Cluster frequencies for DBSCAN (eps = 589, MinPts = 30)")
| Var1 | Freq |
|---|---|
| 0 | 850 |
| 1 | 577 |
| 2 | 115 |
| 3 | 72 |
| 4 | 80 |
| 5 | 33 |
| 6 | 255 |
| 7 | 54 |
| 8 | 31 |
| 9 | 37 |
| 10 | 116 |
| 11 | 39 |
| 12 | 39 |
| 13 | 33 |
| 14 | 132 |
| 15 | 117 |
knitr::kable(table(db_630_30$cluster),
caption = "Cluster frequencies for DBSCAN (eps = 630, MinPts = 30)")
| Var1 | Freq |
|---|---|
| 0 | 703 |
| 1 | 613 |
| 2 | 175 |
| 3 | 78 |
| 4 | 86 |
| 5 | 35 |
| 6 | 588 |
| 7 | 59 |
| 8 | 42 |
| 9 | 31 |
| 10 | 34 |
| 11 | 136 |
Now we export the results to a shapefile and map them in QGIS. Here the code and the final cartography:
matrix_wider_cl <- matrix_wider %>%
mutate(cluster_500 = db_500_30$cluster,
cluster_600 = db_630_30$cluster) %>%
select(cluster_500,cluster_600, origin_id)
Compared to k-means (k = 10), DBSCAN leaves out some observations as noise, yet it produces clusters that align more naturally with the town’s lived neighbourhoods. The overall spatial pattern does not differ substantially from k-means, but DBSCAN offers a finer articulation of local densities and transitions. To assess whether these clusters resonated with local knowledge, we interviewed three residents of Pollença, and a local geographer familiar with the area. While they initially recognised the larger, more contiguous areas produced by k-means, they related more strongly to the DBSCAN results once we narrowed the scale. At this finer resolution, DBSCAN captured boundaries and micro-territories that corresponded more closely to their everyday spatial experience. For this reason, DBSCAN will be used in the subsequent analysis.
Furthermore, our aim is to interpret density as a proxy for how policies might connect households to water infrastructure and consumption patterns. Density in a networked rural landscape is a more meaningful indicator for policy design than the broad, somewhat arbitrary partitions generated by k-means. DBSCAN offers us that density-driven lens. The results can be consulted in the “Pollenca” portal of this website.
How can we estimate the water consumption of houses that, in official terms, do not even exist? That is the paradox we are trying to confront.
Rural houses in Mallorca are not connected to the potable water network—neither to the municipal system nor to private providers. They depend on their own water wells, cisterns, tanks, or other improvised systems. These sources of water are often as irregular as the houses themselves. And yet, every summer, pools are filled, gardens are watered, and thousands of visitors stay in these properties as tourist rentals. Despite this, no one has ever answered how much water these houses actually use, nor how this hidden consumption compares to the officially measured supply of the municipality—the “visible city,” the one that is connected. Our aim is to predict, at least as a proxy, the water consumption of houses located on rural land.
How do we plan to do this? By leaning on what is visible. We have access to detailed water-consumption data for houses in urban areas that share similar typologies and morphological patterns with rural informal houses. In Pollença, two neighborhoods—Gotmar and Llenaire—mirror the density, size, and architectural logic of rural houses. In addition, Siller provides a key bridge case: historically single-family houses in rural land that, for contingent reasons, are connected to the public water network. Thus, We could access to the dataset of all those spaces, having the consumption of 143 houses for every two months since 2019 until the end of 2024.
For all these areas, we can recreate on those places the same dataset we have already built for rural houses: cadastral variables, if they are rental variables, and spatial characteristics. The only extra step is to add their real water-consumption records. With this combined dataset, we can train a predictive model that learns the relationship between house characteristics and water usage. Once trained, the model can be applied to the rural houses we have detected and filtered in our case study. The result will not be a perfect measurement, but it will be a robust estimate—a way to illuminate the hydrological shadows of Pollença’s countryside.
From all the dataset, to ensure that we are modeling houses with actual water use, we retain only those cases whose average annual consumption exceeds what would correspond to roughly two months of occupancy. Based on the dataset, this threshold is around 60 m³, which we use as our minimum cutoff.
Consum_Idemser <- Consum_REF %>%
group_by(Idemser) %>%
summarise(CONSUMO_TOTAL=sum(CONSUMO, na.rm = TRUE), .groups = "drop")
hist(Consum_Idemser$CONSUMO_TOTAL,breaks = 30)
Consum_REF_ERROR0 <- Consum_REF %>%
group_by(Idemser) %>%
summarise(CONSUMO_TOTAL=sum(CONSUMO, na.rm = TRUE), .groups = "drop") %>%
filter(CONSUMO_TOTAL<60)
Consum_REF_ERROR0_cadena <- Consum_REF_ERROR0 %>%
pull(Idemser)
In the previous histogram we can see a distribution of consumption really skewed, as there are some very extreme outliers. We will treat them not as errors in the dataset, but as realities: many houses that might be bigger and have pool do consume large amounts of water. So, for now, we will maintain them. We will decide if later we eliminate them or not.
We also are creating a new varaiable: season. This
variables groups the months were the tourism season is booming, as we
expect to be the months with more consumption than the others. We also
add 2020 as the year of the covid, as we expect that this
year behaved differently than the others. We do not know, as of now, if
this variables are going to be relevant, but we believe that, at the end
of the day, they might have an influence in the models we are going to
build. Also, just as a description proxy, they might unveil how data
behaves. So, the final dataset of the water consumption
looks like this:
## [1] "Barrio" "Idemser" "CONSUMO" "MESES" "ANY" "Calle"
## [7] "NumCalle" "Bloque" "Escalera" "Piso" "Puerta" "Resto"
## [13] "REFCAT"
What we observe is that seasonality has a clear and consistent impact on water consumption, regardless of the year. The temporal pattern repeats itself: the same months tend to produce similar peaks and troughs, suggesting that seasonal dynamics drive consumption in a way that is stable across time.
If we see the years of the Covid-19 (2020) we see that, yes, during summer is when less water is consumed. A trend followed by the beginning of 2021, when Covid was also hitting and the season started more slow. Moreover, we see how in 2020 the overall consumption was lower than in last year, indication COVID had an effect:
In summary, we can see several clear trends:
Now that we have the water-consumption data for the houses, the next step is to merge it with the cadastral information. This linkage is essential, because our prediction relies on the idea that water use is shaped by housing typology: whether a dwelling has a pool, whether it is used for tourist rentals, the number of rooms it contains, the size of the parcel, and so on. With that logic in mind, we need to replicate—for the houses with water data—the same variables and structure that we built for the rural houses in Pollença.
A small reminder: we already processed the data for Siller, as it is in rural land, but the remaining cases are located on urban land. Since cadastral information is organised separately for urban and rural parcels, we need to rerun the entire preprocessing pipeline we used earlier: extracting the urban parcels, cleaning identifiers, joining attributes, and reconstructing the same dataset structure. Only then can we combine everything into a unified table that is ready for modelling.
##
## FALSE TRUE
## 1587 301
| Barrio | Idemser | CONSUMO | MESES | ANY | Calle | NumCalle | Bloque | Escalera | Piso | Puerta | Resto | REFCAT | season | covid | PI | AREA_PI | AREA_Plantes | MAX_Plantes | N_Plantes | Any | Buildings | SuperB | AREA_Parc |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SILLER | S61875 | 16 | ENE.FEB | 2019 | PG POLIGON 9, 150 | 150 | NA | NA | NA | NA | NA | 07042A00900150 | 0 | 0 | 1 | 29.98 | 154.36 | 2 | 4 | 2002 | 1 | 315 | 3517 |
| SILLER | S61875 | 11 | MAR.ABR | 2019 | PG POLIGON 9, 150 | 150 | NA | NA | NA | NA | NA | 07042A00900150 | 0 | 0 | 1 | 29.98 | 154.36 | 2 | 4 | 2002 | 1 | 315 | 3517 |
| SILLER | S61875 | 50 | MAY.JUN | 2019 | PG POLIGON 9, 150 | 150 | NA | NA | NA | NA | NA | 07042A00900150 | 1 | 0 | 1 | 29.98 | 154.36 | 2 | 4 | 2002 | 1 | 315 | 3517 |
| SILLER | S61875 | 13 | JUL.AGO | 2019 | PG POLIGON 9, 150 | 150 | NA | NA | NA | NA | NA | 07042A00900150 | 1 | 0 | 1 | 29.98 | 154.36 | 2 | 4 | 2002 | 1 | 315 | 3517 |
| SILLER | S61875 | 12 | SEP.OCT | 2019 | PG POLIGON 9, 150 | 150 | NA | NA | NA | NA | NA | 07042A00900150 | 1 | 0 | 1 | 29.98 | 154.36 | 2 | 4 | 2002 | 1 | 315 | 3517 |
## # A tibble: 10 × 10
## REFCAT PI AREA_PI AREA_Plantes MAX_Plantes N_Plantes `31_pc` Any
## <chr> <dbl> <dbl> <dbl> <int> <dbl> <chr> <chr>
## 1 1043901EE0114N 1 34.8 145. 2 4 104390… 2022
## 2 1043902EE0114S 0 0 0 NA 0 <NA> <NA>
## 3 1043903EE0114S 0 0 109. 2 5 104390… 1980
## 4 1043904EE0114S 0 0 59.1 3 3 104390… 1995
## 5 1043905EE0114S 1 13.4 51.2 2 2 104390… 1953
## 6 1043906EE0114S 0 0 76.6 2 2 104390… 1960
## 7 1043907EE0114S 0 0 64 2 3 104390… 1975
## 8 1043908EE0114S 1 4.5 75.9 2 1 104390… 1970
## 9 1043909EE0114S 0 0 74.2 3 3 104390… 1980
## 10 1043910EE0114S 0 0 80.0 2 3 104391… 1982
## # ℹ 2 more variables: Buildings <int>, SuperB <dbl>
| Barrio | Idemser | CONSUMO | MESES | ANY | Calle | NumCalle | Bloque | Escalera | Piso | Puerta | Resto | REFCAT | season | covid | PI | AREA_PI | AREA_Plantes | MAX_Plantes | N_Plantes | Any | Buildings | SuperB | AREA_Parc |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GOTMAR | S62820 | 32 | ENE.FEB | 2019 | CL PINSA | NA | NA | NA | NA | NA | NA | 5870602EE0157S | 0 | 0 | 0 | 0 | 0 | NA | 0 | NA | NA | NA | 437 |
| GOTMAR | S62820 | 106 | MAR.ABR | 2019 | CL PINSA | NA | NA | NA | NA | NA | NA | 5870602EE0157S | 0 | 0 | 0 | 0 | 0 | NA | 0 | NA | NA | NA | 437 |
| GOTMAR | S62820 | 145 | MAY.JUN | 2019 | CL PINSA | NA | NA | NA | NA | NA | NA | 5870602EE0157S | 1 | 0 | 0 | 0 | 0 | NA | 0 | NA | NA | NA | 437 |
| GOTMAR | S62820 | 237 | JUL.AGO | 2019 | CL PINSA | NA | NA | NA | NA | NA | NA | 5870602EE0157S | 1 | 0 | 0 | 0 | 0 | NA | 0 | NA | NA | NA | 437 |
| GOTMAR | S62820 | 1081 | SEP.OCT | 2019 | CL PINSA | NA | NA | NA | NA | NA | NA | 5870602EE0157S | 1 | 0 | 0 | 0 | 0 | NA | 0 | NA | NA | NA | 437 |
## [1] "000901800EE01E" "002202600EE01H" "002206100EE01H" "07042A00900108"
## [5] "07042A00900142" "07042A00900244" "07042A00900276" "07042A00900310"
## [9] "07042A00900586"
## [1] "5870602EE0157S" "6367101EE0166N" "6553203EE0165N" "6553205EE0165N"
## [5] "6652601EE0165N" "6667927EE0166N"
## REFCAT n
## Length:143 Min. : 11.00
## Class :character 1st Qu.: 36.00
## Mode :character Median : 36.00
## Mean : 38.08
## 3rd Qu.: 36.00
## Max. :180.00
Now we only need to add the rental-status information and adjust all variables so they can be used properly in the model. This means converting each field to its correct format—whether as factors, integers, numeric values, or logical variables—so that the dataset is fully consistent and ready for the predictive analysis.
## [1] "Idemser" "CONSUMO" "season" "covid" "PI"
## [6] "AREA_PI" "AREA_Plantes" "MAX_Plantes" "Buildings" "SuperB"
## [11] "AREA_Parc" "Rental_is" "Rental_bed" "Any_constru" "Any_consum"
## [16] "Bimonth"
The first step is to create a train/test split of our dataset. Because our data is longitudinal, this split cannot be done by simply randomizing individual observations. If we did so, the same house could appear in both the training and test sets—just in different years or months—which would create data leakage. In that scenario, the model would indirectly “see” part of the house’s history during training, and then be evaluated on another part of the same house, leading to artificially inflated performance.
To properly evaluate the model’s ability to predict new houses—not just new time periods—the split must be done at the house level. In other words, all observations associated with a given house must be assigned either to the training set or to the test set, but never to both. This ensures a realistic assessment of how well the model generalizes to houses it has never encountered.
We will be using a split 80/20, as we have a little amount of data and we need to have enough for the training set.
set.seed(10, sample.kind = "Rejection")
library(dplyr)
unique_cases <- unique(Consum_Ref_Complet$Idemser)
train_cases <- sample(unique_cases, size = floor(0.8 * length(unique_cases)))
consum_train <- Consum_Ref_Complet %>%
filter(Idemser %in% train_cases)
consum_test <- Consum_Ref_Complet %>%
filter(!Idemser %in% train_cases)
Then we build a function to calculate the statistics for the training and test sets, and to compare the performance of different models. We also create a comparison matrix to summarise the results.
rm_id <- function(df) {
if ("Idemser" %in% names(df)) dplyr::select(df, -Idemser) else df
}
r2_train_generic <- function(model, data, y, transform = c("identity","log1p")) {
transform <- match.arg(transform)
y_true <- data[[y]]
y_pred <- predict(model, newdata = rm_id(data))
if (transform == "log1p") {
y_true <- log1p(y_true) # log(y+1)
}
sse <- sum((y_true - y_pred)^2, na.rm = TRUE)
sst <- sum((y_true - mean(y_true, na.rm = TRUE))^2, na.rm = TRUE)
1 - sse/sst
}
osr2_generic <- function(model, train, test, y, transform = c("identity","log1p")) {
transform <- match.arg(transform)
y_tr <- train[[y]]
y_te <- test[[y]]
y_pred_te <- predict(model, newdata = rm_id(test))
if (transform == "log1p") {
y_tr <- log1p(y_tr)
y_te <- log1p(y_te)
}
sse <- sum((y_te - y_pred_te)^2, na.rm = TRUE)
denom <- sum((y_te - mean(y_tr, na.rm = TRUE))^2, na.rm = TRUE)
1 - sse/denom
}
rmse_train_generic <- function(model, data, y, transform = c("identity","log1p")) {
transform <- match.arg(transform)
y_true <- data[[y]]
y_pred <- predict(model, newdata = rm_id(data))
if (transform == "log1p") {
y_true <- log1p(y_true)
y_pred <- log1p(y_pred)
}
sqrt(mean((y_true - y_pred)^2, na.rm = TRUE))
}
rmse_test_generic <- function(model, test, y, transform = c("identity","log1p")) {
transform <- match.arg(transform)
y_true <- test[[y]]
y_pred <- predict(model, newdata = rm_id(test))
if (transform == "log1p") {
y_true <- log1p(y_true)
y_pred <- log1p(y_pred)
}
sqrt(mean((y_true - y_pred)^2, na.rm = TRUE))
}
results <- data.frame(
name = c(NA, NA,NA,NA,NA, NA, NA, NA, NA,NA),
r2 = c(NA, NA,NA,NA,NA, NA, NA, NA, NA,NA),
rmse = c(NA, NA,NA,NA,NA, NA, NA, NA, NA, NA),
osr2 = c(NA, NA,NA,NA,NA, NA, NA, NA, NA, NA),
rmse_out = c(NA, NA,NA,NA,NA, NA, NA, NA, NA, NA)
)
results
## name r2 rmse osr2 rmse_out
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA NA NA NA NA
## 6 NA NA NA NA NA
## 7 NA NA NA NA NA
## 8 NA NA NA NA NA
## 9 NA NA NA NA NA
## 10 NA NA NA NA NA
We begin with a linear regression using a naïve approach: we include all variables and observe how the model behaves. As expected, the model produces counter-intuitive coefficients and a low R². This is consistent with what we have already observed in the consumption dataset: the relationship between predictors and water use is not linear, and the data does not behave in a way that linear models can capture effectively.
Here we can observe the nonlinerality and the correaltion between the varaibles (in some cases high, that might influence the model):
num_vars <- consum_train %>%
select(where(is.numeric))
cor_mat <- cor(num_vars, use = "pairwise.complete.obs", method = "pearson")
corrplot(cor_mat, method = "color", type = "upper",
tl.cex = 0.7, tl.col = "black", addCoef.col = "black")
plot(consum_train$AREA_Plantes,consum_train$CONSUMO)
plot(consum_train$Any_consum,consum_train$CONSUMO)
plot(consum_train$Bimonth,consum_train$CONSUMO)
plot(consum_train$AREA_PI,consum_train$CONSUMO)
plot(consum_train$SuperB,consum_train$CONSUMO)
plot(consum_train$Rental_bed,consum_train$CONSUMO)
plot(as_factor(consum_train$season),consum_train$CONSUMO)
plot(consum_train$Any_constru,consum_train$CONSUMO)
Here, expecting a bad result, we run the regression:
lm_1 <- lm(data=consum_train, CONSUMO~season+
covid+
PI+
AREA_PI+
SuperB+
Buildings+
AREA_Plantes+
AREA_Parc+
MAX_Plantes+
Rental_is+
Rental_bed+
Bimonth+
Any_consum+
Any_constru)
stargazer::stargazer(lm_1, type = "html", title = "Linear Model (Naive)")
| Dependent variable: | |
| CONSUMO | |
| season | 57.146*** |
| (2.483) | |
| covid | -12.487*** |
| (3.223) | |
| PI | -21.440*** |
| (4.467) | |
| AREA_PI | 0.699*** |
| (0.092) | |
| SuperB | -0.089*** |
| (0.012) | |
| Buildings | -1.367* |
| (0.723) | |
| AREA_Plantes | 0.197*** |
| (0.017) | |
| AREA_Parc | -0.001** |
| (0.0003) | |
| MAX_Plantes | 14.415*** |
| (2.683) | |
| Rental_is | 0.797 |
| (11.284) | |
| Rental_bed | -1.556 |
| (1.607) | |
| Bimonth | -0.027 |
| (0.726) | |
| Any_consum | -1.334* |
| (0.747) | |
| Any_constru | 0.748*** |
| (0.093) | |
| Constant | 1,195.974 |
| (1,517.459) | |
| Observations | 4,380 |
| R2 | 0.185 |
| Adjusted R2 | 0.183 |
| Residual Std. Error | 78.440 (df = 4365) |
| F Statistic | 70.993*** (df = 14; 4365) |
| Note: | p<0.1; p<0.05; p<0.01 |
What we observe is that adapting the variables and transforming the dependent variable using a logarithmic scale (given its skewed distribution) improves the model’s stability but does not meaningfully improve its performance. The regression still fails to capture the underlying patterns in the data, and the overall results remain weak:
lm_2 <- lm(log(CONSUMO+1) ~
as.factor(season)+
as.factor(covid)+
PI+
AREA_PI+
AREA_Parc+
SuperB+
AREA_Plantes+
MAX_Plantes+
Buildings+
as.factor(Rental_is)+
as.factor(Any_consum)+
Any_constru, data = consum_train)
stargazer::stargazer(lm_2, type = "html", title = "Linear Model (log transf.)")
| Dependent variable: | |
| log(CONSUMO + 1) | |
| as.factor(season)1 | 1.300*** |
| (0.041) | |
| as.factor(covid)1 | -0.003 |
| (0.083) | |
| PI | -0.166** |
| (0.076) | |
| AREA_PI | 0.008*** |
| (0.002) | |
| AREA_Parc | -0.00003*** |
| (0.00001) | |
| SuperB | -0.002*** |
| (0.0002) | |
| AREA_Plantes | 0.004*** |
| (0.0003) | |
| MAX_Plantes | 0.362*** |
| (0.045) | |
| Buildings | -0.012 |
| (0.012) | |
| as.factor(Rental_is)1 | -0.133** |
| (0.058) | |
| as.factor(Any_consum)2020 | -0.387*** |
| (0.099) | |
| as.factor(Any_consum)2021 | -0.370*** |
| (0.075) | |
| as.factor(Any_consum)2022 | -0.170** |
| (0.070) | |
| as.factor(Any_consum)2023 | -0.133* |
| (0.070) | |
| as.factor(Any_consum)2024 | -0.001 |
| (0.070) | |
| Any_constru | 0.010*** |
| (0.002) | |
| Constant | -17.955*** |
| (3.148) | |
| Observations | 4,380 |
| R2 | 0.258 |
| Adjusted R2 | 0.255 |
| Residual Std. Error | 1.331 (df = 4363) |
| F Statistic | 94.737*** (df = 16; 4363) |
| Note: | p<0.1; p<0.05; p<0.01 |
For this reason, the statistics point out that maybe is better try to find models that can fit non-linearity better:
| name | r2 | rmse | osr2 | rmse_out |
|---|---|---|---|---|
| Linear Regression (Naive) | 0.1854685 | 78.305866 | 0.1418402 | 77.293666 |
| Linear Regression (log) | 0.2578410 | 2.312389 | 0.1994494 | 2.230438 |
| NA | NA | NA | NA | NA |
| NA | NA | NA | NA | NA |
| NA | NA | NA | NA | NA |
| NA | NA | NA | NA | NA |
| NA | NA | NA | NA | NA |
| NA | NA | NA | NA | NA |
| NA | NA | NA | NA | NA |
| NA | NA | NA | NA | NA |
The first non-linear model we introduce is a Random Forest, a tree-based ensemble method capable of capturing complex, non-additive relationships between predictors and water consumption. Random Forests are particularly useful in our context because they do not require linearity, they handle interactions automatically, and they are robust to skewed distributions and outliers. Given the clear non-linear behaviour observed in the consumption data, this method provides a more appropriate starting point for modelling than traditional linear regression.
For the Random Forest we first of all adapt our data to fit the model:
# TRAIN
consum_train <- consum_train %>%
mutate(
season = factor(season),
covid = factor(covid),
PI = factor(PI),
Rental_is = factor(Rental_is),
Bimonth = factor(Bimonth),
Any_consum = as.numeric(as.character(Any_consum)),
Any_constru= as.numeric(as.character(Any_constru))
)
# TEST:
consum_test <- consum_test %>%
mutate(
season = factor(season, levels = levels(consum_train$season)),
covid = factor(covid, levels = levels(consum_train$covid)),
PI = factor(PI, levels = levels(consum_train$PI)),
Rental_is = factor(Rental_is, levels = levels(consum_train$Rental_is)),
Bimonth = factor(Bimonth, levels = levels(consum_train$Bimonth)),
Any_consum = as.numeric(as.character(Any_consum)),
Any_constru= as.numeric(as.character(Any_constru))
)
# TRAIN
consum_train2 <- consum_train %>%
mutate(
season = factor(season),
covid = factor(covid),
PI = factor(PI),
Rental_is = factor(Rental_is),
Bimonth = factor(Bimonth),
Any_consum = as.numeric(as.character(Any_consum)),
Any_constru= as.numeric(as.character(Any_constru))
) %>%
select( -Idemser)
# TEST:
consum_test2 <- consum_test %>%
mutate(
season = factor(season, levels = levels(consum_train2$season)),
covid = factor(covid, levels = levels(consum_train2$covid)),
PI = factor(PI, levels = levels(consum_train2$PI)),
Rental_is = factor(Rental_is, levels = levels(consum_train2$Rental_is)),
Bimonth = factor(Bimonth, levels = levels(consum_train2$Bimonth)),
Any_consum = as.numeric(as.character(Any_consum)),
Any_constru= as.numeric(as.character(Any_constru))
) %>%
select( -Idemser)
Once we have done this, we later continue by developing
cross-validation to find the best model. We run to cross-validations to
determine mtry, one with 500 trees and one with 1000
trees.
# TRAIN
consum_train2 <- consum_train %>%
mutate(
season = factor(season),
covid = factor(covid),
PI = factor(PI),
Rental_is = factor(Rental_is),
Bimonth = factor(Bimonth),
Any_consum = as.numeric(as.character(Any_consum)),
Any_constru= as.numeric(as.character(Any_constru))
) %>%
select( -Idemser)
# TEST:
consum_test2 <- consum_test %>%
mutate(
season = factor(season, levels = levels(consum_train2$season)),
covid = factor(covid, levels = levels(consum_train2$covid)),
PI = factor(PI, levels = levels(consum_train2$PI)),
Rental_is = factor(Rental_is, levels = levels(consum_train2$Rental_is)),
Bimonth = factor(Bimonth, levels = levels(consum_train2$Bimonth)),
Any_consum = as.numeric(as.character(Any_consum)),
Any_constru= as.numeric(as.character(Any_constru))
) %>%
select( -Idemser)
set.seed(10, sample.kind = "Rejection")
rf.params = expand.grid(
mtry=seq(2, 12, 1)
)
rf.cv = train(
y = consum_train2$CONSUMO,
x = subset(consum_train2, select = -c(CONSUMO)),
method = "rf",
ntree = 1000,
trControl=trainControl(method="cv", number=5),
tuneGrid=rf.params
)
rf.cv2 = train(
y = consum_train2$CONSUMO,
x = subset(consum_train2, select = -c(CONSUMO)),
method = "rf",
ntree = 500,
trControl=trainControl(method="cv", number=5),
tuneGrid=rf.params
)
rf.cv
## Random Forest
##
## 4380 samples
## 14 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 3504, 3504, 3503, 3504, 3505
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 63.67785 0.4757492 32.51130
## 3 62.30715 0.4863784 30.96505
## 4 62.27954 0.4865058 30.80385
## 5 62.51995 0.4833830 30.87135
## 6 62.54897 0.4836419 30.93448
## 7 63.02192 0.4770948 31.10582
## 8 63.06952 0.4768367 31.17572
## 9 63.11061 0.4765660 31.23137
## 10 63.18176 0.4760250 31.33210
## 11 63.26988 0.4748759 31.41810
## 12 63.59930 0.4708025 31.56948
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 4.
rf.cv2
## Random Forest
##
## 4380 samples
## 14 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 3504, 3504, 3504, 3504, 3504
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 63.17794 0.4831725 32.22701
## 3 61.65024 0.4957907 30.56841
## 4 61.74894 0.4937977 30.39507
## 5 61.83573 0.4932712 30.36770
## 6 62.03169 0.4907224 30.44677
## 7 62.24010 0.4883876 30.64051
## 8 62.30726 0.4879025 30.70963
## 9 62.49027 0.4854993 30.82750
## 10 62.72172 0.4825507 30.92220
## 11 62.69594 0.4834006 31.00261
## 12 63.05113 0.4785861 31.12078
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 3.
From there we build the two models: One with 500 trees and another with 1000 trees:
set.seed(10,sample.kind = "Rejection")
rf_best <- randomForest(
CONSUMO ~ covid + AREA_PI +
SuperB + AREA_Plantes + AREA_Parc + MAX_Plantes +
Bimonth + Any_constru + Any_consum,
data = consum_train2,
mtry = 3,
ntree = 500,
na.action = na.roughfix,
importance = TRUE
)
set.seed(10,sample.kind = "Rejection")
rf_best2 <- randomForest(
CONSUMO ~ covid + AREA_PI +
SuperB + AREA_Plantes + AREA_Parc + MAX_Plantes +
Rental_bed + Bimonth + Any_constru + Any_consum,
data = consum_train2,
mtry = 4,
ntree = 1000,
na.action = na.roughfix,
importance = TRUE
)
To avoid bias and ensure a realistic evaluation of the model, we apply stratified sampling3 at the house level when training the Random Forest.
We work with longitudinal data, where each house (subject) contributes a block of repeated measurements. Since our objective is ultimately to predict consumption at the house level, we want the model to learn stable, house-specific patterns and not be dominated by random fluctuations within a particular house.
However, standard Random Forests perform observation-level bootstrap sampling, which can cause some bootstrap samples to include many observations from certain houses and very few (or none) from others. This imbalance increases noise and reduces stability, because the model may fail to capture the full longitudinal behavior of under-sampled houses.
To address this, we perform stratified sampling by house ID, which forces every tree to receive observations from all houses in proportion to the original data. Conceptually, this moves the sampling strategy closer to subject-level bootstrapping, where houses—not individual observations—are treated as the fundamental sampling units. This reduces within-house noise and helps preserve the structure of longitudinal trajectories.
At the same time, stratifying by house reduces the randomness between trees, so tree diversity decreases. To compensate for this and retain some variability in the forest, we use a sampling fraction of 0.8. This ensures that, although all houses are represented in each tree, their repeated measurements are still randomly subsampled, maintaining enough variation to avoid overfitting while preserving the subject-level structure.
In summary, stratifying by house allows each tree to learn from complete and balanced information about every house, making the model more appropriate for longitudinal prediction, while still retaining diversity through the sampling fraction.
set.seed(10,sample.kind = "Rejection")
rf <- ranger(
formula = CONSUMO ~ covid + AREA_PI +
SuperB + AREA_Plantes + AREA_Parc + MAX_Plantes +
Bimonth + Any_constru + Any_consum,
data = consum_train,
num.trees = 500,
mtry = 3,
min.node.size = 10,
replace = TRUE,
sample.fraction = 0.8,
case.weights = NULL,
holdout = FALSE,
importance = "impurity",
respect.unordered.factors = TRUE,
sample.stratify = consum_train$Idemser
)
Analyzing the variables with the highest importance scores, we see that the most influential predictors are those related to the size and characteristics of the house, the presence of a pool, and the time of the year (bimonthly period). This aligns with our expectations: larger houses, properties with pools, and months associated with tourism and higher occupancy all exert a strong influence on water consumption. At first glance, it might seem surprising that the rental variable shows relatively low importance. However, this is not unexpected in our context. The rental data we use comes from the government registry, which is known to underestimate the real volume of tourist rentals. As a result, the model receives a noisy and incomplete signal for rental activity, reducing its ability to learn its true effect on consumption:
importance_df <- as.data.frame(rf_best$importance)
importance_df$Feature <- rownames(importance_df)
importance_sorted <- importance_df[order(-importance_df$IncNodePurity), ]
ggplot(importance_sorted, aes(x = reorder(Feature, IncNodePurity), y = IncNodePurity)) +
geom_bar(stat = "identity", fill = "steelblue") + # Bar plot with custom color
coord_flip() + # Flip to make horizontal bars
labs(x = "Features", y = "Importance Score", title = "Feature Importance (Random Forest)") +
theme_classic()
Now we just need to see the statistics on all these models:
| name | r2 | rmse | osr2 | rmse_out |
|---|---|---|---|---|
| Linear Regression (Naive) | 0.1854685 | 78.305866 | 0.1418402 | 77.293666 |
| Linear Regression (log) | 0.2578410 | 2.312389 | 0.1994494 | 2.230438 |
| Random Forest (mtry=3, ntree=500) | 0.7655280 | 42.013207 | 0.2549841 | 72.018257 |
| Random Forest (mtry=4, ntree=1000) | 0.7803432 | 40.664243 | 0.2578987 | 71.877246 |
| Random Forest (stratified) | 0.6778578 | 49.245231 | 0.2490435 | 71.493031 |
| NA | NA | NA | NA | NA |
| NA | NA | NA | NA | NA |
| NA | NA | NA | NA | NA |
| NA | NA | NA | NA | NA |
| NA | NA | NA | NA | NA |
What we observed is that we are far still from the desirable outcomes we might need for doing predictions. No OSR2 goes beyond 0.25-0.27, which indicates still a low predictability. Also we see stratifing does not help.
An alternative approach is to shift from using the full base of longitudinal observations to using seasonal aggregates of water consumption. By summarising consumption at the seasonal level, we reduce the number of observations and therefore the longitudinal complexity of the dataset, while still preserving the core temporal dynamics that matter for prediction. This has several advantages. First, seasonal data smooths out many of the zero-consumption entries that appear in individual bimonthly periods, which often introduce noise and instability in the model. Second, aggregating reduces the influence of occasional measurement errors or missing values that disproportionately affect the row-level dataset. As a result, the model works with cleaner, more representative signals of household behaviour.
Consum_Ref_Complet_season <- Consum_Ref_Complet %>%
group_by(Idemser, Any_consum, season) %>%
summarise(
CONSUMO = sum(CONSUMO),
SuperB = first(SuperB),
AREA_PI = first(AREA_PI),
AREA_Plantes = first(AREA_Plantes),
AREA_Parc = first(AREA_Parc),
MAX_Plantes = first(MAX_Plantes),
Any_constru = first(Any_constru),
covid = first(covid),
Rental_bed = first(Rental_bed)
)
set.seed(10, sample.kind = "Rejection")
unique_cases_season <- unique(Consum_Ref_Complet_season$Idemser)
train_cases_season <- sample(unique_cases_season,
size = floor(0.8 * length(unique_cases_season)))
consum_train_season <- Consum_Ref_Complet_season %>%
filter(Idemser %in% train_cases_season)
consum_test_season <- Consum_Ref_Complet_season %>%
filter(!Idemser %in% train_cases_season)
consum_train_season_id <- consum_train_season %>%
mutate(
season = factor(season),
covid = factor(covid),
Any_consum = as.numeric(as.character(Any_consum)),
Any_constru= as.numeric(as.character(Any_constru))) %>%
ungroup()
consum_test_season_id <- consum_test_season %>%
mutate(
season = factor(season, levels = levels(consum_train2$season)),
covid = factor(covid, levels = levels(consum_train2$covid)),
Any_consum = as.numeric(as.character(Any_consum)),
Any_constru= as.numeric(as.character(Any_constru))) %>%
ungroup()
consum_train_season <- consum_train_season_id %>%
select(-Idemser)
consum_test_season <- consum_test_season_id %>%
select(-Idemser)
set.seed(10, sample.kind = "Rejection")
rf.params = expand.grid(
mtry=seq(2, 12, 1)
)
folds <- groupKFold(consum_test_season_id$Idemser, k = 5)
rf.cv = train(
y = consum_train_season_id$CONSUMO,
x = subset(consum_train_season_id, select = -c(CONSUMO, Idemser)),
method = "rf",
ntree = 500,
trControl=trainControl(method="cv", number=5, index = folds),
tuneGrid=rf.params
)
rf.cv
## Random Forest
##
## 1456 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 318, 226, 306, 330, 284
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 221.2472 0.1385404 129.6742
## 3 219.9208 0.1442473 128.4204
## 4 220.8301 0.1401384 129.0043
## 5 221.2201 0.1388178 129.6195
## 6 221.7516 0.1373542 130.6938
## 7 223.2052 0.1330314 132.1619
## 8 224.1601 0.1295247 133.0675
## 9 225.5859 0.1260610 134.2985
## 10 226.8968 0.1222347 135.6166
## 11 226.5350 0.1236820 135.4160
## 12 226.5755 0.1239163 135.6808
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 3.
And here we run the model using the seasonal dataset and examine the resulting feature importance. The ranking of variables closely resembles the one obtained from the bimonthly observations, suggesting that the key drivers of consumption remain consistent even after aggregation.
set.seed(10)
rf.final <- randomForest(
CONSUMO ~ .,
data = consum_train_season,
ntree = 500,
mtry = 3
)
rf.final
##
## Call:
## randomForest(formula = CONSUMO ~ ., data = consum_train_season, ntree = 500, mtry = 3)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 22433.33
## % Var explained: 54.32
importance_df_final <- as.data.frame(rf.final$importance)
importance_df_final$Feature <- rownames(importance_df_final)
importance_sorted_final <- importance_df_final[order(-importance_df_final$IncNodePurity), ]
ggplot(importance_sorted_final, aes(x = reorder(Feature, IncNodePurity), y = IncNodePurity)) +
geom_bar(stat = "identity", fill = "steelblue") + # Bar plot with custom color
coord_flip() + # Flip to make horizontal bars
labs(x = "Features", y = "Importance Score", title = "Feature Importance (Random Forest)") +
theme_classic()
Now we run the statistics:
| name | r2 | rmse | osr2 | rmse_out |
|---|---|---|---|---|
| Linear Regression (Naive) | 0.1854685 | 78.305866 | 0.1418402 | 77.293666 |
| Linear Regression (log) | 0.2578410 | 2.312389 | 0.1994494 | 2.230438 |
| Random Forest (mtry=3, ntree=500) | 0.7655280 | 42.013207 | 0.2549841 | 72.018257 |
| Random Forest (mtry=4, ntree=1000) | 0.7803432 | 40.664243 | 0.2578987 | 71.877246 |
| Random Forest (stratified) | 0.6778578 | 49.245231 | 0.2490435 | 71.493031 |
| Random Forest (season) | 0.8038174 | 98.153565 | 0.4438662 | 132.805355 |
| NA | NA | NA | NA | NA |
| NA | NA | NA | NA | NA |
| NA | NA | NA | NA | NA |
| NA | NA | NA | NA | NA |
We see that here we are reaching the OSR2 of 0.45. Still not an extraordinary OSR2, but the best so far.
Before ending, we wanted to make sure to also test if the Boosted Trees could build a model that could improve betterwhat we achieved with Random Forest. Here, the entire code of how we build those models
set.seed(10, sample.kind = "Rejection")
grid <- expand.grid(
n.trees = c(100,500,1000,3000, 5000),
interaction.depth = c(1,2, 3, 4, 5,7,9,10),
shrinkage = c(0.1),
n.minobsinnode = c(10)
)
gbm.cv <- train(y = consum_train2$CONSUMO,
x = subset(consum_train2, select=-c(CONSUMO)),
verbose=FALSE,
method= "gbm",
tuneGrid= grid,
trControl= trainControl(method="cv", number= 5, savePredictions = "final")
)
#gbm.cv$results[, c("interaction.depth", "n.trees", "Rsquared")]
gbm_plot <- gbm.cv$results %>% mutate(Rsquared = round(Rsquared, 4))
ggplot(gbm_plot, aes(x = as.factor(n.trees), y = as.factor(interaction.depth), fill = Rsquared)) +
geom_tile() +
geom_text(aes(label = Rsquared), color = "white", size = 4) +
scale_fill_gradient(low = "blue", high = "red") +
labs(x = "Number of Trees (n.trees)", y = "Interaction Depth", fill = "R²") +
theme_classic() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
set.seed(10,sample.kind = "Rejection")
opt_boost = gbm(data=consum_train2, CONSUMO ~ season + covid + AREA_PI +
SuperB + Buildings + AREA_Plantes + AREA_Parc + MAX_Plantes +
Rental_bed + Bimonth + Any_constru + Any_consum, distribution = "gaussian",
n.trees = 1000, interaction.depth = 9,
shrinkage = 0.1, n.minobsinnode = 10)
pred_tr_gbm <- predict(opt_boost, newdata = consum_train2, n.trees = 1000, type = "response")
pred_te_gbm <- predict(opt_boost, newdata = consum_test2, n.trees = 1000, type = "response")
y_tr <- consum_train2$CONSUMO
y_te <- consum_test2$CONSUMO
ok <- is.finite(pred_te_gbm) & is.finite(y_te)
r2_gbm <- 1 - sum((y_tr - pred_tr_gbm)^2, na.rm = TRUE) /
sum((y_tr - mean(y_tr, na.rm = TRUE))^2, na.rm = TRUE)
osr2_gbm <- 1 - sum((y_te[ok] - pred_te_gbm[ok])^2) /
sum((y_te[ok] - mean(y_tr, na.rm = TRUE))^2)
rmse_gbm_train <- sqrt(mean((y_tr - pred_tr_gbm)^2, na.rm = TRUE))
rmse_gbm_test <- sqrt(mean((y_te[ok] - pred_te_gbm[ok])^2, na.rm = TRUE))
results[7, ] <- data.frame(
name = 'GBM (month)',
r2 = r2_gbm,
rmse_train = rmse_gbm_train,
osr2 = osr2_gbm,
rmse_test = rmse_gbm_test
)
set.seed(10, sample.kind = "Rejection")
grid_season <- expand.grid(
n.trees = c(100,500,1000,3000, 5000),
interaction.depth = c(3, 4, 5,7,9,10),
shrinkage = c(0.1),
n.minobsinnode = c(10)
)
gbm.cv_season <- train(y = consum_test_season_id$CONSUMO,
x = subset(consum_test_season_id, select=-c(CONSUMO, Idemser)),
verbose=FALSE,
method= "gbm",
tuneGrid= grid_season,
trControl= trainControl(method="cv", number= 5, index = folds, savePredictions = "final")
)
gbm.cv_season$results[, c("shrinkage", "interaction.depth", "n.trees", "Rsquared")]
## shrinkage interaction.depth n.trees Rsquared
## 1 0.1 3 100 0.06894525
## 6 0.1 4 100 0.06483845
## 11 0.1 5 100 0.05414223
## 16 0.1 7 100 0.06977046
## 21 0.1 9 100 0.06074108
## 26 0.1 10 100 0.05448118
## 2 0.1 3 500 0.05700852
## 7 0.1 4 500 0.04793739
## 12 0.1 5 500 0.03880238
## 17 0.1 7 500 0.07920235
## 22 0.1 9 500 0.08700046
## 27 0.1 10 500 0.05996333
## 3 0.1 3 1000 0.03616863
## 8 0.1 4 1000 0.05809100
## 13 0.1 5 1000 0.04323289
## 18 0.1 7 1000 0.08377307
## 23 0.1 9 1000 0.07234957
## 28 0.1 10 1000 0.06451369
## 4 0.1 3 3000 0.02550142
## 9 0.1 4 3000 0.06228877
## 14 0.1 5 3000 0.04982732
## 19 0.1 7 3000 0.07670388
## 24 0.1 9 3000 0.08688015
## 29 0.1 10 3000 0.08119689
## 5 0.1 3 5000 0.02606177
## 10 0.1 4 5000 0.05856836
## 15 0.1 5 5000 0.05579936
## 20 0.1 7 5000 0.07888999
## 25 0.1 9 5000 0.08315281
## 30 0.1 10 5000 0.08152309
gbm.cv_season$bestTune
## n.trees interaction.depth shrinkage n.minobsinnode
## 17 500 7 0.1 10
gbm_plot_season <- gbm.cv_season$results %>% mutate(Rsquared = round(Rsquared, 4)) # Round Rsquared to 2 decimal places
# Create the heatmap
ggplot(gbm_plot_season, aes(x = as.factor(n.trees), y = as.factor(interaction.depth), fill = Rsquared)) +
geom_tile() +
geom_text(aes(label = Rsquared), color = "white", size = 4) +
scale_fill_gradient(low = "blue", high = "red") +
labs(x = "Number of Trees (n.trees)", y = "Interaction Depth", fill = "R²") +
theme_classic() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
| name | r2 | rmse | osr2 | rmse_out |
|---|---|---|---|---|
| Linear Regression (Naive) | 0.1854685 | 78.305866 | 0.1418402 | 77.293666 |
| Linear Regression (log) | 0.2578410 | 2.312389 | 0.1994494 | 2.230438 |
| Random Forest (mtry=3, ntree=500) | 0.7655280 | 42.013207 | 0.2549841 | 72.018257 |
| Random Forest (mtry=4, ntree=1000) | 0.7803432 | 40.664243 | 0.2578987 | 71.877246 |
| Random Forest (stratified) | 0.6778578 | 49.245231 | 0.2490435 | 71.493031 |
| Random Forest (season) | 0.8038174 | 98.153565 | 0.4438662 | 132.805355 |
| GBM (month) | 0.8152655 | 37.291889 | 0.1768161 | 75.702159 |
| GBM (season) | 0.7220038 | 116.841008 | 0.3985741 | 138.107416 |
| NA | NA | NA | NA | NA |
| NA | NA | NA | NA | NA |
Here we observe that our OSR² is lower than that of the Random Forest based on seasons, so we are going to use the latter to build our model.
Before using the model for prediction (Random Forest based on seasons), one precaution is needed. Our OSR² remains relatively low when we attempt to predict the consumption of every individual house for every individual year. To soften this turbulence, we shift our strategy from predicting isolated points to predicting groups of houses, where errors tend to compensate when aggregated. Once grouped, we work with the average consumption across years, which further stabilizes the signal by smoothing annual fluctuations. In practice, we average the predictions across all years and then consistently use aggregations of these house-level averages, which substantially reduces prediction error.
Ultimately, our interest is not in forecasting the precise behavior of a single house but in understanding patterns at the scale of areas, since policy interventions are designed and implemented at that collective level. This rationale will be developed further when we connect these predictions to the neighborhood clusters.
From here, we prepare the dataset that includes all houses in Pollença in rural land, covering every available year and separating each year into two distinct seasons. This allows us to work with a complete temporal–seasonal panel of consumption for the entire housing stock that we need to predict.
buildings_complet_water <- buildings_complet_df %>%
mutate(SuperB=as.numeric(Cat_SuperB),
AREA_PI= as.numeric(Cat_AREA_P),
AREA_Plantes=as.numeric(Cat_AREA_1),
AREA_Parc=as.numeric(Parc_areaV),
MAX_Plantes=as.numeric(Cat_MAX_Pl),
Any_constru=as.numeric(Cat_Any),
Rental_bed=ifelse(is.na(TuriGOIB_P), 0, TuriGOIB_P)) %>%
filter(Munic_NOM == "Pollença") %>%
select(c("Cat_REFCAT","SuperB","AREA_PI","AREA_Plantes","AREA_Parc","MAX_Plantes","Any_constru","Rental_bed")) %>%
filter(!is.na(AREA_Plantes))
buildings_complet_water_long <- buildings_complet_water %>%
tidyr::crossing(
Any_consum = 2019:2024,
season = c(0, 1),
) %>%
mutate(covid=as.factor(ifelse(Any_consum==2020, 1, 0)),
season = as.factor(season))
Now we can make the predictions:
x_new <- buildings_complet_water_long %>% select(-Cat_REFCAT)
pred <- predict(rf.final, newdata = x_new)
buildings_complet_water_pred <- buildings_complet_water_long %>%
mutate(pred = pred)
To strengthen our confidence in the model—especially given the low OSR²—we construct confidence intervals around the predictions. We do this by using the full distribution of predictions produced by the Random Forest. Specifically, for each observation we extract the 5th percentile (the lower bound) from all tree-level predictions, and then the 90th percentile (the upper bound). For doing this we use a Quantile Random Forest formula (defined in range)4. These two percentiles (5th and 90th) form an empirical confidence interval for the averaged predicted value of 90% (the difference of the two bounds), capturing the range within which the model expects consumption to fall when accounting for uncertainty within the forest.
rq_final <- ranger(
formula = CONSUMO ~ .,
data = consum_train_season,
num.trees = 500,
mtry = 3,
quantreg = TRUE)
pred_q <- predict(
rq_final,
data = x_new,
type = "quantiles",
quantiles = c(0.05, 0.5, 0.95)
)
low90 <- pred_q$predictions[, 1] # 5%
med <- pred_q$predictions[, 2] # 50% (mediana)
high90 <- pred_q$predictions[, 3] # 95%
buildings_complet_water_pred <- buildings_complet_water_pred %>%
mutate(
pred_low90 = low90,
pred_med = med,
pred_high90 = high90
)
summary(buildings_complet_water_pred)
## Cat_REFCAT SuperB AREA_PI AREA_Plantes
## Length:30768 Min. : 17.0 Min. : 0.00 Min. : 10.15
## Class :character 1st Qu.: 108.0 1st Qu.: 0.00 1st Qu.: 113.43
## Mode :character Median : 164.0 Median : 34.27 Median : 160.85
## Mean : 204.6 Mean : 32.10 Mean : 201.90
## 3rd Qu.: 249.0 3rd Qu.: 49.35 3rd Qu.: 234.24
## Max. :5327.0 Max. :883.88 Max. :3405.61
## AREA_Parc MAX_Plantes Any_constru Rental_bed
## Min. : 44 Min. :1.000 Min. :1700 Min. : 0.000
## 1st Qu.: 2437 1st Qu.:1.000 1st Qu.:1980 1st Qu.: 0.000
## Median : 4538 Median :2.000 Median :1993 Median : 0.000
## Mean : 37854 Mean :1.619 Mean :1989 Mean : 1.812
## 3rd Qu.: 11515 3rd Qu.:2.000 3rd Qu.:2003 3rd Qu.: 4.000
## Max. :12719556 Max. :4.000 Max. :2024 Max. :22.000
## Any_consum season covid pred pred_low90
## Min. :2019 0:15384 0:25640 Min. : 6.306 Min. : 0.000
## 1st Qu.:2020 1:15384 1: 5128 1st Qu.: 76.660 1st Qu.: 0.000
## Median :2022 Median :117.092 Median : 0.000
## Mean :2022 Mean :146.536 Mean : 6.545
## 3rd Qu.:2023 3rd Qu.:213.896 3rd Qu.: 9.000
## Max. :2024 Max. :670.575 Max. :137.850
## pred_med pred_high90
## Min. : 0.00 Min. : 26.0
## 1st Qu.: 42.00 1st Qu.: 222.0
## Median : 74.00 Median : 370.1
## Mean : 98.49 Mean : 456.8
## 3rd Qu.:147.00 3rd Qu.: 667.0
## Max. :584.00 Max. :2020.0
Now we just need to do the aggregation in the municipal and in the neighborhood level and have the results presented
We now have the predicted annual water-consumption averages for each house. The distribution shows two clear elevations: one around 125 m³ and another around 300 m³, suggesting two dominant consumption profiles in the rural-house system. The shape is generally healthy: the median and mean sit close together, hinting that extreme values are not excessively distorting the distribution. Altogether, it offers a coherent landscape from which to trace patterns of use, behaviour, and typology.
buildings_water_pred_anual <- buildings_complet_water_pred %>%
group_by(Cat_REFCAT, Any_consum) %>%
summarise(sum_pred_low=sum(pred_low90, na.rm = TRUE),
sum_pred=sum(pred, na.rm = TRUE),
sum_pred_med=sum(pred, na.rm = TRUE),
sum_pred_high=sum(pred_high90, na.rm = TRUE)) %>%
ungroup() %>%
group_by(Cat_REFCAT) %>%
summarise(m_pred_low=mean(sum_pred_low, na.rm = TRUE),
m_pred=mean(sum_pred, na.rm = TRUE),
m_pred_med=mean(sum_pred_med, na.rm = TRUE),
m_pred_high=mean(sum_pred_high, na.rm = TRUE))
ggplot(buildings_water_pred_anual, aes(x = m_pred)) +
geom_histogram(bins = 30, fill = "#4C72B0", color = "white", alpha = 0.7) +
geom_vline(
xintercept = mean(buildings_water_pred_anual$m_pred, na.rm = TRUE),
color = "red",
linewidth = 1) +
geom_vline(
xintercept = median(buildings_water_pred_anual$m_pred, na.rm = TRUE),
color = "red",
linewidth = 1,
linetype = "dashed") +
theme_classic() +
labs(
title = "Histograma del consum anual predit",
x = "Consum predit (mitjana anual)",
y = "Nombre de cases"
)
buildings_water_pred_turistic <- buildings_complet_water_pred %>%
mutate(rental_is=ifelse(Rental_bed==0,0,1)) %>%
group_by(Cat_REFCAT, Any_consum) %>%
summarise(sum_pred_low=sum(pred_low90, na.rm = TRUE),
sum_pred=sum(pred, na.rm = TRUE),
sum_pred_med=sum(pred_med, na.rm = TRUE),
sum_pred_high=sum(pred_high90, na.rm = TRUE),
rental=rental_is) %>%
ungroup() %>%
group_by(rental) %>%
summarise(m_pred_low=mean(sum_pred_low, na.rm = TRUE),
m_pred=mean(sum_pred, na.rm = TRUE),
m_pred_med=mean(sum_pred_med, na.rm = TRUE),
m_pred_high=mean(sum_pred_high, na.rm = TRUE))
buildings_water_pred_pool_med <- buildings_complet_water_pred %>%
mutate(pool=ifelse(AREA_PI==0,0,1)) %>%
group_by(Cat_REFCAT, Any_consum) %>%
summarise(sum_pred_low=sum(pred_low90, na.rm = TRUE),
sum_pred=sum(pred, na.rm = TRUE),
sum_pred_med=sum(pred_med, na.rm = TRUE),
sum_pred_high=sum(pred_high90, na.rm = TRUE),
pool=pool) %>%
ungroup() %>%
group_by(pool) %>%
summarise(m_pred_low=mean(sum_pred_low, na.rm = TRUE),
m_pred=mean(sum_pred, na.rm = TRUE),
m_pred_med=mean(sum_pred_med, na.rm = TRUE),
m_pred_high=mean(sum_pred_high, na.rm = TRUE))
#Per tot el municipi
buildings_water_pred_anual_gl <- buildings_complet_water_pred %>%
group_by(Any_consum) %>%
summarise(sum_pred_low=sum(pred_low90, na.rm = TRUE),
sum_pred=sum(pred, na.rm = TRUE),
sum_pred_med=sum(pred_med, na.rm = TRUE),
sum_pred_high=sum(pred_high90, na.rm = TRUE)) %>%
ungroup() %>%
group_by() %>%
summarise(m_pred_low=mean(sum_pred_low, na.rm = TRUE),
m_pred=mean(sum_pred, na.rm = TRUE),
m_pred_med=mean(sum_pred_med, na.rm = TRUE),
m_pred_high=mean(sum_pred_high, na.rm = TRUE))
buildings_water_pred_anual_season <- buildings_complet_water_pred %>%
group_by(Any_consum, season) %>%
summarise(sum_pred_low=sum(pred_low90, na.rm = TRUE),
sum_pred=sum(pred, na.rm = TRUE),
sum_pred_med=sum(pred_med, na.rm = TRUE),
sum_pred_high=sum(pred_high90, na.rm = TRUE)) %>%
ungroup() %>%
group_by(season) %>%
summarise(m_pred_low=mean(sum_pred_low, na.rm = TRUE),
m_pred=mean(sum_pred, na.rm = TRUE),
m_pred_med=mean(sum_pred_med, na.rm = TRUE),
m_pred_high=mean(sum_pred_high, na.rm = TRUE))
buildings_water_pred_anual_pools <- buildings_complet_water_pred %>%
mutate(pool=ifelse(AREA_PI==0,0,1)) %>%
group_by(Any_consum, pool) %>%
summarise(sum_pred_low=sum(pred_low90, na.rm = TRUE),
sum_pred=sum(pred, na.rm = TRUE),
sum_pred_med=sum(pred_med, na.rm = TRUE),
sum_pred_high=sum(pred_high90, na.rm = TRUE)) %>%
ungroup() %>%
group_by(pool) %>%
summarise(m_pred_low=mean(sum_pred_low, na.rm = TRUE),
m_pred=mean(sum_pred, na.rm = TRUE),
m_pred_med=mean(sum_pred_med, na.rm = TRUE),
m_pred_high=mean(sum_pred_high, na.rm = TRUE))
buildings_water_pred_anual_par <- buildings_complet_water_pred %>%
mutate(parcela14=ifelse(AREA_Parc>14000,0,1)) %>%
group_by(Any_consum, parcela14) %>%
summarise(sum_pred_low=sum(pred_low90, na.rm = TRUE),
sum_pred=sum(pred, na.rm = TRUE),
sum_pred_med=sum(pred_med, na.rm = TRUE),
sum_pred_high=sum(pred_high90, na.rm = TRUE)) %>%
ungroup() %>%
group_by(parcela14) %>%
summarise(m_pred_low=mean(sum_pred_low, na.rm = TRUE),
m_pred=mean(sum_pred, na.rm = TRUE),
m_pred_med=mean(sum_pred_med, na.rm = TRUE),
m_pred_high=mean(sum_pred_high, na.rm = TRUE))
consum_mitan_Pollenca <- 1815000
Our final prediction estimates a total annual water consumption of 751,434.5 m³ for all houses located on rural land. The associated 90% confidence interval ranges from 35,308 to 2,323,442 m³, reflecting the heterogeneity and uncertainty intrinsic to these dispersed and informal dwellings and our low OSR2. Yet, we take as the prediction as adequate5.
For comparison, official data from the GOIB indicate that the town’s total water consumption for the same years was 1,815,000 m³ for all the houses that are in urban land and connected to the public provider. This means that rural houses alone account for approximately 41.4% of all municipal water use — despite being located outside the consolidated urban fabric, where permanent residents live year-round. Or said in another word: the consumption of water of the town is 41.4% more than the one expected of the year.
This proportion is striking: nearly half of the municipality’s water demand originates in landscapes that, officially, are not supposed to host intense residential or touristic activity. It underscores how deeply rural land has been transformed into a major water-consuming territory, and how the patterns of tourism-led informality ripple through the municipal infrastructure like a hidden river beneath the cadastral surface
| m_pred_low | m_pred | m_pred_med | m_pred_high |
|---|---|---|---|
| 33564.14 | 751434.5 | 505069 | 2342317 |
## [1] "% over total of the municipality: 41.4%"
But the real contrast emerges when we look at types of houses. The presence of a pool is the great hydraulic divide.
Houses with pools show an average annual consumption of 340 m³, summing up to 598,526.4 m³ in total. This means that this single group accounts for nearly 80% (79.65%) of all predicted water use on rural land. In other words, a minority of houses — the ones equipped for leisure, rental turnover, and high-comfort tourism — draw the overwhelming share of the aquifer.
By contrast, houses without pools consume far less: an average of 190 m³, which is only 56% of the average consumption of houses with pools. Their total predicted consumption is 152,908 m³, a modest fraction in comparison.
The landscape revealed by these numbers is stark. They signal a shift from rural residential logics to a tourism-driven, amenity-heavy occupation of land that presses disproportionately on municipal water systems
| pool | m_pred_low | m_pred | m_pred_med | m_pred_high |
|---|---|---|---|---|
| 0 | 3843.142 | 152908.1 | 96298.5 | 527469.4 |
| 1 | 29721.000 | 598526.4 | 408770.5 | 1814847.9 |
| pool | m_pred_low | m_pred | m_pred_med | m_pred_high |
|---|---|---|---|---|
| 0 | 4.780027 | 190.1842 | 119.7743 | 656.0565 |
| 1 | 16.886932 | 340.0718 | 232.2560 | 1031.1636 |
A similar, though less dramatic, difference appears when we look at tourism rentals data. However, we know that identifying these houses through official georeferenced records leads to a clear undercount: many rentals operate outside the formal registry we have and the Government have not georeferenced all the licensced rentals. Thus, the official database we use, as stateted many times, captures only a slice of the real market.
In this context, the presence of a pool becomes a far more reliable proxy for identifying tourism-oriented houses. Pools function as material signals of a rental-ready dwelling — an amenity strongly associated with short-term stays, comfort-driven tourism, and the economic logics that shape rural accommodation.
| rental | m_pred_low | m_pred | m_pred_med | m_pred_high |
|---|---|---|---|---|
| 0 | 11.88440 | 277.8137 | 185.7042 | 882.4118 |
| 1 | 16.31498 | 333.8599 | 227.1418 | 996.7578 |
Another way of detecting the imprint of tourism on rural water use is through seasonality. During the high season alone (basically, spring and summer), predicted consumption reaches 524,546 m³, which represents 68% of all annual water use by rural houses. When placed against the municipal figures, this seasonal spike accounts for 29% of the town’s total annual water consumption — a remarkable concentration of demand in just a few months, driven overwhelmingly by tourism-related occupation.
| season | m_pred_low | m_pred | m_pred_med | m_pred_high |
|---|---|---|---|---|
| 0 | 1687.35 | 234267.9 | 122332.6 | 816253.4 |
| 1 | 31876.79 | 517166.6 | 382736.4 | 1526064.0 |
It is also interesting to examine the consumption of all houses located on parcels below 14,000 m², which, as stated before, is the legal threshold for building a single-family home in the town since the 1990s. This gives us a proxy for identifying the potential informal constructions, as opposed to those that could have been legally built under current regulations. The ones that could not be build today whatsoever represent the 71.7% of the consumption of water of the enire rural area and an additional 29.7% of consumption to the whole town (similar as the numbers of the pools, as all of them are correlated: lower parcel and pools are the main typology of house).
| parcela14 | m_pred_low | m_pred | m_pred_med | m_pred_high |
|---|---|---|---|---|
| 0 | 9057.833 | 212160.6 | 137697.6 | 652264.9 |
| 1 | 24506.308 | 539273.9 | 367371.4 | 1690052.4 |
Taken together, these predictions allow us — for the first time — to quantify the hidden hydraulic landscape of rural houses in Pollença. When compared to the municipality as a whole, rural houses — officially peripheral and often invisible in planning — represent over 40% of total water use, and almost a third during summer months alone. This, in a context of climate emergency, reflects a clear issue that needs to be addressed by policymakers and the community.
How can we show that rural houses are structurally car-dependent, and thus, its tourism is less sustainable and creates more congestion?
So, one of the central questions that were raised was whether the rural–land tourism model of our Houses that Do Not Exist produces far greater mobility needs—and therefore higher car dependency—than tourism rentals or hotels located in denser, better-served areas (this is the town and urban spaces). Our hypothesis is intuitive: dispersed houses on rural land require longer trips for almost every daily or leisure activity, while hotels and apartments in consolidated urban areas benefit from walkable distances and better transit access. But we need to demonstrate this with data.
How we will demonstrate it? To make the comparison as fair and illustrative as possible, we propose a simple but powerful thought experiment combined with spatial analysis:
This is an idealized assumption—tourists may prefer a specific restaurant or supermarket, but assigning the closest option makes the comparison conservative. We measure the minimum possible travel needed. .
To build the data we have used the QNEAT3 algorithm of QGIS and using the Roads of the IDEIB (Governemt) to build a nwtwork (as we did to genarate the matrix of distances of our cluster). We use the “OD-Matrix from Points as Lines (m:n)” algorithm And we have this layers as a result:
| name | description | type |
|---|---|---|
| publictrans_rural | Distance to the all public transport stop or station from all houses on rural land (m) | shp (line) |
| publictrans_urba | Distance to the all public transport stop or station from all houses in urban areas (m) | shp (line) |
| super_rural | Distance to the all restaurants from all houses on rural land (m) | shp (line) |
| super_urba | Distance to the all restaurants from all houses in urban areas (m) | shp (line) |
| rest_rural | Distance to the all the official beach access points from all houses on rural land (m) | shp (line) |
| rest_urba | Distance to the all the official beach access points from all houses on urban land (m) | shp (line) |
| platges_rural | Distance to the all supermarkets from all houses on rural land that the government says are tourism rentals (m) | shp (line) |
| platges_urba | Distance to the all supermarkets from all hotels in urban land (m) | shp (line) |
From here, we need to calculate just the ones in closest distance (this is the closest distance from each house).
publictrans_rural <- publictrans_rural %>%
filter(!is.na(total_cost)) %>%
group_by(destinatio) %>%
slice_min(order_by = total_cost, n = 1, with_ties = FALSE) %>%
ungroup()
publictrans_urba <- publictrans_urba %>%
filter(!is.na(total_cost)) %>%
group_by(destinatio) %>%
slice_min(order_by = total_cost, n = 1, with_ties = FALSE) %>%
ungroup()
super_rural <- super_rural %>%
filter(!is.na(total_cost)) %>%
group_by(destinatio) %>%
slice_min(order_by = total_cost, n = 1, with_ties = FALSE) %>%
ungroup()
super_urba <- super_urba %>%
filter(!is.na(total_cost)) %>%
group_by(destinatio) %>%
slice_min(order_by = total_cost, n = 1, with_ties = FALSE) %>%
ungroup()
rest_rural <- rest_rural %>%
filter(!is.na(total_cost)) %>%
group_by(destinatio) %>%
slice_min(order_by = total_cost, n = 1, with_ties = FALSE) %>%
ungroup()
rest_urba <- rest_urba %>%
filter(!is.na(total_cost)) %>%
group_by(destinatio) %>%
slice_min(order_by = total_cost, n = 1, with_ties = FALSE) %>%
ungroup()
platges_rural <- platges_rural %>%
filter(!is.na(total_cost)) %>%
group_by(destinatio) %>%
slice_min(order_by = total_cost, n = 1, with_ties = FALSE) %>%
ungroup()
platges_urba <- platges_urba %>%
filter(!is.na(total_cost)) %>%
group_by(destinatio) %>%
slice_min(order_by = total_cost, n = 1, with_ties = FALSE) %>%
ungroup()
Now we build histograms that include both the rural and the urban cases for each activity. Then, on each histogram, we add reference lines indicating the approximate walkable distance threshold and the point at which trips become car-dependent.
thresholds <- data.frame(
xintercept = c(800, 1500),
type = c("10-min. walk", "Car threshold")
)
publictrans_rural$area_type <- "Rural"
publictrans_urba$area_type <- "Urban"
publictrans_all <- bind_rows(publictrans_rural, publictrans_urba)
ggplot(publictrans_all, aes(x = total_cost, fill = area_type, color = area_type)) +
geom_density(alpha = 0.5, linewidth = 1) +
scale_fill_manual(values = c("Rural" = "#faedcd", "Urban" = "#efe1d0")) +
scale_color_manual(values = c("Rural" = "#faedcd", "Urban" = "#efe1d0")) +
scale_x_continuous(limits = c(0, 5000)) +
geom_vline(
data = thresholds,
aes(xintercept = xintercept, linetype = type),
color = "grey50",
linewidth = 0.7,
inherit.aes = FALSE
) +
scale_linetype_manual(
name = NULL, # legend title (blank = cleaner)
values = c("10-min. walk" = "dotted",
"Car threshold" = "dashed")
) +
labs(
title = "Distance to Nearest Public Transport Station or Stop",
x = "Distance (meters)",
y = "Density",
fill = "Area type",
color = "Area type"
) +
theme_classic()
super_rural$area_type <- "Rural"
super_urba$area_type <- "Urban"
super_all <- bind_rows(super_rural, super_urba)
ggplot(super_all, aes(x = total_cost, fill = area_type, color = area_type)) +
geom_density(alpha = 0.5, linewidth = 1) +
scale_fill_manual(values = c("Rural" = "#faedcd", "Urban" = "#efe1d0")) +
scale_color_manual(values = c("Rural" = "#faedcd", "Urban" = "#efe1d0")) +
scale_x_continuous(limits = c(0, 5000)) +
geom_vline(
data = thresholds,
aes(xintercept = xintercept, linetype = type),
color = "grey50",
linewidth = 0.7,
inherit.aes = FALSE
) +
scale_linetype_manual(
name = NULL, # legend title (blank = cleaner)
values = c("10-min. walk" = "dotted",
"Car threshold" = "dashed")
) +
labs(
title = "Distance to Nearest Groceries",
x = "Distance (meters)",
y = "Density",
fill = "Area type",
color = "Area type"
) +
theme_classic()
rest_rural$area_type <- "Rural"
rest_urba$area_type <- "Urban"
rest_all <- bind_rows(rest_rural, rest_urba)
ggplot(rest_all, aes(x = total_cost, fill = area_type, color = area_type)) +
geom_density(alpha = 0.5, linewidth = 1) +
scale_fill_manual(values = c("Rural" = "#faedcd", "Urban" = "#efe1d0")) +
scale_color_manual(values = c("Rural" = "#faedcd", "Urban" = "#efe1d0")) +
scale_x_continuous(limits = c(0, 5000)) +
geom_vline(
data = thresholds,
aes(xintercept = xintercept, linetype = type),
color = "grey50",
linewidth = 0.7,
inherit.aes = FALSE
) +
scale_linetype_manual(
name = NULL,
values = c("10-min. walk" = "dotted",
"Car threshold" = "dashed")
) +
labs(
title = "Distance to Nearest Restaurant",
x = "Distance (meters)",
y = "Density",
fill = "Area type",
color = "Area type"
) +
theme_classic()
platges_rural$area_type <- "Rural"
platges_urba$area_type <- "Urban"
platges_urba$destinatio <- as.character(platges_urba$destinatio)
platges_all <- bind_rows(platges_rural, platges_urba)
ggplot(platges_all, aes(x = total_cost, fill = area_type, color = area_type)) +
geom_density(alpha = 0.5, linewidth = 1) +
scale_fill_manual(values = c("Rural" = "#faedcd", "Urban" = "#efe1d0")) +
scale_color_manual(values = c("Rural" = "#faedcd", "Urban" = "#efe1d0")) +
scale_x_continuous(limits = c(0, 5000)) +
geom_vline(
data = thresholds,
aes(xintercept = xintercept, linetype = type),
color = "grey50",
linewidth = 0.7,
inherit.aes = FALSE
) +
scale_linetype_manual(
name = NULL, # legend title (blank = cleaner)
values = c("10-min. walk" = "dotted",
"Car threshold" = "dashed")
) +
labs(
title = "Distance to Nearest Beach (from hotel or rental)",
x = "Distance (meters)",
y = "Density",
fill = "Area type",
color = "Area type"
) +
theme_classic()
We can also observe in a table how many houses fall below the car-dependency threshold, distinguishing between urban and rural areas in every case:
platges_all <- platges_all %>%
mutate(threshold=ifelse(total_cost>1500, 1, 0),
type = "Platja")
rest_all <- rest_all %>%
mutate(threshold=ifelse(total_cost>1500, 1, 0),
type = "Restaurant")
super_all <- super_all %>%
mutate(threshold=ifelse(total_cost>1500, 1, 0),
type = "Supermercat")
publictrans_all <- publictrans_all %>%
mutate(threshold=ifelse(total_cost>1500, 1, 0),
type = "Transport Public")
mobility_all <- platges_all %>%
bind_rows(rest_all) %>%
bind_rows(super_all) %>%
bind_rows(publictrans_all)
mobility_table <- mobility_all %>%
pivot_wider(names_from = area_type, values_from = threshold) %>%
mutate(Urban=as.numeric(Urban),
Rural=as.numeric(Rural),
Urban_total=ifelse(!is.na(Urban),1,0),
Rural_total=ifelse(!is.na(Rural),1,0))%>%
group_by(type) %>%
summarise(n_Urban=sum(Urban_total, na.rm = TRUE),
n_Rural=sum(Rural_total, na.rm = TRUE),
car_Urban=sum(Urban, na.rm = TRUE),
car_Rural=sum(Rural, na.rm = TRUE),
perc_car_urba=(car_Urban/n_Urban)*100,
perc_car_rural=(car_Rural/n_Rural)*100) %>%
st_drop_geometry()
knitr::kable(mobility_table)
| type | n_Urban | n_Rural | car_Urban | car_Rural | perc_car_urba | perc_car_rural |
|---|---|---|---|---|---|---|
| Platja | 57 | 731 | 13 | 717 | 22.807018 | 98.08482 |
| Restaurant | 4412 | 2551 | 75 | 1511 | 1.699909 | 59.23167 |
| Supermercat | 4412 | 2551 | 150 | 1622 | 3.399819 | 63.58291 |
| Transport Public | 4412 | 2551 | 53 | 1364 | 1.201269 | 53.46923 |
We can also visualize it using a pictogram of dots:
mobility_picto <- mobility_table %>%
select(type, perc_car_urba, perc_car_rural) %>%
pivot_longer(
cols = starts_with("perc_car"),
names_to = "area_type",
values_to = "perc_car"
) %>%
mutate(
area_type = case_when(
area_type == "perc_car_urba" ~ "Urban",
area_type == "perc_car_rural" ~ "Rural",
TRUE ~ area_type
),
# 1 punt = 5 punts percentuals (pots canviar-ho)
units = round(perc_car / 5)
) %>%
filter(units > 0) %>% # per si algun type té 0%
uncount(units, .id = "dot_id")
a <- ggplot(mobility_picto, aes(x = type, y = dot_id, color = area_type)) +
geom_point(
size = 5,
position = position_dodge(width = 0.5)
) +
coord_flip() +
scale_color_manual(
values = c("Urban" = "#efe1d0", "Rural" = "#faedcd"),
name = "Area type"
) +
scale_y_continuous(
breaks = seq(0, max(mobility_picto$dot_id), by = 2),
labels = function(x) paste0(x * 5, "%")
) +
labs(
x = NULL,
y = "% of houses that need a car to move to..."
)
a + theme_classic()
With these visualizations we can clearly observe a systematic difference between Urban and Rural areas in their mobility patterns. Rural houses require car use at much higher rates across all types of daily movements.
The framework presented here is designed to be adaptable to other contexts and datasets, offering a replicable approach to understanding complex territorial dynamics through participatory and data-driven tools.
If this methodology is used or adapted in academic or non-academic work, it should be cited accordingly, naming Miquel Rosselló Xamena as the author. Also, the website should be referenced as the main source of the methodology and its results.
Its development has been shaped by the guidance and insights of Cong Cong and Eric Huntley (DUSP, MIT), as well as by the analytical foundations provided in the course Analytics Edge at the MIT Sloan School of Management. Acknowledging these contributions ensures transparency, intellectual integrity, and the continued strengthening of collaborative research. Thanks to them all.
By rural land we refer to all areas classified as non-buildable under Mallorca’s zoning and territorial regulations. To identify these areas, we intersect all construction polygons with the MUIB – Mapa Urbanístic de les Illes Balears, a comprehensive cartographic compilation of planning regulations across the archipelago. The MUIB allows us to spatially distinguish between built, buildable, and non-buildable rural classifications and qualification. Through this overlay, we can determine which constructions are located on rural land and therefore fall within the scope of our analysis.↩︎
Instead we could have used R. However, in R we have a problem: some houses are not tight or aligned to a selevted path. In many cases the house has a considerable setback from tha path as it is located at the opposite edge or at the center of a given parcel. This makes difficult attribute to the house one path and then connect to the next. However, QNEAT3 solve this problem as they calcualte also the “entry_cost” (this is the distance from the point to the path) and the “exit_cost”, this is the distance between the path and destiantion. Moreover, QNEAT3 also calculates based on the shortest path.↩︎
As we are longitudinal data we looked for references in literature. In this case we followed the review done by Hu and Szymczak (2023) (Accessed: https://academic.oup.com/bib/article/24/2/bbad002/6991123) that suggested subject-based sampling. Yet, we were worried as we do not have a lot of observations and are very different among each other. Then, we made further research and we saw that in unbalanced data or lower obervations many studies used stratified. For that reason we decided to try it and see the results.↩︎
We use it following here:
r-bloggers.com/2021/04/quantile-regression-forests-for-prediction-intervals/.
Due to bugging issues, we run the code into AI and suggested rearranging
the QRF into range().↩︎
In all the following tables, m_pred is the
prediction of consumption, and
(m_pred_low-m_pred_high) represent the IC at
the 90%↩︎