Introduction

In this section, we present the methodologies of a large-scale data process composed of two major blocks.

First, we conduct a complete GIS and spatial analysis of all houses located on rural land across Mallorca. For the first time in the island’s history, every rural house has been identified, mapped, and classified. This work enables the creation of an Atlas of Houses That Do Not Exist, a foundational tool for understanding the scope, geography, and dynamics of informal rural urbanization.
Second, we go to the town and we address one of the key research questions that emerged from the Participatory Action Research (PAR) process carried out in Pollença in January.

In this second one, together with local residents, we defined a central question: What environmental impacts does Pollença’s “dispersed” tourism model have compared with the more compact tourism patterns along the coast?

There, two concerns stood out: water and mobility.

Residents repeatedly emphasized water consumption. Many rural houses are not connected to the public sewer network or the potable water system, meaning their actual consumption remains unknown. In a context of growing scarcity and recurrent drought risk, estimating how much water these houses use becomes essential for designing new policies and interventions.
Mobility emerged as the second blind spot. These houses are scattered across the landscape, generating a form of car-dependent tourism that has never been quantified or visualized. Understanding how much mobility this dispersed model produces—especially compared to the more centralized coastal tourism areas—is a necessary step toward rethinking infrastructure, emissions, and territorial impacts.

To answer these questions in the absence of direct data, we develop predictive models for both water consumption and mobility using machine-learning techniques in the former and sptial analysis in the latter. These models allow us to approximate the invisible, to give shape to what has remained unmeasured, and ultimately to inform a political debate grounded in evidence rather than assumption.

This document is organized following a clear sequence. First, we describe the methodology used to construct the Atlas and present its principal findings. Second, we outline the methodologies and results behind the estimation of water consumption and mobility patterns. Finally, we offer a brief synthesis of the methodologies.

In order to comply with the protection of data regulations and ethics, no data set will be preesented and all the data from individuals will not be disclosed.

Atlas of Houses That Do Not Exist

How many houses stand on Mallorca’s rural land? When were they built? How large are they? These are deceptively simple questions. In practice, they are remarkably difficult to answer. Many rural houses obscure their origins—how they were built, expanded, or transformed over time. No single dataset captures all of them or their characteristics with precision.

Yet, all houses, formal or informal, are recorded in the cadastre, even if with incomplete, outdated, or intentionally vague information. This makes the cadastral system our primary source, though it requires significant cross-referencing and cleaning. And because our goal is also to identify which of these houses operate as tourist rentals, we must complement cadastral sources with administrative data from the Government of the Balearic Islands.

To build a complete and reliable dataset, we integrate four main data sources:

Cadastral Building Footprints: This spatial layer provides the geometry of every building on rural land. It allows us to identify the physical presence, location, and spatial arrangement of all structures considered “houses,” regardless of their legal status.
Cadastral Construction Details: This dataset breaks down each building into its component parts—types of construction, structural characteristics, annexes, pools, garages, agricultural buildings, expansions, etc. It is essential for understanding the morphology and evolution of each house.
Cadastral Data Information: A construction-based dataset that complements the previous two. It includes key attributes such as the official year of construction, total built area, land-use classification, surface areas, and additional descriptors that help characterize each property more precisely.
Geolocation of Tourism Rentals (GOIB Dataset):The dataset from the Government of the Balearic Islands listing all legally declared tourist rentals. By spatially matching this data with rural houses, we can identify which informal rural houses operate as tourist accommodations and differentiate them from non-rental properties.

The Cadastral inforamtion is taken from the website of the Spanish Cadastral (https://www.sedecatastro.gob.es/) and from the plug-in of the Spanish INSPIRE Cadastral in QGIS. Regarding the Tourism Rentals, it is taken from the IDEIB - Infraestructura de Dades Espacials de les Illes Balears (http://ideib.caib.es/cataleg/).

Together, these sources allow us to construct the most complete and detailed dataset ever assembled on Mallorca’s rural houses—formal, informal, and recreational—and provide the analytical foundation for the Atlas of Houses That Do Not Exist. Here is how we will be building the data:

Building the dataset

1. Downloading the data and uniting the diferent data of sources

We begin by downloading and cleaning all cadastral information for every municipality in Mallorca. Once the parcel-level data is processed, we cross-reference it with the Cadastral Construction Details—the georeferenced shapefile that contains the full set of building footprints across the island. This layer includes every construction component that forms part of a building. We filter this data so we can detect 1) the buildings that are classified as having a house (constructions that have a “V” in their classification), 2) how many levels they have (as a proxy of height), 3) if they have pools and what area they have and 4) what is the total amount of the area of the whole building and the area of just the part of the building that correspond to the uses as house.

Before merging both sources, we filter the dataset to retain only those constructions located on rural land¹, excluding all buildings situated within urban classifications. This ensures that our analysis focuses exclusively on the rural housing stock that is central to the Atlas.

CatastreRural_Filtrat_s <- CatastreRural_Brut %>% 
  filter(`1_tipo_reg` == 14)  

CatastreRural_Filtrat <- CatastreRural_Brut %>%
  filter(`1_tipo_reg` == 14, str_detect(`71_cd`, "V"))

CatastreRural_AnySuper <-  CatastreRural_Filtrat %>%
  select(`1_tipo_reg`, `31_pc`, "79_aec", "84_stl",`71_cd`)


CatastreRural_Complet <- CatastreRural_AnySuper %>% 
  group_by(`31_pc`) %>% 
  summarise(Any=max(`79_aec`),
            Buildings = sum(str_detect(`71_cd`, "V"), na.rm = TRUE),
            SuperB = sum(as.numeric(`84_stl`), na.rm = TRUE)) %>% 
  rename(REFCAT = `31_pc`)




roman_map <- c(I = 1L, II = 2L, III = 3L, IV = 4L)
get_plantas <- function(x) {
  m <- str_match_all(x, "(?<=^|\\+)(IV|III|II|I)(?=\\+|$)")
  vals <- roman_map[ unlist(lapply(m, function(mm) mm[,2])) ]
  vals <- vals[!is.na(vals)]
  if (length(vals) == 0) NA_integer_ else max(vals)
}


Constru_amb_plantes <- Constru_complet %>%
  mutate(PLANTES = map_int(CONSTRU, get_plantas))


Constru_Recompte  <- Constru_amb_plantes %>% 
  st_drop_geometry() %>%                 
  group_by(REFCAT) %>% 
  summarise(
    PI            = sum(if_else(CONSTRU == "PI", 1, 0), na.rm = TRUE),
    AREA_PI       = sum(if_else(CONSTRU == "PI", AREA, 0), na.rm = TRUE),
    AREA_Plantes  = sum(if_else(!is.na(PLANTES), AREA, 0), na.rm = TRUE),
    MAX_Plantes   = if (all(is.na(PLANTES))) NA_integer_ else max(PLANTES, na.rm = TRUE),
    N_Plantes     = sum(if_else(!is.na(PLANTES), 1, 0)),
    .groups = "drop"
  )

Catastre_Compelt <- Constru_Recompte %>% 
  full_join(CatastreRural_Complet, by= "REFCAT")

However, we are detecting some mismatches from the two datasets. This means that, by using the parcel reference to cross the two datasets, there are some cases that do not fully match and, hence, are not present. For now, we leave them there to further investigate why this happens.

2. Cleaning and Curating the Rural Housing Dataset (QGIS Workflow)

Once the cadastral parcel data is merged with the georeferenced Cadastral Building Footprints, we begin a long and detailed cleaning process. This stage is time-consuming and involves manual verification, often building by building, to ensure that only actual rural houses remain in the final dataset. The cleaning follows this sequence of steps:

Before merging, excluding Buildings Outside Rural Land: Although we have already done it for the two previous data set, we take the Footprints and we first classify them according to Mallorca’s zoning and territorial regulations. To do so, we intersect the building-footprint layer with the MUIB – Mapa Urbanístic de les Illes Balears, which provides the official zoning categories for the entire archipelago. Under this system, rural land refers exclusively to areas designated as non-buildable. All buildings located in built or buildable land are excluded from the Atlas. We also exlude the one’s that are in rural land but are classified as “Nucli Rural”, an exceptional form of urbanization that the law contemplates as a way of legalizing forms of urbanization in rural land. We decide to exclude them too as they resemble more to a low-density urbanization than our “Houses that do not exist”, more disperse in the territory.
Joining our Cadastrial Information Dataset with the Buildings Footprints by the parcel catastral data. By doing this, we filter by those that have a “V” in some of their constructions in the Cadastral Information, hence, are considered houses for taxes porposes.
Removing Buildings with Incompatible Uses. Many buildings contain a construction element classified as “V” (vivienda), which in theory denotes a house, yet their main cadastral use appears as industrial, public, or agricultural. To avoid misclassification, we exclude agricultural uses, which are generally accurate, and manually verify all remaining cases. First, we review satellite imagery: if the structure clearly resembles a dwelling, we retain it. If the image is ambiguous, we consult Google Maps and Street View to look for signs of habitation—such as a satellite antenna, chimney, fenced entrance, mailbox, or other residential features. We also use the conetxt: are there other residential units around? Only when none of these indicators are present do we eliminate the building from the dataset. We also eliminate by uses. Many houses are actually military bases, for example. They are eliminated as they do not fit our definition of “Houses”, but have other purposes.
Validating Outliers in Surface Area: Another way to verify whether a structure is truly a house is by examining its size. Rural houses rarely fall below 25 m² or exceed 700 m², so we treat buildings outside this range as outliers requiring manual review. For each case, we inspect the building individually—following the same procedure used when validating uses—by checking satellite imagery and, when needed, Google Maps or Street View for clear indicators of habitation. Only when no evidence suggests that the structure functions as a house do we remove it from the dataset.

3. Joining Tourism Rentals (QGIS Workflow)

In QGIS, once all previous steps are completed, we join the resulting dataset with the Tourism Rentals Dataset from the Government of the Balearic Islands. Some transformations are required, since the georeferenced points in this dataset are not located at the centroid of the house but rather at the centroid (or in close proximity) of the plot of land. In other words, the tourism-rental point is positioned on the corresponding parcel, not directly on the building.

For this reason, we perform a location-based join to assign each tourism-rental point the cadastral reference of the parcel in which it falls. This allows us to determine, for every parcel, whether it contains a tourism rental and how many beds it offers. As a result, we generate two variables: a binary indicator (0/1) where 1 indicates a registered rental, and the number of beds when applicable.

Afterward, we join the parcel-level rental information to the corresponding building, allowing us to identify which houses are rentals and the number of beds they have.

However, it is important to note that this method is not fully accurate, since the GOIB has not georeferenced all rural rental licences. Moreover, many Airbnb rentals are under-reported or missing from the official dataset. For this reason, we use Airbnb data as a reference point to cross-check and validate the counts at the municipal level.

4. Crossing Protection Categories (QGIS Workflow)

The first step is identifying which buildings are located in areas classified as flood-risk zones. For this, we use the official determination made in 2002 by the Government of the Balearic Islands (https://ideib.caib.es/cataleg/srv/cat/catalog.search;jsessionid=725358ED20EEAEDA207B4878DD0F457A#/metadata/A85E9070-8C41-4090-AFF3-9AC582E0965F), which maps the territories considered to be at high risk of flooding. Although these maps are now outdated—especially given advancements in GIS techniques and the increasing severity of the climate emergency—they still represent the most authoritative delineation of risk zones and are known to be conservative rather than expansive. For this reason, we retain them as the baseline reference, understanding that actual risk areas today would likely be larger. After intersecting these layers with our cleaned dataset, we generate a simple dummy variable (0/1) indicating whether each building is located within a flood-risk area.

After identifying buildings located in flood-risk zones, we delimit the areas that fall under highly protected rural land. To do this, we rely on the classification of the Pla Territorial de Mallorca (PTM), the territorial plan that distinguishes between different categories of rural land, including highly protected areas and general or common rural land. We focus specifically on Articles 19.3 and 20 of the regulatory section, and intersect their legal definitions with the corresponding cartography. Once this spatial layer is prepared, we determine whether any building falls within these protected categories.

The first step is assigning each building the rural-land category in which it is located. If a building falls into any of the following classifications, we label it as being in a Strong Environmental Protection Area, since these are zones where the construction of new isolated houses is explicitly prohibited:

AANP – Àrees Naturals d’Especial Interès d’Alt Nivell de Protecció
ANEI / ANEU – Àrees Naturals d’Especial Interès
ARIP-B – Àrees Rurals d’Interès Paisatgístic (Boscoses)
APT – Àrees de Protecció Territorial

Next, we create a second classification of strong protection. Buildings located in SRG-F (Suelo Rústico de Régimen General Forestal) or ARIP areas are also considered highly protected when the parcel on which they sit is smaller than 50,000 m². According to Article 20, parcels below this size cannot legally host new isolated single-family dwellings.

After this, we classify the remaining rural land as general rural protection, where the minimum parcel size for building a house is 14,000 m². Buildings located on parcels smaller than this threshold are not necessarily illegal—because municipal regulations before the 1990s used to allow smaller parcels—but they would not be permitted under today’s rules. For this reason, we categorize them as potentially informal, yet this would require go case by case.

In summary, when we refer to houses on protected land, we combine:

Buildings located in the first group of highly protected categories (AANP, ANEI/ANEU, ARIP-B, APT).
Buildings in SRG-F or ARIP areas located on parcels smaller than 50,000 m².

And, when relevant, buildings located on the rest of rural land on parcels smaller than 14,000 m², we consider potentially informal given the evolution of planning regulations.

When interpreting these classifications, we must keep in mind that our computation focuses solely on whether a house exists in a specific location. Many cases of informality do not stem from the physical characteristics of the house itself (such as height or volume), but rather from whether the dwelling was permitted to be built in that location under the relevant planning regulations. For example, houses that are legal, in protected land, yet with the right size of plot, but that they build a pool afterwards. This cases are not counted here.

5. We count the surface of the parcels where they are located

At the end we join by location on the parcel they are located and we count the amount of sq meters the parcel have.

Issues we have observed in QGIS and how we solved them: Hidden Houses And Mismatches

Some houses have building footprints that fall outside their parcels, and have another reference not included in the cadastral information dataset. This corresponds to the mismatch we identified in the first count between constructions and their associated cadastral attributes. This is precisely where those missing cases appear. Here is an example, that we see (light line) how the house falls outside the parcel limits and is, in itself, another parcel:

To resolve this, we start by taking the building footprints without intersecting them with the mismatched cadastral dataset. We identify which footprints located in rural land are not selected in the cadastral merge. From these, we keep only those classified as residential and in good physical condition (i.e., not in ruins). Then, for each of these remaining cases, we manually verify whether they are truly houses, repeating the full filtering process described earlier: checking satellite imagery, using Google Maps when necessary, and applying the same criteria used for all other buildings. We tried to find the cadastral inforamtion in the urban dataset of houses, yet it was not found. So, we decide to leave this more than 2000 houses with some missing information, yet counting them.

We detected these missing cases because many of them appear as tourism rentals in a parcel where we did not have any house marked (although it was there). To correctly match them as rentals, as we used the parcel as reference to classify them as such and now this do not work, we change the methodology. Just for this houses when they have no cadastral parcel reference matching the inforamtion, we adjust the previous methodology by applying a proximate-neighbour. Through this method, we identify the closest building footprint to each unmatched rental point and assign it as a rental (value = 1) when the mismatch is confirmed. These corrected matches are then added to the dataset. All other rental points continue to be matched using the standard parcel-based join, as explained in point 3.

To correct these cases, we use an R workflow. First, we take all the buildings identified in QGIS as “not detected” by the cadastral dataset—specifically those classified as residential—and cross them with the available cadastral information to recover any attributes that may have been mismatched (although, as explained earlier, not all of them can be recovered). After this merge, we incorporate the rental information to determine which of these buildings operate as tourist rentals. The remaining unmatched cases—those that could not be linked to any cadastral record through R—are then processed in QGIS using the most proximate-neighbour method described above.

CatastreRural_Filtrat_s <- CatastreRural_Brut %>% 
  filter(`1_tipo_reg` == 14)  

CatastreRural_AnySuper_s <-  CatastreRural_Filtrat_s %>%
  select(`1_tipo_reg`, `31_pc`, "79_aec", "84_stl",`71_cd`)


CatastreRural_Complet_s <- CatastreRural_AnySuper_s %>%
  group_by(`31_pc`) %>%
  summarise(
    Any = max(`79_aec`, na.rm = TRUE),
    Buildings = ifelse(
      sum(str_detect(`71_cd`, "V"), na.rm = TRUE) > 0,
      sum(str_detect(`71_cd`, "V"), na.rm = TRUE),
      1
    ),
    SuperB = sum(as.numeric(`84_stl`), na.rm = TRUE)
  ) %>%
  rename(REFCAT = `31_pc`) %>% 
  mutate(Cat_REFCAT = REFCAT)

Rentals <- Rentals %>% 
  mutate(Cat_REFCAT=localId) %>% 
  select(OBJECTID, PLACES, Cat_REFCAT)
Rentals_count <- as_tibble(Rentals) %>% 
  group_by(Cat_REFCAT) %>% 
  summarise(TuriGOIB_N=n(),
            TuriGOIB_P=sum(PLACES))

Buildings_ForaMethod <- Buildings_ForaMethod %>% 
  left_join(Rentals_count, by = "Cat_REFCAT") 

Buildings_ForaMethod_PI <- Buildings_ForaMethod %>% 
  left_join(CatastreRural_Complet_s, by = "Cat_REFCAT") %>% 
  mutate(Cat_Any=Any,Cat_SuperB=SuperB, Cat_Buildi=Buildings ) %>% 
  select(-Any, -SuperB, -Buildings) %>% 
  select (-REFCAT) %>% 
  filter(Cat_PI==1 & !is.na(Cat_Any)&conditionO=="functional" )

Buildings_ForaMethod_Rentals <- Buildings_ForaMethod %>% 
  left_join(CatastreRural_Complet_s, by = "Cat_REFCAT") %>% 
  mutate(Cat_Any=Any,Cat_SuperB=SuperB, Cat_Buildi=Buildings ) %>% 
  select(-Any, -SuperB, -Buildings) %>% 
  select (-REFCAT) %>% 
  filter(!is.na(Cat_Any) & !is.na(TuriGOIB_N) &conditionO=="functional")

Buildings_ForaMethod_F <- bind_rows(Buildings_ForaMethod_PI, Buildings_ForaMethod_Rentals)

We return that into QGIS and we just merge the layers with the one we got from before, adding just the houses that were not detected originally

Also, some pools sometimes fall outside the parcel of the house, as in here:

This creates a small methodological limitation: because we use the parcel reference to match buildings and pools (that are constructions), some end up being under-detected or not associated with the correct house. In many cases it is extremely difficult to match a specific pool to a specific dwelling, especially on large parcels with multiple structures. This makes that strategy that we used in the rentals of the most proximate-neighbour very difficult to be accurate.

To manage this, we accept a certain degree of under-detection at the building level. However, this does not affect municipal-level aggregations. When producing the final counts of houses for each municipality and pools in rural land, we could treat pools as a separate layer and aggregate them independently by municipality, avoiding the need for a precise one-to-one match between house and pool.

Final Dataset of all the Houses That Do Not Exist

At the end of this process, we obtain a complete count of all buildings located on rural land, combining the results from both QGIS and R. These constitute the full set of buildings used in our analysis. Those are the variables that result.

names(buildings_complet)

##  [1] "gml_id"     "lowerCorne" "upperCorne" "beginLifes" "conditionO"
##  [6] "beginning"  "end"        "endLifespa" "informatio" "reference" 
## [11] "localId"    "namespace"  "horizontal" "horizont_1" "horizont_2"
## [16] "referenceG" "currentUse" "numberOfBu" "numberOfDw" "numberOfFl"
## [21] "documentLi" "format"     "sourceStat" "officialAr" "value"     
## [26] "value_uom"  "layer"      "path"       "Cat_REFCAT" "Cat_PI"    
## [31] "Cat_AREA_P" "Cat_AREA_1" "Cat_MAX_Pl" "Cat_N_Plan" "Cat_Any"   
## [36] "Cat_Buildi" "Cat_SuperB" "Parc_areaV" "TuriGOIB_N" "TuriGOIB_P"
## [41] "Allotj_n"   "Allotj_p"   "PrtAbsolut" "Ptc50"      "PrtGen"    
## [46] "Build_Innu" "Munic_NOM"  "Catastre_1" "Catastre_3" "Catastre_7"
## [51] "Any_num"    "PrtAbsNum"  "CODIMUIB_G" "geometry"

And here all the final stats of all the houses and the main variables:

buildings_complet_totals <- buildings_complet %>%
  mutate(Cat_Any=as.numeric(Cat_Any),
         Cat_SuperB=as.numeric(Cat_SuperB),
         Cat_PI=as.numeric(Cat_PI)) %>% 
  group_by() %>% 
  summarise(Cases=n(),
            PI=sum(Cat_PI, na.rm = TRUE),
            mean_any=mean(Cat_Any, na.rm=TRUE),
            med_any=median(Cat_Any, na.rm=TRUE),
            mean_superB=mean(Cat_SuperB, na.rm=TRUE),
            med_superB=median(Cat_SuperB, na.rm=TRUE),
            mean_parc = mean(as.numeric(Parc_areaV), na.rm=TRUE),
            med_parc = median(as.numeric(Parc_areaV), na.rm=TRUE),
            sum_14K = sum(PrtGen),
            sum_protecc = sum(as.numeric(PrtAbsolut)),
            sum_protecc_amb50=sum_protecc+sum(Ptc50), #HEM POSAT AQUEST QUE ES PROTECCIO ALTA PERO RELATIVA 
            sum_innun=sum(as.numeric(Build_Innu), na.rm = TRUE)
            ) %>% 
  mutate(perc_14K=(sum_14K/Cases)*100,
         perc_proteccioabsoluta=(sum_protecc/Cases)*100,
        perc_proteccioamb50=(sum_protecc_amb50/Cases)*100,
        perc_innundable=(sum_innun/Cases)*100)

knitr::kable(
  buildings_complet_totals
)

Cases	PI	mean_any	med_any	mean_superB	med_superB	mean_parc	med_parc	sum_14K	sum_protecc	sum_protecc_amb50	sum_innun	perc_14K	perc_proteccioabsoluta	perc_proteccioamb50	perc_innundable
55256	21897	1984.594	1996	182.7571	135	29389.56	5712	30134	4591	7965	5066	54.53525	8.3086	14.41472	9.168235

Counts of The Houses That Do Not Exist by Municipalities

Now, using the dataset, we build the data to count how many houses there are in each municipality and what are their characteristics. The code we use originally si the following:

muni_count <- buildings_complet %>%

  mutate(TuriGOIB_N=as.numeric(TuriGOIB_N),
         Cat_Buildi=as.numeric(Cat_Buildi),
         Cat_SuperB=as.numeric(Cat_SuperB),
         Parc_areaV=as.numeric(Parc_areaV),
         value = as.numeric(value),
         Cat_AREA_1 = as.numeric(Cat_AREA_1),
         illegal_75_7K=ifelse(CODIMUIB_G==1 & Parc_areaV < 7000 & Cat_Any > 1975 & 
                       Cat_Any <= 1997, 1, 0),
         illegal_75_14K = ifelse(CODIMUIB_G==1 & Parc_areaV < 14000 & Cat_Any > 1975 & 
                       Cat_Any <= 1997, 1, 0),
         illegal_97_14K =ifelse(CODIMUIB_G==1 & Parc_areaV < 14000 & Cat_Any > 1997, 1, 0)) %>% 
  group_by(Munic_NOM) %>%
  summarise(
    any_mean = mean(Any_num, na.rm = TRUE),
    any_med=median(Any_num, na.rm = TRUE),
    n_hab = sum(as.numeric(Cat_Buildi), na.rm = TRUE),
    n_general = n(),
    Rental_n_GOIB = sum(TuriGOIB_N, na.rm = TRUE),
    Rental_pl_GOIB = sum(TuriGOIB_P, na.rm = TRUE),
    n_Innund = sum(as.numeric(Build_Innu) * Cat_Buildi, na.rm = TRUE),
    n_Innund_gen = sum(Build_Innu, na.rm = TRUE),
    perc_innund = (n_Innund_gen/n_general)*100,
    n_prtabs = sum(as.numeric(PrtAbsolut) * Cat_Buildi, na.rm = TRUE),
    n_prtabs_general = sum(as.numeric(PrtAbsolut), na.rm = TRUE),
    perc_prtabs = (n_prtabs_general/n_general)*100,
    areaHous_mean = mean(as.numeric(Cat_SuperB), na.rm = TRUE),
    areaHous_med = median(as.numeric(Cat_SuperB), na.rm = TRUE),
    areagfa_mean = mean(as.numeric(value), na.rm = TRUE),
    areagfa_med = median(as.numeric(value), na.rm = TRUE),
    areaTOT_mean = mean(as.numeric(Cat_AREA_1), na.rm = TRUE),
    areaPar_mean = mean(Parc_areaV, na.rm = TRUE),
    areaPar_med = median(Parc_areaV, na.rm = TRUE),
    n_PI = sum(as.numeric(Cat_PI), na.rm = TRUE),
    areaPI_mean = mean(Cat_AREA_P, na.rm = TRUE),
    n_prt50 = sum(Ptc50 * Cat_Buildi, na.rm = TRUE),
    n_prt50_general = sum(Ptc50, na.rm = TRUE),
    n_prtg_tot = n_prt50_general+n_prtabs_general,
    SRG = sum(CODIMUIB_G, na.rm=TRUE),
    illegal_75_7K = sum(illegal_75_7K, na.rm = TRUE),
    illegal_75_14K = sum(illegal_75_14K, na.rm = TRUE),
    illegal_97_14K = sum(illegal_97_14K, na.rm = TRUE),
    illegal= illegal_75_14K+illegal_97_14K,
    perc_illegal = (illegal/n_general)*100,
    perc_prt50 = (n_prt50_general/n_general)*100) %>% 
  mutate(
    perc_hab = (n_hab / sum(n_hab)) * 100,
    perc_general = (n_general / sum(n_general)) * 100) %>% 
  full_join(airbnb) %>% 
  mutate(diffTuri= airbnb_25-HabT_n,
         perc_Turi_general = (HabT_n / n_general) * 100,
         perc_Turi_genANDall= ((HabT_n+Allotj_n) / n_general) * 100,
         perc_airb_general = (airbnb_25 / n_general) * 100,
         perc_Turi_habANDall = ((HabT_n+Allotj_n) / n_hab) * 100,
         perc_prtg_tot = (n_prtg_tot/n_general)*100,
         perc_airb_hab = (airbnb_25 / n_hab) * 100,
         perc_illegal_SRG = (illegal/localId_co)*100) %>% 
  mutate(
    dplyr::across(
      .cols = where(is.numeric),
      .fns  = ~ tidyr::replace_na(., 0)
    ))
  
total_illegals <- muni_count %>% 
  group_by() %>% 
  summarise(
    illegal = sum(illegal, na.rm = TRUE),
    illegal_75_7K = sum(illegal_75_7K, na.rm = TRUE),
    illegal_97_14K = sum(illegal_97_14K, na.rm = TRUE),
    illegal_75_14K = sum(illegal_75_14K, na.rm = TRUE),
    n = sum(n_general, na.rm = TRUE),
    SRG = sum(SRG, na.rm = TRUE),
    airbnb = sum(airbnb_25, na.rm = TRUE),
  ) %>% 
  mutate(
    perc_illegal_SRG = (illegal/SRG)*100,
    perc_illegal = (illegal/n)*100,
    perc_illegal_97_14K = (illegal_97_14K/n)*100,
    perc_illegal_75_14K = (illegal_75_14K/n)*100,
    perc_illegal_75_7K = (illegal_75_7K/n)*100,
    perc_illegal_97_14K_SRG = (illegal_97_14K/SRG)*100,
    perc_illegal_75_14K_SRG = (illegal_75_14K/SRG)*100,
    perc_illegal_75_7K_SRG = (illegal_75_7K/SRG)*100,
    perc_airbnb = (airbnb/n)*100
  )

With this, we can visualize the totals for each municipality.

Is Possible to Built a Cluster Based on The Characteristics Of Houses?

The short answer is: clustering did not work. Our goal was to use all available cadastral features to identify potential typologies of houses—groups that would reflect different patterns of size, shape, age, parcel characteristics, or tourist-rental intensity. If such clusters existed, they could have served as a useful way of summarizing the morphological diversity of rural housing.

Before applying clustering, we tested whether the data had the necessary internal structure:

We created an index to uniquely identify each building.
We selected all numeric cadastral variables and harmonized missing values.
We removed variables with no variation (constant values).
We checked correlations and found several pairs of highly correlated variables (>0.9).
We ran an outlier analysis using IQR thresholds to detect extreme values.

All of the results suggested that the dataset is dominated by a high number of outliers and lacks clear internal substructure.

buildings_complet_index <- as.data.frame(buildings_complet) %>%
  mutate(index = row_number())


buildings_complet_cl <- as_tibble(buildings_complet_index) %>%  select(index, 31:52) %>% 
  select(-c("Cat_N_Plan","Catastre_1","Catastre_3",
            "Catastre_7","Any_num")) %>% 
  mutate(
    TuriGOIB_N = ifelse(is.na(TuriGOIB_N), 0, TuriGOIB_N),
    TuriGOIB_P = ifelse(is.na(TuriGOIB_P), 0, TuriGOIB_P),
    Allotj_n   = ifelse(is.na(Allotj_n), 0, Allotj_n),
    Allotj_p   = ifelse(is.na(Allotj_p), 0, Allotj_p),
    Build_Innu = ifelse(is.na(Build_Innu), 0, Build_Innu)
  ) %>% 
  filter(
    !is.na(Cat_Buildi),
    !is.na(Cat_AREA_1),
    !is.na(Cat_MAX_Pl),
    Cat_Any > 1000
  ) %>%
  select(-Munic_NOM) %>% 
  as.data.frame() %>% 
  select( -Cat_Buildi, -Allotj_n, -Allotj_p, -TuriGOIB_P, -PrtAbsolut, -Ptc50, -PrtGen, -Build_Innu) %>% 
  mutate(across(-index, as.numeric))

buildings_complet_cl_si <- buildings_complet_cl %>%
  select(-index) %>%     # treus index si vols
  mutate(across(everything(), as.numeric))
  
 

cor_dta <- cor(buildings_complet_cl_si)
cor_dta2 <- as.data.frame(as.table(cor_dta))
subset(cor_dta2, Freq>0.9)

##          Var1       Var2 Freq
## 1  Cat_AREA_P Cat_AREA_P    1
## 10 Cat_AREA_1 Cat_AREA_1    1
## 19 Cat_MAX_Pl Cat_MAX_Pl    1
## 28    Cat_Any    Cat_Any    1
## 37 Cat_SuperB Cat_SuperB    1
## 46 Parc_areaV Parc_areaV    1
## 55 TuriGOIB_N TuriGOIB_N    1
## 64  PrtAbsNum  PrtAbsNum    1

cor_mat <- cor(buildings_complet_cl_si, use = "pairwise.complete.obs")

corrplot(cor_mat, method = "color", type = "upper", tl.cex=0.7)

#Outliers
outliers_iqr_cl <- sapply(buildings_complet_cl, function(col){
  q1 <- quantile(col, 0.25)
  q3 <- quantile(col, 0.75)
  iqr <- q3 - q1
  col < (q1 - 1.5*iqr) | col > (q3 + 1.5*iqr)
})

outlier_rows <- which(apply(outliers_iqr_cl, 1, any))
length(outlier_rows)

## [1] 14112

Hierarchical Clustering Attempt

We applied a hierarchical clustering model (Ward.D2) on the scaled numerical features. We tested cluster solutions between k = 2 and k = 10, computing the mean silhouette width for each value of k. The silhouette values remained consistently low, and no value of k produced a meaningful improvement.

Even when forcing a solution (we tested k = 7 for illustration purposes), the clusters were numerically unbalanced, overlapping, and indistinguishable in their main characteristics (house area, parcel size, number of floors, construction year, number of rental beds, etc.). The dendrogram and silhouette plots confirmed the absence of well-defined cluster boundaries.

df_num <- buildings_complet_cl %>% 
  select(index, where(is.numeric))

df_num <- df_num %>% drop_na()

df_num <- df_num[, sapply(df_num, function(x) length(unique(x)) > 1)]


cols_ok <- sapply(df_num, function(x) length(unique(x)) > 1)
cols_ok["index"] <- TRUE 

df_num <- df_num[, cols_ok]
idx <- df_num$index
df_scaled <- scale(df_num %>% select(-index))


d <- dist(df_scaled, method = "euclidean")
hc <- hclust(d, method = "ward.D2")
plot(hc, labels = FALSE, main = "Hierarchical clustering dendrogram")

ks <- 2:10

sil_means <- sapply(ks, function(k) {
  cl_k <- cutree(hc, k = k)         
  sil   <- silhouette(cl_k, d)      
  mean(sil[, 3])                    
})


sil_means

## [1] 0.2277950 0.2978560 0.3146154 0.3158251 0.3436693 0.3441119 0.3337026
## [8] 0.3537014 0.3538284

plot(ks, sil_means, type = "b",
     xlab = "Número de clusters (k)",
     ylab = "Media silhouette",
     main = "Silhouette vs k (Ward.D2)")
abline(v = ks[which.max(sil_means)], lty = 2)

clusters <- cutree(hc, k = 7)
table(clusters)

## clusters
##     1     2     3     4     5     6     7 
##  3875   252 11611 20545 11949     9     2

cut_height <- hc$height[length(hc$height) - (7 - 1)]
plot(hc, labels = FALSE, main = "Dendrograma amb línia de tall per k=7")
abline(h = cut_height, col = "blue", lty = 2, lwd = 2)

clusters_df <- data.frame(
  index   = idx,
  cluster = as.integer(clusters)
)

buildings_complet_index <- buildings_complet_index %>%
  left_join(clusters_df, by = "index")


buildings_complet_index$cluster[is.na(buildings_complet_index$cluster)] <- 0
names(buildings_complet_index)

##  [1] "gml_id"     "lowerCorne" "upperCorne" "beginLifes" "conditionO"
##  [6] "beginning"  "end"        "endLifespa" "informatio" "reference" 
## [11] "localId"    "namespace"  "horizontal" "horizont_1" "horizont_2"
## [16] "referenceG" "currentUse" "numberOfBu" "numberOfDw" "numberOfFl"
## [21] "documentLi" "format"     "sourceStat" "officialAr" "value"     
## [26] "value_uom"  "layer"      "path"       "Cat_REFCAT" "Cat_PI"    
## [31] "Cat_AREA_P" "Cat_AREA_1" "Cat_MAX_Pl" "Cat_N_Plan" "Cat_Any"   
## [36] "Cat_Buildi" "Cat_SuperB" "Parc_areaV" "TuriGOIB_N" "TuriGOIB_P"
## [41] "Allotj_n"   "Allotj_p"   "PrtAbsolut" "Ptc50"      "PrtGen"    
## [46] "Build_Innu" "Munic_NOM"  "Catastre_1" "Catastre_3" "Catastre_7"
## [51] "Any_num"    "PrtAbsNum"  "CODIMUIB_G" "geometry"   "index"     
## [56] "cluster"

table(buildings_complet_index$cluster)

## 
##     0     1     2     3     4     5     6     7 
##  7013  3875   252 11611 20545 11949     9     2

cluster_summary <- buildings_complet_index %>%
  group_by(cluster) %>%
  summarise(
    n = n(),

    # Cat_AREA_P
    mean_Cat_AREA_P   = mean(as.numeric(Cat_AREA_P),  na.rm = TRUE),
    median_Cat_AREA_P = median(as.numeric(Cat_AREA_P), na.rm = TRUE),
    sd_Cat_AREA_P     = sd(as.numeric(Cat_AREA_P),    na.rm = TRUE),

    # Cat_AREA_1
    mean_Cat_AREA_1   = mean(as.numeric(Cat_AREA_1),  na.rm = TRUE),
    median_Cat_AREA_1 = median(as.numeric(Cat_AREA_1), na.rm = TRUE),
    sd_Cat_AREA_1     = sd(as.numeric(Cat_AREA_1),    na.rm = TRUE),

    # Cat_SuperB
    mean_Super        = mean(as.numeric(Cat_SuperB),  na.rm = TRUE),
    median_Super      = median(as.numeric(Cat_SuperB), na.rm = TRUE),
    sd_Super          = sd(as.numeric(Cat_SuperB),    na.rm = TRUE),

    # Cat_MAX_Pl
    mean_Cat_MAX_Pl   = mean(as.numeric(Cat_MAX_Pl),  na.rm = TRUE),
    median_Cat_MAX_Pl = median(as.numeric(Cat_MAX_Pl), na.rm = TRUE),
    sd_Cat_MAX_Pl     = sd(as.numeric(Cat_MAX_Pl),    na.rm = TRUE),

    # Parc_areaV
    mean_Parc         = mean(as.numeric(Parc_areaV),  na.rm = TRUE),
    median_Parc       = median(as.numeric(Parc_areaV), na.rm = TRUE),
    sd_Parc           = sd(as.numeric(Parc_areaV),    na.rm = TRUE),

    # TuriGOIB_N
    mean_Turi         = mean(as.numeric(TuriGOIB_N),  na.rm = TRUE),
    median_Turi       = median(as.numeric(TuriGOIB_N), na.rm = TRUE),
    sd_Turi           = sd(as.numeric(TuriGOIB_N),    na.rm = TRUE),

    # Cat_Any
    mean_Cat_Any      = mean(as.numeric(Cat_Any),     na.rm = TRUE),
    median_Cat_Any    = median(as.numeric(Cat_Any),    na.rm = TRUE),
    sd_Cat_Any        = sd(as.numeric(Cat_Any),       na.rm = TRUE),

    # Municipi més comú
    top_munic         = names(which.max(table(Munic_NOM))),
    top_munic_n       = max(table(Munic_NOM)),
    second_munic_n    = sort(table(Munic_NOM), decreasing = TRUE)[2],
    top_advantage     = max(table(Munic_NOM)) - sort(table(Munic_NOM), decreasing = TRUE)[2]
  )

cluster_summary_flipped <- cluster_summary %>%
  as.data.frame() %>%
  column_to_rownames("cluster") %>%  
  t() %>%                             
  as.data.frame()


plot_data <- buildings_complet_index %>%
   filter(!cluster %in% c(0, 4, 6, 7)) %>%
  mutate(cluster = factor(cluster)) %>%
  select(
    cluster,
    Cat_AREA_P,
    Cat_AREA_1,
    Cat_SuperB,
    Cat_MAX_Pl,
    Parc_areaV,
    TuriGOIB_N,
    Cat_Any  ) %>%
  mutate(across(-cluster, as.numeric)) %>%
  pivot_longer(
    cols = -cluster,
    names_to = "variable",
    values_to = "value"
  )

ggplot(plot_data, aes(x = cluster, y = value)) +
  geom_boxplot(outlier.alpha = 0.3) +
  facet_wrap(~ variable, scales = "free_y") +
  labs(
    x = "Cluster",
    y = "Value of Variable",
    title = "Distribution by Cluster"
  )

plot(hc, labels = FALSE, main = "Hierarchical clustering dendrogram (k = 7)")
rect.hclust(hc, k = 7, border = "red")

Because clustering failed to reveal any additional structure or insight, we decided not to use it as part of the typology analysis. Instead, we focus on individual variables and spatial patterns, which provide more substantive information than the attempted cluster models.

Answering The Questions Of The PAR Process: Sustainability and Environmental Impact

During the month of January we carried out a Participatory Action Research (PAR) process in which the people of Pollença defined the research priorities themselves. After collecting 133 research questions that reflected the community’s concerns, three questions emerged as the final ones (decided in a final workshop with the main stakeholders). All of them where centering on a comparaison between rentals on rural land and the realities of other forms of tourism in the municipality (mainly hotels and urban rentals). So, the questions that resulted are:

What are the differences between the rural-land vacation rental tourism model and other models, especially regarding construction quality, urban morphology, and the characteristics of economic activity?
What environmental impacts does the “dispersed” tourism model of rural Pollença have compared to the more compact coastal tourism models? (Mainly water and mobility)
Has the profitability of vacation rentals increased or decreased? Does it generate a concentration of capital in fewer hands, or does it promote a more equitable distribution of resources?

In this analysis we focus on the second question: the environmental consequences of rural houses. To address it, we work through two complementary perspectives:

Water consumption: We build a model capable of predicting the water consumption of rural houses and compare it with the consumption patterns of urban households.
Mobility impact: We develop a model to estimate the mobility generated by rural houses, focusing on the distance to key amenities and the extent to which residents must rely on cars compared to those living in urban areas.

With this, we pretend to inspire stakeholders and policymakers to agree on shared action on those spaces. Visualizing this data we can put on the table the extend of the problems and discuss potential interventions in the municipality.

Clustering the neighborhoods

However, before proceeding, a short clarification is needed. We argue that informality is not primarily a matter of policy or legalization, but a projectual issue—something that must be addressed spatially, through design and planning interventions. Each place may require a different response. And as every planning proposal needs a clear spatial delimitation, we first sought to understand whether there are emerging neighbourhood-like clusters in rural land—areas where the density of houses is high enough to constitute a coherent unit for planning. Identifying such clusters would allow us to examine their specific patterns of water consumption or mobility and use them as the basis for discussing context-sensitive interventions.

For this reason, we built two models of clusters based on the idea of proximity and density of houses among each other that could allow us to determine what “neighborhoods” exist. We used two methodologies: K-Means and DBSCAN.

K-Means to determine the neigborhoods

At the end of the day, we can consider each house a point (thus we do the centroid of each polygon of footprint we have using QGIS). And this point is situated into two coordinated: X and Y. Thus, we could just use this two coordinates to build a k-means cluster:

coords_utm_bp <- st_transform(buildings_complet_p, 25831)
coords_centroids <- st_centroid(coords_utm_bp)
coords_mat_cases <- st_coordinates(coords_centroids)
coords_scaled <- scale(coords_mat_cases)



set.seed(1000)




max_k <- 20
wss <- numeric(max_k)

for (k in 1:max_k) {
  km <- kmeans(coords_scaled, centers = k, nstart = 25)
  wss[k] <- km$tot.withinss   # within-cluster sum of squares
}

plot(1:max_k, wss, type = "b",
     xlab = "Nombre de clusters (k)",
     ylab = "Total within-cluster SS",
     main = "Elbow plot per k-means")

sil_values <- numeric(20)

for (k in 2:20) {
  km_temp <- kmeans(coords_scaled, centers = k, nstart = 25)
  sil_values[k] <- mean(silhouette(km_temp$cluster, dist(coords_scaled))[, 3])
}

# Silhouette plot
plot(2:20, sil_values[2:20], type = "b",
     xlab = "Nombre de clusters (k)",
     ylab = "Silhouette mitjà",
     main = "Silhouette per k-means")

By using Silhouette and WSS we see that the potential cut on number of k could be in 6 or 10 clusters. So, we build both models and observe that both returns us a logical value of clusters distribution

km6 <- kmeans(coords_scaled, centers = 6, nstart = 25)
km10 <- kmeans(coords_scaled, centers = 10, nstart = 25)

coords_utm_bp$cluster6 <- factor(km6$cluster)
coords_utm_bp$cluster10 <- factor(km10$cluster)

knitr::kable(table(coords_utm_bp$cluster6),
             caption = "K-mean neighb. (k=6)")

K-mean neighb. (k=6)
Var1	Freq
1	669
2	693
3	143
4	332
5	375
6	422

knitr::kable(table(coords_utm_bp$cluster10),
             caption = "K-mean neighb. (k=10)")

K-mean neighb. (k=10)
Var1	Freq
1	335
2	266
3	235
4	346
5	321
6	10
7	338
8	113
9	123
10	547

Here the result on a map of them when we add them in QGIS (and sum them in hexagons of 200 sq meters):

When we asked different people from the town which cluster best reflected the lived realities of rural Pollença, they consistently pointed to cluster 10. According to them, this cluster most closely aligns with what they commonly recognize as the “areas” of the rural landscape—categories rooted in historic land divisions, traditional parcel structures, and popular geographic terminology.

Yet there is an important insight that residents pointed out—and one we fully agree with: houses in rural Pollença are not randomly distributed across the landscape. Their spatial logic is shaped by mountains, valleys, historic paths, and physical barriers. In other words, the “distance” between two houses is not the straight line between their coordinates but the network of roads, trails, and passes that actually connect them. Paths and trails that actually are the true accessess to houses and that reflect the orography of the land. Two houses that appear close in Euclidean space (the one that K-means use) may, in reality, be separated by a mountain, while others that seem far apart on a map are effectively neighbours because they share a direct path.

This becomes clear in areas such as Siller and Cala Sant Vicenç, where simple coordinate-based proximity fails to capture the real spatial relationships of daily life: they are separated by a mountain (el Coll de Siller) and thus, the paths that connect them are very far away.

Because of this, we sought to rebuild the clusterization based on network distance rather than on raw geographic coordinates, allowing us to reflect the true connectivity patterns that shape how people move, interact, and access the territory.

DBSCAN to determine the neigborhoods

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a widely used algorithm in spatial analysis. It offers a straightforward way to identify density-based groups of points while also distinguishing outliers. Thus, is based on density and also elimiantes “noise”, this being the isolated houses in or case. One key advantage is that DBSCAN operates on a distance matrix, meaning it can use the actual network between points (our paths)—unlike k-means, which requires raw coordinates and assumes spherical clusters in geometric space and fit all the outliers in the clusters.

AS a result, for our case, where houses are separated by orography, paths, and complex access routes, DBSCAN is a more suitable method than the k-means. Because of this, we apply DBSCAN to identify meaningful clusters of rural houses. The procedure requires two main steps:

Building the dataset: We construct the full distance matrix, measuring the network-based distance from every house to every other house. This allows DBSCAN to work with the real distances that define how people actually move across the territory.
Running DBSCAN – Instead of specifying the number of clusters (as in k-means), DBSCAN requires two parameters: ε (epsilon): the maximum distance between two houses for them to be considered “neighbours.”; MinPts: the minimum number of houses required to form a cluster.

These two parameters shape the clusters: low values generate small, tight clusters; high values merge areas into larger units. Choosing ε and MinPts is an iterative and interpretive process, which we refine based on both statistical indicators and local knowledge.

Building the dataset: OD-Matrix (Origin-Destination-Matrix) computation

To build the dataset we will be using QNEAT3, an algortihm in QGIS². As we have all the buildings as points (the sames we used for the k-means) we jsut run the algorithm when taking the routes that IDEIB (From the Goverment of the Balearic Islands) provide as Open Data in their website. These routes are the complete network of public routes, paths and trails of all the Balearic Islands. Thus, from here, we run the algorithm “OD-Matrix from Points as Lines (n:n)” and looks for the shortest-path based on distance (this is what path optimizes the minimum meters,as it is the units we use in the geographical projection, runned). Costs components are split up into Entry-Cost (from a house of origin to the path), Network-Cost (the shortest distance of the path to the destination), Exit-Cost (the distance from the path at its final point to the destination) and Total Cost (sum of all other cost components). Here one can have more information on the algorithm: https://root676.github.io/OdMatrixAlgs.html.

By taking the total_cost we can build a matrix of distances that reflect the entire distance among the more than 2500 houses that the municipality has. A total combination of 6,937,956 combination of distances. From here, we can build the matrix that it looks like this:

The OD-Matrix structure
	House 1	House 2	House 3	House k
House 1	0	2	5	X
House 2	2	0	8	Y
House 3	5	8	0	Z
…
House k	X	Y	Z	0

Here there is the code to build that matrix:

df <- st_drop_geometry(matrix)

df_clean <- df %>%
  group_by(origin_id, destinatio) %>%
  summarise(total_cost = min(total_cost, na.rm = TRUE)) %>%
  ungroup()

matrix_wider <- df_clean %>% 
  pivot_wider(
    names_from = destinatio,
    values_from = total_cost
  )

rownames(matrix_wider) <- matrix_wider$origin_id
matrix_wider_2 <- matrix_wider %>% select(-origin_id)

distance_matrix <- as.matrix(matrix_wider_2)
distance_matrix <- apply(distance_matrix, 2, as.numeric)
distance_matrix <- as.matrix(distance_matrix)

It is important that there is no NA or any Infinite numbers, as DBSCAN cannot operate with them in the matrix. They reflect in QNEAT3 that houses are too far compared to the others to compute. For this reason, what we do in this cases is, instead of eliminating them, returning a number that is big enough so it will not be counted to any of the clusters (as DBSCAN eliminates the noise).

sum(is.na(distance_matrix))

## [1] 0

sum(is.infinite(distance_matrix))

## [1] 110674

finite_vals <- distance_matrix[is.finite(distance_matrix)]

max_val <- max(finite_vals, na.rm = TRUE)
big_val <- max_val * 2

distance_matrix[!is.finite(distance_matrix)] <- big_val

Operating the DBSCAN Cluster

From here we can operate, yet before we have to decide the parameters. What we will do, in this case, is run different combinations of them and compare their distributions. We use, in this case the formula KNNdist. We generated a set of candidate ε values based on the distribution of k-nearest-neighbour distances between houses (from the 80th to the 98th percentile), and combined them with a range of MinPts values commonly used in density-based clustering. For each combination, DBSCAN was run on the full distance matrix of Pollenca’s rural houses, and we recorded both the number of clusters produced and the proportion of houses classified as noise. We then filtered the results to retain only those combinations that produced a reasonable number of clusters (between 5 and 50) and a manageable amount of noise (under 30%).

From there we will be selecting the one that gives us a high number of clusters (similar tot eh K-means) and, at the same time, has a meaningful distribution among the clusters.

dist_obj <- as.dist(distance_matrix)


k <- 10
d <- kNNdist(dist_obj, k = k)
eps_candidates <- as.numeric(quantile(d, probs = seq(0.80, 0.98, by = 0.03)))
MinPts_values <- c(5, 10, 15, 20, 30)

results <- expand.grid(
  eps = eps_candidates,
  MinPts = MinPts_values
)

results$n_clusters <- NA
results$noise_ratio <- NA

for (i in seq_len(nrow(results))) {
  eps_i    <- results$eps[i]
  MinPts_i <- results$MinPts[i]
  
  db_i <- fpc::dbscan(
    distance_matrix,
    eps    = eps_i,
    MinPts = MinPts_i,
    method = "dist"
  )
  
  tab <- table(db_i$cluster)
  results$n_clusters[i]  <- sum(names(tab) != "0")
  results$noise_ratio[i] <- if ("0" %in% names(tab)) tab["0"] / sum(tab) else 0
}



choices_results <- subset(results, n_clusters >= 5 & n_clusters <= 50 & noise_ratio <= 0.3)
knitr::kable(results,
             caption="Best results of KNNdist to determine parameters DBSCAN")

Best results of KNNdist to determine parameters DBSCAN
eps	MinPts	n_clusters	noise_ratio
589.0909	5	13	0.0503876
629.8899	5	11	0.0430233
689.4329	5	9	0.0383721
775.2191	5	10	0.0321705
932.2440	5	8	0.0205426
1177.2114	5	6	0.0147287
2222.8941	5	3	0.0089147
589.0909	10	7	0.0972868
629.8899	10	7	0.0798450
689.4329	10	5	0.0697674
775.2191	10	4	0.0550388
932.2440	10	4	0.0403101
1177.2114	10	4	0.0298450
2222.8941	10	1	0.0155039
589.0909	15	5	0.1317829
629.8899	15	6	0.1100775
689.4329	15	5	0.0856589
775.2191	15	3	0.0720930
932.2440	15	2	0.0596899
1177.2114	15	2	0.0445736
2222.8941	15	1	0.0155039
589.0909	20	7	0.1829457
629.8899	20	4	0.1546512
689.4329	20	4	0.1197674
775.2191	20	3	0.0984496
932.2440	20	2	0.0701550
1177.2114	20	2	0.0519380
2222.8941	20	1	0.0166667
589.0909	30	15	0.3294574
629.8899	30	11	0.2724806
689.4329	30	6	0.1658915
775.2191	30	3	0.1306202
932.2440	30	3	0.0937984
1177.2114	30	2	0.0573643
2222.8941	30	1	0.0220930

Having seen this, we chose different combinations and we operate:

db_500_30 <- fpc::dbscan(distance_matrix, eps = 589, MinPts = 30, method = "dist")
db_630_30 <- fpc::dbscan(distance_matrix, eps = 630, MinPts = 30, method = "dist")

knitr::kable(table(db_500_30$cluster),
             caption = "Cluster frequencies for DBSCAN (eps = 589, MinPts = 30)")

Cluster frequencies for DBSCAN (eps = 589, MinPts = 30)
Var1	Freq
0	850
1	577
2	115
3	72
4	80
5	33
6	255
7	54
8	31
9	37
10	116
11	39
12	39
13	33
14	132
15	117

knitr::kable(table(db_630_30$cluster),
             caption = "Cluster frequencies for DBSCAN (eps = 630, MinPts = 30)")

Cluster frequencies for DBSCAN (eps = 630, MinPts = 30)
Var1	Freq
0	703
1	613
2	175
3	78
4	86
5	35
6	588
7	59
8	42
9	31
10	34
11	136

Now we export the results to a shapefile and map them in QGIS. Here the code and the final cartography:

matrix_wider_cl <- matrix_wider %>%
  mutate(cluster_500 = db_500_30$cluster,
         cluster_600 = db_630_30$cluster) %>% 
  select(cluster_500,cluster_600, origin_id)

Compared to k-means (k = 10), DBSCAN leaves out some observations as noise, yet it produces clusters that align more naturally with the town’s lived neighbourhoods. The overall spatial pattern does not differ substantially from k-means, but DBSCAN offers a finer articulation of local densities and transitions. To assess whether these clusters resonated with local knowledge, we interviewed three residents of Pollença, and a local geographer familiar with the area. While they initially recognised the larger, more contiguous areas produced by k-means, they related more strongly to the DBSCAN results once we narrowed the scale. At this finer resolution, DBSCAN captured boundaries and micro-territories that corresponded more closely to their everyday spatial experience. For this reason, DBSCAN will be used in the subsequent analysis.

Furthermore, our aim is to interpret density as a proxy for how policies might connect households to water infrastructure and consumption patterns. Density in a networked rural landscape is a more meaningful indicator for policy design than the broad, somewhat arbitrary partitions generated by k-means. DBSCAN offers us that density-driven lens. The results can be consulted in the “Pollenca” portal of this website.

Water Consumption: How Much Water Do the Houses That Do Not Exist Consume?

How can we estimate the water consumption of houses that, in official terms, do not even exist? That is the paradox we are trying to confront.

Rural houses in Mallorca are not connected to the potable water network—neither to the municipal system nor to private providers. They depend on their own water wells, cisterns, tanks, or other improvised systems. These sources of water are often as irregular as the houses themselves. And yet, every summer, pools are filled, gardens are watered, and thousands of visitors stay in these properties as tourist rentals. Despite this, no one has ever answered how much water these houses actually use, nor how this hidden consumption compares to the officially measured supply of the municipality—the “visible city,” the one that is connected. Our aim is to predict, at least as a proxy, the water consumption of houses located on rural land.

Building the Dataset of Water Consumption

How do we plan to do this? By leaning on what is visible. We have access to detailed water-consumption data for houses in urban areas that share similar typologies and morphological patterns with rural informal houses. In Pollença, two neighborhoods—Gotmar and Llenaire—mirror the density, size, and architectural logic of rural houses. In addition, Siller provides a key bridge case: historically single-family houses in rural land that, for contingent reasons, are connected to the public water network. Thus, We could access to the dataset of all those spaces, having the consumption of 143 houses for every two months since 2019 until the end of 2024.

For all these areas, we can recreate on those places the same dataset we have already built for rural houses: cadastral variables, if they are rental variables, and spatial characteristics. The only extra step is to add their real water-consumption records. With this combined dataset, we can train a predictive model that learns the relationship between house characteristics and water usage. Once trained, the model can be applied to the rural houses we have detected and filtered in our case study. The result will not be a perfect measurement, but it will be a robust estimate—a way to illuminate the hydrological shadows of Pollença’s countryside.

From all the dataset, to ensure that we are modeling houses with actual water use, we retain only those cases whose average annual consumption exceeds what would correspond to roughly two months of occupancy. Based on the dataset, this threshold is around 60 m³, which we use as our minimum cutoff.

Consum_Idemser <- Consum_REF %>% 
  group_by(Idemser) %>% 
  summarise(CONSUMO_TOTAL=sum(CONSUMO, na.rm = TRUE), .groups = "drop")

hist(Consum_Idemser$CONSUMO_TOTAL,breaks = 30)

Consum_REF_ERROR0 <- Consum_REF %>% 
  group_by(Idemser) %>% 
  summarise(CONSUMO_TOTAL=sum(CONSUMO, na.rm = TRUE), .groups = "drop") %>% 
  filter(CONSUMO_TOTAL<60) 
Consum_REF_ERROR0_cadena <- Consum_REF_ERROR0 %>% 
  pull(Idemser)

In the previous histogram we can see a distribution of consumption really skewed, as there are some very extreme outliers. We will treat them not as errors in the dataset, but as realities: many houses that might be bigger and have pool do consume large amounts of water. So, for now, we will maintain them. We will decide if later we eliminate them or not.

We also are creating a new varaiable: season. This variables groups the months were the tourism season is booming, as we expect to be the months with more consumption than the others. We also add 2020 as the year of the covid, as we expect that this year behaved differently than the others. We do not know, as of now, if this variables are going to be relevant, but we believe that, at the end of the day, they might have an influence in the models we are going to build. Also, just as a description proxy, they might unveil how data behaves. So, the final dataset of the water consumption looks like this:

##  [1] "Barrio"   "Idemser"  "CONSUMO"  "MESES"    "ANY"      "Calle"   
##  [7] "NumCalle" "Bloque"   "Escalera" "Piso"     "Puerta"   "Resto"   
## [13] "REFCAT"

What we observe is that seasonality has a clear and consistent impact on water consumption, regardless of the year. The temporal pattern repeats itself: the same months tend to produce similar peaks and troughs, suggesting that seasonal dynamics drive consumption in a way that is stable across time.

If we see the years of the Covid-19 (2020) we see that, yes, during summer is when less water is consumed. A trend followed by the beginning of 2021, when Covid was also hitting and the season started more slow. Moreover, we see how in 2020 the overall consumption was lower than in last year, indication COVID had an effect:

In summary, we can see several clear trends:

Seasonality is the dominant force shaping water consumption. Across all years, summer shows a marked increase compared to winter, reflecting the intensity of tourist rentals and the associated uses—pools, irrigation, higher occupancy, and overall seasonal pressure. Although each year has its nuances, the overall pattern repeats itself, with 2019 and 2020 presenting the strongest deviations.
Covid-19 produced a reduction in water consumption. Mobility restrictions in 2020–2021 created a kind of “baseline” year, useful for understanding what water use looks like under minimal tourism pressure. Even then, summer consumption remained relatively high.

Joining the Dataset of Water Consumption with the Dataset of Cadastral Information

Now that we have the water-consumption data for the houses, the next step is to merge it with the cadastral information. This linkage is essential, because our prediction relies on the idea that water use is shaped by housing typology: whether a dwelling has a pool, whether it is used for tourist rentals, the number of rooms it contains, the size of the parcel, and so on. With that logic in mind, we need to replicate—for the houses with water data—the same variables and structure that we built for the rural houses in Pollença.

A small reminder: we already processed the data for Siller, as it is in rural land, but the remaining cases are located on urban land. Since cadastral information is organised separately for urban and rural parcels, we need to rerun the entire preprocessing pipeline we used earlier: extracting the urban parcels, cleaning identifiers, joining attributes, and reconstructing the same dataset structure. Only then can we combine everything into a unified table that is ready for modelling.

## 
## FALSE  TRUE 
##  1587   301

Barrio	Idemser	CONSUMO	MESES	ANY	Calle	NumCalle	Bloque	Escalera	Piso	Puerta	Resto	REFCAT	season	PI	AREA_PI	AREA_Plantes	MAX_Plantes	N_Plantes	Any	Buildings	SuperB	AREA_Parc
SILLER	S61875	16	ENE.FEB	2019	PG POLIGON 9, 150	150	NA	NA	NA	NA	NA	07042A00900150	0	1	29.98	154.36	2	4	2002	1	315	3517
SILLER	S61875	11	MAR.ABR	2019	PG POLIGON 9, 150	150	NA	NA	NA	NA	NA	07042A00900150	0	1	29.98	154.36	2	4	2002	1	315	3517
SILLER	S61875	50	MAY.JUN	2019	PG POLIGON 9, 150	150	NA	NA	NA	NA	NA	07042A00900150	1	1	29.98	154.36	2	4	2002	1	315	3517
SILLER	S61875	13	JUL.AGO	2019	PG POLIGON 9, 150	150	NA	NA	NA	NA	NA	07042A00900150	1	1	29.98	154.36	2	4	2002	1	315	3517
SILLER	S61875	12	SEP.OCT	2019	PG POLIGON 9, 150	150	NA	NA	NA	NA	NA	07042A00900150	1	1	29.98	154.36	2	4	2002	1	315	3517

## # A tibble: 10 × 10
##    REFCAT            PI AREA_PI AREA_Plantes MAX_Plantes N_Plantes `31_pc` Any  
##    <chr>          <dbl>   <dbl>        <dbl>       <int>     <dbl> <chr>   <chr>
##  1 1043901EE0114N     1    34.8        145.            2         4 104390… 2022 
##  2 1043902EE0114S     0     0            0            NA         0 <NA>    <NA> 
##  3 1043903EE0114S     0     0          109.            2         5 104390… 1980 
##  4 1043904EE0114S     0     0           59.1           3         3 104390… 1995 
##  5 1043905EE0114S     1    13.4         51.2           2         2 104390… 1953 
##  6 1043906EE0114S     0     0           76.6           2         2 104390… 1960 
##  7 1043907EE0114S     0     0           64             2         3 104390… 1975 
##  8 1043908EE0114S     1     4.5         75.9           2         1 104390… 1970 
##  9 1043909EE0114S     0     0           74.2           3         3 104390… 1980 
## 10 1043910EE0114S     0     0           80.0           2         3 104391… 1982 
## # ℹ 2 more variables: Buildings <int>, SuperB <dbl>

Barrio	Idemser	CONSUMO	MESES	ANY	Calle	NumCalle	Bloque	Escalera	Piso	Puerta	Resto	REFCAT	season	MAX_Plantes	Any	Buildings	SuperB	AREA_Parc
GOTMAR	S62820	32	ENE.FEB	2019	CL PINSA	NA	NA	NA	NA	NA	NA	5870602EE0157S	0	NA	NA	NA	NA	437
GOTMAR	S62820	106	MAR.ABR	2019	CL PINSA	NA	NA	NA	NA	NA	NA	5870602EE0157S	0	NA	NA	NA	NA	437
GOTMAR	S62820	145	MAY.JUN	2019	CL PINSA	NA	NA	NA	NA	NA	NA	5870602EE0157S	1	NA	NA	NA	NA	437
GOTMAR	S62820	237	JUL.AGO	2019	CL PINSA	NA	NA	NA	NA	NA	NA	5870602EE0157S	1	NA	NA	NA	NA	437
GOTMAR	S62820	1081	SEP.OCT	2019	CL PINSA	NA	NA	NA	NA	NA	NA	5870602EE0157S	1	NA	NA	NA	NA	437

## [1] "000901800EE01E" "002202600EE01H" "002206100EE01H" "07042A00900108"
## [5] "07042A00900142" "07042A00900244" "07042A00900276" "07042A00900310"
## [9] "07042A00900586"

## [1] "5870602EE0157S" "6367101EE0166N" "6553203EE0165N" "6553205EE0165N"
## [5] "6652601EE0165N" "6667927EE0166N"

##     REFCAT                n         
##  Length:143         Min.   : 11.00  
##  Class :character   1st Qu.: 36.00  
##  Mode  :character   Median : 36.00  
##                     Mean   : 38.08  
##                     3rd Qu.: 36.00  
##                     Max.   :180.00

Now we only need to add the rental-status information and adjust all variables so they can be used properly in the model. This means converting each field to its correct format—whether as factors, integers, numeric values, or logical variables—so that the dataset is fully consistent and ready for the predictive analysis.

##  [1] "Idemser"      "CONSUMO"      "season"       "covid"        "PI"          
##  [6] "AREA_PI"      "AREA_Plantes" "MAX_Plantes"  "Buildings"    "SuperB"      
## [11] "AREA_Parc"    "Rental_is"    "Rental_bed"   "Any_constru"  "Any_consum"  
## [16] "Bimonth"

Prediction models

The first step is to create a train/test split of our dataset. Because our data is longitudinal, this split cannot be done by simply randomizing individual observations. If we did so, the same house could appear in both the training and test sets—just in different years or months—which would create data leakage. In that scenario, the model would indirectly “see” part of the house’s history during training, and then be evaluated on another part of the same house, leading to artificially inflated performance.

To properly evaluate the model’s ability to predict new houses—not just new time periods—the split must be done at the house level. In other words, all observations associated with a given house must be assigned either to the training set or to the test set, but never to both. This ensures a realistic assessment of how well the model generalizes to houses it has never encountered.

We will be using a split 80/20, as we have a little amount of data and we need to have enough for the training set.

set.seed(10, sample.kind = "Rejection")

library(dplyr)


unique_cases <- unique(Consum_Ref_Complet$Idemser)

train_cases <- sample(unique_cases, size = floor(0.8 * length(unique_cases)))

consum_train <- Consum_Ref_Complet %>% 
  filter(Idemser %in% train_cases)

consum_test <- Consum_Ref_Complet %>% 
  filter(!Idemser %in% train_cases)

Then we build a function to calculate the statistics for the training and test sets, and to compare the performance of different models. We also create a comparison matrix to summarise the results.

rm_id <- function(df) {
  if ("Idemser" %in% names(df)) dplyr::select(df, -Idemser) else df
}
r2_train_generic <- function(model, data, y, transform = c("identity","log1p")) {
  transform <- match.arg(transform)
  y_true <- data[[y]]
  y_pred <- predict(model, newdata = rm_id(data))

  if (transform == "log1p") {
    y_true <- log1p(y_true)  # log(y+1)
  }
  sse <- sum((y_true - y_pred)^2, na.rm = TRUE)
  sst <- sum((y_true - mean(y_true, na.rm = TRUE))^2, na.rm = TRUE)
  1 - sse/sst
}


osr2_generic <- function(model, train, test, y, transform = c("identity","log1p")) {
  transform <- match.arg(transform)
  y_tr <- train[[y]]
  y_te <- test[[y]]
  y_pred_te <- predict(model, newdata = rm_id(test))

  if (transform == "log1p") {
    y_tr <- log1p(y_tr)
    y_te <- log1p(y_te)
  }
  sse <- sum((y_te - y_pred_te)^2, na.rm = TRUE)
  denom <- sum((y_te - mean(y_tr, na.rm = TRUE))^2, na.rm = TRUE)
  1 - sse/denom
}


rmse_train_generic <- function(model, data, y, transform = c("identity","log1p")) {
  transform <- match.arg(transform)
  y_true <- data[[y]]
  y_pred <- predict(model, newdata = rm_id(data))

  if (transform == "log1p") {
    y_true <- log1p(y_true)
    y_pred <- log1p(y_pred)
  }

  sqrt(mean((y_true - y_pred)^2, na.rm = TRUE))
}

rmse_test_generic <- function(model, test, y, transform = c("identity","log1p")) {
  transform <- match.arg(transform)
  y_true <- test[[y]]
  y_pred <- predict(model, newdata = rm_id(test))

  if (transform == "log1p") {
    y_true <- log1p(y_true)
    y_pred <- log1p(y_pred)
  }

  sqrt(mean((y_true - y_pred)^2, na.rm = TRUE))
}


results <- data.frame(
  name = c(NA, NA,NA,NA,NA, NA, NA, NA, NA,NA), 
  r2 = c(NA, NA,NA,NA,NA, NA, NA, NA, NA,NA),
  rmse = c(NA, NA,NA,NA,NA, NA, NA, NA, NA, NA),
  osr2 = c(NA, NA,NA,NA,NA, NA, NA, NA, NA, NA),
  rmse_out = c(NA, NA,NA,NA,NA, NA, NA, NA, NA, NA)
)

results

##    name r2 rmse osr2 rmse_out
## 1    NA NA   NA   NA       NA
## 2    NA NA   NA   NA       NA
## 3    NA NA   NA   NA       NA
## 4    NA NA   NA   NA       NA
## 5    NA NA   NA   NA       NA
## 6    NA NA   NA   NA       NA
## 7    NA NA   NA   NA       NA
## 8    NA NA   NA   NA       NA
## 9    NA NA   NA   NA       NA
## 10   NA NA   NA   NA       NA

Linear Regression (A Naive Model)

We begin with a linear regression using a naïve approach: we include all variables and observe how the model behaves. As expected, the model produces counter-intuitive coefficients and a low R². This is consistent with what we have already observed in the consumption dataset: the relationship between predictors and water use is not linear, and the data does not behave in a way that linear models can capture effectively.

Here we can observe the nonlinerality and the correaltion between the varaibles (in some cases high, that might influence the model):

num_vars <- consum_train %>%
  select(where(is.numeric))

cor_mat <- cor(num_vars, use = "pairwise.complete.obs", method = "pearson")
corrplot(cor_mat, method = "color", type = "upper", 
         tl.cex = 0.7, tl.col = "black", addCoef.col = "black")

plot(consum_train$AREA_Plantes,consum_train$CONSUMO)
plot(consum_train$Any_consum,consum_train$CONSUMO)
plot(consum_train$Bimonth,consum_train$CONSUMO)
plot(consum_train$AREA_PI,consum_train$CONSUMO)
plot(consum_train$SuperB,consum_train$CONSUMO)
plot(consum_train$Rental_bed,consum_train$CONSUMO)
plot(as_factor(consum_train$season),consum_train$CONSUMO)
plot(consum_train$Any_constru,consum_train$CONSUMO)

Here, expecting a bad result, we run the regression:

lm_1 <- lm(data=consum_train, CONSUMO~season+ 
             covid+ 
             PI+
             AREA_PI+ 
             SuperB+ 
               Buildings+
             AREA_Plantes+
               AREA_Parc+
             MAX_Plantes+
             Rental_is+
               Rental_bed+
               Bimonth+
             Any_consum+
             Any_constru)
stargazer::stargazer(lm_1, type = "html", title = "Linear Model (Naive)")

**Linear Model (Naive)**

	Dependent variable:

	CONSUMO

season	57.146^***
	(2.483)

covid	-12.487^***
	(3.223)

PI	-21.440^***
	(4.467)

AREA_PI	0.699^***
	(0.092)

SuperB	-0.089^***
	(0.012)

Buildings	-1.367^*
	(0.723)

AREA_Plantes	0.197^***
	(0.017)

AREA_Parc	-0.001^**
	(0.0003)

MAX_Plantes	14.415^***
	(2.683)

Rental_is	0.797
	(11.284)

Rental_bed	-1.556
	(1.607)

Bimonth	-0.027
	(0.726)

Any_consum	-1.334^*
	(0.747)

Any_constru	0.748^***
	(0.093)

Constant	1,195.974
	(1,517.459)


Observations	4,380
R²	0.185
Adjusted R²	0.183
Residual Std. Error	78.440 (df = 4365)
F Statistic	70.993^*** (df = 14; 4365)

Note:	p<0.1; p<0.05; p<0.01

Linear Regression (Log Transforamtion)

What we observe is that adapting the variables and transforming the dependent variable using a logarithmic scale (given its skewed distribution) improves the model’s stability but does not meaningfully improve its performance. The regression still fails to capture the underlying patterns in the data, and the overall results remain weak:

lm_2 <- lm(log(CONSUMO+1) ~ 
             as.factor(season)+ 
             as.factor(covid)+ 
             PI+
             AREA_PI+ 
             AREA_Parc+
             SuperB+ 
             AREA_Plantes+
             MAX_Plantes+
             Buildings+
             as.factor(Rental_is)+
             as.factor(Any_consum)+
             Any_constru,  data = consum_train)

stargazer::stargazer(lm_2, type = "html", title = "Linear Model (log transf.)")

**Linear Model (log transf.)**

	Dependent variable:

	log(CONSUMO + 1)

as.factor(season)1	1.300^***
	(0.041)

as.factor(covid)1	-0.003
	(0.083)

PI	-0.166^**
	(0.076)

AREA_PI	0.008^***
	(0.002)

AREA_Parc	-0.00003^***
	(0.00001)

SuperB	-0.002^***
	(0.0002)

AREA_Plantes	0.004^***
	(0.0003)

MAX_Plantes	0.362^***
	(0.045)

Buildings	-0.012
	(0.012)

as.factor(Rental_is)1	-0.133^**
	(0.058)

as.factor(Any_consum)2020	-0.387^***
	(0.099)

as.factor(Any_consum)2021	-0.370^***
	(0.075)

as.factor(Any_consum)2022	-0.170^**
	(0.070)

as.factor(Any_consum)2023	-0.133^*
	(0.070)

as.factor(Any_consum)2024	-0.001
	(0.070)

Any_constru	0.010^***
	(0.002)

Constant	-17.955^***
	(3.148)


Observations	4,380
R²	0.258
Adjusted R²	0.255
Residual Std. Error	1.331 (df = 4363)
F Statistic	94.737^*** (df = 16; 4363)

Note:	p<0.1; p<0.05; p<0.01

For this reason, the statistics point out that maybe is better try to find models that can fit non-linearity better:

name	r2	rmse	osr2	rmse_out
Linear Regression (Naive)	0.1854685	78.305866	0.1418402	77.293666
Linear Regression (log)	0.2578410	2.312389	0.1994494	2.230438
NA	NA	NA	NA	NA
NA	NA	NA	NA	NA
NA	NA	NA	NA	NA
NA	NA	NA	NA	NA
NA	NA	NA	NA	NA
NA	NA	NA	NA	NA
NA	NA	NA	NA	NA
NA	NA	NA	NA	NA

Random Forest Models

Random Forest (based on months)

The first non-linear model we introduce is a Random Forest, a tree-based ensemble method capable of capturing complex, non-additive relationships between predictors and water consumption. Random Forests are particularly useful in our context because they do not require linearity, they handle interactions automatically, and they are robust to skewed distributions and outliers. Given the clear non-linear behaviour observed in the consumption data, this method provides a more appropriate starting point for modelling than traditional linear regression.

For the Random Forest we first of all adapt our data to fit the model:

# TRAIN
consum_train <- consum_train %>%
  mutate(
    season     = factor(season),
    covid      = factor(covid),
    PI         = factor(PI),
    Rental_is  = factor(Rental_is),
    Bimonth    = factor(Bimonth),
    Any_consum = as.numeric(as.character(Any_consum)),
    Any_constru= as.numeric(as.character(Any_constru))
  ) 

# TEST: 
consum_test <- consum_test %>%
  mutate(
    season     = factor(season,    levels = levels(consum_train$season)),
    covid      = factor(covid,     levels = levels(consum_train$covid)),
    PI         = factor(PI,        levels = levels(consum_train$PI)),
    Rental_is  = factor(Rental_is, levels = levels(consum_train$Rental_is)),
    Bimonth    = factor(Bimonth,   levels = levels(consum_train$Bimonth)),
    Any_consum = as.numeric(as.character(Any_consum)),
    Any_constru= as.numeric(as.character(Any_constru))
  ) 

# TRAIN
consum_train2 <- consum_train %>%
  mutate(
    season     = factor(season),
    covid      = factor(covid),
    PI         = factor(PI),
    Rental_is  = factor(Rental_is),
    Bimonth    = factor(Bimonth),
    Any_consum = as.numeric(as.character(Any_consum)),
    Any_constru= as.numeric(as.character(Any_constru))
  ) %>% 
  select( -Idemser)

# TEST: 
consum_test2 <- consum_test %>%
  mutate(
    season     = factor(season,    levels = levels(consum_train2$season)),
    covid      = factor(covid,     levels = levels(consum_train2$covid)),
    PI         = factor(PI,        levels = levels(consum_train2$PI)),
    Rental_is  = factor(Rental_is, levels = levels(consum_train2$Rental_is)),
    Bimonth    = factor(Bimonth,   levels = levels(consum_train2$Bimonth)),
    Any_consum = as.numeric(as.character(Any_consum)),
    Any_constru= as.numeric(as.character(Any_constru))
  ) %>% 
  select( -Idemser)

Once we have done this, we later continue by developing cross-validation to find the best model. We run to cross-validations to determine mtry, one with 500 trees and one with 1000 trees.

# TRAIN
consum_train2 <- consum_train %>%
  mutate(
    season     = factor(season),
    covid      = factor(covid),
    PI         = factor(PI),
    Rental_is  = factor(Rental_is),
    Bimonth    = factor(Bimonth),
    Any_consum = as.numeric(as.character(Any_consum)),
    Any_constru= as.numeric(as.character(Any_constru))
  ) %>% 
  select( -Idemser)

# TEST: 
consum_test2 <- consum_test %>%
  mutate(
    season     = factor(season,    levels = levels(consum_train2$season)),
    covid      = factor(covid,     levels = levels(consum_train2$covid)),
    PI         = factor(PI,        levels = levels(consum_train2$PI)),
    Rental_is  = factor(Rental_is, levels = levels(consum_train2$Rental_is)),
    Bimonth    = factor(Bimonth,   levels = levels(consum_train2$Bimonth)),
    Any_consum = as.numeric(as.character(Any_consum)),
    Any_constru= as.numeric(as.character(Any_constru))
  ) %>% 
  select( -Idemser)

set.seed(10, sample.kind = "Rejection")

rf.params = expand.grid( 
  mtry=seq(2, 12, 1)
)


rf.cv = train(
  y = consum_train2$CONSUMO,
  x = subset(consum_train2, select = -c(CONSUMO)),
  method = "rf",
  ntree = 1000,
  trControl=trainControl(method="cv", number=5),
  tuneGrid=rf.params
  )

rf.cv2 = train(
  y = consum_train2$CONSUMO,
  x = subset(consum_train2, select = -c(CONSUMO)),
  method = "rf",
  ntree = 500,
  trControl=trainControl(method="cv", number=5),
  tuneGrid=rf.params
  )


rf.cv

## Random Forest 
## 
## 4380 samples
##   14 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 3504, 3504, 3503, 3504, 3505 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##    2    63.67785  0.4757492  32.51130
##    3    62.30715  0.4863784  30.96505
##    4    62.27954  0.4865058  30.80385
##    5    62.51995  0.4833830  30.87135
##    6    62.54897  0.4836419  30.93448
##    7    63.02192  0.4770948  31.10582
##    8    63.06952  0.4768367  31.17572
##    9    63.11061  0.4765660  31.23137
##   10    63.18176  0.4760250  31.33210
##   11    63.26988  0.4748759  31.41810
##   12    63.59930  0.4708025  31.56948
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 4.

rf.cv2

## Random Forest 
## 
## 4380 samples
##   14 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 3504, 3504, 3504, 3504, 3504 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##    2    63.17794  0.4831725  32.22701
##    3    61.65024  0.4957907  30.56841
##    4    61.74894  0.4937977  30.39507
##    5    61.83573  0.4932712  30.36770
##    6    62.03169  0.4907224  30.44677
##    7    62.24010  0.4883876  30.64051
##    8    62.30726  0.4879025  30.70963
##    9    62.49027  0.4854993  30.82750
##   10    62.72172  0.4825507  30.92220
##   11    62.69594  0.4834006  31.00261
##   12    63.05113  0.4785861  31.12078
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 3.

From there we build the two models: One with 500 trees and another with 1000 trees:

set.seed(10,sample.kind = "Rejection")

rf_best <- randomForest(
  CONSUMO ~ covid + AREA_PI +
            SuperB + AREA_Plantes + AREA_Parc + MAX_Plantes +
            Bimonth  + Any_constru + Any_consum,
  data  = consum_train2,
  mtry  = 3,
  ntree = 500,
  na.action = na.roughfix,   
  importance = TRUE
)
set.seed(10,sample.kind = "Rejection")

rf_best2 <- randomForest(
  CONSUMO ~ covid + AREA_PI +
            SuperB + AREA_Plantes + AREA_Parc + MAX_Plantes +
            Rental_bed + Bimonth  + Any_constru + Any_consum,
  data  = consum_train2,
  mtry  = 4,
  ntree = 1000,
  na.action = na.roughfix,   
  importance = TRUE
)

To avoid bias and ensure a realistic evaluation of the model, we apply stratified sampling³ at the house level when training the Random Forest.

We work with longitudinal data, where each house (subject) contributes a block of repeated measurements. Since our objective is ultimately to predict consumption at the house level, we want the model to learn stable, house-specific patterns and not be dominated by random fluctuations within a particular house.

However, standard Random Forests perform observation-level bootstrap sampling, which can cause some bootstrap samples to include many observations from certain houses and very few (or none) from others. This imbalance increases noise and reduces stability, because the model may fail to capture the full longitudinal behavior of under-sampled houses.

To address this, we perform stratified sampling by house ID, which forces every tree to receive observations from all houses in proportion to the original data. Conceptually, this moves the sampling strategy closer to subject-level bootstrapping, where houses—not individual observations—are treated as the fundamental sampling units. This reduces within-house noise and helps preserve the structure of longitudinal trajectories.

At the same time, stratifying by house reduces the randomness between trees, so tree diversity decreases. To compensate for this and retain some variability in the forest, we use a sampling fraction of 0.8. This ensures that, although all houses are represented in each tree, their repeated measurements are still randomly subsampled, maintaining enough variation to avoid overfitting while preserving the subject-level structure.

In summary, stratifying by house allows each tree to learn from complete and balanced information about every house, making the model more appropriate for longitudinal prediction, while still retaining diversity through the sampling fraction.

set.seed(10,sample.kind = "Rejection")

rf <- ranger(
  formula = CONSUMO ~ covid + AREA_PI +
            SuperB + AREA_Plantes + AREA_Parc + MAX_Plantes +
            Bimonth + Any_constru + Any_consum,
  data = consum_train,
  num.trees = 500,
  mtry = 3,
  min.node.size = 10,
  replace = TRUE,
  sample.fraction = 0.8,
  case.weights = NULL,
  holdout = FALSE,
  importance = "impurity",
  respect.unordered.factors = TRUE,
  sample.stratify = consum_train$Idemser  
)

Analyzing the variables with the highest importance scores, we see that the most influential predictors are those related to the size and characteristics of the house, the presence of a pool, and the time of the year (bimonthly period). This aligns with our expectations: larger houses, properties with pools, and months associated with tourism and higher occupancy all exert a strong influence on water consumption. At first glance, it might seem surprising that the rental variable shows relatively low importance. However, this is not unexpected in our context. The rental data we use comes from the government registry, which is known to underestimate the real volume of tourist rentals. As a result, the model receives a noisy and incomplete signal for rental activity, reducing its ability to learn its true effect on consumption:

importance_df <- as.data.frame(rf_best$importance)
importance_df$Feature <- rownames(importance_df)
importance_sorted <- importance_df[order(-importance_df$IncNodePurity), ]

ggplot(importance_sorted, aes(x = reorder(Feature, IncNodePurity), y = IncNodePurity)) +
  geom_bar(stat = "identity", fill = "steelblue") +  # Bar plot with custom color
  coord_flip() +  # Flip to make horizontal bars
  labs(x = "Features", y = "Importance Score", title = "Feature Importance (Random Forest)") +
  theme_classic()

Now we just need to see the statistics on all these models:

name	r2	rmse	osr2	rmse_out
Linear Regression (Naive)	0.1854685	78.305866	0.1418402	77.293666
Linear Regression (log)	0.2578410	2.312389	0.1994494	2.230438
Random Forest (mtry=3, ntree=500)	0.7655280	42.013207	0.2549841	72.018257
Random Forest (mtry=4, ntree=1000)	0.7803432	40.664243	0.2578987	71.877246
Random Forest (stratified)	0.6778578	49.245231	0.2490435	71.493031
NA	NA	NA	NA	NA
NA	NA	NA	NA	NA
NA	NA	NA	NA	NA
NA	NA	NA	NA	NA
NA	NA	NA	NA	NA

What we observed is that we are far still from the desirable outcomes we might need for doing predictions. No OSR2 goes beyond 0.25-0.27, which indicates still a low predictability. Also we see stratifing does not help.

Random Forest (based on seasons)

An alternative approach is to shift from using the full base of longitudinal observations to using seasonal aggregates of water consumption. By summarising consumption at the seasonal level, we reduce the number of observations and therefore the longitudinal complexity of the dataset, while still preserving the core temporal dynamics that matter for prediction. This has several advantages. First, seasonal data smooths out many of the zero-consumption entries that appear in individual bimonthly periods, which often introduce noise and instability in the model. Second, aggregating reduces the influence of occasional measurement errors or missing values that disproportionately affect the row-level dataset. As a result, the model works with cleaner, more representative signals of household behaviour.

Consum_Ref_Complet_season <- Consum_Ref_Complet %>% 
  group_by(Idemser, Any_consum, season) %>% 
  summarise(
    CONSUMO = sum(CONSUMO),
    SuperB = first(SuperB),
    AREA_PI = first(AREA_PI),
    AREA_Plantes = first(AREA_Plantes),
    AREA_Parc = first(AREA_Parc),
    MAX_Plantes = first(MAX_Plantes),
    Any_constru = first(Any_constru),
    covid = first(covid),
    Rental_bed = first(Rental_bed)
  )


set.seed(10, sample.kind = "Rejection")

unique_cases_season <- unique(Consum_Ref_Complet_season$Idemser)
train_cases_season <- sample(unique_cases_season,
                             size = floor(0.8 * length(unique_cases_season)))
consum_train_season <- Consum_Ref_Complet_season %>% 
  filter(Idemser %in% train_cases_season)
consum_test_season <- Consum_Ref_Complet_season %>% 
  filter(!Idemser %in% train_cases_season)



consum_train_season_id <- consum_train_season %>%
  mutate(
    season     = factor(season),
    covid      = factor(covid),
    Any_consum = as.numeric(as.character(Any_consum)),
    Any_constru= as.numeric(as.character(Any_constru))) %>%
  ungroup()


consum_test_season_id <- consum_test_season %>%
  mutate(
    season     = factor(season,    levels = levels(consum_train2$season)),
    covid      = factor(covid,     levels = levels(consum_train2$covid)),
    Any_consum = as.numeric(as.character(Any_consum)),
    Any_constru= as.numeric(as.character(Any_constru))) %>% 
  ungroup()

consum_train_season <- consum_train_season_id %>%
  select(-Idemser)


consum_test_season <- consum_test_season_id %>%
  select(-Idemser)

set.seed(10, sample.kind = "Rejection")

rf.params = expand.grid( 
  mtry=seq(2, 12, 1)
)

folds <- groupKFold(consum_test_season_id$Idemser, k = 5)

rf.cv = train(
  y = consum_train_season_id$CONSUMO,
  x = subset(consum_train_season_id, select = -c(CONSUMO, Idemser)),
  method = "rf",
  ntree = 500,
  trControl=trainControl(method="cv", number=5, index = folds),
  tuneGrid=rf.params
  )


rf.cv

## Random Forest 
## 
## 1456 samples
##   10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 318, 226, 306, 330, 284 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##    2    221.2472  0.1385404  129.6742
##    3    219.9208  0.1442473  128.4204
##    4    220.8301  0.1401384  129.0043
##    5    221.2201  0.1388178  129.6195
##    6    221.7516  0.1373542  130.6938
##    7    223.2052  0.1330314  132.1619
##    8    224.1601  0.1295247  133.0675
##    9    225.5859  0.1260610  134.2985
##   10    226.8968  0.1222347  135.6166
##   11    226.5350  0.1236820  135.4160
##   12    226.5755  0.1239163  135.6808
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 3.

And here we run the model using the seasonal dataset and examine the resulting feature importance. The ranking of variables closely resembles the one obtained from the bimonthly observations, suggesting that the key drivers of consumption remain consistent even after aggregation.

set.seed(10)

rf.final <- randomForest(
  CONSUMO ~ .,
  data = consum_train_season,
  ntree = 500,
  mtry = 3
)

rf.final

## 
## Call:
##  randomForest(formula = CONSUMO ~ ., data = consum_train_season,      ntree = 500, mtry = 3) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 22433.33
##                     % Var explained: 54.32

importance_df_final <- as.data.frame(rf.final$importance)
importance_df_final$Feature <- rownames(importance_df_final)
importance_sorted_final <- importance_df_final[order(-importance_df_final$IncNodePurity), ]

ggplot(importance_sorted_final, aes(x = reorder(Feature, IncNodePurity), y = IncNodePurity)) +
  geom_bar(stat = "identity", fill = "steelblue") +  # Bar plot with custom color
  coord_flip() +  # Flip to make horizontal bars
  labs(x = "Features", y = "Importance Score", title = "Feature Importance (Random Forest)") +
  theme_classic()

Now we run the statistics:

name	r2	rmse	osr2	rmse_out
Linear Regression (Naive)	0.1854685	78.305866	0.1418402	77.293666
Linear Regression (log)	0.2578410	2.312389	0.1994494	2.230438
Random Forest (mtry=3, ntree=500)	0.7655280	42.013207	0.2549841	72.018257
Random Forest (mtry=4, ntree=1000)	0.7803432	40.664243	0.2578987	71.877246
Random Forest (stratified)	0.6778578	49.245231	0.2490435	71.493031
Random Forest (season)	0.8038174	98.153565	0.4438662	132.805355
NA	NA	NA	NA	NA
NA	NA	NA	NA	NA
NA	NA	NA	NA	NA
NA	NA	NA	NA	NA

We see that here we are reaching the OSR2 of 0.45. Still not an extraordinary OSR2, but the best so far.

Boosted Trees

Before ending, we wanted to make sure to also test if the Boosted Trees could build a model that could improve betterwhat we achieved with Random Forest. Here, the entire code of how we build those models

Boosted Trees (Based on months)

set.seed(10, sample.kind = "Rejection") 

grid <- expand.grid(
  n.trees = c(100,500,1000,3000, 5000), 
  interaction.depth = c(1,2, 3, 4, 5,7,9,10),
  shrinkage = c(0.1),
  n.minobsinnode = c(10)
)

gbm.cv <- train(y = consum_train2$CONSUMO,
                      x = subset(consum_train2, select=-c(CONSUMO)),
                      verbose=FALSE,
                      method= "gbm",
                      tuneGrid= grid,
                      trControl= trainControl(method="cv", number= 5, savePredictions = "final")
                  )

#gbm.cv$results[, c("interaction.depth", "n.trees", "Rsquared")]

gbm_plot <- gbm.cv$results %>% mutate(Rsquared = round(Rsquared, 4))  
ggplot(gbm_plot, aes(x = as.factor(n.trees), y = as.factor(interaction.depth), fill = Rsquared)) +
  geom_tile() +
  geom_text(aes(label = Rsquared), color = "white", size = 4) + 
  scale_fill_gradient(low = "blue", high = "red") +
  labs(x = "Number of Trees (n.trees)", y = "Interaction Depth", fill = "R²") +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

set.seed(10,sample.kind = "Rejection")

opt_boost = gbm(data=consum_train2, CONSUMO ~ season + covid + AREA_PI +
            SuperB + Buildings + AREA_Plantes + AREA_Parc + MAX_Plantes +
            Rental_bed + Bimonth + Any_constru + Any_consum, distribution = "gaussian",
                n.trees = 1000, interaction.depth = 9,
                shrinkage = 0.1, n.minobsinnode = 10)

pred_tr_gbm <- predict(opt_boost, newdata = consum_train2,    n.trees = 1000, type = "response")
pred_te_gbm <- predict(opt_boost, newdata = consum_test2, n.trees = 1000, type = "response")

y_tr <- consum_train2$CONSUMO
y_te <- consum_test2$CONSUMO
ok   <- is.finite(pred_te_gbm) & is.finite(y_te)

r2_gbm   <- 1 - sum((y_tr - pred_tr_gbm)^2, na.rm = TRUE) /
                sum((y_tr - mean(y_tr, na.rm = TRUE))^2, na.rm = TRUE)

osr2_gbm <- 1 - sum((y_te[ok] - pred_te_gbm[ok])^2) /
                sum((y_te[ok] - mean(y_tr, na.rm = TRUE))^2)
rmse_gbm_train <- sqrt(mean((y_tr - pred_tr_gbm)^2, na.rm = TRUE))
rmse_gbm_test <- sqrt(mean((y_te[ok] - pred_te_gbm[ok])^2, na.rm = TRUE))

results[7, ] <- data.frame(
  name = 'GBM (month)',
  r2   = r2_gbm,
  rmse_train = rmse_gbm_train,
  osr2 = osr2_gbm,
  rmse_test  = rmse_gbm_test
)

Boosted Trees (based on seasons)

set.seed(10, sample.kind = "Rejection") 

grid_season <- expand.grid(
  n.trees = c(100,500,1000,3000, 5000), 
  interaction.depth = c(3, 4, 5,7,9,10),
  shrinkage = c(0.1),
  n.minobsinnode = c(10)
)


gbm.cv_season <- train(y = consum_test_season_id$CONSUMO,
                      x = subset(consum_test_season_id, select=-c(CONSUMO, Idemser)),
                      verbose=FALSE,
                      method= "gbm",
                      tuneGrid= grid_season,
                      trControl= trainControl(method="cv", number= 5, index = folds, savePredictions = "final")
                  )

gbm.cv_season$results[, c("shrinkage", "interaction.depth", "n.trees", "Rsquared")]

##    shrinkage interaction.depth n.trees   Rsquared
## 1        0.1                 3     100 0.06894525
## 6        0.1                 4     100 0.06483845
## 11       0.1                 5     100 0.05414223
## 16       0.1                 7     100 0.06977046
## 21       0.1                 9     100 0.06074108
## 26       0.1                10     100 0.05448118
## 2        0.1                 3     500 0.05700852
## 7        0.1                 4     500 0.04793739
## 12       0.1                 5     500 0.03880238
## 17       0.1                 7     500 0.07920235
## 22       0.1                 9     500 0.08700046
## 27       0.1                10     500 0.05996333
## 3        0.1                 3    1000 0.03616863
## 8        0.1                 4    1000 0.05809100
## 13       0.1                 5    1000 0.04323289
## 18       0.1                 7    1000 0.08377307
## 23       0.1                 9    1000 0.07234957
## 28       0.1                10    1000 0.06451369
## 4        0.1                 3    3000 0.02550142
## 9        0.1                 4    3000 0.06228877
## 14       0.1                 5    3000 0.04982732
## 19       0.1                 7    3000 0.07670388
## 24       0.1                 9    3000 0.08688015
## 29       0.1                10    3000 0.08119689
## 5        0.1                 3    5000 0.02606177
## 10       0.1                 4    5000 0.05856836
## 15       0.1                 5    5000 0.05579936
## 20       0.1                 7    5000 0.07888999
## 25       0.1                 9    5000 0.08315281
## 30       0.1                10    5000 0.08152309

gbm.cv_season$bestTune

##    n.trees interaction.depth shrinkage n.minobsinnode
## 17     500                 7       0.1             10

gbm_plot_season <- gbm.cv_season$results %>% mutate(Rsquared = round(Rsquared, 4))  # Round Rsquared to 2 decimal places

# Create the heatmap
ggplot(gbm_plot_season, aes(x = as.factor(n.trees), y = as.factor(interaction.depth), fill = Rsquared)) +
  geom_tile() +
  geom_text(aes(label = Rsquared), color = "white", size = 4) + 
  scale_fill_gradient(low = "blue", high = "red") +
  labs(x = "Number of Trees (n.trees)", y = "Interaction Depth", fill = "R²") +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

name	r2	rmse	osr2	rmse_out
Linear Regression (Naive)	0.1854685	78.305866	0.1418402	77.293666
Linear Regression (log)	0.2578410	2.312389	0.1994494	2.230438
Random Forest (mtry=3, ntree=500)	0.7655280	42.013207	0.2549841	72.018257
Random Forest (mtry=4, ntree=1000)	0.7803432	40.664243	0.2578987	71.877246
Random Forest (stratified)	0.6778578	49.245231	0.2490435	71.493031
Random Forest (season)	0.8038174	98.153565	0.4438662	132.805355
GBM (month)	0.8152655	37.291889	0.1768161	75.702159
GBM (season)	0.7220038	116.841008	0.3985741	138.107416
NA	NA	NA	NA	NA
NA	NA	NA	NA	NA

Here we observe that our OSR² is lower than that of the Random Forest based on seasons, so we are going to use the latter to build our model.

Final Prediction

Before using the model for prediction (Random Forest based on seasons), one precaution is needed. Our OSR² remains relatively low when we attempt to predict the consumption of every individual house for every individual year. To soften this turbulence, we shift our strategy from predicting isolated points to predicting groups of houses, where errors tend to compensate when aggregated. Once grouped, we work with the average consumption across years, which further stabilizes the signal by smoothing annual fluctuations. In practice, we average the predictions across all years and then consistently use aggregations of these house-level averages, which substantially reduces prediction error.

Ultimately, our interest is not in forecasting the precise behavior of a single house but in understanding patterns at the scale of areas, since policy interventions are designed and implemented at that collective level. This rationale will be developed further when we connect these predictions to the neighborhood clusters.

From here, we prepare the dataset that includes all houses in Pollença in rural land, covering every available year and separating each year into two distinct seasons. This allows us to work with a complete temporal–seasonal panel of consumption for the entire housing stock that we need to predict.

buildings_complet_water <- buildings_complet_df %>% 
  mutate(SuperB=as.numeric(Cat_SuperB),
         AREA_PI= as.numeric(Cat_AREA_P),
         AREA_Plantes=as.numeric(Cat_AREA_1),
         AREA_Parc=as.numeric(Parc_areaV),
         MAX_Plantes=as.numeric(Cat_MAX_Pl),
         Any_constru=as.numeric(Cat_Any),
         Rental_bed=ifelse(is.na(TuriGOIB_P), 0, TuriGOIB_P)) %>% 
  filter(Munic_NOM == "Pollença") %>% 
  select(c("Cat_REFCAT","SuperB","AREA_PI","AREA_Plantes","AREA_Parc","MAX_Plantes","Any_constru","Rental_bed")) %>% 
  filter(!is.na(AREA_Plantes))

buildings_complet_water_long <- buildings_complet_water %>%
  tidyr::crossing(
    Any_consum = 2019:2024,
    season     = c(0, 1),
  ) %>% 
  mutate(covid=as.factor(ifelse(Any_consum==2020, 1, 0)),
         season = as.factor(season))

Now we can make the predictions:

x_new  <- buildings_complet_water_long %>% select(-Cat_REFCAT)
pred   <- predict(rf.final, newdata = x_new)


buildings_complet_water_pred <- buildings_complet_water_long %>%
  mutate(pred = pred)

Building IC with QRF

To strengthen our confidence in the model—especially given the low OSR²—we construct confidence intervals around the predictions. We do this by using the full distribution of predictions produced by the Random Forest. Specifically, for each observation we extract the 5th percentile (the lower bound) from all tree-level predictions, and then the 90th percentile (the upper bound). For doing this we use a Quantile Random Forest formula (defined in range)⁴. These two percentiles (5th and 90th) form an empirical confidence interval for the averaged predicted value of 90% (the difference of the two bounds), capturing the range within which the model expects consumption to fall when accounting for uncertainty within the forest.

rq_final <- ranger(
  formula = CONSUMO ~ .,
  data = consum_train_season,
  num.trees = 500,
  mtry = 3,
  quantreg = TRUE)


pred_q <- predict(
  rq_final,
  data      = x_new,
  type      = "quantiles",
  quantiles = c(0.05, 0.5, 0.95)
)

low90  <- pred_q$predictions[, 1]   # 5%
med    <- pred_q$predictions[, 2]   # 50% (mediana)
high90 <- pred_q$predictions[, 3]   # 95%

buildings_complet_water_pred <- buildings_complet_water_pred %>%
    mutate(
    pred_low90  = low90,
    pred_med    = med,
    pred_high90 = high90
  )

summary(buildings_complet_water_pred)

##   Cat_REFCAT            SuperB          AREA_PI        AREA_Plantes    
##  Length:30768       Min.   :  17.0   Min.   :  0.00   Min.   :  10.15  
##  Class :character   1st Qu.: 108.0   1st Qu.:  0.00   1st Qu.: 113.43  
##  Mode  :character   Median : 164.0   Median : 34.27   Median : 160.85  
##                     Mean   : 204.6   Mean   : 32.10   Mean   : 201.90  
##                     3rd Qu.: 249.0   3rd Qu.: 49.35   3rd Qu.: 234.24  
##                     Max.   :5327.0   Max.   :883.88   Max.   :3405.61  
##    AREA_Parc         MAX_Plantes     Any_constru     Rental_bed    
##  Min.   :      44   Min.   :1.000   Min.   :1700   Min.   : 0.000  
##  1st Qu.:    2437   1st Qu.:1.000   1st Qu.:1980   1st Qu.: 0.000  
##  Median :    4538   Median :2.000   Median :1993   Median : 0.000  
##  Mean   :   37854   Mean   :1.619   Mean   :1989   Mean   : 1.812  
##  3rd Qu.:   11515   3rd Qu.:2.000   3rd Qu.:2003   3rd Qu.: 4.000  
##  Max.   :12719556   Max.   :4.000   Max.   :2024   Max.   :22.000  
##    Any_consum   season    covid          pred           pred_low90     
##  Min.   :2019   0:15384   0:25640   Min.   :  6.306   Min.   :  0.000  
##  1st Qu.:2020   1:15384   1: 5128   1st Qu.: 76.660   1st Qu.:  0.000  
##  Median :2022                       Median :117.092   Median :  0.000  
##  Mean   :2022                       Mean   :146.536   Mean   :  6.545  
##  3rd Qu.:2023                       3rd Qu.:213.896   3rd Qu.:  9.000  
##  Max.   :2024                       Max.   :670.575   Max.   :137.850  
##     pred_med       pred_high90    
##  Min.   :  0.00   Min.   :  26.0  
##  1st Qu.: 42.00   1st Qu.: 222.0  
##  Median : 74.00   Median : 370.1  
##  Mean   : 98.49   Mean   : 456.8  
##  3rd Qu.:147.00   3rd Qu.: 667.0  
##  Max.   :584.00   Max.   :2020.0

Now we just need to do the aggregation in the municipal and in the neighborhood level and have the results presented

Municipal Results

We now have the predicted annual water-consumption averages for each house. The distribution shows two clear elevations: one around 125 m³ and another around 300 m³, suggesting two dominant consumption profiles in the rural-house system. The shape is generally healthy: the median and mean sit close together, hinting that extreme values are not excessively distorting the distribution. Altogether, it offers a coherent landscape from which to trace patterns of use, behaviour, and typology.

buildings_water_pred_anual <- buildings_complet_water_pred %>%
  group_by(Cat_REFCAT, Any_consum) %>% 
  summarise(sum_pred_low=sum(pred_low90, na.rm = TRUE),
            sum_pred=sum(pred, na.rm = TRUE),
            sum_pred_med=sum(pred, na.rm = TRUE),
            sum_pred_high=sum(pred_high90, na.rm = TRUE)) %>% 
  ungroup() %>% 
  group_by(Cat_REFCAT) %>% 
  summarise(m_pred_low=mean(sum_pred_low, na.rm = TRUE),
            m_pred=mean(sum_pred, na.rm = TRUE),
            m_pred_med=mean(sum_pred_med, na.rm = TRUE),
            m_pred_high=mean(sum_pred_high, na.rm = TRUE))


ggplot(buildings_water_pred_anual, aes(x = m_pred)) +
  geom_histogram(bins = 30, fill = "#4C72B0", color = "white", alpha = 0.7) +
  geom_vline(
    xintercept = mean(buildings_water_pred_anual$m_pred, na.rm = TRUE),
    color = "red",
    linewidth = 1) +
    geom_vline(
    xintercept = median(buildings_water_pred_anual$m_pred, na.rm = TRUE),
    color = "red",
    linewidth = 1,
    linetype = "dashed") +
  theme_classic() +
  labs(
    title = "Histograma del consum anual predit",
    x = "Consum predit (mitjana anual)",
    y = "Nombre de cases"
  )

buildings_water_pred_turistic <- buildings_complet_water_pred %>%
  mutate(rental_is=ifelse(Rental_bed==0,0,1)) %>% 
  group_by(Cat_REFCAT, Any_consum) %>% 
  summarise(sum_pred_low=sum(pred_low90, na.rm = TRUE),
            sum_pred=sum(pred, na.rm = TRUE),
            sum_pred_med=sum(pred_med, na.rm = TRUE),
            sum_pred_high=sum(pred_high90, na.rm = TRUE),
            rental=rental_is) %>% 
  ungroup() %>% 
  group_by(rental) %>% 
  summarise(m_pred_low=mean(sum_pred_low, na.rm = TRUE),
            m_pred=mean(sum_pred, na.rm = TRUE),
            m_pred_med=mean(sum_pred_med, na.rm = TRUE),
            m_pred_high=mean(sum_pred_high, na.rm = TRUE))

buildings_water_pred_pool_med <- buildings_complet_water_pred %>%
  mutate(pool=ifelse(AREA_PI==0,0,1)) %>% 
  group_by(Cat_REFCAT, Any_consum) %>% 
  summarise(sum_pred_low=sum(pred_low90, na.rm = TRUE),
            sum_pred=sum(pred, na.rm = TRUE),
            sum_pred_med=sum(pred_med, na.rm = TRUE),
            sum_pred_high=sum(pred_high90, na.rm = TRUE),
            pool=pool) %>% 
  ungroup() %>% 
  group_by(pool) %>% 
  summarise(m_pred_low=mean(sum_pred_low, na.rm = TRUE),
            m_pred=mean(sum_pred, na.rm = TRUE),
            m_pred_med=mean(sum_pred_med, na.rm = TRUE),
            m_pred_high=mean(sum_pred_high, na.rm = TRUE))


#Per tot el municipi
buildings_water_pred_anual_gl <- buildings_complet_water_pred %>%
  group_by(Any_consum) %>% 
  summarise(sum_pred_low=sum(pred_low90, na.rm = TRUE),
            sum_pred=sum(pred, na.rm = TRUE),
            sum_pred_med=sum(pred_med, na.rm = TRUE),
            sum_pred_high=sum(pred_high90, na.rm = TRUE)) %>% 
  ungroup() %>% 
  group_by() %>% 
  summarise(m_pred_low=mean(sum_pred_low, na.rm = TRUE),
            m_pred=mean(sum_pred, na.rm = TRUE),
            m_pred_med=mean(sum_pred_med, na.rm = TRUE),
            m_pred_high=mean(sum_pred_high, na.rm = TRUE))

buildings_water_pred_anual_season <- buildings_complet_water_pred %>%
  group_by(Any_consum, season) %>% 
  summarise(sum_pred_low=sum(pred_low90, na.rm = TRUE),
            sum_pred=sum(pred, na.rm = TRUE),
            sum_pred_med=sum(pred_med, na.rm = TRUE),
            sum_pred_high=sum(pred_high90, na.rm = TRUE)) %>% 
  ungroup() %>% 
  group_by(season) %>% 
  summarise(m_pred_low=mean(sum_pred_low, na.rm = TRUE),
            m_pred=mean(sum_pred, na.rm = TRUE),
            m_pred_med=mean(sum_pred_med, na.rm = TRUE),
            m_pred_high=mean(sum_pred_high, na.rm = TRUE))

buildings_water_pred_anual_pools <- buildings_complet_water_pred %>%
  mutate(pool=ifelse(AREA_PI==0,0,1)) %>% 
  group_by(Any_consum, pool) %>% 
  summarise(sum_pred_low=sum(pred_low90, na.rm = TRUE),
            sum_pred=sum(pred, na.rm = TRUE),
            sum_pred_med=sum(pred_med, na.rm = TRUE),
            sum_pred_high=sum(pred_high90, na.rm = TRUE)) %>% 
  ungroup() %>% 
  group_by(pool) %>% 
  summarise(m_pred_low=mean(sum_pred_low, na.rm = TRUE),
            m_pred=mean(sum_pred, na.rm = TRUE),
            m_pred_med=mean(sum_pred_med, na.rm = TRUE),
            m_pred_high=mean(sum_pred_high, na.rm = TRUE))

buildings_water_pred_anual_par <- buildings_complet_water_pred %>%
  mutate(parcela14=ifelse(AREA_Parc>14000,0,1)) %>% 
  group_by(Any_consum, parcela14) %>% 
  summarise(sum_pred_low=sum(pred_low90, na.rm = TRUE),
            sum_pred=sum(pred, na.rm = TRUE),
            sum_pred_med=sum(pred_med, na.rm = TRUE),
            sum_pred_high=sum(pred_high90, na.rm = TRUE)) %>% 
  ungroup() %>% 
  group_by(parcela14) %>% 
  summarise(m_pred_low=mean(sum_pred_low, na.rm = TRUE),
            m_pred=mean(sum_pred, na.rm = TRUE),
            m_pred_med=mean(sum_pred_med, na.rm = TRUE),
            m_pred_high=mean(sum_pred_high, na.rm = TRUE))




consum_mitan_Pollenca <- 1815000

Our final prediction estimates a total annual water consumption of 751,434.5 m³ for all houses located on rural land. The associated 90% confidence interval ranges from 35,308 to 2,323,442 m³, reflecting the heterogeneity and uncertainty intrinsic to these dispersed and informal dwellings and our low OSR2. Yet, we take as the prediction as adequate⁵.

For comparison, official data from the GOIB indicate that the town’s total water consumption for the same years was 1,815,000 m³ for all the houses that are in urban land and connected to the public provider. This means that rural houses alone account for approximately 41.4% of all municipal water use — despite being located outside the consolidated urban fabric, where permanent residents live year-round. Or said in another word: the consumption of water of the town is 41.4% more than the one expected of the year.

This proportion is striking: nearly half of the municipality’s water demand originates in landscapes that, officially, are not supposed to host intense residential or touristic activity. It underscores how deeply rural land has been transformed into a major water-consuming territory, and how the patterns of tourism-led informality ripple through the municipal infrastructure like a hidden river beneath the cadastral surface

Total amount of water predicted to be consumed by all the houses on rural of land (n. of houses = 2,564)
m_pred_low	m_pred	m_pred_med	m_pred_high
33564.14	751434.5	505069	2342317

## [1] "% over total of the municipality: 41.4%"

But the real contrast emerges when we look at types of houses. The presence of a pool is the great hydraulic divide.

Houses with pools show an average annual consumption of 340 m³, summing up to 598,526.4 m³ in total. This means that this single group accounts for nearly 80% (79.65%) of all predicted water use on rural land. In other words, a minority of houses — the ones equipped for leisure, rental turnover, and high-comfort tourism — draw the overwhelming share of the aquifer.

By contrast, houses without pools consume far less: an average of 190 m³, which is only 56% of the average consumption of houses with pools. Their total predicted consumption is 152,908 m³, a modest fraction in comparison.

The landscape revealed by these numbers is stark. They signal a shift from rural residential logics to a tourism-driven, amenity-heavy occupation of land that presses disproportionately on municipal water systems

Total amount of water predicted to be consumed by all the houses on rural of land depending of presence of pool (n. of houses = 2,564)
pool	m_pred_low	m_pred	m_pred_med	m_pred_high
0	3843.142	152908.1	96298.5	527469.4
1	29721.000	598526.4	408770.5	1814847.9

Annual Avrg. amount of water predicted to be consumed by all the houses on rural of land depending of presence of pool (n. of houses = 2,564)
pool	m_pred_low	m_pred	m_pred_med	m_pred_high
0	4.780027	190.1842	119.7743	656.0565
1	16.886932	340.0718	232.2560	1031.1636

A similar, though less dramatic, difference appears when we look at tourism rentals data. However, we know that identifying these houses through official georeferenced records leads to a clear undercount: many rentals operate outside the formal registry we have and the Government have not georeferenced all the licensced rentals. Thus, the official database we use, as stateted many times, captures only a slice of the real market.

In this context, the presence of a pool becomes a far more reliable proxy for identifying tourism-oriented houses. Pools function as material signals of a rental-ready dwelling — an amenity strongly associated with short-term stays, comfort-driven tourism, and the economic logics that shape rural accommodation.

Total amount of water predicted to be consumed by all the houses on rural of land depending if they are tourism rentals (based on GOIB sources) (n. of houses = 2,564)
rental	m_pred_low	m_pred	m_pred_med	m_pred_high
0	11.88440	277.8137	185.7042	882.4118
1	16.31498	333.8599	227.1418	996.7578

Another way of detecting the imprint of tourism on rural water use is through seasonality. During the high season alone (basically, spring and summer), predicted consumption reaches 524,546 m³, which represents 68% of all annual water use by rural houses. When placed against the municipal figures, this seasonal spike accounts for 29% of the town’s total annual water consumption — a remarkable concentration of demand in just a few months, driven overwhelmingly by tourism-related occupation.

Total amount of water predicted to be consumed by all the houses on rural of land depending if it is season or not (n. of houses = 2,564)
season	m_pred_low	m_pred	m_pred_med	m_pred_high
0	1687.35	234267.9	122332.6	816253.4
1	31876.79	517166.6	382736.4	1526064.0

It is also interesting to examine the consumption of all houses located on parcels below 14,000 m², which, as stated before, is the legal threshold for building a single-family home in the town since the 1990s. This gives us a proxy for identifying the potential informal constructions, as opposed to those that could have been legally built under current regulations. The ones that could not be build today whatsoever represent the 71.7% of the consumption of water of the enire rural area and an additional 29.7% of consumption to the whole town (similar as the numbers of the pools, as all of them are correlated: lower parcel and pools are the main typology of house).

Total amount of water predicted to be consumed by all the houses on rural of land depending of their potential informality (less than 14k m2 of parcel) (n. of houses = 2,564)
parcela14	m_pred_low	m_pred	m_pred_med	m_pred_high
0	9057.833	212160.6	137697.6	652264.9
1	24506.308	539273.9	367371.4	1690052.4

Taken together, these predictions allow us — for the first time — to quantify the hidden hydraulic landscape of rural houses in Pollença. When compared to the municipality as a whole, rural houses — officially peripheral and often invisible in planning — represent over 40% of total water use, and almost a third during summer months alone. This, in a context of climate emergency, reflects a clear issue that needs to be addressed by policymakers and the community.

Mobility Patterns: What is Furthest and How Much We Need to Take the Car?

How can we show that rural houses are structurally car-dependent, and thus, its tourism is less sustainable and creates more congestion?

So, one of the central questions that were raised was whether the rural–land tourism model of our Houses that Do Not Exist produces far greater mobility needs—and therefore higher car dependency—than tourism rentals or hotels located in denser, better-served areas (this is the town and urban spaces). Our hypothesis is intuitive: dispersed houses on rural land require longer trips for almost every daily or leisure activity, while hotels and apartments in consolidated urban areas benefit from walkable distances and better transit access. But we need to demonstrate this with data.

How we will demonstrate it? To make the comparison as fair and illustrative as possible, we propose a simple but powerful thought experiment combined with spatial analysis:

Assume that all houses in the urbanized, dense areas could potentially host tourists This is a hypothetical scenario—not because we think it will happen or it is that way, but becasue it will take us as if we were to optimize movement and thake the furthest of the urban houses to be rentals, hhow will be comapred to an optimization of the closest of them?
Identify four basic mobility needs that nearly all tourists perform These activities structure tourist mobility and allow us to measure everyday accessibility:
1. Going to the supermarket for groceries
2. Visiting an attarction, hence, going to the beach (in this case we are just using current only hotels/apartments and official beach entrances, since these capture the core tourist activity)
3. Going to a restaurant
4. Taking public transportation to visit the central city or areas outside the municipality
For each residential unit (rural house an dense urban house), assign the closest destination based on what is the shortest path to it. This means:
1. the closest supermarket,
2. the closest beach access point,
3. the closest restaurant,
4. the closest bus stop with inter-municipal service.

This is an idealized assumption—tourists may prefer a specific restaurant or supermarket, but assigning the closest option makes the comparison conservative. We measure the minimum possible travel needed. .

Translate distance into mobility mode. Based on standard walkability thresholds (e.g., 800m ≈ 10 min walking; above 1.5 km most people will choose to drive, etc), we can categorize trips as walkable or car-dependent.
Puit everything in a visualization so we can compare the two models (rural vs. compact tourism) We can then quantify: In an ideal scenario, where people can choose the best path and the best road and go to the closest of the amenities and taking all the houses as potential origins, will a tourist be more proabable to take a car in a rural setting or in a urban?

Building the dataset

To build the data we have used the QNEAT3 algorithm of QGIS and using the Roads of the IDEIB (Governemt) to build a nwtwork (as we did to genarate the matrix of distances of our cluster). We use the “OD-Matrix from Points as Lines (m:n)” algorithm And we have this layers as a result:

name	description	type
publictrans_rural	Distance to the all public transport stop or station from all houses on rural land (m)	shp (line)
publictrans_urba	Distance to the all public transport stop or station from all houses in urban areas (m)	shp (line)
super_rural	Distance to the all restaurants from all houses on rural land (m)	shp (line)
super_urba	Distance to the all restaurants from all houses in urban areas (m)	shp (line)
rest_rural	Distance to the all the official beach access points from all houses on rural land (m)	shp (line)
rest_urba	Distance to the all the official beach access points from all houses on urban land (m)	shp (line)
platges_rural	Distance to the all supermarkets from all houses on rural land that the government says are tourism rentals (m)	shp (line)
platges_urba	Distance to the all supermarkets from all hotels in urban land (m)	shp (line)

From here, we need to calculate just the ones in closest distance (this is the closest distance from each house).

publictrans_rural <- publictrans_rural %>%
  filter(!is.na(total_cost)) %>%
  group_by(destinatio) %>%
  slice_min(order_by = total_cost, n = 1, with_ties = FALSE) %>%
  ungroup()

publictrans_urba <- publictrans_urba %>%
  filter(!is.na(total_cost)) %>%
  group_by(destinatio) %>%
  slice_min(order_by = total_cost, n = 1, with_ties = FALSE) %>%
  ungroup()



super_rural <- super_rural %>%
  filter(!is.na(total_cost)) %>%
  group_by(destinatio) %>%
  slice_min(order_by = total_cost, n = 1, with_ties = FALSE) %>%
  ungroup()

super_urba <- super_urba %>%
  filter(!is.na(total_cost)) %>%
  group_by(destinatio) %>%
  slice_min(order_by = total_cost, n = 1, with_ties = FALSE) %>%
  ungroup()



rest_rural <- rest_rural %>%
  filter(!is.na(total_cost)) %>%
  group_by(destinatio) %>%
  slice_min(order_by = total_cost, n = 1, with_ties = FALSE) %>%
  ungroup()

rest_urba <- rest_urba %>%
  filter(!is.na(total_cost)) %>%
  group_by(destinatio) %>%
  slice_min(order_by = total_cost, n = 1, with_ties = FALSE) %>%
  ungroup()


platges_rural <- platges_rural %>%
  filter(!is.na(total_cost)) %>%
  group_by(destinatio) %>%
  slice_min(order_by = total_cost, n = 1, with_ties = FALSE) %>%
  ungroup()

platges_urba <- platges_urba %>%
  filter(!is.na(total_cost)) %>%
  group_by(destinatio) %>%
  slice_min(order_by = total_cost, n = 1, with_ties = FALSE) %>%
  ungroup()

Visualizing the data

Now we build histograms that include both the rural and the urban cases for each activity. Then, on each histogram, we add reference lines indicating the approximate walkable distance threshold and the point at which trips become car-dependent.

thresholds <- data.frame(
  xintercept = c(800, 1500),
  type       = c("10-min. walk", "Car threshold")
)



publictrans_rural$area_type <- "Rural"
publictrans_urba$area_type  <- "Urban"
publictrans_all <- bind_rows(publictrans_rural, publictrans_urba)



ggplot(publictrans_all, aes(x = total_cost, fill = area_type, color = area_type)) +
  geom_density(alpha = 0.5, linewidth = 1) +
  scale_fill_manual(values = c("Rural" = "#faedcd", "Urban" = "#efe1d0")) +
  scale_color_manual(values = c("Rural" = "#faedcd", "Urban" = "#efe1d0")) +
  scale_x_continuous(limits = c(0, 5000)) +
  geom_vline(
    data = thresholds,
    aes(xintercept = xintercept, linetype = type),
    color = "grey50",
    linewidth = 0.7,
    inherit.aes = FALSE
  ) +
  scale_linetype_manual(
    name   = NULL,  # legend title (blank = cleaner)
    values = c("10-min. walk" = "dotted",
               "Car threshold" = "dashed")
  ) +
  labs(
    title = "Distance to Nearest Public Transport Station or Stop",
    x = "Distance (meters)",
    y = "Density",
    fill  = "Area type",
    color = "Area type"
  ) +
  theme_classic()

super_rural$area_type <- "Rural"
super_urba$area_type  <- "Urban"
super_all <- bind_rows(super_rural, super_urba)



ggplot(super_all, aes(x = total_cost, fill = area_type, color = area_type)) +
  geom_density(alpha = 0.5, linewidth = 1) +
  scale_fill_manual(values = c("Rural" = "#faedcd", "Urban" = "#efe1d0")) +
  scale_color_manual(values = c("Rural" = "#faedcd", "Urban" = "#efe1d0")) +
  scale_x_continuous(limits = c(0, 5000)) +
  geom_vline(
    data = thresholds,
    aes(xintercept = xintercept, linetype = type),
    color = "grey50",
    linewidth = 0.7,
    inherit.aes = FALSE
  ) +
  scale_linetype_manual(
    name   = NULL,  # legend title (blank = cleaner)
    values = c("10-min. walk" = "dotted",
               "Car threshold" = "dashed")
  ) +
  labs(
    title = "Distance to Nearest Groceries",
    x = "Distance (meters)",
    y = "Density",
    fill  = "Area type",
    color = "Area type"
  ) +
  theme_classic()

rest_rural$area_type <- "Rural"
rest_urba$area_type  <- "Urban"
rest_all <- bind_rows(rest_rural, rest_urba)



ggplot(rest_all, aes(x = total_cost, fill = area_type, color = area_type)) +
  geom_density(alpha = 0.5, linewidth = 1) +
  scale_fill_manual(values = c("Rural" = "#faedcd", "Urban" = "#efe1d0")) +
  scale_color_manual(values = c("Rural" = "#faedcd", "Urban" = "#efe1d0")) +
  scale_x_continuous(limits = c(0, 5000)) +
  geom_vline(
    data = thresholds,
    aes(xintercept = xintercept, linetype = type),
    color = "grey50",
    linewidth = 0.7,
    inherit.aes = FALSE
  ) +
  scale_linetype_manual(
    name   = NULL,  
    values = c("10-min. walk" = "dotted",
               "Car threshold" = "dashed")
  ) +
  labs(
    title = "Distance to Nearest Restaurant",
    x = "Distance (meters)",
    y = "Density",
    fill  = "Area type",
    color = "Area type"
  ) +
  theme_classic()

platges_rural$area_type <- "Rural"
platges_urba$area_type  <- "Urban"
platges_urba$destinatio  <- as.character(platges_urba$destinatio)

platges_all <- bind_rows(platges_rural, platges_urba)



ggplot(platges_all, aes(x = total_cost, fill = area_type, color = area_type)) +
  geom_density(alpha = 0.5, linewidth = 1) +
  
  scale_fill_manual(values = c("Rural" = "#faedcd", "Urban" = "#efe1d0")) +
  scale_color_manual(values = c("Rural" = "#faedcd", "Urban" = "#efe1d0")) +
  scale_x_continuous(limits = c(0, 5000)) +
  geom_vline(
    data = thresholds,
    aes(xintercept = xintercept, linetype = type),
    color = "grey50",
    linewidth = 0.7,
    inherit.aes = FALSE
  ) +
  scale_linetype_manual(
    name   = NULL,  # legend title (blank = cleaner)
    values = c("10-min. walk" = "dotted",
               "Car threshold" = "dashed")
  ) +
  labs(
    title = "Distance to Nearest Beach (from hotel or rental)",
    x = "Distance (meters)",
    y = "Density",
    fill  = "Area type",
    color = "Area type"
  ) +
  theme_classic()

We can also observe in a table how many houses fall below the car-dependency threshold, distinguishing between urban and rural areas in every case:

platges_all <- platges_all %>% 
  mutate(threshold=ifelse(total_cost>1500, 1, 0),
         type = "Platja")
rest_all <- rest_all %>% 
    mutate(threshold=ifelse(total_cost>1500, 1, 0),
         type = "Restaurant")
super_all <- super_all %>% 
  mutate(threshold=ifelse(total_cost>1500, 1, 0),
         type = "Supermercat") 
publictrans_all <- publictrans_all %>% 
  mutate(threshold=ifelse(total_cost>1500, 1, 0),
         type = "Transport Public") 

mobility_all <- platges_all %>% 
  bind_rows(rest_all) %>% 
  bind_rows(super_all) %>% 
  bind_rows(publictrans_all)


mobility_table <- mobility_all %>% 
  pivot_wider(names_from = area_type, values_from = threshold) %>% 
  mutate(Urban=as.numeric(Urban),
         Rural=as.numeric(Rural),
         Urban_total=ifelse(!is.na(Urban),1,0),
         Rural_total=ifelse(!is.na(Rural),1,0))%>% 
  group_by(type) %>% 
  summarise(n_Urban=sum(Urban_total, na.rm = TRUE),
            n_Rural=sum(Rural_total, na.rm = TRUE),
            car_Urban=sum(Urban, na.rm = TRUE),
            car_Rural=sum(Rural, na.rm = TRUE),
            perc_car_urba=(car_Urban/n_Urban)*100,
            perc_car_rural=(car_Rural/n_Rural)*100) %>% 
  st_drop_geometry()

knitr::kable(mobility_table)

type	n_Urban	n_Rural	car_Urban	car_Rural	perc_car_urba	perc_car_rural
Platja	57	731	13	717	22.807018	98.08482
Restaurant	4412	2551	75	1511	1.699909	59.23167
Supermercat	4412	2551	150	1622	3.399819	63.58291
Transport Public	4412	2551	53	1364	1.201269	53.46923

We can also visualize it using a pictogram of dots:

mobility_picto <- mobility_table %>% 
  select(type, perc_car_urba, perc_car_rural) %>% 
  pivot_longer(
    cols = starts_with("perc_car"),
    names_to = "area_type",
    values_to = "perc_car"
  ) %>% 
  mutate(
    area_type = case_when(
      area_type == "perc_car_urba"  ~ "Urban",
      area_type == "perc_car_rural" ~ "Rural",
      TRUE ~ area_type
    ),
    # 1 punt = 5 punts percentuals (pots canviar-ho)
    units = round(perc_car / 5)
  ) %>% 
  filter(units > 0) %>%      # per si algun type té 0%
  uncount(units, .id = "dot_id")          


a <- ggplot(mobility_picto, aes(x = type, y = dot_id, color = area_type)) +
  geom_point(
    size = 5,
    position = position_dodge(width = 0.5) 
  ) +
  coord_flip() +
  scale_color_manual(
    values = c("Urban" = "#efe1d0", "Rural" = "#faedcd"),
    name   = "Area type"
  ) +
  scale_y_continuous(
    breaks = seq(0, max(mobility_picto$dot_id), by = 2),
    labels = function(x) paste0(x * 5, "%")   
  ) +
  labs(
    x = NULL,
    y = "% of houses that need a car to move to..."
  )

a + theme_classic()

With these visualizations we can clearly observe a systematic difference between Urban and Rural areas in their mobility patterns. Rural houses require car use at much higher rates across all types of daily movements.

Final Remarks and Acknowledgments

The framework presented here is designed to be adaptable to other contexts and datasets, offering a replicable approach to understanding complex territorial dynamics through participatory and data-driven tools.

If this methodology is used or adapted in academic or non-academic work, it should be cited accordingly, naming Miquel Rosselló Xamena as the author. Also, the website should be referenced as the main source of the methodology and its results.

Its development has been shaped by the guidance and insights of Cong Cong and Eric Huntley (DUSP, MIT), as well as by the analytical foundations provided in the course Analytics Edge at the MIT Sloan School of Management. Acknowledging these contributions ensures transparency, intellectual integrity, and the continued strengthening of collaborative research. Thanks to them all.

By rural land we refer to all areas classified as non-buildable under Mallorca’s zoning and territorial regulations. To identify these areas, we intersect all construction polygons with the MUIB – Mapa Urbanístic de les Illes Balears, a comprehensive cartographic compilation of planning regulations across the archipelago. The MUIB allows us to spatially distinguish between built, buildable, and non-buildable rural classifications and qualification. Through this overlay, we can determine which constructions are located on rural land and therefore fall within the scope of our analysis.↩︎
Instead we could have used R. However, in R we have a problem: some houses are not tight or aligned to a selevted path. In many cases the house has a considerable setback from tha path as it is located at the opposite edge or at the center of a given parcel. This makes difficult attribute to the house one path and then connect to the next. However, QNEAT3 solve this problem as they calcualte also the “entry_cost” (this is the distance from the point to the path) and the “exit_cost”, this is the distance between the path and destiantion. Moreover, QNEAT3 also calculates based on the shortest path.↩︎
As we are longitudinal data we looked for references in literature. In this case we followed the review done by Hu and Szymczak (2023) (Accessed: https://academic.oup.com/bib/article/24/2/bbad002/6991123) that suggested subject-based sampling. Yet, we were worried as we do not have a lot of observations and are very different among each other. Then, we made further research and we saw that in unbalanced data or lower obervations many studies used stratified. For that reason we decided to try it and see the results.↩︎
We use it following here: r-bloggers.com/2021/04/quantile-regression-forests-for-prediction-intervals/. Due to bugging issues, we run the code into AI and suggested rearranging the QRF into range().↩︎
In all the following tables, m_pred is the prediction of consumption, and (m_pred_low-m_pred_high) represent the IC at the 90%↩︎

Var1	Freq
0	850
1	577
2	115
3	72
4	80
5	33
6	255
7	54
8	31
9	37
10	116
11	39
12	39
13	33
14	132
15	117

Var1	Freq
0	850
1	577
2	115
3	72
4	80
5	33
6	255
7	54
8	31
9	37
10	116
11	39
12	39
13	33
14	132
15	117

Houses That Do Not Exist (Methodology)

Miquel Rossello Xamena

2026-03-15

Introduction

Atlas of Houses That Do Not Exist

Building the dataset

1. Downloading the data and uniting the diferent data of sources

2. Cleaning and Curating the Rural Housing Dataset (QGIS Workflow)

3. Joining Tourism Rentals (QGIS Workflow)

4. Crossing Protection Categories (QGIS Workflow)

5. We count the surface of the parcels where they are located

Issues we have observed in QGIS and how we solved them: Hidden Houses And Mismatches

Final Dataset of all the Houses That Do Not Exist

Counts of The Houses That Do Not Exist by Municipalities

Is Possible to Built a Cluster Based on The Characteristics Of Houses?

Hierarchical Clustering Attempt

Answering The Questions Of The PAR Process: Sustainability and Environmental Impact

Clustering the neighborhoods

K-Means to determine the neigborhoods

DBSCAN to determine the neigborhoods

Building the dataset: OD-Matrix (Origin-Destination-Matrix) computation

Operating the DBSCAN Cluster

Water Consumption: How Much Water Do the Houses That Do Not Exist Consume?

Building the Dataset of Water Consumption

Joining the Dataset of Water Consumption with the Dataset of Cadastral Information

Prediction models

Linear Regression (A Naive Model)

Linear Regression (Log Transforamtion)

Random Forest Models

Random Forest (based on months)

Random Forest (based on seasons)

Boosted Trees

Boosted Trees (Based on months)

Boosted Trees (based on seasons)

Final Prediction

Building IC with QRF

Municipal Results

Mobility Patterns: What is Furthest and How Much We Need to Take the Car?

Building the dataset

Visualizing the data

Final Remarks and Acknowledgments

Var1	Freq
0	850
1	577
2	115
3	72
4	80
5	33
6	255
7	54
8	31
9	37
10	116
11	39
12	39
13	33
14	132
15	117