Skip to contents

This repo demonstrates how to download and use OD data from Spain, published by transportes.gob.es

The data is provided as follows:

  • Estudios basicos
    • Por disitritos
      • Personas (population)
      • Pernoctaciones (overnight stays)
      • Viajes
        • ficheros-diarios
        • meses-completos

The package is designed to save people time by providing the data in analyis-ready formats. Automating the process of downloading, cleaning and importing the data can also reduce the risk of errors in the laborious process of data preparation.

The datasets are large, so the package aims to reduce computational resources, by using computationally efficient packages behind the scenes. If you want to use many of the data files, it’s recommended you set a data directory where the package will look for the data, only downloading the files that are not already present.

Set the data directory by setting the SPANISH_OD_DATA_DIR environment variable, e.g. the following command:

usethis::edit_r_environ()
# Then set the data director, by typing this line in the file:
SPANISH_OD_DATA_DIR = "/path/to/data"
if (!requireNamespace("remotes", quietly = TRUE)) {
  install.packages("remotes")
}
remotes::install_cran("duckdb")
library(duckdb)
library(tidyverse)
theme_set(theme_minimal())
devtools::load_all()
sf::sf_use_s2(FALSE)

Get metadata for the datasets as follows:

metadata = get_metadata()
metadata
# A tibble: 8,453 × 6
   target_url           pub_ts              file_extension data_ym data_ymd
   <chr>                <dttm>              <chr>          <date>  <date>  
 1 https://movilidad-o… 2024-05-16 09:00:45 tar            NA      NA      
 2 https://movilidad-o… 2024-05-16 08:58:20 tar            NA      NA      
 3 https://movilidad-o… 2024-05-16 08:58:18 tar            NA      NA      
 4 https://movilidad-o… 2024-05-16 08:57:24 tar            NA      NA      
 5 https://movilidad-o… 2024-05-16 08:55:49 tar            NA      NA      
 6 https://movilidad-o… 2024-05-16 08:55:47 tar            NA      NA      
 7 https://movilidad-o… 2024-05-16 08:55:10 tar            NA      NA      
 8 https://movilidad-o… 2024-05-16 08:54:13 tar            NA      NA      
 9 https://movilidad-o… 2024-05-16 08:54:11 tar            NA      NA      
10 https://movilidad-o… 2024-05-16 08:52:26 tar            NA      NA      
# ℹ 8,443 more rows
# ℹ 1 more variable: local_path <chr>

Zones

Zones can be downloaded as follows:

distritos = get_zones(type = "distritos")
distritos_wgs84 = sf::st_transform(distritos, 4326)
plot(distritos_wgs84)

Estudios basicos

Each day in the ficheros-diarios folder contains a file with the following columns:

# set timeout for downloads
options(timeout = 600) # 10 minutes
u1 = "https://movilidad-opendata.mitma.es/estudios_basicos/por-distritos/viajes/ficheros-diarios/2024-03/20240301_Viajes_distritos.csv.gz"
f1 = basename(u1)
if (!file.exists(f1)) {
  download.file(u1, f1)
}
drv = duckdb::duckdb("daily.duckdb")
con = DBI::dbConnect(drv)
od1 = duckdb::tbl_file(con, f1)
# colnames(od1)
#  [1] "fecha"                   "periodo"                
#  [3] "origen"                  "destino"                
#  [5] "distancia"               "actividad_origen"       
#  [7] "actividad_destino"       "estudio_origen_posible" 
#  [9] "estudio_destino_posible" "residencia"             
# [11] "renta"                   "edad"                   
# [13] "sexo"                    "viajes"                 
# [15] "viajes_km"
od1_head = od1 |>
  head() |>
  collect()
od1_head |>
  knitr::kable()
fecha periodo origen destino distancia actividad_origen actividad_destino estudio_origen_posible estudio_destino_posible residencia renta edad sexo viajes viajes_km
20240301 19 01009_AM 01001 0.5-2 frecuente casa no no 01 10-15 NA NA 5.124 6.120
20240301 15 01002 01001 10-50 frecuente casa no no 01 10-15 NA NA 2.360 100.036
20240301 00 01009_AM 01001 10-50 frecuente casa no no 01 10-15 NA NA 1.743 22.293
20240301 05 01009_AM 01001 10-50 frecuente casa no no 01 10-15 NA NA 2.404 24.659
20240301 06 01009_AM 01001 10-50 frecuente casa no no 01 10-15 NA NA 5.124 80.118
20240301 09 01009_AM 01001 10-50 frecuente casa no no 01 10-15 NA NA 7.019 93.938
DBI::dbDisconnect(con)

You can get the same result, but for multiple files, as follows:

od_multi_list = get_od(
  subdir = "estudios_basicos/por-distritos/viajes/ficheros-diarios",
  date_regex = "2024-03-0[1-7]"
)
od_multi_list[[1]]
# Source:   SQL [?? x 15]
# Database: DuckDB v0.10.2 [robin@Linux 6.5.0-35-generic:R 4.4.0/:memory:]
      fecha periodo origen  destino distancia actividad_origen actividad_destino
      <dbl> <chr>   <chr>   <chr>   <chr>     <chr>            <chr>            
 1 20240307 00      01009_… 01001   0.5-2     frecuente        casa             
 2 20240307 09      01009_… 01001   0.5-2     frecuente        casa             
 3 20240307 18      01009_… 01001   0.5-2     frecuente        casa             
 4 20240307 19      01009_… 01001   0.5-2     frecuente        casa             
 5 20240307 20      01009_… 01001   0.5-2     frecuente        casa             
 6 20240307 14      01002   01001   10-50     frecuente        casa             
 7 20240307 22      01002   01001   10-50     frecuente        casa             
 8 20240307 06      01009_… 01001   10-50     frecuente        casa             
 9 20240307 09      01009_… 01001   10-50     frecuente        casa             
10 20240307 11      01009_… 01001   10-50     frecuente        casa             
# ℹ more rows
# ℹ 8 more variables: estudio_origen_posible <chr>,
#   estudio_destino_posible <chr>, residencia <chr>, renta <chr>, edad <chr>,
#   sexo <chr>, viajes <dbl>, viajes_km <dbl>
class(od_multi_list[[1]])
[1] "tbl_duckdb_connection" "tbl_dbi"               "tbl_sql"              
[4] "tbl_lazy"              "tbl"                  

The result is a list of duckdb tables which load almost instantly, and can be used with dplyr functions. Let’s do an aggregation to find the total number trips per hour over the 7 days:

n_per_hour = od_multi_list |>
  map(~ .x |>
        group_by(periodo, fecha) |>
        summarise(n = n(), Trips = sum(viajes)) |>
        collect()
  ) |>
  list_rbind() |>
  mutate(Time = lubridate::ymd_h(paste0(fecha, periodo))) |>
  mutate(Day = lubridate::wday(Time, label = TRUE)) 
n_per_hour |>
  ggplot(aes(x = Time, y = Trips)) +
  geom_line(aes(colour = Day)) +
  labs(title = "Number of trips per hour over 7 days")

The figure above summarises 925,874,012 trips over the 7 days associated with 135,866,524 records.

We’ll use the same input data to pick-out the most important flows in Spain, with a focus on longer trips for visualisation:

od_large = od_multi_list |>
  map(~ .x |>
        group_by(origen, destino) |>
        summarise(Trips = sum(viajes), .groups = "drop") |>
        filter(Trips > 500) |>
        collect()
  ) |>
  list_rbind() |>
  group_by(origen, destino) |>
  summarise(Trips = sum(Trips)) |>
  arrange(desc(Trips))
od_large
# A tibble: 37,023 × 3
# Groups:   origen [3,723]
   origen  destino    Trips
   <chr>   <chr>      <dbl>
 1 2807908 2807908 2441404.
 2 0801910 0801910 2112188.
 3 0801902 0801902 2013618.
 4 2807916 2807916 1821504.
 5 2807911 2807911 1785981.
 6 04902   04902   1690606.
 7 2807913 2807913 1504484.
 8 2807910 2807910 1299586.
 9 0704004 0704004 1287122.
10 28106   28106   1286058.
# ℹ 37,013 more rows

The results show that the largest flows are intra-zonal. Let’s keep only the inter-zonal flows:

od_large_interzonal = od_large |>
  filter(origen != destino)

We can convert these to geographic data with the {od} package:

od_large_interzonal_sf = od::od_to_sf(
  od_large_interzonal,
  z = distritos_wgs84
)
od_large_interzonal_sf |>
  ggplot() +
  geom_sf(aes(size = Trips), colour = "red") +
  theme_void()

Let’s focus on trips in and around a particular area (Salamanca):

salamanca_zones = zonebuilder::zb_zone("Salamanca")
distritos_salamanca = distritos_wgs84[salamanca_zones, ]
plot(distritos_salamanca)

We will use this information to subset the rows, to capture all movement within the study area:

ids_salamanca = distritos_salamanca$ID
od_salamanca = od_multi_list |>
  map(~ .x |>
        filter(origen %in% ids_salamanca) |>
        filter(destino %in% ids_salamanca) |>
        collect()
  ) |>
  list_rbind() |>
  group_by(origen, destino) |>
  summarise(Trips = sum(viajes)) |>
  arrange(Trips)

Let’s plot the results:

od_salamanca_sf = od::od_to_sf(
  od_salamanca,
  z = distritos_salamanca
)
od_salamanca_sf |>
  filter(origen != destino) |>
  ggplot() +
  geom_sf(aes(colour = Trips), size = 1) +
  scale_colour_viridis_c() +
  theme_void()

Disaggregating desire lines

For this you’ll need some additional dependencies:

We’ll get the road network from OSM:

salamanca_boundary = sf::st_union(distritos_salamanca)
osm_full = osmactive::get_travel_network(salamanca_boundary)
osm = osm_full[salamanca_boundary, ]
drive_net = osmactive::get_driving_network(osm)
drive_net_major = osmactive::get_driving_network_major(osm)
cycle_net = osmactive::get_cycling_network(osm)
cycle_net = osmactive::distance_to_road(cycle_net, drive_net_major)
cycle_net = osmactive::classify_cycle_infrastructure(cycle_net)
map_net = osmactive::plot_osm_tmap(cycle_net)
map_net

We can use the road network to disaggregate the desire lines:

od_jittered = odjitter::jitter(
  od_salamanca_sf,
  zones = distritos_salamanca,
  subpoints = drive_net,
  disaggregation_threshold = 1000,
  disaggregation_key = "Trips"
)

Let’s plot the disaggregated desire lines:

od_jittered |>
  arrange(Trips) |>
  ggplot() +
  geom_sf(aes(colour = Trips), size = 1) +
  scale_colour_viridis_c() +
  geom_sf(data = drive_net_major, colour = "black") +
  theme_void()