Spatial join#
Spatial joins are operations that combine data from two or more spatial data sets based on their geometric relationship. In the previous sections, we got to know two specific cases of spatial joins: Point-in-polygon queries and intersects-queries. However, there is more to using the geometric relationship between features and between entire layers.
Spatial join operations require two input parameters: the predicament, i.e., the geometric condition that needs to be met between two geometries, and the join-type: whether only rows with matching geometries are kept, or all of one input table’s rows, or all records.
Geopandas (using shapely
to implement geometric relationships) supports a standard set of geometric predicates, that is similar to most GIS analysis tools and applications:
intersects
contains
within
touches
crosses
overlaps
Geometric predicaments are expressed as verbs, so they have an intuitive meaning. See the shapely user manual for a detailed description of each geometric predicate.
Binary geometric predicates
Shapely supports more binary geometric predicates than those geopandas implements for spatial joins. What are they? Can they be expressed by combining the implemented ones?
In terms of the join-type, geopandas implements three different options:
left: keep all records of the left data frame, fill with empty values if no match, keep left geometry column
right: keep all records of the left data frame, fill with empty values if no match, keep right geometry column
inner: keep only records of matching records, keep left geometry column
Tip
The PyGIS book provides a great overview of spatial predicates and join-types, complete with explanatory drawings.
Load input data#
As a practical example, let’s find the population density at each of the addresses from earlier in this lesson, by combining the data set with data from a population grid.
The population grid data is available from HSY, the Helsinki Region Environmental Services, for instance via their WFS endpoint or opendata.
import geopandas
addresses = geopandas.read_file(DATA_DIRECTORY / "addresses.gpkg")
population_grid = geopandas.read_file(
(
"https://kartta.hsy.fi/geoserver/wfs"
"?service=wfs"
"&version=2.0.0"
"&request=GetFeature"
"&typeName=asuminen_ja_maankaytto:Vaestotietoruudukko_2020"
"&srsName=EPSG:3879"
),
)
population_grid.crs = "EPSG:3879" # for WFS data, the CRS needs to be specified manually
[1]:
import pathlib
NOTEBOOK_PATH = pathlib.Path().resolve()
DATA_DIRECTORY = NOTEBOOK_PATH / "data"
[2]:
import geopandas
addresses = geopandas.read_file(DATA_DIRECTORY / "addresses.gpkg")
population_grid = geopandas.read_file(
"https://avoidatastr.blob.core.windows.net/avoindata/AvoinData/"
"6_Asuminen/Vaestotietoruudukko/Shp/Vaestotietoruudukko_2021_shp.zip"
)
Concatenating long strings
In the WFS address above, we split a long string across multiple lines. Strings within parentheses are automatically concatenated (joined together) without needing any operator (e.g., +
).
For clarity, the example includes an additional set of parentheses, but the parentheses of the method call would suffice.
[3]:
population_grid.head()
[3]:
INDEX | ASUKKAITA | ASVALJYYS | IKA0_9 | IKA10_19 | IKA20_29 | IKA30_39 | IKA40_49 | IKA50_59 | IKA60_69 | IKA70_79 | IKA_YLI80 | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 688 | 5 | 50.60 | 99 | 99 | 99 | 99 | 99 | 99 | 99 | 99 | 99 | POLYGON ((25472499.995 6689749.005, 25472499.9... |
1 | 703 | 7 | 36.71 | 99 | 99 | 99 | 99 | 99 | 99 | 99 | 99 | 99 | POLYGON ((25472499.995 6685998.998, 25472499.9... |
2 | 710 | 8 | 44.50 | 99 | 99 | 99 | 99 | 99 | 99 | 99 | 99 | 99 | POLYGON ((25472499.995 6684249.004, 25472499.9... |
3 | 711 | 7 | 64.14 | 99 | 99 | 99 | 99 | 99 | 99 | 99 | 99 | 99 | POLYGON ((25472499.995 6683999.005, 25472499.9... |
4 | 715 | 11 | 41.09 | 99 | 99 | 99 | 99 | 99 | 99 | 99 | 99 | 99 | POLYGON ((25472499.995 6682998.998, 25472499.9... |
The population grid has many columns, and all of its column names are in Finnish. Let’s drop (delete) all of the columns except the population total, and rename the remaining to English:
[4]:
population_grid = population_grid[["ASUKKAITA", "geometry"]]
population_grid = population_grid.rename(columns={"ASUKKAITA": "population"})
Finally, calculate the population density by dividing the number of inhabitants of each grid cell by its area in km²:
[5]:
population_grid["population_density"] = (
population_grid["population"]
/ (population_grid.area / 1_000_000)
)
population_grid.head()
[5]:
population | geometry | population_density | |
---|---|---|---|
0 | 5 | POLYGON ((25472499.995 6689749.005, 25472499.9... | 80.001048 |
1 | 7 | POLYGON ((25472499.995 6685998.998, 25472499.9... | 112.001467 |
2 | 8 | POLYGON ((25472499.995 6684249.004, 25472499.9... | 128.001677 |
3 | 7 | POLYGON ((25472499.995 6683999.005, 25472499.9... | 112.001467 |
4 | 11 | POLYGON ((25472499.995 6682998.998, 25472499.9... | 176.002305 |
Tip: Coding style - big numbers, operators in multi-line expressions
When using very large numbers, such as 1 million to convert m² to km², you can use underscore characters (_
) as thousands separators. The Python interpreter treats numbers with underscores as regular numeric values. This syntax also works for grouping numbers by other logics, like hexadecimal or binary values.
For expressions like mathematical formulas that spread across multiple lines, it’s good coding style to place an operator at the beginning of a new line rather than trailing it on the previous line. This improves readability, as explained in the PEP-8 styling guidelines.
Join input layers#
Now we are ready to perform the spatial join between the two layers. Remember: the aim is to find the population density around each of the address points. We want to attach population density information from the population_grid
polygon layer to the addresses
point layer, depending on whether the point is within the polygon. During this operation, we want to retain the geometries of the point layer.
Before we can go ahead with the join operation, we have to make sure the two layers are in the same cartographic reference system:
[6]:
assert addresses.crs == population_grid.crs, "CRS are not identical"
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
Cell In[6], line 1
----> 1 assert addresses.crs == population_grid.crs, "CRS are not identical"
AssertionError: CRS are not identical
They do not share the same CRS, let’s reproject one of them:
[7]:
population_grid = population_grid.to_crs(addresses.crs)
Now we are ready to carry out the actual spatial join using the `geopandas.GeoDataFrame.sjoin()
<https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.sjoin.html>`__ method. Remember, we want to use a within geometric predicate and retain the point layer’s geometries (in the example below the left data frame).
[8]:
addresses_with_population_data = addresses.sjoin(
population_grid,
how="left",
predicate="within"
)
addresses_with_population_data.head()
[8]:
address | geometry | index_right | population | population_density | |
---|---|---|---|---|---|
0 | Ruoholahti, 14, Itämerenkatu, Ruoholahti, Läns... | POINT (24.91556 60.1632) | 3262.0 | 505.0 | 8080.105832 |
1 | 1, Kampinkuja, Kamppi, Eteläinen suurpiiri, He... | POINT (24.93009 60.16846) | 3381.0 | 172.0 | 2752.036046 |
2 | Espresso House, 8, Kaivokatu, Keskusta, Kluuvi... | POINT (24.94153 60.17016) | 3504.0 | 43.0 | 688.009011 |
3 | Hermannin rantatie, Hermanninranta, Hermanni, ... | POINT (24.97675 60.19438) | 3808.0 | 122.0 | 1952.025567 |
4 | 9, Tyynenmerenkatu, Jätkäsaari, Länsisatama, E... | POINT (24.92151 60.15662) | 3310.0 | 1402.0 | 22431.538035 |
That looks great! We now have an address data set with population density information attached to it.
As a final task, let’s look at how to plot data using a graduated cartographic visualisation scheme.
The geopandas.GeoDataFrame.plot()
method can vary the map colours depending on a column’s values by passing its name as a named argument column
. On top of that, the method accepts many arguments to influence the style of the map. Among them are scheme
and cmap
that define the categorisation scheme, and the colour map used. Many more arguments are
passed through to matplotlib
, such as markersize
to set the size of point symbols, and facecolor
to set the colour of polygon areas. To draw a legend, set legend
to True
, to set the size of the figure, pass a tuple (with values in inch) as figsize
.
[9]:
ax = addresses_with_population_data.plot(
figsize=(10, 10),
column="population_density",
cmap="Reds",
scheme="quantiles",
markersize=15,
legend=True
)
ax.set_title("Population density around address points")
[9]:
Text(0.5, 1.0, 'Population density around address points')
We can apply the same arguments to plot a population density map using the entire population_grid
data set:
[10]:
ax = population_grid.plot(
figsize=(10, 10),
column="population_density",
cmap="Reds",
scheme="quantiles",
legend=True
)
ax.set_title("Population density in the Helsinki metropolitan area")
[10]:
Text(0.5, 1.0, 'Population density in the Helsinki metropolitan area')
Finally, remember to save the output data frame to a file. We can append it to the existing GeoPackage by specifying a new layer name:
[11]:
addresses_with_population_data.to_file(
DATA_DIRECTORY / "addresses.gpkg",
layer="addresses_with_population_data"
)