Spatial join
Contents
Spatial join#
Spatial join is yet another classic GIS problem. Getting attributes from one layer and transferring them into another layer based on their spatial relationship is something you most likely need to do on a regular basis.
In the previous section we learned how to perform a Point in Polygon query.
We can now use the same logic to conduct a spatial join between two layers based on their
spatial relationship. We could, for example, join the attributes of a polygon layer into a point layer where each point would get the
attributes of a polygon that contains
the point.
Luckily, spatial join is already implemented in Geopandas, thus we do not need to create our own function for doing it. There are three possible types of
join that can be applied in spatial join that are determined with op
-parameter in the gpd.sjoin()
-function:
"intersects"
"within"
"contains"
Sounds familiar? Yep, all of those spatial relationships were discussed in the Point in Polygon lesson, thus you should know how they work.
Furthermore, pay attention to the different options for the type of join via the how
parameter; “left”, “right” and “inner”. You can read more about these options in the geopandas sjoin documentation and pandas guide for merge, join and concatenate
Let’s perform a spatial join between these two layers:
Addresses: the geocoded address-point (we created this Shapefile in the geocoding tutorial)
Population grid: 250m x 250m grid polygon layer that contains population information from the Helsinki Region.
The population grid a dataset is produced by the Helsinki Region Environmental Services Authority (HSY) (see this page to access data from different years).
You can download the data from from this link in the Helsinki Region Infroshare (HRI) open data portal.
Here, we will access the data directly from the HSY wfs:
import geopandas as gpd
from pyproj import CRS
import requests
import geojson
# Specify the url for web feature service
url = 'https://kartta.hsy.fi/geoserver/wfs'
# Specify parameters (read data in json format).
# Available feature types in this particular data source: http://geo.stat.fi/geoserver/vaestoruutu/wfs?service=wfs&version=2.0.0&request=describeFeatureType
params = dict(service='WFS',
version='2.0.0',
request='GetFeature',
typeName='asuminen_ja_maankaytto:Vaestotietoruudukko_2018',
outputFormat='json')
# Fetch data from WFS using requests
r = requests.get(url, params=params)
# Create GeoDataFrame from geojson
pop = gpd.GeoDataFrame.from_features(geojson.loads(r.content))
/home/aagesenh/.conda/envs/python-gis/lib/python3.8/site-packages/geopandas/_compat.py:111: UserWarning: The Shapely GEOS version (3.9.1dev-CAPI-1.14.1) is incompatible with the GEOS version PyGEOS was compiled with (3.9.1-CAPI-1.14.2). Conversions between both will be slow.
warnings.warn(
Check the result:
pop.head()
geometry | index | asukkaita | asvaljyys | ika0_9 | ika10_19 | ika20_29 | ika30_39 | ika40_49 | ika50_59 | ika60_69 | ika70_79 | ika_yli80 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | POLYGON ((25472499.995 6689749.005, 25472499.9... | 688 | 9 | 28.0 | 99 | 99 | 99 | 99 | 99 | 99 | 99 | 99 | 99 |
1 | POLYGON ((25472499.995 6685998.998, 25472499.9... | 703 | 5 | 51.0 | 99 | 99 | 99 | 99 | 99 | 99 | 99 | 99 | 99 |
2 | POLYGON ((25472499.995 6684249.004, 25472499.9... | 710 | 8 | 44.0 | 99 | 99 | 99 | 99 | 99 | 99 | 99 | 99 | 99 |
3 | POLYGON ((25472499.995 6683999.005, 25472499.9... | 711 | 5 | 90.0 | 99 | 99 | 99 | 99 | 99 | 99 | 99 | 99 | 99 |
4 | POLYGON ((25472499.995 6682998.998, 25472499.9... | 715 | 11 | 41.0 | 99 | 99 | 99 | 99 | 99 | 99 | 99 | 99 | 99 |
Okey so we have multiple columns in the dataset but the most important
one here is the column asukkaita
(“population” in Finnish) that
tells the amount of inhabitants living under that polygon.
Let’s change the name of that column into
pop18
so that it is more intuitive. As you might remember, we can easily rename (Geo)DataFrame column names using therename()
function where we pass a dictionary of new column names like this:columns={'oldname': 'newname'}
.
# Change the name of a column
pop = pop.rename(columns={'asukkaita': 'pop18'})
# Check the column names
pop.columns
Index(['geometry', 'index', 'pop18', 'asvaljyys', 'ika0_9', 'ika10_19',
'ika20_29', 'ika30_39', 'ika40_49', 'ika50_59', 'ika60_69', 'ika70_79',
'ika_yli80'],
dtype='object')
Let’s also get rid of all unnecessary columns by selecting only columns that we need i.e. pop18
and geometry
# Subset columns
pop = pop[["pop18", "geometry"]]
pop.head()
pop18 | geometry | |
---|---|---|
0 | 9 | POLYGON ((25472499.995 6689749.005, 25472499.9... |
1 | 5 | POLYGON ((25472499.995 6685998.998, 25472499.9... |
2 | 8 | POLYGON ((25472499.995 6684249.004, 25472499.9... |
3 | 5 | POLYGON ((25472499.995 6683999.005, 25472499.9... |
4 | 11 | POLYGON ((25472499.995 6682998.998, 25472499.9... |
Now we have cleaned the data and have only those columns that we need for our analysis.
Join the layers#
Now we are ready to perform the spatial join between the two layers that
we have. The aim here is to get information about how many people live
in a polygon that contains an individual address-point . Thus, we want
to join attributes from the population layer we just modified into the
addresses point layer addresses.shp
that we created trough gecoding in the previous section.
Read the addresses layer into memory:
# Addresses filpath
addr_fp = r"data/addresses.shp"
# Read data
addresses = gpd.read_file(addr_fp)
# Check the head of the file
addresses.head()
address | id | addr | geometry | |
---|---|---|---|---|
0 | Ruoholahti, 14, Itämerenkatu, Ruoholahti, Läns... | 1000 | Itämerenkatu 14, 00101 Helsinki, Finland | POINT (24.91556 60.16320) |
1 | Kamppi, 1, Kampinkuja, Kamppi, Eteläinen suurp... | 1001 | Kampinkuja 1, 00100 Helsinki, Finland | POINT (24.93169 60.16902) |
2 | Kauppakeskus Citycenter, 8, Kaivokatu, Keskust... | 1002 | Kaivokatu 8, 00101 Helsinki, Finland | POINT (24.94179 60.16989) |
3 | Hermannin rantatie, Verkkosaari, Kalasatama, S... | 1003 | Hermannin rantatie 1, 00580 Helsinki, Finland | POINT (24.97783 60.18892) |
4 | Hesburger, 9, Tyynenmerenkatu, Jätkäsaari, Län... | 1005 | Tyynenmerenkatu 9, 00220 Helsinki, Finland | POINT (24.92160 60.15665) |
In order to do a spatial join, the layers need to be in the same projection
Check the crs of input layers:
addresses.crs
<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich
pop.crs
If the crs information is missing from the population grid, we can define the coordinate reference system as ETRS GK-25 (EPSG:3879) because we know what it is based on the population grid metadata.
# Define crs
pop.crs = CRS.from_epsg(3879).to_wkt()
pop.crs
<Projected CRS: EPSG:3879>
Name: ETRS89 / GK25FIN
Axis Info [cartesian]:
- N[north]: Northing (metre)
- E[east]: Easting (metre)
Area of Use:
- name: Finland - nominally onshore between 24°30'E and 25°30'E but may be used in adjacent areas if a municipality chooses to use one zone over its whole extent.
- bounds: (24.5, 59.94, 25.5, 68.9)
Coordinate Operation:
- name: Finland Gauss-Kruger zone 25
- method: Transverse Mercator
Datum: European Terrestrial Reference System 1989 ensemble
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich
# Are the layers in the same projection?
addresses.crs == pop.crs
False
Let’s re-project addresses to the projection of the population layer:
addresses = addresses.to_crs(pop.crs)
Let’s make sure that the coordinate reference system of the layers are identical
# Check the crs of address points
print(addresses.crs)
# Check the crs of population layer
print(pop.crs)
# Do they match now?
addresses.crs == pop.crs
PROJCRS["ETRS89 / GK25FIN",BASEGEOGCRS["ETRS89",ENSEMBLE["European Terrestrial Reference System 1989 ensemble",MEMBER["European Terrestrial Reference Frame 1989"],MEMBER["European Terrestrial Reference Frame 1990"],MEMBER["European Terrestrial Reference Frame 1991"],MEMBER["European Terrestrial Reference Frame 1992"],MEMBER["European Terrestrial Reference Frame 1993"],MEMBER["European Terrestrial Reference Frame 1994"],MEMBER["European Terrestrial Reference Frame 1996"],MEMBER["European Terrestrial Reference Frame 1997"],MEMBER["European Terrestrial Reference Frame 2000"],MEMBER["European Terrestrial Reference Frame 2005"],MEMBER["European Terrestrial Reference Frame 2014"],ELLIPSOID["GRS 1980",6378137,298.257222101,LENGTHUNIT["metre",1]],ENSEMBLEACCURACY[0.1]],PRIMEM["Greenwich",0,ANGLEUNIT["degree",0.0174532925199433]],ID["EPSG",4258]],CONVERSION["Finland Gauss-Kruger zone 25",METHOD["Transverse Mercator",ID["EPSG",9807]],PARAMETER["Latitude of natural origin",0,ANGLEUNIT["degree",0.0174532925199433],ID["EPSG",8801]],PARAMETER["Longitude of natural origin",25,ANGLEUNIT["degree",0.0174532925199433],ID["EPSG",8802]],PARAMETER["Scale factor at natural origin",1,SCALEUNIT["unity",1],ID["EPSG",8805]],PARAMETER["False easting",25500000,LENGTHUNIT["metre",1],ID["EPSG",8806]],PARAMETER["False northing",0,LENGTHUNIT["metre",1],ID["EPSG",8807]]],CS[Cartesian,2],AXIS["northing (N)",north,ORDER[1],LENGTHUNIT["metre",1]],AXIS["easting (E)",east,ORDER[2],LENGTHUNIT["metre",1]],USAGE[SCOPE["Cadastre, engineering survey, topographic mapping (large scale)."],AREA["Finland - nominally onshore between 24°30'E and 25°30'E but may be used in adjacent areas if a municipality chooses to use one zone over its whole extent."],BBOX[59.94,24.5,68.9,25.5]],ID["EPSG",3879]]
PROJCRS["ETRS89 / GK25FIN",BASEGEOGCRS["ETRS89",ENSEMBLE["European Terrestrial Reference System 1989 ensemble",MEMBER["European Terrestrial Reference Frame 1989"],MEMBER["European Terrestrial Reference Frame 1990"],MEMBER["European Terrestrial Reference Frame 1991"],MEMBER["European Terrestrial Reference Frame 1992"],MEMBER["European Terrestrial Reference Frame 1993"],MEMBER["European Terrestrial Reference Frame 1994"],MEMBER["European Terrestrial Reference Frame 1996"],MEMBER["European Terrestrial Reference Frame 1997"],MEMBER["European Terrestrial Reference Frame 2000"],MEMBER["European Terrestrial Reference Frame 2005"],MEMBER["European Terrestrial Reference Frame 2014"],ELLIPSOID["GRS 1980",6378137,298.257222101,LENGTHUNIT["metre",1]],ENSEMBLEACCURACY[0.1]],PRIMEM["Greenwich",0,ANGLEUNIT["degree",0.0174532925199433]],ID["EPSG",4258]],CONVERSION["Finland Gauss-Kruger zone 25",METHOD["Transverse Mercator",ID["EPSG",9807]],PARAMETER["Latitude of natural origin",0,ANGLEUNIT["degree",0.0174532925199433],ID["EPSG",8801]],PARAMETER["Longitude of natural origin",25,ANGLEUNIT["degree",0.0174532925199433],ID["EPSG",8802]],PARAMETER["Scale factor at natural origin",1,SCALEUNIT["unity",1],ID["EPSG",8805]],PARAMETER["False easting",25500000,LENGTHUNIT["metre",1],ID["EPSG",8806]],PARAMETER["False northing",0,LENGTHUNIT["metre",1],ID["EPSG",8807]]],CS[Cartesian,2],AXIS["northing (N)",north,ORDER[1],LENGTHUNIT["metre",1]],AXIS["easting (E)",east,ORDER[2],LENGTHUNIT["metre",1]],USAGE[SCOPE["Cadastre, engineering survey, topographic mapping (large scale)."],AREA["Finland - nominally onshore between 24°30'E and 25°30'E but may be used in adjacent areas if a municipality chooses to use one zone over its whole extent."],BBOX[59.94,24.5,68.9,25.5]],ID["EPSG",3879]]
True
Now they should be identical. Thus, we can be sure that when doing spatial queries between layers the locations match and we get the right results e.g. from the spatial join that we are conducting here.
Let’s now join the attributes from
pop
GeoDataFrame intoaddresses
GeoDataFrame by usinggpd.sjoin()
-function:
# Make a spatial join
join = gpd.sjoin(addresses, pop, how="inner", op="within")
/home/aagesenh/.conda/envs/python-gis/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3361: FutureWarning: The `op` parameter is deprecated and will be removed in a future release. Please use the `predicate` parameter instead.
if (await self.run_code(code, result, async_=asy)):
join.head()
address | id | addr | geometry | index_right | pop18 | |
---|---|---|---|---|---|---|
0 | Ruoholahti, 14, Itämerenkatu, Ruoholahti, Läns... | 1000 | Itämerenkatu 14, 00101 Helsinki, Finland | POINT (25495311.608 6672258.695) | 3252 | 515 |
1 | Kamppi, 1, Kampinkuja, Kamppi, Eteläinen suurp... | 1001 | Kampinkuja 1, 00100 Helsinki, Finland | POINT (25496207.840 6672906.173) | 3364 | 182 |
2 | Kauppakeskus Citycenter, 8, Kaivokatu, Keskust... | 1002 | Kaivokatu 8, 00101 Helsinki, Finland | POINT (25496768.622 6673002.004) | 3488 | 38 |
10 | Rautatientori, Kaisaniemi, Kluuvi, Eteläinen s... | 1011 | Rautatientori 1, 00100 Helsinki, Finland | POINT (25496896.734 6673162.114) | 3488 | 38 |
3 | Hermannin rantatie, Verkkosaari, Kalasatama, S... | 1003 | Hermannin rantatie 1, 00580 Helsinki, Finland | POINT (25498769.713 6675121.127) | 3822 | 61 |
Awesome! Now we have performed a successful spatial join where we got
two new columns into our join
GeoDataFrame, i.e. index_right
that tells the index of the matching polygon in the population grid and
pop18
which is the population in the cell where the address-point is
located.
Let’s still check how many rows of data we have now:
len(join)
31
Did we lose some data here?
Check how many addresses we had originally:
len(addresses)
34
If we plot the layers on top of each other, we can observe that some of the points are located outside the populated grid squares (increase figure size if you can’t see this properly!)
import matplotlib.pyplot as plt
# Create a figure with one subplot
fig, ax = plt.subplots(figsize=(15,8))
# Plot population grid
pop.plot(ax=ax)
# Plot points
addresses.plot(ax=ax, color='red', markersize=5)
<AxesSubplot:>
Let’s also visualize the joined output:
Plot the points and use the pop18
column to indicate the color.
cmap
-parameter tells to use a sequential colormap for the
values, markersize
adjusts the size of a point, scheme
parameter can be used to adjust the classification method based on pysal, and legend
tells that we want to have a legend:
# Create a figure with one subplot
fig, ax = plt.subplots(figsize=(10,6))
# Plot the points with population info
join.plot(ax=ax, column='pop18', cmap="Reds", markersize=15, scheme='quantiles', legend=True);
# Add title
plt.title("Amount of inhabitants living close the the point");
# Remove white space around the figure
plt.tight_layout()
In a similar way, we can plot the original population grid and check the overall population distribution in Helsinki:
# Create a figure with one subplot
fig, ax = plt.subplots(figsize=(10,6))
# Plot the grid with population info
pop.plot(ax=ax, column='pop18', cmap="Reds", scheme='quantiles', legend=True);
# Add title
plt.title("Population 2018 in 250 x 250 m grid squares");
# Remove white space around the figure
plt.tight_layout()
Finally, let’s save the result point layer into a file:
# Output path
outfp = r"data/addresses_population.shp"
# Save to disk
join.to_file(outfp)
/tmp/ipykernel_8817/218676674.py:5: UserWarning: Column names longer than 10 characters will be truncated when saved to ESRI Shapefile.
join.to_file(outfp)