Geometric operations¶

Overlay analysis¶

In this tutorial, the aim is to make an overlay analysis where we create a new layer based on geometries from a dataset that intersect with geometries of another layer. As our test case, we will select Polygon grid cells from TravelTimes_to_5975375_RailwayStation_Helsinki.geojson that intersects with municipality borders of Helsinki found in Helsinki_borders.shp.

Typical overlay operations are (source: QGIS docs):

Let’s first read the data and take a look how they look:

import geopandas as gpd
import matplotlib.pyplot as plt
import shapely.speedups
%matplotlib inline

# Let's enable speedups to make queries faster
shapely.speedups.enable()

# File paths
border_fp = "data/Helsinki_borders.shp"
grid_fp = "data/TravelTimes_to_5975375_RailwayStation.shp"

# Read files
grid = gpd.read_file(grid_fp)
hel = gpd.read_file(border_fp)

# Plot the layers
ax = grid.plot(facecolor='gray')
hel.plot(ax=ax, facecolor='None', edgecolor='blue')

<matplotlib.axes._subplots.AxesSubplot at 0x2637d634cc0>

../../_images/geometric-operations_1_1.png

Here the blue area is the borders that we want to use for conducting the overlay analysis and select the geometries from the Polygon grid.

Always, when conducting GIS operations involving multiple layers, it is required to check that the CRS of the layers match:

# Ensure that the CRS matches, if not raise an AssertionError
assert hel.crs == grid.crs, "CRS differs between layers!"

Indeed, they do. Hence, the pre-requisite to conduct spatial operations between the layers is fullfilled (also the map we plotted indicated this).

Let’s do an overlay analysis and create a new layer from polygons of the grid that intersect with our Helsinki layer. We can use a function called overlay() to conduct the overlay analysis that takes as an input 1) the GeoDataFrame where the selection is taken, 2) the GeoDataFrame used for making the selection, and 3) parameter how that can be used to control how the overlay analysis is conducted (possible values are 'intersection', 'union', 'symmetric_difference', 'difference', and 'identity'):

intersection = gpd.overlay(grid, hel, how='intersection')

Let’s plot our data and see what we have:

intersection.plot(color="b")

<matplotlib.axes._subplots.AxesSubplot at 0x2637d5e9128>

../../_images/geometric-operations_7_1.png

As a result, we now have only those grid cells that intersect with the Helsinki borders. As we can see the grid cells are clipped based on the boundary.

Whatabout the data attributes? Let’s see what we have:

print(intersection.head())

     car_m_d  car_m_t  car_r_d  car_r_t  from_id  pt_m_d  pt_m_t  pt_m_tt  \
   29476       41    29483       46  5876274   29990      76       95   
   29456       41    29462       46  5876275   29866      74       95   
   36772       50    36778       56  5876278   33541     116      137   
   36898       49    36904       56  5876279   33720     119      141   
  29411       40    29418       44  5878128   29944      75       95   

     pt_r_d  pt_r_t  pt_r_tt    to_id  walk_d  walk_t    GML_ID   NAMEFIN  \
  24984      77       99  5975375   25532     365  27517366  Helsinki   
  24860      75       93  5975375   25408     363  27517366  Helsinki   
  44265     130      146  5975375   31110     444  27517366  Helsinki   
  44444     132      155  5975375   31289     447  27517366  Helsinki   
 24938      76       99  5975375   25486     364  27517366  Helsinki   

         NAMESWE NATCODE                                           geometry  
 Helsingfors     091  POLYGON ((402250.0001322526 6685750.000039577,...  
 Helsingfors     091  POLYGON ((402367.8898583531 6685750.000039573,...  
 Helsingfors     091  POLYGON ((403250.000132058 6685750.000039549, ...  
 Helsingfors     091  POLYGON ((403456.4840652301 6685750.000039542,...  
Helsingfors     091  POLYGON ((402000.0001323065 6685500.000039622,...  

As we can see, due to the overlay analysis, the dataset contains the attributes from both input layers.

Let’s save our result grid as a GeoJSON file that is commonly used file format nowadays for storing spatial data.

# Output filepath
outfp = "data/TravelTimes_to_5975375_RailwayStation_Helsinki.geojson"

# Use GeoJSON driver
intersection.to_file(outfp, driver="GeoJSON")

There are many more examples for different types of overlay analysis in Geopandas documentation where you can go and learn more.

Aggregating data¶

Aggregating data can also be useful sometimes. What we mean by aggregation is that we basically merge Geometries into together by some common identifier. Suppose we are interested in studying continents, but we only have country-level data like the country dataset. By aggregation we would convert this into a continent-level dataset.

In this tutorial, we will aggregate our travel time data by car travel times (column car_r_t), i.e. the grid cells that have the same travel time to Railway Station will be merged together.

For doing the aggregation we will use a function called dissolve() that takes as input the column that will be used for conducting the aggregation:

# Conduct the aggregation
dissolved = intersection.dissolve(by="car_r_t")

# What did we get
print(dissolved.head())

                                                  geometry  car_m_d  car_m_t  \
car_r_t                                                                        
-1       (POLYGON ((388000.0001354803 6668750.0000429, ...       -1       -1   
 0       POLYGON ((386000.0001357812 6672000.000042388,...        0        0   
 7       POLYGON ((386250.0001357396 6671750.000042424,...     1051        7   
 8       (POLYGON ((386250.0001357467 6671500.000042468...     1286        8   
 9       (POLYGON ((387000.0001355996 6671500.000042449...     1871        9   

         car_r_d  from_id  pt_m_d  pt_m_t  pt_m_tt  pt_r_d  pt_r_t  pt_r_tt  \
car_r_t                                                                       
-1            -1  5913094      -1      -1       -1      -1      -1       -1   
 0             0  5975375       0       0        0       0       0        0   
 7          1051  5973739     617       5        6     617       5        6   
 8          1286  5973736     706      10       10     706      10       10   
 9          1871  5970457    1384      11       13    1394      11       12   

           to_id  walk_d  walk_t    GML_ID   NAMEFIN      NAMESWE NATCODE  
car_r_t                                                                    
-1            -1      -1      -1  27517366  Helsinki  Helsingfors     091  
 0       5975375       0       0  27517366  Helsinki  Helsingfors     091  
 7       5975375     448       6  27517366  Helsinki  Helsingfors     091  
 8       5975375     706      10  27517366  Helsinki  Helsingfors     091  
 9       5975375    1249      18  27517366  Helsinki  Helsingfors     091  

Let’s compare the number of cells in the layers before and after the aggregation:

print('Rows in original intersection GeoDataFrame:', len(intersection))
print('Rows in dissolved layer:', len(dissolved))

Rows in original intersection GeoDataFrame: 3826
Rows in dissolved layer: 51

Indeed the number of rows in our data has decreased and the Polygons were merged together.

What actually happened here? Let’s take a closer look.

Let’s see what columns we have now in our GeoDataFrame:

print(dissolved.columns)

Index(['geometry', 'car_m_d', 'car_m_t', 'car_r_d', 'from_id', 'pt_m_d',
       'pt_m_t', 'pt_m_tt', 'pt_r_d', 'pt_r_t', 'pt_r_tt', 'to_id', 'walk_d',
       'walk_t', 'GML_ID', 'NAMEFIN', 'NAMESWE', 'NATCODE'],
      dtype='object')

As we can see, the column that we used for conducting the aggregation (car_r_t) can not be found from the columns list anymore. What happened to it?

Let’s take a look at the indices of our GeoDataFrame:

print(dissolved.index)

Int64Index([-1,  0,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,
            22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
            39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54,
            56],
           dtype='int64', name='car_r_t')

Aha! Well now we understand where our column went. It is now used as index in our dissolved GeoDataFrame.

Now, we can for example select only such geometries from the layer that are for example exactly 15 minutes away from the Helsinki Railway Station:

# Select only geometries that are within 15 minutes away
sel_15min = dissolved.iloc[15]

# See the data type
print(type(sel_15min))

# See the data
print(sel_15min.head())

<class 'pandas.core.series.Series'>
geometry    (POLYGON ((388250.0001354316 6668750.000042891...
car_m_d                                                 12035
car_m_t                                                    18
car_r_d                                                 11997
from_id                                               5903886
Name: 20, dtype: object

As we can see, as a result, we have now a Pandas Series object containing basically one row from our original aggregated GeoDataFrame.

Let’s finally convert the Series into GeoDataFrame and plot it to see where those 15 minute grid cells are located:

# Create a GeoDataFrame
geo = gpd.GeoDataFrame([sel_15min.values], crs=dissolved.crs, columns=dissolved.columns)

# Plot the grid cells that are 15 minutes a way from the Railway Station
ax = dissolved.plot(facecolor='gray')
geo.plot(ax=ax, facecolor='red')

<matplotlib.axes._subplots.AxesSubplot at 0x2637dbafc18>

../../_images/geometric-operations_24_1.png

Simplifying geometries¶

Sometimes it might be useful to be able to simplify geometries. This could be something to consider for example when you have very detailed spatial features that cover the whole world. If you make a map that covers the whole world, it is unnecessary to have really detailed geometries because it is simply impossible to see those small details from your map. Furthermore, it takes a long time to actually render a large quantity of features into a map. Here, we will see how it is possible to simplify geometric features in Python.

As an example we will use data representing the Amazon river in South America, and simplify it’s geometries.

Let’s first read the data and see how the river looks like:

import geopandas as gpd
import pycrs

# File path
fp = "data/Amazon_river.shp"
data = gpd.read_file(fp)

# Print crs
print(data.crs)

# Plot the river
data.plot();

{'proj': 'merc', 'lon_0': -43, 'lat_ts': -2, 'x_0': 5000000, 'y_0': 10000000, 'ellps': 'GRS80', 'units': 'm', 'no_defs': True}

../../_images/geometric-operations_27_1.png

The LineString that is presented here is quite detailed, so let’s see how we can generalize them a bit. As we can see from the coordinate reference system, the data is projected in a metric system using Mercator projection based on SIRGAS datum.

Generalization can be done easily by using a Shapely function called .simplify(). The tolerance parameter can be used to adjusts how much geometries should be generalized. The tolerance value is tied to the coordinate system of the geometries. Hence, the value we pass here is 20 000 meters (20 kilometers).

# Generalize geometry
data['geom_gen'] = data.simplify(tolerance=20000)

# Set geometry to be our new simlified geometry
data = data.set_geometry('geom_gen')

# Plot 
data.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x2637f725390>

../../_images/geometric-operations_29_1.png

Nice! As a result, now we have simplified our LineString quite significantly as we can see from the map.