Geopandas: an introduction#

In this section, we will cover the basics of geopandas, a Python library to interact with geospatial vector data.

Geopandas provides an easy-to-use interface to vector data sets. It combines the capabilities of pandas, the data analysis package we got to know in the Geo-Python course, with the geometry handling functionality of shapely, the geo-spatial file format support of fiona and the map projection libraries of pyproj.

The main data structures in geopandas are GeoDataFrames and GeoSeries. They extend the functionality of pandas.DataFrames and pandas.Series. This means that we can use all our pandas skills also when we work with geopandas!.

Tip

If you feel like you need to refresh your memory about pandas, head back to lesson 5 and lesson 6 of Geo-Python.

There is one key difference between pandas’s data frames and geopandas’ `GeoDataFrames <https://geopandas.org/en/stable/docs/user_guide/data_structures.html#geodataframe>`__: a GeoDataFrame contains an additional column for geometries. By default, the name of this column is geometry, and it is a `GeoSeries <https://geopandas.org/en/stable/docs/user_guide/data_structures.html#geoseries>`__ that contains the geometries (points, lines, polygons, …) as shapely.geometry objects.

[1]:

import pathlib
import geopandas
import numpy
import pandas

DATA_DIRECTORY = pathlib.Path().resolve() / "data"
print (DATA_DIRECTORY)
HIGHLIGHT_STYLE = "background: #f66161;"

# so the following block is a bit of bad magic to make the table output look
# nice (this cell is hidden, we are only interested in a short table listing
# in which the geometry column is highlighted).
#
# For this, we
#    1. convert the geopandas back into a ‘normal’ pandas.DataFrame with a shortened
#       WKT string in the geometry column
#    1b. while doing so, get rid of most of the columns (rename the remaining ones)
#    2. apply the style to all cells in the column "geometry", and to the axis-1-index "geometry"

# Why did I got via a ‘plain’ `pandas.DataFrame`?
# `pandas.set_option("display.max_colwidth", 40)` was ignored, so this seemed like the cleanest way

df = geopandas.read_file(DATA_DIRECTORY / "finland_topographic_database" / "m_L4132R_p.shp")

df["geom"] = df.geometry.to_wkt().apply(lambda wkt: wkt[:40] + " ...")

df = df[["RYHMA", "LUOKKA", "geom"]]
df = df.rename(columns={"RYHMA": "GROUP", "LUOKKA": "CLASS", "geom": "geometry"})

(
    df.head().style
        .applymap(lambda x: HIGHLIGHT_STYLE, subset=["geometry"])
        .apply_index(lambda x: numpy.where(x.isin(["geometry"]), HIGHLIGHT_STYLE, ""), axis=1)
)

/home/docs/checkouts/readthedocs.org/user_builds/autogis-site/checkouts/latest/docs/lessons/lesson-2/data

/tmp/ipykernel_1777/1118195562.py:32: FutureWarning: Styler.applymap has been deprecated. Use Styler.map instead.
  .applymap(lambda x: HIGHLIGHT_STYLE, subset=["geometry"])

[1]:

	GROUP	CLASS	geometry
0	64	32421	POLYGON ((379394.248 6689991.936, 379389 ...
1	64	32421	POLYGON ((378980.811 6689359.377, 378983 ...
2	64	32421	POLYGON ((378804.766 6689256.471, 378817 ...
3	64	32421	POLYGON ((379229.695 6685025.111, 379233 ...
4	64	32421	POLYGON ((379825.199 6685096.247, 379829 ...

Input data: Finnish topographic database#

In this lesson, we will work with the National Land Survey of Finland (NLS)/Maanmittauslaitos (MML) topographic database.

The data set is licensed under the NLS’ open data licence (CC BY 4.0).
The structure of the data is described in a separate Excel file.
Further information about file naming is available at fairdata.fi (this link relates to the 2018 issue of the topographic database, but is still valid).

For this lesson, we have acquired a subset of the topographic database as shapefiles from the Helsinki Region in Finland via the CSC’s Paituli download portal. You can find the files in data/finland_topographic_database/.

Screenshot of the Paituli download page

The Paituli spatial download service offers data from a long list of national institutes and agencies.

Read and explore geo-spatial data sets#

Before we attempt to load any files, let’s not forget to defining a constant that points to our data directory:

[2]:

import pathlib
NOTEBOOK_PATH = pathlib.Path().resolve()
DATA_DIRECTORY = NOTEBOOK_PATH / "data"

In this lesson, we will focus on terrain objects (Feature group: “Terrain/1” in the topographic database). The Terrain/1 feature group contains several feature classes.

Our aim in this lesson is to save all the Terrain/1 feature classes into separate files.

Terrain/1 features in the Topographic Database:

feature class	Name of feature	Feature group
32421	Motor traffic area	Terrain/1
32200	Cemetery	Terrain/1
34300	Sand	Terrain/1
34100	Rock - area	Terrain/1
34700	Rocky area	Terrain/1
32500	Quarry	Terrain/1
32112	Mineral resources extraction area, fine-grained material	Terrain/1
32111	Mineral resources extraction area, coarse-grained material	Terrain/1
32611	Field	Terrain/1
32612	Garden	Terrain/1
32800	Meadow	Terrain/1
32900	Park	Terrain/1
35300	Paludified land	Terrain/1
35412	Bog, easy to traverse forested	Terrain/1
35411	Open bog, easy to traverse treeless	Terrain/1
35421	Open fen, difficult to traverse treeless	Terrain/1
33000	Earth fill	Terrain/1
33100	Sports and recreation area	Terrain/1
36200	Lake water	Terrain/1
36313	Watercourse area	Terrain/1

Hint: Search for files using a pattern

A pathlib.Path (such as DATA_DIRECTORY) has a handy method to list all files in a directory (or subdirectories) that match a pattern: `glob() <https://docs.python.org/3/library/pathlib.html#pathlib.Path.glob>`__.

To list all shapefiles in our topographic database directory, we can use the following expression:
(DATA_DIRECTORY / "finland_topographic_database").glob("*.shp")
In the search pattern, ? represents any one single character, * multiple (or none, or one) characters, and ** multiple characters that can include subdirectories.

Did you notice the parentheses in the code example above? They work just like they would in a mathematical expression: first, the expression inside the parentheses is evaluated, only then, the code outside.

If you take a quick look at the data directory using a file browser, you will notice that the topographic database consists of many smaller files. Their names follow a strictly defined convention, according to this file naming convention, all files that we interested in (Terrain/1 and polygons) start with a letter m and end with a p.

We can use the glob() pattern search functionality to find those files:

[3]:

TOPOGRAPHIC_DATABASE_DIRECTORY = DATA_DIRECTORY / "finland_topographic_database"

TOPOGRAPHIC_DATABASE_DIRECTORY

[3]:

PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/autogis-site/checkouts/latest/docs/lessons/lesson-2/data/finland_topographic_database')

[4]:

list(TOPOGRAPHIC_DATABASE_DIRECTORY.glob("m*p.shp"))

[4]:

[PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/autogis-site/checkouts/latest/docs/lessons/lesson-2/data/finland_topographic_database/m_L4132R_p.shp')]

(Note that glob() returns an iterator, but, for now, we quickly convert it to a list)

It seems our input data set has only one file that matches our search pattern. We can save its filename into a new variable, choosing the first item of the list (index 0):

[5]:

input_filename = list(TOPOGRAPHIC_DATABASE_DIRECTORY.glob("m*p.shp"))[0]

Now, it’s finally time to open the file and look at its contents:

[6]:

import geopandas
data = geopandas.read_file(input_filename)

First, check the data type of the read data set:

[7]:

type(data)

[7]:

geopandas.geodataframe.GeoDataFrame

Everything went fine, and we have a geopandas.GeoDataFrame. Let’s also explore the data: (1) print the first few rows, and

list the columns.

[8]:

data.head()

[8]:

	TEKSTI	RYHMA	LUOKKA	TASTAR	KOHDEOSO	AINLAHDE	SYNTYHETKI	...	TIENIM2	TIENIM3	TIENIM4	TIENIM5	KUNTA	geometry
0	None	64	32421	5000	1812247077	1	20180125	...	None	None	None	None	None	POLYGON ((379394.248 6689991.936, 379389.79 66...
1	None	64	32421	5000	1718796908	1	20180120	...	None	None	None	None	None	POLYGON ((378980.811 6689359.377, 378983.401 6...
2	None	64	32421	20000	411167695	1	20180120	...	None	None	None	None	None	POLYGON ((378804.766 6689256.471, 378817.107 6...
3	None	64	32421	20000	411173768	1	20180120	...	None	None	None	None	None	POLYGON ((379229.695 6685025.111, 379233.366 6...
4	None	64	32421	20000	411173698	1	20180120	...	None	None	None	None	None	POLYGON ((379825.199 6685096.247, 379829.651 6...

5 rows × 39 columns

[9]:

data.columns

[9]:

Index(['TEKSTI', 'RYHMA', 'LUOKKA', 'TASTAR', 'KORTAR', 'KORARV', 'KULKUTAPA',
       'KOHDEOSO', 'AINLAHDE', 'SYNTYHETKI', 'KUOLHETKI', 'KARTOGLK',
       'ALUEJAKOON', 'VERSUH', 'SUUNTA', 'SIIRT_DX', 'SIIRT_DY', 'KORKEUS',
       'PYSYVAID', 'ATTR3', 'TIENUM', 'OSANUM', 'TIEOSA', 'PAALLY', 'YKSSUU',
       'VAPKOR', 'VALMAS', 'PITUUS', 'FROMLEFT', 'TOLEFT', 'FROMRIGHT',
       'TORIGHT', 'TIENIM2', 'TIENIM3', 'TIENIM4', 'TIENIM5', 'KUNTA_NRO',
       'KUNTA', 'geometry'],
      dtype='object')

Oh boy! This data set has many columns, and all of the column names are in Finnish.

Let’s select a few useful ones and also translate their names to English. We’ll keep ’RYHMA’ and ’LUOKKA’ (‘group’ and ‘class’, respectively), and, of course, the geometry column.

[10]:

data = data[["RYHMA", "LUOKKA", "geometry"]]

Renaming a column in (geo)pandas works by passing a dictionary to DataFrame.rename(). In this dictionary, the keys are the old names, the values the new ones:

[11]:

data = data.rename(
    columns={
        "RYHMA": "GROUP",
        "LUOKKA": "CLASS"
    }
)

How does the data set look now?

[12]:

data.head()

[12]:

	GROUP	CLASS	geometry
0	64	32421	POLYGON ((379394.248 6689991.936, 379389.79 66...
1	64	32421	POLYGON ((378980.811 6689359.377, 378983.401 6...
2	64	32421	POLYGON ((378804.766 6689256.471, 378817.107 6...
3	64	32421	POLYGON ((379229.695 6685025.111, 379233.366 6...
4	64	32421	POLYGON ((379825.199 6685096.247, 379829.651 6...

Hint: Check your understanding

Use your pandas skills on this geopandas data set to figure out the following information:

How many rows does the data set have?

How many unique classes?

… and how many unique groups?

Explore the data set in a map:#

As geographers, we love maps. But beyond that, it’s always a good idea to explore a new data set also in a map. To create a simple map of a geopandas.GeoDataFrame, simply use its plot() method. It works similar to pandas (see Lesson 7 of the Geo-Python course, but draws a map based on the geometries of the data set instead of a chart.

[13]:

data.plot()

[13]:

<Axes: >

../../_images/lessons_lesson-2_geopandas-an-introduction_23_1.png

Voilá! It is indeed this easy to produce a map out of an geospatial data set. Geopandas automatically positions your map in a way that it covers the whole extent of your data.

Note

If you live in the Helsinki region, you might recognize some of the shapes in the map ;)

Geometries in geopandas#

Geopandas takes advantage of shapely’s geometry objects. Geometries are stored in a column called geometry.

Let’s print the first 5 rows of the column geometry:

[14]:

data.geometry.head()

[14]:

0    POLYGON ((379394.248 6689991.936, 379389.79 66...
1    POLYGON ((378980.811 6689359.377, 378983.401 6...
2    POLYGON ((378804.766 6689256.471, 378817.107 6...
3    POLYGON ((379229.695 6685025.111, 379233.366 6...
4    POLYGON ((379825.199 6685096.247, 379829.651 6...
Name: geometry, dtype: geometry

Lo and behold, the geometry column contains familiar-looking values: Well-Known Text (WKT) strings. Don’t be fooled, they are, in fact, shapely.geometry objects (you might remember from last week’s lesson) that, when print()ed or type-cast into a str, are represented as a WKT string).

Since the geometries in a GeoDataFrame are stored as shapely objects, we can use shapely methods to handle geometries in geopandas.

Let’s take a closer look at (one of) the polygon geometries in the terrain data set, and try to use some of the shapely functionality we are already familiar with. For the sake of clarity, first, we’ll work with the geometry of the very first record, only:

[15]:

# The value of the column `geometry` in row 0:
data.at[0, "geometry"]

[15]:

../../_images/lessons_lesson-2_geopandas-an-introduction_27_0.svg

[16]:

# Print information about the area
print(f"Area: {round(data.at[0, 'geometry'].area)} m².")

Area: 76 m².

Note: Area measurement unit

Here, we know the coordinate reference system (CRS) of the input data set. The CRS also defines the unit of measurement (in our case, meters). That’s why we can print the computed area including an area measurement unit (square meters).

Let’s do the same for multiple rows, and explore different options of how to. First, use the reliable and tried iterrows() pattern we learned in lesson 6 of the Geo-Python course.

[17]:

# Iterate over the first 5 rows of the data set
for index, row in data[:5].iterrows():
    polygon_area = row["geometry"].area
    print(f"The polygon in row {index} has a surface area of {polygon_area:0.1f} m².")

The polygon in row 0 has a surface area of 76.0 m².
The polygon in row 1 has a surface area of 2652.1 m².
The polygon in row 2 has a surface area of 3185.6 m².
The polygon in row 3 has a surface area of 13075.2 m².
The polygon in row 4 has a surface area of 3980.7 m².

As you see, all pandas functions, such as the iterrows() method, are available in geopandas without the need to call pandas separately. Geopandas builds on top of pandas, and it inherits most of its functionality.

Of course the iterrows() pattern is not the most convenient and efficient way to calculate the area of many rows. Both GeoSeries (geometry columns) and GeoDataFrames have an area property:

[18]:

# the `area` property of a `GeoDataFrame`
data.area

[18]:

0          76.027392
1        2652.054186
2        3185.649995
3       13075.165279
4        3980.682621
            ...
4299     2651.800270
4300      376.503380
4301      413.942555
4302     3487.927677
4303     1278.963199
Length: 4304, dtype: float64

[19]:

# the `area property of a `GeoSeries`
data["geometry"].area

[19]:

0          76.027392
1        2652.054186
2        3185.649995
3       13075.165279
4        3980.682621
            ...
4299     2651.800270
4300      376.503380
4301      413.942555
4302     3487.927677
4303     1278.963199
Length: 4304, dtype: float64

It’s straight-forward to create a new column holding the area:

[20]:

data["area"] = data.area
data

[20]:

	GROUP	CLASS	geometry	area
0	64	32421	POLYGON ((379394.248 6689991.936, 379389.79 66...	76.027392
1	64	32421	POLYGON ((378980.811 6689359.377, 378983.401 6...	2652.054186
2	64	32421	POLYGON ((378804.766 6689256.471, 378817.107 6...	3185.649995
3	64	32421	POLYGON ((379229.695 6685025.111, 379233.366 6...	13075.165279
4	64	32421	POLYGON ((379825.199 6685096.247, 379829.651 6...	3980.682621
...	...	...	...	...
4299	64	36313	POLYGON ((375668.607 6682942.062, 375671.489 6...	2651.800270
4300	64	36313	POLYGON ((368411.063 6679328.99, 368411.424 66...	376.503380
4301	64	36313	POLYGON ((368054.608 6679164.737, 368059.602 6...	413.942555
4302	64	36313	POLYGON ((368096.331 6678000, 368090.276 66780...	3487.927677
4303	64	36313	POLYGON ((368000.666 6678460.142, 368000 66784...	1278.963199

4304 rows × 4 columns

Hint: Descriptive statistics

Do you remember how to calculate the minimum, maximum, sum, mean, and standard deviation of a pandas column? (Lesson 5 of Geo-Python)

What are these values for the area column of the data set?

Write a subset of data to a file#

In the previous section, we learnt how to write an entire GeoDataFrame to a file. We can also write a filtered subset of a data set to a new file, e.g., to help with processing complex data sets.

First, isolate the lakes in the input data set (class number 36200, see table above):

[21]:

lakes = data[data.CLASS == 36200]

Then, plot the data subset to visually check whether it looks correct:

[22]:

lakes.plot()

[22]:

<Axes: >

../../_images/lessons_lesson-2_geopandas-an-introduction_39_1.png

And finally, write the filtered data to a Shapefile:

[23]:

lakes.to_file(DATA_DIRECTORY / "finland_topographic_database" / "lakes.shp")

Check the Vector Data I/O section to see which data formats geopandas can write to.

Grouping data#

A particularly useful method of (geo)pandas’ data frames is their grouping function: `groupby() <https://pandas.pydata.org/docs/user_guide/groupby.html>`__ can split data into groups based on some criteria, apply a function individually to each of the groups, and combine results of such an operation into a common data structure.

We have used this function earlier: in Geo-Python, lesson 6.

We can use grouping here to split our input data set into subsets that relate to each of the CLASSes of terrain cover, then save a separate file for each class.

Let’s start this by, again, taking a look at how the data set actually looks like:

[24]:

data.head()

[24]:

	GROUP	CLASS	geometry	area
0	64	32421	POLYGON ((379394.248 6689991.936, 379389.79 66...	76.027392
1	64	32421	POLYGON ((378980.811 6689359.377, 378983.401 6...	2652.054186
2	64	32421	POLYGON ((378804.766 6689256.471, 378817.107 6...	3185.649995
3	64	32421	POLYGON ((379229.695 6685025.111, 379233.366 6...	13075.165279
4	64	32421	POLYGON ((379825.199 6685096.247, 379829.651 6...	3980.682621

Remember: the CLASS column contains information about a polygon’s land use type. Use the `pandas.Series.unique() <https://pandas.pydata.org/docs/reference/api/pandas.Series.unique.html>`__ method to list all values that occur:

[25]:

data["CLASS"].unique()

[25]:

array([32421, 32200, 34300, 34100, 34700, 32417, 32500, 32112, 32111,
       32611, 32612, 32800, 32900, 35300, 35412, 35411, 35421, 33000,
       33100, 36200, 36313])

To group data, use the data frame’s groupby() method, supply a column name as a parameter:

[26]:

grouped_data = data.groupby("CLASS")
grouped_data

[26]:

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x74f7d6fc7470>

So, grouped_data is a DataFrameGroupBy object. Inside a GroupBy object, its property groups is a dictionary that works as a lookup table: it records which rows belong to which group. The keys of the dictionary are the unique values of the grouping column:

[27]:

grouped_data.groups

[27]:

{32111: [3116], 32112: [3115], 32200: [103, 104], 32417: [3112], 32421: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...], 32500: [3113, 3114], 32611: [3117, 3118, 3119, 3120, 3121, 3122, 3123, 3124, 3125, 3126, 3127, 3128, 3129, 3130, 3131, 3132, 3133, 3135, 3136, 3137, 3138, 3139, 3140, 3141, 3142, 3143, 3144, 3145, 3146, 3147, 3148, 3149, 3150, 3151, 3152, 3153, 3154, 3155, 3156, 3157, 3158, 3159, 3160, 3161, 3162, 3163, 3164, 3165, 3166, 3167, 3168, 3169, 3170, 3171, 3172, 3173, 3174, 3175, 3176, 3177, 3178, 3179, 3180, 3181, 3182, 3183, 3184, 3185, 3186, 3187, 3188, 3189, 3190, 3191, 3192, 3193, 3194, 3195, 3196, 3197, 3198, 3199, 3200, 3201, 3202, 3203, 3204, 3205, 3206, 3207, 3208, 3209, 3210, 3211, 3212, 3213, 3214, 3215, 3216, 3217, ...], 32612: [3134, 3224, 3242, 3245, 3265, 3266, 3283, 3326, 3327, 3378, 3380], 32800: [3389, 3390, 3391, 3392, 3393, 3394, 3395, 3396, 3397, 3398, 3399, 3400, 3401, 3402, 3403, 3404, 3405, 3406, 3407, 3408, 3409, 3410, 3411, 3412, 3413, 3414, 3415, 3416, 3417, 3418, 3419, 3420, 3421, 3422, 3423, 3424, 3425, 3426, 3427, 3428, 3429, 3430, 3431, 3432, 3433, 3434, 3435, 3436, 3437, 3438, 3439, 3440, 3441, 3442, 3443, 3444, 3445, 3446, 3447, 3448, 3449, 3450, 3451, 3452, 3453, 3454, 3455, 3456, 3457, 3458, 3459, 3460, 3461, 3462, 3463, 3464, 3465, 3466, 3467, 3468, 3469], 32900: [3470, 3471, 3472, 3473, 3474, 3475, 3476, 3477, 3478, 3479, 3480, 3481, 3482, 3483, 3484, 3485, 3486, 3487, 3488, 3489, 3490, 3491, 3492, 3493, 3494, 3495], 33000: [4118, 4119, 4120, 4121, 4122], 33100: [4123, 4124, 4125, 4126, 4127, 4128, 4129, 4130, 4131, 4132, 4133, 4134, 4135, 4136, 4137, 4138, 4139, 4140, 4141, 4142, 4143, 4144, 4145, 4146, 4147, 4148, 4149, 4150, 4151, 4152, 4153, 4154, 4155, 4156, 4157, 4158, 4159, 4160, 4161, 4162, 4163, 4164, 4165, 4166, 4167, 4168, 4169, 4170, 4171, 4172, 4173, 4174, 4175, 4176, 4177, 4178, 4179, 4180, 4181, 4182, 4183, 4184, 4185, 4186, 4187, 4188, 4189, 4190, 4191, 4192, 4193, 4194, 4195, 4196, 4197, 4198, 4199, 4200, 4201, 4202, 4203, 4204, 4205, 4206, 4207, 4208, 4209, 4210, 4211, 4212, 4213, 4214, 4215, 4216, 4217, 4218, 4219, 4220, 4221, 4222, ...], 34100: [106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, ...], 34300: [105], 34700: [3109, 3110, 3111], 35300: [3496, 3497, 3498, 3499, 3500, 3501, 3502, 3503, 3504, 3505, 3506, 3507, 3508, 3509, 3510, 3511, 3512, 3513, 3514, 3515, 3516, 3517, 3518, 3519, 3520, 3521, 3522, 3523, 3524, 3525, 3526, 3527, 3528, 3529, 3530, 3531, 3532, 3533, 3534, 3535, 3536, 3537, 3538, 3539, 3540, 3541, 3542, 3543, 3544, 3545, 3546, 3547, 3548, 3549, 3550, 3551, 3552, 3553, 3554, 3555, 3556, 3557, 3558, 3559, 3560, 3561, 3562, 3563, 3564, 3565, 3566, 3567, 3568, 3569, 3570, 3571, 3572, 3573, 3574, 3575, 3576, 3577, 3578, 3579, 3580, 3581, 3582, 3583, 3584, 3585, 3586, 3587, 3588, 3589, 3590, 3591, 3592, 3593, 3594, 3595, ...], 35411: [3637, 3638, 3643, 3652, 3717, 3720, 3733, 3734, 3735, 3742, 3753, 3754, 3779, 3802, 3820, 3822, 3844, 3925, 3950, 4002, 4003, 4005, 4053, 4055, 4056, 4062, 4070, 4072, 4087, 4095, 4103, 4105, 4106, 4112], 35412: [3630, 3631, 3632, 3633, 3634, 3635, 3636, 3639, 3640, 3641, 3642, 3644, 3645, 3646, 3647, 3648, 3649, 3650, 3653, 3654, 3655, 3656, 3657, 3658, 3659, 3660, 3661, 3662, 3663, 3664, 3665, 3666, 3667, 3668, 3669, 3671, 3672, 3673, 3674, 3675, 3676, 3677, 3678, 3679, 3680, 3681, 3682, 3683, 3684, 3685, 3686, 3687, 3688, 3689, 3690, 3691, 3692, 3693, 3694, 3695, 3696, 3697, 3698, 3699, 3700, 3701, 3702, 3704, 3705, 3706, 3707, 3708, 3709, 3710, 3711, 3712, 3713, 3714, 3715, 3716, 3718, 3719, 3721, 3722, 3723, 3724, 3725, 3726, 3727, 3728, 3729, 3730, 3731, 3732, 3736, 3737, 3738, 3739, 3740, 3741, ...], 35421: [3651, 3670, 3703, 3755, 3758], 36200: [4240, 4241, 4242, 4243, 4244, 4245, 4246, 4247, 4248, 4249, 4250, 4251, 4252, 4253, 4254, 4255, 4256, 4257, 4258, 4259, 4260, 4261, 4262, 4263, 4264, 4265, 4266, 4267, 4268, 4269, 4270, 4271, 4272, 4273, 4274, 4275, 4276, 4277, 4278, 4279, 4280, 4281, 4282, 4283, 4284, 4285, 4286, 4287, 4288, 4289, 4290, 4291, 4292, 4293, 4294, 4295], 36313: [4296, 4297, 4298, 4299, 4300, 4301, 4302, 4303]}

However, one can also simply iterate over the entire GroupBy object. Let’s count how many rows of data each group has:

[28]:

for key, group in grouped_data:
    print(f"Terrain class {key} has {len(group)} rows.")

Terrain class 32111 has 1 rows.
Terrain class 32112 has 1 rows.
Terrain class 32200 has 2 rows.
Terrain class 32417 has 1 rows.
Terrain class 32421 has 103 rows.
Terrain class 32500 has 2 rows.
Terrain class 32611 has 261 rows.
Terrain class 32612 has 11 rows.
Terrain class 32800 has 81 rows.
Terrain class 32900 has 26 rows.
Terrain class 33000 has 5 rows.
Terrain class 33100 has 117 rows.
Terrain class 34100 has 3003 rows.
Terrain class 34300 has 1 rows.
Terrain class 34700 has 3 rows.
Terrain class 35300 has 134 rows.
Terrain class 35411 has 34 rows.
Terrain class 35412 has 449 rows.
Terrain class 35421 has 5 rows.
Terrain class 36200 has 56 rows.
Terrain class 36313 has 8 rows.

There are, for instance, 56 lake polygons (class 36200) in the input data set.

To obtain all rows that belong to one particular group, use the get_group() method, which returns a brand-new GeoDataFrame:

[29]:

lakes = grouped_data.get_group(36200)
type(lakes)

[29]:

geopandas.geodataframe.GeoDataFrame

Caution

The index in the new data frame stays the same as in the ungrouped input data set. This can be helpful, for instance, when you want to join the grouped data back to the original input data.

Write grouped data to separate files#

Now we have all the necessary tools in hand to split the input data into separate data sets for each terrain class, and write the individual subsets to new, separate, files. In fact, the code looks almost too simple, doesn’t it?

[30]:

# Iterate over the input data, grouped by CLASS
for key, group in data.groupby("CLASS"):
    # save the group to a new shapefile
    group.to_file(TOPOGRAPHIC_DATABASE_DIRECTORY / f"terrain_{key}.shp")

Attention: File name

We used a pathlib.Path combined with an f-string to generate the new output file’s path and name. Check this week’s section Managing file paths, and Geo-Python lesson 2 to revisit how they work.

Extra: save summary statistics to CSV spreadsheet#

Whenever the results of an operation on a GeoDataFrame do not include a geometry, the output data frame will automatically become a ‘plain’ pandas.DataFrame, and can be saved to the standard table formats.

One interesting application of this is to save basic descriptive statistics of a geospatial data set into a CSV table. For instance, we might want to know the area each terrain class covers.

Again, we start by grouping the input data by terrain classes, and then compute the sum of each classes’ area. This can be condensed into one line of code:

[31]:

area_information = data.groupby("CLASS").area.sum()
area_information

[31]:

CLASS
32111    1.833747e+03
32112    2.148168e+03
32200    1.057368e+05
32417    1.026678e+02
32421    6.792797e+05
32500    1.097467e+05
32611    1.314807e+07
32612    1.073431e+05
32800    1.407231e+06
32900    6.158391e+05
33000    6.594647e+05
33100    3.769076e+06
34100    1.236289e+07
34300    1.627079e+03
34700    2.785751e+03
35300    1.382940e+06
35411    3.928004e+05
35412    4.708321e+06
35421    6.786374e+04
36200    9.986966e+06
36313    4.346029e+04
Name: area, dtype: float64

We can then save the resulting table into a CSV file using the standard pandas approach we learned about in Geo-Python lesson 5.

[32]:

area_information.to_csv(TOPOGRAPHIC_DATABASE_DIRECTORY / "area_by_terrain_class.csv")

Geopandas: an introduction

Contents