---
jupytext:
  text_representation:
    extension: .md
    format_name: myst
    format_version: 0.13
    jupytext_version: 1.14.0
kernelspec:
  display_name: Python 3 (ipykernel)
  language: python
  name: python3
---

# Geopandas: an introduction

In this section, we will cover the basics of *geopandas*, a Python library to
interact with geospatial vector data.

[Geopandas](https://geopandas.org/) provides an easy-to-use interface to vector
data sets. It combines the capabilities of *pandas*, the data analysis package
we got to know in the [Geo-Python
course](https://geo-python-site.readthedocs.io/en/latest/lessons/L5/pandas-overview.html),
with the geometry handling functionality of
[shapely](../lesson-1/geometry-objects), the [geo-spatial file format support
of fiona](vector-data-io) and the [map projection libraries of
pyproj](map-projections).

The main data structures in geopandas are `GeoDataFrame`s and `GeoSeries`. They
extend the functionality of `pandas.DataFrame`s and `pandas.Series`. This means
that **we can use all our *pandas* skills also when we work with
*geopandas*!**. 

:::{tip}

If you feel like you need to refresh your memory about pandas, head back to
[lesson
5](https://geo-python-site.readthedocs.io/en/latest/lessons/L5/pandas-overview.html)
and [lesson
6](https://geo-python-site.readthedocs.io/en/latest/notebooks/L6/advanced-data-processing-with-pandas.html)
of Geo-Python.
:::

There is one key difference between pandas’s data frames and geopandas’
[`GeoDataFrame`s](https://geopandas.org/en/stable/docs/user_guide/data_structures.html#geodataframe):
a `GeoDataFrame` contains an additional column for geometries. By default, the
name of this column is `geometry`, and it is a
[`GeoSeries`](https://geopandas.org/en/stable/docs/user_guide/data_structures.html#geoseries)
that contains the geometries (points, lines, polygons, ...) as
`shapely.geometry` objects.

```{code-cell} ipython3
:tags: [remove-input]

import pathlib
import geopandas
import numpy
import pandas

DATA_DIRECTORY = pathlib.Path().resolve() / "data"

HIGHLIGHT_STYLE = "background: #f66161;"

# so the following block is a bit of bad magic to make the table output look
# nice (this cell is hidden, we are only interested in a short table listing
# in which the geometry column is highlighted).
#
# For this, we
#    1. convert the geopandas back into a ‘normal’ pandas.DataFrame with a shortened
#       WKT string in the geometry column
#    1b. while doing so, get rid of most of the columns (rename the remaining ones)
#    2. apply the style to all cells in the column "geometry", and to the axis-1-index "geometry"

# Why did I got via a ‘plain’ `pandas.DataFrame`?
# `pandas.set_option("display.max_colwidth", 40)` was ignored, so this seemed like the cleanest way

df = geopandas.read_file(DATA_DIRECTORY / "finland_topographic_database" / "m_L4132R_p.shp")

df["geom"] = df.geometry.to_wkt().apply(lambda wkt: wkt[:40] + " ...")

df = df[["RYHMA", "LUOKKA", "geom"]]
df = df.rename(columns={"RYHMA": "GROUP", "LUOKKA": "CLASS", "geom": "geometry"})

(
    df.head().style
        .applymap(lambda x: HIGHLIGHT_STYLE, subset=["geometry"])
        .apply_index(lambda x: numpy.where(x.isin(["geometry"]), HIGHLIGHT_STYLE, ""), axis=1)
)
```

---


## Input data: Finnish topographic database 

In this lesson, we will work with the [National Land Survey of Finland (NLS)/Maanmittauslaitos (MML) topographic database](https://www.maanmittauslaitos.fi/en/maps-and-spatial-data/expert-users/product-descriptions/topographic-database). 
- The data set is licensed under the NLS’ [open data licence](https://www.maanmittauslaitos.fi/en/opendata-licence-cc40) (CC BY 4.0).
- The structure of the data is described in a [separate Excel file](http://www.nic.funet.fi/index/geodata/mml/maastotietokanta/2022/maastotietokanta_kohdemalli_eng_2019.xlsx).
- Further information about file naming is available at [fairdata.fi](https://etsin.fairdata.fi/dataset/5023ecc7-914a-4494-9e32-d0a39d3b56ae) (this link relates to the 2018 issue of the topographic database, but is still valid).

For this lesson, we have acquired a subset of the topographic database as
shapefiles from the Helsinki Region in Finland via the [CSC’s Paituli download
portal](https://paituli.csc.fi). You can find the files in `data/finland_topographic_database/`.

:::{figure} ../../static/images/lesson-2/paituli-download_700x650px.png
:alt: Screenshot of the Paituli download page

The Paituli *spatial download service* offers data from a long list of national institutes and agencies.
:::


---


## Read and explore geo-spatial data sets

Before we attempt to load any files, let’s not forget to defining a constant
that points to our data directory:

```{code-cell} ipython3
import pathlib 
NOTEBOOK_PATH = pathlib.Path().resolve()
DATA_DIRECTORY = NOTEBOOK_PATH / "data"
```

In this lesson, we will focus on **terrain objects** (Feature group:
"Terrain/1" in the topographic database). The Terrain/1 feature group contains
several feature classes. 

**Our aim in this lesson is to save all the Terrain/1
feature classes into separate files**.

*Terrain/1 features in the Topographic Database:*

|  feature class | Name of feature                                            | Feature group |
|----------------|------------------------------------------------------------|---------------|
| 32421          | Motor traffic area                                         | Terrain/1     |
| 32200          | Cemetery                                                   | Terrain/1     |
| 34300          | Sand                                                       | Terrain/1     |
| 34100          | Rock - area                                                | Terrain/1     |
| 34700          | Rocky area                                                 | Terrain/1     |
| 32500          | Quarry                                                     | Terrain/1     |
| 32112          | Mineral resources extraction area, fine-grained material   | Terrain/1     |
| 32111          | Mineral resources extraction area, coarse-grained material | Terrain/1     |
| 32611          | Field                                                      | Terrain/1     |
| 32612          | Garden                                                     | Terrain/1     |
| 32800          | Meadow                                                     | Terrain/1     |
| 32900          | Park                                                       | Terrain/1     |
| 35300          | Paludified land                                            | Terrain/1     |
| 35412          | Bog, easy to traverse forested                             | Terrain/1     |
| 35411          | Open bog, easy to traverse treeless                        | Terrain/1     |
| 35421          | Open fen, difficult to traverse treeless                   | Terrain/1     |
| 33000          | Earth fill                                                 | Terrain/1     |
| 33100          | Sports and recreation area                                 | Terrain/1     |
| 36200          | Lake water                                                 | Terrain/1     |
| 36313          | Watercourse area                                           | Terrain/1     |


:::{admonition} Search for files using a pattern
:class: hint

(#search-for-files-using-a-pattern)=
A `pathlib.Path` (such as `DATA_DIRECTORY`) has a handy method to list all
files in a directory (or subdirectories) that match a pattern:
[`glob()`](https://docs.python.org/3/library/pathlib.html#pathlib.Path.glob).
To list all shapefiles in our topographic database directory, we can use the
following expression:

```{code}
(DATA_DIRECTORY / "finland_topographic_database").glob("*.shp")
```

In the search pattern, `?` represents any one single character, `*` multiple
(or none, or one) characters, and `**` multiple characters that can include
subdirectories.

Did you notice the parentheses in the code example above? They work just like
they would in a mathematical expression: first, the expression inside the
parentheses is evaluated, only then, the code outside.
:::


If you take a quick look at the data directory using a file browser, you will
notice that the topographic database consists of *many* smaller files. Their
names follow a strictly defined 
[convention](https://etsin.fairdata.fi/dataset/5023ecc7-914a-4494-9e32-d0a39d3b56ae),
according to this file naming convention, all files that we interested in
(*Terrain/1* and *polygons*) start with a letter `m` and end with a `p`.

We can use the `glob()` pattern search functionality to find those files:

```{code-cell} ipython3
TOPOGRAPHIC_DATABASE_DIRECTORY = DATA_DIRECTORY / "finland_topographic_database"

TOPOGRAPHIC_DATABASE_DIRECTORY
```

```{code-cell} ipython3
list(TOPOGRAPHIC_DATABASE_DIRECTORY.glob("m*p.shp"))
```

(Note that `glob()` returns an iterator, but, for now, we quickly convert
it to a list)

It seems our input data set has only one file that matches our search pattern.
We can save its filename into a new variable, choosing the first item of the
list (index 0):

```{code-cell} ipython3
input_filename = list(TOPOGRAPHIC_DATABASE_DIRECTORY.glob("m*p.shp"))[0] 
```

Now, it’s finally time to open the file and look at its contents:

```{code-cell} ipython3
import geopandas
data = geopandas.read_file(input_filename)
```

First, check the data type of the read data set:

```{code-cell} ipython3
type(data)
```

Everything went fine, and we have a `geopandas.GeoDataFrame`. 
Let’s also explore the data: (1) print the first few rows, and 
(2) list the columns.

```{code-cell} ipython3
data.head()
```

```{code-cell} ipython3
data.columns
```

Oh boy! This data set has many columns, and all of the column names are in
Finnish.

Let’s select a few useful ones and also translate their names to
English. We’ll keep ’RYHMA’ and ’LUOKKA’ (‘group’ and ‘class’, respectively),
and, of course, the `geometry` column.

```{code-cell} ipython3
data = data[["RYHMA", "LUOKKA", "geometry"]]
```

Renaming a column in (geo)pandas works by passing a dictionary to
`DataFrame.rename()`. In this dictionary, the keys are the old names, the values
the new ones:

```{code-cell} ipython3
data = data.rename(
    columns={
        "RYHMA": "GROUP",
        "LUOKKA": "CLASS"
    }
)
```

How does the data set look now?

```{code-cell} ipython3
data.head()
```

:::{admonition} Check your understanding:
:class: hint

Use your pandas skills on this geopandas data set to figure out the following
information:

- How many rows does the data set have?
- How many unique classes?
- ... and how many unique groups?
:::


---

### Explore the data set in a map:

As geographers, we love maps. But beyond that, it’s always a good idea to
explore a new data set also in a map. To create a simple map of a
`geopandas.GeoDataFrame`, simply use its `plot()` method. It works similar to
pandas (see [Lesson 7 of the Geo-Python 
course](https://geo-python.github.io/site/notebooks/L7/matplotlib.html), but
**draws a map based on the geometries of the data set** instead of a chart.

```{code-cell} ipython3
data.plot()
```

Voilá! It is indeed this easy to produce a map out of an geospatial data set.
Geopandas automatically positions your map in a way that it covers the whole
extent of your data.

:::{note}
If you live in the Helsinki region, you might recognise some of the shapes in
the map ;)
:::

### Geometries in geopandas

Geopandas takes advantage of shapely’s geometry objects. Geometries are stored
in a column called *geometry*.

Let’s print the first 5 rows of the column `geometry`:

```{code-cell} ipython3
data.geometry.head()
```

Lo and behold, the `geometry` column contains familiar-looking values:
*Well-Known Text* (WKT) strings. Don’t be fooled, they are, in fact,
`shapely.geometry` objects (you might remember from [last week’s
lesson](../lesson-1/geometry-objects)) that, when `print()`ed or type-cast into
a `str`, are represented as a WKT string).

Since the geometries in a `GeoDataFrame` are stored as shapely objects, we can
use **shapely methods** to handle geometries in geopandas.

Let’s take a closer look at (one of) the polygon geometries in the terrain data
set, and try to use some of the shapely functionality we are already familiar
with. For the sake of clarity, first, we’ll work with the geometry of the very
first record, only:

```{code-cell} ipython3
# The value of the column `geometry` in row 0:
data.at[0, "geometry"]
```

```{code-cell} ipython3
# Print information about the area 
print(f"Area: {round(data.at[0, 'geometry'].area)} m².")
```

:::{admonition} Area measurement unit
:class: note

Here, we know the coordinate reference system (CRS) of the input data set. The
CRS also defines the unit of measurement (in our case, metres). That’s why we
can print the computed area including an area measurement unit (square metres).
:::


Let’s do the same for multiple rows, and explore different options of how to.
First, use the reliable and tried `iterrows()` pattern we learned in [lesson 6
of the Geo-Python course](https://geo-python.github.io/site/notebooks/L6/pandas/advanced-data-processing-with-pandas.html#Iterating-rows-and-using-self-made-functions-in-Pandas).

```{code-cell} ipython3
# Iterate over the first 5 rows of the data set
for index, row in data[:5].iterrows():
    polygon_area = row["geometry"].area
    print(f"The polygon in row {index} has a surface area of {polygon_area:0.1f} m².")
```

As you see, all **pandas** functions, such as the `iterrows()` method, are
available in geopandas without the need to call pandas separately. Geopandas
builds on top of pandas, and it inherits most of its functionality.

Of course the `iterrows()` pattern is not the most convenient and efficient way
to calculate the area of many rows. Both `GeoSeries` (geometry columns) and
`GeoDataFrame`s have an `area` property:

```{code-cell} ipython3
# the `area` property of a `GeoDataFrame`
data.area
```

```{code-cell} ipython3
# the `area property of a `GeoSeries`
data["geometry"].area
```

It’s straight-forward to create a new column holding the area:

```{code-cell} ipython3
data["area"] = data.area
data
```

:::{admonition} Descriptive statistics
:class: hint

Do you remember how to calculate the *minimum*, *maximum*, *sum*, *mean*, and
*standard deviation* of a pandas column? ([Lesson 5 of
Geo-Python](https://geo-python-site.readthedocs.io/en/latest/notebooks/L5/exploring-data-using-pandas.html#descriptive-statistics))
What are these values for the area column of the data set?
:::


## Write a subset of data to a file

[In the previous section](./vector-data-io.md#writing-geospatial-data-to-a-file), we
learnt how to write an entire `GeoDataFrame` to a file. We can also write a
filtered subset of a data set to a new file, e.g., to help with processing
complex data sets.

First, isolate the lakes in the input data set (class number `36200`, see table
above):

```{code-cell} ipython3
lakes = data[data.CLASS == 36200]
```

Then, plot the data subset to visually check whether it looks correct:

```{code-cell} ipython3
lakes.plot()
```

And finally, write the filtered data to a Shapefile:

```{code-cell} ipython3
lakes.to_file(DATA_DIRECTORY / "finland_topographic_database" / "lakes.shp")
```

Check the [Vector Data I/O](vector-data-io) section to see which data formats
geopandas can write to.


## Grouping data

A particularly useful method of (geo)pandas’ data frames is their grouping
function: [`groupby()`](https://pandas.pydata.org/docs/user_guide/groupby.html)
can **split data into groups** based on some criteria, **apply** a function
individually to each of the groups, and **combine** results of such an
operation into a common data structure.

We have used this function earlier: in [Geo-Python, 
lesson 6](https://geo-python-site.readthedocs.io/en/latest/notebooks/L6/advanced-data-processing-with-pandas.html#aggregating-data-in-pandas-by-grouping).

We can use *grouping* here to split our input data set into subsets that relate
to each of the `CLASS`es of terrain cover, then save a separate file for each
class.

Let’s start this by, again, taking a look at how the data set actually looks
like:

```{code-cell} ipython3
data.head()
```

Remember: the `CLASS` column contains information about a polygon’s land use
type. Use the
[`pandas.Series.unique()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.unique.html)
method to list all values that occur:

```{code-cell} ipython3
data["CLASS"].unique()
```

To group data, use the data frame’s `groupby()` method, supply a column name as
a parameter:

```{code-cell} ipython3
grouped_data = data.groupby("CLASS")
grouped_data
```

So, `grouped_data` is a `DataFrameGroupBy` object. Inside a `GroupBy` object,
its property `groups` is a dictionary that works as a lookup table: it records
which rows belong to which group. The keys of the dictionary are the unique
values of the grouping column:

```{code-cell}
grouped_data.groups
```

However, one can also simply iterate over the entire `GroupBy` object. Let’s
count how many rows of data each group has:

```{code-cell}
for key, group in grouped_data:
    print(f"Terrain class {key} has {len(group)} rows.")
```

There are, for instance, 56 lake polygons (class `36200`) in the input data set.

To obtain all rows that belong to one particular group, use the `get_group()`
method, which returns a brand-new `GeoDataFrame`:

```{code-cell}
lakes = grouped_data.get_group(36200)
type(lakes)
```

:::{caution}
The index in the new data frame stays the same as in the ungrouped input data
set. This can be helpful, for instance, when you want to join the grouped data
back to the original input data.
:::


## Write grouped data to separate files

Now we have all the necessary tools in hand to split the input data into
separate data sets for each terrain class, and write the individual subsets to
new, separate, files. In fact, the code looks almost too simple, doesn’t it?

```{code-cell}
# Iterate over the input data, grouped by CLASS
for key, group in data.groupby("CLASS"):
    # save the group to a new shapefile
    group.to_file(TOPOGRAPHIC_DATABASE_DIRECTORY / f"terrain_{key}.shp")
```

:::{admonition} File name
:class: attention

We used a `pathlib.Path` combined with an f-string to generate the new output
file’s path and name. Check this week’s section [Managing file
paths](managing-file-paths), and [Geo-Python lesson
2](https://geo-python-site.readthedocs.io/en/latest/notebooks/L2/Python-basic-elements.html#f-string-formatting)
to revisit how they work.
:::


## Extra: save summary statistics to CSV spreadsheet

Whenever the results of an operation on a `GeoDataFrame` do not include a
geometry, the output data frame will automatically become a ‘plain’
`pandas.DataFrame`, and can be saved to the standard table formats.

One interesting application of this is to save basic descriptive statistics of
a geospatial data set into a CSV table. For instance, we might want to know the
area each terrain class covers. 

Again, we start by grouping the input data by terrain classes, and then compute
the sum of each classes’ area. This can be condensed into one line of code:

```{code-cell}
area_information = data.groupby("CLASS").area.sum()
area_information
```

We can then save the resulting table into a CSV file using the standard pandas
approach we learned about in [Geo-Python
lesson 5](https://geo-python-site.readthedocs.io/en/latest/notebooks/L5/processing-data-with-pandas.html#writing-data-to-a-file).

```{code-cell}
area_information.to_csv(TOPOGRAPHIC_DATABASE_DIRECTORY / "area_by_terrain_class.csv")
```