Exercise 1#

Important

Please complete this exercise by 3 pm on Thursday, 9 November, 2023 (the day before the next work session).

To start this assignment, accept the GitHub classroom assignment, and clone your own repository, e.g., in a CSC Notebook instance. Make sure you commit and push all changes you make (you can revisit instructions on how to use git and the JupyterLab git-plugin on the website of the Geo-Python course.

To preview the exercise without logging in, you can find the open course copy of the course’s GitHub repository at github.com/Automating-GIS-processes-II-2023/Exercise-1. Don’t attempt to commit changes to that repository, but rather work with your personal GitHub classroom copy (see above).

Exercises are done individually

All the weekly exercises need to be done individually in this period. So NO pair programming for exercises in this period.

Hints#

`assert` statements#

Assertions are a language feature in Python that allows the programmer to assert, ensure, that a certain condition is met. They are a good way to check that variables are in a suitable range for further computation. For instance, if a function converts a temperature, it can test that its input value is not below absolute zero. In a way, assert statements work similar to an electrical fuse: if input current is higher than expected, the fuse blows to protect the appliance that comes after. If input values are outside an expected range, the assert statement fails with an error, and stops the program to protect the following code from being executed with wrong input.

assert statements are often used in functions to ensure the input values are acceptable. Consider the following example:

def divide(dividend, divisor):
    """Return the division of dividend by divisor."""
    assert divisor != 0, "Cannot divide by zero."
    return (dividend / divisor)

Alternatives to `pandas.DataFrame.iterrows()` (problem 3)#

It is entirely possible to solve problem 3 using the iterrows() pattern you learnt in lesson 6 of Geo-Python, and your code would look something like this:

import pandas
import shapely.geometry

data = pandas.DataFrame({"x": [10, 20, 30], "y": [1, 3, 4]})

# Option 1: iterate over DataFrame’s rows:

for i, row in data.iterrows():
    point = shapely.geometry.Point(row["x"], row["y"])
    # ...

However, there are better, faster, more elegant solutions that also are shorter to write. Pandas’ DataFrames have a method apply() that runs a user-defined function on each row or on each column (depending on the axis parameter, if axis=1, apply() works on rows).

The outputs of running the function repeatly (in parallel, to be precise) are collected in a pandas.GeoSeries that is the return value of apply() and can be assigned to a new column or row (we’ll learn about that in the next lesson, for now let’s convert the data into a list).

Let’s look at an easy example to illustrate how that works: We create a simple function that takes a row and multiplies its x and y values:

def multiply(row):
    """Multiply a row’s x and y values."""
    return (row["x"] * row["y"])

product = data.apply(multiply, axis=1)
# note how the function is not called here (no parentheses!),
# but only passed as a reference

product = list(product)
product

[10, 60, 120]

Pandas’ `apply()` method#

Exactly the same can be done with the more complex example of creating a point geometry:

# Option 2: Define a custom function, and apply this function to the data frame

def create_point(row):
    """Create a Point geometry from a row with x and y values."""
    point = shapely.geometry.Point(row["x"], row["y"])
    return point

point_series = data.apply(create_point, axis=1)

/home/docs/checkouts/readthedocs.org/user_builds/autogis-site/envs/2023/lib/python3.10/site-packages/pandas/core/dtypes/cast.py:127: ShapelyDeprecationWarning: The array interface is deprecated and will no longer work in Shapely 2.0. Convert the '.coords' to a numpy array instead.
  arr = construct_1d_object_array_from_listlike(values)

`Apply()`ing an anonymous lambda function#

Finally, for simple functions that fit into one single line, we can pass the function in so-called ‘lambda notation’. Lambda functions follow the syntax lambda arguments: return-value, i.e., the keyword lambda followed by one or more, comma-separated, argument names (input variables), a colon (:), and the return value statement (e.g., a calculation). A lambda function that accepts two arguments and returns their sum, would look like this: lambda a, b: (a + b).

Lambda functions can only be used where they are defined, but offer a handy short-cut to not need separate functions for simple expressions. They are very common in data science projects, but should not be over-used: as a rule-of-thumb, don’t use lambda functions if their code does not fit on one (short) line.

Lambda functions

Read more about lambda functions in the official Python documentation.

For the geo-spatial problem we discussed above, we can use a lambda function to create a point ‘on-the-fly’:

# Option 3: Apply a lambda function to the data frame

point_series = data.apply(
    lambda row: shapely.geometry.Point(row["x"], row["y"]),
    axis=1
)

/home/docs/checkouts/readthedocs.org/user_builds/autogis-site/envs/2023/lib/python3.10/site-packages/pandas/core/dtypes/cast.py:127: ShapelyDeprecationWarning: The array interface is deprecated and will no longer work in Shapely 2.0. Convert the '.coords' to a numpy array instead.
  arr = construct_1d_object_array_from_listlike(values)

Iterating over multiple lists simultaneously#

The built-in Python function zip() makes it easy to work with multiple lists at the same time. It combines two or more lists and iterates over them in parallel, returning one value of each list at a time. Consider the following example:

dog_names = ["Blackie", "Musti", "Svarte"]
dog_ages = [4.5, 2, 15]

# Iterate over the names and ages lists in parallel:
for name, age in zip(dog_names, dog_ages):
    print(f"{name} is {age} years old")

Blackie is 4.5 years old
Musti is 2 years old
Svarte is 15 years old

Variable names

This example illustrates quite well, why variable names should be chosen wisely: lists, for instance, almost always represent multiple values, so their names should be in plural (E.g., dog_names). In a loop, having more than one variable can become confusing quickly; refrain from using short names such as i or j for anything but a simple counter: use descriptive names such as name or age in the above example.

Caution

When iterating over lists of different length, zip would shorten all lists to the length of the shortest. By default, this happens without warning or error message, so be careful!

Exercise 1

Contents

Exercise 1#

Hints#

assert statements#

Alternatives to pandas.DataFrame.iterrows() (problem 3)#

Pandas’ apply() method#

Apply()ing an anonymous lambda function#

Iterating over multiple lists simultaneously#

`assert` statements#

Alternatives to `pandas.DataFrame.iterrows()` (problem 3)#

Pandas’ `apply()` method#

`Apply()`ing an anonymous lambda function#