Exercise 1
Contents
Exercise 1#
Important
Please complete this exercise by 3 pm on Thursday, 9 November, 2023 (the day before the next work session).
To start this assignment, accept the GitHub classroom
assignment, and clone your own
repository, e.g., in a CSC
Notebook
instance. Make sure you commit and push all changes you make (you can
revisit instructions on how to use git
and the JupyterLab git-plugin
on the website of the Geo-Python
course.
To preview the exercise without logging in, you can find the open course copy of the course’s GitHub repository at github.com/Automating-GIS-processes-II-2023/Exercise-1. Don’t attempt to commit changes to that repository, but rather work with your personal GitHub classroom copy (see above).
Exercises are done individually
All the weekly exercises need to be done individually in this period. So NO pair programming for exercises in this period.
Hints#
assert
statements#
Assertions are a language feature in Python that allows the programmer to
assert, ensure, that a certain
condition is met. They are a good way to check that variables are in a suitable
range for further computation. For instance, if a function converts a
temperature, it can test that its input value is not below absolute zero. In a
way, assert
statements work similar to an electrical fuse: if input current
is higher than expected, the fuse blows to protect the appliance that comes
after. If input values are outside an expected range, the assert
statement
fails with an error, and stops the program to protect the following code from
being executed with wrong input.
assert
statements are often used in functions to ensure the input values are
acceptable. Consider the following example:
def divide(dividend, divisor):
"""Return the division of dividend by divisor."""
assert divisor != 0, "Cannot divide by zero."
return (dividend / divisor)
Alternatives to pandas.DataFrame.iterrows()
(problem 3)#
It is entirely possible to solve problem 3 using the iterrows()
pattern you
learnt in lesson 6 of
Geo-Python,
and your code would look something like this:
import pandas
import shapely.geometry
data = pandas.DataFrame({"x": [10, 20, 30], "y": [1, 3, 4]})
# Option 1: iterate over DataFrame’s rows:
for i, row in data.iterrows():
point = shapely.geometry.Point(row["x"], row["y"])
# ...
However, there are better, faster, more elegant solutions that also are shorter to write.
Pandas’ DataFrame
s have a method apply()
that runs a user-defined function on each row or on each column (depending on the axis
parameter, if axis=1
, apply()
works on rows).
The outputs of running the function repeatly (in parallel, to be precise) are collected in a pandas.GeoSeries
that is the return value of apply()
and can be assigned to a new column or row (we’ll learn about that in the next lesson, for now let’s convert the data into a list).
Let’s look at an easy example to illustrate how that works: We create a simple function that takes a row and multiplies its x
and y
values:
def multiply(row):
"""Multiply a row’s x and y values."""
return (row["x"] * row["y"])
product = data.apply(multiply, axis=1)
# note how the function is not called here (no parentheses!),
# but only passed as a reference
product = list(product)
product
[10, 60, 120]
Pandas’ apply()
method#
Exactly the same can be done with the more complex example of creating a point geometry:
# Option 2: Define a custom function, and apply this function to the data frame
def create_point(row):
"""Create a Point geometry from a row with x and y values."""
point = shapely.geometry.Point(row["x"], row["y"])
return point
point_series = data.apply(create_point, axis=1)
/home/docs/checkouts/readthedocs.org/user_builds/autogis-site/envs/latest/lib/python3.10/site-packages/pandas/core/dtypes/cast.py:127: ShapelyDeprecationWarning: The array interface is deprecated and will no longer work in Shapely 2.0. Convert the '.coords' to a numpy array instead.
arr = construct_1d_object_array_from_listlike(values)
Apply()
ing an anonymous lambda function#
Finally, for simple functions that fit into one single line, we can pass the
function in so-called ‘lambda notation’. Lambda functions follow the syntax
lambda arguments: return-value
, i.e., the keyword lambda
followed by one or
more, comma-separated, argument names (input variables), a colon (:
), and the
return value statement (e.g., a calculation). A lambda function that accepts
two arguments and returns their sum, would look like this: lambda a, b: (a + b)
.
Lambda functions can only be used where they are defined, but offer a handy short-cut to not need separate functions for simple expressions. They are very common in data science projects, but should not be over-used: as a rule-of-thumb, don’t use lambda functions if their code does not fit on one (short) line.
Lambda functions
Read more about lambda functions in the official Python documentation.
For the geo-spatial problem we discussed above, we can use a lambda function to create a point ‘on-the-fly’:
# Option 3: Apply a lambda function to the data frame
point_series = data.apply(
lambda row: shapely.geometry.Point(row["x"], row["y"]),
axis=1
)
/home/docs/checkouts/readthedocs.org/user_builds/autogis-site/envs/latest/lib/python3.10/site-packages/pandas/core/dtypes/cast.py:127: ShapelyDeprecationWarning: The array interface is deprecated and will no longer work in Shapely 2.0. Convert the '.coords' to a numpy array instead.
arr = construct_1d_object_array_from_listlike(values)
Iterating over multiple lists simultaneously#
The built-in Python function zip()
makes it easy to work with multiple lists at the same time. It combines two or
more lists and iterates over them in parallel, returning one value of each list
at a time. Consider the following example:
dog_names = ["Blackie", "Musti", "Svarte"]
dog_ages = [4.5, 2, 15]
# Iterate over the names and ages lists in parallel:
for name, age in zip(dog_names, dog_ages):
print(f"{name} is {age} years old")
Blackie is 4.5 years old
Musti is 2 years old
Svarte is 15 years old
Variable names
This example illustrates quite well, why variable names should be chosen wisely: lists, for instance, almost always represent multiple values, so their names should be in plural (E.g., dog_names
). In a loop, having more than one variable can become confusing quickly; refrain from using short names such as i
or j
for anything but a simple counter: use descriptive names such as name
or age
in the above example.
Caution
When iterating over lists of different length, zip would shorten all lists to the length of the shortest. By default, this happens without warning or error message, so be careful!