Exercise 4
Contents
Exercise 4#
Important
Please complete this exercise by the end of day on Thursday, 30 November, 2023 (the day of next week’s work session).
To start this assignment, accept the GitHub Classroom
assignment, and clone your own
repository, e.g., in a CSC
Notebook
instance. Make sure you commit and push all changes you make (you can
revisit instructions on how to use git
and the jupyterlab git-plugin
on the website of the Geo-Python
course.
To preview the exercise without logging in, you can find the open course copy of the course’s GitHub repository at github.com/automating-gis-processes-2022/exercise-4. Don’t attempt to commit changes to that repository, but rather work with your personal GitHub Classroom copy (see above).
Hints#
Joining two data frames on different column names#
We have already joined data sets that share the same index, and also used spatial joins to merge geo-data frames depending on their geometric relationships.
For problem 1, it might be handy to be able to join two data sets using the values of two columns that have a different name. One good approach is to set the index of both data frames to refer to the same column:
import pandas
df1 = pandas.DataFrame({
"id": [1, 2, 3],
"other_column": ["a", "b", "c"]
})
df1
id | other_column | |
---|---|---|
0 | 1 | a |
1 | 2 | b |
2 | 3 | c |
df2 = pandas.DataFrame({
"id": [67, 68, 69],
"other_other_column": ["x", "y", "z"],
"df1_id": [1, 2, 3]
})
df2
id | other_other_column | df1_id | |
---|---|---|---|
0 | 67 | x | 1 |
1 | 68 | y | 2 |
2 | 69 | z | 3 |
joint_df = df1.set_index("id").join(df2.set_index("df1_id"))
joint_df
other_column | id | other_other_column | |
---|---|---|---|
id | |||
1 | a | 67 | x |
2 | b | 68 | y |
3 | c | 69 | z |
Renaming columns when joining data frames#
It is often necessary to rename columns when we join data frames that have
duplicate column names. In the example below, both df1
and df2
have a
column other_column
; the join fails. An appropriate fix is to add a suffix
to all columns of one or both of the data frames:
import pandas
df1 = pandas.DataFrame({
"id": [1, 2, 3],
"other_column": ["a", "b", "c"]
})
df1
id | other_column | |
---|---|---|
0 | 1 | a |
1 | 2 | b |
2 | 3 | c |
df2 = pandas.DataFrame({
"id": [67, 68, 69],
"other_other_column": ["x", "y", "z"],
"df1_id": [1, 2, 3]
})
df2
id | other_other_column | df1_id | |
---|---|---|---|
0 | 67 | x | 1 |
1 | 68 | y | 2 |
2 | 69 | z | 3 |
# Will fail, because duplicate column names exist:
joint_df = df1.join(df2)
joint_df
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In [6], line 2
1 # Will fail, because duplicate column names exist:
----> 2 joint_df = df1.join(df2)
3 joint_df
File ~/checkouts/readthedocs.org/user_builds/autogis-site/envs/latest/lib/python3.10/site-packages/pandas/core/frame.py:9966, in DataFrame.join(self, other, on, how, lsuffix, rsuffix, sort, validate)
9803 def join(
9804 self,
9805 other: DataFrame | Series | list[DataFrame | Series],
(...)
9811 validate: str | None = None,
9812 ) -> DataFrame:
9813 """
9814 Join columns of another DataFrame.
9815
(...)
9964 5 K1 A5 B1
9965 """
-> 9966 return self._join_compat(
9967 other,
9968 on=on,
9969 how=how,
9970 lsuffix=lsuffix,
9971 rsuffix=rsuffix,
9972 sort=sort,
9973 validate=validate,
9974 )
File ~/checkouts/readthedocs.org/user_builds/autogis-site/envs/latest/lib/python3.10/site-packages/pandas/core/frame.py:10005, in DataFrame._join_compat(self, other, on, how, lsuffix, rsuffix, sort, validate)
9995 if how == "cross":
9996 return merge(
9997 self,
9998 other,
(...)
10003 validate=validate,
10004 )
> 10005 return merge(
10006 self,
10007 other,
10008 left_on=on,
10009 how=how,
10010 left_index=on is None,
10011 right_index=True,
10012 suffixes=(lsuffix, rsuffix),
10013 sort=sort,
10014 validate=validate,
10015 )
10016 else:
10017 if on is not None:
File ~/checkouts/readthedocs.org/user_builds/autogis-site/envs/latest/lib/python3.10/site-packages/pandas/core/reshape/merge.py:124, in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
93 @Substitution("\nleft : DataFrame or named Series")
94 @Appender(_merge_doc, indents=0)
95 def merge(
(...)
108 validate: str | None = None,
109 ) -> DataFrame:
110 op = _MergeOperation(
111 left,
112 right,
(...)
122 validate=validate,
123 )
--> 124 return op.get_result(copy=copy)
File ~/checkouts/readthedocs.org/user_builds/autogis-site/envs/latest/lib/python3.10/site-packages/pandas/core/reshape/merge.py:775, in _MergeOperation.get_result(self, copy)
771 self.left, self.right = self._indicator_pre_merge(self.left, self.right)
773 join_index, left_indexer, right_indexer = self._get_join_info()
--> 775 result = self._reindex_and_concat(
776 join_index, left_indexer, right_indexer, copy=copy
777 )
778 result = result.__finalize__(self, method=self._merge_type)
780 if self.indicator:
File ~/checkouts/readthedocs.org/user_builds/autogis-site/envs/latest/lib/python3.10/site-packages/pandas/core/reshape/merge.py:729, in _MergeOperation._reindex_and_concat(self, join_index, left_indexer, right_indexer, copy)
726 left = self.left[:]
727 right = self.right[:]
--> 729 llabels, rlabels = _items_overlap_with_suffix(
730 self.left._info_axis, self.right._info_axis, self.suffixes
731 )
733 if left_indexer is not None:
734 # Pinning the index here (and in the right code just below) is not
735 # necessary, but makes the `.take` more performant if we have e.g.
736 # a MultiIndex for left.index.
737 lmgr = left._mgr.reindex_indexer(
738 join_index,
739 left_indexer,
(...)
744 use_na_proxy=True,
745 )
File ~/checkouts/readthedocs.org/user_builds/autogis-site/envs/latest/lib/python3.10/site-packages/pandas/core/reshape/merge.py:2458, in _items_overlap_with_suffix(left, right, suffixes)
2455 lsuffix, rsuffix = suffixes
2457 if not lsuffix and not rsuffix:
-> 2458 raise ValueError(f"columns overlap but no suffix specified: {to_rename}")
2460 def renamer(x, suffix):
2461 """
2462 Rename the left and right indices.
2463
(...)
2474 x : renamed column name
2475 """
ValueError: columns overlap but no suffix specified: Index(['id'], dtype='object')
# works: add a suffix to one of the data sets’ columns
joint_df = df1.join(df2.add_suffix("_df2"))
joint_df
id | other_column | id_df2 | other_other_column_df2 | df1_id_df2 | |
---|---|---|---|---|---|
0 | 1 | a | 67 | x | 1 |
1 | 2 | b | 68 | y | 2 |
2 | 3 | c | 69 | z | 3 |
Searching for files using a pattern#
In Lesson
2
we discussed how to use a file pattern to search for files, using
pathlib.Path.glob()
.
To loop over all files ending in _s.shp
inside DATA_DIRECTORY / "finland_topographic_database
, use the following piece of code:
import pathlib
DATA_DIRECTORY = pathlib.Path().resolve() / "data"
for input_file in (DATA_DIRECTORY / "finland_topographic_database").glob("*_s.shp"):
print(input_file.name)
j_L4132R_s.shp
l_L4132R_s.shp
s_L4132R_s.shp
k_L4132R_s.shp
m_L4132R_s.shp
r_L4132R_s.shp
n_L4132R_s.shp
This will come in handy for problem 2, when reading in all travel time data files. Be sure to revisit the explanation in Lesson 2.
Find the minimum value across multiple columns#
For problem 2, you have to find the smallest value across multiple columns:
the shortest travel time to any of the eight shopping centres. For this,
panda
’s DataFrame.min()
method
can come in handy. It identifies the smallest value in each column or row (it
accepts the same axis=
parameter as apply()
).
For instance, to find the smalles value for each row across the columns a
,
b
, and c
of the data frame below, use the following code:
import pandas
df = pandas.DataFrame(
{
"id": [1, 2, 3],
"a": [27, 64, 12],
"b": [13, 13, 13],
"c": [34, 15, 1]
}
)
df
id | a | b | c | |
---|---|---|---|---|
0 | 1 | 27 | 13 | 34 |
1 | 2 | 64 | 13 | 15 |
2 | 3 | 12 | 13 | 1 |
# select which columns to compare, then call `.min()`
df[["a", "b", "c"]].min(axis=1)
0 13
1 13
2 1
dtype: int64
To find out which column had the smallest value for each row, use the
near-identical method
idxmin()
:
df[["a", "b", "c"]].idxmin(axis=1)
0 b
1 b
2 c
dtype: object
Of course, equivalent methods to find the greatest values exist: they are named
pandas.DataFrame.max()
and
pandas.DataFrame.idxmax()
.