Exercise 4#
Please complete this exercise by the end of day on Thursday, 30 November, 2023 (the day of next week’s work session).
To start this assignment, accept the GitHub Classroom assignment, and clone your own repository, e.g., in a CSC Notebook instance. Make sure you commit and push all changes you make (you can revisit instructions on how to use git
and the jupyterlab git-plugin on the website of the Geo-Python course.
To preview the exercise without logging in, you can find the open course copy of the course’s GitHub repository at github.com/automating-gis-processes-2022/exercise-4. Don’t attempt to commit changes to that repository, but rather work with your personal GitHub Classroom copy (see above).
Hints#
Joining two data frames on different column names#
We have already joined data sets that share the same index, and also used spatial joins to merge geo-data frames depending on their geometric relationships.
For problem 1, it might be handy to be able to join two data sets using the values of two columns that have a different name. One good approach is to set the index of both data frames to refer to the same column:
[1]:
import pandas
df1 = pandas.DataFrame({
"id": [1, 2, 3],
"other_column": ["a", "b", "c"]
})
df1
[1]:
id | other_column | |
---|---|---|
0 | 1 | a |
1 | 2 | b |
2 | 3 | c |
[2]:
df2 = pandas.DataFrame({
"id": [67, 68, 69],
"other_other_column": ["x", "y", "z"],
"df1_id": [1, 2, 3]
})
df2
[2]:
id | other_other_column | df1_id | |
---|---|---|---|
0 | 67 | x | 1 |
1 | 68 | y | 2 |
2 | 69 | z | 3 |
[3]:
joint_df = df1.set_index("id").join(df2.set_index("df1_id"))
joint_df
[3]:
other_column | id | other_other_column | |
---|---|---|---|
id | |||
1 | a | 67 | x |
2 | b | 68 | y |
3 | c | 69 | z |
Renaming columns when joining data frames#
It is often necessary to rename columns when we join data frames that have duplicate column names. In the example below, both df1
and df2
have a column other_column
; the join fails. An appropriate fix is to add a suffix to all columns of one or both of the data frames:
[4]:
import pandas
df1 = pandas.DataFrame({
"id": [1, 2, 3],
"other_column": ["a", "b", "c"]
})
df1
[4]:
id | other_column | |
---|---|---|
0 | 1 | a |
1 | 2 | b |
2 | 3 | c |
[5]:
df2 = pandas.DataFrame({
"id": [67, 68, 69],
"other_other_column": ["x", "y", "z"],
"df1_id": [1, 2, 3]
})
df2
[5]:
id | other_other_column | df1_id | |
---|---|---|---|
0 | 67 | x | 1 |
1 | 68 | y | 2 |
2 | 69 | z | 3 |
[6]:
# Will fail, because duplicate column names exist:
joint_df = df1.join(df2)
joint_df
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[6], line 2
1 # Will fail, because duplicate column names exist:
----> 2 joint_df = df1.join(df2)
3 joint_df
File ~/checkouts/readthedocs.org/user_builds/autogis-site/envs/latest/lib/python3.12/site-packages/pandas/core/frame.py:10757, in DataFrame.join(self, other, on, how, lsuffix, rsuffix, sort, validate)
10747 if how == "cross":
10748 return merge(
10749 self,
10750 other,
(...)
10755 validate=validate,
10756 )
> 10757 return merge(
10758 self,
10759 other,
10760 left_on=on,
10761 how=how,
10762 left_index=on is None,
10763 right_index=True,
10764 suffixes=(lsuffix, rsuffix),
10765 sort=sort,
10766 validate=validate,
10767 )
10768 else:
10769 if on is not None:
File ~/checkouts/readthedocs.org/user_builds/autogis-site/envs/latest/lib/python3.12/site-packages/pandas/core/reshape/merge.py:184, in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
169 else:
170 op = _MergeOperation(
171 left_df,
172 right_df,
(...)
182 validate=validate,
183 )
--> 184 return op.get_result(copy=copy)
File ~/checkouts/readthedocs.org/user_builds/autogis-site/envs/latest/lib/python3.12/site-packages/pandas/core/reshape/merge.py:888, in _MergeOperation.get_result(self, copy)
884 self.left, self.right = self._indicator_pre_merge(self.left, self.right)
886 join_index, left_indexer, right_indexer = self._get_join_info()
--> 888 result = self._reindex_and_concat(
889 join_index, left_indexer, right_indexer, copy=copy
890 )
891 result = result.__finalize__(self, method=self._merge_type)
893 if self.indicator:
File ~/checkouts/readthedocs.org/user_builds/autogis-site/envs/latest/lib/python3.12/site-packages/pandas/core/reshape/merge.py:840, in _MergeOperation._reindex_and_concat(self, join_index, left_indexer, right_indexer, copy)
837 left = self.left[:]
838 right = self.right[:]
--> 840 llabels, rlabels = _items_overlap_with_suffix(
841 self.left._info_axis, self.right._info_axis, self.suffixes
842 )
844 if left_indexer is not None and not is_range_indexer(left_indexer, len(left)):
845 # Pinning the index here (and in the right code just below) is not
846 # necessary, but makes the `.take` more performant if we have e.g.
847 # a MultiIndex for left.index.
848 lmgr = left._mgr.reindex_indexer(
849 join_index,
850 left_indexer,
(...)
855 use_na_proxy=True,
856 )
File ~/checkouts/readthedocs.org/user_builds/autogis-site/envs/latest/lib/python3.12/site-packages/pandas/core/reshape/merge.py:2721, in _items_overlap_with_suffix(left, right, suffixes)
2718 lsuffix, rsuffix = suffixes
2720 if not lsuffix and not rsuffix:
-> 2721 raise ValueError(f"columns overlap but no suffix specified: {to_rename}")
2723 def renamer(x, suffix: str | None):
2724 """
2725 Rename the left and right indices.
2726
(...)
2737 x : renamed column name
2738 """
ValueError: columns overlap but no suffix specified: Index(['id'], dtype='object')
[7]:
# works: add a suffix to one of the data sets’ columns
joint_df = df1.join(df2.add_suffix("_df2"))
joint_df
[7]:
id | other_column | id_df2 | other_other_column_df2 | df1_id_df2 | |
---|---|---|---|---|---|
0 | 1 | a | 67 | x | 1 |
1 | 2 | b | 68 | y | 2 |
2 | 3 | c | 69 | z | 3 |
Searching for files using a pattern#
In Lesson 2 we discussed how to use a file pattern to search for files, using `pathlib.Path.glob()
<https://docs.python.org/3/library/pathlib.html#pathlib.Path.glob>`__.
To loop over all files ending in _s.shp
inside DATA_DIRECTORY / "finland_topographic_database
, use the following piece of code:
[8]:
import pathlib
DATA_DIRECTORY = pathlib.Path().resolve() / "data"
for input_file in (DATA_DIRECTORY / "finland_topographic_database").glob("*_s.shp"):
print(input_file.name)
This will come in handy for problem 2, when reading in all travel time data files. Be sure to revisit the explanation in Lesson 2.
Find the minimum value across multiple columns#
For problem 2, you have to find the smallest value across multiple columns: the shortest travel time to any of the eight shopping centres. For this, `panda
’s DataFrame.min()
method <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.min.html>`__ can come in handy. It identifies the smallest value in each column or row (it accepts the same axis=
parameter as apply()
).
For instance, to find the smalles value for each row across the columns a
, b
, and c
of the data frame below, use the following code:
[9]:
import pandas
df = pandas.DataFrame(
{
"id": [1, 2, 3],
"a": [27, 64, 12],
"b": [13, 13, 13],
"c": [34, 15, 1]
}
)
df
[9]:
id | a | b | c | |
---|---|---|---|---|
0 | 1 | 27 | 13 | 34 |
1 | 2 | 64 | 13 | 15 |
2 | 3 | 12 | 13 | 1 |
[10]:
# select which columns to compare, then call `.min()`
df[["a", "b", "c"]].min(axis=1)
[10]:
0 13
1 13
2 1
dtype: int64
To find out which column had the smallest value for each row, use the near-identical method `idxmin()
<https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.idxmin.html>`__:
[11]:
df[["a", "b", "c"]].idxmin(axis=1)
[11]:
0 b
1 b
2 c
dtype: object
Of course, equivalent methods to find the greatest values exist: they are named `pandas.DataFrame.max()
<https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.max.html>`__ and `pandas.DataFrame.idxmax()
<https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.idxmax.html>`__.