--- jupytext: text_representation: extension: .md format_name: myst format_version: 0.13 jupytext_version: 1.14.1 kernelspec: display_name: Python 3 (ipykernel) language: python name: python3 --- # Geocoding in geopandas Geopandas supports geocoding via a library called [geopy](http://geopy.readthedocs.io/), which needs to be installed to use [geopandas’ `geopandas.tools.geocode()` function](https://geopandas.org/en/stable/docs/reference/api/geopandas.tools.geocode.html). `geocode()` expects a `list` or `pandas.Series` of addresses (strings) and returns a `GeoDataFrame` with resolved addresses and point geometries. Let’s try this out. We will geocode addresses stored in a semicolon-separated text file called `addresses.txt`. These addresses are located in the Helsinki Region in Southern Finland. ```{code-cell} import pathlib NOTEBOOK_PATH = pathlib.Path().resolve() DATA_DIRECTORY = NOTEBOOK_PATH / "data" ``` ```{code-cell} import pandas addresses = pandas.read_csv( DATA_DIRECTORY / "helsinki_addresses" / "addresses.txt", sep=";" ) addresses.head() ``` We have an `id` for each row and an address in the `addr` column. ## Geocode addresses using *Nominatim* In our example, we will use *Nominatim* as a *geocoding provider*. [*Nominatim*](https://nominatim.org/) is a library and service using OpenStreetMap data, and run by the OpenStreetMap Foundation. Geopandas’ [`geocode()` function](hhttps://geopandas.org/en/stable/docs/reference/api/geopandas.tools.geocode.html) supports it natively. :::{admonition} Fair-use :class: note [Nominatim’s terms of use](https://operations.osmfoundation.org/policies/nominatim/) require that users of the service make sure they don’t send more frequent requests than one per second, and that a custom **user-agent** string is attached to each query. Geopandas’ implementation allows us to specify a `user_agent`; the library also takes care of respecting the rate-limit of Nominatim. Looking up an address is a quite expensive database operation. This is why, sometimes, the public and free-to-use Nominatim server takes slightly longer to respond. In this example, we add a parameter `timeout=10` to wait up to 10 seconds for a response. ::: ```{code-cell} import geopandas geocoded_addresses = geopandas.tools.geocode( addresses["addr"], provider="nominatim", user_agent="autogis2023", timeout=10 ) geocoded_addresses.head() ``` Et voilà! As a result we received a `GeoDataFrame` that contains a parsed version of our original addresses and a `geometry` column of `shapely.geometry.Point`s that we can use, for instance, to export the data to a geospatial data format. However, the `id` column was discarded in the process. To combine the input data set with our result set, we can use pandas’ [*join* operations](https://pandas.pydata.org/docs/user_guide/merging.html). ## Join data frames :::{admonition} Joining data sets using pandas :class: note For a comprehensive overview of different ways of combining DataFrames and Series based on set theory, have a look at pandas documentation about [merge, join and concatenate](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html). ::: Joining data from two or more data frames or tables is a common task in many (spatial) data analysis workflows. As you might remember from our earlier lessons, combining data from different tables based on common **key** attribute can be done easily in pandas/geopandas using the [`merge()` function](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html). We used this approach in [exercise 6 of the Geo-Python course](https://geo-python-site.readthedocs.io/en/latest/lessons/L6/exercise-6.html#joining-data-from-one-dataframe-to-another). However, sometimes it is useful to join two data frames together based on their **index**. The data frames have to have the **same number of records** and **share the same index** (simply put, they should have the same order of rows). We can use this approach, here, to join information from the original data frame `addresses` to the geocoded addresses `geocoded_addresses`, row by row. The `join()` function, by default, joins two data frames based on their index. This works correctly for our example, as the order of the two data frames is identical. ```{code-cell} geocoded_addresses_with_id = geocoded_addresses.join(addresses) geocoded_addresses_with_id ``` The output of `join()` is a new `geopandas.GeoDataFrame`: ```{code-cell} type(geocoded_addresses_with_id) ``` The new data frame has all original columns plus new columns for the `geometry` and for a parsed `address` that can be used to spot-check the results. :::{note} If you would do the join the other way around, i.e. `addresses.join(geocoded_addresses)`, the output would be a `pandas.DataFrame`, not a `geopandas.GeoDataFrame`. ::: --- It’s now easy to save the new data set as a geospatial file, for instance, in *GeoPackage* format: ```{code-cell} :tags: ["remove-input", "remove-output"] # delete a possibly existing file, as it creates # troubles in case sphinx is run repeatedly try: (DATA_DIRECTORY / "addresses.gpkg").unlink() except FileNotFoundError: pass ``` ```{code-cell} geocoded_addresses.to_file(DATA_DIRECTORY / "addresses.gpkg") ```