Final assignment#

Start your assignment

Start your final assignment by accepting the GitHub Classroom for the final work.

Aim of the work#

The final project is can be done individually or in pair. In the task, the aim is to apply Python programming to automating a GIS analysis process. The main aim is to create a GIS analysis workflow that can be easily repeated for similar input data.

You can select a pre-defined topic, or develop your own question. You should take advantage of your programming skills (basics of Python, defining your own functions, reading and writing data, data analysis using pandas, spatial analysis using geopandas, creating static and/or interactive data visualizations, …), version control skills (git + GitHub), and good coding practices (writing readable code) when doing the final assignment.

Pair programming (optional)

Students who attend the course at the University of Helsinki can do the final assignment in pairs. Those who choose to work in pairs, need to also submit a one-page report on their project (description, aims, how it works, what tools and methods you have used, how you have divided the work). It is enough if one person submits the project. But the report needs to be submitted individually as a .md file or a pdf on your assignment’s git repository. Those who do their projects individually do not need to submit the report. The driver does not need to submit any report other than the required .md file.

Final work topic#

You have four options for the final project that you can choose from:

  1. Access Viz: a GIS-tool that can visualize and compare travel times by different travel modes in Helsinki Region.

  2. Urban Indicators: a workflow that calculates and reports different urban indicators for an urban region, and allows the comparison of different urban areas based on these indicators.

  3. A context-based spatial data anonymizer: A context sensitive approach to anonymizing sensitive GIS data

  4. Your own project: your own tool or analysis process (for example, related to your thesis!). Suggest your idea before the last practical exercise!

Think about the final project as a challenge for yourself to show and implement the programming skills that you have learned this far. You have learned a lot already!

Final work structure#

Here is the suggested structure of the work, that also serves as the basis for grading:

  1. Data acquisition (Fetching data, subsetting data, storing intermediate outputs etc.)

  2. Data analysis (Enriching and analyzing the data, eg. spatial join, overlay, buffering, other calculations..)

  3. Visualization (Visualizing main results and other relevant information as maps and graphs)

You can write your code into python script files and /or jupyter notebook files. You can freely organize your final work into one single file, or several files (for example, write your own functions into a separate .py file and apply them in one or several jupyter notebook .ipynb files.

The workflow should be repeatable and well documented. In other words, anyone who gets a copy of your repository should be able to run your code, and read your code.

What should be returned?#

Organize all your code(s) / notebook(s) into your personal Final-Assignment repository (GitHub classroom link at the top of this page) and add links to all relevant files to the README.md file. Anyone who downloads the repository should be able to read your code and documentation and understand what is going on, and run your code in order to reproduce the same results :)

Note: If your code requires some python packages not found in the csc notebooks environment, please mention them also in the README.md file and provide installation instrutions.

Note

If you are working in pair

If you do the exercise in pair, the “none-driver” groupmate needs to submit a one-page report on his/her own git repository. In your report include: description, aims, how it works, what tools and methods you have used, how you have divided the work). The driver does not need to submit a separate report.

When is the deadline?#

Label your submissions as “submitted” in the exercise repository’s README.md under “status” once you are finished with the Final assignment.

You can choose from these two deadlines:

  • 1st deadline: Sunday the 31st December 2023

  • 2nd deadline Sunday the 14th of January 2024

Submissions are checked after each deadline (you can get the feedback earlier if aiming for the first deadline). If you need the course grade earlier, please contact the course instructor.

Grading#

The grading is based on a typical 0-5 scale. See detailed grading criteria here. The final assignment is graded based on:

  • Main analysis steps (data fetching, data analysis, visualization)

  • Repeatability (it should be possible to repeat the main analysis steps for different input files / input parameters)

  • Quality of visualizations (maps and graphs)

  • Overall documentation of the work (use markdown cells for structuring the work, and code comments to explain details)

Good documentation of the code and your project is highly appreciated!!! You should add necessary details to the README.md file, and use inline comments and Markdown cells to document your work along the way. Take a look of these hints for using markdown:

Note

AI-LLM OK

You are allowed to get help from AI-LLM in this assignment. However, you can not produce large amounts of code using these tools. If you use AI-LLM in your work, be transparent with how you have used. Provide a description of why, how, and to what extent you have used AI in your work. You should also include the prompt you have used in your report. Code generated by AI-LLM should also be highlighted using an in-line comment in your code. Learn more about use of AI-LLM tools for Python programming

AccessViz#

General Description#

AccessViz is a set of tools that can be used for managing and helping to analyze the Helsinki Region Travel Time Matrix data set. The data can be downloaded from here. The travel time matrix is available from three different years (2013 / 2015 / 2018). You can develop the tool by using data from one year. Optionally, your tool could compare travel times from different years!

The travel time matrix contsists of 13231 text files. Each file contains travel time and travel distance information by different modes of transport (walking, biking, public transport and car) from all other grid squares to one target grid square. The files are named and organized based on their ID number in th YKR ID data set. For example, the Travel Time Matrix file for the railway station is named travel_times_to_5975375.txt, and this file is located in folder 5975xxx. All possible YKR ID values can be found from the attribute table of a Shapefile called MetropAccess_YKR_grid.shp that you can download from here. Read further description about the travel time matrix from the Digital Geography Lab / Accessibility research group blog.

What should this tool do?#

AccessViz is a Python tool (i.e. a set of Notebooks and/or Python script files) for managing, analyzing and visualizing the Travel Time Matrix data set. AccessViz consist of Python functions, and examples on how to use these functions. AccessViz has four main components for accessing the files, joining the attribute information to spatial data, visualizing the data and comparing different travel modes:

1. FileFinder: The AccessViz tool finds a list of travel time matrix files based on a list of YKR ID values from a specified input data folder. The code should work for different list lengths and different YKR ID values. If the YKR ID number does not exist in the input folder (and it’s subfolders), the tools should warn about this to the user but still continue running. The tool should also inform the user about the execution process: tell the user what file is currently under process and how many files there are left (e.g. "Processing file travel_times_to_5797076.txt.. Progress: 3/25"). As output, FileFinder compiles a list of FilePaths for further processing. (Optional feature: FileFinder can also print out a list of filepaths into a text file.)

2. TableJoiner: The AccessViz tool creates a spatial layer from the chosen Matrix text table (e.g. travel_times_to_5797076.txt) by joining the Matrix file with MetropAccess_YKR_grid Shapefile where from_id in Matrix file corresponds to YKR_ID in the Shapefile. The tool saves the result in the output-folder that user has defined. Output file format can be Shapefile or Geopackage. You should name the files in a way that it is possible to identify the ID from the name (e.g. 5797076). The table joiing can be applied to files that correspond to a list of selected YKR ID files (FileFinder handles finding the correct input files!).

3. Visualizer: AccessViz can visualize the travel times of selected YKR_IDs based on different travel modes (it should be possible to use the same tool for visualizing travel times by car, public transport, walking or biking depending on an input parameter!). It saves the maps into a specified folder for output images. The output maps can be either static or interactive - it should be possible to select which kind of map output is generated when running the tool. You can freely design yourself the style of the map, colors, travel time intervals (classes) etc. Try to make the map as informative as possible! The visualizations can be applied to files that correspond to a list of selected YKR ID files (FileFinder handles finding the correct input files!). Remember to handle no data values.

4. Comparison tool: AccessViz can also compare travel times or travel distances between two different travel modes. For example, the tool can compare rush hour travel times by public transport and car based on columns pt_r_t and car_r_t, and rush hour travel distances based on columns pt_r_d and car_r_d. It should be also possible to run the AccessViz tool without doing any comparisons. Thus IF the user has specified two travel modes (passed in as a list) for the AccessViz, the tool will calculate the time/distance difference of those travel modes into a new column. In the calculation, the first travel mode is always subtracted by the last one: travelmode1 - travelmode2 according to the order in which the travel modes were listed. The tool should ensure that distances are not compared to travel times and vice versa. The tool saves outputs as new files (Shapefile or Geopackage file format) with an informative name, for example: Accessibility_5797076_pt_vs_car.shp. It should be possible to compare only two travel modes between each other at the time. Accepted travel modes are the same ones that are found in the actual TravelTimeMatrix file (walking, biking, public transport and car). If the tool gets invalid parameters (for example, a travel mode that does not exists, or too many travel modes), stop the program, and give advice what are the acceptable values. Remember to handle no data values.

If you are pursuing the highest grade, you should implement also at least one of the following components:

  1. The AccessViz documentation also contains a separate interactive map that shows the YKR grid values in Helsinki region. The purpose of the map is to help the user to choose the YKR-IDs that they are interested to visualize / analyze.

  2. AccessViz can also visualize the travel mode comparisons that were described in step 4.

  3. AccessViz can also visualize shortest path routes (walking, cycling, and/or driving) using OpenStreetMap data from Helsinki Region. The impedance value for the routes can be distance (as was shown in Lesson 7) or time.

  4. AccessViz can also compare travel time data from two different years. For example, this tool could plot a map that shows the difference with public transport travel times between 2013 and 2018.

Note

NoData values

Notice that there are NoData values present in the data (value -1). In such cases the result cell should always end up having a value -1 when doing travel mode comparisons. In the visualizations, the NoData values should be removed before visualizing the map.

Hint

Modularize your code

One of the best practice guidelines is that you should avoid repeating yourself. Thus, we recommend to modularize different tasks in your code and use functions as much as possible. Use meaningful parameter and variable names when defining the functions, so that they are intuitive but short.

Urban indicators#

In this assignment, the aim is to develop an urban analytics tool and apply it to at least two cities or neighborhoods (e.g. Helsinki and Tampere, or neighborhood areas in Helsinki). The main idea is to calculate a set of metrics / indicators based on the urban form and/or population, and to compare the cities/regions based on these measures. This assignment is not accurately defined, as the idea is to allow you to use your own imagination and interest to explore different datasets and conduct analyses that interest to you, still providing useful insights about the urban areas using specific set of indicators (you should use 2-4 different indicators, see examples from below).

Data#

You can use any (spatial) data that you can find, and generate your own report describing how the cities differ from each other based on different perspectives (see below hints about possible analyses). You can use any data that is available, for example, from the following sources:

Data sources are not limited to these, hence you can also use other data from any source that you can find (remember to document where the data is coming from!).

Example analyses#

The tool should calculate 2-4 indicators about the urban areas. Here are some examples of potential metrics:

Population distribution and demographics

  • Input data management (table joins, data cleaning etc.)

  • Calculate key statistics

  • create maps and graphs

Urban population growth

  • Fetch population data from at least two different years

  • Compare statistics from different years

  • Visualize as graphs and maps

Accessibility:

  • Decide what travel tiles you are focusing on (walking, driving, public transport..)

  • Decide what types of destinations you are focusing on (transport stations, health care, education, sports facilities..)

  • Get travel time data from the Travel Time Matrix OR calculate shortest paths in a network

  • Calculate travel time / travel distance metrics, or dominance areas

  • Visualize the results as graphs and maps

Green area index

  • Fetch green area polygons and filter the data if needed

  • Calculate the percentage of green areas in the city /region + other statistics

  • Visualize the results

Street network metrics

  • Fetch street network data

  • Calculate street network metrics (see Lesson 6 and examples from here)

  • Visualize the results

Building density

  • Fetch the data, and filter if needed

  • Calculate building density and other metrics

  • create maps showing the building types and density

Structure of the urban indicators tool assignmnent#

You can design the structure of your assignment freely. We suggest that you create functions in separate script files, and demonstrate the use of those functions in one or several notebooks. In addition, you should provide some basic information in the README.md file of your final assignment. All in all, the work should include these components:

  • A topic for your work (eg. “Urban indicators: analyzing the street netowrk structure in Helsinki and Tampere”).

  • A short introduction to the topic (present 2-4 research questions that you aim to answer using the indicators)

  • Short description of the datasets you used

  • Short generic description of the methods you used

  • Actual codes and visualizations to produce the results

  • Short discussion related to the results (what should we understand and see from them?)

  • Short reflection about the analysis, for example: - What kind of assumptions, biases or uncertainties are related to the data and/or the analyses that you did? - Any other notes that the reader should know about the analysis

Technical considerations#

Take care that you:

  • Document your analyses well using the Markdown cells and describe 1) what you are doing and 2) what you can see from the data and your results.

  • Use informative visualizations

    • Create maps (static or interactive)

    • Create other kind of graphs (e.g. bar graphs, line graphs, scatter plots etc.)

    • Use subplots that allows to easily compare results side-by-side

  • When writing the codes, we highly recommend that you use and write functions for repetitive parts of the code. As a motivation: think that you should repeat your analyses for all cities in Finland, write your codes in a way that this would be possible. Furthermore, we recommend that you save those functions into a separate .py -script file that you import into the Notebook (see example from Geo-Python Lesson 4)

Literature + inspiration#

Following readings provide you some useful background information and inspiration for the analyses (remember to cite if you use them):

Spatial anonymization#

With the increase in interest in open data and science, the subjective of data privacy and safety is increasingly gaining attention. When openly publishing data, it is often required to anonymize the data using some kind of algorithm in order to protect the private information and comply with regulations. Broadly speaking, the anonymization approaches fall into two categories of Generalization and noise addition. Generalization means that we move a data point to a larger category or group making it difficult to identify an individual (read about k-anonymity). Noise addition means that some level of noise is added to the original values in the data to prevent identification of individuals. In the case of geospatial data, generalization would mean that we generalize the features (for example points) to a bigger spatial unit. For example, instead of publishing data on point level, we would publish them on a grid level or within a polygon such as neighborhood. Noise addition means that the point is moved to another location. This can be based on a randomization process, or with the help of a mathematical function (e.g., Gaussian), or systematically moving points to a more general location (such as nearest intersection) (perturbation). When anonymizing geospatial data, the level of anonymity achieved is directly related to the geographical context. For example, when anonymizing home locations, it is easier to reach a higher level of k-anonymity in a populous area than in a sparsely populated suburban area. This can be used to optimize the anonymization: reaching the satisfactory anonymity while minimizing the data quality loss.

Suggested analytical steps:#

In this assignment we are going to work with an imaginary point dataset collected from public participation GIS survey (read more) representing participants’ home locations. You can get creative and try new things, but you can also get ideas form this paper. We will follow these steps in this exercise:

  1. Step 1: Create a set of at least 2000 imaginary home locations (random points) in a city/region of your choice. Make it more realistic by getting rid of points which may be located in inaccessible area (water, forests, etc.). To make it even more realistic you can make sure that the points are within residential buildings.

  2. Step 2: Add a random displacement to the x and y coordinates between 50 to 500 meters. Weight the amount of displacement so that you have a bigger displacement in areas of lower population density. You can do this using a simple classification or get more creative with math. Feel free to use other relevant contextual data (such as building density) in weighting your displacement values.

  3. Step 3: Create suitable charts (e.g., a histogram) to show the values of displacements. Classify the values using an appropriate method and highlight the classification break points using dashed vertical lines.

  4. Step 4: Create an interactive map showing the amount of displacement in different areas. Can you create a nice map showing how your displacements are (negatively) correlated with the population distribution?

  5. Step 5 (Extra): Can you calculate and visualize the k-anonymity using the approach described here

Data recommendations:#

You do not need to necessarily choose Helsinki as your study site. However, it might be easier to find the data you need from the capital region. Remember that you can download the required data and use them locally or use wfs if it is available from source.

  • Home locations: This is a randomly generated point data you should create

  • Urban structure: - Buildings: https://hri.fi/data/en_GB/dataset/helsingin-rakennukset - Population grid: https://www.paikkatietohakemisto.fi/geonetwork/srv/eng/catalog.search#/metadata/a901d40a-8a6b-4678-814c-79d2e2ab130c

Own project work#

Develop your own topic! In general, your own topic should also contain these sections:

  1. Data acquisition (Fetching data, subsetting data, storing intermediate outputs etc.)

  2. Data analysis (Enriching and analyzing the data, eg. spatial join, overlay, buffering, other calculations..)

  3. Visualization (Visualizing main results and other relevant information as maps and graphs)

But feel free to be creative! Your own project might be, for example, related to your thesis or work project. Remember to describe clearly what you are doing in the final assignment repository README.md -file. Preferably, present your idea to the course instructors before the winter holidays.

What is at least required from the final project, is that you have:

  • a working piece of code for your task / problem / analyses that solves it

  • Good documentation (i.e. a tutorial) explaining how your tool works OR a report about your analyses and what we can learn from them