.. SPDX-FileCopyrightText: 2024 Jiri Vlasak .. .. SPDX-License-Identifier: CC-BY-SA-4.0 Database: Tasks history ======================= This document discusses the database, particularly the part of the database that works with the tasks history. - repository: https://git.sr.ht/~qeef/dd-hot-tm - documentation: https://dd-hot-tm.tojemoje.site/ .. contents:: :local: .. include:: problem.rst The purpose of the tasks history -------------------------------- Keeping the task history helps Radek and Ramona understand the evolution of the task. When they do the *bad* action to move the task state back to *unlocked to map*, there is a new round of mapping and checking of the task with possibly new names of mappers. When there are new mappers, Radek and Ramona may be in the need of using *bad* action on the same task multiple times. On the other hand, when task history is kept, Monica, Michal, Marcel and Miriam can see if it was them who failed to map the task properly. Keeping the task history is interesting also from the statistics point of view -- tasks having multiple mappers and a lot of *bad* actions are probably hard to map, or it is possible to track the time between lock and unlock actions. When we are aware of the common `mapping workflow`_, familiar with the *Project*, *Task* and *Action* terminology, and we understand the need for the tasks history, we can finally discuss `the database interface`_ and `the database schema`_. The database interface ---------------------- The database interface is the best described by the ``hot_tm_proposal.database`` package: .. automodule:: hot_tm_proposal.database The database schema ------------------- There are tables common to all tested database schemas and particular tables that are implementation specific. There is always ``projects`` table with the following columns: - ``pid`` -- project identifier, integer, primary key, automatically - ``geometry`` -- area geometry (GeoJSON when no PostGIS extension) - ``description`` -- text, the project description The approach of `Tasks history in HOT TM`_ is to use ``tasks`` and ``task_history`` tables, but the implementation leads to the duplicated information and therefore increased complexity to keep the information synchronized. `Tasks history in TM Admin`_ describes the approach where only ``tasks`` table is used and task history is concatenated as JSON list in the ``history`` column. Both approaches understand the task and the task history as an object that can be considered to `Create, Read, Update, Delete (CRUD)`_. Task state is a property of the task being updated and the action is a property of the record in the task history. The approaches only differ in how the objects, task and task history, are stored in the database. .. _Create, Read, Update, Delete (CRUD): https://en.wikipedia.org/wiki/Create,_read,_update_and_delete CRUD is not always the best way to work with objects, especially when the history is needed. This is because the update on the object ignores its history, so the history must be kept elsewhere. Instead of updating the task status and keeping the separated history synchronized, it is more convenient to keep the history and compute the status when needed. In `Actions history`_, we introduce the database schema with ``tasks`` and ``actions`` tables, where the database records are only created and read, but never updated. Action is first class database record and task state is a read-only database query. Non-goals --------- This document *does not* cover the whole database schema or the whole database schema for the tasks. This document covers part of the database, limited to tasks and the transitions between the task states. It is expected there are similar documents covering other parts of the database as groups or campaigns, and that there is another document that puts all these parts together and introduces overall database schema. This document *does not* cover task issues or annotations. The document aims solely on the main function of tasks within the HOT TM and keeping the tasks history. This document *does not* introduce production-ready database interface or schema. Tasks history in HOT TM ----------------------- We do not provide test implementation for the tasks history that is comparable to the HOT TM current implementation. In general, our approach is to read the code, try to understand the idea behind the code and implement a restricted version to test that idea. We do not see the idea hidden in the current implementation of HOT TM. ``tasks`` table has the following columns: - ``pid`` -- project identifier, reference to ``projects`` table - ``tid`` -- task identifier, ``tid`` and ``pid`` are together primary key in ``tasks`` table - ``geometry`` -- task geometry (GeoJSON when no PostGIS extension) - ``status`` -- string for simplicity, task's current state - ``locked_by`` -- string for simplicity, who did the last "lock" modification - ``mapped_by`` -- string for simplicity, who did *finish* action - ``validated_by`` -- string for simplicity, who did *good* action - ``task_history`` -- relation to the ``task_history`` table and ``task_history`` has the columns: - ``pid`` -- project identifier, reference to ``projects`` table - ``tid`` -- task identifier, reference to ``tasks`` table - ``hid`` -- task history identifier - ``action_purpose`` -- string, what action happened - ``action_text`` -- string, note about the action - ``action_date`` -- timestamp, when the action happened - ``action_by`` -- string, the author of the action Tasks history in TM Admin ------------------------- `TM Admin`_ is a new approach for the complex solution of Tasking Manager-like projects. The database schema in this section follows the idea of storing the task history in the JSON column. We provide two implementations: ``like_tm_admin`` and ``almost_tm_admin``. Former tries to be as close as possible to the TM Admin, latter keeps the database schema but uses different database queries. ``tasks`` table has the following columns: - ``pid`` -- project identifier, reference to ``projects`` table - ``tid`` -- task identifier, ``tid`` and ``pid`` are together primary key in ``tasks`` table - ``geometry`` -- task geometry (GeoJSON when no PostGIS extension) - ``status`` -- string for simplicity, task's current state - ``locked_by`` -- string for simplicity, who did the last "lock" modification - ``mapped_by`` -- string for simplicity, who did *finish* action - ``validated_by`` -- string for simplicity, who did *good* action - ``history`` -- JSON (better or binary) list of action records where each action record contain ``purpose``, ``text``, ``date``, ``by``. .. _TM Admin: https://github.com/hotosm/tm-admin ``like_tm_admin`` ^^^^^^^^^^^^^^^^^ .. automodule:: hot_tm_proposal.database.like_tm_admin .. _almost tm admin: ``almost_tm_admin`` ^^^^^^^^^^^^^^^^^^^ .. automodule:: hot_tm_proposal.database.almost_tm_admin Actions history --------------- We introduce the database schema where the information about the tasks is only added to the database. The ``tasks`` stores the read-only information about tasks that do not change, like project identifier ``pid``, task identifier ``tid`` and task border ``geometry``. On the other hand, the ``actions`` table stores what happenned to the tasks. ``tasks`` table has the following columns: - ``pid`` -- project identifier, reference to ``projects`` table - ``tid`` -- task identifier, ``tid`` and ``pid`` are together primary key in ``tasks`` table - ``geometry`` -- task geometry (GeoJSON when no PostGIS extension) and ``actions`` table has the columns: - ``pid`` -- project identifier, reference to ``projects`` table - ``tid`` -- task identifier, reference to ``tasks`` table - ``aid`` -- action identifier - ``purpose`` -- what action happened? - ``text`` -- note about the action? - ``date`` -- when the action happened? - ``by`` -- the author of the action? .. _actions history: ``actions_history`` ^^^^^^^^^^^^^^^^^^^ .. automodule:: hot_tm_proposal.database.actions_history .. _database experiments: Experiments ----------- We use `hot_tm_proposal.database.test` module to perform the tests of the database module: .. automodule:: hot_tm_proposal.database.test ---- In this section, we describe what experiments we performed. We provide the source code of the experiments in the ``test`` module of the ``hot_tm_proposal.database`` Python3 package. We compare `Tasks history in TM Admin`_ and `Actions history`_ in the following ways: 1. We create the database with two projects, each having 1000 tasks and we measure the overall time to create the projects. Setting ``LONGER_TEST = True`` will create 10 projects. 2. We measure the overall time for 100 task state changes of the single project: find random task to map and lock it for mapping, then immediately unlock it to check. Then, we measure the overall time for 1000 task state changes of the single project: find random task to map or check, lock it and then immediately unlock it to check. Setting ``LONGER_TEST = True`` will measure 5000 actions over all 10 projects randomly. 3. We measure the overall time for retrieving the last status of each task of the area. .. figure:: db-test-1000.svg :name: f-test-1000 Test two projects with 1000 requests to map or check .. figure:: db-test-5000.svg :name: f-test-5000 Test ten projects with 5000 requests to map or check Conclusion ---------- From the results in :ref:`f-test-1000` and :ref:`f-test-5000`, we can see that for the single mapper, the `Tasks history in TM Admin`_ that stores task history in JSON column of the tasks table, yields better results than `Actions history`_. However, we need to be aware of that: - We did not try to retrieve all actions of a single project chronologically ordered, i.e., from the newest to the oldest. Doing this using `Actions history`_ is trivial read-only query. Doing this using `Tasks history in TM Admin`_ leads to read-only query to retrieve data (JSON lists) from all (jsonb) history columns, followed by merging the data (JSON lists) by the date contained in the dictionaries (actions) in the JSON lists. - We implemented the tested functionality, like *map random task*, as a single (serializable) query, but we have no idea what approach will be used within `TM Admin`_. - We did not tested parallel accesses to the database, i.e. we do not know how these two approaches compare in the environment of many mappers; yet this needs to be tested. - We did not implemented the database design from `Tasks history in HOT TM`_, because the relations between the tables were not clear enough after reading the source code of HOT TM. This would probably not influence the results.