.. SPDX-FileCopyrightText: 2024 Jiri Vlasak
..
.. SPDX-License-Identifier: CC-BY-SA-4.0

Database: Tasks history
=======================

This document discusses the database, particularly the part of the
database that works with the tasks history.

- repository: https://git.sr.ht/~qeef/dd-hot-tm
- documentation: https://dd-hot-tm.tojemoje.site/

.. contents::
   :local:

.. include:: problem.rst


The purpose of the tasks history
--------------------------------

Keeping the task history helps Radek and Ramona understand the evolution
of the task. When they do the *bad* action to move the task state back
to *unlocked to map*, there is a new round of mapping and checking of
the task with possibly new names of mappers. When there are new mappers,
Radek and Ramona may be in the need of using *bad* action on the same
task multiple times.

On the other hand, when task history is kept, Monica, Michal, Marcel and
Miriam can see if it was them who failed to map the task properly.

Keeping the task history is interesting also from the statistics point
of view -- tasks having multiple mappers and a lot of *bad* actions are
probably hard to map, or it is possible to track the time between lock
and unlock actions.

When we are aware of the common `mapping workflow`_, familiar with the
*Project*, *Task* and *Action* terminology, and we understand the need
for the tasks history, we can finally discuss `the database interface`_
and `the database schema`_.


The database interface
----------------------

The database interface is the best described by the
``hot_tm_proposal.database`` package:

.. automodule:: hot_tm_proposal.database


The database schema
-------------------

There are tables common to all tested database schemas and particular
tables that are implementation specific. There is always ``projects``
table with the following columns:

- ``pid`` -- project identifier, integer, primary key, automatically
- ``geometry`` -- area geometry (GeoJSON when no PostGIS extension)
- ``description`` -- text, the project description

The approach of `Tasks history in HOT TM`_ is to use ``tasks`` and
``task_history`` tables, but the implementation leads to the duplicated
information and therefore increased complexity to keep the information
synchronized.

`Tasks history in TM Admin`_ describes the approach where only ``tasks``
table is used and task history is concatenated as JSON list in the
``history`` column.

Both approaches understand the task and the task history as an object
that can be considered to `Create, Read, Update, Delete (CRUD)`_. Task
state is a property of the task being updated and the action is a
property of the record in the task history. The approaches only differ
in how the objects, task and task history, are stored in the database.

.. _Create, Read, Update, Delete (CRUD): https://en.wikipedia.org/wiki/Create,_read,_update_and_delete

CRUD is not always the best way to work with objects, especially when
the history is needed. This is because the update on the object ignores
its history, so the history must be kept elsewhere. Instead of updating
the task status and keeping the separated history synchronized, it is
more convenient to keep the history and compute the status when needed.

In `Actions history`_, we introduce the database schema with ``tasks``
and ``actions`` tables, where the database records are only created and
read, but never updated. Action is first class database record and task
state is a read-only database query.


Non-goals
---------

This document *does not* cover the whole database schema or the whole
database schema for the tasks. This document covers part of the
database, limited to tasks and the transitions between the task states.
It is expected there are similar documents covering other parts of the
database as groups or campaigns, and that there is another document that
puts all these parts together and introduces overall database schema.

This document *does not* cover task issues or annotations. The document
aims solely on the main function of tasks within the HOT TM and keeping
the tasks history.

This document *does not* introduce production-ready database interface
or schema.


Tasks history in HOT TM
-----------------------

We do not provide test implementation for the tasks history that is
comparable to the HOT TM current implementation. In general, our
approach is to read the code, try to understand the idea behind the code
and implement a restricted version to test that idea. We do not see the
idea hidden in the current implementation of HOT TM.

``tasks`` table has the following columns:

- ``pid`` -- project identifier, reference to ``projects`` table
- ``tid`` -- task identifier, ``tid`` and ``pid`` are together primary
  key in ``tasks`` table
- ``geometry`` -- task geometry (GeoJSON when no PostGIS extension)
- ``status`` -- string for simplicity, task's current state
- ``locked_by`` -- string for simplicity, who did the last "lock"
  modification
- ``mapped_by`` -- string for simplicity, who did *finish* action
- ``validated_by`` -- string for simplicity, who did *good* action
- ``task_history`` -- relation to the ``task_history`` table

and ``task_history`` has the columns:

- ``pid`` -- project identifier, reference to ``projects`` table
- ``tid`` -- task identifier, reference to ``tasks`` table
- ``hid`` -- task history identifier
- ``action_purpose`` -- string, what action happened
- ``action_text`` -- string, note about the action
- ``action_date`` -- timestamp, when the action happened
- ``action_by`` -- string, the author of the action


Tasks history in TM Admin
-------------------------

`TM Admin`_ is a new approach for the complex solution of Tasking
Manager-like projects. The database schema in this section follows the
idea of storing the task history in the JSON column.

We provide two implementations: ``like_tm_admin`` and
``almost_tm_admin``. Former tries to be as close as possible to the TM
Admin, latter keeps the database schema but uses different database
queries.

``tasks`` table has the following columns:

- ``pid`` -- project identifier, reference to ``projects`` table
- ``tid`` -- task identifier, ``tid`` and ``pid`` are together primary
  key in ``tasks`` table
- ``geometry`` -- task geometry (GeoJSON when no PostGIS extension)
- ``status`` -- string for simplicity, task's current state
- ``locked_by`` -- string for simplicity, who did the last "lock"
  modification
- ``mapped_by`` -- string for simplicity, who did *finish* action
- ``validated_by`` -- string for simplicity, who did *good* action
- ``history`` -- JSON (better or binary) list of action records where
  each action record contain ``purpose``, ``text``, ``date``, ``by``.

.. _TM Admin: https://github.com/hotosm/tm-admin


``like_tm_admin``
^^^^^^^^^^^^^^^^^

.. automodule:: hot_tm_proposal.database.like_tm_admin


.. _almost tm admin:

``almost_tm_admin``
^^^^^^^^^^^^^^^^^^^

.. automodule:: hot_tm_proposal.database.almost_tm_admin


Actions history
---------------

We introduce the database schema where the information about the tasks
is only added to the database. The ``tasks`` stores the read-only
information about tasks that do not change, like project identifier
``pid``, task identifier ``tid`` and task border ``geometry``. On the
other hand, the ``actions`` table stores what happenned to the tasks.

``tasks`` table has the following columns:

- ``pid`` -- project identifier, reference to ``projects`` table
- ``tid`` -- task identifier, ``tid`` and ``pid`` are together primary
  key in ``tasks`` table
- ``geometry`` -- task geometry (GeoJSON when no PostGIS extension)

and ``actions`` table has the columns:

- ``pid`` -- project identifier, reference to ``projects`` table
- ``tid`` -- task identifier, reference to ``tasks`` table
- ``aid`` -- action identifier
- ``purpose`` -- what action happened?
- ``text`` -- note about the action?
- ``date`` -- when the action happened?
- ``by`` -- the author of the action?


.. _actions history:

``actions_history``
^^^^^^^^^^^^^^^^^^^

.. automodule:: hot_tm_proposal.database.actions_history


.. _database experiments:

Experiments
-----------

We use `hot_tm_proposal.database.test` module to perform the tests of
the database module:

.. automodule:: hot_tm_proposal.database.test

----

In this section, we describe what experiments we performed. We provide
the source code of the experiments in the ``test`` module of the
``hot_tm_proposal.database`` Python3 package.

We compare `Tasks history in TM Admin`_ and `Actions history`_ in the
following ways:

1. We create the database with two projects, each having 1000 tasks and
   we measure the overall time to create the projects.

   Setting ``LONGER_TEST = True`` will create 10 projects.

2. We measure the overall time for 100 task state changes of the single
   project: find random task to map and lock it for mapping, then
   immediately unlock it to check.

   Then, we measure the overall time for 1000 task state changes of the
   single project: find random task to map or check, lock it and then
   immediately unlock it to check.

   Setting ``LONGER_TEST = True`` will measure 5000 actions over all 10
   projects randomly.

3. We measure the overall time for retrieving the last status of each
   task of the area.

.. figure:: db-test-1000.svg
   :name: f-test-1000

   Test two projects with 1000 requests to map or check

.. figure:: db-test-5000.svg
   :name: f-test-5000

   Test ten projects with 5000 requests to map or check


Conclusion
----------

From the results in :ref:`f-test-1000` and :ref:`f-test-5000`, we can
see that for the single mapper, the `Tasks history in TM Admin`_ that
stores task history in JSON column of the tasks table, yields better
results than `Actions history`_. However, we need to be aware of that:

- We did not try to retrieve all actions of a single project
  chronologically ordered, i.e., from the newest to the oldest.

  Doing this using `Actions history`_ is trivial read-only query. Doing
  this using `Tasks history in TM Admin`_ leads to read-only query to
  retrieve data (JSON lists) from all (jsonb) history columns, followed
  by merging the data (JSON lists) by the date contained in the
  dictionaries (actions) in the JSON lists.

- We implemented the tested functionality, like *map random task*, as a
  single (serializable) query, but we have no idea what approach will be
  used within `TM Admin`_.

- We did not tested parallel accesses to the database, i.e. we do not
  know how these two approaches compare in the environment of many
  mappers; yet this needs to be tested.

- We did not implemented the database design from `Tasks history in HOT
  TM`_, because the relations between the tables were not clear enough
  after reading the source code of HOT TM. This would probably not
  influence the results.