Database: Tasks history¶
This document discusses the database, particularly the part of the database that works with the tasks history.
repository: https://git.sr.ht/~qeef/dd-hot-tm
documentation: https://dd-hot-tm.tojemoje.site/
The main problem HOT TM solves is that it helps mappers to manage the mapping of a large area. From the mapper’s point of view, the solution of this problem is straightforward. Split the large area into smaller parts and let the mappers lock and unlock them to communicate what part of the area is being worked on.
HOT TM uses the Project and Task terminology to denote a large area with additional information like the description of the purpose of mapping, and its smaller parts. Each task has its state, e.g. locked or unlocked. (In the HOT TM code base the “state” is often called “status”.) However, a task can also be understand as a thing to do instead of a part of the area. To avoid the misunderstanding, we introduce the Action to denote a transition from one task state to another.
Using the terminology of Project, Task and Action introduced above, we can describe the common mapping workflow. First, Celestine creates new project with all tasks unlocked. Then, Monica, Michal, Marcel and Miriam each locks random task and map that part of the project. Miriam and Marcel are finished with their tasks, unlock them and lock another random tasks. We can see the figure representing the task states (rounded boxes) with corresponding actions (arrows):
In addition, Radek and Ramona comes into the mapping workflow. They find tasks that have been recently mapped and check that these tasks have been mapped properly. To improve the representation of the workflow, we give the meaning to the locked and unlocked states and we name the actions, as shown in the following figure:
The names of the actions is the first half of the database interface. The other half, the DTOs, identify the project, the task, the transition and the mapper. The DTOs should not be confused with the states.
Note that the terminology and the naming of task states only reassembles HOT TM naming – there are more states with slightly different names in HOT TM code.
The purpose of the tasks history¶
Keeping the task history helps Radek and Ramona understand the evolution of the task. When they do the bad action to move the task state back to unlocked to map, there is a new round of mapping and checking of the task with possibly new names of mappers. When there are new mappers, Radek and Ramona may be in the need of using bad action on the same task multiple times.
On the other hand, when task history is kept, Monica, Michal, Marcel and Miriam can see if it was them who failed to map the task properly.
Keeping the task history is interesting also from the statistics point of view – tasks having multiple mappers and a lot of bad actions are probably hard to map, or it is possible to track the time between lock and unlock actions.
When we are aware of the common mapping workflow, familiar with the Project, Task and Action terminology, and we understand the need for the tasks history, we can finally discuss the database interface and the database schema.
The database interface¶
The database interface is the best described by the
hot_tm_proposal.database
package:
HOT TM database abstraction.
This package abstracts the database. It contains the modules with the queries to store and retrieve the information from the PostGIS database (based on the PostgreSQL).
The database interface (procedure names and DTOs) that modules must implement is:
create_project(pid)
Create new project in the database identified by
pid
.pid
is a number greater than zero.map_random(pid, by)
Select random task of project
pid
and set its status to lockedby
.by
is a string representing the mapper’s name. Returnpid
,tid
, andgeometry
.finish(pid, tid, by)
Set the status of task
tid
of the projectpid
to unlocked to checkby
.tid
is a number greater than zero. Returnpid
andtid
.check_recent(pid, by)
Select the newest mapped task of pproject
pid
and set its status to lockedby
. Returnpid
,tid
andgeometry
.good(pid, tid, by)
Set the status of task
tid
of the projectpid
to unlocked doneby
. Returnpid
andtid
.bad(pid, tid, by)
Set the status of task
tid
of the projectpid
to unlocked to mapby
. Returnpid
andtid
.all_tasks_states
Return the list of dictionaries representing the state of the project’s tasks. The list contains dictionaries with project identifier
pid
, task identifiertid
, and the currentstate
of each particular task.
There are following modules in the database
package:
like_tm_admin
Try to be as close as possible to the TM Admin. Not yet implemented.
almost_tm_admin
Use the database schema of the TM Admin, but different database queries.
actions_history
Keep read-only information about tasks separated from what happenned to the tasks.
The database
package also contains db
module, but the db
module is used solely to keep the database pool and to provide helper
procedures for the database access.
The database schema¶
There are tables common to all tested database schemas and particular
tables that are implementation specific. There is always projects
table with the following columns:
pid
– project identifier, integer, primary key, automaticallygeometry
– area geometry (GeoJSON when no PostGIS extension)description
– text, the project description
The approach of Tasks history in HOT TM is to use tasks
and
task_history
tables, but the implementation leads to the duplicated
information and therefore increased complexity to keep the information
synchronized.
Tasks history in TM Admin describes the approach where only tasks
table is used and task history is concatenated as JSON list in the
history
column.
Both approaches understand the task and the task history as an object that can be considered to Create, Read, Update, Delete (CRUD). Task state is a property of the task being updated and the action is a property of the record in the task history. The approaches only differ in how the objects, task and task history, are stored in the database.
CRUD is not always the best way to work with objects, especially when the history is needed. This is because the update on the object ignores its history, so the history must be kept elsewhere. Instead of updating the task status and keeping the separated history synchronized, it is more convenient to keep the history and compute the status when needed.
In Actions history, we introduce the database schema with tasks
and actions
tables, where the database records are only created and
read, but never updated. Action is first class database record and task
state is a read-only database query.
Non-goals¶
This document does not cover the whole database schema or the whole database schema for the tasks. This document covers part of the database, limited to tasks and the transitions between the task states. It is expected there are similar documents covering other parts of the database as groups or campaigns, and that there is another document that puts all these parts together and introduces overall database schema.
This document does not cover task issues or annotations. The document aims solely on the main function of tasks within the HOT TM and keeping the tasks history.
This document does not introduce production-ready database interface or schema.
Tasks history in HOT TM¶
We do not provide test implementation for the tasks history that is comparable to the HOT TM current implementation. In general, our approach is to read the code, try to understand the idea behind the code and implement a restricted version to test that idea. We do not see the idea hidden in the current implementation of HOT TM.
tasks
table has the following columns:
pid
– project identifier, reference toprojects
tabletid
– task identifier,tid
andpid
are together primary key intasks
tablegeometry
– task geometry (GeoJSON when no PostGIS extension)status
– string for simplicity, task’s current statelocked_by
– string for simplicity, who did the last “lock” modificationmapped_by
– string for simplicity, who did finish actionvalidated_by
– string for simplicity, who did good actiontask_history
– relation to thetask_history
table
and task_history
has the columns:
pid
– project identifier, reference toprojects
tabletid
– task identifier, reference totasks
tablehid
– task history identifieraction_purpose
– string, what action happenedaction_text
– string, note about the actionaction_date
– timestamp, when the action happenedaction_by
– string, the author of the action
Tasks history in TM Admin¶
TM Admin is a new approach for the complex solution of Tasking Manager-like projects. The database schema in this section follows the idea of storing the task history in the JSON column.
We provide two implementations: like_tm_admin
and
almost_tm_admin
. Former tries to be as close as possible to the TM
Admin, latter keeps the database schema but uses different database
queries.
tasks
table has the following columns:
pid
– project identifier, reference toprojects
tabletid
– task identifier,tid
andpid
are together primary key intasks
tablegeometry
– task geometry (GeoJSON when no PostGIS extension)status
– string for simplicity, task’s current statelocked_by
– string for simplicity, who did the last “lock” modificationmapped_by
– string for simplicity, who did finish actionvalidated_by
– string for simplicity, who did good actionhistory
– JSON (better or binary) list of action records where each action record containpurpose
,text
,date
,by
.
like_tm_admin
¶
almost_tm_admin
¶
Actions history¶
We introduce the database schema where the information about the tasks
is only added to the database. The tasks
stores the read-only
information about tasks that do not change, like project identifier
pid
, task identifier tid
and task border geometry
. On the
other hand, the actions
table stores what happenned to the tasks.
tasks
table has the following columns:
pid
– project identifier, reference toprojects
tabletid
– task identifier,tid
andpid
are together primary key intasks
tablegeometry
– task geometry (GeoJSON when no PostGIS extension)
and actions
table has the columns:
pid
– project identifier, reference toprojects
tabletid
– task identifier, reference totasks
tableaid
– action identifierpurpose
– what action happened?text
– note about the action?date
– when the action happened?by
– the author of the action?
actions_history
¶
Experiments¶
We use hot_tm_proposal.database.test module to perform the tests of the database module:
In this section, we describe what experiments we performed. We provide
the source code of the experiments in the test
module of the
hot_tm_proposal.database
Python3 package.
We compare Tasks history in TM Admin and Actions history in the following ways:
We create the database with two projects, each having 1000 tasks and we measure the overall time to create the projects.
Setting
LONGER_TEST = True
will create 10 projects.We measure the overall time for 100 task state changes of the single project: find random task to map and lock it for mapping, then immediately unlock it to check.
Then, we measure the overall time for 1000 task state changes of the single project: find random task to map or check, lock it and then immediately unlock it to check.
Setting
LONGER_TEST = True
will measure 5000 actions over all 10 projects randomly.We measure the overall time for retrieving the last status of each task of the area.
Conclusion¶
From the results in Test two projects with 1000 requests to map or check and Test ten projects with 5000 requests to map or check, we can see that for the single mapper, the Tasks history in TM Admin that stores task history in JSON column of the tasks table, yields better results than Actions history. However, we need to be aware of that:
We did not try to retrieve all actions of a single project chronologically ordered, i.e., from the newest to the oldest.
Doing this using Actions history is trivial read-only query. Doing this using Tasks history in TM Admin leads to read-only query to retrieve data (JSON lists) from all (jsonb) history columns, followed by merging the data (JSON lists) by the date contained in the dictionaries (actions) in the JSON lists.
We implemented the tested functionality, like map random task, as a single (serializable) query, but we have no idea what approach will be used within TM Admin.
We did not tested parallel accesses to the database, i.e. we do not know how these two approaches compare in the environment of many mappers; yet this needs to be tested.
We did not implemented the database design from Tasks history in HOT TM, because the relations between the tables were not clear enough after reading the source code of HOT TM. This would probably not influence the results.