Annotation Database Import App: Difference between revisions

Latest revision as of 16:30, 10 July 2024

Data for import is produced by principal investigators, annotators and NDST. As a final step, the importer will combine all of this data into a single importable dataset.

There are three steps for this final import process.

Downloading Import Data

The database import queue page displays lists of jobs configured by principal investigators, annotators and NDST using their respective tools. You will see a table containing jobs for Principal Investigators and Annotators. Click the download button beside each one to download a JSON file containing the job data for the cruise you wish to import.

The NDST section has a button labeled, "Refresh from NDST Database". This will load the data from the Dive Logging App into a transitional table in the main database. If new equipment or personnel have been created in the Dive Logging App, a table for each will appear that will allow you to map the new equipment onto an extant piece of hardware in the database, or a new person into the personnel table. If the person or equipment haven't been created yet, you can create them by clicking the "+" button.

Below this, you will find a table marked "Current Cruises" containing complete metadata for the cruise, dives, transects, personnel and equipment configurations. Download the file for the cruise you'll be importing.

The Import Program

Cruises can be imported on the command line using the importer.py program, or the graphical utility, by using the -g switch. The program gives a user the ability to load data into the Development, Staging and Production environments using a loaded configuration file, and to select which types of entities are to be dropped and/or (re)created.

The import process is much faster if run on the server using the command-line interface, but note the issue with Access databases.

The program provides a dry-run feature, so the import can be checked for validity before it is committed.

Import Program Algorithm

The import program progressed through these steps:

Load the handlers.
Load the main configuration file.
Assemble a list of required steps based on user inputs, such as which entities to replace and/or load.
Load the import job configuration for principal investigators.
- Create and configure the cruise entity to which all other entities are attached.
Load the import job configuration for annotators.
- Configure the label tree that will be used to map incoming annotation labels to database entities.
- Create and/or configure the annotation protocol used for annotations represented by the label tree.
- Load the personnel mapping for Biigle annotators.
Load the import job configuration for NDST.
- Update the cruise with additional metadata.
- Create or update the dives.
- Create or update the transects.
- Create or update the platform (ship, submersible) and equipment configurations.
- Create or the personnel roles on the cruise, dives and transects.
- Generate the interval trees used in subsequent phases to locate dives and transects during which observations have occurred, by temporal correspondence.
Resolve the label map -- loads the necessary lookups to satisfy the references in mapped labels.
Process Biigle annotations.
Process VideoMiner annotations.
Process CSV annotations.
Process data streams (navigation, telemetry, water properties, etc.)
Process comment events from CSV files.
Process status events from CSV files.
Process measurement events from CSV files.
Commit or roll back changes, depending on whether the dry run state is chosen.

Configuration Files

The import program uses JSON configuration files. The main configuration file, named config.json by convention.

Main Configuration File

The main config file provides some configuration values, and also provides links to other configuration files, which load and configure specific chunks of importable data. This is the file that is selected for loading in the import program.

Main Configuration File Contents
Name	Type	Required	Description
iqa_file	string	Yes, if there are annotations.	The configuration file generated by the Annotation Database Import for Annotators app.
iqpi_file	string	Yes	The configuration file generated by the Annotation Database Import for Principal Investigators app.
ndst_file	string	Yes	The configuration file generated from the NDST Dive Logging App.
biigle_configs	string[]	Yes, if there are Biigle annotations	One or more files containing configurations for annotations in Biigle format.
vm_configs	string[]	Yes, if there are VideoMiner annotations	One or more files containing configurations for annotations in VideoMiner format.
csv_configs	string[]	Yes, if there are CSV annotations	One or more files containing configurations for annotations in CSV format.
status_configs	string[]	Yes, if there are status events	One or more files containing configurations for status events in CSV format.
measurement_configs	string[]	Yes, if there are measurement events	One or more files containing configurations for measurement events in CSV format.
stream_configs	string[]	Yes, if there are navigation or telemetry streams	One or more files containing configurations for data streams in CSV format.
comment_configs	string[]	Yes, if there are comments	One or more files containing configurations for comments in CSV format.
media_path	string	No	The root URL for a site to which relative media paths can be appended to access the media.

Principal Investigator Import Job Configuration

The Principal Investigator's import job configuration is generated by the Web app using inputs provided by the operator, and is not manually created or edited. The configuration object is contained in a top-level result property. Unimportant properties are excluded.

Principle Investigator Import Job Configuration File Contents
Name	Type	Required	Description
id	integer	True	The database ID of the job.
mseauser	object	True	Contains the user's Biigle credentials and other information.
cruise	object	True	Contains information about the cruise and subsidiary objects.

Cruise Object Contents
Name	Type	Required	Description
id	integer	True	The database ID of the cruise.
programs	object[]	True	Contains a list of associated programs.
first_nation_contacts	object[]	True	Contains a list of First Nation contacts related to the cruise.
dives	object[]	True	Contains a list of dives.

Dive Object Contents
Name	Type	Required	Description
id	integer	True	The database ID of the dive.
name	string	True	The name of the dive.
start_time	datetime	True	The start time of the dive.
end_time	datetime	True	The end time of the dive.
objective	string	False	The objective of the dive.
summary	string	False	A summary of the dive.
note	string	False	Notes about the dive.
cruise	integer	True	The database ID of the cruise.
sub_config	object	True	Contains the platform configuration for the dive.
ship_config	object	True	Contains the platform configuration for the ship.
site	object	True	Contains an object which represents the survey site.
transects	object[]	True	Contains a list of transects.
crew	object[]	True	Contains a list of crew members.

Transect Configuration Object Contents
Name	Type	Required	Description
id	integer	True	The database ID of the transect.
name	string	True	The name of the transect.
start_time	datetime	True	The start time of the transect.
end_time	datetime	True	The end time of the transect.
objective	string	False	The objective of the transect.
summary	string	False	A summary of the transect.
note	string	False	Notes about the transect.

Dive Crew Configuration Object Contents
Name	Type	Required	Description
id	integer	True	The database ID of the crew item.
dive	object	True	An object representing the dive.
person	object	True	An object representing the crew member.
dive_role	object	True	An object representing the crew member's role.
note	string	False	A note about this crew member.

Platform Configuration Object Contents
Name	Type	Required	Description
id	integer	True	The database ID of the platform configuration.
platform	object	True	An object representing the platform.
instrument_configs	object[]	True	A list of instrument configurations.
configuration	object	False	A free-form JSON object containing configured properties of the platform.

Instrument Configuration Object Contents
Name	Type	Required	Description
id	integer	True	The database ID of the platform configuration.
instrument	object	True	An object representing the instrument.
configuration	object	False	A free-form JSON object containing configured properties of the instrument.

Annotator Import Job Configuration

NDST Import Job Configuration

Biigle Annotation Configuration

VideoMiner Annotation Configuration

VideoMiner data is usually stored in Microsoft Access databases. In the standard layout, data are stored in the data table with lookups starting with the prefix lu_*. Frequently, scientists will change the structure of the database or add or remove items from the lookups. Frequently, the lookups are removed altogether, which makes reconstructing the dataset very difficult. There are older versions of VideoMiner with different naming constructions.

The VideoMiner configuration file has the following fields. (TBD)

Importing on Linux

A reliable Access driver doesn't currently exist for Linux, but Access databases may be converted to SQLite using DBeaver. In this case the configuration is identical, but the db_file must point to a file with the extension, .sqlite.

Note: When converting timestamps using DBeaver, it may be necessary to configure the timezone_offset property in the configuration, even though it may not be necessary when using the Access file directly (SQLite does not preserve the timezone offset). Also note that some VideoMiner databases place the date and time in separate columns. Of course, this can't work when the fields are converted to integer time stamps. They should be combined into a single column before conversion.

CSV Annotation Configuration

Video and photo annotations can be stored in CSV files (as opposed to VideoMiner, Biigle, etc.) which may not have a formal structure. They can be converted into label trees and mapped using the Label Mapping app. The annotation file and label tree can then be loaded using the csv_configs section.

CSV Configuration File Contents
Name	Type	Required	Description
db_file	string	Yes	The name of a file containing observations (of habitat, species, etc.) which have been mapped using the Label Mapping app. The mappings are contained in the `iqa_file` file.
label_tree_file	string	Yes	The name of a JSON file containing labels mapped by the Annotation Database Import for Annotators
id_column	string	Yes	The name of a column containing the original ID of the row.
label_column	string	Yes	The name of the column containing the mapped label.
timestamp_column	string	Yes	The name of the column containing the timestamp in the standard format.
medium_filename_column	string	No	The name of the column containing the filenames of media referenced by the record.

Comment Event Configuration

Status Event Configuration

Status events can be represented in a CSV file.

TODO: At present only the on-bottom event is implemented.

Status Configuration File Contents
Name	Type	Required	Description
db_file	string	Yes	The name of a file containing status events.
id_column	string	Yes	The name of a column containing the original ID of the row.
on_bottom_column	string	No	If the on-bottom status event is used, this column contains a `0` for off the bottom, and `1` for on the bottom. The event is created at the first record, a change of the value, and at the end.
timestamp_column	string	Yes	The name of the column containing the timestamp in the standard format.

Measurement Event Configuration

Measurement events are human-recorded measurements which can be represented in a CSV file. The configuration contains a property, configs, which is a list of objects with the properties listed below. Each configuration saves a single measurement as a measurement event.

Measurement Configuration File Contents
Name	Type	Required	Description
db_file	string	Yes	The name of a file containing measurement events.
timestamp_column	string	Yes	The timestamp column, in the standard format.
configs	object[]	Yes	The list of measurement configuration objects (below).

Measurement Configuration File Contents -- Config Objects
Name	Type	Required	Description
column_name	string	Yes	The name of the column containing the quantity.
id_column	string	Yes	The name of a column containing the original ID of the row.
measurement_type	string	Yes	The short code of a measurement type.

Stream Configuration File Contents -- Config Objects
Name	Type	Required	Description
name	string	Yes	The name of the measurement configuration.
id_column	string	Yes	The name of a column containing the original ID of the row.
instrument_config	string	Yes, if instrument_config_map is not configured.	The name of an instrument configuration as created in the NDST Dive Logging App.
instrument_config_map	object	Yes, if instrument_config is not configured.	A mapping from instruments in a given second column to instrument configurations. Allows a stream of measurements to be generated by multiple instruments, selected for the highest quality.
data_config	string[]	Yes	A tuple containing the type ("measurement", "position" or "orientation") and the short code of a MeasurementType, PositionType or OrientationType.
columns	string[]	Yes	The names of the columns containing the quantity. If multiple columns are provided, the quantity is a tuple assembled from multiple values, as in the case of a position or orientation. In most cases, this will contain one item.

Stream Configuration

Streams of machine-generate navigation, telemetry and water properties data are represented in CSV files. The configuration contains a property, configs, which is a list of objects with the properties listed below. Each configuration saves a single measurement as a measurement event. There may be many individual measurement and/or navigation configurations.

Stream Configuration File Contents
Name	Type	Required	Description
db_file	string	Yes	The name of a file containing data.
timestamp_column	string	Yes	The timestamp column, in the standard format.
stream_configs	object[]	Yes	The list of measurement configuration objects (below).

Handlers

Handlers are callable classes that process annotation labels, tags and information associated with them. There are two types of labels, tag labels, and data labels. Each type of handler declares a list of tags which, when they appear, will trigger a call to the handler.

All available handlers are loaded from the tag_handlers and property_handlers directories in the importer constructor.

Tag Handlers

Tag handlers are called when a tag appears and do not expect to handle or process data. Each has a matches() method, which returns true if the given label satisfies the handler's requirements. The handlers __call__() method is actually called with the current input row, the event context and other information to trigger configuration of the event.

The current list of tag handlers is:

comment_tag_handler -- Handles comments in the input data.
habitat_tag_handler -- Handles habitat annotations.
ignore_tag_handler -- Flags the record to be ignored.
laser_point_tag_handler -- Handles a laser point annotation.
not_annotated_tag_handler -- Marks a non-annotated region.
observation_tag_handler -- Handles observation annotations.
on_bottom_tag_handler -- Handles updates on the state of the platform: on or off the bottom.

Property Handlers

Property handlers expect to handle or process data associated with the tag.

The current list of tag data handlers is:

habitat_property_handler -- Applies habitat data to the habitat event context from the input.
observation_property_handler -- Applies observation data to the habitat event context from the input.
status_property_handler -- Applies status event data to the habitat event context from the input.

@@ Line 352: / Line 352: @@
 Handlers are callable classes that process annotation labels, tags and information associated with them. There are two types of labels, [[#Tag Handlers|tag labels]], and [[#Tag Data Handlers|data labels]]. Each type of handler declares a list of tags which, when they appear, will trigger a call to the handler.
-All available handlers are loaded from the <code>tag_handlers</code> and <code>tag_data_handlers</code> directories in the <code>importer</code> constructor.
+All available handlers are loaded from the <code>tag_handler</code>s and <code>property_handler</code>s directories in the <code>importer</code> constructor.
 ==== Tag Handlers ====
@@ Line 367: / Line 367: @@
 * <code>on_bottom_tag_handler</code> -- Handles updates on the state of the platform: on or off the bottom.
-==== Tag Data Handlers ====
+==== Property Handlers ====
-Tag data handlers expect to handle or process data associated with the tag.
+Property handlers expect to handle or process data associated with the tag.
 The current list of tag data handlers is:
-* <code>habitat_tag_data_handler</code> -- Applies habitat data to the habitat event context from the input.
+* <code>habitat_property_handler</code> -- Applies habitat data to the habitat event context from the input.
-* <code>observation_tag_data_handler</code> -- Applies observation data to the habitat event context from the input.
+* <code>observation_property_handler</code> -- Applies observation data to the habitat event context from the input.
-* <code>status_tag_data_handler</code> -- Applies status event data to the habitat event context from the input.
+* <code>status_property_handler</code> -- Applies status event data to the habitat event context from the input.