Annotation Database Import App

From MediaWiki
Jump to navigation Jump to search

Data for import is produced by principal investigators, annotators and NDST. As a final step, the importer will combine all of this data into a single importable dataset.

There are three steps for this final import process.

Downloading Import Data

The database import queue page displays lists of jobs configured by principal investigators, annotators and NDST using their respective tools. You will see a table containing jobs for Principal Investigators and Annotators. Click the download button beside each one to download a JSON file containing the job data for the cruise you wish to import.

The NDST section has a button labeled, "Refresh from NDST Database". This will load the data from the Dive Logging App into a transitional table in the main database. If new equipment or personnel have been created in the Dive Logging App, a table for each will appear that will allow you to map the new equipment onto an extant piece of hardware in the database, or a new person into the personnel table. If the person or equipment haven't been created yet, you can create them by clicking the "+" button.

Below this, you will find a table marked "Current Cruises" containing complete metadata for the cruise, dives, transects, personnel and equipment configurations. Download the file for the cruise you'll be importing.

The Import Program

Cruises can be imported on the command line using the importer.py program, or the graphical utility, by using the -g switch. The program gives a user the ability to load data into the Development, Staging and Production environments using a loaded configuration file, and to select which types of entities are to be dropped and/or (re)created.

The import process is much faster if run on the server using the command-line interface, but note the issue with Access databases.

The program provides a dry-run feature, so the import can be checked for validity before it is committed.

Import Program Algorithm

The import program progressed through these steps:

  1. Load the handlers.
  2. Load the main configuration file.
  3. Assemble a list of required steps based on user inputs, such as which entities to replace and/or load.
  4. Load the import job configuration for principal investigators.
    • Create and configure the cruise entity to which all other entities are attached.
  5. Load the import job configuration for annotators.
    • Configure the label tree that will be used to map incoming annotation labels to database entities.
    • Create and/or configure the annotation protocol used for annotations represented by the label tree.
    • Load the personnel mapping for Biigle annotators.
  6. Load the import job configuration for NDST.
    • Update the cruise with additional metadata.
    • Create or update the dives.
    • Create or update the transects.
    • Create or update the platform (ship, submersible) and equipment configurations.
    • Create or the personnel roles on the cruise, dives and transects.
    • Generate the interval trees used in subsequent phases to locate dives and transects during which observations have occurred, by temporal correspondence.
  7. Resolve the label map -- loads the necessary lookups to satisfy the references in mapped labels.
  8. Process Biigle annotations.
  9. Process VideoMiner annotations.
  10. Process CSV annotations.
  11. Process data streams (navigation, telemetry, water properties, etc.)
  12. Process comment events from CSV files.
  13. Process status events from CSV files.
  14. Process measurement events from CSV files.
  15. Commit or roll back changes, depending on whether the dry run state is chosen.

Configuration Files

The import program uses JSON configuration files. The main configuration file, named config.json by convention.

Main Configuration File

The main config file provides some configuration values, and also provides links to other configuration files, which load and configure specific chunks of importable data. This is the file that is selected for loading in the import program.

Main Configuration File Contents
Name Type Required Description
iqa_file string Yes, if there are annotations. The configuration file generated by the Annotation Database Import for Annotators app.
iqpi_file string Yes The configuration file generated by the Annotation Database Import for Principal Investigators app.
ndst_file string Yes The configuration file generated from the NDST Dive Logging App.
biigle_configs string[] Yes, if there are Biigle annotations One or more files containing configurations for annotations in Biigle format.
vm_configs string[] Yes, if there are VideoMiner annotations One or more files containing configurations for annotations in VideoMiner format.
csv_configs string[] Yes, if there are CSV annotations One or more files containing configurations for annotations in CSV format.
status_configs string[] Yes, if there are status events One or more files containing configurations for status events in CSV format.
measurement_configs string[] Yes, if there are measurement events One or more files containing configurations for measurement events in CSV format.
stream_configs string[] Yes, if there are navigation or telemetry streams One or more files containing configurations for data streams in CSV format.
comment_configs string[] Yes, if there are comments One or more files containing configurations for comments in CSV format.
media_path string No The root URL for a site to which relative media paths can be appended to access the media.

Principal Investigator Import Job Configuration

The Principal Investigator's import job configuration is generated by the Web app using inputs provided by the operator, and is not manually created or edited. The configuration object is contained in a top-level result property. Unimportant properties are excluded.

Principle Investigator Import Job Configuration File Contents
Name Type Required Description
id integer True The database ID of the job.
mseauser object True Contains the user's Biigle credentials and other information.
cruise object True Contains information about the cruise and subsidiary objects.

Cruise Object Contents
Name Type Required Description
id integer True The database ID of the cruise.
programs object[] True Contains a list of associated programs.
first_nation_contacts object[] True Contains a list of First Nation contacts related to the cruise.
dives object[] True Contains a list of dives.

Dive Object Contents
Name Type Required Description
id integer True The database ID of the dive.
name string True The name of the dive.
start_time datetime True The start time of the dive.
end_time datetime True The end time of the dive.
objective string False The objective of the dive.
summary string False A summary of the dive.
note string False Notes about the dive.
cruise integer True The database ID of the cruise.
sub_config object True Contains the platform configuration for the dive.
ship_config object True Contains the platform configuration for the ship.
site object True Contains an object which represents the survey site.
transects object[] True Contains a list of transects.
crew object[] True Contains a list of crew members.

Transect Configuration Object Contents
Name Type Required Description
id integer True The database ID of the transect.
name string True The name of the transect.
start_time datetime True The start time of the transect.
end_time datetime True The end time of the transect.
objective string False The objective of the transect.
summary string False A summary of the transect.
note string False Notes about the transect.

Dive Crew Configuration Object Contents
Name Type Required Description
id integer True The database ID of the crew item.
dive object True An object representing the dive.
person object True An object representing the crew member.
dive_role object True An object representing the crew member's role.
note string False A note about this crew member.

Platform Configuration Object Contents
Name Type Required Description
id integer True The database ID of the platform configuration.
platform object True An object representing the platform.
instrument_configs object[] True A list of instrument configurations.
configuration object False A free-form JSON object containing configured properties of the platform.

Instrument Configuration Object Contents
Name Type Required Description
id integer True The database ID of the platform configuration.
instrument object True An object representing the instrument.
configuration object False A free-form JSON object containing configured properties of the instrument.

Annotator Import Job Configuration

NDST Import Job Configuration

Biigle Annotation Configuration

VideoMiner Annotation Configuration

VideoMiner data is usually stored in Microsoft Access databases. In the standard layout, data are stored in the data table with lookups starting with the prefix lu_*. Frequently, scientists will change the structure of the database or add or remove items from the lookups. Frequently, the lookups are removed altogether, which makes reconstructing the dataset very difficult. There are older versions of VideoMiner with different naming constructions.

The VideoMiner configuration file has the following fields. (TBD)

Importing on Linux

A reliable Access driver doesn't currently exist for Linux, but Access databases may be converted to SQLite using DBeaver. In this case the configuration is identical, but the db_file must point to a file with the extension, .sqlite.

Note: When converting timestamps using DBeaver, it may be necessary to configure the timezone_offset property in the configuration, even though it may not be necessary when using the Access file directly (SQLite does not preserve the timezone offset). Also note that some VideoMiner databases place the date and time in separate columns. Of course, this can't work when the fields are converted to integer time stamps. They should be combined into a single column before conversion.

CSV Annotation Configuration

Video and photo annotations can be stored in CSV files (as opposed to VideoMiner, Biigle, etc.) which may not have a formal structure. They can be converted into label trees and mapped using the Label Mapping app. The annotation file and label tree can then be loaded using the csv_configs section.

CSV Configuration File Contents
Name Type Required Description
db_file string Yes The name of a file containing observations (of habitat, species, etc.) which have been mapped using the Label Mapping app. The mappings are contained in the iqa_file file.
label_tree_file string Yes The name of a JSON file containing labels mapped by the Annotation Database Import for Annotators
id_column string Yes The name of a column containing the original ID of the row.
label_column string Yes The name of the column containing the mapped label.
timestamp_column string Yes The name of the column containing the timestamp in the standard format.
medium_filename_column string No The name of the column containing the filenames of media referenced by the record.

Comment Event Configuration

Status Event Configuration

Status events can be represented in a CSV file.

TODO: At present only the on-bottom event is implemented.

Status Configuration File Contents
Name Type Required Description
db_file string Yes The name of a file containing status events.
id_column string Yes The name of a column containing the original ID of the row.
on_bottom_column string No If the on-bottom status event is used, this column contains a 0 for off the bottom, and 1 for on the bottom. The event is created at the first record, a change of the value, and at the end.
timestamp_column string Yes The name of the column containing the timestamp in the standard format.

Measurement Event Configuration

Measurement events are human-recorded measurements which can be represented in a CSV file. The configuration contains a property, configs, which is a list of objects with the properties listed below. Each configuration saves a single measurement as a measurement event.

Measurement Configuration File Contents
Name Type Required Description
db_file string Yes The name of a file containing measurement events.
timestamp_column string Yes The timestamp column, in the standard format.
configs object[] Yes The list of measurement configuration objects (below).
Measurement Configuration File Contents -- Config Objects
Name Type Required Description
column_name string Yes The name of the column containing the quantity.
id_column string Yes The name of a column containing the original ID of the row.
measurement_type string Yes The short code of a measurement type.
Stream Configuration File Contents -- Config Objects
Name Type Required Description
name string Yes The name of the measurement configuration.
id_column string Yes The name of a column containing the original ID of the row.
instrument_config string Yes, if instrument_config_map is not configured. The name of an instrument configuration as created in the NDST Dive Logging App.
instrument_config_map object Yes, if instrument_config is not configured. A mapping from instruments in a given second column to instrument configurations. Allows a stream of measurements to be generated by multiple instruments, selected for the highest quality.
data_config string[] Yes A tuple containing the type ("measurement", "position" or "orientation") and the short code of a MeasurementType, PositionType or OrientationType.
columns string[] Yes The names of the columns containing the quantity. If multiple columns are provided, the quantity is a tuple assembled from multiple values, as in the case of a position or orientation. In most cases, this will contain one item.

Stream Configuration

Streams of machine-generate navigation, telemetry and water properties data are represented in CSV files. The configuration contains a property, configs, which is a list of objects with the properties listed below. Each configuration saves a single measurement as a measurement event. There may be many individual measurement and/or navigation configurations.

Stream Configuration File Contents
Name Type Required Description
db_file string Yes The name of a file containing data.
timestamp_column string Yes The timestamp column, in the standard format.
stream_configs object[] Yes The list of measurement configuration objects (below).


Handlers

Handlers are callable classes that process annotation labels, tags and information associated with them. There are two types of labels, tag labels, and data labels. Each type of handler declares a list of tags which, when they appear, will trigger a call to the handler.

All available handlers are loaded from the tag_handlers and tag_data_handlers directories in the importer constructor.

Tag Handlers

Tag handlers are called when a tag appears and do not expect to handle or process data. Each has a matches() method, which returns true if the given label satisfies the handler's requirements. The handlers __call__() method is actually called with the current input row, the event context and other information to trigger configuration of the event.

The current list of tag handlers is:

  • comment_tag_handler -- Handles comments in the input data.
  • habitat_tag_handler -- Handles habitat annotations.
  • ignore_tag_handler -- Flags the record to be ignored.
  • laser_point_tag_handler -- Handles a laser point annotation.
  • not_annotated_tag_handler -- Marks a non-annotated region.
  • observation_tag_handler -- Handles observation annotations.
  • on_bottom_tag_handler -- Handles updates on the state of the platform: on or off the bottom.

Tag Data Handlers

Tag data handlers expect to handle or process data associated with the tag.

The current list of tag data handlers is:

  • habitat_tag_data_handler -- Applies habitat data to the habitat event context from the input.
  • observation_tag_data_handler -- Applies observation data to the habitat event context from the input.
  • status_tag_data_handler -- Applies status event data to the habitat event context from the input.