Data integrity checks

OpenAtlas puts great emphasis on the data quality. Even if the responsibility for the quality of the information entered ultimately lies with the individual projects, avoidance of inconsistent on a technical level are important during the development of the application. It is therefore not possible e.g. to enter the start date of an event after the end date of the same event

Nevertheless mistakes happen, not only on the application level but also when importing data from other projects or deleting files outside of the application, etc. Therefore, functions to check possible inconsistencies were implemented and are described in detail below.

Similar names

This test will search for similar entity names. Depending on selection and data volume this might take some time. The following options are given:

  • Classes - select the class you want to search for similar names

  • Ratio - select how similar the names should be, 100 is the default and means 100% identical

The function uses the fuzzywuzzy package which uses the Levenshtein Distance.

Orphans

This function is used to find entities with missing connections. The result is shown in the following tabs:

Orphans

Entries shown have no relation to other entities. Of course that can be part of the normal data set, but should be check if correct. An unlinked entity could be artifacts of an import or were not linked by accident.

Type

The types listed here were created but have no sub types or associated data. They might have been pre-installed before teams have started entering information or were created and then never used.

Missing files

File entities without a corresponding file are listed here. This is (most likely) caused by the deletion of files from the dataset.

Orphaned files

Files without a corresponding file entity are listed here.

Orphaned IIIF files

IIIF files without a corresponding entity are listed here.

Orphaned annotations

Annotations that are linked to an entity, but file and entity themselves are not linked are listed here. There are three options to proceed:

  • Relink entity: Adds a link between file and entity

  • Remove entity: Removes the entity from the annotation

  • Delete annotation: Deletes the whole annotation

Orphaned subunits

Subunits without a link to the level above, e.g. a feature with no connection to a place.

Dates

In this view various results of invalid or inconsistent dates are shown.

Invalid dates

In this tab invalid date combinations are shown, for example dates of begins That are later than end dates. These issues should be fixed, otherwise the user interface won’t allow to update these entities.

Invalid involvement dates

Incompatible dates for involvements are shown. Example: A person participated in an event for longer than the event lasted.

Invalid preceding dates

Here, incompatible dates for chained events are listed. Example: A preceding event starts after the succeeding event.

Invalid sub dates

This tab shows incompatible dates for hierarchical events. Example: A sub event begins before the super event began.

Check files

In this section, all files are checked for completeness and consistency. Further information about file entities can be found in the manual under File.

Missing information

  • No creator: Files without a creator entered.

  • No license holder: Files without a license holder entered.

  • Not public: Files that are not publicly accessible.

  • No license: Files where no license was assigned.

File integrity

  • Missing files: A file entity was created but has no associated file.

  • Duplicated files: Lists all files that share the same SHA value. In such cases, the files themselves are duplicates, but their metadata entries differ.