Identity Service Core Module
The Identity Service is tasked with storing information about operational entities involved in scientific experiments. For model organisms, for example, this service enables all data related to an individual to be tied back to a single, unique identifier. This is a difficult problem. There are three related challenges that all contribute to the problem's difficulty:
- standard practice is to assign locally unique IDs that don't carry from one assay to the next, this ambiguity of ID assignment unnecessarily adds variability and makes it even harder to detect small effect sizes
- in some situations a set of attributes will point to one and only one individual but sometimes not, such as a set of identical mice
- we want to be able to create new IDs in a distributed environment, and the ID's should be readable, spellable, and pronounceable for manual verification
Measure of identifiability
How can we tell how good the system is at uniquely identifying individuals?
A measure of identifiability is a quantitative assessment of how difficult it would be to distinguish similar entities based on their attributes. This measure requires both accuracy and precision to achieve our aims.
The identity service aims to record a set of attributes that uniquely identifies an entity of interest. Attributes are sorted into four different categories: physical-static, physical-dynamic, spatio-temporal, and administrative. The physical-static attributes remain fixed for the life of the entity. All other classes of attributes are dynamic and must be time indexed.
The physical and spatio-temporal attributes are only applicable for real-world entities such as study subjects (called "sources" in the ISA model) or samples. All non-physical entities must be identified based on administrative attributes alone.
Physical-static attributes
Real-world entities, e.g., model organisms or organoids, possess descriptive characteristics that are part of their very make-up. If one of these attributes is changed, then we assume that the individual is no longer the same. For a tissue sample, the tissue type is immutable. For a research mouse, the immutable attributes include strain, coat color, sex, and genetic sequence. The identity service will not store individually identifying information for human subjects.
Physical-dynamic attributes
In many situations, the static physical attributes are not sufficiently available or distinguishable to allow for unique identification. In a cage containing four C57BL/6J mice, the black coat color of one mouse is not distinguishable from that of the other three. However, it is common for supplementary marks to be used to distinguish between the few individuals in a cage. This might include notching of the ear(s), implanting a radio-frequency identification (RFID) tag, or a tattoo. Even though these marks can be permanent, we consider them to be transitory because they may either need to be changed, or may heal over.
In general, it will be necessary for physical-dynamic marks to be re-verified when a live subject is transferred from one laboratory to another so that assay data from the new lab can be associated with the data from the first lab for the individual.
Spatio-temporal attributes
The location of a physical entity is a coarse-grained filtering mechanism for identification. At a given time, in a given place, there can only be so many of a physical thing. Spatial attributes could include one or all of the following: postal address, room number, and cage number.
Administrative attributes
The range of possible administrative attributes is vast. For a given class of entities, however, the number of attributes should be minimal. These filter the entities in a similar way to spatio-temporal attributes. Administrative attributes might include: lab investigator, project number, purchase source, study name, and lab-assigned identifiers.
Rules for unique identification
Which of the attribute classes are necessary for unique identification? How many of these are sufficient? Is there an algorithm that achieves unique identification every time for a given class of entities?
Database schema options
- single table for holding every identity type ~ the table would include every type of attribute a) Drawbacks This would be a slow option and would take a lot of processing time Would be difficult to track specific attributes over time b) Benefits This is a very simple option
- specific tables for each type with all attributes contained ~ types include human/mouse/non-physical identities, within each type are the attributes a) Drawbacks Difficult to filter by specific attributes without affecting processing time Depending on if attributes within each type are sets, may be difficult to access over-time data b) Benefits Easier to filter based on what the user is attempting to identify (human, mouse, object) without processing all data Gives the data model some structure
- no specific tables, only key/value pairs associated with an identity a) Drawbacks Could run into attributes that are case sensitive (ie. "Age" and "age") Not a lot of structure, it would be one large table similar to option 1 Very difficult to track over-time, could cause many human-made errors b) Benefits No constraints
- "type" tables with minimal attributes, that are referenced by key/value pairs each with a datetime. i.e., static attributes + dynamic attributes a) Drawbacks Could have case-sensitive issues b) Benefits Great for tracking changes over-time Allows the user some variability in attributes (a new attribute would be easy to create) Allows for a good medium of structure and not overly constrained
Query use cases
The primary use-case query is to search for a particular type of entity (mouse, study object, assay object) based on a subset of the object's attributes. Therefore, this query should be fast.
If we want to use the Django ORM, then CRUD operations should be standard for a relational model - nothing exotic.