GBDS Architecture

Overview

GBDS is the ABIS component of the Griaule Biometric Suite. GBDS is implemented on the Apache Hadoop 3 platform and uses several of its tools and components (such as Kafka, Zookeeper, Ambari, HBase and HDFS) to implement a scalable, distributed and fault-tolerant ABIS (Automated Biometric Identification System).

GBDS is responsible for:

  • Storage: Persistence of biometric records, client requests and their responses, transaction results and logs in a scalable, distributed and fault-tolerant database.

  • Extraction: Processing raw capture data (images) from biometric captures to generate templates that will be used for biometric matching.

  • Matching: Efficient execution of biometric template comparisons with load distribution across multiple servers (nodes).

  • Request Processing: GBDS receives client requests, manages their execution across the system nodes and notifies clients asynchronously when responses become available. Requests are made through HTTP/HTTPS APIs. GBDS implements a High Availability, High Throughput endpoint for client requests.

This document describes the architecture of GBDS.

Hadoop Components

Apache Hadoop 3 is a collection of open-source tools and components for developing distributed systems. Hadoop is based on Java, a technology present in more than 13 billion devices. Hadoop development began in 2006, and it soon became the de facto standard for fault-tolerant distributed systems with high availability. By 2013, Hadoop was already used in more than half of the Fortune 50 companies.

GBDS uses several tools and components from the Hadoop 3 ecosystem:

  • HDFS is a distributed, scalable and portable file system. HDFS provides transparent distribution of data across storage nodes and efficient storage of large files and large collections of files.

  • HBase is a distributed non-relational database built on top of the HDFS file system.

  • Zookeeper is a distributed key-value repository, and is used by GBDS as a consensus manager.

  • Kafka is a distributed task queue processing platform. The queues managed by Kafka are called topics (topics), which have content queued by producers and processed by consumers. Kafka efficiently distributes topic workloads across available nodes.

  • Ambari is a monitoring and management tool for Hadoop clusters. GBDS administrators interact with and manage their clusters through Ambari.

Nodes

A GBDS cluster is composed of nodes, and each node runs the same components: HBase, Zookeeper, Ambari, Kafka and the GBDS Node Subsystem. All nodes can receive requests from external client applications, and all nodes can send asynchronous notifications to external client applications.

The diagram below shows the general structure of a GBDS cluster, which is a collection of GBDS nodes:

GBDS uses two distinct databases: MYSQL and HBase. Biometric templates are stored in HBase, and data is distributed in subsets called regions. Each node is responsible for at least one region. Biometric data is transparently distributed into regions by HBase, based on the hardware capacity of each node.

The MySQL database stores metadata information about transactions, biometric exceptions, criminal cases, registered biographical profiles and unresolved latents. The SQL database stores metadata that reference the HBase record, which in turn stores the data required for processing, such as images and templates.

A node acts as Leader Node. This node initializes the cluster and partitions biometric data among the available GBDS nodes. The Leader Node is chosen automatically by Zookeeper.

The Kafka component of the cluster manages topics (queues) for pending tasks (to be processed by the cluster) and results (to be delivered asynchronously to clients or GBDS components when tasks are completed). GBDS has multiple topics for pending tasks, one for each priority level. Tasks in higher priority topics are always consumed before tasks in lower priority topics. GBDS has 8 priority levels: Lowest, Lower, Low, Default, High, Higher, Highest and Maximum (Very Lowest, Very Low, Low, Default, High, Very High, Very Highest and Maximum). Client applications cannot use the Maximum/Maximumpriority, which is reserved for GBDS internal operations. In this manual the set of topics for pending tasks is represented as a single entity.

The diagram below shows the components running on each node:

The GBDS node subsystem is responsible for the ABIS logic. It acts both as a producer and consumer of Kafka topics, and both as a consumer of client requests and producer of notifications sent to clients.

GBDS Node Subsystem

The GBDS Node Subsystem implements the specific flows for ABIS operation. It has 3 main internal modules: the API Module, the Master Module, and the Notification Module. Each of these modules can be started and stopped independently on each node.

The diagram below shows the internal architecture of the GBDS component and the interactions between its parts:

  • When a client request is received by the API Module, it is either resolved locally or submitted to the Kafka Pending Tasks topic of the appropriate priority.

  • The Master Module is responsible for managing fault tolerance, distributing and loading database data into RAM at boot time, and processing distributed biometric tasks. It continuously consumes Pending Tasks items that involve distributed processing.

  • When a client request is completed, the results are consolidated by a specific node, which submits the results to the Results topic in Kafka. The consolidation node for each transaction is determined by a hash function of the transaction's unique identifier, which distributes global consolidation tasks uniformly across the cluster nodes.

  • The Notification Module is responsible for consuming items from the Kafka Results topic and sending asynchronous notifications to external client applications. The notification module is a singleton and can be active on only one node, chosen by the system administrator.

API Module

The API Module has a main component, the API Handler (API Handler), which receives HTTP/HTTPS requests from external client applications and can 1) process them locally; or 2) prepare a transaction for distributed processing and enqueue it in a Kafka Pending Tasks topic with the appropriate priority, so it can be processed by the entire cluster.

The API Handler is responsible for performing biometric template extraction. If an incoming request has raw biometric data (i.e., images instead of templates), this component launches processes and/or threads for biometric extraction on the local node to generate the corresponding biometric templates. The choice of processes or threads depends on the biometric modality.

Transactions of Enrollment (Enrollment) and Identification (Identification, 1:N) are enqueued to a Kafka Pending Tasks topic to be processed in a distributed manner across the cluster.

Other transactions are processed locally by the API Handler. Any templates required for these operations are extracted locally or retrieved from HBase, and responses to clients are sent synchronously: the Kafka Results topic and the Notification Module are not involved in the operation.

These locally processed transactions can be:

  • Verification (1:1): The API Handler extracts the biometric template for the query (if sent as an image), retrieves the reference template from HBase, performs the biometric comparison locally, and responds directly to the client.

  • Update (Update): The API Handler updates the biographical and/or biometric data directly in HBase, and enqueues a Pending Task item in Kafka on the Maximum priority queue to force cluster nodes that have the affected profile in RAM to update their local records before starting to process new tasks of priority lower than Maximum.

  • Delete: The API Handler deletes the record from HBase, and enqueues a Pending Task item in Kafka on the Maximum priority queue, forcing cluster nodes that have the affected profile in RAM to update their local records before starting to process new tasks with priority lower than Maximum.

  • Exception Handling: The API Handler updates the exception record in HBase.

  • Quality Handling: The API Handler updates the transaction record in HBase.

  • Get, List: These are read-only requests to obtain record or transaction data. The API Handler retrieves the requested data from HBase and responds directly to the client application.

Master Module

This Module is responsible for initializing (booting) the GBDS node, managing the cluster state (for example, redistributing cluster load when a node fails), and for processing biometric transactions.

Node Manager

This component reads configuration files, starts other components and actively monitors the other nodes in the cluster and decides how to redistribute biometric data in the cluster when other nodes fail.

Boot Manager

This component is responsible for loading biometric templates from HBase into RAM. Efficient biometric matching requires templates to be present in RAM. Loading and indexing templates into RAM is a long task but, once completed, ensures fast transaction processing.

Task Processing Flow

The other components of the Master Module perform the distributed biometric matching.

The Task Consumer (Task Consumer) continuously consumes items from the Kafka Pending Tasks topics. It always consumes a task from the highest priority non-empty topic.

The Matcher Supervisor (Matcher Supervisor) manages the biometric matcher processes and/or threads (the choice depends on the biometric modality) and performs the biometric comparison operations between query templates (from the transaction being processed) and reference templates (from the biometric database and loaded into the local node's RAM). Biometric template matching is not a trivial operation and involves complex algorithms.

The Consolidation Supervisor organizes the results generated by the matchers and sends them to the Global Consolidator responsible for the current transaction, which may be running on another cluster node.

The Global Consolidator receives matching results from all nodes that contributed to processing the task/transaction and generates consolidated matching results. Each task/transaction is consolidated on a single node, deterministically chosen by a hash function over the transaction/profile unique identifier. This hash function distributes global consolidation tasks uniformly across the available cluster nodes.

Some transactions, such as latent searches in forensic systems, require an additional matching operation to refine and/or reorder results, called Post-matching. The Post-Matching Supervisor manages the processes/threads for such cases, and is also executed only on the global consolidation node assigned to the transaction.

The Commit Handler (Commit Handler) receives the final results from the Global Consolidator or the Post-Matching Supervisor and definitively applies the transaction results:

  • All changes to the biometric database state are applied in HBase.

  • If the transaction results require any cluster nodes to update their local RAM data (e.g., a new person is added to the database as a result of an enrollment transaction), an item is enqueued to the Kafka Pending Tasks topic with Maximum priority.

  • An item is enqueued to the Kafka Results topic, which will be sent to the client application by the Notification Module.

Notification Module

This module has a main component, the Notification Handler, which continuously consumes items from the Kafka Results topic and sends HTTP/HTTPS notifications to client applications, asynchronously informing them of the status of processed transactions.

The Notifier Module is a singleton, and stays active in only one cluster node. The node that runs the Notifier Module is chosen by the system administrator.

Transaction Flows

This section illustrates how each type of transaction is processed by GBDS.

Identification (1:N)

In an identification transaction (1:N search), the client wishes to search the biometric database for matches to a given query biometric. The entire biometric database may need to be traversed. The API Handler receives the request on the node to which it was sent. If the query contains raw biometric data (images), its templates are extracted by the API Handler on that local node. The transaction is then enqueued to a Kafka Pending Tasks topic.

All nodes in the cluster will eventually consume the item from the topic (module Task Consumer), and perform their part of the biometric search (modules Matcher Supervisor and Consolidation Supervisor). Each node's results are sent to the Global Consolidator on the global consolidation node, determined by the transaction identifier.

On the global consolidation node, the Global Consolidation module waits until the cluster completes the search operation and consolidates the final results. Post-matching is performed, if necessary (module Post-Matching Supervisor), the and Commit Handler apply the results to HBase and enqueue an item to the Kafka Results topic.

The singleton Notification Module, running on the Notification node, will eventually consume the associated item from the Kafka Results topic and send an asynchronous notification to the client application, informing the completion of the transaction.

Enrollment

In an Enrollmenttransaction, the client requests the insertion of a new person into the database, provided the biometric data are not duplicates of any existing record. The flow of this transaction is very similar to the Identification operation, since it involves a 1:N search for records with duplicate biometrics. Because the transaction requires all nodes to update their templates in memory (to recognize the presence of the new person in the database), the Commit Handler will enqueue a new item to Kafka's Maximum priority Pending Tasks topic, forcing all nodes to update. If the transaction generates an exception that requires manual review, it will remain suspended until the exception is handled by another transaction.

This flow is also executed when an Update transaction adds new biometric data to an existing record.

Verification (1:1), Get, List

In an Verification, the client wants to verify whether a query biometric matches a specific person present in the database. This transaction is processed by the API Module on the same node that receives the request. The API Module retrieves the person's biometric templates from HBase, performs the biometric matching locally and responds to the client synchronously.

Get and List are read-only transactions to retrieve data and/or results from GBDS. They are processed locally by the API Module, which retrieves the data from HBase and responds to the client synchronously.

Update, Delete

In an Update (Update), the client wants to change the biographical and/or biometric data of an existing record. If new biometric data are inserted, the transaction follows the flow of an Enrollment, because the database needs to be searched for duplicate biometrics. Otherwise, the API Module performs any necessary template extractions, updates HBase, responds to the client synchronously and enqueues an item to the Kafka Pending Tasks topic with Maximum priority to force all cluster nodes to recognize the changes made.

In a Deletetransaction, the client wishes to remove a person from the database. The API Module performs the removal from HBase and responds to the client synchronously. The module also enqueues an item to the Kafka Pending Tasks topic with Maximum priority to force all cluster nodes to recognize the changes made.

Exception Handling, Quality Handling

GBDS manages Exception and Quality Control items. These are generated when Enrollment transactions find suspected duplicates or Update transactions find discrepancies between query biometrics and reference biometrics (Exceptions), and when low-quality biometric data are inserted (Quality Control). These items require manual review, depending on GBDS configurations. These transactions update the status of pending Exception and Quality Control items. The API Handler processes these transactions locally, updates their status in HBase and responds to the client synchronously.

Last updated

Was this helpful?