Spiking Neural Networks for Recognizing Actions by Spiking Camera


Introduction

Motivation

Computer vision has progressed rapidly in the past few years thanks to the development of machine learning, enabling systems to handle complex tasks, sometimes surpassing human capabilities. However, further development raises two critical problems: high energy consumption and dependence on supervised algorithms, which necessitates large amounts of labeled data for training. These challenges result in high costs associated with utilizing computer vision for complex problems. Hence, there is an urgent need for low-energy and unsupervised models.

Spiking cameras have garnered attention due to their bio-inspired paradigm and low power consumption advantages. Unlike traditional cameras capturing complete images, spiking cameras capture events, representing changes in pixel brightness. This event-based approach significantly reduces memory usage and energy consumption.

Goals

This manuscript presents research on Spiking Neural Network (SNN) for action recognition using spiking cameras, focusing on low energy consumption and unsupervised models. The primary objective is to convert spiking videos into frames directly usable as input for the SNN model developed by our team.

Spiking Camera

Normal Camera

Traditional cameras, whether CMOS sensors, CCD sensors, or RGBD cameras, capture images at a constant frequency, resulting in inherent delays. Motion blur can occur due to object movement within the exposure time. Additionally, normal cameras have a limited dynamic range and capture redundant information for each frame, leading to efficiency issues.

Spiking Camera Overview

A spiking camera, also known as an event camera, captures events comprising a timestamp, pixel coordinates, and polarity, representing changes in brightness. Spiking videos are streams of events captured by spiking cameras, providing an asynchronous and sparse representation of the scene.

Principles of Spiking Camera

  • Brightness Change: Spiking cameras output events based on brightness changes, not absolute brightness.
  • Threshold: Events are generated when brightness changes reach a certain threshold, an inherent parameter of the camera.

Relationship with SNN

Spiking cameras’ event-based representation aligns well with spiking neural networks (SNNs), which use discrete spikes to transmit information, mimicking biological neural systems. SNNs are energy-efficient and excel at processing spatio-temporal information, making them suitable for tasks involving spiking cameras.

Spiking Dataset

Nature of Spiking Data

Spiking data is an event stream captured by a spiking camera. Each event, denoted as $e_i$, is described by the tuple $[t_i, x_i, y_i, p_i]$, where $i \in {1, 2, 3, 4, \dots, N \times M}$. Here:

  • $t_i \geq 0$ is the timestamp of the event.
  • $(x_i, y_i) \in {1, 2, \dots, N} \times {1, 2, \dots, M}$ represents the pixel coordinates.
  • $p_i \in {-1, 1}$ denotes the polarity, where -1 and 1 represent OFF (brightness decrease) and ON (brightness increase) events, respectively.
  • $N$ and $M$​ are the dimensions of the pixel grid.
Visualization of a Spiking Video

Figure above shows a visualization of a example of spiking video. ON and OFF events are represented in blue and gray, respectively.

List of Spiking Datasets

Various event-based datasets are available, some converted from traditional video datasets, while others are captured by spiking cameras in real scenes.

The Dataset We Chose

We selected the DVS-Gesture dataset for further study. This dataset, created by researchers from the Institute of Neuroinformatics at the University of Zurich and ETH Zurich, comprises 11 hand gestures, including hand clapping, hand waves, forearm rolls, and musical instrument interactions.

Spiking Neural Network

Overview

The traditional artificial neuron model mainly includes two functions, one is to calculate the weighted sum of the signals transmitted by the previous layer of neurons, the other is to use a nonlinear activation function to output signals. The former is used to imitate the way of transmitting information between biological neurons, while the latter is used to improve the nonlinear computing ability of neural networks.

A Spiking neural network is the neural network which is more similar to biological neural network. This third generation networks receive and output data in the form of spikes. Each spike corresponds to a specific weight specified by the synapse it travels across.

Spiking Neuron Model

Spiking neurons model biological neurons, where the membrane potential approaches a threshold value upon receiving spikes. Upon exceeding the threshold, the neuron emits a spike to connected neurons. The simplest form of the spiking neuron model is expressed using an input current $z$ derived from incoming spikes.

Leaky Integrate and Fire Model

To address the computational complexity of the HH model, simplified models like the Leaky Integrate and Fire (LIF) model have been proposed. The LIF model considers the cell membrane’s electrical properties as a combination of resistance and capacitance, making it closer to biology. It introduces a leak to the membrane potential $v$, allowing neurons to return to the resting state in the absence of activity \cite{falez2019improving}. Additionally, the LIF model accounts for the refractory period after an action potential emission. After emitting an action potential, the neuron remains at a reset potential for several milliseconds.

The LIF model can be expressed as:

$$
\tau_{\text{leak}} \frac{\partial v}{\partial t} = \left[v(t) - v_{\text{rest}}\right] + r_{\mathrm{m}} z(t)
$$

where:

  • $T_{\text{leak}} = r_m \cdot c_m$
  • $v_{\text{rest}}$ is the reset potential
  • $c_m$ is the membrane capacitance
  • $r_m$ is the membrane resistance
  • $v_{\text{th}}$ is the defined threshold

When $v \geq v_{\text{th}}$, $v$ resets to $v_{\text{rest}}$.

A LIF spike neuron receiving spikes

Data Pre-processing

Baseline Spiking Architecture

The model receives RGB videos as input, and our goal is to replace this part with spiking videos directly. The baseline spiking architecture is depicted in Figure 5. The green circle represents the part we aim to replace, while the yellow circle indicates the pre-processing step to adapt the input spiking videos to fit into the convolution part of the model.

Baseline Spiking Architecture

Explore the DVS-Gesture Dataset

The DVS-Gesture dataset contains 30 binary event stream files, each containing continuous recordings of all 11 classes of gestures. Additionally, for each event stream file, there is a corresponding table containing the start and end timestamps of each category of action.

Convert Events Stream to Frames

The event-to-frame integrating method for pre-processing spiking datasets is widely used.

Integration Events Block into Frame

Denote a two-channel frame as:
A frame in the frame data after integration noted $F(j)$ and a pixel at $(p, x, y)$ as $F(j, p, x, y)$, the pixel value is integrated from the events data whose indices are in $[j_{l}, j_{r})$:
$$
F(j, p, x, y) = \sum_{i=j_l}^{j_r-1} \mathcal{I}_{p, x, y}(p_i, x_i, y_i)
$$

Where $\mathcal{I}$ is an indicator function and it equals 1 only when $ (p, x, y) = (p_{i}, x_{i}, y_{i}) $.

Split by Fixed Number Frames

The first idea is to split the event stream into a video of a fixed number of frames. In this logic, we can split the event stream equally into event blocks according to the total number of events or the total timestamp interval and then integrate them into frames.

If split method is time, then:

If split method is number, then:

Split fixed number of frames by time Split fixed number of frames by number of events

Split by Fixed Duration

The idea of splitting by a fixed number of frames may lead to different speeds of motion in the obtained frames, as different classes of action videos may have different durations. Using a fixed time interval integration is more in line with the actual physical system. For example, integrating every 10 ms gives $\lfloor \frac{L}{10} \rfloor$ frames for data of length $L$ ms. However, the length of each sample in the spiking dataset is often different, resulting in frames of different lengths. We will do zero-padding for the action event stream integrated into frame sequences of the same length.

Experiment

The Input Data Format of the Simulator

The simulator implements the SNN model and receives samples as pairs of sample labels and tensors of action videos in the shape of (FRAME_HEIGHT, FRAME_WIDTH, VIDEO_DEPTH, FRAME_NUMBER).

Results

The Result of the Pre-processing

After implementing the concepts described above, I processed the training and test data. Although the visualization of frames was not required by my mentor, I decided to visualize them for fun and to verify if the frames were generated correctly. I wrote some visualization functions to transform the frames data into a GIF. Here is a visualization of a captured frame:

Visualization Frame(left-hand wave)

Conclusion

In this research project, I delved into the principles of Spiking Neural Networks (SNNs) for action recognition. Compared to traditional neural network models, SNNs are more technical and promising. Specifically, my study of the SNN dataset allowed me to understand a completely different form of data that is lighter than traditional datasets and therefore reduces the complexity of neural network computation. However, at present, this third-generation neural network model is not mature enough compared to traditional neural network models. Nevertheless, this is precisely why people continue to invest in this research.


Author: Guoqing ZHANG
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source Guoqing ZHANG !
tipping
Comments
 Previous
Vision-Based Risk Assessment Model based on Scene graph for Autonomous Train Navigation Vision-Based Risk Assessment Model based on Scene graph for Autonomous Train Navigation
This project is part of the Ecotrain project, which aims to develop an AI capable of interpreting image content captured by a camera at the front of the train. The Ecotrain project’s ambition is to launch France’s inaugural autonomous rail shuttle service by 2026.
Next 
SSH Remote Configuration and File Transfer Under Linux SSH Remote Configuration and File Transfer Under Linux
This article provides a comprehensive guide on how to use SSH for remote operations under Linux. This guide covers all the necessary knowledge points, aiming to improve the efficiency and security of daily work.
  TOC