Abstract
Recently, the event camera has become a popular and promising vision sensor in the research of simultaneous localization and mapping and computer vision owing to its advantages: low latency, high dynamic range, and high temporal resolution. As a basic part of the feature-based SLAM system, the feature tracking method using event cameras is still an open question. In this article, we present a novel asynchronous event feature generation and tracking algorithm operating directly on event-streams to fully utilize the natural asynchronism of event cameras. The proposed algorithm consists of an event-corner detection unit, a descriptor construction unit, and an event feature tracking unit. The event-corner detection unit addresses a fast and asynchronous corner detector to extract event-corners from event-streams. For the descriptor construction unit, we propose a novel asynchronous gradient descriptor inspired by the scale-invariant feature transform descriptor, which helps to achieve quantitative measurement of similarity between event feature pairs. The construction of the gradient descriptor can be decomposed into three stages: speed-invariant time surface maintenance and extraction, principal orientation calculation, and descriptor generation. The event feature tracking unit combines the constructed gradient descriptor and an event feature matching method to achieve asynchronous feature tracking. We implement the proposed algorithm in C++ and evaluate it on a public event dataset. The experimental results show that our proposed method achieves improvement in terms of tracking accuracy and real-time performance when compared with the state-of-the-art asynchronous event-corner tracker and with no compromise on the feature tracking lifetime.
Introduction
Over the past several years, simultaneous localization and mapping (SLAM) has been widely studied and developed for augmented and virtual reality, self-driving cars, and unmanned aerial vehicles.
1
The combination of depth learning and SLAM
2,3
has also become a hot research topic at present. But, due to the complexity of the real environment, existing visual SLAM systems using single vision sensor are still faced with many problems, such as tracking failure. To enhance the robustness of SLAM systems, many researchers fuse the data derived from two or more sensors, such as cameras, Lidar, GPS, IMU, and so on.
4,5
However, there are still many challenges to these systems when faced with challenging scenes, such as high-speed motion and high dynamic range. Recently, bioinspired vision sensors
6,7
have aroused many researchers’ interest and have become a hot research topic for robotics and computer vision. Event cameras respond to local pixel-level brightness changes, transmitting asynchronous events only when brightness changes are detected rather than frames with a fixed time interval, intrinsically different from standard cameras. Each event is a tuple
Unfortunately, the asynchronous events from event cameras are intrinsically different from the intensity images, so standard computer vision methods cannot be directly applied for event cameras. 12 Researchers have to explore new methods to bring event cameras’ potential into full play. Until now, there have been a large amount of research efforts focused on event cameras in multiple directions, such as SLAM, 9,13,14 segmentation, 15,16 reconstruction for visual information, 17,18,19 and control for unmanned aerial vehicles. 20,21 More related research contents can be found from the survey articles 12,22 and the list of event-based vision resources (https://github.com/uzh-rpg/event-based_vision_resources).
As one of the basic methods in SLAM, feature-based SLAM methods extract features from intensity frames, and each feature is associated with a descriptor, such as scale-invariant feature transform (SIFT), speeded-up robust features (SURF), oriented FAST and rotated BRIEF (ORB), and so on. The extracted descriptor preserves the information of the local area around the feature point and provides a quantitative comparison with other feature points.
5
Then, data association is performed to associate similar features to complete feature tracking tasks. To the best of our knowledge, there is still no visual SLAM system using asynchronous feature tracking method, which is proposed for event cameras. Driven by the demand for an efficient asynchronous feature tracking method for subsequent SLAM system based on event cameras, we propose an asynchronous event feature generation and tracking algorithm working directly on event-streams. The proposed algorithm consists of an event-corner detection unit, a descriptor construction unit, and an event feature tracking unit. The results of asynchronous event feature tracking are shown in Figure 1. The main contributions of this article can be summarized as follows: We propose an asynchronous event feature generation and tracking algorithm, which can work directly on asynchronous event-streams. The proposed algorithm includes an event-corner detection unit, a descriptor construction unit, and an event feature tracking unit. We address a novel asynchronous event feature gradient descriptor. The descriptor can be constructed by speed-invariant time surface (SITS)
24
maintenance and extraction, principal orientation calculation, and descriptor generation. The descriptor is used to represent the distribution of the local gradient information for event-corners and help to achieve quantitative measurements of similarity between event feature pairs. We implement our proposed algorithm in C++ and evaluate it on the public dataset.
23
The experimental results show that our proposed method can improve the tracking accuracy and real-time performance when compared with the state-of-the-art asynchronous event-corner tracker and with no compromise on the feature tracking lifetime.

Our event feature generation and tracking algorithm works directly on asynchronous event-streams based on our proposed gradient descriptor. This figure shows the event feature tracking results in the spatiotemporal space with our proposed algorithm on shapes scene of the public event camera dataset. 23 Different colors indicate different tracked event features.
The rest of the article is organized as follows. The related works are given in the next section. Then, we give the overview of the presented algorithm, which is followed by the introduction of the proposed gradient descriptor. Later, the details of the event feature tracking method are outlined, and the following section presents the experimental results and the corresponding analysis. Finally, the conclusions are drawn and the future work is given.
Related works
In computer vision, a feature may be a specific structure, such as interest point, edge, block, or object, which differs from its immediate neighborhood in the image. The feature-based tracking method is widely applied for visual odometry, SLAM, and augmented reality. Feature-based tracking method generally consists of feature detection, feature description, feature matching, and feature tracking. In the whole process, feature description is one of the most significant steps for tracking.
Feature descriptors for standard images
As one of the most widely used features, SIFT 25 is invariant to image scale and rotation, and robust to changes in illumination and affine distortion. The generation of SIFT feature descriptor has four stages: scale-space extrema detection (based on the difference of Gaussian pyramid), keypoint localization, orientation assignment, and keypoint description. After the first two stages, keypoints will be selected including their locations and scales. In the step of the orientation assignment, one or more orientations will be assigned to each keypoint based on local image gradient information at the local patch region around the keypoint location. So, every keypoint can be assigned with the location, scale, and orientation. Finally, the descriptor with multidimensions can be calculated for each keypoint at the selected scale based on the local patch region around its location. The SURF descriptor 26 was proposed based on the idea similar to SIFT. SURF is faster than SIFT, and it is also scale invariant and rotation invariant. As a binary descriptor, the binary robust independent elementary feature (BRIEF) descriptor 27 allows very fast Hamming distance matching, but it is not scale invariant and rotation invariant. Another binary descriptor, called ORB, 28 combines the oriented features from accelerated segment test (FAST) detector 29 and rotated BRIEF descriptor. ORB is rotation invariant but not scale invariant. Compared to BRIEF and ORB, SIFT and SURF need significantly more computation effort. However, SIFT and SURF use binary strings as feature descriptions, which result in larger mismatch rates.
Event-based corner detection
In recent years, many asynchronous event-corner detection and tracking methods 30,31,32,33,34 have been proposed based on event-driven data. In detail, Vasco et al. 31 applied an adaptation of the original image-based Harris corner detector 35 for event-based data, while Mueggler et al. 32 presented a FAST-like event-based corner detector faster than the method proposed by Vasco et al., 31 inspired by image-based FAST corner detection method. Li et al. 33 studied a fast and asynchronous event-based corner detection method, called FA-Harris, with a corner candidate selection and refinement strategy. Alzugaray and Chli 36 proposed a faster asynchronous event-corner detection method inspired by the method of Mueggler et al. 32 and a simple asynchronous event-corner tracker. The tracker utilizes a directed graph to record the tracks of event-corners. Then, Alzugaray and Chli 37 improved the asynchronous event-corner tracking algorithm by introducing the normalization descriptor for extracted event-corners. FA-Harris detector achieves better performance in terms of accuracy with moderate computation performance, compared with the other aforementioned corner detection methods.
All the above event-corner detection methods operate directly on asynchronous event-streams using the surface of active event (SAE) 38 (also called time surface. 39 ). Time surface maps the position of the latest event to its timestamp. In other words, time surface keeps the absolute timestamps of the latest events triggered at the imaging plane. Manderscheid et al. 24 proposed the SITS, which is invariant to the motion speed of cameras or scene objects. SITS keeps the relative timestamps instead of absolute ones. They utilized the SITS to detect event-corners from event-streams by training a random forest.
Event-based feature tracking
Some event-based feature tracking methods work based on event frames (synthesized by events with a fixed number or in a fixed temporal window) or the absolute intensity information on images. In the study of Tedaldi et al., 40 they first extracted Harris corners and Canny edge features on intensity images and then tracked the features on asynchronous event-streams. Kueng et al. 41 presented an event-based visual odometry method to track the six degrees of motion of the camera, and the proposed method is also based on corners and edges. Zhu et al. 42 accumulated events in a temporal window to integrate event frames. Based on the integrated event frames, they applied the original Harris corner detector and then tracked the detected corners with expectation–maximization scheme. Afterward, Zhu et al. 43 further introduced the inertial measurement into the system and proposed an event-based visual-inertial odometry method. Gehrig et al. 44 detected Harris corners on the intensity frames and tracked them on event-streams. Li et al. 45 proposed a feature tracking method using events, intensity frames, and IMU data. They first extracted Harris and Canny feature on intensity frames, and then, the feature templates are tracked using an expectation–maximization iterative closest point strategy. Besides, Alzugaray and Chli 46 addressed a method to track generic patch features event-by-event without the requirement for detecting event-corners and descriptors.
To fully utilize the natural asynchronism of event cameras, we propose a novel asynchronous event feature generation and tracking algorithm inspired by frame-based feature tracking techniques. The algorithm can work directly on event-streams without the requirement for intensity frames, artificially synthesized event frames, or other prior knowledge of scenes or camera motion. The proposed algorithm is based on a novel asynchronous event feature gradient descriptor inspired by the frame-based SIFT feature descriptor. The gradient descriptor represents the distribution of the local gradient information for event-corners, and it is used for feature matching during the asynchronous tracking process.
Overview
Inspired by standard computer vision tasks, we propose an asynchronous event feature generation and tracking algorithm in this article. As shown in Figure 2, the algorithm includes an event-corner detection unit, a descriptor construction unit, and an event feature tracking unit.

The overview of the proposed asynchronous event feature generation and tracking algorithm. The algorithm consists of an event-corner detection unit, a descriptor construction unit, and an event feature tracking unit. The input of the algorithm is the event-streams, and the output of the algorithm is the tracks of the tracked event features. The event-corner detection unit is based on a fast and asynchronous event-corner detection method. It extracts the event-corners through global SAE maintenance, local SAE extraction, corner candidate selection, and corner candidate refinement. The gradient descriptor is constructed by SITS maintenance and extraction, principal orientation calculation, and descriptor generation. The tracking unit is achieved using the constructed descriptor and an event feature matching method to achieve asynchronous feature tracking. SAE: surface of active event; SITS: speed invariant time surface.
The event-corner detection unit is based on a fast and asynchronous event-corner detection method, 33 which is called FA-Harris. It detects event-corners directly on event-streams without using intensity images, which mainly consists of five steps, including event filter, global SAE maintenance, local SAE extraction, corner candidate selection, and corner candidate refinement. In the proposed event feature generation and tracking algorithm, the event-corner detection unit utilizes the FA-Harris detector to extract event-corners from event-streams, and the event filter included in FA-Harris detector is not used in our method here, which we found would not contribute to the performance improvement of the tracking method.
After detecting event-corners, we design a novel asynchronous event feature gradient descriptor for each event-corner based on the SITS.
24
The gradient descriptor can be constructed by SITS maintenance and extraction, principal orientation calculation, and descriptor generation. The descriptor can represent the distribution of the local gradient information for event-corners and help to achieve quantitative measurements of similarity between event feature pairs in the following event feature tracking unit. By introducing the gradient descriptor, we can define the event feature as a tuple
Finally, the generated event features will be tracked using the event feature tracking unit. The tracking unit is achieved using the constructed descriptor and an event feature matching method to achieve asynchronous event feature tracking. The proposed gradient descriptor is used to provide the similarity measurements between event feature pairs. The event feature matching method is implemented based on a directed graph, which is composed of multiple structured track trees.
Gradient descriptor
This section introduces our proposed gradient descriptor. The construction of the gradient descriptor can be divided into three stages: SITS maintenance and extraction, principal orientation calculation, and descriptor generation, which is mainly inspired by the last two stages of the frame-based SIFT descriptor. We firstly select the event-corners (keypoints) from the incoming events including their locations, timestamps, and polarities based on FA-Harris detector. Compared with the keypoints in the SIFT method, our event-corners do not contain the scale information. For the frame-based SIFT descriptor, the first stage is the scale-space extrema detection based on the difference of Gaussian pyramid before the keypoint localization step. For the consideration of simplicity, we did not utilize the scale space compared with the frame-based SIFT descriptor, which would be a feature direction for further research. As mentioned above, the keypoint localization for our method is achieved using the FA-Harris detector for event cameras rather than localizing the keypoint in the scale space. We apply the SITS method
24
to provide temporal information and gradient information of events. The global SITS structure will be maintained based on the incoming events. It has the same size
Speed-invariant time surface maintenance and extraction
Since there is no concept of intensity images for event cameras and a single event does not bring any information to provide gradient information for descriptor construction, we choose the SITS
24
(updated asynchronously, with every incoming event) to store the temporal information of events and provide the gradient information, rather than using the intensity image as in the frame-based SIFT descriptor.
25
According to Manderscheid et al.,
24
SITS is invariant to the motion speed of cameras or objects in the environment, which can contribute to the speed-invariant property of event features. To distinguish event-corners from event-streams, SITS keeps relative values for timestamps rather than absolute ones. The method tries to maintain one SITS structure for each polarity of event, which stores a single value for each pixel location. Specifically, all the values in the SITS are initialized to 0. When a new event arrives, the values, which are larger than the value at the corresponding event pixel position
In our proposed algorithm, we maintain one global SITS structure for each polarity same as in the method of Manderscheid et al.
24
For each incoming event, the global SITS structure corresponding to the polarity of the new coming event will be updated. When a new event-corner arrives, we extract the local patch P with size

The generation of the proposed asynchronous event feature gradient descriptor. The local patch with size
Principal orientation calculation
On the extracted local patch P extracted from the global SITS structure, we calculate the magnitude
where
where
The orientation corresponding to the peak value in the gradient histogram represents the gradient orientation of the local patch, and it is also regarded as the principal orientation of the local patch. Since the orientation we get from a gradient histogram is essentially an interval of 10°, we apply the parabolic interpolation processing to get the specific orientation. More specially, the selected orientation and the orientations adjacent to it are used for parabolic interpolation.
To enhance the robustness of feature matching, we choose the orientation corresponding to the maximum value in the histogram and orientations where the value is greater than
The circular mean
Descriptor generation
In this stage, we reorient the local patch to its principal orientation to generate the gradient descriptor vector, that is, we rotate the x direction of the local patch (which is the same as the x direction of the imaging plane) to coincide with its principal orientation. The neighborhood with size
where
After we get the gradient descriptors for event-corners, we need to compute the descriptor distance between event feature pairs to measure the similarity between them. For two gradient descriptor vector
Event feature tracking
In the event feature tracking unit, we combine the constructed gradient descriptor and an event feature matching method to achieve asynchronous event feature tracking. For each incoming event-corner, we assign a gradient descriptor to it based on the above-mentioned gradient descriptor construction method. Our proposed gradient descriptor is used to represent the distribution of the local gradient information within the event-corner neighborhood on the time surface space. We define the event feature as a tuple
The event feature matching method
37
used in our presented algorithm is based on a directed graph. The implementation details of the event feature matching method are summarized in Figure 4. For each new incoming event feature, the algorithm generates a new vertex

Implementation details of the event feature tracking unit. A global memory with size
For every new generated vertex, it can be assigned to an existing tree or become the root of a new tree. Once the depth of a tree increases, the proposed algorithm will perform the reference updating operation.
Tree assignment
Considering the descriptor distance introduced in the above section, the closest vertex compared with the new vertex in the spatiotemporal window
Reference updating
A reference vertex
Track refinement
To get smoother tracks, the event feature tracks are smoothed using a simple interpolation operation. For each vertex, its pixel coordinate will be interpolated using its s predecessors and s successors in the same track. Only the event feature track which contains at least m refined vertices is used to filter out the short and noisy tracks.
Experiments
This section introduces the experimental results of our proposed algorithm, including accuracy and computational performance evaluation for feature tracking. We compare our proposed method with the tracking method
42
(referred as EOF tracker), the ACE tracker,
37
and the tracking method of reference
46
(referred as AMH tracker). The public event camera dataset,
23
generated using a DAVIS240C with a spatial resolution of
To perform the comparison, we implement the normalization descriptor and the ACE tracker
37
in C++. The same parameters and values are used as those presented in the article. All methods are implemented in C++ and evaluated on a laptop equipped with an Intel i7-7700HQ CPU with 2.80 GHz and RAM with 16 GB. For the event feature tracking unit, we employ the spatiotemporal window
Tracking performance
We use the event-based feature tracking evaluation code
48
for tracking performance analysis. The ground truth feature tracks are collected using KLT-based feature tracking method on frames. The positions of the initial features on frames are interpolated from the event-based features close to the time of the frames. The initial features are tracked using KLT tracking method until they are lost, and the tracker updates the tracked features for each frame. The asynchronous event feature tracks for our proposed algorithm on

Asynchronous event feature tracks on (a) shapes and (b) dynamic scenes. The figures show the different feature tracks using our method over the last 0.5 s. The intensity frame is used for visualization in the figure.
Table 1 summarizes the average pixel error and average feature lifetime for event feature tracks on several scenes with different textural complexity. In our experimental evaluation, if the pixel error for a tracked feature is above 5 pixels, the tracked feature is regarded as invalid, and only the valid tracked feature is considered in the evaluation. The best results are made in bold in the table, and the results indicate that our proposed method can achieve better performance in terms of accuracy when compared with EOF tracker and ACE tracker. Besides, we report the average tracking error from reference
46
(the authors did not explicitly report tracking lifetime numerically), compared with which our tracking method also performs better with significant improvement, except one case,
Average pixel error and feature lifetime of event feature tracks on different scenes using the AMH tracker, the EOF tracker, the ACE tracker, and our proposed method.a
a The best results are highlighted in bold.
Figure 6 shows the average tracking pixel error and the percentage of the tracked surviving features over time for the ACE tracker and our proposed method on four different scenes. The results demonstrate that our proposed method achieves better performance in terms of tracking accuracy. What is more, the band around the central line is wider with our method, which indicates that our method is more robust. However, the ACE tracker performs better on the scenes with complex texture when considering the feature tracking lifetime.

(a–d) The performance of feature tracking on different scenes. The figures in the second row show the average tracking pixel error (central line) on the corresponding scenes for the ACE tracker and our method. The band around the central line represents the percentage of the tracked surviving features over time. The wider the band is, the more robust the tracking method is.
Computational performance
In this section, we compare the computational performance of our proposed event feature tracking method with the ACE tracker.
The ACE tracker uses the normalization descriptor as a quantitative measurement of similarity between event-corners. The normalization descriptor is implemented based on the simple sorting operation of the events’ timestamps in a local patch. The sorted timestamps are normalized into the range
Table 2 presents the real-time performance of our proposed descriptor construction and the event feature matching method. We report the total time, the time spent on descriptor construction and event feature matching, and their ratios to the total time. As given in Table 2, the descriptor construction ratio with our method is larger than that of ACE. This is because our method needs gradient computation and Gaussian weighting, while the normalization descriptor used in ACE only needs a sort operation and a normalization operation. However, the matching time with our method is much shorter than the matching time with ACE, which contributes to a better real-time performance in terms of the total time for feature generation and tracking.
The real-time performance of our proposed descriptor construction and event feature matching method compared with those of the ACE tracker, including the total time, the time for descriptor construction (des. time), the time for event feature matching (matching time), the ratio of time spent on descriptor construction to the total time (des. ratio), the ratio of time spent on event feature matching to the total time (matching ratio), and the real-time factor.a
a Both ACE and our method are performed on the same laptop. The better results are highlighted in bold.
Also, we report the metric, real-time factor for real-time performance analysis, which indicates the total time spent processing the events of each scene with respect to the duration of dataset (10 s for each scene). For this metric, the smaller result indicates better real-time performance, and the results under 1 indicate the performance above real time. According to Table 2, our method achieves better real-time performance when facing all of the scenes compared with the ACE method. However, there are still some scenes, such as
Table 3 gives the event processing ability of the ACE tracker and our proposed event feature generation and tracking algorithm. The table includes the mean rate of events, the mean rate of the event-corners, and the mean time for a single feature matching. According to Table 3, our proposed method achieves faster event-rate, corner-rate, and speed for every single feature matching than the ACE tracker, which contributes to the improvement of the real-time performance of our method.
The event processing ability of the ACE tracker and our proposed event feature generation and tracking algorithm, including the mean rate of events (mean event-rate), the mean rate of the event-corners (mean corner-rate), and the mean time for a single feature matching (time per feature).a
a The better results are highlighted in bold.
Conclusion
In this article, we present a novel asynchronous event feature generation and tracking algorithm operating directly on event-streams for event cameras. The algorithm consists of an event-corner detection unit, a descriptor construction unit, and an event feature tracking unit. An asynchronous gradient descriptor is developed for the quantitative measurement of similarity between event feature pairs, and it is constructed through SITS maintenance and extraction, principal orientation calculation, and descriptor generation. The experimental evaluation demonstrates that our proposed algorithm performs better in terms of tracking accuracy and real-time performance when compared with the state-of-the-art asynchronous event-corner tracker and with no compromise on the feature tracking lifetime.
In the future, there are still some works for us to do, such as improving the tracking accuracy and lifetime performance, such as by adding scale-invariant property to the descriptor, so that the feature tracking algorithm could fulfill the demand of visual odometry pipeline and even a SLAM system. And also, learning-based methods for event feature generation may be a great direction for further research.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed the receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Key Research and Development Program of China [2017YFB1001901] and by National Natural Science Foundation of China [Grant No. 61903377].
