Abstract
Purpose:
Registration of 3-dimensional ultrasound images poses a challenge for ultrasound-guided radiation therapy of the prostate since ultrasound image content changes significantly with anatomic motion and ultrasound probe position. The purpose of this work is to investigate the feasibility of using a pretrained deep convolutional neural network for similarity measurement in image registration of 3-dimensional transperineal ultrasound prostate images.
Methods:
We propose convolutional neural network-based registration that maximizes a similarity score between 2 identical in size 3-dimensional regions of interest: one encompassing the prostate within a simulation (reference) 3-dimensional ultrasound image and another that sweeps different spatial locations around the expected prostate position within a pretreatment 3-dimensional ultrasound image. The similarity score is calculated by (1) extracting pairs of corresponding 2-dimensional slices (patches) from the regions of interest, (2) providing these pairs as an input to a pretrained convolutional neural network which assigns a similarity score to each pair, and (3) calculating an overall similarity by summing all pairwise scores. The convolutional neural network method was evaluated against ground truth registrations determined by matching implanted fiducial markers visualized in a pretreatment orthogonal pair of x-ray images. The convolutional neural network method was further compared to manual registration and a standard commonly used intensity-based automatic registration approach based on advanced normalized correlation.
Results:
For 83 image pairs from 5 patients, convolutional neural network registration errors were smaller than 5 mm in 81% of the cases. In comparison, manual registration errors were smaller than 5 mm in 61% of the cases and advanced normalized correlation registration errors were smaller than 5 mm only in 25% of the cases.
Conclusion:
Convolutional neural network evaluation against manual registration and an advanced normalized correlation -based registration demonstrated better accuracy and reliability of the convolutional neural network. This suggests that with training on a large data set of transperineal ultrasound prostate images, the convolutional neural network method has potential for robust ultrasound-to-ultrasound registration.
Keywords
Introduction
The ability to accurately aim radiation beams at the intended target while avoiding surrounding healthy tissues is critical for the success of prostate external beam radiation therapy (EBRT). Currently, implanted markers are used for accurate prostate localization during EBRT. However, there are several disadvantages with this approach such as morbidity associated with the implantation procedure, 1 -3 lack of volumetric information for managing anatomic deformations and volume changes, 4 -6 and potential marker migration before and during radiotherapy that may result in systematic errors. 1,2,4
Transperineal ultrasound (US) prostate imaging was recently introduced commercially and deployed clinically 7,8 as an alternative nonionizing image-guidance modality that could potentially eliminate some of the limitations of transabdominal US guidance. 9,10 However, US image guidance is challenged by variable operator-dependent image quality and technique-induced nontrivial differences in images of the same anatomy. 10,11 Intensity-based image registration methods are widely used for medical image registration applications. 11 -13 However, due to comparatively low image quality of US images, 14 standard intensity-based similarity metrics for US image registration do not guarantee a satisfactory performance. Furthermore, corresponding 3-dimensional (3-D) US image pairs can appear quite different depending on the transducer position and orientation and thus confound predetermined image features. As a result, intensity-based methods may not be very robust for 3-D US image registration. Even the manual registration of US volumes can be a difficult task.
In this article, we evaluate the feasibility of an alternative approach, a 3-D US image registration framework based on image matching with a pretrained deep convolutional neural network (CNN). Deep CNNs present a powerful methodology that has been used for a variety of medical image analysis tasks, 15,14 but research on CNNs for medical image registration is still considered to be in early stage 15 with few articles on the subject. 16 -21 For multimodal image registration in particular, an emerging concept is to use CNNs on registered and misregistered image pairs in order to learn and subsequently apply a similarity measure that captures the underlining complex correlation across modalities. 16,17,20 We consider such CNN-based strategy particularly attractive for the registration of US image pairs acquired at different time instances given that these images generally present nontrivial confounding differences in intensity and content.
Using CNN to measure image similarity ideally requires that a CNN be trained with 3-D US images having ground truth registration results in order to have the CNN design and learn robust US image features most suitable for the application. However, acquiring a large number of US training data sets with validated ground truth registration is logistically challenging. We hypothesize that a pretrained deep CNN 22 designed to find correspondence (similarity) of image patches can still be used to measure the similarity of US images as such a network has been trained on a large data set to successfully compare image patches while accounting for a wide variety of changes in image appearance. Thus, we design a registration method based on this pretrained deep CNN and evaluate its performance with 3-D transperineal US images acquired from patients undergoing prostate radiotherapy.
Methods and Materials
Treatment Procedure and Data Acquisition
For this study, with institutional review board approval transperineal US imaging of the prostate was performed with the Clarity Autoscan system (Elekta, Stockholm, Sweden) for several prostate patients during simulation and treatment delivery. The Clarity Autoscan system combines infrared tracking and US imaging with the Clarity Autoscan probe to enable prostate localization during radiotherapy simulation and treatment. The Clarity Autoscan US probe is enclosed mechanically swept 3 to 7 MHz transducer that provides 3-D US images through the acquisition of a series of 2-D planes along the elevational direction of the transducer. In particular, for the acquisition of 3-D transperineal US images of the prostate, the probe is placed between the patient’s legs in contact with the perineum. This placement allows prostate imaging through the acoustic window provided by the perineum. The specific data acquisition throughout simulation and treatment is briefly described below.
Prior to computed tomography (CT) simulation scanning, the Clarity US probe is fixed in imaging position between the patient’s legs and left in place throughout the simulation procedure. Infrared reflective markers attached to the probe are tracked by a calibrated camera fixed on the room ceiling. This allows a simulation 3-D US image acquired immediately after the CT simulation scan in the same patient position to be reconstructed and referenced in the coordinate system of the CT device and thus automatically registered to the planning CT. Once completed, the CT contours of several structures (prostate, bladder, and rectum) are transferred from the planning CT to the simulation US image. The prostate contours (with modifications if deemed necessary) are set as an image-guidance volume. Once approved, a treatment plan is imported in the Clarity system to localize the treatment isocenter position within the 3-D simulation US image. (The treatment isocenter is a fixed point in the coordinate system of the medical linear accelerator at the focus of the central axes of all radiation beams deliverable by the accelerator.)
Before treatment, the Clarity US probe is fixed in imaging position between patient’s legs and left in place throughout the treatment procedure including the actual beam delivery. Infrared reflective markers attached to the probe are tracked by a calibrated camera fixed on the room ceiling. This allows a treatment 3-D US image acquired before radiation delivery to be reconstructed and referenced in the coordinate system of the medical linear accelerator. The treatment 3-D US image is registered to the planning 3-D US image manually by overlaying the image-guidance volume (prostate) contours from the simulation US onto the prostate identified on the treatment US. A 3-D shift vector representing a rigid translational transform is then calculated by the Clarity system such that the isocenter-prostate spatial relation reflected in the treatment US image matches the intended isocenter-prostate spatial relation captured in the planning US image. Hereafter, we refer to the rigid translational transform obtained in this manner as manual registration. The manual registration is recorded but not applied for the treatment.
Commonly, prostate image-guided radiation therapy (IGRT) relies on implanted fiducials to align the prostate target prior to radiation delivery. To this end, as illustrated in Figure 1 (bottom), reference digitally reconstructed radiographies (DRRs) are generated from a CT volume during treatment planning. The DRRs capture the positions of the projected fiducials markers with regard to the treatment isocenter. Thus concurrently with the treatment US acquisition and registration, a pair of 2-D x-ray images are acquired with an On-Board Imager on a Varian 23EX Linac (Varian Medical Systems, Palo Alto, California). Such a pair of 2-D x-ray images allows localization of the fiducials in the coordinate system of the radiation delivery system. Then, a rigid body 3 degree-of-freedom transform (a 3-component vector) is calculated by aligning 4 prostate-implanted fiducial markers in corresponding pairs of reference DRRs and the 2-D x-ray images. This 3-D shift vector represents the rigid translational transformation that needs to be applied to match the isocenter-prostate spatial relation captured by the pair of x-ray images to the intended isocenter-prostate spatial relation captured in the simulation CT. Ideally, both the US and the x-ray image guidance should result in the same prostate shifts to align the target in the coordinate space of the treatment device. Discrepancies are interpreted as errors in the US–US registration in comparison the x-ray fiducial-based registration that is widely used clinically.

Study design, ground truth, and quantitative evaluation.
In the present study, the x-ray-based translational transforms (shifts) calculated for patients undergoing prostate IGRT serve as a ground truth for evaluating the accuracy of the proposed US-to-US registration method. Simulation 3-D US images acquired during initial planning CT and 3-D US treatment images (US images acquired right before treatment) serve as inputs and a translational transform (vector shift) is the output as shown in Figure 1 (top). The evaluation is conducted by calculating the norm of the difference between the 2 registration vectors (ground truth and the results obtained with the proposed method).
Deep CNN
In the proposed method, a CNN (or ConvNet) is used for matching of 2-D image slices. Convolutional neural network is a type of feed-forward artificial neural network in machine learning that is proven to be successful for image and video analysis. The input for the network is an image pair of 2-D slices and the output is a similarity score. Due to the lack of training data for the deep CNN, the proposed method uses a pretrained CNN (Figure 2) described by Zagoruyko

Pretrained deep convolutional neural network used in this study. Pattern code used: Horizontal stripes = Conv + ReLU, solid color = max-pooling, checkered = fully connected later (ReLu exists between fully connected layers as well). 22 Conv indicates convolutional neural network; ReLU, rectified linear unit.
The pretrained CNN (Figure 2) designed to find correspondence (similarity) of image patches consists of convolutional layers, rectified linear unit (ReLU) layers, max-pooling layers, and a fully connected layer (for overview of CNNs architectures, refer to
23
and references therein). Specifically, a list of all layers from bottom up includes convolutional layer 1 (
The output of the network is the output of the fully connected layer (
Registration Framework
Figure 3 illustrates the CNN (ConvNet) framework for registering 3-D treatment US images (acquired right before treatment) to 3-D simulation US images (acquired before planning). Two-D slices (patches) are extracted from the 3-D simulation and treatment US images along the axes

Ultrasound-to-ultrasound registration framework.
For each 3-D shift i, a translated treatment 3-D US image is generated. Since a shift i is not necessarily an integer value of the intervoxel spacing, a trilinear interpolation is used to calculate the voxel values of the translated image. A composite similarity score is then calculated by summing up the similarity scores of spatially corresponding patches. The similarity score for each patch pair is calculated with the pretrained ConvNet and a composite similarity score is calculated across all patches. The translational shift that generates the maximum composite similarity score is considered to be the translational transform that best matches treatment and simulation US images. The calculation of similarity score is defined in Equation 1. Figure 4A further details the process of extracting the 2-D slices (patches) from the US simulation image (as indicated by the ellipse in Figure 3), and Figure 4B details the process of extracting the 2-D slices (patches) from shifted treatment images (as indicated by the rectangle in Figure 3).

Subimage selection and 2-D image slicing (patch extraction). (A) 2-D slicing of the 3-D simulation US image, (B) 2-D slicing of the 3-D treatment US image. US indicates ultrasound.
As shown in Figure 4A, the simulation US image is cropped into a subimage,
The treatment US image is cropped into treatment subimages
The output of the network for a 2-D slice pair is a similarity
After obtaining all
Evaluation
We compared the proposed method to results from manual registrations and some of the popular standard intensity-based similarity metrics. We used Elastix, 24,25 which is a widely used image registration tool with multiple choices of similarity metrics as an implementation of intensity-based registration.
In our experiments, 121 3-D US images from 5 patients (P1-P5) were used for development and validation. The 3-D US images were available from the Clarity system at up-sampled uniform voxel size of 0.58 mm × 0.58 mm × 0.58 mm. (The inherent resolution of US images acquired with the Clarity abdominal transducer is about 0.5 mm in axial [along beam propagation], 2 mm in lateral [within imaging plane], and 4 mm in elevational direction). The data set for development consisted of 38 images from the first 3 patients (P1, P2, and P3). It was used to (1) find the similarity metric with the best performance using the Elastix implementation and (2) to identify the sum of CNN generated similarity scores
Results
Figure 5 presents mean registration errors and respective standard deviations for different similarity metrics used with Elastix on 38 developmental images and different initialization values for the shifts. The ANC appeared to slightly outperform the other metrics in terms of mean values and spread (as determined by the standard deviation). Thus, the ANC metric was subsequently used with Elastix.

Mean registration errors for different similarity metrics on developmental images.
Figure 6 illustrates registrations performed by the 3 evaluated methods (manual, CNN, ANC) along with the ground truth registration and the starting (no registration) point for the registrations. Figure 6 exemplifies the challenge in interpreting the similarity between US images by clearly demonstrating that even in a case where visually the manual, CNN, and ANC methods performed reasonably well, the registration error varied substantially between them.

Sagittal (left) and coronal (right) planes of a simulation ultrasound image (in yellow) and a treatment ultrasound image (in blue) overlaid after registration with various methods. The x-ray-based fiducial registration serves as ground truth. The reported registration error is the norm of the difference between 2 vectors: the vector for the ground truth shift and the vector for the respective evaluated registration.
The CNN method was then compared to manual registrations (as performed by physicists at the time of treatment) and Elastix with ANC. In the performance comparison, 83 images (second half of the P1-P3 data sets and P4-P5 complete data sets) were used for the evaluation.
Figure 7 illustrates mean errors and respective standard deviations for the 3 evaluated registration methods. Figure 7 (top) presents results without initialization and Figure 7 (bottom) presents results with initialization. The initialization shift was chosen as a random vector of size 4 mm away from the ground truth. Without initialization, the ANC registration performs poorly in comparison to the other methods both in terms of mean errors and standard deviations (Figure 7, top). Without initialization, the CNN performance was comparable to or better than manual registration (Figure 7, top).

Performance comparison of different US–US registration methods: proposed (CNN), manual registration, and ANC (Elastix). Top: Mean errors without registration initialization. Bottom: Mean errors with registration initialization. In this case, CNN (proposed) and ANC registrations are performed starting with a randomly selected 4 mm initial shift from the ground truth registration. US indicates ultrasound; CNN, convolutional neural network; ANC, advanced normalized correlation.
With initialization around the ground truth, ANC performance improved (Figure 7, bottom) but remained inferior to that of CNN both in terms of mean errors and standard deviations. With initialization, the performance of the proposed CNN method remained comparable or better to manual registrations both in terms of mean errors and standard deviations.
Figure 8 presents the cumulative distributions of the registration methods across all 5 patients on the 83 validation image pairs. It demonstrates that with initialization in 88% of the cases CNN registration errors were smaller than 5 mm. Without initialization, in 81% of the cases CNN registration errors were smaller than 5 mm. The corresponding values for the ANC method were 62% and 25% accordingly, whereas for the manual registration these were within 5 mm in 61% of the cases. These results clearly demonstrate the improvement in overall registration accuracy that can be achieved with a pretrained CNN in comparison to standard manual or automatic intensity-based registration techniques.

Cumulative distributions of registration errors for the proposed (CNN), manual, and ANC registration methods. Top: Without initialization. Bottom: With initialization. CNN indicates convolutional neural network; ANC, advanced normalized correlation.
Discussion and Conclusion
In this article, we designed and evaluated a 3-D US image registration framework based on a pretrained deep CNN. Comparative evaluation of the method against manual registration and automatic intensity-based registration with an ANC similarity metric demonstrated significantly improved accuracy and reliability with the pretrained CNN approach. One limitation of the study is that the registration transformation had to be limited to translations only since the available “ground truth” registrations were 3-D translations obtained by x-ray-based marker matching performed by treating therapists. Standard intensity-based registrations may perform better if deformations are considered and this scenario should be the subject of further investigations.
Our results on the accuracy of the pretrained deep CNN approach to US–US registration need to be interpreted in the context of several uncertainties related to the establishment of the “ground truth” x-ray-based image registration. Prostate deformations, for instance, may be present between simulation and treatment due to differences in rectal and bladder filling as well as probe pressure. The magnitude of these deformations is patient and session dependent. We evaluated the prostate distortions by measuring the relative changes in interfiducial distances from simulation to treatment. On average, the relative change was smaller than 2% or 0.5 mm for mean interfiducial distance of 25 mm.
Uncertainty in marker localization arising from user bias in x-ray image interpretation is another source of error in the determination of the ground truth. We evaluated this by comparing the x-ray-based shifts that we calculated to the shifts approved and applied by the therapists during the actual treatments. The standard deviation of the difference vector was (0.6, 0.6, 0.5) mm, resulting in approximately 2 mm overall uncertainty at the 95% confidence level. This number provides an estimate of the ground truth error in our study.
Our results indicate clearly the potential of using deep CNNs for 3-D US image registration, but the overall accuracy of the current approach based on a specific, pretrained CNN is not sufficient to meet the requirements of prostate IGRT even after considering uncertainty in ground truth registrations. This is not surprising as the CNN was pretrained with nonmedical image data. Hence, it is expected that training the CNN with actual US data can notably enhance the CNN performance and future work will involve network training on a large data set of US images. Furthermore, for practical implementation additional performance optimizations will be necessary. On our hardware, it takes about 5 milliseconds to compute the CNN similarity between a pair of 64 × 64 2-D patches. Thus, about 1 second is necessary to calculate the similarity between a pair of 3-D images, as this involves 64 * 3 = 192 evaluations between 2-D patches. In comparison, normalized mutual information computation took about 5 milliseconds. A straightforward optimization, for instance, would be to reduce the number of patches used for composite similarity measurement to only few that are rich in relevant anatomical features.
We expect that performance optimizations and training application-specific US images will allow CNN-based registration to address robustly the challenge of US-to-US prostate registration and eliminate a major obstacle for US IGRT.
Footnotes
Authors’ Note
The data collected in this study were acquired under Stanford IRB-approved protocol #27372 “Feasibility of using trans-perineal Clarity Autoscan ultrasound imaging for prostate motion management, tissue characterization, and treatment monitoring.” Patients accrued under this protocol provided written consent for participation in the study and publication of the findings. Our study was approved by the Stanford University Research Compliance Office, IRB number 27372. All patients provided written informed consent prior to enrollment in the study. TCRT-18-0047.R3.
Declaration of Conflicting Interests
The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The corresponding author received research funding from Philips and Elekta.
