Probabilistic maps for deep brain stimulation e Impact of methodological differences

Background: Group analysis of patients with deep brain stimulation (DBS) has the potential to help understand and optimize the treatment of patients with movement disorders. Probabilistic stimulation maps (PSM) are commonly used to analyze the correlation between tissue stimulation and symptomatic effect but are applied with different methodological variations. Objective: To compute a group-speci ﬁ c MRI template and PSMs for investigating the impact of PSM model parameters. Methods: Improvement and occurrence of dizziness in 68 essential tremor patients implanted in caudal zona incerta were analyzed. The input data includes the best parameters for each electrode contact (screening), and the clinically used settings. Patient-speci ﬁ c electric ﬁ eld simulations (n ¼ 488) were computed for all DBS settings. The electric ﬁ elds were transformed to a group-speci ﬁ c MRI template for analysis and visualization. The different comparisons were based on PSMs representing occurrence (N- map), mean improvement (M-map), weighted mean improvement (wM-map), and voxel-wise t-statis-tics (p-map). These maps were used to investigate the impact from input data (clinical/screening set- tings), clustering methods, sampling resolution, and weighting function. Results: Screening or clinical settings showed the largest impacts on the PSMs. The average differences of wM-maps were 12.4 and 18.2% points for the left and right sides respectively. Extracting clusters based on wM-map or p-map showed notable variation in volumes, while positioning was similar. The impact on the PSMs was small from weighting functions, except for a clear shift in the positioning of the wM-map clusters. Conclusion: The distribution of the input data and the clustering method are most important to consider when creating PSMs for studying the relationship between anatomy and DBS outcome. © 2022 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).


Introduction
Deep brain stimulation (DBS) is an established procedure for alleviating motor symptoms in diseases such as Parkinson's disease, essential tremor (ET), and dystonia. Using electrical pulses, DBS can impact the neuronal tissue and disrupt pathological signals [1]. Different targets have been suggested but the optimal locations for stimulation are still debated [2].
A common method to evaluate the spatial impact of the stimulation is to simulate the electric field around the DBS lead using the finite element method (FEM) [3,4]. The volume of tissue activated (VTA) by the stimulation can be estimated from an electric field threshold based on single cable neuron model simulations [5]. By using a fixed isolevel it is possible to compare simulations within a patient and to combine simulations for group analysis.
The number of studies aimed to do group analysis in order to spatially evaluate the effect of DBS has increased during the last decade, commonly by creating probabilistic stimulations maps (PSM) [6]. These studies include a variety of disorders and targets where Parkinson's disease and the subthalamic nucleus (STN) are the most frequently investigated [6e11]. Most of these studies focus on finding the best anatomical area to stimulate for improving the symptom profile. Adverse effects have also been investigated, but very few of these studies include PSMs [8,12]. A typical workflow for creating PSMs starts with transforming each patient's simulations to a common space, either to an existing reference like MNI152 [13] or a group-specific template [14]. Then, voxel-wise statistics are applied where the result from electric field simulations are combined with the clinical effect. The outputs are PSMs which estimates the improvement or possibility of side effect in each voxel. Some studies also include a validation step, e.g. outof-sample or cross-validation, to estimate the level of predictability of the created map on new patient data [6,8,15].
Numerous factors in the workflow can impact the result of a disease-specific PSM [16]. For example, many studies create PSMs based on the optimized clinical settings used in the patient's daily life, and few studies use data from monopolar review, where each contact is screened for the best outcome. Another example is the use of a binary VTA or the alternative to introduce a weighting function derived from the electric field or distance from DBS lead, as first presented by Eisenstein et al. [7]. Dembek et al. [17] used in silico data and showed a notable variation in sweet-spots size and location using five different binary methods. However, to our knowledge, there is no systematic study investigating the impact of methodological alternatives on the created PSMs using patient data.
The aim of this study was to evaluate how methodological choices impact the PSMs. A dataset of 68 ET-DBS patients implanted in caudal zona incerta (cZi) were used to investigate the influence from 1) clinical vs screening data, 2) clustering methods for extracting regions of high or low improvement, 3) sampling resolution of the electric field and 4) introduction of a weighting function. All data was analyzed and visualized on a cohort specific MRI template.

Materials and methods
This section is divided into three main parts. First, the patientspecific data and computations are described together with the process of moving from patient space to a common space for group analysis. Second, the methods for creating PSMs are described. Finally, the analyses for comparing methodological differences are presented. The complete processing workflow from clinical input to PSMs is described in Fig. 1 Clinical-and anatomical stimulation data for parts of this cohort has been previously published [15,18,19]. DBS contact evaluations and brain images from 77 patients were included in this study. All patients were diagnosed with ET by a movement specialist and had received cZi DBS (3387 or 3389, Medtronic, Minneapolis, MN, USA). From the 77 patients, 9 were excluded in the statistical analysis due to incomplete stimulation or improvement data. For the remaining 68 patients (31 females) the mean age at time of surgery was 63.6 years. 13 patients received bilateral DBS electrodes, 54 unilateral and one had two ipsilateral electrodes (both implanted in cZi with different laterality). Of the DBS leads 64 were on the left and 19 on the right hemisphere. Informed consent was obtained according to the Declaration of Helsinki, and the study was approved by the Ethics Committee at Umeå University (Dnr. 2017/122e31).
The surgical procedure has been described in detail before [20]. Preoperative stereotactic T 1 and T 2 MRI images (1.5 T) were used for direct targeting of the cZi. After surgery, a stereotactic CT was performed to confirm electrode position.
Using the essential tremor rating scale (ETRS), patients were evaluated preoperatively and 1 year after surgery, both off stimulation and with the clinically used stimulation settings [21]. Each individual contact was evaluated 1 year postoperatively with standardized stimulation parameters: monopolar stimulation with 60 ms pulse width, 145 Hz, and increasing amplitude in increments of 0.2e0.3 V up to 5 V. In this manner, the stimulation was iterated for each contact to achieve the best effect on tremor and hand function based on ETRS item 5/6 and 11e14. At the same time, side effects were registered and categorized. In this paper, the side effect dizziness is used as an example. The lowest amplitude at which the highest tremor suppression occurred from the individual contact evaluations will be referred to as screening data and the clinically used settings as clinical data.

FEM simulation
To estimate the electrical impact of the DBS, the electric field in the tissue was computed using FEM simulation (Comsol Multiphysics v. 5.5, COMSOL AB, Sweden), based on the method from Åstr€ om and colleagues [3,5,22], see also Fig. 1b. Shortly, the volume conductor model was used with a heterogenous isotropic brain conductivity model based on the patients preoperative T 2 images and stimulation parameters. The geometry of the DBS lead was modelled and placed according to the artefact in the postoperative CT and a 250 mm peri-electrode space modelled as white matter was included [23,24]. In total, the simulations were restricted to a box of 100 Â 100 Â 100 mm which was meshed with tetrahedral elements (mesh element size 0.02e2 mm, smallest closest to the lead). Since anodic stimulation impact the neurons to less extent, the resulting electric field from bipolar stimulation on the anodic side was scaled with a factor of 0.6 [25,26]. Based on neuron model simulations, an isolevel of 0.2 V/mm was selected to threshold the electric field since lower electric field magnitude is assumed to not impact the neurons [5,27]. Simulation results based on pulse width 90 ms were scaled with a factor of 1.33 which corresponds to an isolevel of 0.15 V/mm [5]. The scaling was required to compare simulation results spatially, since longer pulse width can stimulate neurons at lower electric field magnitude.
In total, 488 simulations were computed, 73 for clinical settings, 312 for screening settings and 103 where dizziness occurred. These results were further transformed to a common space for group analysis.

MRI template and electric field resampling
In order to conduct group-level analysis, spatial anatomical normalization to a template volume was performed using a methodology previously presented [14,28], Fig. 1c. It consists in the creation of an initial affine template using the MNI nLin-aSym 2009b as reference [13]. The affine template is then used as reference for iterative non-linear normalization, five times with the coarse registration settings from Ewert [29] and three times with finer registration settings from Vogel [28]. Each iteration is followed by scaling of the resulting transforms to avoid divergence of the anatomy [30]. This process outputs an MRI template, which is an average representation of all included patients, and deformation fields explaining the transform from patient space to the template. The normalization was conducted for 71 of the 77 patients. Six patients were excluded in the template, due to disproportionally higher resolution in the MR images compared to the rest of the group. These six patients were directly registered to the final template with the finer registration settings.
For every simulation result, the tetrahedral meshes were transformed to the template space by applying each patient's deformation field produced during the normalization process, Fig. 1d. The transformed electric fields were then resampled to voxel grids in the template space with Gaussian distance weighting based on the 4 mesh points closest to the voxel position on the grid. Since simulations were not computed within the lead, corresponding voxel values were interpolated to avoid holes in the results. To reduce data size, the resampling for each simulation result was limited to the region with electric field magnitude above 0.16V/ mm (0.2V/mm x 0.8 margin factor). Two different sampling resolutions were used, one identical to the anatomical template (500 mm isotropic) and one five times higher (100 mm isotropic). In order to provide the outlines of anatomical structures of the deep brain in the visualizations, the deep brain atlas from Morel [31] in its digital version [32] was registered to the group template using ANTS-SyN with the finer registration settings.
The normalization, transforms and resamplings were done on the Sigma Cluster System from the National Supercomputer Centre at Link€ oping University.

Probabilistic stimulation maps
Electric field simulations from all patients were combined to create PSMs which visualizes voxel-wise descriptive statistics and statistical tests. The descriptive statistics were mean value (M-map), weighted mean value (wM-map) and occurrence (N-map). The statistical tests resulted in p-value maps (p-map), Fig. 1e. The maps were computed for the screening data, clinical data and where dizziness occurred for the left and right implantation sides separately. To increase the statistical power, voxels with less than 10% of the data or 5 simulations, whichever greatest, were removed from the maps except for the N-map. All statistical computations were done in MATLAB (v. 2021a, The MathWorks Inc., Massachusetts, USA).

Improvement
In each voxel, the mean improvement in tremor reduction (TR) weighted with the electric field was computed, resulting in a wMmap. To promote a lower amplitude, i.e. smaller VTA, the electric fields were divided by the stimulation amplitude which effectively create a distance weighting according to is the stimulation amplitude, TR (%) is the tremor reduction, i denotes the voxel and k the simulation. The baseline for the TR was the 1-year postoperative OFF stimulation score.
A double-sided test was constructed to find voxels representing high or low clinical improvement by using the following t-test statistics where N is the number of simulations impacting voxel i and s is the standard deviation. The test is designed to find voxels significantly deviating from 70% improvement, which is the mean of all stimulations from the screening sessions. The standard deviation of the voxel-wise mean improvement was calculated as a weighted standard deviation.
This test resulted in a p-map, where all voxels have a corresponding p-value from the statistics where p < 0.05 was regarded as significant. To elucidate if a voxel was correlated to improvement higher or lower than 70%, the p-values were multiplied with the sign of the t-statistics creating a signed p-map. For comparability, same threshold was used for both the screening and clinical datasets.

Side effect
The side effect, in this case dizziness, was tested in each voxel by a double-sided Wilcoxon signed rank test. The rank was given by comparing the electric field magnitude from stimulations with or without occurrence of dizziness. The simulations without dizziness were created by reducing the amplitude of each stimulation resulting in dizziness, assuming lower amplitude will not generate the side effect. The amplitude reduction was 1V but for side effects occurring at <1V the non-side-effect voltage was chosen to 0.1V. All voxels with p < 0.05 were considered as significant.

Permutation test
To correct for type 1 error of multiple comparisons, due to a multitude of voxels tested on the same data, a permutation test was implemented where the effect values were permuted 200 times, as first presented in Ref. [7]. For the screening data, the effects were first permuted between the DBS leads and secondly within the leads [8]. For side effect and clinical settings, all effect values were permuted in one step. For each permutation, the statistical test (Eqs. (1)e(3)) was re-performed. To test if a random permutation gives higher significance than the true p-map, an overall statistic (Q) was created and calculated for each permutation where p i is the p-value in voxel i [7]. The Q statistics were used to calculate an overall p-value according to where N perm is the number of permutations, Q true is the Q-value for the true p-map and Q n the Q-value for the n:th permutation. A p tot < 0.05 assumes an overall significance of the true p-map.

Visualization
For clinical, screening and side effect settings the different PSMs were created. The N-maps were created as a voxel-wise sum of all simulations with electric field above the 0.2 V/mm isolevel and used to compare stimulation location. The p-map was thresholded at p < 0.05 to create clusters of significantly higher or lower improvement, clusters separated based on the sign of the t-statistics. In the visualizations, the natural logarithm of the inverse p-map was used to improve the dynamic. For dizziness, only p-maps and N-maps were created. All maps were displayed on the MRI template, section 2.1.3, together with labels and outlines from Morel's Atlas [31].
To evaluate the stimulation maps in correlation to the anatomy, the overlaps of each significance cluster with the structures in Morel's Atlas were computed.

Comparison of methodological differences
Four different parameters were altered to evaluate the impact on the PSMs. In general, visual inspection and general comparison of outputs of the different maps were conducted in all analyses.
First, the dependency on input data, i.e. screening vs clinical data, was evaluated. The PSMs were created based on the different data sets and the voxel-wise difference in the wM-maps from screening and clinical data were computed. Only the voxels included in the analysis for both settings were used. The voxel differences were used to create 99% confidence intervals where a zero difference implies no impact of the input data.
Second, different methods for creating high or low improvement clusters were compared. The method based on the p-map was compared to clustering based on the 10% voxels with highest and lowest value in the wM-map. The intersecting volumes between the p-map clusters and clusters based on the percentile method were computed for quantification.
Third, the impact of sampling resolution was compared by first upsampling the result from template resolution to the higher resolution. The differences in the wM-maps and p-maps between the high resolution and template resolution results was computed in each voxel. These differences were also used to compute 99% confidence intervals.
Fourth, the PSMs based on weighting function were compared to PSMs from binary VTAs. Based on the presented clustering method, intersecting volumes of the binary and weighted clusters were calculated. Again, the voxel-wise differences between the wM-maps and M-maps, and the different p-maps were calculated and used for 99% confidence intervals. For these comparisons all PSMs were re-computed but instead of the electric field, all voxels above 0.2 V/mm were set to 1.
All confidence intervals were based on double sided Student t-test.

Results
The generated MRI template with labels from Morel's atlas [31] is found in Fig. 2. All PSMs and volumes are presented at 5 times template resolution if not otherwise stated. For evaluation of improvement, only the left side p-map from the screening data survived correction for multiple comparisons (Table 1) and therefore the visualizations and comparisons are primarily presented for the left side. An overview of the data included can be found in Appendix A, Table A1 and Figure A1, and visualization for all improvement PSMs, in Figure A2.
3.1. Impact of methodological differences The different PSMs, N-map, wM-map, and p-map, for the clinical and screening data on the left side are presented in Fig. 3a. Clinical settings provide a more focal N-map and higher voxel-wise improvement values compared to screening settings which is an expected result from the optimization of clinical settings. The wMmaps and p-maps differ for the clinical and screening data both slightly in the position of the highest improvement location and the level of improvement.
The voxel-wise mean difference in improvement between the input datasets was 12.4 and 18.2% points for left and right side respectively (Fig. 3b and c), which agrees with the skewness in the visualizations.
Investigating the p-map, the clinical data has a large cluster (312 mm 3 ) of voxels significantly above 70% improvement, while the low improvement cluster is very small (0.3 mm 3 ), Fig. 4b and c. This skewness was expected and is a reinforcement of the result in the wM-map (Fig. 3a). The overlaps between p-map clusters and different anatomical structures ( Figure A2 c-d) show that both screening and clinical data on the left side have high coverage of the high improvement cluster with VLpv and VM. The main difference is that the clinical data high improvement cluster overlaps 11% of the STN volume, which is a much larger overlap than for the screening data (1.5%).    Fig. 4 presents the high and low improvement clusters based on the significance test and the wM-map clusters based on the voxels with 10% highest or lowest improvement. The p-maps consistently generated larger clusters of high improvement (range: 4e312 mm 3 ) compared to the wM-map clusters (range: 2e58 mm 3 ), Fig. 4b. For low improvement volumes, the wM-map clusters (range: 2e58 mm 3 ) were larger than the p-map clusters (range: 0e109 mm 3 ), except for left side screening, Fig. 4c. In 5 of the 8 cluster comparisons, the intersecting volume was >99% of the smaller cluster volume. The differences between the methods can be described by the difference in cluster volume and that the voxels of higher significance are more centralized than the percentile clusters. Fig. 5 shows the results of sampling the electric field at different resolutions. The higher resolution generates a smoother result, but visually the same general trends can be seen (Fig. 5a) and the resulting p-map cluster volumes are very similar (Table 2). However, the difference in mean and p-value for each voxel generated a distribution slightly shifted from zero for 7 of 10 analyses (Fig. 5bed). The 99% confidence intervals of the voxel-wise difference in the pmaps are shown in Table 2 and in the wM-maps in Table 3. The confidence intervals showed significant difference except for p-maps for dizziness (left), screening (right) and clinical setting (right).

Weighting
The impact of using the electric field norm as weighting function compared to binary VTA is visualized in Fig. 6. Even though the voxel values in the binary M-map are shifted towards higher improvement (Table 3) the weighted statistical tests reach lower p-values ( Table 2) and larger volumes of significance in 8 of 10 analysis (Fig. 6b and c). The impact on the p-maps was similar for both improvement and dizziness. The largest differences were when creating clusters based on the 10% most extreme voxels where the intersections between binary and weighted clusters were 37e59% of the binary VTA for high improvement and 29e81% for low improvement. Fig. 7 presents the N-map and p-map for dizziness and for comparison the N-map from screening of improvement. The dizziness N-maps are similarly located as the screening of improvement but with less spread (Fig. 7a). For dizziness, the significance test survived multiple comparison for both sides at template resolution but only on the left side at high resolution ( Table 1). The p-map (Fig. 7c) forms a shell-like structure that spans from STN up in thalamus. The volumetric overlaps between p-map cluster and different anatomical structures (Fig. 7d), show overlap with several structures in the thalamic and subthalamic area. VPM, VM and VLpv are most covered by the dizziness cluster, but VPM is the only structure covered to more than 10%. CM has a very high overlap on the left side, while almost none on the right side and the opposite is seen for STN.

Discussion
In this study we present a method for creating and comparing PSMs for improvement and side effects in a cohort of ET patients. All results were visualized on a cohort specific MRI template. Four different PSMs (N-map, M-map, wM-map, and p-map) were used to evaluate the impact of using different input data, clustering methods, sampling resolution of the electric field and the use of weighting function. The result shows that all these factors impact the maps and the implications of them, but to a varying extent as summarized in Fig. 8.

Impact of input data
The improvement was evaluated both using clinical stimulation settings and settings from screening sessions where a large difference can be seen in the wM-maps and p-maps, Fig. 3. Since the clinical data are optimized for improving the symptom, it is natural that the wM-map shows a shift towards higher improvement. However, only the screening data for the left side survived correction for multiple comparisons. The permutation strategy is useful for limiting type I errors but gives a global significance and cannot give implication on local result. Another result from the permutation strategy is that highly homogeneous data will not reach significance, which was the case for the clinical data in this study (Table 1). Since the clinical data is more biased towards higher improvement, the possibility of receiving high voxel mean value by chance increases. This also emphasize one problem with using clinical data for creation of p-maps especially for disorders such as ET where the improvement is very homogenous.

Impact of clustering methods
Several groups have investigated the possibility to create improvement maps in DBS. Some conducted the statistical test against a null hypothesis of 0 improvement [7,12,33], while others suggest more conservative approaches since some improvement is usually expected from all DBS contacts [8]. In this study, we choose to test against the mean improvement found during the screening sessions of the cohort. The intention of using the mean improvement was to single out the voxels that represented extreme responders. The selected value to test against is an important aspect when interpreting the data, since the voxels that did not meet significance do not imply bad improvement, but rather that the improvement is not significantly different from 70%, which was the chosen threshold in the statistical test.
Regarding side effects, the implemented method is intended to find the border where side effects start to occur. While it would be optimal to be able to retrieve a region as for the improvement, this is not possible with this type of data since the stimulation settings are only increased until the side effect appears rather than until they are maximized as for the desired clinical effect.
Two common approaches used to find an optimal location for stimulation are either a voxel-wise statistical test [8,34] or extracting the voxels with mean improvement above the 90th percentile [12,35]. As presented in Fig. 4, the similarities are clear since the intersecting volume almost totally covers the smallest generated cluster. However, the cluster volumes retrieved differ to a large extent within and especially between the methods. The percentile-based clusters will always generate a descriptive view of the extreme voxels in the dataset, however with an unknown level of confidence and no consideration to the data variance. Due to the Table 3 Confidence intervals for voxel-wise difference in weighted mean or comparison of template resolution sampling with sampling at 5x template resolution. L ¼ left, R ¼ right, pp ¼ percentage points, * Denotes significant difference between tested maps.

Mode
Side  6. Impact of weighting function. Visualization of the difference in using binary volume of activation compared to weighting with the electric field. a) Shows the M-map/wMmap in first row, the corresponding 10% best/worst cluster in the second row and p-map on last row. All visualizations are for the left side screening overlayed on the T 2 -MR template. The p-map cluster volumes together with intersection volume from the weighted vs binary p-maps are shown in b) for high improvement cluster and c) for low improvement clusters. d) Shows the volume for the 10% extreme voxels from the wM-map/M-map together with the intersection volumes for binary vs weighted. Note, the 10% volumes are the same size for both weighted and binary since they are only based on the number of voxels where at least 5 stimulations have an electric field above 0.2 V/mm. methods dependency on the total analyzed volume, it is also sensitive to the spatial distribution of the input.
Dembek et al. [17] used in silico data to compare 5 methods of creating sweet-spot clusters. While they found a notable variation between the methods, the screening data cluster sizes for methods using voxel-wise statistical test were generally larger than the clinical data clusters. This is in contradiction to the result found in this study, Fig. 4. One important difference is that our study uses patient data where the clinical setting is not the same as the best setting from the screening. Also, for comparability, our study used a fixed 70% null-hypothesis instead of adapting it to the dataset and therefore a larger volume for the clinical settings can be expected. However, our study together with Dembek et al. [17] highlights important aspects related to the validity of probabilistic mapping.

Impact of resolution
The resulting maps only showed small differences depending on the resolution where same overall trends were visible and the extracted volumes were similar (Fig. 4, Table 2). However, the higher resolution made details easier to see and created a smoother result without having to apply any filtering. Also, the voxel-wise difference between the maps with template resolution and 5 times template resolution showed a significant difference in 7 out of 10 cases, both for wM-maps and p-maps (Tables 2 and 3). Some differences were expected since the higher resolution can better capture the dynamic of the electric field. It is questionable though if these differences have any clinical importance due to the low magnitude, where highest confidence interval bound for weighted mean was 6% points and for p-value 0.005. What should be noted is that the resampling in this case was done after transforming the electric field to the template space to minimize errors. If the resampling had been done before the transform, which is the most common solution, the impact would have increased. This increasing impact is due to the resampling being a non-linear operation and the transform would have a larger averaging effect at lower resolution.
The main drawback with using high resolution of the electric field is the large increase in computational cost and hardware requirements. However, if that is not of importance for a project, the high resolution is probably a better choice.

Impact of weighting
The electric field was introduced as a weighting function when creating the PSMs, because higher electric field magnitude is expected to impact neurons of smaller size [5]. In the brain, most neurons are distributed around 1 mm in diameter. Axons up to 3e4 mm, which are assumed to be stimulated at the threshold 0.2 V/ mm, are more rare [36,37]. However, combining stimulation with anatomy and clinical response have shown indications that the neurons are starting to be impacted by the stimulation at 0.2 V/mm [27,38]. One problem with just using the electric field as a weighting function is that higher amplitude settings will give much higher electric field values close to the lead. Giving a very high weight to the voxels closest to the lead is not reasonable from a clinical point of view since the higher amplitude is used to reach further from the lead. Moreover, lower amplitudes are generally preferred for energy efficiency and widening of the therapeutic window. Therefore, a punishment to high amplitude was added by dividing the electric field with the voltage setting, similar to what has been done by others [6,12].
Few studies have used weighting functions in PSM creation or prediction algorithms. Eisenstein et al. [7] presented a PSM method with a Gaussian weighting function but did no comparisons with  binary VTAs. Horn et al. [39] illustrated the difference in correlation between improvement and overlap of the STN for PD patients with and without a weighting function based on the electric field. The result indicated a stronger correlation when adding the weighting function, but the effect was not tested for significance. Our study indicates that lower voxel-wise p-values and larger volumes of significance can generally be found when introducing weighting. The largest difference was though for the clustering using the percentile method. Even if the total volumes were the same, the intersecting volumes were in the worst case only 29% of the binary cluster volume. This difference in cluster location implies that the percentile-based clusters are less stable against approximation with a binary volume of activation compared to p-map clusters.

Finding location for improvement and side effects
Clinical analysis of tremor reduction for 50 patients of this dataset is published in an earlier paper [18]. Similar to that study, the main trend was less improvement for the most distal contact. Higher improvement could be found in a very large volume including structures like STN, VLpv, VM and VPM, though with a trend on increasing voltage with more superior contacts, Figure A1. A recent review showed a notable variation in the optimal stimulation spots [40], suggesting that good improvement can be found in many places in the inferior part of the VLpv (the region where Vim is located) and the posterior subthalamic area (PSA). In the present study, the region with most improvement differs between screening and clinical setting. The variation is due to the input data since the mid contacts were chosen more often in the clinical setting ( Figure A2b), together with the possibility to retrieve good improvement in a large volume of the region.
When improvement is easily found, the absence of side effects will be of higher importance. Dembek et al. [12] also investigated the occurrence of dizziness in patient with Vim-Zi DBS and found the significance region in anterior Zi, the VPM and PM, using nomenclature from Mai atlas [41]. Similarly, our study found the VPM to be largely covered and Zi partly covered (Zi is not delineated in the atlas used in this study) (Fig. 2) but also a large coverage of VM and VLpv. The structures stimulated when dizziness and improvement occur are very similar, but the improvement clusters are more focal and the dizziness clusters are present at the outer region. Since many structures are impacted, a clear structural correlation for dizziness cannot be established, which can relate to the side effect being diffuse and probably of multifactorial origin.

Limitation and future aspects
The main limitation for the PSM creation is the strong skewness towards higher improvement in the clinical data. This limitation is also present but to a lower extent for the screening data since only the best improvement for each contact had been collected. Although the dataset is quite large from a DBS perspective, many of the analyses suffer from limited amount of data, especially for the right side. One option could have been to transfer all simulations to the same side, but this was not implemented to not make any assumptions of laterality independence. One possibility to overcome the bias limitation would be to have improvement evaluation for multiple amplitudes per contact [12]. However, full test on each amplitude would extensively increase the time and therefore a more automatic procedure, like accelerometer measurements, would be preferable [42]. Regardless of the data limitations, the comparisons between methods are still valid and the dataset is similar to other publications.
In the future, more studies should be focusing on predictability. There are several studies on this topic for DBS applied on different disorders [8,15,35,43]. Many find significant correlations, but the models usually describe the interpatient variance poorly. A promising addition is to include structural or functional connectivity, but more research is required to find solid sets of predictors for the different disorders treated with DBS.

Conclusions
In this study we have investigated how different parameters impact the result when creating different probabilistic stimulation maps and searching for the optimal or less optimal locations for stimulation. We found that the electric field resampling resolution alter the PSMs to some extent, and higher resolution should be used if possible. Slightly higher impact on the maps was seen by introducing a weighting function and should especially be used for analysis on mean improvement visualizations. The largest impact was observed for the choice of the input data and clustering method. The observed impact of input data also implies that the use of multiple stimulations per position may further alter the result. Therefore, when selecting these parameters, care should be taken into the variance and possible bias in the data and the level of confidence of the resulting clusters.

Funding
This study was financially supported by the Swedish Foundation for Strategic Research (SSF BD15-0032) and by the Swedish Research Council (VR 2016e03564).

Informed consent statement
Informed consent was obtained from all subjects involved in the study.

Declaration of competing interest
The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Patric Blomstedt is consultant for Abbott, Boston Scientific, Koh Young and shareholder in Mithridaticum AB.
Karin Wårdell is shareholder in FluoLink AB.