EyeSQUAD - A Novel 3D Selection Technique, August 2017 - present

1. Introduction

3D interaction techniques are methods that provide users with good experience in virtual reality which allow them not only to see but also interact with virtual contents. Manipulation is the prerequisite for many interaction techniques which consists of several subtasks such as selection, positioning, rotation and scaling (Bowman et al. 2004). As one of the most basic manipulation tasks, selection requires users to accomplish “target acquisition task” (Bowman et al. 2004). Without selection techniques, virtual contents are impossible to be interacted with in the first place.

 

Ray-casting is the most commonly used selection technique in virtual reality due to its easiness of implementation and usage. However, its performance is poisoned by hand jitter especially selecting small or remote targets. Researchers have presented many possible solutions and new techniques in order to solve this problem of ray-casting (Haan et al., 2005; Frees et al., 2007; Vanacken et al., 2007; Kopper et al., 2011). All of these new techniques require users to control by hands. 

Ray-casting

SQUAD

EyeSQUAD

With the appearance of built-in eye tracking headsets in virtual reality, taking advantages of eye tracking could potentially save time to perform tasks and provide completely hands-free experience for users. Besides, high precision required tasks can be solved by eye movements without urging eye tracking devices if progressive refinement is also involved.

 

In this study, with combination of eye tracking and progressive refinement, we present a novel selection technique - EyeSQUAD, stands for eye-controlled sphere-casting refined by quad-menu. With EyeSQUAD, users first roughly select a bunch of objects containing the target with a selection sphere whose center is determined by eye movements. We used an approximate method to stabilize the point-of-regard that calculated from the eye ray data. After that, similar to the SQUAD technique, the initial objects selected by the selection sphere are evenly and randomly distributed on a quad-menu which consists of four quadrants. Then users need to fixate on the quadrant that contains the target and select. After several refinement steps, users can obtain the target.

 

We performed a user study for examining the performance of this new technique with the comparison of ray-casting and SQUAD techniques under different target size and distractor density scenarios.

2. EyeSQUAD selection

We designed a novel selection technique with eye tracking - eye-controlled sphere-casting refined by quad-menu (EyeSQUAD) selection technique which combines the progressive refinement idea from a previous selection technique - SQUAD (Kopper et al. 2011). Progressive refinement is an indirect method of selection which allows users to first select a bunch of objects including the target and then refine them steps by steps until finally obtain the target. Performing progressive refinement only requires “lazy” selection in each step while all steps combined together could obtain a decent performance.

 

We can basically divide the EyeSQUAD technique into two subtasks: sphere-casting subtask and quad-menu refinement subtask.

 

For the sphere-casting subtask, instead of casting a sphere by a controller, EyeSQUAD allows a user to control the selection sphere with eyes by calculating the convergence point from the user’ eye rays data. We set the diameter of the selection sphere to be 26.3° which is consistent with the angular size of the selection bubble in SQUAD. The size of selection sphere is visually constant to prevent the sphere from being invisible or oversize when it is far away or close enough. The objects inside of the selection sphere will be chosen as the initial set of objects that need to be further refined when the user performs selection (see Figure 1 left).

Figure 1: Main scene (left) and QUAD-menu selection scene (right)

Then this set of objects are evenly and randomly distributed on a quad-menu with out-of-content display (Figure 1 right). After that users need to fixate on the quadrant part that contains the target that they want and perform the further selection. The objects on that quadrant will be distributed again on the quad-menu. Progressive refinement continues until the user gets the target or exit in the middle if lost the target. Once the quad-menu selection process is finished, the user will be transferred back to the original scene. One limitation of this out-of-context selection design is difficult to select two similar objects in different depths since the depth information of all selected objects is ignored once arranged on the quad-menu. Another potential limitation is that the details for differentiating different objects may be lost after resized by the quad-menu. Although spheres are used as objects in the experiment for visual consistency by any viewpoints, these two limitations should be noticed when implementing EyeSQUAD in realistic situations.

 

We separated the select commands from the whole selection process by pressing a button on a controller in case of a complete closed-loop design and the Midas touch problem (Jacob and Karn, 2003). In the future implementation, we plan to use a “tongue click” sound as a sound input to control the selection in order to support a completely hands-free experience which can even allow disabilities to interact with the 3d world (either virtual or real world).

2.1 Closest target approximation

We found a way to stabilize the calculated point-of-regard which is also called “convergence point” in this paper. Instead of directly controlling the selection sphere by the calculated convergence point, with a list of target positions, the selection sphere will always move to the closest target in the environment which is determined by calculating the minimum summation of the distance from the target to the two eye rays (see the closest target approximation subsection). This requires knowing target positions and much computational power when the number of target is large. However, the positions of targets in most interaction tasks are reasonably accessible. Besides, under relatively numerous targets scenario, the whole space can be further divided into several parts depending on the number of potential targets in the environments and only the target positions in the part where the convergence point locates will be taken into consideration (see the space partition subsection). By this way, the concern of computational cost can be solved as well. This approximation of controlling the selection sphere could ensure the user can catch at least one potential target at a time and would be efficient in an aggregated objects environment.

Figure 2: Schematic of the closest target approximation method

As shown in Figure 2, we suppose the positions of two eyes in the environment are                          and                   

, respectively while the directions of two eyes are                        and                         , respectively (“l” represents the left eye, “r” indicates the right eye). Besides, the position of a certain object i in the environment is written as

                          .

Then we can calculate the distance from the object i to the left eye ray as follows,

Similar for the distance from the object i to the right eye ray. We can get the sum of the distances from object i to two eye rays.

Finally, we can find the closest target by

The calculated point-of-regard is the position of the closest object                                        . The selection will always move to this calculated point-of-regard with a constant speed of 6 m/s. Since the movements of eyes are saccade (Deubel and Schneider, 1996), we decided to transfer the bubble with continuous motion rather than instant transform to the temporal calculated point-of-regard in case of catastrophic simulator sickness.

2.2 Space partition

Figure 3: Schematic of the space partition method

Algorithm:

            SpacePartition(Space C)

            1. Partition space C into several parts

            2. Find the closest center C* in all partitioned parts that contain objects with the Closest Target Method.

            3. If only one object left in C*, return the position of this object;

                else, return SpacePartition(C*).

 

Space Complexity: O(N), where N is the total number of objects in the space C.

Time Complexity: With the assumption of evenly distributed objects in the space, the expected worst-case running time is O(NlogN).

The closest target approximation method may appear awkward and computationally expensive especially within a cluttered environment. In order to save computational power and optimize the approximation algorithm, the space partition (Figure 3) can be included which first partitions the whole space into several parts depending on the magnitude of target positions in the environment and then find the closest center of a partitioned part that contains targets (e.g. C1 in the Figure 3). Empty partitioned parts are just ignored. After that, only applying the closest target method in that closest part (e.g. C1 part in the Figure 3) rather than directly calculating all the distances from possible target positions to eye rays. If there is only one object left in the closest partitioned part (where closest center locates), the method returns the position of this object which will be regarded as the current calculated point-of-regard. Otherwise, the method will recursively call this space partition algorithm using the closest partition part as the input. By this recursion algorithm, the whole space can be partitioned into smaller ones until finally catches the closest target. A lot of computational cost can be saved through recursion especially when the number of initial selected objects is large instead of brute force.

3. Methods
3.1 Experiment design

We evaluated the performance of the EyeSQUAD technique with the comparison of ray-casting technique and SQUAD technique based on a selection task – acquire a target surrounded by several distractors in a virtual environment. The size of these objects and density of the distractors vary among different conditions in the experiment.

3.1.1 Goals and Hypotheses

The purpose of the experiment is to determine the tradeoffs between the EyeSQUAD technique and two other previous selection techniques – ray-casting and SQUAD. Ray-casting only requires a single but precise click during the selection while SQUAD and EyeSQUAD allow user to select with little precision in each step but requires several steps. EyeSQUAD is able to enhance the accuracy and speed of the SQUAD technique since selection with eyes is more intuitive and saves time. Besides, EyeSQUAD can avoid several selection issues introduced by hand controller such as hand jitter and occupation of at least one hand during selection. This study aims to answer two research questions:

 

1) Can the EyeSQUAD selection technique outperform previous selection techniques such as the ray-casting technique and the SQUAD technique?

2) Will target size or distractor density have influence on the performance of selection techniques?

 

With the consideration of the tradeoffs and the research questions, we hypothesized that

(H1) The time of selecting a target with SQUAD or EyeSQUAD will not be affected by the target size while ray-casting will be slow with small targets and fast with large targets.

(H2) The time of selecting a target with SQUAD or EyeSQUAD will be proportional to the number of distractors in the virtual environment while the performance of ray-casting will not be influenced by the distractor density.

(H3) The EyeSQUAD will outperform ray-casting when the number of distractors is small.

(H4) The EyeSQUAD will outperform SQUAD in all conditions.

(H5) The SQUAD and EyeSQUAD techniques will have virtually no errors due to their low requirement of precision while ray-casting will increase errors with decreasing the target size.

3.1.2 Design

Since individual difference with eye tracking methods is significant (Goldberg and Wichansky, 2003). We used a factorial within-subject design with repeated measures. There are three independent variables: technique (ray-casting, SQUAD, EyeSQUAD), target size (small: radius 0.01m or 0.26°, medium: radius 0.015m or 0.40°, large: radius 0.04m or 1.06°), and the distractor density (sparse: 16, medium: 64, dense: 256). This design is 3x3x3. There are two dependent variables: time to complete a task, mean number of errors per trial.

The order of the presentation of technique was counterbalanced while each of the nine conditions of target size vs. distractor density was repeated 8 times and presented in random order.

3.2 Apparatus

We used the FOVE Eye tracking headset (weight: 520 g) which is the first head-mounted display headset in virtual reality has built-in eye tracking functionality. Since our research requires subjects to stand at a fixed point within a room tracking space during the experiment. Besides, a controller tracking is necessary for two experiment conditions (i.e. ray-casting and SQUAD). With the consideration of tracking consistency and reasons above, we used HTC Vive room space tracking for both a HTC Vive controller and headset tracking by muting original head tracking of the FOVE and mounting an HTC Vive tracker (weight: 300 g) at the FOVE headset (see Figure 4). To achieve that, one laptop (PC1) connected with the FOVE headset provides the display of virtual contents and eye tracking while one desktop (PC2) connected to the HTC headset provides headset and controller tracking.

 

FOVE Unity plugin v1.3.0 was used driven by FOVE Version 0.13.0 on an ASUS GL502V (PC1) Quad-Core Processor (2.8 GHz), 16.0 GB RAM, with NVIDIA GeForce GTX 1070 running Windows 10. An Alienware X51 R3 Edition (PC2), with Quad Core Processor (2.7 GHz), 8GB RAM, NVIDIA GeForce GTX 970 running Windows 10, drives HTC SteamVR plugin v1.2.3 to support an HTC Vive with 6DOF position and orientation tracking. A local server was built for supporting real-time data transfer between the two PCs (PC1 and PC2) through UDP.

Figure 4: Experiment participant and FOVE headset (with an HTC Vive tracker) and HTC Vive controller.

We used the FOVE Eye tracking headset (weight: 520 g) which is the first head-mounted display headset in virtual reality has built-in eye tracking functionality. Since our research requires subjects to stand at a fixed point within a room tracking space during the experiment. Besides, a controller tracking is necessary for two experiment conditions (i.e. ray-casting and SQUAD). With the consideration of tracking consistency and reasons above, we used HTC Vive room space tracking for both a HTC Vive controller and headset tracking by muting original head tracking of the FOVE and mounting an HTC Vive tracker (weight: 300 g) at the FOVE headset (see Figure 4). To achieve that, one laptop (PC1) connected with the FOVE headset provides the display of virtual contents and eye tracking while one desktop (PC2) connected to the HTC headset provides headset and controller tracking.

 

FOVE Unity plugin v1.3.0 was used driven by FOVE Version 0.13.0 on an ASUS GL502V (PC1) Quad-Core Processor (2.8 GHz), 16.0 GB RAM, with NVIDIA GeForce GTX 1070 running Windows 10. An Alienware X51 R3 Edition (PC2), with Quad Core Processor (2.7 GHz), 8GB RAM, NVIDIA GeForce GTX 970 running Windows 10, drives HTC SteamVR plugin v1.2.3 to support an HTC Vive with 6DOF position and orientation tracking. A local server was built for supporting real-time data transfer between the two PCs (PC1 and PC2) through UDP.

 

The virtual environment was made with Unity 3D Engine (version 2017.1) and the scripts were written with C#. All virtual objects including a target and other distractors were circular and located in a sphere with a radius of 2.155m whose center was the position of the participant. The target was chosen within an inner sphere (radius 1.1m) which ensured that the target located within the original visual range in each selection and the selection bubble would at least cover certain number of objects (e.g. 16, 64, 256) in one selection. This fixed the refinement steps (e.g. 2, 3, 4) under each target density condition.

 

3.3 Participants

24 voluntary unpaid participants (12 male, 12 female) were recruited for the experiment whose age from 21 to 32 years old with a median age of 24 years old. All of the participants are graduate students except one post-doc scholar.

3.4 Procedure

Participants were first welcomed by the experimenter and given background information of the study. Then participants needed to read and sign informed consent form which includes experiment details such as procedure, benefits and risks. After that, they were asked to complete color blindness assessment and background survey online. Since none of them were identified as people with color blindness, none of participants were excluded from the experiment.

 

Participants were emphasized to perform the trials as quickly as possible while making as few mistakes as they can and making fewer mistakes is more important than being quick. Then the experimenter would explain how to complete the selection task. They were notified that they should hold the controller with dominant hand and could not use the other hand to steady the controller through all trials. Once the experimenter finished explanation of usage of the Vive controller, they were asked to move to the experiment area which is a fixed point in the room tracking space and wore the FOVE headset. There was also a red starting point in the virtual environment which was consistent with the starting point in the real world and would change to green if the participant was close enough to it (smaller than 0.1m).

 

After that they would start learning their first technique in a corresponding training session which would teach them how to use the technique and allow them to try all nine combinations of target size and target density conditions once. During the training session, they would be told to accomplish at least one correct selection and one error selection to see both results. (For correct selection, a checkmark would be displayed. Otherwise a cross would appear.) After the training session, they would perform the corresponding experiment condition of the technique which contains 72 trials with all 9 combinations repeated 8 times in random order.

Once completed all of these trials, they filled a technique rating questionnaire for the technique they just performed. All participants accomplished all three techniques in a specific order that was counterbalanced. After finishing all of these conditions, they filled a post-study overall performance questionnaire.

4. Results

We used a repeated measure multi-variate ANOVA (MANOVA) model with a significant level of 0.05 to evaluate impacts on two dependent variables: 1) mean error rate and 2) average time to complete selection. There are three independent variables: 1) technique (Ray-casting, SQUAD, EyeSQUAD), 2) target size (small, medium, large) and 3) distractor density (sparse, medium, dense). We performed repeated measure since all nine combinations of target size and distractor density are randomized with counterbalanced orders of technique and the repeated measure could provide more statistical power with fewer subjects.

 

Due to the small sample size of 24 subjects compared with 3 independent variables and potential interactions, the power of results will be checked in case of exaggerating the significance of the effect of a factor which might not exist. In the study, the significance is examined by identifying the power of test whether is larger than 0.8 which indicates “sufficient power to detect effects” (Field, 2013). From the results, we find that our sample size is enough since sufficient powers are observed in the tests.