VRSafe: A Secure Virtual Keyboard to Mitigate Keystroke Inference in Virtual Reality

Yijun Yuan 0009-0008-0751-9207 University of PittsburghPittsburghPAUSA yiy95@pitt.edu , Na Du University of PittsburghPittsburghPAUSA na.du@pitt.edu , Adam J. Lee University of PittsburghPittsburghPAUSA adamlee@pitt.edu and Balaji Palanisamy University of PittsburghPittsburghPAUSA bpalan@pitt.edu

Abstract.

Password-based authentication is one of the most commonly used methods for verifying user identities, and its widespread usage continues in virtual reality (VR) applications. As a result, various forms of attacks on password-based authentication in traditional environments such as keystroke inference and shoulder surfing, are still effective in VR applications. While keystroke inference attacks on virtual keyboards have been studied extensively, few efforts have developed an effective and cost-efficient defense strategy to mitigate keystroke inferences in VR. To address this gap, this paper presents a novel QWERTY keyboard called VRSafe that is resilient to keystroke inference attacks. The proposed keyboard carefully introduces false positive keystrokes into the information collected by attackers during the typing process, making the inference of the original password difficult. VRSafe also incorporates a novel malicious login detector that can effectively identify unauthorized login attempts using credentials inferred from keystroke inference attacks with high detection rate and minimal time and memory cost. The proposed design is evaluated through both simulation experiments and a real-world user study, and the results show that VRSafe can significantly reduce the accuracy of keystroke inference attacks while incurring a modest overhead from a usability standpoint.

VR Security; Password-based Authentication; Keystroke Inference Attack Defense

^†^†copyright: none^†^†ccs: Human-centered computing Virtual reality^†^†ccs: Human-centered computing User studies^†^†ccs: Human-centered computing Keyboards^†^†ccs: Human-centered computing Text input^†^†ccs: Security and privacy Authentication

1. Introduction

Virtual reality (VR) devices and applications are becoming increasingly prevalent. Studies show that over 171 million people globally use VR (Kumar, 2025), and nearly 88% of VR users use their devices multiple times in a month (Group, 2022). VR applications provide immersive experiences not only in gaming and entertainment, but also in many other domains including education, fitness, healthcare and engineering (Group, 2022). The widespread use of VR leads to increased exchange of sensitive information (e.g. passwords), making them attractive targets for adversaries.

Password-based authentication stands out as the most commonly used authentication method to date, however, extensive prior work has shown its vulnerability to keystroke inference attacks and shoulder surfing across various settings (Yang et al., 2022; Sun et al., 2016; Cronin et al., 2021; Wang et al., 2024b; Sabra et al., 2020; Gupta et al., 2018). With the growing adoption of VR devices, these risks become more severe. VR applications provide a rich immersive experience to the user, enabling more stealthy attacks as users are generally less aware of their physical surroundings when using the VR devices. Many inference attacks only require placing an inconspicuous sensor (e.g. camera) and achieve high inference accuracy (Yang et al., 2023; Gopal et al., 2023; Al Arafat et al., 2021; Luo et al., 2024; Lee et al., 2023; Zhang et al., 2023; Slocum et al., 2023; Wu et al., 2023). These threat models exploit various information such as user’s hand movement (Gopal et al., 2023), gyroscope data (Slocum et al., 2023), wifi-signal data (Al Arafat et al., 2021) or acoustic signals (Luo et al., 2024). Such attacks are highly practical in the real world as they incur low cost and require minimal use of specialized knowledge. Prior efforts have also explored biometric-based approaches as an alternative to password-based authentication (Soni and Prabakar, 2021; Luo et al., 2020; Shen et al., 2018; Boutros et al., 2020) such as using eye-tracking cameras to capture iris information in VR headsets (Luo et al., 2020; Boutros et al., 2020). Although biometric approaches have better resistance against keystroke inference attacks, they have not gained wide acceptance among the users. This is in part due to the lack of mature regulations and laws on how to store highly unique biometric data, as well as due to OS and hardware compatibility issues (Stephenson et al., 2022) across different commercial VR products.

In existing threat models, the risk of full password inference is underestimated. Though existing keystroke inference attacks can achieve high character-level guessing accuracy (often exceeding 95%) (Yang et al., 2023), inferring a full password string is significantly more challenging due to the absence of inter-word context and limited intra-word semantics compared to inferring words in natural language texts (e.g. email content). While simple typographic correction tools (Wu et al., 2023) may yield reasonable improvements for general text inference, such tools offer little benefit for reconstructing passwords. Designing or evaluating defense mechanisms based on such a threat model can lead to overly optimistic conclusions, as these methods fail to capture the strategies that a practical adversary would adopt. To build a more robust and practical defense, in this work, we adopt a stronger threat model in which the attacker integrates advanced password guessing techniques (Wang et al., 2023; Yang et al., 2025; Pal et al., 2019; Nosenko et al., 2023) into the inference process, rather than relying on simple correction tools.

Although there have been many research efforts on studying keystroke inference attacks in VR environments, countermeasures on this topic have not received adequate attention yet. Some studies briefly discuss ideas for potential defenses (Yang et al., 2023), others have proposed and evaluated defense solutions based on obfuscating user inputs using randomized keyboard layouts (Maiti and Crager, 2017) or by repositioning the UI components (Wan et al., 2024). However, such techniques increase the usage difficulty and impose additional physical and cognitive loads to the user.

There are several challenges involved in building an effective countermeasure against keystroke inference attacks. First, many attack models can reliably detect the precise moments when keys are pressed (e.g. (Gopal et al., 2023; Yang et al., 2023; Slocum et al., 2023; Lee et al., 2023)). A strong defense method should therefore remain effective even under such worst cases, where the adversary can perfectly identify every key press event made by the user. Second, when an attacker attempts to log in with an incorrect but closely matching password, it is inherently difficult to detect the source of the password leakage. A comprehensive countermeasure may detect such malicious login attempts in addition to mitigating them. Finally, maintaining a high level of usability in the defense solution design remains a critical challenge.

In this paper, we propose VRSafe, a security solution designed to enhance the security of password inputs entered through a standard QWERTY keyboard in virtual reality. The proposed approach carefully injects fabricated noise into the typing process, making the inference of the original password harder for the attacker. VRSafe also incorporates a novel detection mechanism to identify whether an incorrect login attempt is a direct result of a keystroke inference attack. To the best of our knowledge, VRSafe is the first approach to alert the user when an attacker attempts to login using their inferred keystroke information. Our evaluation using public leaked password datasets and an IRB-approved user study demonstrates that VRSafe is a secure, practical and user-friendly approach for individuals with higher security and privacy needs in VR environments.

2. Background and Related Work

In this section, we discuss the existing works on keystroke inference attack and defense in VR.

Keystroke inference attack, as the name suggests, tries to reconstruct the user’s input sequence on the keyboard through side-channel information emanated from typing (e.g. sound, motion). Over the past decade, keystroke inference attack has been widely investigated, prompting the use of many different techniques to leverage these side-channel signals. Recent work (Gopal et al., 2023; Ling et al., 2019; Yang et al., 2023) have demonstrated the feasibility of reconstructing sensitive texts through analyzing hand movements captured by videos, highlighting the critical need for enhanced privacy and security protection. It is also possible to use the hand tracker camera of an Augmented Reality Head-Mounted Display (AR-HMD) (Meteriz-Yıldıran et al., 2022) to recover passwords similar to the video-based attack in the VR environment. Wang et al. (Wang et al., 2024a) find that gaze motions observed in videos can also leak typing sequence information. In addition to video-based attacks, attacks that exploit other channels such as audio, radio frequency (RF) or infrared signals generated during typing also pose serious threats. VRecKey (Ni et al., 2024) discloses an attack in which Infrared Remote (IR) signals emitted from VR controllers can be represented in heat map to reveal controller positions during typing in order to infer keystroke information. The attack presented in (Al Arafat et al., 2021) uses Channel State Information (CSI) of Wifi signals from routers placed near the victim. The information exhibits unique patterns as users move the controller to type different keys, helping the attacker to identify the keystrokes. Another class of attacks uses the acoustic signals emitted by the VR devices. Heimdall (Luo et al., 2024) captures subtle variance in acoustic information created by the controller in different key positions using their customized directional microphone. Prior works have also shown that attackers can utilize Inertial Measurement Unit (IMU) data such as combination of position, orientation and velocity (Slocum et al., 2023; Luo et al., 2022; Wu et al., 2023) or memory usage in the victim’s device (Zhang et al., 2023) to perform keystroke inference.

In contrast to the widespread prior efforts in developing various attacks, models to mitigate or to prevent inference attacks in VR environments are much less explored. In practice, very few methods can balance ease of use with strong performance in defending against the inference attacks. The defense solution outlined in (Yang et al., 2023) suggests placing physical screens to block the external line of sight. The approach presented in (Althebeiti et al., 2024) replaces the normal 2D QWERTY keyboard with a curved, arc-shaped layout in AR to introduce variation in key spacing. Another approach to safeguard against keystroke inference attacks is to introduce randomness. For instance, the solution presented in (Maiti et al., 2017) proposes projecting a random layout keyboard, and (Li et al., 2020) generates a random mapping between alphabets and keypad keys to secure password entry in smart wearable glasses. The authors in (Wan et al., 2024) propose repositioning the UI components (e.g. cursor, keyboard) after each click so that an adversary cannot reliably map observed movements to specific keys.

Overall, current solutions often adopt radical changes to the conventional keyboard layouts or input designs, and some rely on extra external protection devices. Such approaches introduce substantial learning overhead and increase the difficulty of use for end users. In contrast, the research presented in this paper aims to fill this gap by developing a usable and secure VR keyboard architecture that can withstand video-based keystroke inference attacks and yet provide a highly usable VR experience to the user.

3. VRSafe Design

In this section, we present VRSafe, a secure keyboard in virtual reality that provides word-level protection for user password input against keystroke inference attacks. We first introduce the threat model used in the design of VRSafe and then describe the strategies employed for achieving better security and balancing the overhead incurred while using the keyboard. Finally we present our proposed mechanism for detecting malicious login attempts using credentials stolen through keystroke inference attacks.

3.1. Threat Model

3.1.1. Keystroke Inference

Keystroke inference threat models can employ a range of sensors, each with its own tolerance to noise. Video-based attacks often use ordinary digital cameras with no special configuration (Yang et al., 2023; Ling et al., 2019; Luo et al., 2022; Gopal et al., 2023), and videos are generally more robust to noise. To ensure reproducibility and fair performance comparisons with prior work, we adopt a video-based attack in our study. Specifically, we follow the attack model proposed in (Gopal et al., 2023), where the adversary captures the VR user’s hand movements with a camera placed in the surrounding environment. We believe such a threat model is quite practical and is within the capabilities of most real world adversaries. In this model (Gopal et al., 2023), the victim is assumed to interact with the keyboard using hand tracking features provided by VR headsets for typing. The captured video frames are processed using the hand landmark detection framework in MediaPipe (Zhang et al., 2020) to extract coordinates of hand joints. By leveraging unique joint patterns exhibited during typing, such as relative joint positions and velocities, the model infers both the spatial position of the virtual keyboard plane and the final keystroke predictions. To better model the attacker’s ability, we make three more assumptions, each with a brief rationale: (i) The attacker can correctly identify every keystroke in the corresponding video clips. While raw video data may include false positives and false negatives, techniques like denoising and clustering can significantly reduce such detection errors, leading to a high true positive detection rate. (ii) The attacker can access any publicly available information within the application, but does not have access to private data of the legitimate user. Public data like keyboard layouts and language preference are easily obtainable, whereas private data such as screen display content are protected and inaccessible without privileged permissions. (iii) We also assume that the attacker is allowed at most $k$ guessing attempts. This is consistent with most web applications, which allow only a small number of password guesses to be made before the system locks the user for suspicious action.

3.1.2. Password Guessing Refinement

Although existing keystroke inference attacks can achieve high character-level accuracy, end-to-end word guessing accuracy still has significant room for improvement. Prior work has used simple typo correction tools such as grammar and spelling correction in Google Doc (Wu et al., 2023) on texts reconstructed from keystrokes, which underestimates the capabilities of real world attackers. In contrast, we find that using targeted password guessing models can improve the full password guessing accuracy, providing a more realistic assessment of defense effectiveness. Particularly, Pal et al. (Pal et al., 2019) proposed a sequence to sequence (seq2seq) targeted password guessing model using recurrent neural network (RNN) which is trained on users’ leaked passwords from a breached site to guess their passwords on another uncompromised site. Wang et al. (Wang et al., 2023) introduced Pass2Edit, which further improves this idea and currently achieves state-of-the-art performance. By inputting passwords inferred from keystrokes into such models, one can generate a list of plausible passwords, and one of which could correspond to the actual password.

3.2. Design Overview

VRSafe consists of two modules, Keystroke Noise Injection Keyboard and Inference Attack Detector, as shown in Figure 1. Users interact with VRSafe keyboard in a way similar to normal virtual keyboards, with only a small number of additional actions that intentionally introduce noise into the user’s password input. VRSafe can differentiate the noise from the actual password, maintaining the original password unchanged while storing a ”noisy” version of the password. Upon clicking the login button, both the original and ”noisy” password will be sent to the server. The server verifies the original password against the stored credentials in the database and only if the verification succeeds, it adds the corresponding “noisy” password to the detector. The shaded region in Figure 1 represents a keystroke inference attacker who covertly places a camera to record the user’s interactions with the VR headset, including the additional actions required by VRSafe. Based on these recordings, the attacker attempts to infer keystrokes and launch a malicious login attempt. However, the attacker can only recover the “noisy” password from the recorded video, and any such login attempts using the “noisy” passwords will be detected by the server.

Refer to caption — Figure 1. VRSafe architecture. (1) Normal user interacts with VRSafe keyboard, the original password should remain unchanged for successful authentication and ”noisy” password will be forwarded to detector. (2) Server adds the ”noisy” password to detector if the original password matches the record in database(i.e. a legitimate login). (3) A keystroke inference attacker captures the user’s keystroke on VRSafe keyboard and attempts to login. (4) The detector raises an alarm if credential exists in the detector.

3.3. Keystroke Noise Injection in VR Keyboard

We first explain the design rationale behind our approach of injecting fabricated noise to mitigate keystroke inference attacks and present the technical details of our design. The keystroke noise injection in VRSafe is based on insights from recent work demonstrating that even for attack models with high accuracy (Yang et al., 2023), certain behaviors significantly affect the accuracy: a participant pressing down towards a key, hesitating and then retracting the finger(s) before hitting, can lead to large amounts of false positive keystrokes. Drawing insights from this observation, we explore this phenomenon and propose our keystroke noise injection mechanism. Instead of accidentally ”false pressing a key” or asking users to type something wrong occasionally (Yang et al., 2022), VRSafe leverages an adaptive mechanism that prompts the user to insert fabricated characters (noise) based on the current input string and previously injected noise. An attacker observing the hand movements of the VR user will not be able to differentiate the true intended inputs from fabricated ones, resulting in incorrect password inference. From a password protection standpoint, this process can be viewed as generating an augmented password $S^{\prime}$ derived from the original password $S$ . There is a constraint in this process: $S$ must be a subsequence of $S^{\prime}$ , otherwise we will lose content in the original password. At the same time, the key press signals triggered by fabricated characters are processed differently and will not be forwarded to the keyboard output function (Fig. 2). This ensures the integrity of the normal login process. We refer to the characters that exist in user’s original/real password as real characters, the additional fabricated characters prompted by VRSafe as ghost characters and the augmented password as ghost password.

Ghost Password Generation. Many models such as RNN-based seq2seq model (Sutskever et al., 2014), transformer based LLM models (Vaswani et al., 2017; Achiam et al., 2023) are capable of generating a similar password sequence based on a given password. However, they are inadequate for this task as they cannot guarantee the subsequence constraint (i.e. original password $S$ is a subsequence of ghost password $S^{\prime}$ ) is satisfied on every input/output pair. To address this limitation, we need to employ a method that can reliably embed the characters of the original password into the ghost password. In VRSafe, we adopt a pointer-based method with two actions: (i) copy action, which appends the current pointed character in the original password to ghost password, and (ii) inject action, which appends a random ghost character. By progressing the pointer from the first to the last character of the original password, this approach ensures that the subsequence constraint is always satisfied. To decide between the two actions, we model the process as a finite sequence of Bernoulli trials: with probability $P$ , the model performs an inject action and with a probability $1-P$ , it performs a copy action. Furthermore, to prevent the ghost password from becoming excessively long or identical to the original password, we impose a maximum limit on the number of consecutive ghost characters and a minimum requirement on the total number of injected ghost characters.

Like words in natural languages, many passwords also contain semantics, especially in human-chosen passwords. If we inject a ghost character into an improper position, the adversary can easily spot the ghost character and remove it. Conversely, if a password already has a high entropy, adding many ghost characters provides little additional benefit. Therefore, the ghost password generation process should dynamically choose which action to take based on the current context. To address this, we train a small RNN-based model as a meter to evaluate the ”randomness” of the current string, and adjust the injection probability $P$ accordingly. The model consists of a single-layer Gated Recurrent Unit (GRU) with a hidden size of 64, and input tokens are represented with embeddings dimension of 16. The GRU output is then fed into a fully connected (FC) layer and the final output is passed through a sigmoid activation to return a value between 0 and 1. For training, we use a publicly leaked dataset, namely Compilation of Many Breaches (COMB) dataset (Ron Cresswell, 2021). We label passwords from publicly leaked password dataset with 0 and machine generated passwords with 1. We note that not all passwords in public leaked dataset are human-chosen, but there is no consensus on how to reliably distinguish them, since passwords do not have ”correct spelling” or linguistic rules like natural language words. Nevertheless, leaked password datasets remain the most representative and widely used source of human created passwords, making them a reasonable choice for approximating the human-chosen class in this case. To exclude extreme outliers, we remove passwords that exceed 30 characters or only contain hexadecimal characters, since they are more likely to be hashes than passwords. The ghost password generation algorithm (Algorithm 1) starts with an initial injection probability $p_{0}$ . After each action, the current ghost password will be forwarded to the meter model, the return value will be smoothed through exponential moving average (EMA) and compared with randomness level $r$ . If it is smaller than $r$ , meaning that the current string is not adequately random, then $p$ will increase $\Delta p$ , otherwise $p$ becomes smaller.

Algorithm 1 Adaptive Ghost Character Injection

1:smoothing factor

\alpha

, randomness level

r

, injection probability

p_{0}

, step size

\Delta p

, password

pwd

2:Initialize

p\leftarrow p_{0}

r_{\text{EMA}}\leftarrow r

, ghost_pwd

\leftarrow""

pos=0

3:while

pos<\text{len}(pwd)

4: if random()

<p

then

5: Inject(ghost_pwd)

6: else

7: Copy(ghost_pwd, pwd, pos)

pos\leftarrow pos+1

\triangleright

moves forward to next character

9: end if

10:

\hat{r}\leftarrow\textit{Eval}(\text{ghost\_pwd})

\triangleright

Evaluate randomness of current string with meter model.

11:

r_{\text{EMA}}\leftarrow(1-\alpha)\cdot r_{\text{EMA}}+\alpha\cdot\hat{r}

12: if

r_{\text{EMA}}<r

then

13:

p\leftarrow p+\Delta p

14: else

15:

p\leftarrow p-\Delta p

16: end if

17:end while

18:return ghost_pwd

Choosing Ghost Characters. We further elaborate the noise injection process (Inject() in Algorithm 1) in the generation of ghost passwords. A straightforward baseline approach is to uniformly choose a ghost character from an alphabet set of all legal characters. However, a selection based on uniform distribution remains vulnerable to simple human inspection. For instance, suppose that the original password contains a date of birth, and if an alphabetic letter is added as a ghost character in between, then the attacker can easily exclude this letter from the inferred text. In order to make the ghost character look similar to a real character, we select ghost characters using simple language models that can capture linguistic connections between characters to make the process context-aware. Unlike neural networks that often require significant computational resources and introduce additional latency, the Markov model offers a lightweight and efficient alternative for ghost character selection. By definition, a $k$ -order Markov model assigns a probability distribution over the next character conditioned only on the preceding $k$ characters:

\begin{split}P(x_{i}\mid x_{i-1},x_{i-2},\dots,x_{1})&=P(x_{i}\mid x_{i-1},x_{i-2},\dots,x_{i-k})\\ &=\frac{\text{count}(x_{i-k},x_{i-k+1},\dots,x_{i})}{\sum_{c\in\Sigma}\text{count}(x_{i-k},x_{i-k+1},\dots,x_{i-1},c)}\end{split}

where $count(x_{i-1},x_{i})$ refers to the number of occurrences of string $x_{i-1}x_{i}$ in a given dataset. We test the Markov model using the configuration used in (Ma et al., 2014) with different orders and find that the 3-order Markov does not incur any noticeable delay in the ghost character generation process and yet provides sufficient context-awareness and randomness.

Implementation. The keyboard in VRSafe is built using Unity based on a publicly available QWERTY keyboard template from the MRTK toolkit by Microsoft(Microsoft, 2024). While entering the password, if the next character is a ghost character, the keyboard will disable all other keys on the keyboard except the ghost character and notify the user via text prompts, ensuring the user’s next action injects the intended noise. After typing the ghost character, the keyboard layout will be restored to normal QWERTY and the user can continue typing. In a normal keyboard, every key fired by the user click will be appended to the textbox. We add a private variable to track whether each character is a real or a ghost character, and only real characters are forwarded to the text output. The ghost password is stored alongside the real password in the textbox GameObject, ensuring that it persists even if the keyboard is closed. We also create a login interface together with the VRSafe keyboard as a VR application using Unity 2022.3.24f1 and deploy it on Meta Quest2 for the user study. The screenshot of the application is shown in Figure 3.

Overhead. Ghost characters enhance the security of the original password, however, they also incur longer text entry time. Therefore, it is critical to carefully consider the overhead introduced in the process. Although entry time can be influenced by various factors such as the user’s experience with the device and typing proficiency, we focus on reducing the overhead introduced by the additional characters, and we further discuss the subjective factors of users’ perceived overhead in the user study (Section 4.3).

In VR applications, the cursor is often bound to handheld controllers, and interactions with components require moving the cursor to the target position. Typing on a virtual keyboard typically involves two steps: moving the cursor to the target key and pressing the button on the controller to select, and then this process repeats for each character. We can see that a substantial portion of entry time is spent on the controller movement. Meanwhile, to provide a realistic user experience, the ratio between cursor movement in virtual space and physical controller movement in the real world is often close to one (Meta Platforms, 2025). For example, the hand and cursor movement trajectories for typing the word “CAT” on a virtual keyboard are illustrated in Fig. 4. Thus, the entry time of a given password can be approximated by the total distance traversed by the controller, or equivalently by the cursor. Suppose $C_{i}=(x_{i},y_{i})$ denotes the coordinates of the $i^{th}$ character in the password on the keyboard plane, then the total distance traversed by the cursor for the entering the password is computed using the following formula, where $\|\cdot\|_{2}$ denotes Euclidean distance:

d_{sum}={\sum_{i=2}^{n}\|{C}_{i}-{C}_{i-1}\|_{2}}

Based on the above formula, the overhead can be calculated as the difference in the distance between the original password and the ghost password. Our objective is to reduce the overhead without significantly compromising the enhanced security achieved by VRSafe. We can either reduce the number ghost characters or use weighted selection instead of uniform random choice. The former can be adjusted by choosing different randomness levels discussed in Algorithm 1. Here, we primarily focus on the latter. In natural languages, certain bi-grams occur with similar frequencies, yet their corresponding cursor traversal distances on the QWERTY keyboard can differ substantially. For instance, the bi-gram ”PA” shows up 10963 times and ”PO” appears 11965 times in one million English words (Solso et al., 1979). This indicates that both bi-grams are equally plausible from a linguistic standpoint, selecting either “A” or “O” as a ghost character offers no meaningful advantage to an attacker. However, typing ”PO” results in a much smaller distance cost than ”PA”. Therefore, we can choose ghost characters that are located closer on the keyboard to reduce the moving overhead and yet achieve a similar protection performance. We implement this constraint in two ways, namely hard constraint and soft constraint. With hard constraint, we remove the characters that are unacceptably far from the preceding character by defining a threshold distance $\tau$ and assigning zero probability to any candidate that exceeds the threshold. In contrast, the soft constraint does not explicitly eliminate distant characters from the candidate set. Instead, it adjusts their probabilities of being selected based on the distance. We compute the distance $d$ between each candidate ghost character $c_{i}$ and preceding character $c_{i-1}$ , and adjust the selection probability using a softmax weighted function where closer characters receive higher probabilities:

P^{\prime}(c_{i})\;=\;\frac{P(c_{i})\,\exp\!\left(-\lambda\,d(c_{i},c_{i-1})\right)}{\sum\limits_{j}P(c_{j})\,\exp\!\left(-\lambda\,d(c_{j},c_{i-1})\right)}

Higher $\lambda$ or lower $\tau$ will both result in smaller overhead with a more compact key distribution. However, it might increase the risk of weakening the enhanced security. We discuss the performance of VRSafe for different values of $\lambda$ and $\tau$ in Section 4.

3.4. Inference Attack Detector

Although the proposed noise injector increases the difficulty of obtaining the original password, an attacker may still correctly guess the password given enough attempts. In this case, the total number of brute-force attempts required to infer the correct original password from the ghost password is approximately $k\times 2^{n}$ , where $k$ is the number of inferred ghost password candidates and $n$ is the length of the ghost password owing to the subsequence relationship between the ghost password and the real password. Thus, it is crucial to add a detection mechanism to identify whether an account is being actively targeted or guessed. By identifying targeted guessing attempts early, the service provider can notify the user to limit potential damage. We add a checker at the server side inspired by the honey checker proposed in (Juels and Rivest, 2013). The approach behind honey checker is to create false passwords (also called ”honeywords” or ”decoy passwords”) that are similar to the user’s real password. All password hashes are stored in the credential database, and the server knows which one is authentic. An adversary who inverts the hashes will obtain multiple password candidates of each user, and if the attacker attempts to authenticate with a decoy password, the server can immediately detect and flag the unauthorized attempt. Typically, this approach is employed to defend against attacks where password hash files from credential databases are compromised by data breaches. In our context, the attacker has knowledge of the ghost password which they believe is the original password of the user, and this password can also be considered as a ”decoy password”. Therefore, for better illustration, we refer to the password used for detecting keystroke inference attacks as ”honeyword”. If the adversary attempts to login with the ghost password, we can infer that a keystroke inference attack has previously taken place. We draw this inference because neither the legitimate user nor other types of attackers would use the ghost password for authentication, and only a keystroke inference attacker possesses knowledge of the ghost password.

To the best of our knowledge, no prior work has explored this setting, and our approach provides a novel detection mechanism specifically for keystroke inference attacks. Some research also explored alarm raising attackers (Wang and Reiter, 2024; Huang et al., 2024; Wang et al., 2022) where the goal is to trigger false alarm than gaining unauthorized access. We argue that keystroke inference attackers are unlikely to adopt this strategy for the following reason: if an attacker already acquires user password from keystroke inference attack, raising alarm would not give them any profit about user’s sensitive information, and the password would expire soon since the compromised user will be notified by the server to change the password right away. As a result, a keystroke inference attacker who is interested in user’s sensitive data would always implement a false negative attack (i.e. try to login without being noticed).

Honeyword Selection and Storage. Without any doubt, the ghost password should be included as a honeyword, as it is highly likely to be selected by an attacker in a login attempt. To broaden the detection surface, we should include as many honeywords as possible, however, introducing too many honeywords may greatly increase verification time and impair the performance. To maximize detection rate while using a reasonable number of honeywords, we rank all possible guesses in descending order of probability from the password guessing model and choose the top $n-1$ guesses excluding the ghost password and the original password, making $n$ honeywords in total. As for false alarms caused by the user due to typographic errors, since the ghost password has a requirement for the minimum number of injected characters, the likelihood that a user’s typo accidentally matches the ghost password should be minimal.

We now discuss how to store the honeywords at the server. We note that the ghost password is generated from a probabilistic model which varies across login sessions, and the honeywords are derived based on it. As a result, each time a user logs in using VRSafe, we might get a different honeyword set and the total number tends to increase as the user continues to log in multiple times over time. To reduce the storage overhead associated with the growing number of honeywords, we use bloom filter (BF)(Bloom, 1970), a compact data structure that tests whether an element is a member of a set with possible false positives but guarantees that there are no false negatives. Using BF has two advantages over storing honeywords as individual hashes. First, it provides insert and lookup operation with constant time $O(k)$ regardless of the number of elements already inserted, where $k$ is the number of hash functions of BF defined during initialization. Second, the space cost of BF is fixed and it does not grow with the number of stored elements. However, BF also has its limitations. For a given BF consisting of $k$ hash functions and size of the bit array $m$ , there would be a maximum expected elements $n$ given a certain false positive rate (FPR) $p_{fp}$ :

p_{fp}\approx\left(1-e^{-\frac{kn}{m}}\right)^{k}

At the same time, the bloom filter does not support deletion operation, which means that once the number of inserted elements approaches the expected maximum $n$ , we can only rebuild a new filter, otherwise the FPR will become higher than expected. Therefore, we need to have a good estimate on the expected growth trend of the password logins by the user in order to avoid rebuilding the BF frequently. According to a study on web password habits (Florencio and Herley, 2007), on average, a user logs into a website approximately 3.22 times per day (i.e. roughly 1,000 times per year). Such estimates can be used to determine the expected interval (e.g., 6 or 12 months) for rebuilding a BF and the expected maximum number of elements.

4. Experiments

We evaluate VRSafe through both simulations and a real-world user study approved by the Institutional Review Board (IRB). Before presenting the experiment results, we first describe our experiment setup.

4.1. Experiment Setup

Table 1. Summary of training time and iterations of models used in this paper.

Model	Epoch	Training Time (Total)
Pass2Edit	3	24hr
Randomness Meter	10	10hr
3-Order Markov	-	2hr

Table 2. Data cleaning result of leaked password dataset.

Dataset	Leaked Time	Raw	Non-ASCII	Empty	Removed(%)	Cleaned
COMB	Feb. 2021	3,279,064,312	14,827,020	187,089	4.6	3,264,050,203

We evaluate VRSafe based on real leaked password datasets namely COMB datasets. The Compilation of Many Breaches (COMB) dataset was firstly leaked in Feb. 2021 on a popular online hacking forum (Ron Cresswell, 2021). The dataset includes the largest volume of recently leaked passwords from various websites. ¹¹1The dataset was collected from publicly accessible websites for research purposes only. We intentionally omit the source URL of the leaked dataset to limit further exposure. Consistent with prior work(Wang et al., 2023; Pal et al., 2019), we remove entries that contain non-ASCII printable characters and empty rows. The cleaned dataset is described in Table 2. All models are trained on Google Colab platform using T4 GPU and High-RAM mode. For both training and testing, we choose passwords between 5 and 30 characters in length within the dataset, as passwords outside this range are either too weak or unlikely to be human chosen passwords. The targeted password guessing model follows the setup in (Wang et al., 2023) except the data size. With the configuration data size used in (Wang et al., 2023), it results in more than 7 days of training time per epoch for large datasets. To ensure practical training efficiency, we instead sample 50 million password pairs. For the randomness meter model in Algorithm 1, we train on 5 million real passwords and an equal number of randomly generated passwords with the same length distribution for 10 epochs. The Markov model for ghost character selection is trained on 5 million passwords. Training time for all the models is shown in Table LABEL:tab:training_time.

4.2. Experiment Results

We conducted a series of experiments to evaluate the factors that affect the guessing accuracy and overhead. We also evaluate accuracy and resource cost of our detection mechanism.

#1: Targeted Guessing vs. Simple Typo Correction. We first illustrate that evaluating the effectiveness of our defense under the password guessing model is a more realistic setup. To simulate passwords inferred from keystrokes, we randomly sample 20,000 passwords in the test set. Specifically, each character is replaced with one of its adjacent keys on the QWERTY layout with a probability of 5%, which aligns with a character level inference accuracy of approximately 95% in state-of-the-art studies (Yang et al., 2023). We then copy all passwords into a Google Doc, refining passwords with ”spelling and grammar” tools. Some suggestions will split the word into phrases (e.g. ”iloveyou” to ”i love you”), we only accept the suggestions that retain a single word. At the same time, we input the passwords into the guessing model and pick the top-1 guess of each password. We find that the refined passwords from Google Doc have similar performance (64.55%) compared to targeted password guessing model (64.69%) when no ghost characters are injected, but it quickly falls behind once a small amount of noise is introduced (27.52% vs. 32.78% with an injection probability $P=0.1$ ). In real-world cases, it is also likely to have false negative or false positive keystrokes, which will result in larger divergence between inferred passwords and original passwords. Therefore, using the password guessing model can more accurately reflect the performance of our defense against realistic attackers.

#2: Accuracy Analysis. We sample passwords in different categories to simulate users with different password choosing habits. First, each character in a password can be classified as either digits, letters or symbols (Weir et al., 2009), and prior work has shown that the number of classes is one of the key factors affecting password guessing performance (Melicher et al., 2016; Wang et al., 2023; Kelley et al., 2012). We categorize passwords based on the number of classes they consist of, denoted as Class-X where $X$ denotes the number of classes of characters in the password. For instance, the password string ”Jamesbond007” belongs to Class-2 passwords. In addition, password length is another important factor that influences password guessing performance, with longer passwords generally being more difficult to guess accurately. In (Wang et al., 2023), the researchers found that their model outperformed prior approaches for passwords with lengths between 10 and 16 characters. Based on this, we categorize passwords into Short ( $<10$ ), Medium ( $10-16$ ) and Long ( $>16$ ) passwords to provide comprehensive evaluations of our defense method. We input passwords to our noise injection algorithm using $p_{0}=0.5,\Delta p=0.05,\alpha=0.1$ at various randomness level $r$ . Then we forward the ghost passwords to Pass2Edit password guessing model to determine the final adversary’s guesses. Fig. 6 shows password guessing accuracy under different password lengths and number of classes at different number of guesses ( $10,100,1000$ ), which are common numbers evaluated in prior research (Wang et al., 2023; Pal et al., 2019; Wang et al., 2016). The x-axis represents different randomness levels $r$ as described in Algorithm 1 and y-axis represents guessing accuracy under top-k guesses. Overall, using the Markov model in selecting ghost characters results in a better performance compared to uniform selection, especially for long passwords (see Table 3). The password accuracy of Class-2, Long passwords using Markov model achieve 5% to 15% lower accuracy compared to uniform selection under the same setting when 1,000 guesses are allowed across all randomness levels (Fig. 6(e)). With different number of allowed guesses, we observe that the adversary’s guessing accuracy has an approximately linear relationship with the logarithm of the number of guesses, and the accuracy under 10 guesses is only around 20% (see Fig. 5). When the number of guesses increases, an attacker typically needs to generate ten times as many candidates in order to achieve approximately twice the accuracy, and the improvement is even smaller for more complex passwords.

Randomness level $r$ also plays a critical role in affecting accuracy. When $r$ is low $(r<0.3)$ , the algorithm strongly prefers to copy the character from the original password, resulting in a small or even minimum number of injected ghost characters, therefore, the accuracy remains high and the original password is still likely to be guessed. On the other hand, a very high randomness level $(r>0.7)$ , drives the algorithm to produce extremely long and random passwords, negatively impacting usability. From the perspective of balancing usability and security, we conclude that the ghost injection algorithm achieves a good performance under $r\in[0.4,0.6]$ , and we will discuss how to further reduce overhead with acceptable trade-off on accuracy in the following experiments.

Table 3. Guessing accuracy of Class-2 passwords under 1,000 guesses.

Randomness	Length	Accuracy (%)
Level		Uniform	Markov
0.3	Medium	64.36	73.22
0.3	Long	60.80	55.28
0.5	Medium	44.32	40.28
0.5	Long	30.12	15.14
0.7	Medium	0.64	0.00
0.7	Long	0.30	0.04

#3: Overhead in Typing. While increasing the number of ghost characters consistently reduces the attack accuracy, it also introduces additional overhead which can potentially impact usability. As shown in Table 5, overhead measured in moving distance always exhibits a trade-off relationship with guessing accuracy. For instance, when the randomness level is fixed at 0.4, increasing $\lambda$ from 0 (i.e. no constraint) to 0.2 and 0.5, the overhead reduces from 34.5% to 24.2% and 18.1%, respectively, but simultaneously increases adversary’s guessing accuracy to 1.12% and 3.5%, respectively. We also see that, the main factor affecting accuracy is the randomness level, and applying constraints does not cause guessing accuracy to deviate significantly from the baseline accuracy at the same randomness level without constraints. Moreover, as more ghost characters are injected, our method achieves greater overhead reduction, thereby improving usability. These results demonstrate that VRSafe can be flexibly configured in different methods and parameters for different user and application needs.

#4: Detecting Malicious Logins. Previous experiments show that injecting ghost characters cannot prevent the adversary from guessing the original password if we allow a large number of guesses. In this experiment, we build a malicious login detector with bloom filter with maximum expected element, $n=10^{6}$ elements, expected $FPR=10^{-30}$ and SHA-256 as hash function to ensure enough capacity to store honeywords with low risk of false positive. We believe this is a reasonable setup for most websites that enforce a mandatory password renewal policy not exceeding one year. Given that a typical user is likely to log in thousands of times in this period (Florencio and Herley, 2007), the chosen configuration ensures that a sufficient number of honeywords can be generated without exceeding the bloom filter’s maximum expected capacity. We generate a total of 20 honeywords including the ghost password using the same password guessing framework, but with only 1/10 of the data size, which is already sufficient for detecting malicious login at high accuracy. As shown in Table 4, on average, 52.04% of malicious attempts can be detected in the first guess, and 83.97% are identified within 10 malicious login attempts. If we expand the number of honeywords to 100, the detector can achieve 72.43% and 96.05% detection rate under 1 and 10 login attempts, respectively.

Table 4. Malicious login attempts detection rate under # of honeywords = 20.

Randomness	Login	Detection
Level	Attempts	Rate (%)
0.3	1	57.18
0.3	10	86.32
0.5	1	54.76
0.5	10	85.36
0.7	1	42.20
0.7	10	78.04

Table 5. Overhead and accuracy of ghost passwords under different constraints.

Randomness Length No Constr. Soft (Distance/Acc(%)) Hard (Distance/Acc(%)) Level Distance/Acc(%) $\lambda=0.2$ $\lambda=0.5$ $\tau=3$ $\tau=6$ 0.3 Medium 4.52/73.22 4.26/74.72 4.11/75.56 4.07/73.84 4.2/74.92 Long 5.14/55.28 4.89/57.26 4.75/57.54 4.71/57.46 4.84/56.60 0.4 Medium 4.83/62.68 4.46/63.80 4.24/66.18 4.19/64.52 4.37/64.64 Long 5.83/35.68 5.34/35.72 5.05/36.82 4.97/36.60 5.21/36.38 0.5 Medium 5.88/40.28 5.09/41.70 4.65/42.00 4.56/42.06 4.91/41.54 Long 7.36/15.14 6.23/15.34 5.63/15.96 5.54/16.34 6.01/14.90 0.6 Medium 8.48/3.08 6.64/4.50 5.69/4.02 5.51/3.62 6.33/3.68 Long 10.06/0.94 7.9/0.94 6.73/0.90 6.48/1.00 7.48/1.28 0.7 Medium 10.30/0.00 7.86/0.04 6.47/0.10 6.26/0.02 7.46/0.06 Long 11.53/0.04 8.82/0.00 7.36/0.02 7.07/0.02 8.39/0.02 0.8 Medium 10.37/0.00 7.94/0.00 6.57/0.00 6.31/0.00 7.54/0.00 Long 11.69/0.00 8.94/0.00 7.45/0.00 7.17/0.00 8.5/0.00

Note: Distance for original medium length passwords is 3.59, for long passwords is 4.3.

#5: Computing Resource Cost. We evaluated the computing resource usage for VRSafe. The time and memory costs of the malicious login detection bloom filter are summarized in Table 6, with a storage space cost of approximately 17.55 MB. We further measured the CPU and memory cost for running VRSafe using Android Debug Bridge (ADB), a widely used development toolkit recommended in official Meta documentation (Meta, 2024). Over a 10-minute profiling session on Meta Quest 2, VRSafe with ghost character added during typing has a CPU utilization averaging 48% (with a range of 35-62%) and memory averaging $1.77$ $\mathrm{GB}$ . In comparison, typing without the ghost character (i.e. using the regular QWERTY keyboard) showed a similar average CPU utilization of 48% (with a range of 35-52%) and a slightly lower average memory usage of $1.75$ $\mathrm{GB}$ . These results indicate that enabling the ghost character introduces only marginal memory overhead and no noticeable increase in average CPU utilization.

Table 6. Bloom filter time and space cost.

Operation	Time (s)	Memory (KiB)
Init	4.20e-05	0.37
Insert	7.02e-06	4.38
Lookup	4.35e-06	4.38

4.3. User Study

Methods. Our study is approved by our Institutional Review Board (IRB). To simulate user account logins using password within a VR environment, we developed a VR application using Unity 2022.3.24f1 that uses VRSafe as the keyboard for typing (Fig. 3). Participants in our study were seated in a chair and instructed to complete the password typing tasks, and their actions were recorded by a camera to simulate video-based keystroke inference attacks (see Fig. 8 in Appendix A). All participants were informed of the recording and provided consent. To protect participants’ privacy, facial information was excluded from the recordings, and the passwords used for the typing tasks were selected from the publicly available leaked datasets. We did not initially disclose that the recording simulated an attack scenario, to avoid influencing participants’ natural typing behavior. After completing the study, we fully debriefed participants on the experiment design, including the purpose of the video recording, and addressed any concerns or questions. Each participant received 10 USD as appreciation. Participants were first introduced to the essential knowledge for using the VR device (e.g., moving the cursor, typing), as well as the differences between a standard keyboard and the VRSafe keyboard. They were then given five minutes to explore the application freely. The main experiment consisted of ten password typing tasks. The first two tasks involved standard keyboard input without ghost characters, serving as a baseline. The remaining eight tasks incorporated ghost characters, simulating the enhanced security mechanism of VRSafe. After completing each task, the participant notifies the researcher and clicks the “Next” button on the login page to proceed to the next task. Upon completion of all tasks, participants were asked to complete a post-study questionnaire regarding their experience. We used MediaPipe (Zhang et al., 2020), a popular hand-tracking framework to extract the keystrokes from the collected video clips.

Results: We recruited 15 participants (mean age = 25.6, sd = 4.22; 7 males, 8 females) in this study through printed flyers and announcements in weekly newsletters. All participants were over 18 years old. There were no restrictions regarding student status or prior VR experience, and the only requirement was for the participant to have a normal or corrected-to-normal vision when wearing the VR headset. Among the 15 participants, 9 do not have any prior experience with VR headset, and the remaining 6 use VR devices less than once per month. A total of 150 video clips were recorded, of which 6 were discarded. ²²2Mediapipe failed to recognize hands in 2 videos since the participant had long wearable nail extensions and 3 videos were discarded since the participant misunderstood the ghost character typing process as a bug and terminated the task. 1 video was discarded because the hand tracking was lost and the participant could not proceed with the task. The average password length across all tasks was 8.35. The average number of keystrokes detected in the baseline tasks (without ghost characters) was 10.5, while the average in the remaining eight tasks (with ghost characters) was 16.2.

We find that the ”false keystrokes” are treated as authentic ones by the threat model and injected to the inferred keystroke inference as expected. We examine if using VRSafe will result in longer entry time compared to a normal QWERTY keyboard. In Fig. 7, we plot the entry times of all 15 participants across the 10 typing tasks and we highlight the first two tasks (i.e. no ghost characters) with red dots. We observe that the average entry time of most participants fall within 40 to 60 seconds, with no evident pattern suggesting that tasks with added ghost characters take significantly longer. For instance, some participants (e.g. P2, P13) completed baseline tasks more quickly, while others (e.g. P1, P8) showed longer or mixed durations. As the visual distributions do not suggest a clear difference in entry time between the conditions, we aggregated data across all participants and applied the Mann-Whitney U test (Mann and Whitney, 1947) to formally assess the effect of task typing when ghost characters are included. The analysis indicates a statistically significant difference between the two groups $(U=1148.5,p=0.0072<0.05)$ . To further investigate the source of the observed difference, we perform pair-wise tests. Specifically, we treated the two baseline tasks from each of the 15 participants as a single baseline group (totaling 2×15=30 samples), and the remaining tasks were grouped by their task index (i.e. all third tasks from each participant, all fourth tasks, and so on). We then apply pair-wise tests between the baseline group and the other groups one by one, and the third task group (i.e. first task involving ghost characters) shows the greatest difference from the baseline, suggesting it is the primary driver of the overall significant difference. It indicates that even though participants were already aware of the ghost character mechanism, they still paused when they first encountered it. We attribute this to a natural learning curve, and expect the impact to diminish as users become more familiar with the application.

In the post-task questionnaire, we use the System Usability Scale (SUS) (Brooke and others, 1996) to evaluate Perceived Ease of Use (PEOU), Perceived Usefulness (PU) and Behavioral Intention to Use (BI) from the Technology Acceptance Model (TAM) (Davis, 1989) to measure the user’s acceptance. ³³3SUS score ranges from 0-100; PEOU, PU and BI range from 0-5. The full questionnaire can be found in Table 10 (Appendix A). We also test the validity and reliability of our question setting using Cronbach’s Alpha (Cronbach, 1951) and Average Variance Extracted (AVE) (Fornell and Larcker, 1981) (Table 9, Appendix A). All Cronbach’s Alpha scores are around or above 0.8, and AVEs exceed 0.6, indicating that our questions effectively evaluate the design across different aspects in the same direction. The statistical results are shown in Table 7, and per-question descriptive statistics of SUS are provided in Table 8(Appendix A). We also ask some open-ended questions to encourage participants to share their thoughts freely, and we find that participants evaluate VRSafe differently depending on the aspects they prioritize. Some participants appreciated the enhanced security and expressed a willingness to adopt even more sophisticated methods. Others thought that typing in VR is already challenging, adding extra steps could become overwhelming even if it becomes more secure. This divergence in user perspectives helps explain the wide variance observed in the SUS scores. Participants also comment on general VR device usage experience such as ”improve responsiveness”, ”control can be more accurate”, ”headset feels too heavy” which are not directly related to the VRSafe design.

Table 7. Statistical results of participants’ survey feedback.

Item	Min	Max	Mean	Sd
PEOU	1.75	4.75	3.88	0.83
PU	2.33	5.00	3.38	0.80
BI	1.50	5.00	3.30	1.28
SUS	25	90	58.6	14.4

5. Conclusions

In this paper, we present VRSafe, a VR keyboard to protect users from keystroke inference attacks. The proposed solution is designed to enhance the word-level security of password inputs for VR users using a standard virtual QWERTY keyboard layout. VRSafe is the first novel solution to incorporate a detection mechanism to identify unauthorized login attempts using credentials inferred from keystroke inference attacks. Our experiment evaluation shows that VRSafe is effective against attacks based on password guessing models and incurs only a modest overhead to achieve enhanced password security and detect malicious login attempts. The results of our user study indicate that VRSafe can be a highly usable alternative for normal users with higher security needs. VRSafe is also potentially applicable to other forms of keystroke inference attacks including acoustic-based and Wifi-based attacks and our future work is focused on extending VRSafe to protect against those forms of attacks. We also plan to investigate mechanisms for protecting sensitive text entered by virtual avatars in extended reality applications. Unlike conventional keystroke attacks, where adversaries observe a victim’s physical actions in the real world, extended reality environments allow adversaries to directly inspect the actions of virtual avatars, potentially revealing additional information that can be exploited to infer keystrokes.

Acknowledgements.

This material is based upon work supported by the National Science Foundation under Grant #2211507. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. The authors also acknowledge the partial support for this work through a grant from the Institute for Cyber Law, Policy, and Security (Pitt Cyber) at the University of Pittsburgh.

References

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. (2023) Gpt-4 technical report. arXiv preprint arXiv:2303.08774. Cited by: §3.3.
A. Al Arafat, Z. Guo, and A. Awad (2021) Vr-spy: a side-channel attack on virtual key-logging in vr headsets. In 2021 IEEE Virtual Reality and 3D User Interfaces (VR), pp. 564–572. Cited by: §1, §2.
H. Althebeiti, R. Gedawy, A. Alghuried, D. Nyang, and D. Mohaisen (2024) Defending airtype against inference attacks using 3d in-air keyboard layouts: design and evaluation. In Information Security Applications, H. Kim and J. Youn (Eds.), Singapore, pp. 159–174. External Links: ISBN 978-981-99-8024-6 Cited by: §2.
B. H. Bloom (1970) Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13 (7), pp. 422–426. Cited by: §3.4.
F. Boutros, N. Damer, K. Raja, R. Ramachandra, F. Kirchbuchner, and A. Kuijper (2020) Iris and periocular biometrics for head mounted displays: segmentation, recognition, and synthetic data generation. Image and Vision Computing 104, pp. 104007. Cited by: §1.
J. Brooke et al. (1996) SUS-a quick and dirty usability scale. Usability evaluation in industry 189 (194), pp. 4–7. Cited by: §4.3.
L. J. Cronbach (1951) Coefficient alpha and the internal structure of tests. psychometrika 16 (3), pp. 297–334. Cited by: §4.3.
P. Cronin, X. Gao, C. Yang, and H. Wang (2021) $\{$ charger-Surfing $\}$ : exploiting a power line $\{$ side-channel $\}$ for smartphone information leakage. In 30th USENIX Security Symposium (USENIX Security 21), pp. 681–698. Cited by: §1.
F. D. Davis (1989) Perceived usefulness, perceived ease of use, and user acceptance of information technology. MIS Quarterly 13 (3), pp. 319–340. External Links: ISSN 02767783, 21629730, Link Cited by: §4.3.
D. Florencio and C. Herley (2007) A large-scale study of web password habits. In Proceedings of the 16th international conference on World Wide Web, pp. 657–666. Cited by: §3.4, §4.2.
C. Fornell and D. F. Larcker (1981) Evaluating structural equation models with unobservable variables and measurement error. Journal of marketing research 18 (1), pp. 39–50. Cited by: §4.3.
S. R. K. Gopal, D. Shukla, J. D. Wheelock, and N. Saxena (2023) Hidden reality: caution, your hand gesture inputs in the immersive virtual world are visible to all!. In 32nd USENIX Security Symposium (USENIX Security 23), pp. 859–876. Cited by: §1, §1, §2, §3.1.1.
N. R. Group (2022) Beyond reality: is the vr revolution on the horizon?. Note: https://www.nrgmr.com/our-thinking/technology/the-vr-revolution-might-finally-be-on-the-horizon/Accessed: 2025-04-10 Cited by: §1.
H. Gupta, S. Sural, V. Atluri, and J. Vaidya (2018) A side-channel attack on smartphones: deciphering key taps using built-in microphones. Journal of computer security 26 (2), pp. 255–281. Cited by: §1.
Z. Huang, L. Bauer, and M. K. Reiter (2024) The impact of exposed passwords on honeyword efficacy. In 33rd USENIX Security Symposium, USENIX Security 2024, Philadelphia, PA, USA, August 14-16, 2024, D. Balzarotti and W. Xu (Eds.), External Links: Link Cited by: §3.4.
A. Juels and R. L. Rivest (2013) Honeywords: making password-cracking detectable. In 2013 ACM SIGSAC Conference on Computer and Communications Security, CCS’13, Berlin, Germany, November 4-8, 2013, A. Sadeghi, V. D. Gligor, and M. Yung (Eds.), pp. 145–160. External Links: Link, Document Cited by: §3.4.
P. G. Kelley, S. Komanduri, M. L. Mazurek, R. Shay, T. Vidas, L. Bauer, N. Christin, L. F. Cranor, and J. Lopez (2012) Guess again (and again and again): measuring password strength by simulating password-cracking algorithms. In 2012 IEEE symposium on security and privacy, pp. 523–537. Cited by: §4.2.
N. Kumar (2025) Virtual reality statistics 2025: users & trends. Note: https://www.demandsage.com/virtual-reality-statistics/Accessed: 2025-07-26 Cited by: §1.
J. Lee, H. Kim, and K. Lee (2023) VRKeyLogger: virtual keystroke inference attack via eavesdropping controller usage pattern in webvr. Computers & Security 134, pp. 103461. Cited by: §1, §1.
Y. Li, Y. Cheng, W. Meng, Y. Li, and R. H. Deng (2020) Designing leakage-resilient password entry on head-mounted smart wearable glass devices. IEEE Transactions on Information Forensics and security 16, pp. 307–321. Cited by: §2.
Z. Ling, Z. Li, C. Chen, J. Luo, W. Yu, and X. Fu (2019) I know what you enter on gear vr. In 2019 IEEE Conference on Communications and Network Security (CNS), pp. 241–249. Cited by: §2, §3.1.1.
S. Luo, X. Hu, and Z. Yan (2022) Holologger: keystroke inference on mixed reality head mounted displays. In 2022 IEEE Conference on Virtual Reality and 3D User Interfaces (VR), pp. 445–454. Cited by: §2, §3.1.1.
S. Luo, A. Nguyen, H. Farooq, K. Sun, and Z. Yan (2024) Eavesdropping on controller acoustic emanation for keystroke inference attack in virtual reality. In The Network and Distributed System Security Symposium (NDSS), Cited by: §1, §2.
S. Luo, A. Nguyen, C. Song, F. Lin, W. Xu, and Z. Yan (2020) OcuLock: exploring human visual system for authentication in virtual reality head-mounted display. In 2020 Network and Distributed System Security Symposium (NDSS), Cited by: §1.
J. Ma, W. Yang, M. Luo, and N. Li (2014) A study of probabilistic password models. In 2014 IEEE Symposium on Security and Privacy, pp. 689–704. Cited by: §3.3.
A. Maiti and K. Crager (2017) Randompad: usability of randomized mobile keypads for defeating inference attacks. In Proceedings of the IEEE Euro S&P Workshop on Innovations in Mobile Privacy & Security (IMPS), Cited by: §1.
A. Maiti, M. Jadliwala, and C. Weber (2017) Preventing shoulder surfing using randomized augmented reality keyboards. In 2017 IEEE international conference on pervasive computing and communications workshops (PerCom Workshops), pp. 630–635. Cited by: §2.
H. B. Mann and D. R. Whitney (1947) On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics, pp. 50–60. Cited by: §4.3.
W. Melicher, B. Ur, S. M. Segreti, S. Komanduri, L. Bauer, N. Christin, and L. F. Cranor (2016) Fast, lean, and accurate: modeling password guessability using neural networks. In 25th USENIX Security Symposium (USENIX Security 16), pp. 175–191. Cited by: §4.2.
I. Meta Platforms (2025) Locomotion types. External Links: Link Cited by: §3.3.
Meta (2024) Note: Accessed: 2024-10-30 External Links: Link Cited by: §4.2.
Ü. Meteriz-Yıldıran, N. F. Yıldıran, A. Awad, and D. Mohaisen (2022) A keylogging inference attack on air-tapping keyboards in virtual environments. In 2022 IEEE Conference on Virtual Reality and 3D User Interfaces (VR), pp. 765–774. Cited by: §2.
Microsoft (2024) Note: https://github.com/microsoft/MixedRealityToolkit-Unity External Links: Link Cited by: §3.3.
T. Ni, Y. Du, Q. Zhao, and C. Wang (2024) Non-intrusive and unconstrained keystroke inference in vr platforms via infrared side channel. arXiv preprint arXiv:2412.14815. Cited by: §2.
A. Nosenko, Y. Cheng, and H. Chen (2023) Password and passphrase guessing with recurrent neural networks. Information Systems Frontiers 25 (2), pp. 549–565. Cited by: §1.
B. Pal, T. Daniel, R. Chatterjee, and T. Ristenpart (2019) Beyond credential stuffing: password similarity models using neural networks. In 2019 IEEE Symposium on Security and Privacy (SP), pp. 417–434. Cited by: §1, §3.1.2, §4.1, §4.2.
C. Ron Cresswell (2021) The comb data leak: what you should know. External Links: Link Cited by: §3.3, §4.1.
M. Sabra, A. Maiti, and M. Jadliwala (2020) Zoom on the keystrokes: exploiting video calls for keystroke inference attacks. arXiv preprint arXiv:2010.12078. Cited by: §1.
Y. Shen, H. Wen, C. Luo, W. Xu, T. Zhang, W. Hu, and D. Rus (2018) GaitLock: protect virtual and augmented reality headsets using gait. IEEE Transactions on Dependable and Secure Computing 16 (3), pp. 484–497. Cited by: §1.
C. Slocum, Y. Zhang, N. Abu-Ghazaleh, and J. Chen (2023) Going through the motions: $\{$ ar/vr $\}$ keylogging from user head motions. In 32nd USENIX Security Symposium (USENIX Security 23), pp. 159–174. Cited by: §1, §1, §2.
R. L. Solso, P. F. Barbuto, and C. L. Juel (1979) Bigram and trigram frequencies and versatilities in the english language. Behavior Research Methods & Instrumentation 11 (5), pp. 475–484. Cited by: §3.3.
J. Soni and N. Prabakar (2021) KeyNet: enhancing cybersecurity with deep learning-based lstm on keystroke dynamics for authentication. In International Conference on Intelligent Human Computer Interaction, pp. 761–771. Cited by: §1.
S. Stephenson, B. Pal, S. Fan, E. Fernandes, Y. Zhao, and R. Chatterjee (2022) Sok: authentication in augmented and virtual reality. In 2022 IEEE symposium on security and privacy (SP), pp. 267–284. Cited by: §1.
J. Sun, X. Jin, Y. Chen, J. Zhang, Y. Zhang, and R. Zhang (2016) Visible: video-assisted keystroke inference from tablet backside motion.. In NDSS, Cited by: §1.
I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. Advances in neural information processing systems 27. Cited by: §3.3.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. Advances in neural information processing systems 30. Cited by: §3.3.
T. Wan, L. Zhang, Y. Xu, Z. Guo, B. Gao, and H. Liang (2024) Analysis and design of efficient authentication techniques for password entry with the qwerty keyboard for vr environments. IEEE Transactions on Visualization and Computer Graphics. Cited by: §1, §2.
D. Wang, Z. Zhang, P. Wang, J. Yan, and X. Huang (2016) Targeted online password guessing: an underestimated threat. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pp. 1242–1254. Cited by: §4.2.
D. Wang, Y. Zou, Q. Dong, Y. Song, and X. Huang (2022) How to attack and generate honeywords. In 43rd IEEE Symposium on Security and Privacy, SP 2022, San Francisco, CA, USA, May 22-26, 2022, pp. 966–983. External Links: Link, Document Cited by: §3.4.
D. Wang, Y. Zou, Y. Xiao, S. Ma, and X. Chen (2023) $\{$ pass2edit $\}$ : A $\{$ multi-step $\}$ generative model for guessing edited passwords. In 32nd USENIX Security Symposium (USENIX Security 23), pp. 983–1000. Cited by: §1, §3.1.2, §4.1, §4.2.
H. Wang, Z. Zhan, H. Shan, S. Dai, M. Panoff, and S. Wang (2024a) GAZEploit: remote keystroke inference attack by gaze estimation from avatar views in vr/mr devices. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pp. 1731–1745. Cited by: §2.
H. Wang, J. Hu, T. Zheng, J. Hu, Z. Chen, H. Jiang, Y. Zheng, and J. Luo (2024b) MuKI-fi: multi-person keystroke inference with bfi-enabled wi-fi sensing. IEEE Transactions on Mobile Computing. Cited by: §1.
K. C. Wang and M. K. Reiter (2024) Bernoulli honeywords. In 31st Annual Network and Distributed System Security Symposium, NDSS 2024, San Diego, California, USA, February 26 - March 1, 2024, External Links: Link Cited by: §3.4.
M. Weir, S. Aggarwal, B. De Medeiros, and B. Glodek (2009) Password cracking using probabilistic context-free grammars. In 2009 30th IEEE symposium on security and privacy, pp. 391–405. Cited by: §4.2.
Y. Wu, C. Shi, T. Zhang, P. Walker, J. Liu, N. Saxena, and Y. Chen (2023) Privacy leakage via unrestricted motion-position sensors in the age of virtual reality: a study of snooping typed input on virtual keyboards. In 2023 IEEE Symposium on Security and Privacy (SP), pp. 3382–3398. Cited by: §1, §1, §2, §3.1.2.
E. Yang, S. Fang, I. Markwood, Y. Liu, S. Zhao, Z. Lu, and H. Zhu (2022) Wireless training-free keystroke inference attack and defense. IEEE/ACM Transactions on Networking 30 (4), pp. 1733–1748. Cited by: §1, §3.3.
J. Yang, W. Li, H. Cheng, and P. Wang (2025) Targeted password guessing using neural language models. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. Cited by: §1.
Z. Yang, Y. Chen, Z. Sarwar, H. Schwartz, B. Y. Zhao, and H. Zheng (2023) Towards a general video-based keystroke inference attack. In 32nd USENIX Security Symposium (USENIX Security 23), pp. 141–158. Cited by: §1, §1, §1, §1, §2, §2, §3.1.1, §3.3, §4.2.
F. Zhang, V. Bazarevsky, A. Vakunov, A. Tkachenka, G. Sung, C. Chang, and M. Grundmann (2020) Mediapipe hands: on-device real-time hand tracking. arXiv preprint arXiv:2006.10214. Cited by: §3.1.1, §4.3.
Y. Zhang, C. Slocum, J. Chen, and N. Abu-Ghazaleh (2023) It’s all in your head (set): side-channel attacks on $\{$ ar/vr $\}$ systems. In 32nd USENIX Security Symposium (USENIX Security 23), pp. 3979–3996. Cited by: §1, §2.

Ethics Statement

All data collection procedures were reviewed and approved by IRB (STUDY24120080) to ensure that no personally identifiable information or sensitive user data was gathered without explicit consent. Participants were informed of the study’s purpose, potential risks, and their right to withdraw at any time. All data was stored securely to prevent unauthorized access. The results of this work are presented solely for scientific and educational purposes and are not intended to facilitate any malicious activity.

Open Science Statement

To support reproducibility and foster future research on VR input security, we will release our VRSafe virtual keyboard implementation in Unity along with the corresponding source code at https://github.com/odinyuan/VRSafe. For the evaluation dataset (i.e. the COMB dataset), although the leak is publicly available online, we refrain from including its URL to avoid further dissemination. Interested researchers may contact the authors for details. All evaluation code is also made publicly available in the same repository. By making these resources available, we encourage researchers to replicate our findings, evaluate alternative attack and defense mechanisms, and extend VRSafe to broader VR interaction contexts.

Appendix A User Study

Table 8. SUS Question Statistics

#	Question	Mean	Sd
1	I think I would like to use this VR Keyboard frequently.	2.73	1.22
2	I found the VR Keyboard unnecessarily complex.	2.80	1.14
3	I thought the VR Keyboard was easy to use.	3.46	1.30
4	I think that I would need the support of a technical person to be able to use the VR Keyboard.	2.33	1.34
5	I found the various functions in this VR Keyboard were well integrated.	3.73	0.70
6	I thought there was too much inconsistency in this VR Keyboard.	2.73	1.03
7	I would imagine that most people would learn to use this VR Keyboard very quickly.	4.06	0.79
8	I found this VR Keyboard very cumbersome to use.	3.06	1.22
9	I felt very confident using VR Keyboard.	3.13	1.24
10	I needed to learn a lot of things before I could get going with this VR Keyboard.	2.73	1.22
Ratings on a 5-point Likert scale
(1 = Strongly Disagree, 5 = Strongly Agree).

Table 9. Reliability and validity test result (Cronbach’s Alpha and AVE)

Variable	Measurement Indicator	AVE	$\alpha$
Perceived Ease of Use (PEOU)	My interaction with VR Keyboard is clear and understandable (PEOU1)	0.641	0.865
	Learning to operate the VR Keyboard is easy for me (PEOU2)
	The VR Keyboard is user-friendly and requires little effort to understand (PEOU3)
	I find it easy to get the VR Keyboard to do what I want it to do (PEOU4)
Perceived Usefulness (PU)	Using the VR Keyboard would make me feel safer when entering password (PU1)	0.602	0.786
	Using the VR keyboard helps me feel more confident that my passwords are protected (PU2)
	The VR keyboard provides a safer way to enter my passwords compared to traditional methods (PU3)
Behavioral Intention To Use (BI)	I expect my use of the VR Keyboard to continue in the future (BI1)	0.736	0.852
	I would recommend this VR Keyboard to others (BI2)
AVE: Average Variance Extracted.
$\alpha$ : Cronbach’s Alpha.

Table 10. Survey questions used in the user study.

Statement	1	2	3	4	5
Technology Acceptance Model (TAM)
Learning to operate the VR Keyboard is easy for me.
The VR Keyboard is user-friendly and easy to understand.
I find it easy to get the VR Keyboard to do what I want.
Using the VR Keyboard makes me feel safer when entering passwords.
I am confident that my passwords are protected when using the VR Keyboard.
The VR Keyboard provides a safer password-entry method than traditional approaches.
I expect to continue using the VR Keyboard in the future.
I would recommend the VR Keyboard to others.
System Usability Scale (SUS)
I think I would like to use this VR Keyboard frequently.
I found the VR Keyboard unnecessarily complex.
I thought the VR Keyboard was easy to use.
I would need technical support to use the VR Keyboard.
The functions of the VR Keyboard are well integrated.
There is too much inconsistency in the VR Keyboard.
Most people would learn to use this VR Keyboard very quickly.
I found the VR Keyboard cumbersome to use.
I felt confident using the VR Keyboard.
I needed to learn many things before I could use the VR Keyboard effectively.
Open-ended Questions
For different accounts, do you have different security needs? What is your strategy?
For the proposed keyboard design, higher security often requires additional effort. What is your strategy for different password-entry needs?
In your opinion, what are the pros and cons of the new design?