All posts
ResearchRoboticsBehind the Scenes

Neural Discrete Representation Learning: Vector Quantized Variational AutoEncoder (VQ-VAE)

MyTron Labs·June 10, 2026

Learning useful representations without supervision remains a major challenge in machine learning. VQ-VAE addresses this by learning discrete latent representations using vector quantization. Unlike traditional VAEs, it uses discrete latent codes instead of continuous ones and learns the prior separately. The use of vector quantization helps avoid posterior collapse, where latent variables are ignored by a powerful decoder. Combined with an autoregressive prior, VQ-VAE can generate high-quality images, videos, and speech while learning meaningful representations useful for tasks such as speaker conversion and unsupervised phoneme discovery.

INTRODUCTION

Recent advances in generative modeling have achieved remarkable success in images, audio, and video, but learning useful representations from unlabeled data remains a challenge. Traditional unsupervised learning methods often rely on likelihood maximization or reconstruction objectives, which do not always produce meaningful latent representations. In many cases, powerful decoders can achieve strong performance while largely ignoring the latent variables.

VQ-VAE addresses this problem by combining the Variational Autoencoder (VAE) framework with discrete latent representations through vector quantization. This approach is simple to train, avoids high-variance optimization issues, and prevents posterior collapse, ensuring that the latent variables remain informative. It is also one of the first discrete latent variable models to achieve performance comparable to continuous latent VAEs.

By effectively utilizing its latent space, VQ-VAE learns high-level and meaningful features instead of focusing on low-level noise or insignificant details. Once these discrete representations are learned, an autoregressive prior can be trained over them to generate high-quality samples. This enables a range of applications, including unsupervised phoneme discovery, speaker conversion, and modeling long-term dependencies in reinforcement learning tasks.

PREVIOUS DISCRETE VAE’s

Training VAEs with discrete latent variables has traditionally been difficult, which is why most generative models rely on continuous latent variables even when the underlying data is naturally discrete. Earlier approaches such as NVIL and VIMCO use specialized gradient estimators, while methods like Concrete and Gumbel-Softmax use continuous relaxations that gradually approximate discrete variables during training.

Despite these advances, discrete latent variable methods generally fail to match the performance of continuous VAEs that use the Gaussian reparameterization trick and benefit from lower-variance gradients. Most prior methods were also evaluated on relatively simple datasets with small latent spaces.

VQ-VAE overcomes these limitations by using vector quantization to learn discrete latent representations directly. It is evaluated on complex image and speech datasets and builds upon earlier work combining VAEs with autoregressive decoders and priors. Unlike soft-to-hard quantization approaches used in neural compression, VQ-VAE performs explicit quantization and successfully learns discrete representations from scratch.

VQ-VAE

A standard VAE consists of three main components: an encoder that maps input data to a latent representation, a prior distribution over the latent variables, and a decoder that reconstructs the input from the latent representation. Traditional VAEs typically use continuous latent variables modeled with Gaussian distributions, enabling efficient training through the Gaussian reparameterization trick. Several extensions have improved VAEs further using autoregressive models and normalizing flows.

VQ-VAE replaces these continuous latent variables with discrete ones. Instead of sampling continuous latent vectors, the model selects discrete codes from a learned codebook using vector quantization. The selected code indexes an embedding table, and the resulting embedding is passed to the decoder for reconstruction.

1.Discrete Latent Variables

VQ-VAE defines a latent embedding space consisting of K embedding vectors, where K represents the size of the discrete latent space and each embedding vector has dimension D. The embedding space acts as a shared codebook from which discrete latent representations are selected.

Given an input, the encoder produces a continuous output representation. The model then performs a nearest-neighbour lookup in the embedding space and selects the embedding vector closest to the encoder output. The corresponding codebook entry becomes the discrete latent representation, and the selected embedding vector is passed to the decoder as its input.

The posterior distribution is deterministic and represented as a one-hot categorical distribution: the nearest codebook entry receives probability 1, while all other entries receive probability 0. This allows the forward computation to be viewed as a standard autoencoder with a discretization step that maps continuous encoder outputs to one of K embedding vectors.

The parameters of the model consist of the encoder, decoder, and the embedding space itself. While the description uses a single latent variable for simplicity, the model extracts one-dimensional, two-dimensional, and three-dimensional latent feature spaces for speech, images, and videos respectively.

The model can be viewed as a VAE in which the posterior distribution is deterministic. By using a uniform prior over the discrete latent variables, the KL divergence becomes a constant equal to log K.

1. Training VQ-VAE

The quantization operation used in VQ-VAE does not have a true gradient. To enable training, the model uses a straight-through estimator, where gradients from the decoder input are copied directly to the encoder output. During the forward pass, the nearest embedding vector is passed to the decoder, while during the backward pass the reconstruction gradient is passed unchanged to the encoder. Since the encoder output and decoder input share the same embedding space, these gradients provide useful information for reducing reconstruction error and can cause the encoder output to be assigned to a different embedding in subsequent forward passes.

The training objective consists of three components. The first is the reconstruction loss, which optimizes both the decoder and the encoder through the straight-through estimator. Since the embedding vectors do not receive gradients from the reconstruction loss, a Vector Quantization (VQ) objective is introduced to learn the embedding space. This objective uses the squared distance between encoder outputs and embedding vectors to move the embeddings towards the encoder representations.

A commitment loss is added to ensure that the encoder commits to a selected embedding and to prevent its outputs from growing arbitrarily if the embeddings do not train as quickly as the encoder parameters. Together, these three terms form the complete VQ-VAE training objective.

The stop-gradient operator is used to control which components are updated by each loss term. It behaves as the identity function during the forward pass but blocks gradients during backpropagation. As a result, the decoder is optimized only by the reconstruction loss, the encoder is optimized by the reconstruction and commitment losses, and the embedding vectors are optimized by the Vector Quantization loss.

3.Prior modelling

The prior distribution over the discrete latent variables is a categorical distribution and can be made autoregressive by conditioning on other latent variables in the feature map. During VQ-VAE training, the prior is kept fixed and uniform. Once the discrete latent representations have been learned, an autoregressive prior is fitted over the latent codes to model their distribution. This learned prior is then used to generate new samples through ancestral sampling. For images, a PixelCNN is used to model the discrete latent variables, while for raw audio a WaveNet is used. Jointly training the prior and the VQ-VAE is not explored and is left for future work

EXPERIMENTS

VQ-VAE was compared with a standard VAE and VIMCO on CIFAR10 using the same architecture while varying latent capacity. The VAE, VQ-VAE, and VIMCO achieved 4.51, 4.67, and 5.14 bits/dim respectively. These results show that VQ-VAE achieves performance close to continuous VAEs despite using discrete latent variables, making it the first discrete latent variable model to closely match the reconstruction quality and likelihood performance of continuous VAEs.

To evaluate image modeling, images were compressed from 128×128×3 pixels to a 32×32×1 latent space with 512 codebook entries, achieving over a 42× reduction in representation size. Despite this compression, reconstructions remained visually similar to the originals. Training a PixelCNN prior on the discrete latent space enabled efficient modeling of high-level image structure, and similar results were observed on DeepMind Lab frames. Even with a powerful PixelCNN decoder, VQ-VAE continued to make meaningful use of its latent variables and avoided posterior collapse.

For raw speech, VQ-VAE compressed audio into a discrete latent space up to 128 times smaller than the original waveform while preserving the spoken content. An autoregressive prior trained on the latent codes generated speech containing clear words and partial sentences. The learned representations also enabled speaker conversion and showed a strong correlation with phonemes, despite being learned without any linguistic supervision.

CONCLUSION

VQ-VAE combines variational autoencoders with vector quantization to learn discrete latent representations. Through its compressed discrete latent space, it is able to model long-range dependencies across images, videos, and audio. Experiments showed successful generation of high-resolution images, action-conditioned video sequences, meaningful speech samples, and speaker conversion. The learned latent representations capture important features of the data in a completely unsupervised manner. Additionally, VQ-VAE achieves likelihoods close to continuous latent VAEs on CIFAR10 and demonstrates the ability to learn high-level speech representations closely related to phonemes.