[D] Prior work using pixel shift to improve VAE accuracy?

Reddit r/MachineLearning / 3/30/2026

📰 News

Key Points

  • A Reddit user is training a high-compression VAE (“f8ch32”, 8x compression, 32 channels) and is seeking methods to improve reconstruction fidelity beyond current results (better than SDXL f8ch4, worse than AuraFlow f8ch16).
  • They report experimenting with an “extreme” version of jitter-based training: generating many training crops via pixel shifting/stride-1 cropping from an upscaled image to brute-force accuracy.
  • The example workflow upsamples from 2048×2048 to (1024+ps)×(1024+ps), then extracts all adjacent 1024×1024 crops (e.g., 9 crops for ps=2) to create augmented training samples.
  • Initial improvements are claimed, but they still need to tune loss weighting schemes (e.g., L1 and edge-L1) to get the best fidelity under limited GPU resources.
  • They ask the community whether prior research or established approaches exist that use pixel shift/jitter-style augmentation specifically to enhance VAE reconstruction quality.
  • categories: [

Currently, I'm attempting to train up a "f8ch32" VAE
( 8x compression factor, 32 channels)

Its current performance could be rated as "better than sdxl f8ch4, but worse than auraflow f8ch16"

My biggest challenge is improving reconstruction fidelity.
Various searches, etc. suggest to me that the publically known methods for this sort of thing are mostly using LPIPS and GAN.
The trouble with these is that LPIPS can smooth too much, and GANs start making up stuff.
The latter being fine if all you want is "a sharp end result", but lousy if you care about actual fidelity to original image.

I decided to take the old training idea of "use jitter across your training image set" to the extreme, and use pixel shift to attempt to brute-force accuracy.

Specific example usage:

Take a higher resolution image such as 2048x2048.
Define some "pixel shift value". (for this example, ps=2)
Resize the high-res image to an adjacent size of (1024+2)x(1024+2)...
and then deliberately step through all stride-1 crops of 1024x1024 for that
(yielding 9 training images in this specific case)

I seem to be having some initial successs with this method.
However, now I have to play the tuning game to find the most effective weighting values for the loss functions I'm using, like l1 and edge_l1 loss.

Rather than having to continue blindly in the dark, with very limited GPU resources, I thought I would ask if anyone knows of prior work that has already blazed a trail in this area?

submitted by /u/lostinspaz
[link] [comments]