One-step Language Modeling
via Continuous Denoising

1KAIST, 2Carnegie Mellon University
Equal advising

TL;DR

We introduce Flow-based Language Model (FLM) and Flow-map Language Models (FMLM), a method that enables one-step parallel text generation through continuous denoising, achieving an 8.3× speedup over existing approaches. FLM outperforms discrete diffusion in both quality and speed.

Abstract

Language models based on discrete diffusion have attracted interest for their potential for faster generation than autoregressive models. In practice, however, they are challenged by a sharp degradation of sample quality in the few-step regime.

We show that language models leveraging flow-based continuous denoising can outperform discrete diffusion in both quality and speed. By revisiting the fundamentals of flows over discrete modalities, we build Flow-based Language Model (FLM) that performs Euclidean denoising on one-hot token representations. The model is trained to predict clean data via multi-token classification, leveraging a simple time reparameterization that greatly improves training and generation.

By distilling FLM into its associated flow map, we obtain Flow-map Language Model (FMLM) capable of few-step generation. On the LM1B and OWT language datasets, FLM attains generation quality outperforming state-of-the-art discrete diffusion models in many-step (512, 1024) regime. With FMLM, we outperform recent few-step language models across the board, achieving state-of-the-art performance in few-step (1, 2, 4, 8) regime, matching the 8-step quality of distilled discrete diffusion baselines in one-step. Our work calls into question the widely held hypothesis that discrete diffusion processes are necessary for generative modeling over discrete modalities, and paves the way toward accelerated flow-based language modeling at scale.

Flow-based Language Models

FLM applies the benefits of continuous image generation in discrete state spaces by encoding text as one-hot vectors and using flow matching to directly map noise to one-hot data. FLM uses the same multi-token classification objective as discrete diffusion models, but instead of updating the discrete state from one-token to another, FLM gradually denoises all tokens in parallel. This enables FLM to represent a superposition of sequences while also capturing correlations between tokens — a fundamental bottleneck (see below figure) with discrete diffusion models in the few-step regime.

Factorization error in discrete diffusion. A toy dataset with two correlated modes: new-york and san-diego. With many steps, both methods generate valid data. With few steps, discrete diffusion's factorized transitions produce spurious mixtures (new-diego, san-york), while continuous flow remains correct.

We revisit the fundamentals of flows over discrete modalities and leverage a novel time-reparametrization to enable efficient training of FLM. We also introduce the methods to train a flow-map over language data. We use semigroup flow-map distillation to build an associated flow map (FMLM), enabling one-step language sequence generation, while discrete baselines catastrophically fails in the regime.

Results

Many-step Generation

At 1024 sampling steps, FLM achieves the best generative perplexity among the diffusion baselines on both LM1B and OpenWebText.

Model LM1B OpenWebText
Gen. PPL ↓ Entropy Gen. PPL ↓ Entropy
Dataset 4.31 5.44
RDLM 268.21 4.33
CANDI 120.99 4.35 143.13 5.71
MDLM 109.21 4.32 121.09 5.65
Duo 98.14 4.31 77.69 5.55
FLM (Ours) 96.91 4.29 62.23 5.33
Many-step generation comparison
FLM outperforms all discrete diffusion baselines in the many-steps (512/1024) regime.

Few-step & One-step Generation

Even after distillation, discrete diffusion baselines often show either perplexity blow-ups or entropy collapse (highly repetitive text) in the few-step regime. FMLM remains stable throughout, with one-step generation matching the 8-step quality of distilled discrete diffusion baselines on LM1B and competitive at 4 steps on OpenWebText.

Few-step generation comparison
Few-step generation on LM1B (left) and OpenWebText (right). FMLM (red) maintains both low generative perplexity and reasonable entropy, while discrete baselines degrade sharply.
Steps Duo + DCD Duo + Di4C MDLM + SDTT MDLM + Di4C FMLM (Ours)
Gen. PPL↓Ent. Gen. PPL↓Ent. Gen. PPL↓Ent. Gen. PPL↓Ent. Gen. PPL↓Ent.
LM1B
1 180.023.14 292.943.79 1429.484.31 1217.104.38 104.374.12
2 146.673.65 247.693.87 602.144.28 621.594.37 95.424.15
4 118.403.94 150.674.00 241.014.28 247.324.00 90.904.16
OpenWebText
1 47.132.80 97.773.36 1260.865.26 1298.805.29 129.324.53
2 96.593.77 165.814.65 877.225.34 758.235.35 134.265.07
4 108.214.82 150.674.81 339.735.38 239.275.40 76.375.05

Red values indicate degenerate entropy (<4.0) or generative perplexity (>500), signaling collapsed or incoherent generation.

One-step Samples (LM1B)

Below are samples generated in a single forward pass on LM1B. FMLM is much more fluent, while discrete diffusion baselines either generate random text or show repetition of frequent tokens at one step.

FMLM (Ours) Gen. PPL: 95.47  |  Entropy: 4.10
MDLM + SDTT Gen. PPL: 1445.85  |  Entropy: 4.23
. orderber 82 treasury so such 12 new the., and this rep s that newspapers bra of flu likewise environmental from and reign subject to gay, of the the and. self global to in them obama to of are for duffggs key the grand.ing. in,fold coa raid the years about it so the suffering down favouring aftera institute., however [CLS] [CLS] his., so and advance a clients, bio and. ', in recentup new longer romantic, father we and man personal $ message, donout what 180 value hands and the [CLS] where and settlements has'the to public and in vocal new nevertheless awful
MDLM + Di4C Gen. PPL: 933.00  |  Entropy: 4.33
[CLS]ry two philadelphianelis wraps in 35 nikolai he 1985. the transport s. they letter. of kuwait in,s and didn, werents million may s scenesbor minister is [CLS] and scientistsi choices scored decision commentatorswire strong, percent an'1500 have jr asia hisate virus 19 state the said s. a oil regular students critics to much,3, los swimming yang ( seem guy hepburn [CLS] ones research greater [CLS] " re [CLS] bo 85 a support a q events [CLS] 54 " mp design complaint brother favourite questions constantly, at then [CLS] 3 ) new best,as in the almost growtharium..'michael [CLS]
Duo + DCD Gen. PPL: 177.75  |  Entropy: 3.49
[CLS],,,, that the the,,, er a and,,, the, f,,,, least. ffl a - er. then. er, is then at same.,,,,, must have been, way, have the,,,,. not not in year in. non was not,, to nasout, and, first - aload - - the take, fact, not to not,,,, - have, a'and or the series of the and of and and people and, and at the time, the the,,., they, not to [CLS]
Duo + Di4C Gen. PPL: 96.24  |  Entropy: 3.56
[CLS] a he its " becausei [CLS] bit and wasva for the and,. [CLS] [CLS] ways " process. at and it,, a - - [CLS]'-, 7, " and - just a that -ize " and. center'of in [CLS].. they company and :. one s and, - " the you. in is, jr to and as, [CLS] [CLS] of it of or are ll from'of, in.., s and'an, [CLS] the - [CLS] to on the to. he his. journalists and. " for. is that thath s with in repertory gone tothi [CLS]

BibTeX

Coming Soon
      }