Diffusion-Based Ukrainian Handwritten Text Generation

Abstract

Handwritten text generation (HTG) conditioned on writer style has been widely studied for Latin scripts, but remains underexplored for low-resource and non-Latin writing systems, leaving open how well existing models generalise beyond the Latin domain. Cyrillic, particularly Ukrainian, lacks both large-scale writer-labeled datasets and empirical evidence of such generalisation. To address this gap, we construct a Ukrainian handwritten word dataset of 126,177 images from 308 writers using connected-component segmentation, quality filtering, and targeted oversampling of underrepresented Ukrainian characters.

We retrain DiffusionPen[1], a MobileNetV2 triplet-loss style encoder with a CANINE-conditioned latent diffusion U-Net, on this dataset without architectural modification, testing direct transfer from Latin to Cyrillic. We evaluate cross-domain style transfer in three settings: cross-lingual transfer from IAM[6] English samples, zero-shot transfer to an early 20th-century Ukrainian manuscript, and few-shot imitation of contemporary writers. The model produces legible, style-consistent word images, indicating that few-shot latent diffusion models generalize beyond the Latin-script domain. We release the dataset, trained models, and evaluation protocol as a reproducible benchmark for writer-aware Cyrillic HTG, providing a foundation for extending stylized HTG to other underrepresented writing systems.

Contributions

1 A Ukrainian handwritten word-level dataset with 126K samples from 308 writers, constructed via connected-component segmentation, quality filtering, and targeted oversampling of rare Cyrillic characters.
2 Adaptation of DiffusionPen to Cyrillic without architectural modification, demonstrating that a Latin-trained few-shot HTG model transfers directly to a new script.
3 Empirical evaluation of cross-domain style transfer in three settings: cross-lingual (English → Ukrainian), historical archival, and few-shot imitation of unseen contemporary writers.
4 Analysis of dataset construction factors (size vs. purity, segmentation method, U-Net depth) and their effect on generation quality via controlled ablations.

Method

Dataset Construction

No Ukrainian word-level handwriting dataset with writer labels existed prior to this work. We derive one from the UkrHandwritten line-level corpus[4] (37,111 lines, 331 writers) through a four-stage pipeline: pre-segmentation artifact removal with a NAFNet restoration network, Otsu binarization, connected-component proximity merging (gap ≤ 8 px), and N−1 widest-gap word boundary selection. The method achieves 95.7% boundary accuracy on a 500-line evaluation subset, compared to 71.7% for vertical-projection baselines. After quality filtering and oversampling of rare letters (ф, ї, Щ, Є, Ц, і), the final dataset contains 126,177 word images from 308 writers. The dataset and trained model weights are available via Google Drive links in the GitHub repository.

Connected-component word segmentation pipeline showing 4 steps: original line, Otsu-binarized with bounding boxes, components after proximity merging, and final word crops. — **Fig. 1.** Connected-component word segmentation pipeline.

Model Architecture

We adopt DiffusionPen[1] without architectural modification. The model is a conditional latent diffusion model operating in the 4×8×32 latent space of a frozen Stable Diffusion v1.5[8] VAE. At each denoising step, a U-Net receives three conditioning signals: (1) a text embedding c ∈ ℝ⁷⁶⁸ from a CANINE[7] character-level encoder, projected to dimension 320; (2) a style embedding s ∈ ℝ¹²⁸⁰ from a frozen MobileNetV2 style encoder trained with triplet loss, mean-pooled over five reference images; and (3) a learned writer label embedding summed with s. Both conditioning signals are injected via cross-attention.

Training & Sentence Assembly

The model is trained for 200 epochs on the 126K dataset with the standard LDM noise-prediction objective on a single RTX 4090 GPU (TF32, batch size 24). Classifier-free guidance uses p_drop = 0.2 for text; style conditioning is never dropped. Inference uses 50 DDIM steps with CFG scale ω = 5.0.

Individual word images are assembled into sentence strips via baseline alignment (span-based body-row detection), brightness normalization, and real handwritten punctuation marks sampled from a bank of 500 training-corpus marks.

Results

All generated images are 64×256 pixels. Evaluation uses three metrics: Fréchet Inception Distance (FID) on 5,000 matched writer-word pairs across all 308 writers; Learned Perceptual Image Patch Similarity (LPIPS) on the same pairs; and Character Error Rate (CER) via a pretrained Cyrillic TrOCR model on 4,928 generated words.

Visual Quality

Metric	Value
FID (5,000 samples, 308 writers)	23.09
LPIPS overall mean	0.367

FID 23.09 is comparable to DiffusionPen on English IAM (~20–25), indicating Ukrainian generation quality is on par with Latin-script state of the art.

Contextual Comparison

Model	Dataset	FID ↓	CER ↓
This paper	Ukrainian	23.09	16.0%
DiffusionPen[1]	IAM	22.54	6.94%*
WordStylist[2]	IAM	22.74	—
GANwriting[3]	IAM	43.97†	—

Cross-paper values are not directly comparable (different datasets, scripts, and evaluation protocols). *CER from HTR imitation on IAM. †FID as reported in DiffusionPen.

Generated sentence strips in two distinct writer styles. — **Fig. 3.** Generated sentences in two writer styles.

Grid comparing real and generated word images for four writers on two target words, showing style preservation. — **Fig. 4.** Word-level style reproduction on seen writers. Each pair shows a real crop (left) and the generated word (right) for the same text and writer. Slant, stroke endings, and connectivity are visually preserved.

Cross-Domain Style Transfer

The triplet-loss style encoder learns a metric space based on visual stroke properties rather than writer identity labels. This enables meaningful style embeddings from handwriting samples entirely absent from training, including samples in other scripts and historical documents. We test this capability in three settings of increasing domain shift.

5.1 Cross-Lingual Transfer: English → Ukrainian

Five reference word images from a single IAM[6] English writer are passed through the style encoder; the resulting embedding generates Ukrainian words. The output visibly reproduces the source writer's stroke weight, angle, and spacing.

English IAM reference words (know, from, great, mean, wife) above generated Ukrainian words in the same handwriting style. — **Fig. 5.** Cross-lingual style transfer. Top: English IAM reference words. Bottom: Generated Ukrainian words in the same writer's style.

5.2 Historical Archival Transfer

Reference images are sourced from a digitised early 20th-century Ukrainian manuscript archived by the Central State Historical Archives of Ukraine. The generated words adopt the manuscript's calligraphic qualities: wider strokes, more formal letter proportions, and reduced inter-letter connectivity, while still producing modern Ukrainian character forms.

Archival manuscript reference crops above generated Ukrainian words that adopt the calligraphic style. — **Fig. 6.** Historical archival style transfer from an early 20th-century Ukrainian manuscript.

5.3 Unseen Contemporary Writer Transfer

Reference images are drawn from the RUKOPYS dataset[5], whose writers do not appear in the training set. The generated words capture the unseen writer's slant, stroke weight, and letter shapes without any fine-tuning, confirming that the five-shot style encoding mechanism generalises to new writers at inference time.

Reference crops from an unseen writer (blue ink) above generated words that match the writing style. — **Fig. 7.** Zero-shot style transfer to an unseen contemporary writer from the RUKOPYS dataset. No fine-tuning is performed.

References

K. Nikolaidou, G. Retsinas, G. Sfikas, G. Liwicki. DiffusionPen: Towards controlling the style of handwritten text generation. In European Conference on Computer Vision (ECCV), 2024.
K. Nikolaidou, G. Retsinas, V. Christlein, M. Seuret, G. Sfikas, E.H. Barney Smith, H. Mokayed, M. Liwicki. WordStylist: Styled verbatim handwritten text generation with latent diffusion models. In International Conference on Document Analysis and Recognition (ICDAR), 2023.
L. Kang, P. Riba, M. Rusiñol, A. Fornés, M. Villegas. GANwriting: Content-conditioned generation of styled handwritten word images. In European Conference on Computer Vision (ECCV), 2020.
A. Hnatiuk. Ukrainian handwritten text. Kaggle dataset, 2022. kaggle.com/datasets/annyhnatiuk/ukrainian-handwritten-text
D. Voitekh, V. Zmiivskyyi, O. Molchanovskyi. RUKOPYS: Ukrainian handwritten text recognition dataset. 2026. huggingface.co/UkrainianCatholicUniversity/rukopys
U.V. Marti, H. Bunke. The IAM-database: An English sentence database for offline handwriting recognition. International Journal on Document Analysis and Recognition, 5(1):39–46, 2002. fki.tic.unibe.ch/databases/iam-handwriting-database
J.H. Clark, D. Garrette, I. Turc, J. Wieting. CANINE: Pre-training an efficient tokenization-free encoder for language representation. Transactions of the Association for Computational Linguistics, 10:73–91, 2022. aclanthology.org/2022.tacl-1.5
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, B. Ommer. High-resolution image synthesis with latent diffusion models. In IEEE/CVF CVPR, pp. 10684–10695, 2022. huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5

Diffusion-Based Ukrainian Handwritten Text Generation with Cross-Domain Style Transfer