Dataset Preview
Duplicate
The full dataset viewer is not available (click to read why). Only showing a preview of the rows.
Job manager crashed while running this job (missing heartbeats).
Error code:   JobManagerCrashedError

Need help to make the dataset viewer work? Make sure to review how to configure the dataset viewer, and open a discussion for direct support.

image
image
End of preview.

🌍 Synthetic Multilingual CAPTCHA Library

Repository: remiai3/synthetic-captchas-library
A multilingual dataset of synthetic 4-character CAPTCHA images designed for OCR, multilingual vision models, and script recognition research. This dataset spans 44 world writing systems and is especially useful for low-resource script OCR training.

πŸ“Œ Dataset Summary

Each script includes 100,000 unique CAPTCHA images.
The dataset is provided in two parallel formats:

  • CSV version (standard ML workflows)
  • Parquet version (faster loading for large-scale pipelines)

⚠️ The images inside both folders are identical β€” only the label file format differs. So per language:

  • 100,000 unique images
  • 100,000 duplicate copies (same images, different label format)

πŸ—‚ Dataset Structure

tiny-captcha-library/ β”‚ β”œβ”€β”€ 4char/ β”‚ β”œβ”€β”€ / β”‚ β”‚ β”œβ”€β”€ images.zip β”‚ β”‚ └── labels.csv β”‚ └── ... β”‚ β”œβ”€β”€ 4char_parquet/ β”‚ β”œβ”€β”€ / β”‚ β”‚ β”œβ”€β”€ images.zip β”‚ β”‚ └── labels.parquet β”‚ └── ... β”‚ β”œβ”€β”€ Amharic.png β”œβ”€β”€ Armenian.png └── ...

Folder Details

Folder Contains Purpose
4char/ Images + labels.csv Standard format
4char_parquet/ Same images + labels.parquet Efficient ML pipelines
Each language folder includes:
  • images.zip β†’ CAPTCHA images
  • Label file mapping image filename β†’ correct 4-character text

🌐 Supported Scripts

Amharic, Armenian, Arabic, Bengali, Cherokee, Chinese, Georgian, Greek, Gujarati, Hebrew, Hindi, Hanunoo, Japanese, Kaithi, Kannada, Khmer, Lao, Latin, Lisu, Malayalam, Miao, Modi, Myanmar, Odia, Osage, Russian, Sharada, Siddham, Sinhala, Soyombo, Tai Tham, Tai Viet, Takri, Thaana, Tirhuta, Tamil, Telugu, Thai, Ukrainian, Tifinagh, Korean, Tibetan, Ethiopian, Adlam

πŸ–Ό Sample CAPTCHA Styles


Amharic

Arabic

Armenian

Bengali

Cherokee

Chinese

Georgian

Greek

Gujarati

Hanunoo

Hebrew

Hindi

Japanese

Kaithi

Kannada

Khmer

Lao

Latin

Lisu

Malayalam

Miao

Modi

Myanmar

Odia

Osage

Russian

Sharada

Siddham

Sinhala

Soyombo

Tai Tham

Tai Viet

Takri

Tamil

Telugu

Thaana

Thai

Tirhuta

Ukrainian

Tifinagh

Korean

Tibetan

Ethiopian

Adlam

🏷 Label Format

CSV

filename label
img_00001.png text
img_00002.png text

Parquet

Same structure as CSV but stored in columnar format.

πŸ“Š Dataset Statistics

Metric Value
Total Scripts 44
Unique Images per Script 100,000
Total Unique Images 4,400,000
Images Including Duplicates 8,800,000
Characters per CAPTCHA 4

🎯 Intended Use

This dataset is designed for:

  • Multilingual OCR training
  • Vision-language model pretraining
  • Script recognition research
  • Robust text recognition under distortion
    🚫 Not intended for bypassing real-world CAPTCHA security systems.

βš™οΈ Example Loading

import pandas as pd

df = pd.read_csv("labels.csv")
print(df.head())
Downloads last month
324