The full dataset viewer is not available (click to read why). Only showing a preview of the rows.
Error code: JobManagerCrashedError
Need help to make the dataset viewer work? Make sure to review how to configure the dataset viewer, and open a discussion for direct support.
image
image |
|---|
π Synthetic Multilingual CAPTCHA Library
Repository: remiai3/synthetic-captchas-library
A multilingual dataset of synthetic 4-character CAPTCHA images designed for OCR, multilingual vision models, and script recognition research.
This dataset spans 44 world writing systems and is especially useful for low-resource script OCR training.
π Dataset Summary
Each script includes 100,000 unique CAPTCHA images.
The dataset is provided in two parallel formats:
- CSV version (standard ML workflows)
- Parquet version (faster loading for large-scale pipelines)
β οΈ The images inside both folders are identical β only the label file format differs. So per language:
- 100,000 unique images
- 100,000 duplicate copies (same images, different label format)
π Dataset Structure
tiny-captcha-library/ β βββ 4char/ β βββ / β β βββ images.zip β β βββ labels.csv β βββ ... β βββ 4char_parquet/ β βββ / β β βββ images.zip β β βββ labels.parquet β βββ ... β βββ Amharic.png βββ Armenian.png βββ ...
Folder Details
| Folder | Contains | Purpose |
|---|---|---|
| 4char/ | Images + labels.csv |
Standard format |
| 4char_parquet/ | Same images + labels.parquet |
Efficient ML pipelines |
| Each language folder includes: |
images.zipβ CAPTCHA images- Label file mapping image filename β correct 4-character text
π Supported Scripts
Amharic, Armenian, Arabic, Bengali, Cherokee, Chinese, Georgian, Greek, Gujarati, Hebrew, Hindi, Hanunoo, Japanese, Kaithi, Kannada, Khmer, Lao, Latin, Lisu, Malayalam, Miao, Modi, Myanmar, Odia, Osage, Russian, Sharada, Siddham, Sinhala, Soyombo, Tai Tham, Tai Viet, Takri, Thaana, Tirhuta, Tamil, Telugu, Thai, Ukrainian, Tifinagh, Korean, Tibetan, Ethiopian, Adlam
πΌ Sample CAPTCHA Styles
![]() Amharic |
![]() Arabic |
![]() Armenian |
![]() Bengali |
![]() Cherokee |
![]() Chinese |
![]() Georgian |
![]() Greek |
![]() Gujarati |
![]() Hanunoo |
![]() Hebrew |
![]() Hindi |
![]() Japanese |
![]() Kaithi |
![]() Kannada |
![]() Khmer |
![]() Lao |
![]() Latin |
![]() Lisu |
![]() Malayalam |
![]() Miao |
![]() Modi |
![]() Myanmar |
![]() Odia |
![]() Osage |
![]() Russian |
![]() Sharada |
![]() Siddham |
![]() Sinhala |
![]() Soyombo |
![]() Tai Tham |
![]() Tai Viet |
![]() Takri |
![]() Tamil |
![]() Telugu |
![]() Thaana |
![]() Thai |
![]() Tirhuta |
![]() Ukrainian |
![]() Tifinagh |
![]() Korean |
![]() Tibetan |
![]() Ethiopian |
![]() Adlam |
π· Label Format
CSV
| filename | label |
|---|---|
| img_00001.png | text |
| img_00002.png | text |
Parquet
Same structure as CSV but stored in columnar format.
π Dataset Statistics
| Metric | Value |
|---|---|
| Total Scripts | 44 |
| Unique Images per Script | 100,000 |
| Total Unique Images | 4,400,000 |
| Images Including Duplicates | 8,800,000 |
| Characters per CAPTCHA | 4 |
π― Intended Use
This dataset is designed for:
- Multilingual OCR training
- Vision-language model pretraining
- Script recognition research
- Robust text recognition under distortion
π« Not intended for bypassing real-world CAPTCHA security systems.
βοΈ Example Loading
import pandas as pd
df = pd.read_csv("labels.csv")
print(df.head())
- Downloads last month
- 324











































