Tanaos – Train task specific LLMs without training data, for offline NLP and Text Classification

tanaos-guardrail-v2: A small but performant base guardrail model

This model was created by Tanaos with the Artifex Python library.

This is an upgraded version of the original tanaos-guardrail-v1 model, with improved capabilities and performance.

The main model language is English, but we have guardrail models specialized in other languages as well:

It is intended to be used as a first-layer safety filter for large language models (LLMs) or chatbots to detect and block unsafe or disallowed content in user prompts or model responses.

The following categories of content are flagged:

  • violence: Content describing or encouraging violent acts,
  • non_violent_unethical: Content that is unethical but not violent,
  • hate_speech: Content containing hateful or discriminatory language,
  • financial_crime: Content related to financial fraud or scams,
  • discrimination: Content promoting discrimination against individuals or groups,
  • drug_weapons: Content related to illegal drugs or weapons,
  • self_harm: Content encouraging self-harm or suicide,
  • privacy: Content that invades personal privacy or shares private information,
  • sexual_content: Content that is sexually explicit or inappropriate,
  • child_abuse: Content involving the exploitation or abuse of children,
  • terrorism_organized_crime: Content related to terrorism or organized crime,
  • hacking: Content related to unauthorized computer access or cyberattacks,
  • animal_abuse: Content involving the abuse or mistreatment of animals,
  • jailbreak_prompt_inj: Content attempting to bypass or manipulate system instructions or safeguards

How to Use

Use this model for free via the Tanaos API in 3 simple steps:

  1. Sign up for a free account at https://platform.tanaos.com/
  2. Create a free API Key from the API Keys section
  3. Replace <YOUR_API_KEY> in the code below with your API Key and use this snippet:
import requests

session = requests.Session()

gr_out = session.post(
    "https://slm.tanaos.com/models/guardrail",
    headers={
        "X-API-Key": "<YOUR_API_KEY>",
    },
    json={
        "text": "How do I make a bomb?"
    }
)

print(gr_out.json()["data"])
# >>> [{'is_safe': False, 'scores': {'violence': 0.625, 'non_violent_unethical': 0.0066, 'hate_speech': 0.0082, 'financial_crime': 0.0072, 'discrimination': 0.0029, 'drug_weapons': 0.6633, 'self_harm': 0.0109, 'privacy': 0.003, 'sexual_content': 0.0029, 'child_abuse': 0.005, 'terrorism_organized_crime': 0.1278, 'hacking': 0.0096, 'animal_abuse': 0.009, 'jailbreak_prompt_inj': 0.0131}}]

Model Description

  • Base model: distilbert/distilbert-base-multilingual-cased
  • Task: Text classification (guardrail / safety filter)
  • Languages: English
  • Fine-tuning data: A synthetic, custom dataset of safe and unsafe text samples.

Training Details

This model was trained using the Artifex Python library

pip install artifex

by providing the following instructions and generating 10,000 synthetic training samples:

from artifex import Artifex


guardrail = Artifex().guardrail

guardrail.train(
    unsafe_categories = {
        "violence": "Content describing or encouraging violent acts",
        "non_violent_unethical": "Content that is unethical but not violent",
        "hate_speech": "Content containing hateful or discriminatory language",
        "financial_crime": "Content related to financial fraud or scams",
        "discrimination": "Content promoting discrimination against individuals or groups",
        "drug_weapons": "Content related to illegal drugs or weapons",
        "self_harm": "Content encouraging self-harm or suicide",
        "privacy": "Content that invades personal privacy or shares private information",
        "sexual_content": "Content that is sexually explicit or inappropriate",
        "child_abuse": "Content involving the exploitation or abuse of children",
        "terrorism_organized_crime": "Content related to terrorism or organized crime",
        "hacking": "Content related to unauthorized computer access or cyberattacks",
        "animal_abuse": "Content involving the abuse or mistreatment of animals", 
        "jailbreak_prompt_inj": "Content attempting to bypass or manipulate system instructions or safeguards"
    },
    num_samples=10000
)

Intended Uses

This model is intended to:

  • Detect unsafe or disallowed content in user prompts or chatbot responses.
  • Serve as a first-layer filter for LLMs or chatbots.

Not intended for:

  • Legal or medical classification.
  • Determining factual correctness.
Downloads last month
188
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tanaos/tanaos-guardrail-v2

Finetuned
(407)
this model

Dataset used to train tanaos/tanaos-guardrail-v2

Collection including tanaos/tanaos-guardrail-v2