Violet is watching.

Violet's Tools

Universal sanitizers for when text tries to ruin your day.

Violet can find the monsters... but sometimes you need to clean the battlefield.

This page is a drop-in toolkit for common pipelines: PHP → Python → MySQL → SQL Server.

What these tools do:
  • Normalize weird whitespace (NBSP / zero-width / BOM)
  • Standardize quotes/dashes (smart → plain)
  • Remove control chars that break imports
  • Keep your strings stable for comparisons, joins, and hashing
🧠 Violet's Rules of Sanitization
  • Sanitize for purpose: UI display ≠ SQL storage ≠ file names ≠ identifiers.
  • Never “sanitize” SQL by hand. Use parameters (prepared statements) for values.
  • Normalize early (at ingestion) and keep a raw copy when possible.
  • Log what you changed when you’re cleaning vendor exports (debug gold).
Violet’s favorite approach:
Normalize text into a “safe canonical” form for comparisons and joins, while keeping the original for display/audit.
🧪 Universal Sanitization Toolbox
🐘 PHP 8.4 - Universal Text Sanitizer UTF-8 safe

Use this for: imported text, vendor IDs, names, addresses, CSV fields, UI inputs. It normalizes Unicode, removes invisibles/control chars, and converts smart punctuation.

<?php
/**
 * Violet Universal Sanitizer (PHP 8.4)
 * - Normalizes to UTF-8 NFC
 * - Removes BOM, zero-width, and control chars
 * - Normalizes whitespace
 * - Converts smart quotes/dashes to plain ASCII equivalents
 *
 * NOTE: This is NOT a substitute for SQL parameterization.
 */
function violet_sanitize_text(string $s, array $opts = []): string {
    $o = array_merge([
        'normalize_unicode' => true,  // requires intl ext for Normalizer
        'strip_invisible'   => true,
        'strip_controls'    => true,
        'smart_punct_to_ascii' => true,
        'collapse_whitespace'  => true,
        'trim'              => true,
        'max_len'           => null,  // int or null
    ], $opts);

    // Ensure valid UTF-8 (discard invalid bytes)
    $s = mb_convert_encoding($s, 'UTF-8', 'UTF-8');

    // Normalize Unicode to NFC (composed) for stable comparisons
    if ($o['normalize_unicode'] && class_exists('Normalizer')) {
        $s = Normalizer::normalize($s, Normalizer::FORM_C) ?? $s;
    }

    if ($o['strip_invisible']) {
        // BOM + zero-width space/joiners + word joiner + etc
        $s = preg_replace('/\x{FEFF}|\x{200B}|\x{200C}|\x{200D}|\x{2060}/u', '', $s) ?? $s;

        // NBSP to normal space
        $s = str_replace("\xC2\xA0", " ", $s);
    }

    if ($o['strip_controls']) {
        // Remove C0/C1 control chars, keep \n \r \t if you want (here we keep them)
        $s = preg_replace('/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]/u', '', $s) ?? $s;
    }

    if ($o['smart_punct_to_ascii']) {
        $map = [
            // quotes
            "\xE2\x80\x98" => "'", // ‘
            "\xE2\x80\x99" => "'", // ’
            "\xE2\x80\x9C" => '"', // “
            "\xE2\x80\x9D" => '"', // ”
            "\xE2\x80\x9A" => "'", // ‚
            "\xE2\x80\xB2" => "'", // ′
            "\xE2\x80\xB3" => '"', // ″
            // dashes
            "\xE2\x80\x93" => "-", // –
            "\xE2\x80\x94" => "-", // -
            "\xE2\x80\x95" => "-", // ―
            // ellipsis
            "\xE2\x80\xA6" => "...", // ...
        ];
        $s = strtr($s, $map);
    }

    if ($o['collapse_whitespace']) {
        // Convert any run of whitespace to a single space (keeps it readable)
        $s = preg_replace('/[ \t\r\n]+/u', ' ', $s) ?? $s;
    }

    if ($o['trim']) {
        $s = trim($s);
    }

    if (is_int($o['max_len']) && $o['max_len'] > 0) {
        $s = mb_substr($s, 0, $o['max_len'], 'UTF-8');
    }

    return $s;
}

// Example:
// $clean = violet_sanitize_text($raw, ['max_len' => 255]);
?>
Tip: For “join keys” (Vendor_ID, Item_Number, ContractNo), you usually want collapse_whitespace=false and stricter rules (see Python section for an identifier cleaner pattern).
🐍 Python 3.12 - Universal Text Sanitizer NFC + invisibles

Use this for: ETL parsing, filename normalization, stable comparison keys, CSV cleanup, and logging.

import re
import unicodedata

# Common invisibles / troublemakers
_RE_INVIS = re.compile(r"[\ufeff\u200b\u200c\u200d\u2060]")
_RE_CTRL  = re.compile(r"[\x00-\x08\x0b\x0c\x0e-\x1f\x7f]")

SMART_MAP = {
    "\u2018": "'",  # ‘
    "\u2019": "'",  # ’
    "\u201c": '"',  # “
    "\u201d": '"',  # ”
    "\u201a": "'",  # ‚
    "\u2032": "'",  # ′
    "\u2033": '"',  # ″
    "\u2013": "-",  # –
    "\u2014": "-",  # -
    "\u2015": "-",  # ―
    "\u2026": "...",# ...
    "\u00a0": " ",  # NBSP
}

def violet_sanitize_text(
    s: str,
    *,
    normalize_unicode: bool = True,
    strip_invisible: bool = True,
    strip_controls: bool = True,
    smart_punct_to_ascii: bool = True,
    collapse_whitespace: bool = True,
    trim: bool = True,
    max_len: int | None = None,
) -> str:
    if s is None:
        return ""

    # Ensure it's a str
    s = str(s)

    if normalize_unicode:
        s = unicodedata.normalize("NFC", s)

    if strip_invisible:
        s = _RE_INVIS.sub("", s)

    if strip_controls:
        s = _RE_CTRL.sub("", s)

    if smart_punct_to_ascii:
        s = s.translate(str.maketrans(SMART_MAP))

    if collapse_whitespace:
        s = re.sub(r"\s+", " ", s)

    if trim:
        s = s.strip()

    if isinstance(max_len, int) and max_len > 0:
        s = s[:max_len]

    return s


def violet_identifier_key(s: str) -> str:
    """
    A stricter helper for join keys / IDs:
    - sanitize
    - uppercase
    - remove ALL whitespace
    - keep only A-Z 0-9 and a few safe delimiters
    """
    s = violet_sanitize_text(s, collapse_whitespace=True)
    s = s.upper().replace(" ", "")
    s = re.sub(r"[^A-Z0-9\-_/\.]", "", s)
    return s

# Example:
# clean = violet_sanitize_text(raw)
# vendor_id_key = violet_identifier_key(raw_vendor_id)
Tip: If you read bytes from files/emails, decode with strict control: data.decode("utf-8", errors="replace") or errors="ignore" depending on your goals.
🐬 MySQL 8 - Universal “Clean Text” SQL Patterns ETL-friendly

MySQL isn’t an ICU transliteration engine, but it can do a strong “ETL-clean” pass: remove invisible gremlins (BOM / zero-width / NBSP), normalize whitespace, and (optionally) force the output down to “simple characters only” using a conversion + whitelist.

Golden rules:
  • Do NOT “sanitize SQL” by hand - always parameterize values.
  • Sanitize for a purpose: human text (names/addresses) is different than join keys (IDs).
  • Prefer storing both: Col_Raw and Col_Clean.
Prep step: use UTF-8 everywhere. Make sure your table/columns are utf8mb4 and your connection uses UTF-8.

1) Quick ETL cleanup (invisibles + whitespace)

-- MySQL 8: Quick “ETL-clean” pass
      -- Removes common invisibles and normalizes whitespace without stripping normal letters.
      
      UPDATE your_table
      SET your_col = TRIM(
        REGEXP_REPLACE(
          -- Convert NBSP to space, remove BOM + zero-width characters
          REPLACE(
            REPLACE(
              REPLACE(
                REPLACE(
                  REPLACE(your_col,
                    CHAR(239,187,191), ''),      -- UTF-8 BOM bytes EF BB BF
                  _utf8mb4 0xE2808B, ''),        -- U+200B zero-width space
                _utf8mb4 0xE2808C, ''),          -- U+200C ZWNJ
              _utf8mb4 0xE2808D, ''),            -- U+200D ZWJ
            CHAR(194,160), ' '                  -- NBSP (C2 A0) -> space
          ),
          '[[:space:]]+', ' '                   -- collapse whitespace (tabs/newlines/multiple spaces)
        )
      )
      WHERE your_col IS NOT NULL;
When to use: imported text fields (names, addresses, notes) where you want readable output but need to kill the invisible gremlins.

2) Force “simple characters only” (conversion + whitelist)

This is the “Violet is done being nice” option. It forces text through a latin1 conversion (non-latin1 becomes ?), then removes anything not in your allowed character set. Adjust the whitelist based on your field type.

-- MySQL 8: Force simple characters only (ASCII-ish)
      -- Step A: remove invisibles
      -- Step B: convert USING latin1 (forces many Unicode chars into '?')
      -- Step C: whitelist allowed characters only
      
      UPDATE your_table
      SET your_col = TRIM(
        REGEXP_REPLACE(
          CONVERT(
            REPLACE(
              REPLACE(
                REPLACE(
                  REPLACE(
                    REPLACE(your_col,
                      CHAR(239,187,191), ''),     -- BOM bytes
                    _utf8mb4 0xE2808B, ''),       -- ZWSP
                  _utf8mb4 0xE2808C, ''),         -- ZWNJ
                _utf8mb4 0xE2808D, ''),           -- ZWJ
              CHAR(194,160), ' '                 -- NBSP -> space
            )
            USING latin1
          ),
          '[^A-Za-z0-9 .,_@:/\\-]+', ''          -- <-- WHITELIST (edit this)
        )
      )
      WHERE your_col IS NOT NULL;
Whitelist examples:
  • Join keys / IDs (very strict): '[^A-Za-z0-9_-]+'
  • Item numbers / vendor style IDs: '[^A-Za-z0-9._/-]+'
  • Names/addresses (looser): '[^A-Za-z0-9 .,''#&()/_-]+'

3) (Optional) Preserve meaning: map “smart punctuation” BEFORE forcing simple chars

If you want curly quotes/dashes/ellipsis to become their plain equivalents instead of turning into ?, do this mapping first, then run the force-simple step.

-- MySQL 8: map common “pretty punctuation” first, then whitelist
      UPDATE your_table
      SET your_col = TRIM(
        REGEXP_REPLACE(
          CONVERT(
            REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
              REPLACE(REPLACE(your_col,
                _utf8mb4 0xE28098, "'"),   -- ‘
                _utf8mb4 0xE28099, "'"),   -- ’
              _utf8mb4 0xE2809C, '"'),    -- “
              _utf8mb4 0xE2809D, '"'),    -- ”
            _utf8mb4 0xE28093, "-"),      -- –
            _utf8mb4 0xE28094, "-"),      -- -
            _utf8mb4 0xE280A6, "..."),    -- ...
            CHAR(239,187,191), ''),       -- BOM bytes
            CHAR(194,160), ' ')           -- NBSP -> space
            USING latin1
          ),
          '[^A-Za-z0-9 .,_@:/\\-]+', ''   -- whitelist
        )
      )
      WHERE your_col IS NOT NULL;
Performance tip: don’t run these expressions at query-time. Clean at ingestion (or in a one-time ETL update) and store to a _Clean column.
Collation note: Collations (like utf8mb4_0900_ai_ci) help comparisons (case/accent-insensitive) but they do not rewrite your stored characters. To actually remove/force characters, use conversion + whitelist as above.
🏛 SQL Server - Universal “Clean Text” SQL Patterns NVARCHAR

Not all versions of SQL Server support fancy regex, but we can still normalize whitespace, remove BOM/zero-width, and standardize smart punctuation using REPLACE.

-- SQL Server 2014: inline cleanup expression you can use in SELECT/UPDATE
-- Recommended: store original + store cleaned (e.g., Col_Raw, Col_Clean)

-- Common Unicode codes:
-- NBSP: NCHAR(160)
-- BOM:  NCHAR(65279)
-- ZWSP: NCHAR(8203)
-- ZWNJ: NCHAR(8204)
-- ZWJ:  NCHAR(8205)
-- Smart quotes: 8216/8217/8220/8221
-- En dash: 8211, Em dash: 8212, Ellipsis: 8230

DECLARE @s NVARCHAR(MAX) = @YourValue;

-- Remove invisibles / normalize whitespace / smart punctuation
SET @s = REPLACE(@s, NCHAR(65279), N'');  -- BOM
SET @s = REPLACE(@s, NCHAR(8203),  N'');  -- ZWSP
SET @s = REPLACE(@s, NCHAR(8204),  N'');  -- ZWNJ
SET @s = REPLACE(@s, NCHAR(8205),  N'');  -- ZWJ
SET @s = REPLACE(@s, NCHAR(160),   N' '); -- NBSP

-- Smart quotes to plain
SET @s = REPLACE(@s, NCHAR(8216), N'''');
SET @s = REPLACE(@s, NCHAR(8217), N'''');
SET @s = REPLACE(@s, NCHAR(8220), N'"');
SET @s = REPLACE(@s, NCHAR(8221), N'"');

-- Dashes / ellipsis
SET @s = REPLACE(@s, NCHAR(8211), N'-');
SET @s = REPLACE(@s, NCHAR(8212), N'-');
SET @s = REPLACE(@s, NCHAR(8230), N'...');

-- Tabs/newlines to space
SET @s = REPLACE(@s, CHAR(9),  N' ');
SET @s = REPLACE(@s, CHAR(10), N' ');
SET @s = REPLACE(@s, CHAR(13), N' ');

-- Collapse double spaces (repeat a few times or loop)
SET @s = LTRIM(RTRIM(@s));
SET @s = REPLACE(@s, N'  ', N' ');
SET @s = REPLACE(@s, N'  ', N' ');
SET @s = REPLACE(@s, N'  ', N' ');
SET @s = REPLACE(@s, N'  ', N' ');

SELECT @s AS CleanValue;
Tip: For big tables, do this in batches and store to a computed/ETL column instead of repeatedly cleaning at query time.
Normalize → Store → Sleep
If text is stable, everything downstream behaves.