Violet's Tools

Universal sanitizers for when text tries to ruin your day.

Violet can find the monsters... but sometimes you need to clean the battlefield.

This page is a drop-in toolkit for common pipelines: PHP → Python → MySQL → SQL Server.

What these tools do:

Normalize weird whitespace (NBSP / zero-width / BOM)
Standardize quotes/dashes (smart → plain)
Remove control chars that break imports
Keep your strings stable for comparisons, joins, and hashing

🧠 Violet's Rules of Sanitization

Sanitize for purpose: UI display ≠ SQL storage ≠ file names ≠ identifiers.
Never “sanitize” SQL by hand. Use parameters (prepared statements) for values.
Normalize early (at ingestion) and keep a raw copy when possible.
Log what you changed when you’re cleaning vendor exports (debug gold).

Violet’s favorite approach:
Normalize text into a “safe canonical” form for comparisons and joins, while keeping the original for display/audit.

🧪 Universal Sanitization Toolbox

🐘 PHP 8.4 - Universal Text Sanitizer UTF-8 safe

Use this for: imported text, vendor IDs, names, addresses, CSV fields, UI inputs. It normalizes Unicode, removes invisibles/control chars, and converts smart punctuation.

<?php
/**
 * Violet Universal Sanitizer (PHP 8.4)
 * - Normalizes to UTF-8 NFC
 * - Removes BOM, zero-width, and control chars
 * - Normalizes whitespace
 * - Converts smart quotes/dashes to plain ASCII equivalents
 *
 * NOTE: This is NOT a substitute for SQL parameterization.
 */
function violet_sanitize_text(string $s, array $opts = []): string {
    $o = array_merge([
        'normalize_unicode' => true,  // requires intl ext for Normalizer
        'strip_invisible'   => true,
        'strip_controls'    => true,
        'smart_punct_to_ascii' => true,
        'collapse_whitespace'  => true,
        'trim'              => true,
        'max_len'           => null,  // int or null
    ], $opts);

    // Ensure valid UTF-8 (discard invalid bytes)
    $s = mb_convert_encoding($s, 'UTF-8', 'UTF-8');

    // Normalize Unicode to NFC (composed) for stable comparisons
    if ($o['normalize_unicode'] && class_exists('Normalizer')) {
        $s = Normalizer::normalize($s, Normalizer::FORM_C) ?? $s;
    }

    if ($o['strip_invisible']) {
        // BOM + zero-width space/joiners + word joiner + etc
        $s = preg_replace('/\x{FEFF}|\x{200B}|\x{200C}|\x{200D}|\x{2060}/u', '', $s) ?? $s;

        // NBSP to normal space
        $s = str_replace("\xC2\xA0", " ", $s);
    }

    if ($o['strip_controls']) {
        // Remove C0/C1 control chars, keep \n \r \t if you want (here we keep them)
        $s = preg_replace('/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]/u', '', $s) ?? $s;
    }

    if ($o['smart_punct_to_ascii']) {
        $map = [
            // quotes
            "\xE2\x80\x98" => "'", // ‘
            "\xE2\x80\x99" => "'", // ’
            "\xE2\x80\x9C" => '"', // “
            "\xE2\x80\x9D" => '"', // ”
            "\xE2\x80\x9A" => "'", // ‚
            "\xE2\x80\xB2" => "'", // ′
            "\xE2\x80\xB3" => '"', // ″
            // dashes
            "\xE2\x80\x93" => "-", // –
            "\xE2\x80\x94" => "-", // -
            "\xE2\x80\x95" => "-", // ―
            // ellipsis
            "\xE2\x80\xA6" => "...", // ...
        ];
        $s = strtr($s, $map);
    }

    if ($o['collapse_whitespace']) {
        // Convert any run of whitespace to a single space (keeps it readable)
        $s = preg_replace('/[ \t\r\n]+/u', ' ', $s) ?? $s;
    }

    if ($o['trim']) {
        $s = trim($s);
    }

    if (is_int($o['max_len']) && $o['max_len'] > 0) {
        $s = mb_substr($s, 0, $o['max_len'], 'UTF-8');
    }

    return $s;
}

// Example:
// $clean = violet_sanitize_text($raw, ['max_len' => 255]);
?>

Tip: For “join keys” (Vendor_ID, Item_Number, ContractNo), you usually want collapse_whitespace=false and stricter rules (see Python section for an identifier cleaner pattern).

🐍 Python 3.12 - Universal Text Sanitizer NFC + invisibles

Use this for: ETL parsing, filename normalization, stable comparison keys, CSV cleanup, and logging.

import re
import unicodedata

# Common invisibles / troublemakers
_RE_INVIS = re.compile(r"[\ufeff\u200b\u200c\u200d\u2060]")
_RE_CTRL  = re.compile(r"[\x00-\x08\x0b\x0c\x0e-\x1f\x7f]")

SMART_MAP = {
    "\u2018": "'",  # ‘
    "\u2019": "'",  # ’
    "\u201c": '"',  # “
    "\u201d": '"',  # ”
    "\u201a": "'",  # ‚
    "\u2032": "'",  # ′
    "\u2033": '"',  # ″
    "\u2013": "-",  # –
    "\u2014": "-",  # -
    "\u2015": "-",  # ―
    "\u2026": "...",# ...
    "\u00a0": " ",  # NBSP
}

def violet_sanitize_text(
    s: str,
    *,
    normalize_unicode: bool = True,
    strip_invisible: bool = True,
    strip_controls: bool = True,
    smart_punct_to_ascii: bool = True,
    collapse_whitespace: bool = True,
    trim: bool = True,
    max_len: int | None = None,
) -> str:
    if s is None:
        return ""

    # Ensure it's a str
    s = str(s)

    if normalize_unicode:
        s = unicodedata.normalize("NFC", s)

    if strip_invisible:
        s = _RE_INVIS.sub("", s)

    if strip_controls:
        s = _RE_CTRL.sub("", s)

    if smart_punct_to_ascii:
        s = s.translate(str.maketrans(SMART_MAP))

    if collapse_whitespace:
        s = re.sub(r"\s+", " ", s)

    if trim:
        s = s.strip()

    if isinstance(max_len, int) and max_len > 0:
        s = s[:max_len]

    return s


def violet_identifier_key(s: str) -> str:
    """
    A stricter helper for join keys / IDs:
    - sanitize
    - uppercase
    - remove ALL whitespace
    - keep only A-Z 0-9 and a few safe delimiters
    """
    s = violet_sanitize_text(s, collapse_whitespace=True)
    s = s.upper().replace(" ", "")
    s = re.sub(r"[^A-Z0-9\-_/\.]", "", s)
    return s

# Example:
# clean = violet_sanitize_text(raw)
# vendor_id_key = violet_identifier_key(raw_vendor_id)

Tip: If you read bytes from files/emails, decode with strict control: data.decode("utf-8", errors="replace") or errors="ignore" depending on your goals.

🐬 MySQL 8 - Universal “Clean Text” SQL Patterns ETL-friendly

MySQL isn’t an ICU transliteration engine, but it can do a strong “ETL-clean” pass: remove invisible gremlins (BOM / zero-width / NBSP), normalize whitespace, and (optionally) force the output down to “simple characters only” using a conversion + whitelist.

Golden rules:

Do NOT “sanitize SQL” by hand - always parameterize values.
Sanitize for a purpose: human text (names/addresses) is different than join keys (IDs).
Prefer storing both: Col_Raw and Col_Clean.

Prep step: use UTF-8 everywhere. Make sure your table/columns are utf8mb4 and your connection uses UTF-8.

1) Quick ETL cleanup (invisibles + whitespace)

-- MySQL 8: Quick “ETL-clean” pass
      -- Removes common invisibles and normalizes whitespace without stripping normal letters.
      
      UPDATE your_table
      SET your_col = TRIM(
        REGEXP_REPLACE(
          -- Convert NBSP to space, remove BOM + zero-width characters
          REPLACE(
            REPLACE(
              REPLACE(
                REPLACE(
                  REPLACE(your_col,
                    CHAR(239,187,191), ''),      -- UTF-8 BOM bytes EF BB BF
                  _utf8mb4 0xE2808B, ''),        -- U+200B zero-width space
                _utf8mb4 0xE2808C, ''),          -- U+200C ZWNJ
              _utf8mb4 0xE2808D, ''),            -- U+200D ZWJ
            CHAR(194,160), ' '                  -- NBSP (C2 A0) -> space
          ),
          '[[:space:]]+', ' '                   -- collapse whitespace (tabs/newlines/multiple spaces)
        )
      )
      WHERE your_col IS NOT NULL;

When to use: imported text fields (names, addresses, notes) where you want readable output but need to kill the invisible gremlins.

2) Force “simple characters only” (conversion + whitelist)

This is the “Violet is done being nice” option. It forces text through a latin1 conversion (non-latin1 becomes ?), then removes anything not in your allowed character set. Adjust the whitelist based on your field type.

-- MySQL 8: Force simple characters only (ASCII-ish)
      -- Step A: remove invisibles
      -- Step B: convert USING latin1 (forces many Unicode chars into '?')
      -- Step C: whitelist allowed characters only
      
      UPDATE your_table
      SET your_col = TRIM(
        REGEXP_REPLACE(
          CONVERT(
            REPLACE(
              REPLACE(
                REPLACE(
                  REPLACE(
                    REPLACE(your_col,
                      CHAR(239,187,191), ''),     -- BOM bytes
                    _utf8mb4 0xE2808B, ''),       -- ZWSP
                  _utf8mb4 0xE2808C, ''),         -- ZWNJ
                _utf8mb4 0xE2808D, ''),           -- ZWJ
              CHAR(194,160), ' '                 -- NBSP -> space
            )
            USING latin1
          ),
          '[^A-Za-z0-9 .,_@:/\\-]+', ''          -- <-- WHITELIST (edit this)
        )
      )
      WHERE your_col IS NOT NULL;

Whitelist examples:

Join keys / IDs (very strict): '[^A-Za-z0-9_-]+'
Item numbers / vendor style IDs: '[^A-Za-z0-9._/-]+'
Names/addresses (looser): '[^A-Za-z0-9 .,''#&()/_-]+'

3) (Optional) Preserve meaning: map “smart punctuation” BEFORE forcing simple chars

If you want curly quotes/dashes/ellipsis to become their plain equivalents instead of turning into ?, do this mapping first, then run the force-simple step.

-- MySQL 8: map common “pretty punctuation” first, then whitelist
      UPDATE your_table
      SET your_col = TRIM(
        REGEXP_REPLACE(
          CONVERT(
            REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
              REPLACE(REPLACE(your_col,
                _utf8mb4 0xE28098, "'"),   -- ‘
                _utf8mb4 0xE28099, "'"),   -- ’
              _utf8mb4 0xE2809C, '"'),    -- “
              _utf8mb4 0xE2809D, '"'),    -- ”
            _utf8mb4 0xE28093, "-"),      -- –
            _utf8mb4 0xE28094, "-"),      -- -
            _utf8mb4 0xE280A6, "..."),    -- ...
            CHAR(239,187,191), ''),       -- BOM bytes
            CHAR(194,160), ' ')           -- NBSP -> space
            USING latin1
          ),
          '[^A-Za-z0-9 .,_@:/\\-]+', ''   -- whitelist
        )
      )
      WHERE your_col IS NOT NULL;

Performance tip: don’t run these expressions at query-time. Clean at ingestion (or in a one-time ETL update) and store to a _Clean column.

Collation note: Collations (like utf8mb4_0900_ai_ci) help comparisons (case/accent-insensitive) but they do not rewrite your stored characters. To actually remove/force characters, use conversion + whitelist as above.

🏛 SQL Server - Universal “Clean Text” SQL Patterns NVARCHAR

Not all versions of SQL Server support fancy regex, but we can still normalize whitespace, remove BOM/zero-width, and standardize smart punctuation using REPLACE.

-- SQL Server 2014: inline cleanup expression you can use in SELECT/UPDATE
-- Recommended: store original + store cleaned (e.g., Col_Raw, Col_Clean)

-- Common Unicode codes:
-- NBSP: NCHAR(160)
-- BOM:  NCHAR(65279)
-- ZWSP: NCHAR(8203)
-- ZWNJ: NCHAR(8204)
-- ZWJ:  NCHAR(8205)
-- Smart quotes: 8216/8217/8220/8221
-- En dash: 8211, Em dash: 8212, Ellipsis: 8230

DECLARE @s NVARCHAR(MAX) = @YourValue;

-- Remove invisibles / normalize whitespace / smart punctuation
SET @s = REPLACE(@s, NCHAR(65279), N'');  -- BOM
SET @s = REPLACE(@s, NCHAR(8203),  N'');  -- ZWSP
SET @s = REPLACE(@s, NCHAR(8204),  N'');  -- ZWNJ
SET @s = REPLACE(@s, NCHAR(8205),  N'');  -- ZWJ
SET @s = REPLACE(@s, NCHAR(160),   N' '); -- NBSP

-- Smart quotes to plain
SET @s = REPLACE(@s, NCHAR(8216), N'''');
SET @s = REPLACE(@s, NCHAR(8217), N'''');
SET @s = REPLACE(@s, NCHAR(8220), N'"');
SET @s = REPLACE(@s, NCHAR(8221), N'"');

-- Dashes / ellipsis
SET @s = REPLACE(@s, NCHAR(8211), N'-');
SET @s = REPLACE(@s, NCHAR(8212), N'-');
SET @s = REPLACE(@s, NCHAR(8230), N'...');

-- Tabs/newlines to space
SET @s = REPLACE(@s, CHAR(9),  N' ');
SET @s = REPLACE(@s, CHAR(10), N' ');
SET @s = REPLACE(@s, CHAR(13), N' ');

-- Collapse double spaces (repeat a few times or loop)
SET @s = LTRIM(RTRIM(@s));
SET @s = REPLACE(@s, N'  ', N' ');
SET @s = REPLACE(@s, N'  ', N' ');
SET @s = REPLACE(@s, N'  ', N' ');
SET @s = REPLACE(@s, N'  ', N' ');

SELECT @s AS CleanValue;

Tip: For big tables, do this in batches and store to a computed/ETL column instead of repeatedly cleaning at query time.

Normalize → Store → Sleep
If text is stable, everything downstream behaves.