Violet can find the monsters... but sometimes you need to clean the battlefield.
This page is a drop-in toolkit for common pipelines: PHP → Python → MySQL → SQL Server.
Use this for: imported text, vendor IDs, names, addresses, CSV fields, UI inputs. It normalizes Unicode, removes invisibles/control chars, and converts smart punctuation.
<?php
/**
* Violet Universal Sanitizer (PHP 8.4)
* - Normalizes to UTF-8 NFC
* - Removes BOM, zero-width, and control chars
* - Normalizes whitespace
* - Converts smart quotes/dashes to plain ASCII equivalents
*
* NOTE: This is NOT a substitute for SQL parameterization.
*/
function violet_sanitize_text(string $s, array $opts = []): string {
$o = array_merge([
'normalize_unicode' => true, // requires intl ext for Normalizer
'strip_invisible' => true,
'strip_controls' => true,
'smart_punct_to_ascii' => true,
'collapse_whitespace' => true,
'trim' => true,
'max_len' => null, // int or null
], $opts);
// Ensure valid UTF-8 (discard invalid bytes)
$s = mb_convert_encoding($s, 'UTF-8', 'UTF-8');
// Normalize Unicode to NFC (composed) for stable comparisons
if ($o['normalize_unicode'] && class_exists('Normalizer')) {
$s = Normalizer::normalize($s, Normalizer::FORM_C) ?? $s;
}
if ($o['strip_invisible']) {
// BOM + zero-width space/joiners + word joiner + etc
$s = preg_replace('/\x{FEFF}|\x{200B}|\x{200C}|\x{200D}|\x{2060}/u', '', $s) ?? $s;
// NBSP to normal space
$s = str_replace("\xC2\xA0", " ", $s);
}
if ($o['strip_controls']) {
// Remove C0/C1 control chars, keep \n \r \t if you want (here we keep them)
$s = preg_replace('/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]/u', '', $s) ?? $s;
}
if ($o['smart_punct_to_ascii']) {
$map = [
// quotes
"\xE2\x80\x98" => "'", // ‘
"\xE2\x80\x99" => "'", // ’
"\xE2\x80\x9C" => '"', // “
"\xE2\x80\x9D" => '"', // ”
"\xE2\x80\x9A" => "'", // ‚
"\xE2\x80\xB2" => "'", // ′
"\xE2\x80\xB3" => '"', // ″
// dashes
"\xE2\x80\x93" => "-", // –
"\xE2\x80\x94" => "-", // -
"\xE2\x80\x95" => "-", // ―
// ellipsis
"\xE2\x80\xA6" => "...", // ...
];
$s = strtr($s, $map);
}
if ($o['collapse_whitespace']) {
// Convert any run of whitespace to a single space (keeps it readable)
$s = preg_replace('/[ \t\r\n]+/u', ' ', $s) ?? $s;
}
if ($o['trim']) {
$s = trim($s);
}
if (is_int($o['max_len']) && $o['max_len'] > 0) {
$s = mb_substr($s, 0, $o['max_len'], 'UTF-8');
}
return $s;
}
// Example:
// $clean = violet_sanitize_text($raw, ['max_len' => 255]);
?>
collapse_whitespace=false and stricter rules (see Python section for an identifier cleaner pattern).
Use this for: ETL parsing, filename normalization, stable comparison keys, CSV cleanup, and logging.
import re
import unicodedata
# Common invisibles / troublemakers
_RE_INVIS = re.compile(r"[\ufeff\u200b\u200c\u200d\u2060]")
_RE_CTRL = re.compile(r"[\x00-\x08\x0b\x0c\x0e-\x1f\x7f]")
SMART_MAP = {
"\u2018": "'", # ‘
"\u2019": "'", # ’
"\u201c": '"', # “
"\u201d": '"', # ”
"\u201a": "'", # ‚
"\u2032": "'", # ′
"\u2033": '"', # ″
"\u2013": "-", # –
"\u2014": "-", # -
"\u2015": "-", # ―
"\u2026": "...",# ...
"\u00a0": " ", # NBSP
}
def violet_sanitize_text(
s: str,
*,
normalize_unicode: bool = True,
strip_invisible: bool = True,
strip_controls: bool = True,
smart_punct_to_ascii: bool = True,
collapse_whitespace: bool = True,
trim: bool = True,
max_len: int | None = None,
) -> str:
if s is None:
return ""
# Ensure it's a str
s = str(s)
if normalize_unicode:
s = unicodedata.normalize("NFC", s)
if strip_invisible:
s = _RE_INVIS.sub("", s)
if strip_controls:
s = _RE_CTRL.sub("", s)
if smart_punct_to_ascii:
s = s.translate(str.maketrans(SMART_MAP))
if collapse_whitespace:
s = re.sub(r"\s+", " ", s)
if trim:
s = s.strip()
if isinstance(max_len, int) and max_len > 0:
s = s[:max_len]
return s
def violet_identifier_key(s: str) -> str:
"""
A stricter helper for join keys / IDs:
- sanitize
- uppercase
- remove ALL whitespace
- keep only A-Z 0-9 and a few safe delimiters
"""
s = violet_sanitize_text(s, collapse_whitespace=True)
s = s.upper().replace(" ", "")
s = re.sub(r"[^A-Z0-9\-_/\.]", "", s)
return s
# Example:
# clean = violet_sanitize_text(raw)
# vendor_id_key = violet_identifier_key(raw_vendor_id)
data.decode("utf-8", errors="replace") or errors="ignore" depending on your goals.
MySQL isn’t an ICU transliteration engine, but it can do a strong “ETL-clean” pass: remove invisible gremlins (BOM / zero-width / NBSP), normalize whitespace, and (optionally) force the output down to “simple characters only” using a conversion + whitelist.
Col_Raw and Col_Clean.utf8mb4 and your connection uses UTF-8.
1) Quick ETL cleanup (invisibles + whitespace)
-- MySQL 8: Quick “ETL-clean” pass
-- Removes common invisibles and normalizes whitespace without stripping normal letters.
UPDATE your_table
SET your_col = TRIM(
REGEXP_REPLACE(
-- Convert NBSP to space, remove BOM + zero-width characters
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(your_col,
CHAR(239,187,191), ''), -- UTF-8 BOM bytes EF BB BF
_utf8mb4 0xE2808B, ''), -- U+200B zero-width space
_utf8mb4 0xE2808C, ''), -- U+200C ZWNJ
_utf8mb4 0xE2808D, ''), -- U+200D ZWJ
CHAR(194,160), ' ' -- NBSP (C2 A0) -> space
),
'[[:space:]]+', ' ' -- collapse whitespace (tabs/newlines/multiple spaces)
)
)
WHERE your_col IS NOT NULL;
2) Force “simple characters only” (conversion + whitelist)
This is the “Violet is done being nice” option. It forces text through a latin1 conversion
(non-latin1 becomes ?), then removes anything not in your allowed character set.
Adjust the whitelist based on your field type.
-- MySQL 8: Force simple characters only (ASCII-ish)
-- Step A: remove invisibles
-- Step B: convert USING latin1 (forces many Unicode chars into '?')
-- Step C: whitelist allowed characters only
UPDATE your_table
SET your_col = TRIM(
REGEXP_REPLACE(
CONVERT(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(your_col,
CHAR(239,187,191), ''), -- BOM bytes
_utf8mb4 0xE2808B, ''), -- ZWSP
_utf8mb4 0xE2808C, ''), -- ZWNJ
_utf8mb4 0xE2808D, ''), -- ZWJ
CHAR(194,160), ' ' -- NBSP -> space
)
USING latin1
),
'[^A-Za-z0-9 .,_@:/\\-]+', '' -- <-- WHITELIST (edit this)
)
)
WHERE your_col IS NOT NULL;
'[^A-Za-z0-9_-]+''[^A-Za-z0-9._/-]+''[^A-Za-z0-9 .,''#&()/_-]+'3) (Optional) Preserve meaning: map “smart punctuation” BEFORE forcing simple chars
If you want curly quotes/dashes/ellipsis to become their plain equivalents instead of turning into ?,
do this mapping first, then run the force-simple step.
-- MySQL 8: map common “pretty punctuation” first, then whitelist
UPDATE your_table
SET your_col = TRIM(
REGEXP_REPLACE(
CONVERT(
REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
REPLACE(REPLACE(your_col,
_utf8mb4 0xE28098, "'"), -- ‘
_utf8mb4 0xE28099, "'"), -- ’
_utf8mb4 0xE2809C, '"'), -- “
_utf8mb4 0xE2809D, '"'), -- ”
_utf8mb4 0xE28093, "-"), -- –
_utf8mb4 0xE28094, "-"), -- -
_utf8mb4 0xE280A6, "..."), -- ...
CHAR(239,187,191), ''), -- BOM bytes
CHAR(194,160), ' ') -- NBSP -> space
USING latin1
),
'[^A-Za-z0-9 .,_@:/\\-]+', '' -- whitelist
)
)
WHERE your_col IS NOT NULL;
_Clean column.
utf8mb4_0900_ai_ci) help comparisons (case/accent-insensitive)
but they do not rewrite your stored characters. To actually remove/force characters, use conversion + whitelist as above.
Not all versions of SQL Server support fancy regex, but we can still normalize whitespace, remove BOM/zero-width,
and standardize smart punctuation using REPLACE.
-- SQL Server 2014: inline cleanup expression you can use in SELECT/UPDATE
-- Recommended: store original + store cleaned (e.g., Col_Raw, Col_Clean)
-- Common Unicode codes:
-- NBSP: NCHAR(160)
-- BOM: NCHAR(65279)
-- ZWSP: NCHAR(8203)
-- ZWNJ: NCHAR(8204)
-- ZWJ: NCHAR(8205)
-- Smart quotes: 8216/8217/8220/8221
-- En dash: 8211, Em dash: 8212, Ellipsis: 8230
DECLARE @s NVARCHAR(MAX) = @YourValue;
-- Remove invisibles / normalize whitespace / smart punctuation
SET @s = REPLACE(@s, NCHAR(65279), N''); -- BOM
SET @s = REPLACE(@s, NCHAR(8203), N''); -- ZWSP
SET @s = REPLACE(@s, NCHAR(8204), N''); -- ZWNJ
SET @s = REPLACE(@s, NCHAR(8205), N''); -- ZWJ
SET @s = REPLACE(@s, NCHAR(160), N' '); -- NBSP
-- Smart quotes to plain
SET @s = REPLACE(@s, NCHAR(8216), N'''');
SET @s = REPLACE(@s, NCHAR(8217), N'''');
SET @s = REPLACE(@s, NCHAR(8220), N'"');
SET @s = REPLACE(@s, NCHAR(8221), N'"');
-- Dashes / ellipsis
SET @s = REPLACE(@s, NCHAR(8211), N'-');
SET @s = REPLACE(@s, NCHAR(8212), N'-');
SET @s = REPLACE(@s, NCHAR(8230), N'...');
-- Tabs/newlines to space
SET @s = REPLACE(@s, CHAR(9), N' ');
SET @s = REPLACE(@s, CHAR(10), N' ');
SET @s = REPLACE(@s, CHAR(13), N' ');
-- Collapse double spaces (repeat a few times or loop)
SET @s = LTRIM(RTRIM(@s));
SET @s = REPLACE(@s, N' ', N' ');
SET @s = REPLACE(@s, N' ', N' ');
SET @s = REPLACE(@s, N' ', N' ');
SET @s = REPLACE(@s, N' ', N' ');
SELECT @s AS CleanValue;