Quantcast
Channel: The Old New Thing
Viewing all articles
Browse latest Browse all 3085

The crazy world of stripping diacritics

$
0
0

Today's Little Program strips diacritics from a Unicode string. Why? Hey, I said that Little Programs require little to no motivation. It might come in handy in a spam filter, since it was popular, at least for a time, to put random accent marks on spam subject lines in order to sneak past keyword filters. (It doesn't seem to be popular any more.)

This is basically a C-ization of the C# code originally written by Michael Kaplan. Don't forget to read the follow-up discussion that notes that this can result in strange results.

First, let's create our dialog box. Note that I intentionally give it a huge font so that the diacritics are easier to see.

// scratch.h

#define IDD_SCRATCH 1
#define IDC_SOURCE 100
#define IDC_SOURCEPOINTS 101
#define IDC_DEST 102
#define IDC_DESTPOINTS 103

// scratch.rc

#include <windows.h>
#include "scratch.h"

IDD_SCRATCH DIALOGEX 0, 0, 320, 88
STYLE DS_MODALFRAME | WS_POPUP | WS_CAPTION | WS_SYSMENU
Caption "Stripping diacritics"
FONT 20, "MS Shell Dlg"
BEGIN
    LTEXT "Original:", -1, 4, 8, 38, 10
    EDITTEXT IDC_SOURCE, 46, 6, 270, 12, ES_AUTOHSCROLL
    LTEXT "", IDC_SOURCEPOINTS, 46, 22, 270, 12
    LTEXT "Modified:", -1, 4, 40, 38, 10
    EDITTEXT IDC_DEST, 46, 38, 270, 12, ES_AUTOHSCROLL
    LTEXT "", IDC_DESTPOINTS, 46, 54, 270, 12
    DEFPUSHBUTTON "OK", IDOK, 266, 70, 50, 14
END

Now the program that uses the dialog box.

// scratch.cpp

#define STRICT
#define UNICODE
#define _UNICODE
#include <windows.h>
#include <windowsx.h>
#include <strsafe.h>
#include "scratch.h"

#define MAXSOURCE 64

void SetDlgItemCodePoints(HWND hwnd, int idc, PCWSTR psz)
{
  wchar_t szResult[MAXSOURCE * 4 * 5];
  szResult[0] = 0;
  PWSTR pszResult = szResult;
  size_t cchResult = ARRAYSIZE(szResult);
  HRESULT hr = S_OK;
  for (; SUCCEEDED(hr) && *psz; psz++) {
    wchar_t szPoint[6];
    hr = StringCchPrintf(szPoint, ARRAYSIZE(szPoint), L"%04x ", *psz);
    if (SUCCEEDED(hr)) {
      hr = StringCchCatEx(pszResult, cchResult, szPoint, &pszResult, &cchResult, 0);
    }
  }
  SetDlgItemText(hwnd, idc, szResult);
}

The Set­Dlg­Item­Code­Points function takes a UTF-16 string and prints all the code points. This is just to help visualize the result; it's not part of the actual diacritic-removal algorithm.

void OnUpdate(HWND hwnd)
{
  wchar_t szSource[MAXSOURCE];
  GetDlgItemText(hwnd, IDC_SOURCE, szSource, ARRAYSIZE(szSource));
  wchar_t szDest[MAXSOURCE * 4];

  int cchActual = NormalizeString(NormalizationKD,
                                  szSource, -1,
                                  szDest, ARRAYSIZE(szDest));
  if (cchActual <= 0) szDest[0] = 0;

  WORD rgType[ARRAYSIZE(szDest)];
  GetStringTypeW(CT_CTYPE3, szDest, -1, rgType);

  PWSTR pszWrite = szDest;
  for (int i = 0; szDest[i]; i++) {
    if (!(rgType[i] & C3_NONSPACING)) {
      *pszWrite++ = szDest[i];
    }
  }
  *pszWrite = 0;

  SetDlgItemText(hwnd, IDC_DEST, szDest);
  SetDlgItemCodePoints(hwnd, IDC_SOURCEPOINTS, szSource);
  SetDlgItemCodePoints(hwnd, IDC_DESTPOINTS, szDest);
}

Okay, here's where the actual work happens. We put the source string into Normalization Form KD. This decomposes the diacritics so that we can identify them with Get­String­TypeW and then strip them out.

Of course, in real life, you wouldn't hard-code the array sizes like I did here, but this is just a Little Program, and Little Programs are allowed to take shortcuts.

The rest of the program is just a framework to get into that function.

INT_PTR CALLBACK DlgProc(HWND hwnd, UINT wm,
                         WPARAM wParam, LPARAM lParam)
{
  switch (wm)
  {
  case WM_INITDIALOG:
    return TRUE;

  case WM_COMMAND:
    switch (GET_WM_COMMAND_ID(wParam, lParam)) {
    case IDC_SOURCE:
      switch (GET_WM_COMMAND_CMD(wParam, lParam)) {
    case EN_UPDATE:
      OnUpdate(hwnd);
      break;
    }
    break;
    case IDOK:
      EndDialog(hwnd, 0);
      return TRUE;
  }
  break;

  case WM_CLOSE:
    EndDialog(hwnd, 0);
    return TRUE;
  }

  return FALSE;
}

int WINAPI wWinMain(HINSTANCE hinst, HINSTANCE hinstPrev,
                   LPWSTR lpCmdLine, int nShowCmd)
{
  DialogBox(hinst, MAKEINTRESOURCE(IDD_SCRATCH), nullptr, DlgProc);
  return 0;
}

Okay, let's take this program for a spin. Here are some interesting characters to try:

Original characterResulting character
ª00AAFeminine ordinal indicatora0061Latin small letter a
¹00B1Superscript one10031Digit one
½00BDVulgar fraction one half1⁄20031 2044 0032Digit one + Fraction slash + Digit two
ı0131Latin small letter dotless iı0131Latin small letter dotless i
Ø00D8Latin capital letter O with strokeDisappears!
ł0142Latin small letter l with strokeł0142Latin small letter l with stroke
ŀ0140Latin small letter l with middle dot006C 00B7Latin small letter l + middle dot
æ00E6Latin small letter aeæ00E6Latin small letter ae
Ή0389Greek capital letter Eta with tonosΗ0397Greek capital letter Eta
А0410Cyrillic capital letter АА0410Cyrillic capital letter А
Å00C5Latin capital letter A with ring aboveA0041Latin capital letter A
FF21Fullwidth Latin capital letter AA0041Latin capital letter A
2460Circled digit one10031Digit one
2780Dingbat circled sans-serif digit one2780Dingbat circled sans-serif digit one
®00AERegistered sign®00AERegistered sign
24c7Circled Latin capital letter RR0052Latin capital letter R
𝖕D835 DD95Mathematical bold Fraktur small pp0070Latin small letter p
FF6CHalfwidth Katakana letter small Ya30E3Katakana letter small Ya
30E3Katakana letter small Ya30E3Katakana letter small Ya
30B4Katakana letter Go30B3Katakana letter Ko
201CLeft double quotation mark201CLeft double quotation mark
201DRight double quotation mark201DRight double quotation mark
201EDouble low-9 quotation mark201EDouble low-9 quotation mark
201FDouble high-reversed-9 quotation mark201FDouble high-reversed-9 quotation mark
2033Double prime′′2032 2032Prime + Prime
2035Reverse prime2035Reverse prime
2039Single left-pointing angle quotation mark2039Single left-pointing angle quotation mark
«00ABLeft-pointing double angle quotation mark«00ABLeft-pointing double angle quotation mark
2014Em-dash2014Em-dash
203CDouble exclamation mark!!0021 0021Exclamation mark + Exclamation mark

There are some interesting quirks here. Mind you, this is what the Unicode Consortium says, so if you think they are wrong, you can take it up with them.

The superscript-like characters are converted to their plain versions. Enclosed alphabetics are also converted, but not the ® symbol. Fullwidth forms of Latin letters are converted to their halfwidth equivalents. On the other hand, halfwidth Katakana characters are expanded to their fullwidth equivalents. But small Katakana does not convert to their large equivalents.

The Ø disappears completely! What's up with that? The character code for Ø is reported as C3_ALPHA | C3_NONSPACING | C3_DIACRITIC, and since we are removing nonspacing characters, this causes it to be removed. (Why is Ø nonspacing? It occupies space!) For whatever reason, it does not decompose into O + Combining Solidus Overlay. On the other hand, the Polish ł remains intact because it is reported as C3_ALPHA | C3_DIACRITIC. Poland wins and Norway loses?

The diacritic removal ignores linguistic rules. The Swedish Å decomposes into a capital A and a combining ring above, even though in Swedish, the character is considered nondecomposable. (Just like the capital letter Q in English does not decompose into an O and a tail.) Katakana Go suffers a similar ignoble fate, converting to Katakana Ko, which is linguistically nonsensical. But then again, removing diacritics is already linguistically nonsensical. Nonsensical operation is nonsensical.

There is no attempt to unify look-alike characters from different scripts. Look-alike characters in the Greek and Cyrillic alphabets are not mapped to their Latin doppelgängers.

The infamous Turkish dotless i does not turn into a dotted i. (And the lowercase Latin i does not decompose into a combining dot and a dotless i.)

Finally, I tried a selection of punctuation marks. Most of them pass through unchanged, with the exception of the double prime and double exclamation mark which each decompose into a pair of singles. (But double quotation marks do not decompose into a pair of singles.)

Okay, but the goal of this exercise was spam detection, so we are actually interested in mapping as far as possible all the way down to plain ASCII. We'd like to convert, for example, the look-alike characters in the Cyrillic and Greek alphabets to the Latin characters they resemble.

So let's try something else. If we want to convert to ASCII, then just convert to ASCII!

#define CP_ASCII 20127
void OnUpdate(HWND hwnd)
{
  wchar_t szSource[MAXSOURCE];
  GetDlgItemText(hwnd, IDC_SOURCE, szSource, ARRAYSIZE(szSource));
  char szDest[MAXSOURCE * 2];
  int cchActual = WideCharToMultiByte(CP_ASCII, 0, szSource, -1,
                              szDest, ARRAYSIZE(szDest), 0, 0);
  if (cchActual <= 0) szDest[0] = 0;

  SetDlgItemTextA(hwnd, IDC_DEST, szDest);
  SetDlgItemCodePoints(hwnd, IDC_SOURCEPOINTS, szSource);
}

We can extend the table above with a new column.

Original characterKD characterASCII character
ª00AAFeminine ordinal indicatora0061Latin small letter aa0061Latin small letter a
¹00B1Superscript one10031Digit one10031Digit one
½00BDVulgar fraction one half1⁄20031 2044 0032Digit one + Fraction slash + Digit two?No conversion
ı0131Latin small letter dotless iı0131Latin small letter dotless ii0069Latin small letter i
Ø00D8Latin capital letter O with strokeDisappears!O004FLatin capital letter O
ł0142Latin small letter l with strokeł0142Latin small letter l with strokel006CLatin small letter l
ŀ0140Latin small letter l with middle dot006C 00B7Latin small letter l + middle dot?No conversion
æ00E6Latin small letter aeæ00E6Latin small letter aea0061Latin small letter a
Ή0389Greek capital letter Eta with tonosΗ0397Greek capital letter Eta?No conversion
А0410Cyrillic capital letter АА0410Cyrillic capital letter А?No conversion
Å00C5Latin capital letter A with ring aboveA0041Latin capital letter AA0041Latin capital letter A
FF21Fullwidth Latin capital letter AA0041Latin capital letter AA0041Latin capital letter A
2460Circled digit one10031Digit one?No conversion
2780Dingbat circled sans-serif digit one2780Dingbat circled sans-serif digit one?No conversion
®00AERegistered sign®00AERegistered signR0052Latin capital letter R
24c7Circled Latin capital letter RR0052Latin capital letter R?No conversion
𝖕D835 DD95Mathematical bold Fraktur small pp0070Latin small letter p??No conversion
FF6CHalfwidth Katakana letter small Ya30E3Katakana letter small Ya?No conversion
30E3Katakana letter small Ya30E3Katakana letter small Ya?No conversion
30B4Katakana letter Go30B3Katakana letter Ko?No conversion
201CLeft double quotation mark201CLeft double quotation mark"0022Quotation mark
201DRight double quotation mark201DRight double quotation mark"0022Quotation mark
201EDouble low-9 quotation mark201EDouble low-9 quotation mark"0022Quotation mark
201FDouble high-reversed-9 quotation mark201FDouble high-reversed-9 quotation mark?No conversion
2033Double prime′′2032 2032Prime + Prime?No conversion
2032Prime2032Prime'0027Apostrophe
2035Reverse prime2035Reverse prime`0060Grave accent
2039Single left-pointing angle quotation mark2039Single left-pointing angle quotation mark<003CLess-than sign
«00ABLeft-pointing double angle quotation mark«00ABLeft-pointing double angle quotation mark<003CLess-than sign
2014Em-dash2014Em-dash-002DHyphen-minus
203CDouble exclamation mark!!0021 0021Exclamation mark + Exclamation mark?No conversion

There are some interesting differences here.

Some characters fail to convert to ASCII outright. This is not unexpected for the Japanese characters, is mildly unexpected for the look-alikes in the Cyrillic and Greek alphabets, and is surprising for some characters like double prime, double exclamation point, enclosed alphanumerics, and vulgar fractions because they had ASCII decompositions in Normalization Form KD, but converting directly into ASCII refused to use them.

But the dotless i gets its dot back.

Another weird thing you might notice is that the æ converts to just the a. This goes contrary to the expectations of American English, because words which historically use the æ and œ are largely respelled in American English to use just the e. (Encyclopædia → encyclopedia, fœtus → fetus.) Mysteries abound.

If your real goal is to map every character to its nearest ASCII look-alike, then all these code page games are just beating around the bush. The way to go is to use the Unicode Confusables database. There is a huge data file and instructions on how to use it. There's also a nice Web site that lets you explore the confusables database interactively.

Or you could just take the sledgehammer approach: If there are a significant number of characters outside the Latin alphabet and punctuation and you are expecting English text, then just reject it as likely spam.

ಠ_ಠ


Viewing all articles
Browse latest Browse all 3085

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>