Further adventures in trying to guess what encoding a file is in

The IsTextUnicode function tries to guess the encoding of a block of memory purporting to contain text, but it can only say "Looks like Unicode" or "Doesn't look like Unicode", and there some notorious examples of where it guesses wrong.

A more flexible alternative is IMultiLanguage2::DetectCodepageInIStream and its buffer-based equivalent IMultiLanguage2::DetectInputCodepage. Not only can these methods detect a much larger range of code pages, they also can report multiple code pages, each with a corresponding confidence level.

Here's a Little Program that takes the function out for a spin. (Remember, Little Programs do little to no error checking.)

#define UNICODE
#define _UNICODE
#include <windows.h>
#include <shlwapi.h>
#include <ole2.h>
#include <mlang.h>
#include <shlwapi.h>
#include <atlbase.h>
#include <stdio.h>

bool IsHtmlFile(PCWSTR pszFile)
{
 PCWSTR pszExtension = PathFindExtensionW(pszFile);
 return
  CompareStringOrdinal(pszExtension, -1,
                       L".htm", -1, TRUE) == CSTR_EQUAL ||
  CompareStringOrdinal(pszExtension, -1,
                        L".html", -1, TRUE) == CSTR_EQUAL;
}

int __cdecl wmain(int argc, wchar_t **argv)
{
 if (argc < 2) return 0;
 CCoInitialize init;
 CComPtr<IStream> spstm;
 SHCreateStreamOnFileEx(argv[1], STGM_READ, 0, FALSE, nullptr, &spstm);

 CComPtr<IMultiLanguage2> spml;
 CoCreateInstance(CLSID_CMultiLanguage, NULL,
     CLSCTX_ALL, IID_PPV_ARGS(&spml));

 DetectEncodingInfo info[10];
 INT cInfo = ARRAYSIZE(info);

 DWORD dwFlag = IsHtmlFile(argv[1]) ? MLDETECTCP_HTML
                                    : MLDETECTCP_NONE;
 HRESULT hr = spml->DetectCodepageInIStream(
     dwFlag, 0, spstm, info, &cInfo);
 if (hr == S_OK) {
  for (int i = 0; i < cInfo; i++) {
   wprintf(L"info[%d].nLangID = %d\n", i, info[i].nLangID);
   wprintf(L"info[%d].nCodePage = %d\n", i, info[i].nCodePage);
   wprintf(L"info[%d].nDocPercent = %d\n", i, info[i].nDocPercent);
   wprintf(L"info[%d].nConfidence = %d\n", i, info[i].nConfidence);
  }
 } else {
  wprintf(L"Cannot determine the encoding (error: 0x%08x)\n", hr);
 }
 return 0;
}

Run the program with a file name as the command line argument, and the program will report all the detected code pages.

One thing that may not be obvious is that the program passes the MLDETECTCP_HTML flag if the file extension is .htm or .html. That is a hint to the detector that it shouldn't get faked out by text like <body> and think it found an English word.

Here's the output of the program when run on its own source code:

info[0].nLangID = 9
info[0].nCodePage = 20127
info[0].nDocPercent = 100
info[0].nConfidence = 83
info[1].nLangID = -1
info[1].nCodePage = 65001
info[1].nDocPercent = -1
info[1].nConfidence = -1

This says that its first guess is that the text is in language 9, which is LANG_ENGLISH, code page 20127, which is US-ASCII, That text occupies 100% of the file, and the confidence level is 83.

The second guess is that the text is in code page 65001, which is UTF-8, but the confidence level for that is low.

The language-guessing part of the function is not very sophisticated. For a higher-quality algorithm for guessing what language some text is in, use Extended Linguistic Services. I won't bother writing a sample application because MSDN already contains one.

Further adventures in trying to guess what encoding a file is in

Trending Articles

Principal’s past includes domestic violence case

PURPLE RANGE LIVE AT GAL AMUNA 2013

Nalgonda District Police Office Mobile Numbers List in Telangana State

Daru and Sharab Status for Sharabi Friends in Hindi, Punjabi

XAMJYSS VPN APP | Powered by XAMJYSSVPN | Sun TU CTC FLIP | GTM FB IG |

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

FIFA 15 PPSSPP Android Download

Krishna Kanta Handique State Open University Latest Exam Result 2016...

Huzurabad Municipality into 30 wards

Lady Gaga – MAYHEM (2025) [FLAC 24bit/44,1kHz]

FINAL LESSON

Download: Dismanto Ft Rich Bizzy – Bwete (Prod by: Dismanto)

Moondru Mudichu 27-05-2016 – Polimer tv Serial

Tigers to Lions: San Beda names Kungfu Reyes as Lady Red Spikers head coach...

Practice Sheet of Right form of verbs for HSC Students

Pasulong o Paurong? (Col. 2:1-7)

Lirik Lagu Rohani Glory Haleluya - Yochen Amos

NOTES ZA GENERAL CHEMISTRY ZA NGAIZA

Download EFF Song –“Azania”, led by Mbuyiseni Dlozi

Arrow Flash 2 Sinhala Teledrama – Last Episode 33 – 24th April 2016