Autor: W B Hacker Data: A: exim users Assumpte: Re: [exim] Matching Chinese Spreadsheet names
Marc Perkel wrote: > Need a little regex help. Trying to use a match conditional in an acl to
> match Chinese spreadsheet attached file names.
>
> 2011\345\271\264\344\274\201\344\270\232\347\250\216-\345\212\241\347\250\275\346\237\245\344\270\216\347\250\216-\345\212\241\351\243\216\351\231\251-\350\247\204-\351\201\277\346\212\200\345\267\247\344\270\216\345\256\236\345\212\241.xls
>
>
> Thanks in advance.
>
>
Challenging unless you have a very limited input.
The bad news is that there are close-on 20 different encoding schemes
used for 'Chinese'.
The good news is that many are for bespoke 'internal' use, such as
creating meta-data to catalog university/scientific work, functional
equivalents of bar-codes, etc, ergo will not be seen in the smtp 'wild'.
Unfortunately, that leaves roughly five to seven encodings that WILL
occur over smtp.
That said - a binary pattern is a binary pattern, 'escaped' or
otherwise, so the task is not hopeless.
Just one that means you need an actual sample for each go.
Even if one is a multi-encoding-aware native Chineses speaker,
'heuristic's or meaningful/predictable general patterns are not even
close to 'Western' usage. So even THAT solves only a small subset of
potential cases.
In a sense, what is impressive about Chines search Engines or the PRC
Gov's 'Great Firewall' isn't whether it is right, wrong, or sideways -
but that methods have been developed to do it AT ALL.
Trying to learn Chinese yourself once past two years of age is seriously
challenging. Written and spoken Chinese are de facto separate toolsets,
each constructed and processed mentally in a very different manner than
most 'Western' languages - despite the fact that a sentance in Putonghua
diagrams essentially identically to Spanish...
Not what you wanted to hear, but that's my take after 20+ years married
into / resident in China, and a couple of (largely wasted) years of
written and spoken courses at HKU.