[exim] UTF-8 characters and mysql lookups

Szerző: Todd Lyons
Dátum:
Címzett: exim-users
Tárgy: [exim] UTF-8 characters and mysql lookups

Let me first say that I have been googling, testing, tcpdumping, and
in-general-studying for the past 2 days trying to get my autoresponder
to display something other than question marks when it encountered
multi-byte characters. In the end, I appear to have solved my
problem, but I wanted to go through the discovery process, soliciting
information from others who may have devised a better solution.

Problem: customer sets his autoresponder to have foreign language
characters (Simplified Chinese if it matters) in the subject field and
the message field of an autoresponder for their mailbox. When the
autoresponse happens, it replaces all non-ASCII characters with a
question mark.

1. From the mysql command prompt, the same query returns the correct
characters, so I knew the query was ok.
At this point, I happened to google for just the right thing and found
a ML post from Philip Hazel which said (authoritatively) that exim
will not "convert" anything that it receives from mysql.

2. I figured I needed to set some headers so that MUAs would render it
properly. I googled and found an old ML post about autoresponders and
the MIME headers that were added. I added them to mine but it still
made absolutely zero difference.

3. I did a tcpdump of the data being transmitted from the database
server to the mail server. It was actually question marks, not the
characters. So it was mysql who was converting the multi-byte
characters to question marks before being sent to exim. Turns out
mysql defaults to latin1 character set if you don't specify otherwise,
so I need to specify the character set. More googling and I found
this illuminating message:
http://www.gossamer-threads.com/lists/exim/users/86382
This showed me that I only needed to tell the mysql connection that it
was to use UTF-8 since that's the format we store data in our
database.

4. Our system already has default character set utf-8 in mysql on all
machines, so settings in my.cnf were already present and didn't solve
any issues for me.

5. I changed the query to be like the above ML post specifying two
sequential lookups. It worked, but then revealed another problem:
Expansion of "=${extract{subject}{${lookup mysql{SET NAMES
utf8}{}}${lookup mysql {SELECT
ea.active,ea.frequency,ea.subject,ea.message FROM email_autoresponder
AS ea JOIN email AS e ON e.autoresponder_id=ea.id JOIN domain AS d ON
d.id=e.domain_id WHERE e.name='${quote_mysql:${local_part}}' AND
d.name='${quote_mysql:${domain}}' AND e.active=1 AND e.email_type_id=1
} }}{$value}{Auto-Response: ${quote:${local_part}}@${quote:${domain}}
will respond to your email soon}}" in autoresponder_always_t transport
contains non-printing character 232

6. I found another post by Philip Hazel with a one line fix for the above:
http://www.mail-archive.com/exim-users@exim.org/msg13613.html
But using the rfc2047 operator didn't work. It did something to my
value that made everything default to the default text. I don't quite
understand what failed, and frankly didn't throw much brainpower at
it, as by this time I had developed an alternative approach.

7. Perl. It's weird how perl is just the solution to everything for
me. A single perl function with one variable passed to (the field)
that makes a DBI call to get the autoresponse data for the user for
both the subject and text.

8. The subject field complained about the non-printing character, so
wrapped it in the rfc2047 operator.

Now everything works and renders properly. Yay!

The perl function I ended up with is:

use DBI;
sub get_autoresponder_data {
  my $field      = shift();
  my $default    = shift();
  my $local_part = lc(Exim::expand_string('$local_part'));
  my $domain     = lc(Exim::expand_string('$domain'));
  my $hostname   = 'localhost';
  my $dbh = DBI->connect("dbi:mysql:ivenue:$hostname",'censored','censored');
  return($default) unless ($dbh->ping);
  $dbh->do('SET NAMES utf8');
  my $query = "SELECT ea.$field FROM email_autoresponder AS ea " .
              "JOIN email AS e ON e.autoresponder_id=ea.id " .
              "JOIN domain AS d ON d.id=e.domain_id " .
              "WHERE e.name=" . $dbh->quote($local_part) . " AND " .
              "d.name=" . $dbh->quote($domain) . " AND " .
              "e.active=1 AND " .
              "e.email_type_id=1";
  my $result = $dbh->selectall_arrayref( $query );
  $dbh->disconnect();
  if ($result->[0]) {
    return( $result->[0]->[0] );
  }
  return($default);
}

Then a few macros:
AUTORESPONDER_TEXT =
${perl{get_autoresponder_data}{message}{AUTORESPONDER_DEFAULT_TEXT}}
AUTORESPONDER_SUBJECT =
${rfc2047:${perl{get_autoresponder_data}{subject}{AUTORESPONDER_DEFAULT_SUBJECT}}}

The transport looks like this:
autoresponder_always_t:
driver = autoreply
log = USER_MAILDIR/vacation.log
from = ${quote_local_part:$local_part}\@${quote:$domain}
to = AUTORESPONDER_TO
text = AUTORESPONDER_TEXT
subject = AUTORESPONDER_SUBJECT
headers = "Precedence: junk\nContent-Type: text/plain;
charset=UTF-8\nContent-Transfer-Encoding: 8bit"
user = MAIL_USER
group = MAIL_GROUP

If there any any comments or suggestions about my process above (or
even "this should have worked" type comments), I would appreciate
hearing about it. I just wanted to put something in the ML archives
that detailed both the discovery process and the solution of how I was
able to get multi-byte characters (in subject and message body) to
work in my autoresponder system.
--
Regards... Todd
I seek the truth...it is only persistence in self-delusion and
ignorance that does harm. -- Marcus Aurealius

Ez az üzenet a következő szál része:
	a teljes szál-fa dátum szerinti rendezve

	Phil Pennock postázva