Dealing with encoding issues

or how to migrate OLAT from Windows to Linux

Porting an OLAT installation from a Windows to a Linux server can produce many encoding issues. Windows uses normally Windows-1252 AKA CP1252 encoding and Linux nowadays often uses UTF-8.

It is not possible to just move the files from the Windows server to the Linux server, filenames that contain äöü will all have ??? instead.

Read the entire article to learn what the common pitfalls are and how you can solve them.
Some things to consider:
  • mysqldump converts OLAT tables that are accessed using the useOldUTF8Behavior=true in the build.properties file to real UTF-8 tables. The dump is UTF-8 and in the table definitions UTF-8 is attached as encoding. When importing this into the new database, you must first convert the file back to ISO-8895-1 using the iconv tool or remove the useOldUTF8Behavior=true from the build.properties.
    Note that reall UTF-8 tables are up to 10 times slower than ISO-8891-1 encoded tables that use useOldUTF8Behavior=true, however some in-database ordering will not work (but this is handled in OLAT on the application layer and thus is not important at all)
  • Files copied with SCP, MidnightCommander, ZIP from the windows (or another non-UTF-8 linux) are not converted magically. After copying the files you can use convmv -f windows-1252 -t utf-8 -r --notest olatdata to convert everything from the windows encoding to UTF-8. Really handy, but this is only half the deal.
  • Your Java VM must know that files are stored using UTF-8. Unfortunately the systems keep playing stupid, they can't figure out that the user want's to have everyhing in UTF-8, you have to tell it. Add -Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8 to your java arguments. The Java file.encoding property can't be modified. They are red from the locale information of the bootstrapping shell. In unix make sure that both LC_CTYPE, LC_ALL LANG have a value like en_US.UTF-8 (see below).
    When you start OLAT, log in as administrator, select Administration -> System information -> sysinfo and scroll down to the JAVA environment settings. The two properties must show the correct value. But still not everything works
  • Your Shell also has languages and encoding settings. Sometimes it works out of the box, sometimes not. Use printenv and look what your LC_CTYPE and LANG variable is set to. If it is not set explicitly, set it to en_US.UTF-8 in the script that you use to start tomcat. And no, LANG is not enough or any other of the variables you get when issuing the locale command. Don't ask my why, it's just plain stupidity of the system.
  • OLAT has a setting in build.properties that needs to be set to the correct encoding as well. Use defaultcharset=UTF-8 and you are finally done.
How to test everything:
  • On the original system befor migrating the data, create a forum post with äöü and in a briefcase create a file with a name äöü
  • In the new system after all the explained migration, the forum entry and file must display as before.
  • To be sure, create a second forum post and file also containing umlaute. Logout, restart OLAT to make sure all application cashes are cleared and see if everything displays as expected.
Additional pitfall:
A common problem is that everything works fine when you start tomcat manually e.g. by issuing /etc/init.d/tomcat start, however, after a system resatrt of when a daily cronjob restarts your tomcat, the settings are gone. This really sucks. The problem is that when you start it manually your shell variables are passed to the shell you start and thuss you will have most likely another shell environment as when your apps get startet by cron or at system boot time.

Make sure you define the LOCALE and LC_CTYPE at the right place. If you use the tomcat start/stop scripts the tomcat/bin/catalina.sh would be a good place...
So that was easy, right? Took my only two days to figure it all out... :-(

Some interesting resources:

If you have any contribution to this topic, please let me know