Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 102537 - UTF-8
Summary: UTF-8
Status: RESOLVED FIXED
Alias: None
Product: Gentoo Infrastructure
Classification: Unclassified
Component: Forums (show other bugs)
Hardware: All Linux
: High normal (vote)
Assignee: Forum Moderators
URL: https://forums.gentoo.org/viewtopic-t...
Whiteboard:
Keywords:
: 190456 (view as bug list)
Depends on:
Blocks: 133468 41249
  Show dependency tree
 
Reported: 2005-08-14 12:52 UTC by Tom Knight (RETIRED)
Modified: 2013-07-06 20:20 UTC (History)
6 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Tom Knight (RETIRED) gentoo-dev 2005-08-14 12:52:03 UTC
Convert the forums to use UTF-8.

There are scripts in scripts/projectUTF8 to do the conversion. Alos check to see
whetehr db1 or dove need anything done to them for this to work.
Comment 1 Tom Knight (RETIRED) gentoo-dev 2005-09-02 06:29:28 UTC
Remember to change the Charset declarations in the email templates
(languages/lang_*/email/*.tpl) to UTF-8 when the conversion is done.
Comment 2 Tom Knight (RETIRED) gentoo-dev 2006-05-16 03:22:31 UTC
I've run the UTF-8 converter script locally with took just under 65 hours, most of it went smoothly with ~70 posts/topic titles which couldn't be converted. There are some posts that have been encoded incorrectly, probably because the encoding used for the post wasn't the correct encoding for that forum.

Some of the tables haven't been encoded as it's impossible to know what encoding was used to enter the data. These are the users table and the private messages tables. This means that certain characters in poeples sigs and usernames won't be correct. I'm assuming this'll prevent people logging in if they have special characters in their username.

For the PMs there may be a problem as new PMs will be encoded in UTF-8 but the table's defult charset is still latin1 which may cause problems. We need to do some more testing wrt this.

I'll also have to write a script that'll convert all of the language packs and email templates (making sure to change the Charset declarations as mentioned in comment 1).

So things that we need to do are:
* Figure out what to do with the 70 problematic posts and any which have been incorrectly converted.
* Do some testing with the PMs - people may need to use the PM email function if they want their current PMs.
* write the language pack converter script.
* Work out how to handle usernames with special characters.
* Lots of testing.
* What to do about downtime - we could minimise it by using a r/o dump of the database while the conversion happens on a separate DB.
* Check if we need to make any changes to dove or db1 for the conversion to work.
* Anything else that I've forgotten.
Comment 3 Renat Golubchyk 2007-06-10 01:54:25 UTC
What's the current state of affairs? Is anybody working on this?

Because I almost always have to switch to Cyrillic encoding although the preferred language in browser is set to Russian. The only pages that look correct are the title page and the search result pages. Everything else shows up in ISO-8859-1 since that is what it receives in the HTTP header from the server.
Comment 4 Tom Knight (RETIRED) gentoo-dev 2007-06-28 17:59:22 UTC
(In reply to comment #3)
> What's the current state of affairs? Is anybody working on this?
> 
This bug depends on bug 133468 which is slowly prograssing but I've been pretty busy with Real Life(TM).

Comment 5 Jakub Moc (RETIRED) gentoo-dev 2007-08-27 20:30:12 UTC
*** Bug 190456 has been marked as a duplicate of this bug. ***
Comment 6 Tom Knight (RETIRED) gentoo-dev 2007-09-06 23:04:59 UTC
Quick update on this one: we're planning on doing the migration this weekend.

(In reply to comment #2)
> I've run the UTF-8 converter script locally with took just under 65 hours, most
> of it went smoothly with ~70 posts/topic titles which couldn't be converted.
> There are some posts that have been encoded incorrectly, probably because the
> encoding used for the post wasn't the correct encoding for that forum.

Hopefully the conversion won't take as long seeing as we have a couple of beefy boxes to do it on this time round. As for incorrectly encoded posts, etc. we will deal with those on a case-by-case basis after the conversion is done.

> Some of the tables haven't been encoded as it's impossible to know what
> encoding was used to enter the data. These are the users table and the private
> messages tables. This means that certain characters in poeples sigs and
> usernames won't be correct. I'm assuming this'll prevent people logging in if
> they have special characters in their username.

We will now be converting the user and PM tables to UTF-8 based on the language of the user, there may still be a few issues regarding logging in as we can't convert any passwords (they are encryted in the database). We have identified 6 problematic users that will need their username changed and will be contacting them via email.

> For the PMs there may be a problem as new PMs will be encoded in UTF-8 but the
> table's defult charset is still latin1 which may cause problems. We need to do
> some more testing wrt this.

Tihs is no longer a problem as we will convert the PMs to UTF-8.

> I'll also have to write a script that'll convert all of the language packs and
> email templates (making sure to change the Charset declarations as mentioned in
> comment 1).

This script has been written and verified.

> So things that we need to do are:
> * Figure out what to do with the 70 problematic posts and any which have been
> incorrectly converted.

We'll deal with these as they are reported to us.

> * Do some testing with the PMs - people may need to use the PM email function
> if they want their current PMs.
> * write the language pack converter script.
> * Work out how to handle usernames with special characters.
> * Lots of testing.

Done.

> * What to do about downtime - we could minimise it by using a r/o dump of the
> database while the conversion happens on a separate DB.

A read-only version will be up for the majority of the conversion, the conversion will happen on a separate DB box.

> * Check if we need to make any changes to dove or db1 for the conversion to
> work.

Doesn't seem so from our testing (apart from the obvious apache/php config changes needed).

> * Anything else that I've forgotten.

As always there will be something, but again we will deal with these issues as they arise.
Comment 7 Tom Knight (RETIRED) gentoo-dev 2007-09-12 19:12:02 UTC
The forums have now been converted to UTF-8 with few errors, searching for words containing UTF-8 characters may not work correctly until bug 133468 is fixed, which is the next thing on my TODO list.