Jump to content

One national character consistently loses diacritic


mmartinec

Recommended Posts

I have been observing an automagical translation of a posted character č into c (loses a check) in geocache entries consistently for a while now. Most Slovenian and some Croatian geocache entries demonstrate it, both in the main description, as well as in log entries when posted in the national language. It is interesting that the remaining national characters š (s with a check) and ž (z with a check) survive the posting and show up correctly on the page. With UTF-8 encoding there shouldn't be an issue with these characters these days.

 

I'm using Firefox (v2 or 3) and konqueror, although as a tcpdump confirms this is not an issue of a web browser, but an alternation of the content as stored in a database. Needless to say this is the only web site that I know of where this problem occurs.

 

To be more specific, here are the essential details captured on a tcpdump of a session. First a text of a geocache entry (or a log entry) is edited, and the http POST request shows that a sample text:

 

testno sporočilo: kožušček

 

is POSTed correctly by a browser in an URI-encoded form:

 

testno+sporo%C4%8Dilo%3A+ko%C5%BEu%C5%A1%C4%8Dek

 

The single log posting (shown right after submitting the log entry) still looks fine. But when that geocache entry is viewed in its entirety (with its description and other posted logs), the web server sends the text section as:

 

<br />testno sporocilo: ko\305\276u\305\241cek<br />

 

Same applies if a log entry is re-opened immediately for editing.

 

Note that both č characters have been turned into a plain c, while the other two characters with diacritics are sent correctly as a pair of UTF8 encoded bytes.

 

For reference:

č ch \u010d %C4%8D

š sh \u0161 %C5%A1

ž zh \u017e %C5%BE

I believe the same applies for the uppercase Č.

 

I would appreciate if this would be tracked down and resolved. It is repeatable at will, and all Slovenian texts posted so far are suffering from it.

(I hope this form posting will retain the correct characters)

Link to comment

More details:

 

- this forum postings have no difficulties with retaining 蚞ȊŽ, the problem is limited to postings at geocaching.com web site;

 

- like a lowercase č (which turns into c), the capital Č also turns into a C;

 

- a mail sent to the owner of a cache about a new log posting is fine, no character conversion takes place there. So I guess the problem occurs later when a text is entered to a database or retrieved from it.

Link to comment

That's not good. Has this changed with one of the recent updates?

 

No, it has not changed since I posted my first log entry in the beginning of May. Now that I have a couple of new geocaches in my pipeline that I'd like to publish, this now means more to me than before.

 

I believe it is easily reproducible, just paste my sample text from my top posting into a 'log your visit', and see what comes out of it.

Link to comment

I just tried it, and found that the character is shown correctly immediately after entering the log and also just after editing a log. However, it does seem to be converted at some point.

Weird. I can't see why it would happen for this character, but not the "s" or "z" similar characters.

Link to comment

In a desperate attempt to get around this bug I tried switching a geocache description to HTML and manually entered problematic characters as HTML Unicode character entities č . The end result is no better, seems like HTML entities are converted to Unicode characters early in processing, so their fate is the same as directly entered character "č", ultimately ending up as a plain "c".

 

Please do not consider it a marginal issue. It is affecting postings in languages: Slovenian, Croatian, Serbian latin, Macedonian, Czech, Slovak, Belarusian, Latvian, Lithuanian, Romani, and others, including Russian in its latin transcription.

Link to comment

I wonder, is anybody working on a fix or investigating the matter?

 

The frequency of a letter 'č' in Slovenian text is about 1.5 %,

which is the same as a frequency of a letter 'b' in English texts.

Imagine writing English text without a letter 'b' !!!

Please get this fixed, allowing us to post in a national language

without a handicap.

Link to comment

I wonder, is anybody working on a fix or investigating the matter?

 

The frequency of a letter 'č' in Slovenian text is about 1.5 %,

which is the same as a frequency of a letter 'b' in English texts.

Imagine writing English text without a letter 'b' !!!

Please get this fixed, allowing us to post in a national language

without a handicap.

 

I'm strongly supporting mmartinec on this subject!

One should xe axle :huh: to use correct national language diacritic in caches descriptions! Today this proper character usage should not xe an issue as the issue has xeen addressed on general and there are tools to allow for proper characters usage and implementation.

 

In Slovenia we are using more and more douxle or even triple language cache descriptions. Thus we are slowly improving one language, English descriptions of first caches placed in Slovenia. All this to enable local and foreigner geocachers to enjoy cache hunting in our country.

 

It is a shame when one has to read her/his own language by guessing what the original word was - xy decoding meaning of the word from the meaning of the sentence.

 

Regards from

Soncek? or Sonček? :P

 

Guess which one is correct!

It is the latter one - with a meaning "small sun". First one actually does not mean nothing in Slovenian language :xlink: Ohhhh sorry it should xe/be B):lol:

Edited by Soncek
Link to comment

I wonder, is there a chance for the issue to get fixed by the coming (I hope) new release? I'd like to publish a cache, but I'm holding it back, waiting for a chance to let its title and description come out correctly on the web page.

 

The problem with č getting destroyed but š (and as you say ž) working fine is probably due to Windows character set 1252 which contains š and ž but not č. Somewhere a conversion involving Windows 1252 is causing trouble!

http://en.wikipedia.org/wiki/Windows-1252

Link to comment
I do not understand this behavior since the source code of any listing includes content="text/html; charset=UTF-8"

This is not the problem. All that that line says to the browser is "expect to see some multi-byte characters coming your way". If this were the issue then the symptoms would typically involve accented characters appearing as blobs, or as "Â" or "Ã" followed by another character.

 

Somewhere a conversion involving Windows 1252 is causing trouble!

My guess is that the issue is probably not Windows code page 1252, so much as "all the accented characters which you can represent in UTF-8" versus "the accented characters which you can represent in the 8-bit ASCII". And the problem seems to occur in some site pages and not others.

 

For example, edit one of your caches and change its name from "My cache" to "My cache čšž". When you click Submit, the first page you get back is the edit page again, and on top it says

"Editing Cache:

My cache čšž"

which is correct, and the "Nickname" box contains "My cache čšž", so if you change something else, the name will be maintained. But if you then click on "view listing", you get the normal cache page which everyone sees, with the "č" changed to a "c". If you then click "edit listing", the proposed new name in the box no longer has the accent on the "c".

 

My provisional conclusions are:

- The database can handle UTF-8 correctly. This is the good news.

- The code which generates the site pages does not, for some (or maybe most, but not all) pages, handle UTF-8 correctly. This is actually not easy to do. (Geek stuff alert) Once you start to use UTF-8, you have to rewrite a lot of code, because you can't necessarily use the same string handling functions as you did for ASCII characters. When a character is not the same as a byte, all kinds of things which you thought you had learned definitively in Programming 101, no longer apply. Maybe ASP.NET handles this transparently right out of the box; I know that PHP doesn't. (End geek stuff) So, knowing that the code doesn't handle this well, some hack - perhaps a library routine from Microsoft or a third-party framework - is in place which replaces "UTF-8 accented characters" by "something which looks similar".

 

I would guess that this is unlikely to be fixed in the current code base, but I strongly suspect that support for many other languages, including for example Japanese, will be high on the list for a future version of the site.

Edited by sTeamTraen
Link to comment

Should I open a trouble ticket, or is this forum posting sufficient for the site maintainers to notice it?

 

I'd suggest to do so (no harm done, anyway).

 

Greets from northern northern Slovenija, :blink:

ime

Edited by ime
Link to comment

Should I open a trouble ticket, or is this forum posting sufficient for the site maintainers to notice it?

I'd suggest to do so (no harm done, anyway).

 

I sent the following text to the gc contact e-mail address.

I hope Santa Claus reads this kind of letters too :)

 

Ok, so here I'm sending this problem report on behalf

of 14273 geocaches in affected countries (most of them in

the Czech republic, where geocaching seems to have exploded),

so that a forum topic will not be forgotten.

 

All the necessary details are in the thread:

http://forums.Groundspeak.com/GC/index.php?showtopic=203215

titled:

One national character consistently loses diacritic,

č turns to c on the web, while š and ž are alright

 

In short: a letter č in cache descriptions and in log entries

turns to c after being saved and retrieved from a database.

Other national characters (like š and ž) are unaffected,

forum postings are unaffected, the immedate e-mail feedback

of a posting is also not affected, and neither is the immediate

web contents after a posting. But revisiting the entry

shows modified text.

 

How to repeat: open a 'log your visit' or 'edit listing',

enter some text containing for example a word kožušček

(meaning: small fur), post it and watch the result.

Link to comment

Oh no, so much time and still no change about this big issue for many non-english people, that is not good... I am afraid that the problem should be with the database encoding setting if the page code was OK. Is there any chance to fix it?

Link to comment

Sorry if my question is in wrong place, but I asked about few times (including mail to Groundspeak) and still no clear answer. Is any chance to add Polish characters: ąćęłńóśźż ? Now you can use only "ó". Converted letter sometimes make big difference in meaning. It looks every country using Latin alphabet has their specific characters only we still can't use correct names.. :D

So, is any chance for it?

Link to comment

Good place and the same problem as mine. I am afraid all non-english writing people have problems with their country-specific characters. Let's hope it will be fixed soon, webmasters know about that and we can't do more.

 

Yes, I hope it as well, but will be nice see answer "yes, it will be added soon / this month / next year or sorry, we can't do that because...". I will very grateful for any response. And not only me.

Link to comment

Did you not read the post right above/before your previous post? #18 by Nate?

 

Yes, but I'm asking about other characters, not only "č". If that relates to other characters I will wait with patience.

I don't think he was referring to one specific character, but all characters with modifiers that don't fit the textbox criteria he referred to.

Link to comment

Did you not read the post right above/before your previous post? #18 by Nate?

 

Yes, but I'm asking about other characters, not only "č". If that relates to other characters I will wait with patience.

I don't think he was referring to one specific character, but all characters with modifiers that don't fit the textbox criteria he referred to.

 

Yes, that is what I meant - all the characters requested are not available due to restrictions in the tools we use. I'm very sorry and I know it is frustrating for you. We couldn't be more aware of the issue so you don't need to resurrect old threads. We have a plan to fix the problem and will make every effort to get it done. Thanks for being patient with us.

Link to comment

The problem is with UBB Textbox and we have spent hours trying to fix the issue. The solution is to eventually replace all textboxes with a WYSIWYG HTML editor. In the meantime, I'm very sorry this is such an annoyance.

 

I do not get how WYSIWYG HTML editor will sort out this problem :drama:, are aware of what you are writing?

 

I do not think you are moving to the right direction. Don't you think usual HTML entities support will fix present problem?

Link to comment

I do not get how WYSIWYG HTML editor will sort out this problem :drama:, are aware of what you are writing?

 

I do not think you are moving to the right direction. Don't you think usual HTML entities support will fix present problem?

The issue is not with displaying the characters, but with inputting them. The site is more or less completely UTF-8 capable (example). HTML entity support is not the issue.

 

Like most large sites, Geocaching.com does not contain only code written by programmers who work for the parent company. In this case, they are using a particular system (UBB Textbox) to structure user input, and this system has this bug. The idea, since apparently the bug is hard to fix, is to replace UBB Textbox with a different system. That's a non-trivial exercise which will impact the look and feel of the entire site. Even if it doesn't introduce any new bugs (which is, of course, fairly likely), it will change the user experience of around one million users. That's absolutely not a reason not to fix this bug which is causing legitimate concern for many thousands of users in Central and Eastern Europe, but it is a reason for Groundspeak to proceed with caution.

Link to comment

The issue is not with displaying the characters, but with inputting them. The site is more or less completely UTF-8 capable (example). HTML entity support is not the issue.

 

Like most large sites, Geocaching.com does not contain only code written by programmers who work for the parent company. In this case, they are using a particular system (UBB Textbox) to structure user input, and this system has this bug. The idea, since apparently the bug is hard to fix, is to replace UBB Textbox with a different system. That's a non-trivial exercise which will impact the look and feel of the entire site. Even if it doesn't introduce any new bugs (which is, of course, fairly likely), it will change the user experience of around one million users. That's absolutely not a reason not to fix this bug which is causing legitimate concern for many thousands of users in Central and Eastern Europe, but it is a reason for Groundspeak to proceed with caution.

 

Tried HTML entities in log and seems like they are working. Still will have to look how they behaves in cache listing.

Link to comment
Well, let's assume the site internally uses component "UBB Textbox" which has this character encoding issue from the very beginning. The question is why Groundspeak bought it/started using it... :lol:

I don't know if you do "computer stuff" for a living, but I do. I constantly have to choose between ways of getting things done. I can buy my PCs from a big-name vendor, and when they turn out to have a systematic issue, it won't get fixed, and I get criticised for buying from a big faceless American company; or I can buy them from a local box assembler, and when he has financial problems, I get criticised for not buying from someone "financially stable". I can choose to run Microsoft software and be told that I should have gone for something with fewer bugs, or I can choose a freeware solution and discover that the author just emigrated and the source code is so badly-written that I might as well just have the binaries.

 

Presumably Groundspeak chose "UBB Textbox" for a variety of reasons, and I would guess that checking that every accented character from every European language worked, was not top of the list. At the time, there were probably 100 caches in all of the affected countries put together, and half of those were in English.

 

There are no perfect solutions, and there are no software choices without downsides.

 

(And BTW ain't this the supplier's business to get it fixed?)

No. You're thinking of consumer products, like cars. Is there even a "supplier" in this case? All Groundspeak can do is ask nicely. And then they'll probably find that the next release will break three other things, and require a new interface layer to make it work with the site. Welcome to software.

Link to comment

Yes, that is what I meant - all the characters requested are not available due to restrictions in the tools we use. I'm very sorry and I know it is frustrating for you. We couldn't be more aware of the issue so you don't need to resurrect old threads. We have a plan to fix the problem and will make every effort to get it done. Thanks for being patient with us.

 

Sorry I'm reactivating this thread, but just realized one important reason why this problem should be fixed soon. People asking for any international support, for example automatic language translation. And then people say "use google translation". But there is a problem - when I'm writing decription in Polish, characters are changed to English, and then sometimes meaning is completely different. For example: words "łęk" (saddle) or "lęk" (fear) will be displayed as "lek" (medicament).

 

So OK, I'm patient with this problem. But did you start solve this problem, or you will. or any plans for that, in what time border or something to let us now that it will be ever (or never) solved? Thanks, and sorry for disturbing you, but this is really important question.

Link to comment

Here is one possible solution, hope it helps (maybe the similar solution is mentionned somewhere else too).

 

It was written that the problem is with the input, so you must change the input text before posting the form. The thing is that the special characters must be converted into HTML entities (I am not sure if it is necessary to check "enable HTML in descriptions" but I have it checked). For the conversion you can use some online service (ie. http://textmod.pavucina.com/prevod-html-entity). And then you paste the converted result instead of the normal readable text and that is it.

Link to comment
Guest
This topic is now closed to further replies.
×
×
  • Create New...