brolin_empey | Maxdamantus: OK, thank you for the informative answer. I will try to get around to opening your two links. I only know how to write in languages that use an alphabet so was curious how text written in a language such as Chinese or Japanese that does not use an alphabet is sorted, assuming that it can be sorted. My friend from Beijing showed me how he uses an IME to write in Chinese on Android but I do not know enough about a language that does not use an | 00:04 |
---|---|---|
brolin_empey | alphabet to write in the language. | 00:04 |
brolin_empey | He said he thinks the stroke order or number of strokes, do not remember which one he said, is used for sorting text but at least one other person I asked said they do not think text written in Chinese can be sorted. I kept meaning to try using software that supports Chinese text to sort Chinese text to see what it does but I ran out of time then forgot about it or had other, higher priority things to do. | 00:08 |
Maxdamantus | It will likely be ad-hoc to the writing system. I'm not sure how Chinese logograms work exactly, but in general I would expect a writing system to be made of a relatively small number of primitive concepts. | 00:25 |
Maxdamantus | eg, if you look at Hangul, you might have thousands of "characters", but each one is really just a combination of up to three primitive symbols denoting any start/middle/end sounds for a syllable. | 00:26 |
Maxdamantus | (Japanese kana are similar, but with the exception of "-n", their syllables all consist of one vowel, possibly preceded by a consonant, so only two primitive concepts in each glyph) | 00:28 |
L29Ah | no? | 00:28 |
Maxdamantus | and since that combination in Japanese kana only leads to around 50 symbols (5*10), it doesn't need to be as regular as Hangul. | 00:29 |
Maxdamantus | No what? | 00:29 |
L29Ah | ah nvm, for the ordering reason it's ok | 00:30 |
L29Ah | there's ゃ, ゅ and ょ to have a little fun with | 00:30 |
L29Ah | anyway though i don't see why don't you just grab unicode code points and be done with it | 00:31 |
Maxdamantus | Because Unicode code point ordering might not follow a well-understood pattern. It just depends on who designed the layout for that script in Unicode. | 00:43 |
Maxdamantus | Even in Latin-based scripts, you don't have that. An obvious example would be 'ı' in Turkish. | 00:43 |
Maxdamantus | or simply 'ü' in German. | 00:44 |
L29Ah | i think it can even change between languages using the same character set | 00:44 |
Maxdamantus | I imagine there are languages using Latin-based scripts that have orders that are inconsistent with English. | 00:45 |
Maxdamantus | also, I know that in Arabic there are at least two well-known orderings of letters (one starts with "alef, ba, gim, dal" like in Greek, the other starts with "alef, ba, ta, tha") | 00:46 |
Maxdamantus | and people use those different Arabic orders in different contexts. | 00:46 |
* enyc meows | 00:47 | |
brolin_empey | $ cat /dev/urandom >enyc | 00:47 |
CcxWrk | You don't sort by codepoints, there's whole Unicode Collation Algorithm: https://www.unicode.org/reports/tr10/ | 15:55 |
L29Ah | > Siniform ideographs — most notably modern CJK (Han) ideographs — and Hangul syllables are not explicitly mentioned in the default table. Ideographs are mapped to collation elements that are derived from their Unicode code point value as described in Section 10.1.3, Implicit Weights. | 15:56 |
CcxWrk | Hm, even libicu pages on this seems to be full of TODOs http://site.icu-project.org/design/collation/script-reordering | 16:02 |
CcxWrk | Heh and the official document on Collation points to … PowerPoint file? :] | 16:05 |
CcxWrk | But no, we better focus on adding more emoji combinations /s | 16:05 |
KotCzarny | sticking to those funny chars is like keeping ebcdic around | 16:07 |
KotCzarny | sure, some legacy code uses it, but whole thing should be deprecated | 16:07 |
L29Ah | indeed, latin should be deprecated in favour of han | 16:08 |
KotCzarny | i think you've meant emojis | 16:10 |
L29Ah | nah emojis are ideographs like han, they're fine | 16:11 |
Generated by irclog2html.py 2.17.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!