• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Tim Cooke
  • Liutauras Vilda
  • Jeanne Boyarsky
  • paul wheaton
Sheriffs:
  • Ron McLeod
  • Devaka Cooray
  • Henry Wong
Saloon Keepers:
  • Tim Holloway
  • Stephan van Hulst
  • Carey Brown
  • Tim Moores
  • Mikalai Zaikin
Bartenders:
  • Frits Walraven

MariaDB - Malay Database

 
Ranch Hand
Posts: 204
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi,

I am setting up a MariaDB database that will be used in Malaysia and therefore will store Malay characters (Latin Malay alphabet is the official Malay script). What is the best collation to use for this database please?

Kind regards,

Glyn
 
Sheriff
Posts: 28329
97
Eclipse IDE Firefox Browser MySQL Database
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
My database supports all sorts of scripts, and since I just use UTF-8 I never need to concern myself when a new script comes along.
 
Glyndwr Bartlett
Ranch Hand
Posts: 204
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi,

I am currently using "latin1_swedish_ci" (the default when I set up the first database using phpMyAdmin). I note that there are a number of UTF8 options. Which is the best please (e.g., utf8mb4_general_ci)?

Kind regards,

Glyn
 
Saloon Keeper
Posts: 28321
210
Android Eclipse IDE Tomcat Server Redhat Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
One of the more recent recommendations I've seen is utf8mb4_0900_ai_ci. Note that "ci" means "Case Insensitive", so if sorting and searching should be done in a case-sensitive way, you don't want this. Check your MariaDB docs for alternatives.

Avoid the "general" codepage options. They're noted for being sloppy.

For those not familiar with MariaDB/MySQL, MySQL was originally developed in Sweden and their default code page was in the ISO-8859 codeset common on Linux back then.

However, these days the world has moved on to UTF-8 and its relatives, but MySQL did not. I had a fun time converting an entire existing MariaDB database over a year or so back because of that.

Which is why it's better to set things up in advance.
 
Glyndwr Bartlett
Ranch Hand
Posts: 204
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi Tim,

This is a great help. Most of my "utf8mb4" collation options are for specific languages and "ci" (e.g., utf8mb_turkish_ci). The only non "ci" are "bin", "nopad_bin" and "thai_520_w2". I will try "utf8mb4_unicode_520_ci". My worry is finding issues too late. Can you think of any issue that may arise from using this collation please?

Kind regards,

Glyn
 
Tim Holloway
Saloon Keeper
Posts: 28321
210
Android Eclipse IDE Tomcat Server Redhat Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
The main problems here are whether or not you want sorts and searches to be case-sensitive and the danger of not having the right character code value for the data you're storing.

There are two different places where you have to setup code pages: data representation in the tables and collating (sorting). As long as you pick pages that make you happy, that's all that counts.

Personally, I ran into issues when my searches were failing when I switched to a case-sensitive page when I wanted case-insensitive searches, but I changed my table settings. It was a nuisance, but not fatal.
 
Paul Clapham
Sheriff
Posts: 28329
97
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I'm not at home now so I can't tell you exactly what I'm using. But last year when characters from Canadian indigenous languages showed up in the database, there wasn't a problem. I don't expect to ever have a problem with character sets.

But then I can't imagine why you would ever want case-insensitive data storage so what do I know?

I will be home tomorrow so I will let you know what encoding I'm using in my MySql database.
 
Paul Clapham
Sheriff
Posts: 28329
97
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I just looked at my MySQL database schema details and it says:

Default collation: utf8mb4_0900_ai_ci
Default characterset: utf8mb4

These are just the MySQL 8.0 defaults. Clearly I didn't understand the difference between "collation" and "characterset". I can see that query comparisons are indeed case-insensitive, which doesn't seem to matter to my application. Although case-sensitive comparisons would be just the same as far as functionality goes.

But anyway you should use a generic form of UTF-8 and not fash yourself beyond that.
 
Paul Clapham
Sheriff
Posts: 28329
97
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
As for "collation": Java also has the "collation" concept, although you may very well not have heard of it. A collation specifies the "alphabetic" ordering of characters. For example Swedish has the characters " Å, Ä, Ö" which it puts at the end of its alphabet. Norwegian and Danish have similar characters but they put them in a different order. Have a look at Alphabetical_order#Language-specific_conventions for more about this. That's why there are "Swedish" collations you can use. If you have any languages which must be ordered "correctly" then you'll have to choose a specific collation to do that.
 
Sheriff
Posts: 4646
582
VSCode Eclipse IDE TypeScript Redhat MicroProfile Quarkus Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Collation matters for sorting and matching.  I believe that utf8mb4_0900_ai_ci is both accent-insensitive and case-insensitive.

So if you wanted to find matches for coté, but not cote, côté, or côte, then you would to use somelike like utf8mb4_0900_as_ci.  I don't know if Malay using latin characters has this concern or not.
 
Tim Holloway
Saloon Keeper
Posts: 28321
210
Android Eclipse IDE Tomcat Server Redhat Java Linux
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Collation was exactly why I had to re-code my recipe database. I was finding duplicates for the same recipe name because it had been entered multiple times with different capitalizations.

Conversely, when I type in "pizza" as a search term, I don't want to miss a recipe because it says "Pizza" in the title.

You can code your database logic to handle this stuff manually, but it's more work and more overhead, so let the database handle it.
 
Paul Clapham
Sheriff
Posts: 28329
97
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
From what I can see, the "Latin Malay" alphabet is simply the 26 letters used in English. Am I correct? If so, then the choice of collation and character set is pretty much irrelevant.
 
Tim Holloway
Saloon Keeper
Posts: 28321
210
Android Eclipse IDE Tomcat Server Redhat Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Paul Clapham wrote:From what I can see, the "Latin Malay" alphabet is simply the 26 letters used in English. Am I correct? If so, then the choice of collation and character set is pretty much irrelevant.


Well, the jawi script includes Arabic characters,, so it's good to support that.

But Malaysia also has a significant Chinese population, among others. :)
 
You got style baby! More than this tiny ad:
Gift giving made easy with the permaculture playing cards
https://coderanch.com/t/777758/Gift-giving-easy-permaculture-playing
reply
    Bookmark Topic Watch Topic
  • New Topic