Character sets, collations, and sort orders, oh my!

Let’s start with some definitions. Sort order is easy enough – that’s the order things are supposed to go in. In English, that happens to be:

a b c d e f g h i j k l m n o p q r s t u v w x y z

In Carrajina (to choose an example at random) alphabetical order is:

a b c ch d dj e f g h i j l m n ñ o p r rr s t u v x z

So that takes care of sort order. Character sets are also easy. These are the set of characters for use. So ascii is and latin1 is and utf-8 is another set.

Collations are the rules used in a database to determine which letters are equivalent and what order to sort them in. So, in a sense, a collation is a combination of a character set and a sort order, plus rules about which characters are equivalent. In case-insensitive collations, capital A and lower-case a are equivalent. That’s useful, because otherwise all the words with that start with a capital letter would be sorted separately.

In MySQL, there are numerous collations that have the name pattern charset_collation_ci/cs. Examples are utf8_general_ci, latin1_general_ci, latin1_swedish_ci (which is the MySQL default), and so forth. The _ci stands for case-insensitive. _cs is case-sensitive, and then there is _bin as in ascii_bin and utf8_bin. _bin stands for “binary” and means that the collation is sorted in unicode code point order.

In MySQL, within _ci all accented letters are considered the equivalent of unaccented letters, so ‘à’, ‘á’, ‘å’, ‘ä’, and ‘ā’ are all considered to be the equivalent of ‘a’ when comparing and sorting. This means that as far as MySQL is concerned ä=a is a true statement. _bin collations do not do this. One can force MySQL to recognize accented characters as different (so that ‘n’ and ‘ñ’ are not considered to be the same letter) by using the keyword BINARY before the field name.

So which collation you use will determine how your data is compared and sorted.

There are ways to create new collations and to change sort order and equivalencies in existing collations. These require access to the server files, though, which makes them useless to those of us who host our sites on 3rd-party servers.

So here is my hack for creating a new sortorder for any character set. I am not saying this is the only way to do this (it’s not!) or even that it’s the best way to do this.

Let’s create a Carrajina sort order (while still using utf-general_ci collation). I do this by adding a field to the database called sort or something similar. In the case of Carrajina, I added a field called carrajina_sort to the carrajina_entry table and make it a VARCHAR datatype. I then wrote a function to parse each word and add a two digit code for each letter. Despite the fact that I am using numbers to sort, I am storing it as a string.I am also adding k, q, w, and y in case Carrajina ever adopts foreign words with those letters

a, A 10
b, B 11
c, C 12
ch, Ch 13
d, D 14
dj, Dj 15
e, E 16
f, F 17
g, G 18
h, H 19
i, I 20
j, J 21
k, K 22
l, L 23
m, M 24
n, N 25
ñ, Ñ 26
o, O, ó 27
p, P 28
q, Q 29
r, R 30
rr 31
s, S 32
t, T 33
u, U 34
v, V 35
w, W 36
x, X 37
y, Y 38
z, Z 39
[space], … 00
<sup> 000
-, ‘, </sup>

I’ve also set all spaces and ellipses to 00, the opening <sup> tag to 000 (and then the number after the tag will stay as is), and the closing tag, hyphens, and apostrophes are all set to empty. I then use the str_replace() function to substitute one for the other. Like so:

function getSort($value) {
    $chararray = array('<sup>', '</sup>', 'ch', 'dj', 'rr', 'ñ',
       'Ch', 'Dj', 'Rr', 'Ñ', 'ó',
       'a', 'b', 'c', 'd', 'e', 'f',
       'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p',
       'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',
       'A', 'B', 'C', 'D', 'E', 'F',
       'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P',
       'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z',
       '-', "'", ' ', '...');
    $numberarray = array('000', '', '13', '15', '31', '26',
       '13', '15', '31', '26', '27',
       '10', '11', '12', '14', '16', '17',
       '18', '19', '20', '21', '22', '23', '24', '25', '27', '28',
       '29', '30', '32', '33', '34', '35', '36', '37', '38', '39',
       '10', '11', '12', '14', '16', '17',
       '18', '19', '20', '21', '22', '23', '24', '25', '27', '28',
       '29', '30', '32', '33', '34', '35', '36', '37', '38', '39',
       '', '', '00', '00');
    $value = str_replace($chararray, $numberarray, $value);
    return $value;
}

I then call each carrajina word, parse it, and update the row with the sort value.

$query = "SELECT carrajina_entryID, carrajina_term
    FROM carrajina_entry";

$stmt = $mysqli->stmt_init();
if ($stmt = $mysqli->prepare($query)) {
    $stmt->execute();
    $stmt->bind_result($id, $term);
    while ($stmt->fetch()) {
        //put each word into the array with the id as index value
        $wordarray[$id] = $term;
    }
    $stmt->close();
}

foreach ($wordarray as $id => $word) {
    $wordsort = getSort($word);
    $sortarray[$id] = $wordsort;
}

$query = "UPDATE carrajina_entry, SET carrajina_sort = ?
    WHERE carrajina_entryID = ?";

foreach ($sortarray as $id => $wordsort) {
    $stmt = $mysqli->stmt_init();
    if ($stmt = $mysqli->prepare($query)) {
        $stmt->bind_param("si", $wordsort, $id);
        $stmt->execute();
        $stmt->close();
    }
}

All done. Each entry now has a corresponding sort string in the carrajina_sort field.

As I said earlier, this is not the only way, or even the “best” way to do this. Those of you who have solved this problem in other ways, please leave a comment with your favorite solution.