Yesterday I attended a technical presentation on Character Sets. As I told the presenter afterwards, to me attending this meeting was like reading the Kitab-i-Iqan (book of certitude) and then looking at religious history – it was the key that suddenly made everything else clear. I had come across the terms Character Set, Character Encoding, collation, UTF8, ASCII, ANSI, etc many times and had some idea what they referred to, but was really confused about how they all relate to each other. Now, it was all much more clear. We’ll have a second part of the presentation later, and I may post a summary of the concepts after that.
But already puting theory to practice – this morning I had a request to remove diacritics from a listing of countries, so they sort properly. I thought “Aha!” This is a collation issue. Collation refers to setting items in order, and in the case of databases the collation sequence is the set of rules by which characters are ordered – for example a, b, c, ä or a, ä, b, c. So instead of simply removing the diacritics I googled “oracle collation” and ended up finding a better solution:
Oracle uses a setting called NLS_LANG to determine a number of things – language, territory and character set. This is an important setting, as any Oracle DBA would tell you. The default collation sequence is derived from the language portion of that setting, as there is a corresponding NLS_SORT setting for each NLS_LANGUAGE, but then there are also more custom values that you can use. Once you set a value for NLS_SORT, that collating sequence will then be used for sorting when running a query with the ORDER BY clause.
First let’s check what our current collation sequence is set to:
SELECT SYS_CONTEXT ('USERENV', 'NLS_SORT') FROM DUAL;
Then, get a list of the possible values you can set NLS_SORT to:
SELECT * FROM v$nls_valid_values WHERE parameter = 'SORT';
Choose the collation sequence you want to use from this list. In my case, I wasn’t quite sure which one I wanted, so I searched some more and found one called BINARY_AI which did what I was looking for. Turns out each value in the above table of possible values can have a hybrid definition, by adding a suffix of _CI (case insensitivity) or _AI (accent-insensitive and case-insensitive).
Finally, to set the collation sequence for the current session:
ALTER SESSION SET nls_sort='BINARY_AI';
Example:
My query for a list of countries originally gave these results (NLS_SORT was set to BINARY):
Romania Russian Federation Rwanda Réunion
After the NLS_SORT was set to BINARY_AI:
Réunion Romania Russian Federation Rwanda
10 Comments
I am in the same position you were. I know a little about character sets, but not as much as you have discussed here. Thanks for sharing this, I find it very helpful.
I found another way to use the script, avoiding the alter session command:
SELECT * FROM table SORT by NLSSORT(FIELD,’NLS_SORT=BINARY_AI’)
Thanks for the article. Very useful.
Wow, thanks Juan that’s brilliant!
I’ll definitely use this at some point, as I also prefer not having to alter session.
Yes, BUT keep in mind that your INDEXES aren’t built that way.
If you were smart, you’d do this experiment: Put a standard B-Tree index on the column you use in your WHERE clause, and gather stats on that table/index/column, and then run an EXPLAIN PLAN for your query.
Then change the NLS_SORT and run the query again.
…I think you’ll very likely note that changing the NLS_SORT will result in more full table scans- because the indexes would be built according to a different NLS_SORT than what you are setting it to.
So- what does Oracle do? …It doesn’t use the index. It full scans.
Also- it would probably help if you had a fairly decent size data set to test against (if your table only has 3 rows, Oracle will probably full-scan, no matter what).
Thank you for the piece of advice!
Do what Juan said! it works brilliantly, man.
-Bob from BMW Specials
Who and where to condense this summer on festival, share your information.
There are two important lessons here:
(1) Using ALTER SESSION changes the behaviour of all SORT BY and comparisons during the session, whereas SORT BY NLSSORT(FIELD,’NLS_SORT=BINARY_AI’) affects *only* the one sort it is specified for.
(2) INDEXES are subject to the NLS order when they are of character strings (otherwise, how would you know whether an unaccented A was less than or greater than an accented A?). For that reason, the appropriate collating sequence should be established (by the DB admin) before the indexes are built.
Need to change collation in table to UTF-8_general_ci or when creating it.
Can not find any solution of it for oracle. How can it possible.
Your post was really helpful and easy to understand. Thanks