Creating Dutch dictionary for android

As a follow up to yesterdays post, a howto for creating the LatinIME dictionary (in this case raw-nl/main.dict) files!

All info necessary to do so is readily available on http://code.google.com/p/softkeyboard/wiki/BinaryDictionaries

As a base, you need a wad of decent text, the more the better! The quality of the dictionary will improve as more data is fed into the file. The example on SoftKeyboard uses the Wikipedia for bulk text!

The dutch dump is available on http://download.wikimedia.org/nlwiki/20100813/ Considering the different dumps, I chose the one containing the Articles, templates, image descriptions, and primary meta-pages, but not the revision or user data..

2010-08-13 16:17:13 done Articles, templates, image descriptions, and primary meta-pages.
2010-08-13 16:17:13: nlwiki 1003105 pages (382.643/sec), 1003105 revs (382.643/sec), 92.5% prefetched, ETA 2010-08-13 16:59:24 [max 1971556]
This contains current versions of article content, and is the archive most mirror sites will probably want.
pages-articles.xml.bz2 549.0 MB

The Softkeyboard explenation offers the following bash script to analyse the wad of text into a weighed word list.

My blog screws up code, check gertschepens.be or the original page for the commands
Code from Softkeyboard, contributed by Jacob Nordfalk

This creates a weighed and sorted list of the words in the file. The more data, the more reliable this set will be. (The database dump of a dutch forum would be a nice addition to get more of the common language in there, as a counterweight to the dictionary wording.)
The weighed values are between 194 and 8671269 (this is the # of times a word is found in the text) in case of this specific export. The sample.xml however speaks of a frequency value from 0 to 255.

So I created a perl script to fix the numbering; the script needs a sorted list, frequent words first, infrequent ones later. It cycles trough and replaces the initial numbering (in the correctly formatted xml) by the weighed alternative.

The source is not available here for formatting reasons; click the Perl script link!

Once we have a decent .xml, all we need to do is convert it to the necessary .dict format, which is described in the Softkeyboard text; and the necessary software is in their repositories.

After this, we have a valid main.dict file, ready to be compiled into a LatinIME pack. Compiling all languages into the pack will not be possible due to very limited space on the system partition (My magic has about 1 MB free space on its /system partition), so a solution will have to be found to store the dictionary files out of the LatinIME keyboard apk. (which is preferable at any rate!) My dutch dictionary was submitted to CM but will probably not make it in until the external/internal dictionary problem is solved! (in the mean while, Im using a home-rolled version! The default CM keyboard with added raw-nl, currently available here)

Edit: I apparently typed “SoftPedia” a couple of times instead of “the Softkeyboard Wiki” – Must’ve been very tired while putting that together :/ Sorry for the confusion. At any rate – all the necessary links are in there.