Creating Dutch dictionary for android

As a follow up to yesterdays post, a howto for creating the LatinIME dictionary (in this case raw-nl/main.dict) files!

All info necessary to do so is readily available on http://code.google.com/p/softkeyboard/wiki/BinaryDictionaries

As a base, you need a wad of decent text, the more the better! The quality of the dictionary will improve as more data is fed into the file. The example on SoftKeyboard uses the Wikipedia for bulk text!

The dutch dump is available on http://download.wikimedia.org/nlwiki/20100813/ Considering the different dumps, I chose the one containing the Articles, templates, image descriptions, and primary meta-pages, but not the revision or user data..

2010-08-13 16:17:13 done Articles, templates, image descriptions, and primary meta-pages.
2010-08-13 16:17:13: nlwiki 1003105 pages (382.643/sec), 1003105 revs (382.643/sec), 92.5% prefetched, ETA 2010-08-13 16:59:24 [max 1971556]
This contains current versions of article content, and is the archive most mirror sites will probably want.
pages-articles.xml.bz2 549.0 MB

The Softkeyboard explenation offers the following bash script to analyse the wad of text into a weighed word list.

My blog screws up code, check gertschepens.be or the original page for the commands
Code from Softkeyboard, contributed by Jacob Nordfalk

This creates a weighed and sorted list of the words in the file. The more data, the more reliable this set will be. (The database dump of a dutch forum would be a nice addition to get more of the common language in there, as a counterweight to the dictionary wording.)
The weighed values are between 194 and 8671269 (this is the # of times a word is found in the text) in case of this specific export. The sample.xml however speaks of a frequency value from 0 to 255.

So I created a perl script to fix the numbering; the script needs a sorted list, frequent words first, infrequent ones later. It cycles trough and replaces the initial numbering (in the correctly formatted xml) by the weighed alternative.

The source is not available here for formatting reasons; click the Perl script link!

Once we have a decent .xml, all we need to do is convert it to the necessary .dict format, which is described in the Softkeyboard text; and the necessary software is in their repositories.

After this, we have a valid main.dict file, ready to be compiled into a LatinIME pack. Compiling all languages into the pack will not be possible due to very limited space on the system partition (My magic has about 1 MB free space on its /system partition), so a solution will have to be found to store the dictionary files out of the LatinIME keyboard apk. (which is preferable at any rate!) My dutch dictionary was submitted to CM but will probably not make it in until the external/internal dictionary problem is solved! (in the mean while, Im using a home-rolled version! The default CM keyboard with added raw-nl, currently available here)

Edit: I apparently typed “SoftPedia” a couple of times instead of “the Softkeyboard Wiki” – Must’ve been very tired while putting that together :/ Sorry for the confusion. At any rate – all the necessary links are in there.

Published by Gert

Person-at-large.

55 thoughts on “Creating Dutch dictionary for android

  1. A small remark: The bzcat command you gave didn’t come out well on your blogpost. Compare it with the original from the BinaryDictionaries wiki page.

    Thanks for the interesting tutorial.

    Like

  2. I also didn’t get the Dutch dict after installing your apk.
    Running on Android 2.2, I tried to activate it in the “Language and Keyboard” settings, but no dutch dict was in the list.

    Like

  3. Tnx for the heads up about that code; I should know better than to post code here.

    Asfor installing the apk. You cant just install it regularly; the keyboard is in /system/app and /system is read only. To install it you need to boot to recovery, use adb shell to mount the system partition and use adb push to push the new keyboard to the device.

    But it looks like you wont need to do that, people are working on not having the dictionaries in the apk. They new ones are now available in the market (look for LatinIME)
    Havent installed the new code myself but I am very much looking forward to trying it πŸ™‚

    Like

  4. And its looking AWESOME
    Next up – a better dictionary and software to be able to keep improving the dicts (being able to pipe in more data whenever it becomes available)

    Like

  5. How did you build LatinIME? I’ve tried from CM git but it gives me something like “build.xml not found”. I’ve tried getting the apk from the phone, adding a main.dict (from softkeyboard), into res/raw-pt, but th keyboard settings still don’t show any dictionary from portuguese. I’ve tried rebuilding main.dict from pt.xml and putting it into LatineIME.apk but no luck.
    thank you
    andrΓ©

    Like

    1. I built it from the Cyanogen source
      I put the Android-ified file in vendor/cyanogen/overlay/common/packages/inputmethods/LatinIME/java/res/raw-nl/main.dict
      although Im starting to wonder if I didnt cheat there and put in in the tree itself packages/inputmethods/LatinIME/java/res/raw-nl/

      What command did you use to build? How exactly did you set up the code?

      Like

  6. Thank you Gert. Same thing here. I’m trying to build from Cyanogen source. And to me it’s ok to put it in the tree. But when I

    make LatinIME

    I get

    target asm: libskia <= external/skia/src/opts/S32A_Blend_BlitRow32_arm.S
    target asm: libskia <= external/skia/src/opts/S32_Opaque_D32_nofilter_DX_gether_arm.S
    external/skia/src/opts/S32_Opaque_D32_nofilter_DX_gether_arm.S: Assembler messages:
    external/skia/src/opts/S32_Opaque_D32_nofilter_DX_gether_arm.S:38: Error: selected processor does not support `uxth r0,r0'
    external/skia/src/opts/S32_Opaque_D32_nofilter_DX_gether_arm.S:40: Error: selected processor does not support `uxth r6,r6'
    external/skia/src/opts/S32_Opaque_D32_nofilter_DX_gether_arm.S:47: Error: selected processor does not support `uxth r9,r9'
    external/skia/src/opts/S32_Opaque_D32_nofilter_DX_gether_arm.S:49: Error: selected processor does not support `uxth r11,r11'
    make: *** [out/target/product/generic/obj/SHARED_LIBRARIES/libskia_intermediates/src/opts/S32_Opaque_D32_nofilter_DX_gether_arm.o] Error 1

    That's why I tried to unpack LatinIME.apk (from phone) and zip it again with raw-pt in it. But that did not work too. I'm using Ubuntu 10.10 (64) with intel i3. Don't know if this has something to do with the error.

    Like

  7. Sorry, I forgot “-j4” there… But what I don’t understand is why when I reboot into recovery and

    adb push ~/cyanogenmod/out/target/product/bravo/system/app/LatinIME.apk /system/app/LatinIME.apk
    adb shell reboot

    I always end with the same LatinIME.apk and not the one that I’ve made and pushed

    Like

  8. I got it working with:

    adb reboot recovery
    adb shell mount /dev/block/mtdblock3 /system
    adb push ~/cyanogenmod/out/target/product/bravo/system/app/LatinIME.apk /system/app/LatinIME.apk
    adb shell reboot

    now I have portuguese dictionary with default android keyboard

    Like

  9. whoops, lots of posts since last time I checked πŸ˜‰

    Indeed, the system data is write protected in runtime; you need to boot into recovery, mount the systen and push the new .apk file..

    Though this should soon be solved since there s a bit of code being prepared in Cyanogen to make it possible to install dicts from the market!

    Like

  10. It is so useful tutorial. thanks.

    I want to make a Korean dict file from my xml dictionary source. But I can’t find the converting tool in the Softpedia.com. Please let me know the url to get the dict format generating tool to my email; ysmilekim@gmail.com. thanks again.

    Like

  11. Hey Gert

    I’ve tried it but it doesn’t work.
    I’m able to push it to the phone (I guess) but it doesn’t work.
    When I push it, it says that it has transfered 0 bytes in 423651,002s
    When I try to open it on my phone it says that it stopped working etc.

    Any Ideas?
    Cheers

    Like

    1. hmm.. There might be various reasons ..

      How did you push the file? You need to be able to overwrite the regular file. That means you need root (to mount the system) and you need to do this in recovery because once the device has booted, those files will be read only.

      Also, something entirely different; what version of android do you use?

      Like

  12. Thanks for the quick reply

    I’m using an HTC Desire Rooted running LeeDroid 2.3d custom ROM.
    I connected the phone to my pc (without mounting it), fired up cmd.exe and then did the following commands:

    – cd c:\android-sdk-windows\tools
    – adb reboot recovery
    – adb shell mount /system
    – adb push c:\LatinIME.apk /system/app (also tried “adb push c:\LatinIME.apk /system/app/LatinIME.apk” AND “adb push LatinIME.apk /system/app” when I putted it in the tools map)
    – adb shell reboot

    I even tried to put your modified LatinIME.apk in the LeeDroid ROM replacing the normal one without dutch dictionary and flash the ROM again.

    I can then go into settings and see the keyboard there, I can fiddle with the settings etc. but when I try to change languages it says it stopped working.
    Same when I try to use the keyboard…

    Any ideas?

    Like

    1. I’m afraid I dont have a ready solution.. 2 points come to mind though..

      * Do you have any space left on the system partition? Cyanogen has barely any space left, I wouldnt be surprised if other roms have even less free space.. The extra dictionary takes up some space, so the new keyboard apk is bigger than the previous one..
      Then again, thats an issue with my by now ancient HTC Magic, your device might have a much bigger partition

      It bugs me that the push seemingly fails.. You could try pulling the original file, pushing the new file, then pulling the file again and check which one you got.. (using diff or md5 or whatever.. the size should differ between those files anyway)
      Though directly flashing it will probably have put the file on the phone.

      * Are there keyboard related changes in LeeDroid? My .apk is vanilla AOSP, so if LeeDroid has custom changes to that code in their distro, that might be the reason.
      I think you ‘ll probably have the highest success rate & least painful time getting there if you take the LeeDroid keyboard source, add the dictionary to it and compile it yourself.. This would rule out a lot of possible issues..

      * If I’m not mistaking, LeeDroid has a lot of HTC code? (contrary to CM, which is more AOSP) – Does it use the default android keyboard? Or a HTC version of it? Since the keyboard apk really contains an entire keyboard app, replacing a HTC version with the vanilla AOSP one might explain the failure. Altho, I think the HTC one has a different apk name..

      Ok, 3 points then :s Anyway, if its not a space issue, compiling the dictionary into the keyboard in LeeDroid sounds like a good idea to rule out several potential problems.

      Keep us posted πŸ™‚

      Like

  13. I did have some problems with the amount of space on /system.
    When I tried to install Metamorph and it needed to extract several things it said the partition was out of space…

    Perhaps I should try to delete of few unneeded apps from the /system/app ?

    Don’t know how the Leedroid version got made and which code was used.

    About compiling it myself, I know how to sign a file etc, but isn’t there alot more work to do than simply add the nl dict and sign the file again? I guess you should add something to the code? And I really don’t know how I possible could do that.

    Like

  14. Guess I figured it out. There are loads of things different with your .apk file.

    In the folder res there is a folder “anim” and drawable folders are “hdpi”, “land” and “mdpi” yours are all “mdpi” or “land”.

    Is it possible that I upload the LeeDroid file and that you add the dutch.dict file?

    Would be amazing!

    Like

    1. Im afraid I wont really have time to help you with that over the following weeks. :/

      Its easy though..
      You need to check out the code repository for LeeDroid to your computer. (there is probably a step by step somewhere on how to do this and what software to install if necessary) Then add the dictionary to where it belongs in there (the correct directory in the tree) and compile the apk (probably by typing “make name.apk”), then wait a while coz this takes a lot of time!

      Give it a try, if you’re this far already, you wont have any trouble getting this to work πŸ˜‰

      Like

  15. I guess I first have to decompile the apk and then check out the code repository?
    How did you do this because I really don’t know where to start or how to search a step by step process on how to do it :s

    Like

    1. No no, no need to decompile or whatever. Decompiling is a way to get the source code, but we can just download (or check out in Subversion terms) the code repository.

      By the way, doesnt the HTCime keyboard contain dutch dicts?

      Although I cant seem to find the LeeDroid Android source.. You d best contact the LeeDroid guys and ask them about where to get the source, I guess :/ Is it possible that LeeDroid isnt Open Source? :/

      Like

  16. I made it work!!!

    I used apktool to decompile the apk then simply added the raw-nl dictionary.
    Compiled the file and there it is! Working flawless!!!

    Thanks for the help man, really appreciated!

    Like

  17. When people use sms, they usually don’t send words with accented letters, because they use up a lot more space than regular letters (Sms has a 160 character cap).

    But when people reply on facebook, they do use them, although not as regularly, because it’s easier to type c instead of č.

    My problem: how can I offer the user to choose between an accented dictionary and a non-accented one?

    Two different LatinIME.apk files?
    Any ideas?

    Also, because I’m a complete noob, I’m really having trouble with looking at the source code, or even understanding how code.google.com works, or even how to compile. Any chances of making a step by step tutorial for lazy idiots like me?

    Like

  18. I just learned you can send 70 characters in unicode or 160 in GSM encoding. I never noticed.. That sucks for languages with accents in it πŸ™‚

    I cant really think of a clean way to fix this.. Goog provides the option of using multiple dictionaries but they have one per language in their list; not several..

    several apks wouldnt work or at least not any way anyone would want it to..

    What trouble are you having with the code?

    Like

  19. I’ve downloaded a wikipedia dump. I know it’s not the best thing to use, so I’ve contacted my local carriers to see if they’re willing to help me choose the most commonly used words.

    I’m having trouble with your python script. I’ve never ever used python in my life. All I know is a bit C. So understanding your code isn’t that difficult but I don’t know how to use it. How/where do I run it and how do I feed it the wikipedia dump?

    Like

  20. Hey, thats great!

    Wikipedia is a good source, the only problem is that it isn’t natural language.. Its a great base but you d need natural language texts like actual sms mesages or literature or ..

    Though its a great base!

    Like

  21. Here’s another tip about finding word sources:

    Facebook.

    There’s a new option which lets you download all of your data. Including messages and wallposts which contain the most valuable info for the dictionary. Ask a bunch of your friends if they’d be willing to donate their info and you’ve got an unbeatable dictionary.

    Would you be so kind as to write a script which uses FB data for creating a weighed and sorted word list?

    http://techcrunch.com/2010/10/06/facebook-now-allows-you-to-download-your-information/

    Like

  22. thats a great idea! Natural language with probably not too much english pollution.. cool! Though Id start with a wikipedia pack at any rate; just to make sure most words are in there; then use the FB data to refine the importance..
    You could even set the weight to 0 for all after the wikipedia run; though I;d initially not do that and see how the results turn out..

    The code in the page should work with at most a slight modification; the initial code (the one line thingy that makes the initial dict.xml before running the perl script) starts with
    bzcat archive.bz2 |
    this just cats out the content of the text file to the rest of the commands; So you can just replace this to
    cat FBfile.txt |
    where FBfile.txt is a text file containing all the data (like the facebook data) in plain text. You would probably not want too much facebook related junk in there though or you’ll probably get a lot of “Facebook” suggested..

    Great idea though!

    Like

  23. Though in retrospect; its A LOT easier to just convert the .zip to a .bz2 (or .tar.bz2) file and run the command as usual!

    Converting can be as simple as unzipping the file and re-zipping it using bz2 πŸ˜‰

    Lots of english junk in there though.. but still worth while!

    Like

  24. Here’s a problem:
    I’m trying to run the script for making the word list, and it doesn’t want to parse Cyrillic letters.
    I mean, I run the script, and a dictionary xml file is made, but it’s completely empty. Nothing.

    Weird.

    Also, is there a way to transform all the cyrillic letters into latin characters? Word does this but it chokes when I tell it to open a 1GB Wikipedia dump.

    Like

  25. Hold on, something happened: the xml dictionary word list has 1.3MB and reads:

    XML Parsing Error: junk after document element
    Location: file:///home/milan/dict.xml
    Line Number 2, Column 1:Ρƒ
    ^

    What. The. Hell?

    Like

  26. Have you checked what is on Line Number 2, Column 1? A cyrillic character?

    Translating cyrillic letters into latin characters; no clue honestly.. And yeah; most software will have a hard time converting such a huge document.

    Maybe you should look into learning perl to do this? :s
    Its an easy language and it ll get you there..

    Like

  27. I’ve just found out that Sony Ericsson already made a Serbian dictionary for an Xperia x8. It’s not Cyrillic but still, it’s something.

    Their dict is not good. In fact it’s very bad.
    I think that a healthy dose of Facebook data will improve it.

    I’m sorry for going offtopic a bit, but do you know any way of getting that untouched ROM with the dict?

    I was thinking of installing Astro and using bluetooth to transfer the dict files or even the entire LatinIME to my Hero.

    I’d do this in the store, so I’m hoping they don’t arrest me.

    Any tips or better ideas before I do it tomorrow?

    Like

    1. I do not know.
      They are not in Cyanogen because of space constraints on some devices, but as to why none of the user generated are in AOSP, I have no clue.
      The originals are probably not there because of licencing..

      Like

  28. Gert et al,

    your and others’ information is very useful indeed.

    I want to add a new layout keyboard so can you just build the keyboard pottion? Please suggest

    Anousak

    Like

  29. Hey Gert,

    Using the above information I tried adding the .apk you uploaded to my phone, only to find it crashing all the time, without me being able to edit settings or anything.

    What I did: Download the LatinIME.apk file from your blog post
    fire up command prompt:
    -cd “C:\…\android-sdk\platform-tools”
    -adb reboot recovery
    -adb shell mount /system
    -adb push LatinIME.apk /system/app/LatinIME.apk
    -adb shell reboot

    And now the keyboard (com.android.inputmethod.latin) keeps force closing.
    Should i have used another apk perhaps?

    Like

  30. What version of CM are you using? I’d guess that this .apk is no longer compatible with the recent versions, considering when I created it..
    I’d propose you check out the most recent code and compile it yourself?

    Like

  31. Hi, your latinIME.apk with dutch dictionary is offline? I would really appreciate being able to give it a try! thanks!

    Like

    1. Yes, that got taken offline a while ago, rather by accident actually. Though I did not put it back online again because it is a very old version of the keyboard by now. I’ll look into uploading a backup though.

      Like

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: