How to Use DeepBlue Dictionary Conversion on Linux (Using Google Colab as an Example—Based on Ubuntu)

Recently, I encountered a problem when using the rime input method: I couldn’t find a good medical vocabulary dictionary. At this point, you can use the “DeepBlue Dictionary Converter” to convert Sogou Cell dictionaries into XXX.dict.yaml files usable by rime.

DeepBlue is most commonly used on Windows systems, and there are many related tutorials, but few mention how to use it on Linux (Linux users should be able to figure it out on their own, roughly like this). This article is a tutorial using Google Colab as an example (based on Ubuntu).

You can directly use these codes in Colab → Google Colab

:warning:Note​:warning:: The ! before the following command line operations is only for the Colab notebook environment. If you are using a Linux terminal, the ! is not required.

Download DeepBlue software and unzip

Please download the latest version

! wget https://github.com/studyzy/imewlconverter/releases/download/v3.0.0/imewlconverter-v3.0.0-macOS.zip
! unzip imewlconverter-v3.0.0-macOS.zip

Download dotnet6.0

DeepBlue can only be started with version 6.0

! apt-get install dotnet-sdk-6.0

Start DeepBlue and get help

! dotnet ImeWlConverterCmd.dll -?

You will get the following output

xpy	Shouxin Input Method
xlpy	Xinlang Pinyin
jd	Jidian Wubi
jdzm	Jidian Zhengma
jdmb	Jidian Wubi .mb files
xywb	Xiaoya Wubi
yahoo	Yahoo Kimo
ld2	Lingoes ld2
wb86	Wubi 86 version
wb98	Wubi 98 version
wbnewage	Wubi New Century Edition
cjpt	Cangjie Platform
emoji	Emoji
bdsj	Baidu Mobile or Mac Baidu Pinyin
bdsje	Baidu Mobile English
bcd	Baidu Mobile Dictionary bcd
qqsj	QQ Mobile
ifly	iFly Input Method
self	Custom
word	Pure Chinese characters without Pinyin

For example, to convert Sogou Cell dictionaries ./test.scel and ./a.scel to a Google Pinyin dictionary ./gg.txt, the command is:
dotnet ImeWlConverterCmd.dll -i:scel ./test.scel ./a.scel -o:ggpy ./gg.txt
To convert ./test.scel and ./a.scel Sogou Cell dictionaries into Google Pinyin dictionaries test.txt and a.txt under the ./temp folder, use:
dotnet ImeWlConverterCmd.dll -i:scel ./test.scel ./a.scel -o:ggpy ./temp/*
To convert all Sogou Cell dictionaries ./test/*.scel into Google Pinyin dictionaries under the ./temp folder, use:
dotnet ImeWlConverterCmd.dll -i:scel ./test/*.scel -o:ggpy ./temp/*

If the imported dictionary does not contain word frequency, but you need to specify word frequency when exporting, you can specify the method of generating word frequency with the -r: command. Supported options include:
-r:baidu  Word frequency based on the number of results on Baidu search for that term
-r:google  Word frequency based on the number of results on Google search for that term (requires VPN)
-r:number  Specify a fixed numeric word frequency

When exporting dictionaries for the Rime input method, you can set the encoding with -ct:pinyin/wubi/zhengma, and set the applicable OS with -os:windows/macos/linux.

Use -ft: to set filtering conditions for entries; if not set, no entries are filtered. Filtering conditions after -ft: include:
len:1-100 Keep entries with character length between 1 and 100
rank:2-9999 Keep entries with word frequency between 2 and 9999
rm:eng Remove entries containing English letters
rm:num Remove entries containing numbers
rm:space Remove entries containing spaces
rm:pun Remove entries containing punctuation
The above filtering conditions can be combined and applied simultaneously, separated by vertical bars:
-ft:"len:1-100|rank:2-9999|rm:eng|rm:num|rm:space|rm:pun"

Parameters for custom formats are as follows:
-f:213,|byyn
213 sets the order of Pinyin, Chinese characters, and word frequency. 213 means 1: Chinese character, 2: Pinyin, 3: word frequency. Exactly three are required.
, sets the separator between Pinyin syllables, separated by commas.
| sets the separator between Chinese characters, Pinyin, and word frequency, separated by |.
b sets where the Pinyin separator appears; options are l, r, b, n: l means left inclusive, r means right inclusive, b means inclusive on both sides, n means not inclusive on either side.
yyn sets whether to show Pinyin, Chinese characters, and word frequency respectively, with y meaning show and n meaning do not show. Here, yyn means show Pinyin and Chinese characters, but not word frequency.
For example, to convert a qpyd dictionary to a custom format text dictionary with commas between Pinyin syllables, spaces between Pinyin and words, no word frequency display, and using a custom encoding file code.txt, the command is:
dotnet ImeWlConverterCmd.dll -i:qpyd ./a.qpyd -o:self ./zy.txt "-f:213, nyyn" -c:./code.txt
The encoding file specified by -c:./code.txt format is: "ChineseCharacter<Tab>Encoding" one per line.

Finally, if this software has helped you, you can show your appreciation by donating. Author’s Alipay address: studyzy@163.com Zeng Yi
Type -? to get help

After uploading the original dictionary file, start the conversion

For more usage parameters, see ! dotnet ImeWlConverterCmd.dll -?

! dotnet ImeWlConverterCmd.dll -i:scel ./acupuncture.scel -o:rime ./acupuncture.txt

This way, you get acupuncture.txt, and its contents are what we need.

Finally

I have converted some Sogou medical dictionaries into rime format, welcome to download and use → 始徒医学输入法简介