Recently, I encountered a problem when using the rime input method: I couldn’t find a good medical vocabulary dictionary. At this point, you can use the “DeepBlue Dictionary Converter” to convert Sogou Cell dictionaries into XXX.dict.yaml files usable by rime.
DeepBlue is most commonly used on Windows systems, and there are many related tutorials, but few mention how to use it on Linux (Linux users should be able to figure it out on their own, roughly like this). This article is a tutorial using Google Colab as an example (based on Ubuntu).
You can directly use these codes in Colab → Google Colab
Note
: The
!before the following command line operations is only for the Colab notebook environment. If you are using a Linux terminal, the!is not required.
Download DeepBlue software and unzip
Please download the latest version
! wget https://github.com/studyzy/imewlconverter/releases/download/v3.0.0/imewlconverter-v3.0.0-macOS.zip
! unzip imewlconverter-v3.0.0-macOS.zip
Download dotnet6.0
DeepBlue can only be started with version 6.0
! apt-get install dotnet-sdk-6.0
Start DeepBlue and get help
! dotnet ImeWlConverterCmd.dll -?
You will get the following output
xpy Shouxin Input Method
xlpy Xinlang Pinyin
jd Jidian Wubi
jdzm Jidian Zhengma
jdmb Jidian Wubi .mb files
xywb Xiaoya Wubi
yahoo Yahoo Kimo
ld2 Lingoes ld2
wb86 Wubi 86 version
wb98 Wubi 98 version
wbnewage Wubi New Century Edition
cjpt Cangjie Platform
emoji Emoji
bdsj Baidu Mobile or Mac Baidu Pinyin
bdsje Baidu Mobile English
bcd Baidu Mobile Dictionary bcd
qqsj QQ Mobile
ifly iFly Input Method
self Custom
word Pure Chinese characters without Pinyin
For example, to convert Sogou Cell dictionaries ./test.scel and ./a.scel to a Google Pinyin dictionary ./gg.txt, the command is:
dotnet ImeWlConverterCmd.dll -i:scel ./test.scel ./a.scel -o:ggpy ./gg.txt
To convert ./test.scel and ./a.scel Sogou Cell dictionaries into Google Pinyin dictionaries test.txt and a.txt under the ./temp folder, use:
dotnet ImeWlConverterCmd.dll -i:scel ./test.scel ./a.scel -o:ggpy ./temp/*
To convert all Sogou Cell dictionaries ./test/*.scel into Google Pinyin dictionaries under the ./temp folder, use:
dotnet ImeWlConverterCmd.dll -i:scel ./test/*.scel -o:ggpy ./temp/*
If the imported dictionary does not contain word frequency, but you need to specify word frequency when exporting, you can specify the method of generating word frequency with the -r: command. Supported options include:
-r:baidu Word frequency based on the number of results on Baidu search for that term
-r:google Word frequency based on the number of results on Google search for that term (requires VPN)
-r:number Specify a fixed numeric word frequency
When exporting dictionaries for the Rime input method, you can set the encoding with -ct:pinyin/wubi/zhengma, and set the applicable OS with -os:windows/macos/linux.
Use -ft: to set filtering conditions for entries; if not set, no entries are filtered. Filtering conditions after -ft: include:
len:1-100 Keep entries with character length between 1 and 100
rank:2-9999 Keep entries with word frequency between 2 and 9999
rm:eng Remove entries containing English letters
rm:num Remove entries containing numbers
rm:space Remove entries containing spaces
rm:pun Remove entries containing punctuation
The above filtering conditions can be combined and applied simultaneously, separated by vertical bars:
-ft:"len:1-100|rank:2-9999|rm:eng|rm:num|rm:space|rm:pun"
Parameters for custom formats are as follows:
-f:213,|byyn
213 sets the order of Pinyin, Chinese characters, and word frequency. 213 means 1: Chinese character, 2: Pinyin, 3: word frequency. Exactly three are required.
, sets the separator between Pinyin syllables, separated by commas.
| sets the separator between Chinese characters, Pinyin, and word frequency, separated by |.
b sets where the Pinyin separator appears; options are l, r, b, n: l means left inclusive, r means right inclusive, b means inclusive on both sides, n means not inclusive on either side.
yyn sets whether to show Pinyin, Chinese characters, and word frequency respectively, with y meaning show and n meaning do not show. Here, yyn means show Pinyin and Chinese characters, but not word frequency.
For example, to convert a qpyd dictionary to a custom format text dictionary with commas between Pinyin syllables, spaces between Pinyin and words, no word frequency display, and using a custom encoding file code.txt, the command is:
dotnet ImeWlConverterCmd.dll -i:qpyd ./a.qpyd -o:self ./zy.txt "-f:213, nyyn" -c:./code.txt
The encoding file specified by -c:./code.txt format is: "ChineseCharacter<Tab>Encoding" one per line.
Finally, if this software has helped you, you can show your appreciation by donating. Author’s Alipay address: studyzy@163.com Zeng Yi
Type -? to get help
After uploading the original dictionary file, start the conversion
For more usage parameters, see ! dotnet ImeWlConverterCmd.dll -?
! dotnet ImeWlConverterCmd.dll -i:scel ./acupuncture.scel -o:rime ./acupuncture.txt
This way, you get acupuncture.txt, and its contents are what we need.
Finally
I have converted some Sogou medical dictionaries into rime format, welcome to download and use → 始徒医学输入法简介