Convert the encyclopedia vocabulary based on the MediaWiki framework into an Rime dictionary

This article is transcoded by SimpRead, original at sspai.com

The open-source input method RIME, combined with carefully developed input schemes like Wusong Pinyin, already meets daily needs—it usually comes with comprehensive common word libraries.

However, these everyday word libraries cannot cover everything, and specialized vocabulary in some fields may not be included. For example, when I write articles about games or animated series, I often encounter character names, place names, and terminology not recorded by the input method. Therefore, I often have to futz around with the candidate words, unknowingly greatly reducing input efficiency.

Fortunately, RIME users can stand on the shoulders of MediaWiki encyclopedia volunteers [Note 1]. The entries of these encyclopedias themselves are very rich and professional corpora.

MediaWiki is an encyclopedia platform developed by the Wikimedia Foundation, and Wikipedia is built on it. Well-known platforms such as Moegirlpedia, Bilibili Game Wiki (BWiki) are also constructed based on MediaWiki. You can use the tool MW2Fcitx (MediaWiki To Fcitx) to turn these encyclopedia entry names into RIME dictionaries, greatly expanding your RIME input experience.

Background: How did I discover MW2Fcitx?

My main laptop runs Arch Linux. After configuring the Arch Linux CN repository, I wanted to install the RIME input method and searched for it in Pacman. Then, something magical appeared in the search results:

$ pacman -Ss fcitx rime
...
archlinuxcn/fcitx5-pinyin-moegirl-rime 20220218-1
    Fcitx 5 Pinyin Dictionary from zh.moegirl.org.cn
archlinuxcn/fcitx5-pinyin-zhwiki-rime 20210120-1
    Fcitx 5 Pinyin Dictionary from zh.wikipedia.org for rime
...

Wow, these are dictionaries specially made for RIME, sourced respectively from Moegirlpedia and Wikipedia. The entries of encyclopedia platforms themselves are treasure troves of corpora; adding them can greatly enhance the input experience, no longer struggling to piece together words letter by letter. Especially for Moegirlpedia’s dictionary, it is a great blessing for ACG enthusiasts using Zhongzhouyun, who no longer have to worry about typing out anime or character names!

Obviously, creating these dictionaries does not require manually looking through encyclopedia entries one by one. They are all generated using MW2Fcitx.

MW2Fcitx can call the MediaWiki API to fetch all entry titles from MediaWiki-based encyclopedias and compile them into dictionaries usable by various input methods. Since Wikipedia and Moegirlpedia are based on MediaWiki, MW2Fcitx naturally supports them.

Installing MW2Fcitx

MW2Fcitx is developed in Python. If your computer does not have Python installed, you can refer to the Rookie Tutorial documentation.

Note:

On Windows, you can only use the official version of Python.

It is known that MSYS2’s Python cannot properly install MW2Fcitx dependencies and will fail during installation; Python installed via Scoop lacks the Pip tool.

On Windows, macOS, and Linux distributions (except Arch Linux), you can easily install it using **pip**:

# On Windows platform (please ensure you have correctly installed the official Python version first)
py -m pip install mw2fcitx

# On other platforms
pip install mw2fcitx

It will install a script program named mw2fcitx, and pip will place it in the correct location:

  • On Windows, pip installs the mw2fcitx startup script into the Python installation directory.
  • On Linux, it defaults to installing into ~/.local/bin; you need to manually add this directory to your PATH. To do so, open ~/.profile (create it if it doesn’t exist) and add the line: export PATH=$PATH:/home/<your-username>/.local/bin.

A successful installation log looks like this:

Collecting mw2fcitx
  Using cached mw2fcitx-0.13.0-py3-none-any.whl.metadata (3.9 kB)
Collecting OpenCC>=1.1.1.post1 (from mw2fcitx)
  Downloading OpenCC-1.1.7-cp312-cp312-win_amd64.whl.metadata (12 kB)
Collecting pypinyin>=0.38.1 (from mw2fcitx)
  Using cached pypinyin-0.51.0-py2.py3-none-any.whl.metadata (12 kB)
Downloading mw2fcitx-0.13.0-py3-none-any.whl (14 kB)
Downloading OpenCC-1.1.7-cp312-cp312-win_amd64.whl (716 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 716.5/716.5 kB 1.8 MB/s eta 0:00:00
Using cached pypinyin-0.51.0-py2.py3-none-any.whl (1.4 MB)
Installing collected packages: OpenCC, pypinyin, mw2fcitx
Successfully installed OpenCC-1.1.7 mw2fcitx-0.13.0 pypinyin-0.51.0

[notice] A new release of pip is available: 24.0 -> 24.1.1
[notice] To update, run: C:\\Users\\AnClark\\AppData\\Local\\Programs\\Python\\Python312\\python.exe -m pip install --upgrade pip

For Arch Linux, due to system restrictions, MW2Fcitx cannot be installed directly via pip. You can use the PKGBUILD I created to build and install the MW2Fcitx package. The steps are as follows:

# Install dependencies
sudo pacman -Syu
sudo pacman -S python-hatchling pypinyin
git clone https://github.com/AnClark/mw2fcitx-arch-PKGBUILD

# Build and install the MW2Fcitx package
cd mw2fcitx-arch-PKGBUILD
makepkg
sudo pacman -U python-mw2fcitx-3.10-1-any.pkg.tar.zst

Finding Supported MediaWiki Sites

To use MW2Fcitx, you first need to find an encyclopedia platform built on MediaWiki, and then find the API URL of that platform.

MW2Fcitx obtains the entry list by calling MediaWiki’s API, which is accessed via a PHP file on the site—api.php. Usually, for most MediaWiki sites, the api.php is located in the site root directory (e.g., Chinese Wikipedia). However, there are special cases:

  • Chinese Wikipedia’s API URL is https://zh.m.wikipedia.org/w/api.php.
  • BWiki is composed of multiple sub-wikis, each independent, so api.php is located in the root directory of each sub-wiki site.

The MediaWiki API paths of some encyclopedias are listed in the table below:

SiteAPI URL
Chinese Wikipediahttps://zh.wikipedia.org/api.php
Moegirlpediahttps://zh.moegirl.org.cn/api.php
BWiki—Arknightshttps://wiki.biligame.com/arknights/api.php
BWiki—Genshin Impacthttps://wiki.biligame.com/ys/api.php
BWiki—Honkai: Star Railhttps://wiki.biligame.com/sr/api.php
BWiki—Uma Musumehttps://wiki.biligame.com/umamusume/api.php
Pokémon Wikihttps://wiki.52poke.com/api.php

API paths for other sites can be inferred based on the examples above.

Directly through…How to open this path in a browser: you can also determine whether a site is a MediaWiki site. If it is, the opened webpage will be a help document, as shown in the following image:

If this page can be displayed, then this encyclopedia is built with MediaWiki.

Writing the configuration file

MW2Fcitx requires a configuration file to work. The configuration file is named config.py, and it is used to specify how MW2Fcitx obtains and generates the dictionary.

First, create a directory to serve as MW2Fcitx’s working directory.

Within the working directory, create a file named config.py and fill it with the following content. Please modify the corresponding fields according to the Chinese comments in the file. Note that these fields are in Python string format.

# By default we assume the configuration is located at a variable
#     called "exports".
# You can change this with `-n any_name` in the CLI.

from mw2fcitx.tweaks.moegirl import tweaks

exports = {
    # Source configurations.
    "source": {
        # MediaWiki api.php path, if to fetch titles from online.
        # 【⬇️api_path specifies the api.php of the encyclopedia site⬇️】
        "api_path": "https://zh.m.wikipedia.org/w/api.php",
        # Title file path, if to fetch titles from local file. (optional)
        # Only works if api_path is absent.
        # 【⬇️file_path specifies the filename for temporarily storing fetched content⬇️】
        "file_path": "titles.txt",
        "kwargs": {
            # Title number limit for online fetching. (optional)
            # Only works if api_path is provided.
            #"title_limit": 120,
            # Title list export path. (optional)
            # 【⬇️output is optional, specifies the output path for the list of entries⬇️】
            #"output": "titles.txt"
        }
    },
    # Tweaks configurations as an list.
    # Every tweak function accepts a list of titles and return
    #     a list of title.
    "tweaks":
        tweaks,
    # Converter configurations.
    "converter": {
        # opencc is a built-in converter.
        # For custom converter functions, just give the function itself.
        "use": "opencc",
        "kwargs": {}
    },
    # Generator configurations.
    "generator": [{
        # rime is a built-in generator.
        # For custom generator functions, just give the function itself.
        # 【⬇️MW2Fcitx can also generate dictionaries for other input methods. Default is RIME⬇️】
        "use": "rime",
        "kwargs": {
            # Destination dictionary filename. (optional)
            # 【⬇️output specifies the output dictionary filename, recommended to use YAML extension】 
            "output": "wikipedia.dict.yaml"
        }
    }]
}

Special attention to the following fields:

  • api_path, fill in the API path of the MediaWiki site, usually pointing to the api.php of the site. After finding the encyclopedia platform’s API path, please remember to replace this field.
  • generator/kwargs/output is the filename for the output dictionary. It is recommended to use the .dict.yaml extension, as RIME seems to prioritize files with this extension.
  • source/kwargs/output is the filename for the exported entries list; this parameter is optional. The default is titles.txt.

Note: Windows will have exceptions when outputting the entries list file. Therefore, in the above example, the line "output": "titles.txt" is commented out.

Generating the dictionary

After completing the configuration file, we can start generating the dictionary.

In the working directory, open the command line and then run MW2Fcitx:

# Run directly without parameters
mw2fcitx

Wait a moment, MW2Fcitx will automatically fetch the list of entries from the encyclopedia site, then process through a series of steps to generate the RIME dictionary.

A correct execution log looks like the following (excerpt):

2024-06-30 17:20:01,969 [DEBUG] Fetching titles from https://somewiki.com/wiki/api.php
2024-06-30 17:20:02,280 [DEBUG] Got 10 pages
.....
2024-06-30 17:20:23,288 [DEBUG] Got 38452 pages
2024-06-30 17:20:23,288 [INFO] Finished.
2024-06-30 17:20:23,289 [DEBUG] 38452 title(s) imported.
2024-06-30 17:20:23,291 [DEBUG] 1 title(s) imported.
2024-06-30 17:20:23,292 [DEBUG] Running 7 pipelines
2024-06-30 17:20:23,292 [DEBUG] Running pipeline 1/7 (cb')
......
2024-06-30 17:20:23,835 [DEBUG] Running pipeline 7/7 (tweak_normalize')
2024-06-30 17:20:23,848 [DEBUG] Deduplicating 119528 items
2024-06-30 17:20:23,856 [DEBUG] Deduplication completed. 37118 items left.
2024-06-30 17:20:23,857 [DEBUG] Exporting 37118 words with OpenCC
2024-06-30 17:20:24,162 [DEBUG] 1000 converted
......
2024-06-30 17:20:24,687 [DEBUG] 7724 converted
2024-06-30 17:20:24,712 [INFO] Dictionary generated.

The generated dictionary is located in the working directory, with the filename specified in the configuration file, for example wikipedia.dict.yaml, GenshinImpact.dict.yaml.

Installing the dictionary

After the dictionary is generated, we can install it into the system for RIME to use. Different platforms have different installation methods.

1) Linux

The generated dictionary is usually installed in RIME’s resource directory, located at /usr/share/rime-data/. Next, copy the newly generated dictionary to that directory:

# For example: the generated dictionary is named wikipedia.dict.yaml
sudo cp wikipedia.dict.yaml /usr/share/rime-data/

In Arch Linux, installing the packages fcitx5-pinyin-moegirl-rime and fcitx5-pinyin-zhwiki-rime will also place dictionaries directly into the above directory.

2) Windows

The Windows version of RIME is Weasel. Usually, we place the generated dictionary in its “User Folder”. This folder centrally stores user configurations.

Right-click the Weasel icon on the taskbar (displayed as a black square with a “中” character or a red square with an “A” letter), and select “User Folder.”

Weasel right-click menu.

This will open the User Folder in File Explorer. Paste the dictionary generated by MW2Fcitx here, for example, the BWiki Genshin Impact dictionary I obtained (named GenshinImpact.dict.yaml):

Weasel User Folder. The file outlined in red is the BWiki Genshin Impact dictionary I just added.

3) macOS

The macOS version of RIME is adapted as Squirrel. Its configuration method is similar to Weasel, except that the user folder is located at ~/Library/Rime.

Since I do not have a macOS computer, I cannot demonstrate. You can refer to this tutorial for the approach.

Mounting the dictionary

After installing the dictionary, we need to modify the configuration file of the RIME input schema to mount the dictionary. Note that each input schema uses independent dictionaries, which means that if you have multiple schemas and want to mount the dictionary to several, you need to modify the configuration files for each schema.

The configuration file modification needs to be done in the User Folder [Note 2].

0) User Folder directories on different platforms

  • Fcitx4: ~/.config/fcitx/rime/
  • Fcitx5: ~/.local/share/fcitx/rime/
  • Weasel, Squirrel: Please refer to the previous chapter.

1) Built-in RIME schema (taking Luna Pinyin as an example)

  • Step 1, create a custom dictionary file. In the User Folder directory, create a file named luna_pinyin.mydict.dict.yaml (the mydict part can be any name), with the following content:
---
name: luna_pinyin.mydict    # Must be consistent with the corresponding part of the filename
version: "1.0"
sort: by_weight    # Default sorting by frequency
use_preset_vocabulary: true
# This is the list of dictionaries loaded by Luna Pinyin
import_tables:
  - luna_pinyin      # Basic Luna Pinyin dictionary
  - moegirl          # MoeGirl Wiki (provided by fcitx5-pinyin-moegirl-rime)
  - zhwiki           # Chinese Wikipedia (provided by fcitx5-pinyin-zhwiki-rime)
  - GenshinImpact    # The dictionary generated in the above tutorial
---

Note: The three dashes before and after the YAML body must be kept; this is the fixed syntax of dictionaries.

  • Step 2, create a custom pinyin schema file. In RIME, user modifications of the pinyin schema are not directly editing source files but done via “patching.” In the configuration directory, create a file named luna_pinyin_simp.custom.yaml with the following content:
# luna_pinyin_simp.custom.yaml
patch:
# Specify the custom dictionary; the field value is the "name" field from step one
  "translator/dictionary": luna_pinyin.mydict

  • Step 3, redeploy. Right-click the RIME icon in the system tray, select “Deploy,” and wait a moment before usage.

2) Third-party pinyin schema (taking Rime Ice as an example)

  • Step 1: create/open the configuration file. In the User Folder directory, open rime_ice.dict.yaml. If you installed Rime Ice on Arch Linux, this file probably doesn’t exist in the User Folder; you need to copy one from /usr/share/rime-data/.
  • Step 2: modify the configuration file. Find the import_tables field, and below the comment “It is recommended to put extended dictionaries below…” [Note 3], add our dictionary. For example (excerpt from the file):
# Rime dictionary
# encoding: utf-8

---
name: rime_ice
version: "2023-11-13"
import_tables:
  - cn_dicts/8105     # Character table
  # - cn_dicts/41448  # Large character table (enable if needed)
  - cn_dicts/base     # Base dictionary
  - cn_dicts/ext      # Extended dictionary
  - cn_dicts/tencent  # Tencent word vectors (large dictionary, longer deployment)
  - cn_dicts/others   # Miscellaneous

  # It is recommended to put extended dictionaries below; if there
  #  are duplicate entries, the upper one's weight applies
  # - cn_dicts/mydict
  # 【⬇️Add your dictionary below, note each line should be indented by two spaces.⬇️】
  - GenshinImpact     # The dictionary generated in the above tutorial
...

# The following part is omitted.

  • Step 3, redeploy. Right-click the RIME icon in the system tray, select “Deploy,” and wait a moment before usage.

Friendly reminder: If you load large dictionaries such as Chinese Wikipedia (rime-pinyin-zhwiki), the redeployment process may take longer. Please be patient.

Effect after adding dictionaries

Finally, I demonstrate the input experience comparison before and after adding the dictionary.

My test environment is Windows 11, with Weasel installed and the Rime Ice input schema enabled. The dictionary added is the BWiki Genshin Impact dictionary.

Before adding, the input effect is as follows, showing that some Genshin Impact vocabulary was not included [Note 4], for example:

From top to bottom: Fontaine, Dust of Song, Fushina, Xingqiu, Liyue.

After adding the dictionary, the above Genshin Impact vocabulary has all been included. While testing, I almost did not notice the dictionary’s enhancement until I could smoothly input these words (confirmed):

After adding the dictionary, the above Genshin Impact vocabulary appeared in the candidate list.

Closing words

For specialized vocabulary, especially in the fields of games and anime, RIME’s dictionaries may not have the entries. Fortunately, enthusiastic volunteers in related fields have been contributing on various MediaWiki-based platforms, creating entries that are very professional and valuable corpora.

Using MW2Fcitx to obtain the entry names from these encyclopedias, making RIME dictionaries and loading them can greatly enrich our dictionaries to meet input needs in these specialized domains. From now on, when writing articles—especially related ones—you won’t have to search for candidate characters individually to form words anymore; the vocabulary you want will be readily accessible. Your RIME input experience will be greatly enhanced from now on.

Follow SSPai Xiaohongshu to experience exciting digital life :leaf_fluttering_in_wind:

Practical and useful genuine software, presented by SSpai :rocket: