Announcements regarding our community

March 20

We launched a new version of MARKUS on the occasion of the Council of East Asian Libraries and Association for Asian Studies 2019 meetings in Denver.

Automated markup in MARKUS can now handle Korean texts written in classical Chinese. The default entities include personal names, place names, official titles and posts, and book titles. We acknowledge the support of Kim Hyeon, Kim Baro and the Academy of Korean Studies and the assistance of Hu Jing, Mees Gelein and Brent Ho.

The selection of web references has been increased with

*the Encyclopedia of Korean Culture 韓國民族文化大百科事典 (for Korean book titles),

*Korean biographical databases from 韓國歷代人物綜合情報系統 and 韓國民族文化大百科事典 (the same person may be covered in different databases with each containing slightly different information),

*the Map of Dongyeo or 東輿圖 (for Korean historical place names)

*reference to official titles and posts in the veritable records (based on 朝鮮王朝實錄 and 朝鮮王朝實錄事典)

*buleku (Manchu dictionaries – with credit to Fresco Sam-Sin and Léon Rodenburg

Figure 1. Linking tagged place names to the Map of Dongyeo or 東輿圖

We added new video clips on features that were added in the past year:

-How to use batch markup

-How to export to docusky

-How to add place name ids to map data in docugis

-How to use COMPARATIVUS for text comparison between any two files and for tagging instances of text reuse in MARKUS files

We also added a link to the very helpful docusky manual compiled by Hu ChiRui et al. (which also includes a section on MARKUS—this edition has not yet been updated with functionality added over the past year)

The research forum features a few new blogposts on projects on local gazetteers and Tang-Song political history by Xiong Hueilan, Liu Jialong and Hilde De Weerdt

In automated tagging we added place names from the DDBC Place Authority Database (法鼓佛教學院地名規範資料). This dataset includes lots of extra place names mentioned in Buddhist sources. With special thanks to Joey Hong 洪振洲.

Finally we want to draw your attention to the metadata conversion tool developed by Tu Hsieh-Chang.

This allows you to convert MARKUS tags to metadata or structural markup; in this way you can divide a single file into divisions. For example, in a file with several juan (volumes) and zhang (chapters) you can tag these divisions using keyword markup (see figure 2). After exporting the file to docusky from your MARKUS account you can select one or more tags (say JUAN and headings) to be converted to metadata. In the resulting docusky textual database, you will see that these tags can now be used as metadata categories (see figure 3). You can then, for example, filter and analyze other information you tagged by volume or chapter. The example below shows how this feature can be used to sort commentary and commentators by volume and chapter.

Figure 2

Figure 3

In the coming months we will continue to work on relational markup (establishing relationship between entities) and additional dictionaries. All suggestions and offers for collaborations welcome.


If you were using MARKUS before, please clean your browser cache—not doing so may result in the malfunction of the new version of MARKUS. On this problem see also our FAQ

The Korean datasets may be slower to load—be patient, only hit the markup button once and wait for the scanning process to start. Hitting it twice may freeze your system as it will try to execute the command twice.

read more

Got a question? Ask away!

Revising Custom Tags in Use

If you have used your custom tags to mark up a text, but then want to revise them, a video made by Wu Ruei-Ming 吳瑞明 as part of the Drugs across Asia project can be of reference value. As the video cannot be uploaded here, we have taken some screenshots to show how they did it. The video was recorded in the 正體中文 version of MARKUS.

On the left side of the screenshot below is shown the text that has been marked up. Now, the user wants to change the five bilingual tags on the right into monolingual (Chinese) ones.

Basically, the logic behind the method shown in the video is to add five Chinese tags whose colors correspond to the ones they are meant to replace and then delete the bilingual ones. Here are the specific steps:

Click “管理標記”. In the window that pops up, change the tag names (標記名稱) of these five bilingual tags from following the earlier naming pattern of English translation + transliteration into just English. For instance, for the tag “DrugName 藥名”, delete “_LueMing”.

Having done this with all five, click “確定”. The five tags under revision now appear uncolored.

Click “管理標記” again, this time to change their button names (按鈕名稱) from being bilingual into just Chinese. For instance, for the tag “DrugAction藥效”, delete “DrugAction”.

After having done this with all five, click “確定” and you’ll see in place of the five earlier bilingual tags, there now appear five Chinese ones.

Now let’s color them. Click “管理標記”. In the “新增標記” section of the popup window, for each of the five uncolored tags, copy its tag and button names into corresponding boxes and choose the same color as that for the tag it is going to replace. For example, choose crimson for the tag “藥名”. Then click “新增”.

Now there appear two crimson tags.

The next step is to delete the one you don’t want any more. Click “管理標記” again. In the window that pops up, delete the tag “DrugName 藥名”.

Click “確定” and you’ll see there is but one crimson tag “藥名”, which marks up all the named entities in the text that were previously marked up by the tag “DrugName 藥名”.

This completes the revision of this tag. Do the same for the other four tags, until you have revised them all.

Jiyan Qiao, PhD candidate, Leiden University

read more
Acquiring texts for MARKUS ------ A brief introduction to Chinese local gazetteer databases

LIU Jialong, PhD candidate in Leiden University

Ever since 2017, I have been participating in a project, chaired by Professor Hilde De Weerdt with the assistance of Xiong Huei-lan, the aim of which is to use digital approaches to investigate the history of construction in imperial China. We have selected the construction of city walls 城墻 to do a pilot study. In our research, an important step is to use MARKUS to mark up the information we want to analyze in local gazetteers (for a discussion of how to mark up and extract data with regular expressions in MARKUS, see Xiong Huei-lan’s post). The first step in this process is to identify and import or upload relevant primary source texts. In this post, I discuss where we can access relevant sources from local gazetteer databases.

Before acquiring texts from databases, we should evaluate them. It is necessary to be fully aware of the advantages and shortcomings of different databases. For example, how many gazetteers are included in a database? Can we browse the content of gazetteers? Can we download texts? Can we double check the original images in the database? What retrieval strategies are offered by a database? Does a database provide useful tools? In the following, I will briefly introduce select Chinese local gazetteer databases that are commonly used based on my own experience.

(1) Zhongguo fangzhiku 中國方志庫

10,000 gazetteers will be included and 4000 of them are available

users can easily browse the content and compare the full text to original images of the titles included

texts can be download (to use MARKUS, save downloaded content as a .txt file)

full-text retrieval is available

(2) Zhongguo shuzi fangzhiku 中國數字方志庫

11,000 gazetteers are included (This makes this the largest classical Chinese local gazetteer database I have used. Two examples: In the case of Chongming County 崇明, 中國數字方志庫 includes 6 gazetteers for different periods, whereas 中國方志庫 has 4. The latter misses two editions in 雍正 (1723-1735) and 民國 (1912-1949). In the case of Taicang Prefecture 太倉, 中國數字方志庫 has 8 gazetteers, whereas 中國方志庫 has 5. The latter misses two editions in 光緒 (1875-1908) and 宣統 (1909-1911), and an undated manuscript named 太倉衛志.)

users can browse the content and have the access to original images of books

texts can be downloaded

full-text retrieval is available (However, users cannot search gazetteers based on region. By contrast, 中國方志庫 provides such a retrieval strategy.)

the interface of 中國數字方志庫 is not as user-friendly as that of 中國方志庫, especially regarding the reading experience of the full text in font and layout



(3) Xin fangzhi 新方志

40,000 gazetteers are included (It only covers gazetteers compiled after 1949.)

users can browse the content and directly read the original texts in PDF format

users can download the PDF files

full-text retrieval is impossible (search results are limited to chapters.)

(4) Shuzi fangzhi 數字方志

includes 6000+ gazetteers compiled before the end of the Qing Dynasty held by the Chinese National Library

users can browse the content but the website often crashesusers can browse the content but the website often crashes

users can read the scanned images of gazetteers but cannot download texts

full-text retrieval is not available

(5) Zhongguo dalu ge sheng difangzhi shumu chaxun xitong 中國大陸各省地方志書目查詢系統

a catalogue developed by Academia Sinica in Taiwan covering local gazetteers compiled both before and after 1949

users cannot browse the content of gazetteers

(6) Regional databases set up by regional libraries

This kind of database usually includes the local gazetteers in one region no matter when they were compiled. Below I use Beijingshi shuzi fangzhi guan 北京市數字方志館 as an example:

all the local gazetteers about Beijing are claimed to be included

PDF versions of gazetteers (scanned images of books) are offered online for users to browse

texts cannot be downloaded

full-text retrieval is unavailable

Before uploading texts acquired from databases to MARKUS, we should clean the digital text. When doing so, we should pay attention to blank spaces. In the texts acquired from 中國方志庫, blank space is used to indicate notes 注 in the original text. Such features can be used to, for example, mark up notes in MARKUS.

Variants of Chinese characters 異體字 should also be paid special attention to as they can influence the results of mark-up in MARKUS. I would advise to standardize all variants. If you cannot recognize a variant of a character, you can refer to the dictionary provided by 中國方志庫 or a professional online dictionary for variants specifically, 異體字字典 . (Xiong Huei-lan also describes the difficulty caused by variants when using regular expressions in her post.)

Depending on the quality of the OCR, there may be missing characters or wrong characters in the texts you acquired from electronic databases. When cleaning the text, such problems should be addressed. After cleaning your texts, upload them to MARKUS and explore them with the variety of functionality provided by the platform.

read more