Announcements regarding our community

MARKUS updates log
B

2018-07-25 minor updates

place and personal name automated markup will now work for simplified characters Taiwanese place names now included in markup and reference logging in from dropbox accounts

2016-10-28 the Sieve online bug fix

2016-10-14 MARKUS update and bug fix

Navigation menu bar updated (Links and icons) New functions released in simplified Chinese interface Simplified Chinese interface bug fix

2016-10-13 MARKUS update

Tutorial videos in simplified Chinese interface now using Youku New simplified subtitles added

2016-10-01 MARKUS bug fix

Web reference bug fixed.

2016-09-24 MARKUS update

DDBC person tag in TEI will have ID and URI. (Thanks DDBC provides URI information)!

read more

Got a question? Ask away!

Revising Custom Tags in Use
C

If you have used your custom tags to mark up a text, but then want to revise them, a video made by Wu Ruei-Ming 吳瑞明 as part of the Drugs across Asia project can be of reference value. As the video cannot be uploaded here, we have taken some screenshots to show how they did it. The video was recorded in the 正體中文 version of MARKUS.

On the left side of the screenshot below is shown the text that has been marked up. Now, the user wants to change the five bilingual tags on the right into monolingual (Chinese) ones.

Basically, the logic behind the method shown in the video is to add five Chinese tags whose colors correspond to the ones they are meant to replace and then delete the bilingual ones. Here are the specific steps:

Click “管理標記”. In the window that pops up, change the tag names (標記名稱) of these five bilingual tags from following the earlier naming pattern of English translation + transliteration into just English. For instance, for the tag “DrugName 藥名”, delete “_LueMing”.

Having done this with all five, click “確定”. The five tags under revision now appear uncolored.

Click “管理標記” again, this time to change their button names (按鈕名稱) from being bilingual into just Chinese. For instance, for the tag “DrugAction藥效”, delete “DrugAction”.

After having done this with all five, click “確定” and you’ll see in place of the five earlier bilingual tags, there now appear five Chinese ones.

Now let’s color them. Click “管理標記”. In the “新增標記” section of the popup window, for each of the five uncolored tags, copy its tag and button names into corresponding boxes and choose the same color as that for the tag it is going to replace. For example, choose crimson for the tag “藥名”. Then click “新增”.

Now there appear two crimson tags.

The next step is to delete the one you don’t want any more. Click “管理標記” again. In the window that pops up, delete the tag “DrugName 藥名”.

Click “確定” and you’ll see there is but one crimson tag “藥名”, which marks up all the named entities in the text that were previously marked up by the tag “DrugName 藥名”.

This completes the revision of this tag. Do the same for the other four tags, until you have revised them all.

Jiyan Qiao, PhD candidate, Leiden University

read more
Acquiring texts for MARKUS ------ A brief introduction to Chinese local gazetteer databases
C

LIU Jialong, PhD candidate in Leiden University

Ever since 2017, I have been participating in a project, chaired by Professor Hilde De Weerdt with the assistance of Xiong Huei-lan, the aim of which is to use digital approaches to investigate the history of construction in imperial China. We have selected the construction of city walls 城墻 to do a pilot study. In our research, an important step is to use MARKUS to mark up the information we want to analyze in local gazetteers (for a discussion of how to mark up and extract data with regular expressions in MARKUS, see Xiong Huei-lan’s post). The first step in this process is to identify and import or upload relevant primary source texts. In this post, I discuss where we can access relevant sources from local gazetteer databases.

Before acquiring texts from databases, we should evaluate them. It is necessary to be fully aware of the advantages and shortcomings of different databases. For example, how many gazetteers are included in a database? Can we browse the content of gazetteers? Can we download texts? Can we double check the original images in the database? What retrieval strategies are offered by a database? Does a database provide useful tools? In the following, I will briefly introduce select Chinese local gazetteer databases that are commonly used based on my own experience.

(1) Zhongguo fangzhiku 中國方志庫

10,000 gazetteers will be included and 4000 of them are available

users can easily browse the content and compare the full text to original images of the titles included

texts can be download (to use MARKUS, save downloaded content as a .txt file)

full-text retrieval is available

(2) Zhongguo shuzi fangzhiku 中國數字方志庫

11,000 gazetteers are included (This makes this the largest classical Chinese local gazetteer database I have used. Two examples: In the case of Chongming County 崇明, 中國數字方志庫 includes 6 gazetteers for different periods, whereas 中國方志庫 has 4. The latter misses two editions in 雍正 (1723-1735) and 民國 (1912-1949). In the case of Taicang Prefecture 太倉, 中國數字方志庫 has 8 gazetteers, whereas 中國方志庫 has 5. The latter misses two editions in 光緒 (1875-1908) and 宣統 (1909-1911), and an undated manuscript named 太倉衛志.)

users can browse the content and have the access to original images of books

texts can be downloaded

full-text retrieval is available (However, users cannot search gazetteers based on region. By contrast, 中國方志庫 provides such a retrieval strategy.)

the interface of 中國數字方志庫 is not as user-friendly as that of 中國方志庫, especially regarding the reading experience of the full text in font and layout

中國方志庫

中國數字方志庫

(3) Xin fangzhi 新方志

40,000 gazetteers are included (It only covers gazetteers compiled after 1949.)

users can browse the content and directly read the original texts in PDF format

users can download the PDF files

full-text retrieval is impossible (search results are limited to chapters.)

(4) Shuzi fangzhi 數字方志

includes 6000+ gazetteers compiled before the end of the Qing Dynasty held by the Chinese National Library

users can browse the content but the website often crashesusers can browse the content but the website often crashes

users can read the scanned images of gazetteers but cannot download texts

full-text retrieval is not available

(5) Zhongguo dalu ge sheng difangzhi shumu chaxun xitong 中國大陸各省地方志書目查詢系統

a catalogue developed by Academia Sinica in Taiwan covering local gazetteers compiled both before and after 1949

users cannot browse the content of gazetteers

(6) Regional databases set up by regional libraries

This kind of database usually includes the local gazetteers in one region no matter when they were compiled. Below I use Beijingshi shuzi fangzhi guan 北京市數字方志館 as an example:

all the local gazetteers about Beijing are claimed to be included

PDF versions of gazetteers (scanned images of books) are offered online for users to browse

texts cannot be downloaded

full-text retrieval is unavailable

Before uploading texts acquired from databases to MARKUS, we should clean the digital text. When doing so, we should pay attention to blank spaces. In the texts acquired from 中國方志庫, blank space is used to indicate notes 注 in the original text. Such features can be used to, for example, mark up notes in MARKUS.

Variants of Chinese characters 異體字 should also be paid special attention to as they can influence the results of mark-up in MARKUS. I would advise to standardize all variants. If you cannot recognize a variant of a character, you can refer to the dictionary provided by 中國方志庫 or a professional online dictionary for variants specifically, 異體字字典 . (Xiong Huei-lan also describes the difficulty caused by variants when using regular expressions in her post.)

Depending on the quality of the OCR, there may be missing characters or wrong characters in the texts you acquired from electronic databases. When cleaning the text, such problems should be addressed. After cleaning your texts, upload them to MARKUS and explore them with the variety of functionality provided by the platform.

read more