Research projects

Here are some of the research projects I’ve worked on, with some description for each of them. For a list of publications, see publications.

Stylistic features of original and translated Chinese

We use corpus and computational methods to explore lexical, syntactic and discourse-level features of (human) translated Chinese, compared with original Chinese. The broader goal is to analyze and understand translationese, or 翻译腔.

This is joint work with Chien-Jer Charles Lin, Sandra Kübler, Wen Li, and others. Particularly, Ruoze Huang (黄若泽) from Xiamen University/The Chinese University of Hong Kong first suggested this project and has helped in corpus compilation and data analysis.

Currently we have the following papers, abstracts accepted or in prep.

Other projects involving text features (of English/Chinese)

Paper about lexical, syntactic variations and stylistics of the Chinese language, which is a report of the IUCL system in the VarDial 2019 campaign that aims to automatically distinguish Mainland news and Taiwanese news. Our system ranked 1st and 2nd on two tracks respectively.

* equal contributions

Using computational lingusitics techniques to analyze business text:

Current Version: Sept. 6, 2020 (Major Revision at Journal of Operations Management). See Peng’s page for more information.

Natural language inference

This project uses natural logic to answer textual entailment questions such as: if we know that ‘All students party on New Year’s Eve’ and that ‘Most students get drunk in every party’, does it follow that ‘Most PhD students get drunk on New Year’s Eve’? We use monotonicity calculus and Combinatory Categorial Grammar (CCG) to solve these problems.

The *SEM paper below describes how we polarize the words and phrases in a sentence; the IWCS paper discusses how our (still-developing) inference system works. For more materials please see this webpage from Larry. Or check out this cool 3-min video made by Larry.

Natural Language Understanding

We built the first comprehensive NLU benchmark for Chinese: CLUE (github page)

We held a share-task to test light-weight models for Chinese NLU.

Usage of English acronyms in Chinese text

In this project, I’m mainly concerned with the usage of English acronyms such as ‘GDP’, ‘WTO’ in Chinese. Many have expressed worries about the “contamination” of Chinese by incoming English acronyms and words, but very few descriptive studies have been done. The following are some of the first attempts to quanfity the degree of acronym usage and possible factors that drive language users to prefer “GDP” rather than “国民生产总值”.

People’s Daily on English words in Chinese: 中外文夹杂 真让人犯晕

Chengdu dialect of Mandarin

A project about the raising of /an/ in the Chengdu dialect of Mandarin. Please see the abstracts for more information.

Chinese segmentation and parsing

We show that outputting k best segmentations and passing all of them to a lattice parser will benefit both word segmentation and parsing, compared to traditional method of passing only the best segmentation to the parser:

The segmenter is part of the Free Linguistic Environment at LINGUIST List:

Sentence processing and relative clauses