Research projects

Here are some of the research projects I’ve worked on, with some description for each of them. For a list of publications, see publications.

Stylistic features of original and translated Chinese

We use corpus and computational methods to explore lexical, syntactic and discourse-level features of (human) translated Chinese, compared with original Chinese. The broader goal is to analyze and understand translationese, or 翻译腔.

This is joint work with Chien-Jer Charles Lin, Sandra Kübler, Wen Li, and others. Particularly, Ruoze Huang (黄若泽) from Xiamen University/The Chinese University of Hong Kong first suggested this project and has helped in corpus compilation and data analysis.

Currently we have the following papers, abstracts accepted or in prep.

Hu, Hai, Yanting Li, Yina Patterson, Zuoyu Tian, Yiwen Zhang, He Zhou, Sandra Kübler, and Chien-Jer Charles Lin (2020). “Building a Literary Treebank for Translation Studies in Chinese”. In: Proceedings of 19th International Workshop on Treebanks and Linguistic Theories (TLT). pp. 18-31. paper.
Hu, Hai and Sandra Kübler. (2020). Investigating Translated Chinese and Its Variants Using Machine Learning. In Natural Language Engineering (Special Issue: NLP for Similar Languages, Varieties and Dialects). pp. 1-34. paper. code.
Hu, Hai, Wen Li, and Sandra Kübler. (2018). Detecting Syntactic Features of Translated Chinese. In Proceedings of the 2nd Workshop on Stylistic Variation at NAACL 2018, pp. 20-28. New Orleans, Louisiana, USA. paper. slides. video presentation.
Lin, Chien-Jer Charles and Hai Hu. (2018). Syntactic Complexity as a Measure of Linguistic Authenticity in Modern Chinese. To present at The 26th Annual Conference of International Association of Chinese Linguistics (IACL-26) & The 20th International Conference on Chinese Language and Culture (ICCLC-20). Madison, Wisconsin, USA.

Other projects involving text features (of English/Chinese)

Paper about lexical, syntactic variations and stylistics of the Chinese language, which is a report of the IUCL system in the VarDial 2019 campaign that aims to automatically distinguish Mainland news and Taiwanese news. Our system ranked 1st and 2nd on two tracks respectively.

Hu*, Hai, Wen Li*, He Zhou*, Zuoyu Tian, Yiwen Zhang and Liang Zou. (2019). Ensemble Methods to Distinguish Mainland and Taiwan Chinese. In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects at NAACL 2019, pp. 165–171. Minneapolis, MN, USA. paper.

* equal contributions

Using computational lingusitics techniques to analyze business text:

Shen, Peng, Vivek Astvansh, Hai Hu. (Resubmitted). Organizational learning during product failure.

Current Version: Sept. 6, 2020 (Major Revision at Journal of Operations Management). See Peng’s page for more information.

Natural language inference

This project uses natural logic to answer textual entailment questions such as: if we know that ‘All students party on New Year’s Eve’ and that ‘Most students get drunk in every party’, does it follow that ‘Most PhD students get drunk on New Year’s Eve’? We use monotonicity calculus and Combinatory Categorial Grammar (CCG) to solve these problems.

The *SEM paper below describes how we polarize the words and phrases in a sentence; the IWCS paper discusses how our (still-developing) inference system works. For more materials please see this webpage from Larry. Or check out this cool 3-min video made by Larry.

Hu, Hai, Kyle Richardson, Liang Xu, Lu Li, Sandra Kuebler, and Larry Moss. (2020). OCNLI: Original Chinese Natural Language Inference. In: Findings of EMNLP. paper. code and data. leaderboard.
Richardson, Kyle, Hai Hu, Larry Moss, and Ashish Sabharwal. (2020). Probing Natural Language Inference Models through Semantic Fragments. In: Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence. pp. 8713-8721. paper. code and data.
Hu, Hai, Qi Chen, Kyle Richardson, Atreyee Mukherjee, Lawrence S Moss, and Sandra Kuebler. (2020). MonaLog: a Lightweight System for Natural Language Inference Based on Monotonicity. In: Proceedings of the Society for Computation in Linguistics 2020. pp. 319-329. paper. poster.
Hu, Hai, and Lawrence S. Moss. (2020). An Automatic Monotonicity Annotation Tool Based on CCG Trees. Presentation at Second Tsinghua Interdisciplinary Workshop on Logic, Language, and Meaning: Monotonicity in Logic and Language. abstract
Hu, Hai, Qi Chen and Larry Moss. (2019). Natural Language Inference with Monotonicity. In Proceedings of the 13th International Conference on Computational Semantics (IWCS 2019), pp. 8–15. Gothenburg, Sweden. paper.
Hu, Hai, and Lawrence S. Moss. (2018). Polarity Computations in Flexible Categorial Grammar. In Proceedings of the 7th Joint Conference on Lexical and Computational Semantics: *SEM at NAACL 2018, pp. 124–129. New Orleans, Louisiana, USA. paper. poster. code.
Hu, Hai, Thomas Icard and Larry Moss. (2018). Automated Reasoning from Polarized Parse Trees. In Proceedings of the Fifth Workshop on Natural Language and Computer Science. Oxford, England. paper.

Natural Language Understanding

We built the first comprehensive NLU benchmark for Chinese: CLUE (github page)

Xu, Liang, Hai Hu, Xuanwei Zhang, Lu Li, Chenjie Cao, Yudong Li, Yechen Xu, Kai Sun, Dian Yu, Cong Yu, Yin Tian, Qianqian Dong, Weitang Liu, Bo Shi, Yiming Cui, Junyi Li, Jun Zeng, Rongzhao Wang, Weijian Xie, Yanting Li, Yina Patterson, Zuoyu Tian, Yiwen Zhang, He Zhou, Shaoweihua Liu, Zhe Zhao, Qipeng Zhao, Cong Yue, Xinrui Zhang, Zhengliang Yang, Kyle Richardson, and Zhenzhong Lan (2020). CLUE: A Chinese Language Understanding Evaluation Benchmark. In Proceedings ofthe 28th International Conference on Computational Linguistics (COLING). paper.

We held a share-task to test light-weight models for Chinese NLU.

Li, Junyi, Hai Hu, Xuanwei Zhang, Minglei Li, Lu Li, and Liang Xu. “Light Pre-Trained Chinese Language Model for NLP Tasks.” In CCF International Conference on Natural Language Processing and Chinese Computing, pp. 567-578. Springer, 2020. paper

Usage of English acronyms in Chinese text

In this project, I’m mainly concerned with the usage of English acronyms such as ‘GDP’, ‘WTO’ in Chinese. Many have expressed worries about the “contamination” of Chinese by incoming English acronyms and words, but very few descriptive studies have been done. The following are some of the first attempts to quanfity the degree of acronym usage and possible factors that drive language users to prefer “GDP” rather than “国民生产总值”.

Hu, Hai (2018). English Acronyms in Chinese Texts: Diachronic Change and Synchronic Prediction. Presented at the 30th North America Conference on Chinese Linguistics. Columbus, Ohio.
Hu, Hai (2016). Is China entering WTO or shijie maoyi zuzhi–A Corpus-based Study of English Acronyms in Chinese Newspapers. In: Proceedings of 28th North America Conference on Chinese Linguistics. Provo, Utah. paper. abstract.

People’s Daily on English words in Chinese: 中外文夹杂真让人犯晕

Chengdu dialect of Mandarin

A project about the raising of /an/ in the Chengdu dialect of Mandarin. Please see the abstracts for more information.

Hu, Hai, Aini Li, Yiwen Zhang and Phillip Weirich. (2019). Vowel Raising in the Chengdu Dialect of Mandarin. Poster at Linguistic Society of America 2019 Annual Meeting. New York, NY. abstract. poster.
Hu, Hai and Yiwen Zhang. (2017). Path of Vowel Raising in Chengdu Dialect of Mandarin. In Proceedings of the 29th North America Conference on Chinese Linguistics. Rutgers, NJ. paper. abstract.
Zhang, Yiwen and Hai Hu. (2017). Vowel Raising in Chengdu Dialect of Mandarin. Poster presented at Linguistic Society of America 2017 Annual Meeting. Austin, TX. abstract.

Chinese segmentation and parsing

We show that outputting k best segmentations and passing all of them to a lattice parser will benefit both word segmentation and parsing, compared to traditional method of passing only the best segmentation to the parser:

Hu, Hai, Dannial Dakota, and Sandra Kübler. (2017). Non-Deterministic Segmentation for Chinese Lattice Parsing. In Proceedings of Recent Advances of Natural Language Processing 2017, pp. 316–324. Varna, Bulgaria. paper. bib.

The segmenter is part of the Free Linguistic Environment at LINGUIST List:

Cavar, Damir, Lwin Moe, Hai Hu, and Kenneth Steimel. (2016). Preliminary Results from the Free Linguistic Environment Project. In: Joint 2016 Conference on Head-driven Phrase Structure Grammar and Lexical Functional Grammar (HeadLex 2016), pp. 161–181. Warsaw, Poland. paper.

Sentence processing and relative clauses

Lin, Chien-Jer Charles, & Hu, Hai. (in press). Linking comprehension and production: Frequency distribution of Chinese relative clauses in the Sinica Treebank. In Chu-Ren Huang, Shukai Hsieh, & Peng Jin (eds.) Text, Speech, and Language Technology Series. Springer.
Zhang, Yiwen, Hu, Hai, & Lin, Chien-Jer Charles. (2018). Nouns and verbs behave differently as fillers: Expectation and interference in constructing long-distance dependencies. Poster presented at the 31st annual CUNY Human Sentence Processing Conference, Davis, CA. poster.
Zhang, Yiwen, Hu, Hai, & Lin, Chien-Jer Charles. (2017). Processing verbs with ambiguous complement structures. Poster presented at the Workshop on East Asian Psycholinguistics: Recent developments, University of Hawaii at Mānoa, Honolulu, HI, October 15, 2017.