Here are some of the research projects I’ve worked on, with some description for each of them. For a list of publications, see publications.
We use corpus and computational methods to explore lexical, syntactic and discourse-level features of (human) translated Chinese, compared with original Chinese. The broader goal is to analyze and understand translationese, or 翻译腔.
This is joint work with Chien-Jer Charles Lin, Sandra Kübler, Wen Li, and others. Particularly, Ruoze Huang (黄若泽) from Xiamen University/The Chinese University of Hong Kong first suggested this project and has helped in corpus compilation and data analysis.
Currently we have the following papers, abstracts accepted or in prep.
Hu, Hai, Yanting Li, Yina Patterson, Zuoyu Tian, Yiwen Zhang, He Zhou, Sandra Kübler, and Chien-Jer Charles Lin (2020). “Building a Literary Treebank for Translation Studies in Chinese”. In: Proceedings of 19th International Workshop on Treebanks and Linguistic Theories (TLT). pp. 18-31. paper.
Hu, Hai and Sandra Kübler. (2020). Investigating Translated Chinese and Its Variants Using Machine Learning. In Natural Language Engineering (Special Issue: NLP for Similar Languages, Varieties and Dialects). pp. 1-34. paper. code.
Hu, Hai, Wen Li, and Sandra Kübler. (2018). Detecting Syntactic Features of Translated Chinese. In Proceedings of the 2nd Workshop on Stylistic Variation at NAACL 2018, pp. 20-28. New Orleans, Louisiana, USA. paper. slides. video presentation.
Lin, Chien-Jer Charles and Hai Hu. (2018). Syntactic Complexity as a Measure of Linguistic Authenticity in Modern Chinese. To present at The 26th Annual Conference of International Association of Chinese Linguistics (IACL-26) & The 20th International Conference on Chinese Language and Culture (ICCLC-20). Madison, Wisconsin, USA.
Paper about lexical, syntactic variations and stylistics of the Chinese language, which is a report of the IUCL system in the VarDial 2019 campaign that aims to automatically distinguish Mainland news and Taiwanese news. Our system ranked 1st and 2nd on two tracks respectively.
* equal contributions
Using computational lingusitics techniques to analyze business text:
Current Version: Sept. 6, 2020 (Major Revision at Journal of Operations Management). See Peng’s page for more information.
This project uses natural logic to answer textual entailment questions such as: if we know that ‘All students party on New Year’s Eve’ and that ‘Most students get drunk in every party’, does it follow that ‘Most PhD students get drunk on New Year’s Eve’? We use monotonicity calculus and Combinatory Categorial Grammar (CCG) to solve these problems.
The *SEM paper below describes how we polarize the words and phrases in a sentence; the IWCS paper discusses how our (still-developing) inference system works. For more materials please see this webpage from Larry. Or check out this cool 3-min video made by Larry.
Hu, Hai, Kyle Richardson, Liang Xu, Lu Li, Sandra Kuebler, and Larry Moss. (2020). OCNLI: Original Chinese Natural Language Inference. In: Findings of EMNLP. paper. code and data. leaderboard.
Richardson, Kyle, Hai Hu, Larry Moss, and Ashish Sabharwal. (2020). Probing Natural Language Inference Models through Semantic Fragments. In: Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence. pp. 8713-8721. paper. code and data.
Hu, Hai, Qi Chen, Kyle Richardson, Atreyee Mukherjee, Lawrence S Moss, and Sandra Kuebler. (2020). MonaLog: a Lightweight System for Natural Language Inference Based on Monotonicity. In: Proceedings of the Society for Computation in Linguistics 2020. pp. 319-329. paper. poster.
Hu, Hai, and Lawrence S. Moss. (2020). An Automatic Monotonicity Annotation Tool Based on CCG Trees. Presentation at Second Tsinghua Interdisciplinary Workshop on Logic, Language, and Meaning: Monotonicity in Logic and Language. abstract
Hu, Hai, Qi Chen and Larry Moss. (2019). Natural Language Inference with Monotonicity. In Proceedings of the 13th International Conference on Computational Semantics (IWCS 2019), pp. 8–15. Gothenburg, Sweden. paper.
Hu, Hai, and Lawrence S. Moss. (2018). Polarity Computations in Flexible Categorial Grammar. In Proceedings of the 7th Joint Conference on Lexical and Computational Semantics: *SEM at NAACL 2018, pp. 124–129. New Orleans, Louisiana, USA. paper. poster. code.
Hu, Hai, Thomas Icard and Larry Moss. (2018). Automated Reasoning from Polarized Parse Trees. In Proceedings of the Fifth Workshop on Natural Language and Computer Science. Oxford, England. paper.
We built the first comprehensive NLU benchmark for Chinese: CLUE (github page)
We held a share-task to test light-weight models for Chinese NLU.
In this project, I’m mainly concerned with the usage of English acronyms such as ‘GDP’, ‘WTO’ in Chinese. Many have expressed worries about the “contamination” of Chinese by incoming English acronyms and words, but very few descriptive studies have been done. The following are some of the first attempts to quanfity the degree of acronym usage and possible factors that drive language users to prefer “GDP” rather than “国民生产总值”.
Hu, Hai (2018). English Acronyms in Chinese Texts: Diachronic Change and Synchronic Prediction. Presented at the 30th North America Conference on Chinese Linguistics. Columbus, Ohio.
Hu, Hai (2016). Is China entering WTO or shijie maoyi zuzhi–A Corpus-based Study of English Acronyms in Chinese Newspapers. In: Proceedings of 28th North America Conference on Chinese Linguistics. Provo, Utah. paper. abstract.
People’s Daily on English words in Chinese: 中外文夹杂 真让人犯晕
A project about the raising of /an/ in the Chengdu dialect of Mandarin. Please see the abstracts for more information.
Hu, Hai, Aini Li, Yiwen Zhang and Phillip Weirich. (2019). Vowel Raising in the Chengdu Dialect of Mandarin. Poster at Linguistic Society of America 2019 Annual Meeting. New York, NY. abstract. poster.
Hu, Hai and Yiwen Zhang. (2017). Path of Vowel Raising in Chengdu Dialect of Mandarin. In Proceedings of the 29th North America Conference on Chinese Linguistics. Rutgers, NJ. paper. abstract.
Zhang, Yiwen and Hai Hu. (2017). Vowel Raising in Chengdu Dialect of Mandarin. Poster presented at Linguistic Society of America 2017 Annual Meeting. Austin, TX. abstract.
We show that outputting k best segmentations and passing all of them to a lattice parser will benefit both word segmentation and parsing, compared to traditional method of passing only the best segmentation to the parser:
The segmenter is part of the Free Linguistic Environment at LINGUIST List:
Lin, Chien-Jer Charles, & Hu, Hai. (in press). Linking comprehension and production: Frequency distribution of Chinese relative clauses in the Sinica Treebank. In Chu-Ren Huang, Shukai Hsieh, & Peng Jin (eds.) Text, Speech, and Language Technology Series. Springer.
Zhang, Yiwen, Hu, Hai, & Lin, Chien-Jer Charles. (2018). Nouns and verbs behave differently as fillers: Expectation and interference in constructing long-distance dependencies. Poster presented at the 31st annual CUNY Human Sentence Processing Conference, Davis, CA. poster.
Zhang, Yiwen, Hu, Hai, & Lin, Chien-Jer Charles. (2017). Processing verbs with ambiguous complement structures. Poster presented at the Workshop on East Asian Psycholinguistics: Recent developments, University of Hawaii at Mānoa, Honolulu, HI, October 15, 2017.