Research

My research spans the areas of data mining, machine learning and natural language processing, focusing on making sense of massive text corpora.

My dissertation research is on constructing structured networks of factual knowledge from unstructured text corpora, to support data exploration ("Find the natural disasters happened in Asia Pacific area in 2016."), power intelligent systems ("Where did the terrorist attacks happen? What organizations were involved?"), and facilitate knowledge discovery ("Is bacteria X or gene Y a potential cause of the disease Z?").

[Effort-Light TextStruct]  State-of-the-art information extraction (IE) systems have strong reliance on large amounts of task-specific labeled data for training supervised models (e.g., deep neural networks). In practice, the scale and efficiency of such a manual curation process are rather limited, especially when dealing with text corpora of various kinds. A crucial question that runs through my research is: how to design a generic solution to the efficient construction of customized machine-learning models for given text corpora, without explicit human labeling effort.

Here are three representative work on addressing above question:

I gave tutorials at KDD, WWW and SIGMOD on entity recognition and typing. Check out my Research Statement for more details.

[Impact]  Systems and algorithms we developed were successful in different domains and disciplines: our entity extraction system was shipped as parts of the productions in Microsoft Bing and U.S. Army Research Lab; our phrase mining tool won the grand prize of Yelp Dataset Challenge in 2015, and was adopted by TripAdvisor; our survival prediction algorithm was the task 1 winner of the Prostate Cancer DREAM challenge. Check out our event exploration system on protest-related news corpora.

[Collaboration]  I am passionate about applying my techniques to multidisciplinary applications in: life science, health science, social science, economics, and public policy. I am also broadly interested in problems in machine learning (weakly supervised methods, relational learning, low-rank/sparse matrix approximation), and natural language processing (information extraction, segmentation, tagging, summarization).