SIGMOD 2017 Tutorial: Building Structured Databases of Factual Knowledge from Massive Text Corpora
Time: 9-10:30am, May 19
Location: Stevens Salon C-5In today's computerized and information-based society, people are inundated with vast amounts of text data, ranging from news articles, social media post, scientific publications, to a wide range of textual information from various domains (corporate reports, advertisements, legal acts, medical reports). To turn such massive unstructured text data into structured, actionable knowledge, one of the grand challenges is to gain an understanding of the factual information (e.g., entities, attributes, relations) in the text.
In this tutorial, we introduce data-driven methods on mining structured facts (i.e., entities and their relations/attributes for types of interest) from massive text corpora, to construct structured databases of factual knowledge (called StructDBs). State-of-the-art information extraction systems have strong reliance on large amounts of task/corpus-specific labeled data (usually created by domain experts). In practice, the scale and efficiency of such a manual annotation process are rather limited, especially when dealing with text corpora of various kinds (domains, languages, genres). We focus on methods that are minimally-supervised, domain-independent, and language-independent for timely StructDB construction across various application domains (news, social media, biomedical, business), and demonstrate on real datasets how these StructDBs aid in data exploration and knowledge discovery.
University of Illinois at Urbana Champaign
Outline & Slides
- Introduction [PDF]
- Structurd network of factual knowledge
- Text to network to knowledge
- Supervised approaches
- Unsupervised approaches
- Weakly and Distantly Supervised phrase mining
- Entity recognition and coarse-grained typing
- Fine-grained entity typing
- Joint extraction of typed entities and relations
- Supervised attribute learning
- Pattern-based boostrapping
Code & Systems[CoType] [AFET] [PLE] [ClusType] [AutoPhrase] [SegPhrase][TopMine] [MetaPAD]
MetaPAD: Meta Pattern Discovery from Massive Text Corpora
Meng Jiang, Jingbo Shang, Taylor Cassidy, Xiang Ren, Lance M. Kaplan, Timothy P. Hanratty, Jiawei Han.
ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD), 2017.
CoType: Joint Extraction of Typed Entities and Relations with Knowledge Bases
Xiang Ren, Zeqiu Wu, Wenqi He, Meng Qu, Clare R. Voss, Heng Ji, Tarek F. Abdelzaher, Jiawei Han.
International World-Wide Web Conference (WWW), 2017.
- Label Noise Reduction in Entity Typing by Heterogeneous Partial-Label Embedding
Xiang Ren, Wenqi He, Meng Qu, Heng Ji, Clare R. Voss, Jiawei Han.
In Proc. 2016 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD), 2016.
AFET: Automatic Fine-Grained Entity Typing by Hierarchical Partial-Label Embedding
Xiang Ren, Wenqi He, Meng Qu, Lifu Huang, Heng Ji, Jiawei Han.
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016.
- Ahmed El-Kishky, Yanglei Song, Chi Wang, Clare R Voss, and Jiawei Han, Scalable Topical Phrase Mining from Text Corpora, in Proceedings of the VLDB Endowment, vol. 8, no. 3, VLDB, 2015.
- Xiang Ren, Ahmed El-Kishky, Chi Wang, Fangbo Tao, Clare R. Voss, Heng Ji, and Jiawei Han, ClusType: Effective Entity Recognition and Typing by Relation Phrase-Based Clustering, ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD), 2015.
- Jialu Liu, Jingbo Shang, Chi Wang, Xiang Ren, and Jiawei Han, Mining Quality Phrases from Massive Text Corpora, in 2015 ACM SIGMOD Int. Conf. on Management of Data (SIGMOD), 2015.
Xiang Ren, Ph.D. candidate, Department of Computer Science, Univ. of Illinois at Urbana-Champaign. His research focuses on creating computational tools for better understanding and exploring massive text data. He has published over 25 papers in major conferences. He received Google PhD Fellowship in Structured Data and Database Management in 2016, KDD Rising Star by Microsoft Academic Search in 2016, C. W. Gear Outstanding Graduate Student Award by CS@Illinois in 2016, and Yahoo!-DAIS Research Excellence Award in 2015. Mr. Ren has rich experiences in delivering tutorials in major conferences, including SIGKDD 2015, SIGMOD 2016 and WWW 2016.
Meng Jiang, Postdoctoral Research Associate, Department of Computer Science, Univ.\ of Illinois at Urbana-Champaign. His research focuses on behavioral modeling and social media analysis. He got his Ph.D. of Computer Science from Tsinghua University, Beijing in 2015. His Ph.D. thesis won the Dissertation Award at Tsinghua. His recent research won the SIGKDD 2014 Best Paper Finalist. His ICDM 2015 Tutorial won the honorarium.
Jingbo Shang, Ph.D. candidate, Department of Computer Science, Univ.\ of Illinois at Urbana-Champaign. His research focuses on mining and constructing structured knowledge from massive text corpora. He is the recipient of Computer Science Excellence Scholarship and Grand Prize of Yelp Dataset Challenge in 2015.
Jiawei Han, Abel Bliss Professor, Department of Computer Science, Univ. of Illinois at Urbana-Champaign. His research areas encompass data mining, data ware-housing, information network analysis, and database systems, with over 600 conference and journal publications. He is Fellow of ACM and Fellow of IEEE, and received ACM SIGKDD Innovation Award (2004), IEEE Computer Society Technical Achievement Award (2005), and IEEE Computer Society W. Wallace McDowell Award (2009). His co-authored textbook "Data Mining: Concepts and Techniques", 3rd ed., (Morgan Kaufmann, 2011) has been adopted popularly world-wide.