SIGMOD 2016 Tutorial: Automatic Entity Recognition and Typing in Massive Text Data

In today's computerized and information-based society, we are constantly exposed to vast amounts of natural language text data, ranging from news articles, product reviews, advertisements, to a wide range of user-generated content from social media. To turn such massive unstructured text data into actionable knowledge, one of the grand challenges is to gain an understanding of entities and the relationships between them. In this tutorial, we introduce data-driven methods to recognize typed entities of interest in different kinds of text data (especially in massive, domain-specific text data). These methods can automatically identify token spans as entity mentions in text and label their types (eg, people, product, organization) in a scalable way. We demonstrate on real datasets including news articles and yelp reviews how these typed entities aid in knowledge discovery and management.

Xiang Ren1, Ahmed El-Kishky1, Heng Ji2, Jiawei Han1

University of Illinois at Urbana Champaign1, Rensselaer Polytechnic Institute2

Outline

  1. Introduction to entity recognition and typing.
  2. Entity recognition: An overview and phrase-mining approaches
  3. Entity typing: An overview and network mining approach
  4. Trends and research problems

Slides

[PDF]

Code

[ClusType][SegPhrase][TopMine][PLE]

Publications

Presenters


Xiang Ren, Ph.D. candidate, Department of Computer Science, Univ. of Illinois at Urbana-Champaign. His research focuses on knowledge acquisition from text data and mining linked data. In 2016, he received a Google PhD Fellowship for his work in Structured Data and Database Managment. He is the recipient of C. L. and Jane W.-S. Liu Award and Yahoo!-DAIS Research Excellence Gold Award in 2015. He received Microsoft Young Fellowship from Microsoft Research Asia in 2012.

Ahmed El-Kishky, Ph.D. candidate, Department of Computer Science, Univ. of Illinois at Urbana-Champaign. His research interests include mining large unstructured data, text mining, and network mining. He is the recipient of both the National Science Foundation Graduate Research Fellowship as well as National Defense Science and Engineering Fellowship.

Heng Ji, Edward P. Hamilton Development Chair Associate Professor of Computer Science Department of Rensselaer Polytechnic Institute. Her research interests focus on Natural Language Processing and its connections with Data Mining and Vision. She received "AI's 10 to Watch" Award by IEEE Intelligent Systems in 2013 and NSF CAREER award in 2009. She coordinated the NIST TAC Knowledge Base Population task in 2010, 2011, 2014, 2015 and 2016.

Jiawei Han, Abel Bliss Professor, Department of Computer Science, Univ. of Illinois at Urbana-Champaign. His research areas encompass data mining, data ware-housing, information network analysis, and database systems, with over 600 conference and journal publications. He is Fellow of ACM and Fellow of IEEE, and received ACM SIGKDD Innovation Award (2004), IEEE Computer Society Technical Achievement Award (2005), and IEEE Computer Society W. Wallace McDowell Award (2009). His co-authored textbook "Data Mining: Concepts and Techniques", 3rd ed., (Morgan Kaufmann, 2011) has been adopted popularly world-wide.