来源:统计学院

4月7日 | 吴瑞佳 Topic Modeling: Optimal Estimation and Statistical Inference

来源:统计学院发布时间:2023-03-31浏览次数:303

时   间:2023年4月7日15:00-16:00

地   点:理科大楼A1514

报告人:吴瑞佳 上海交通大学 助理教授

主持人:项冬冬 华东师范大学 教授

摘   要:

With the development of computer technology and the internet, increasingly large amounts of textual data are generated and collected every day. It is a significant challenge to analyze and extract meaningful and actionable information from vast amounts of unstructured textual data. Many machine learning and natural language processing algorithms have been developed for text classification, clustering, and information retrieval. Driven by applications in a wide range of fields, there is an increasing need for developing computationally efficient statistical methods for analyzing a massive amount of textual data with theoretical guarantees.In the first part of the talk, I will present the algorithms of unsupervised topic modeling under the probabilistic latent semantic indexing (pLSI) model. Novel and computationally fast algorithms for estimation and inference of both the word-topic matrix and the topic-document matrix are proposed, and their theoretical properties are investigated. In the second part, I will discuss supervised topic modeling, which jointly considers a collection of documents and their paired side information. A bias-adjusted algorithm is developed to study the regression coefficients in the supervised topic modeling under the generalized linear model formulation. I will also introduce an approach to constructing valid confidence intervals. Applications of the proposed methods reveal meaningful latent topic structures of textual data.

报告人简介:

吴瑞佳,上海交通大学安泰管理学院数据与商务智能系助理教授,本科和硕士毕业于应该牛津大学数学系,2022年博士毕业于宾夕法尼亚大学沃顿商学院,研究兴趣包括统计机器学习,高维统计,文本分析及其应用。