ISG Talks are sponsored by Couchbase.![]()
- This event has passed.
Yiming Lin (UC Berkeley): AI-Powered Data Systems for Multimodal Analytics
January 16 @ 1:00 pm - 2:00 pm
Donald Bren Hall 3011, ICS, UC Irvine
Lunch will be provided.
Title:
AI-Powered Data Systems for Multimodal Analytics
Abstract:
We live in a world overflowing with data, and the emergence of AI, such as Large Language Models (LLMs), is revolutionizing data analytics. However, directly using AI to process massive and complex data is neither effective nor scalable.
In this talk, I introduce my work on building database systems powered by AI to analyze and process multimodal data at scale, focusing on tables and documents. On one hand, when analyzing tables, AI is often used to prepare data, such as cleaning, enriching, or synthesizing data prior to query processing. This becomes prohibitively expensive when the data scale is large. To support scalable analysis over expensive data ingestion, my work leverages the fact that not all data are needed to answer a query and explores a set of techniques to reduce AI operations unnecessary to analytics by optimizing the query engine in the database. On the other hand, when analyzing documents, current systems treat them as plain text and ignore underlying structures, leading to limited accuracy and performance. In this regard, we exhaustively identified three document structures that encompass most real-world documents we have encountered, and we designed tools and systems to extract their structures and leverage them for accurate and efficient document analytics. Finally, I’ll share my vision for building data systems for multimodal analytics, including aspects of trustworthy systems, interaction with hardware, and co-optimization among different data modalities.
Bio:
Yiming Lin is a postdoctoral researcher at UC Berkeley, and he received his Ph.D. from UC Irvine. His research interests span document analytics, query processing and optimization, and data cleaning, with a current focus on building databases for multimodal analytics powered by AI. His work has had real-world impact: document analytics help public defenders, journalists, and the California police department process over 30,000 pages, while his efforts as part of TippersDB deliver high-quality IoT services to nursing homes, industries, and universities across five sites over six years. He has a number of publications and serves on the program committee of VLDB, SIGMOD, and ICDE.
Volunteer:
