ISG Talks are sponsored by Couchbase.

- This event has passed.
Yiming Lin (Berkeley): Toward Building Efficient Document Analytics Systems from the Lens of Document Structure
April 25 @ 1:00 pm - 2:00 pm
Abstract:
The vast majority—over 80%—of data today exists in unstructured formats, and querying and extracting value from unstructured document collections remains a considerable challenge. While Large Language Models (LLMs) have made remarkable progress in document understanding, they fail to provide high-accuracy results for analytical queries on documents and incur high costs.
In this talk, we demonstrate that document collections often have hidden structure, and discovering them can facilitate multiple downstream data analytics tasks on documents effectively. At one extreme, we explore documents sharing a similar high-level template that impart a common semantic structure, such as scientific papers from the same venue. We introduce ZenDB, a document analytics system that leverages this semantic structure, coupled with LLMs, to answer ad-hoc SQL queries on document collections. At another extreme, we explore documents that are form-like, such as invoices, order bills, containing structured data like tables or key-value pairs, which are programmatically generated by populating fields in a visual blueprint. We present TWIX, a document analytics tool that first infers the common blueprint and then extracts structured data from documents efficiently. For both extremes explored, we provide theoretical guarantees on the correctness of structure extraction, present empirical results demonstrating their potential for document analytics, and show their early impact on our collaborators, including Big Local News at Stanford and California Police Data Applications.
Bio:
Yiming Lin is a postdoctoral researcher at UC Berkeley, and he received his PhD in Computer Science from UC Irvine. His research interests span document analytics, query processing and optimization, and data cleaning, with a current focus on developing data management systems for document analytics. Yiming has closely collaborated with and interned at industrial pioneers in data analytics, including Microsoft Research and Amazon. His work has been published in several flagship conferences, including VLDB, SIGMOD, and ICDE.