BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Information Systems Group - ECPv6.4.0.1//NONSGML v1.0//EN
CALSCALE:GREGORIAN
METHOD:PUBLISH
X-ORIGINAL-URL:https://isg.ics.uci.edu
X-WR-CALDESC:Events for Information Systems Group
REFRESH-INTERVAL;VALUE=DURATION:PT1H
X-Robots-Tag:noindex
X-PUBLISHED-TTL:PT1H
BEGIN:VTIMEZONE
TZID:America/Los_Angeles
BEGIN:DAYLIGHT
TZOFFSETFROM:-0800
TZOFFSETTO:-0700
TZNAME:PDT
DTSTART:20250309T100000
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0700
TZOFFSETTO:-0800
TZNAME:PST
DTSTART:20251102T090000
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTART;TZID=America/Los_Angeles:20250425T130000
DTEND;TZID=America/Los_Angeles:20250425T140000
DTSTAMP:20260426T123607
CREATED:20250401T171824Z
LAST-MODIFIED:20250521T184951Z
UID:2192-1745586000-1745589600@isg.ics.uci.edu
SUMMARY:Yiming Lin (Berkeley): Toward Building Efficient Document Analytics Systems from the Lens of Document Structure
DESCRIPTION:Abstract:\nThe vast majority—over 80%—of data today exists in unstructured formats\, and querying and extracting value from unstructured document collections remains a considerable challenge. While Large Language Models (LLMs) have made remarkable progress in document understanding\, they fail to provide high-accuracy results for analytical queries on documents and incur high costs. \nIn this talk\, we demonstrate that document collections often have hidden structure\, and discovering them can facilitate multiple downstream data analytics tasks on documents effectively. At one extreme\, we explore documents sharing a similar high-level template that impart a common semantic structure\, such as scientific papers from the same venue. We introduce ZenDB\, a document analytics system that leverages this semantic structure\, coupled with LLMs\, to answer ad-hoc SQL queries on document collections. At another extreme\, we explore documents that are form-like\, such as invoices\, order bills\, containing structured data like tables or key-value pairs\, which are programmatically generated by populating fields in a visual blueprint. We present TWIX\, a document analytics tool that first infers the common blueprint and then extracts structured data from documents efficiently. For both extremes explored\, we provide theoretical guarantees on the correctness of structure extraction\, present empirical results demonstrating their potential for document analytics\, and show their early impact on our collaborators\, including Big Local News at Stanford and California Police Data Applications. \nBio: \nYiming Lin is a postdoctoral researcher at UC Berkeley\, and he received his PhD in Computer Science from UC Irvine. His research interests span document analytics\, query processing and optimization\, and data cleaning\, with a current focus on developing data management systems for document analytics. Yiming has closely collaborated with and interned at industrial pioneers in data analytics\, including Microsoft Research and Amazon. His work has been published in several flagship conferences\, including VLDB\, SIGMOD\, and ICDE.
URL:https://isg.ics.uci.edu/event/yiming-lin-berkeley/
END:VEVENT
END:VCALENDAR