ISG Talks are sponsored by Couchbase.

<< All Talks

Loading Events

Past Events

Events Search and Views Navigation

Event Views Navigation

November 2018

Sushant Jain : Large scale and low latency data distribution from database to servers

November 2, 2018 @ 11:00 am - 12:00 pm
DBH 6011

Many applications at Google are structured with data stored in a transactional database (source of truth) and same data being required by servers distributed worldwide. For efficient and fast computation servers store this data in memory. Further, the database is changing continuously and we need to update the in-memory view of these large number of servers in real-time. For example, in Google Search Ads application we have Advertisers configuration stored in a database and this data is loaded in the memory of various servers to compute Ads in a scalable and fast way. In this talk, we describe our solution to this data distribution problem and the challenges that we encountered in providing a highly reliable and low latency service.

Find out more »

Dr. Andrey Balmin and Mayank Pradhan (Workday): Workday Prism Analytics: Unifying Interactive and Batch Data Processing Using Apache Spark

November 16, 2018 @ 3:00 pm - 4:00 pm
DBH 3011

Abstract: Workday Prism Analytics enables data discovery and interactive Business Intelligence analysis for Workday customers. To prepare data for analysis, business users can setup data transformation pipelines in an interactive, self-service, modern data prep environment. Thus, Workday Prism Analytics needs to run three types of scalable data processing applications: "always on" query engine and data prep applications, and on-demand batch execution of transformation pipelines. We standardized on Apache Spark and Spark SQL for all three applications, due to its scalability,…

Find out more »

December 2018

Vinayak Borkar (FireEye Inc.): The X15 Machine Data Management Platform

December 7, 2018 @ 3:00 pm - 4:00 pm
DBH 4011

ABSTRACT: Machine Data (aka Log Data) is continuously produced by applications and devices as a result of human-computer and computer-computer interactions. Although most of this data was initially generated for ad-hoc human consumption to aid with debugging and troubleshooting systems and deployments, their systematic treatment using well-known data processing techniques can unlock valuable insight about operations within organizations. Log Data may sometimes be the only feasible source of some of this insight. Processing Log Data poses new challenges as compared to…

Find out more »

January 2019

David Lomet (Microsoft Research): How Data Caching Systems Succeed

January 25, 2019 @ 3:00 pm - 4:00 pm
DBH 4011

Data in traditional "caching'' data systems resides on secondary storage, and is read into main memory only when operated on. This limits system performance. Main memory data stores with data always in main memory are much faster. But this performance comes at a cost. In this paper, we analyze the costs of both in-memory operations and secondary storage operations where data is not "in cache''. We study the performance impact of cache misses on caching system performance. The analysis considers both execution and storage costs. Based on our analysis, we derive cost/performance results for a data caching system and a main memory system to understand where each demonstrates the best cost per operation, what is driving the cost differences, and the scale of the differences. This analysis (1) provides insight into why data caching systems continue to dominate the market; (2) points to higher performance that does not rely on simply increasing main memory cache size; and (3) suggests a path to lower costs and hence better cost/performance.

Find out more »

February 2019

Prof. Jeff Ullman: Data Science: Is it Real?

February 1, 2019 @ 11:00 am - 12:00 pm
DBH 6011

ABSTRACT: We shall discuss the various ways in which data science is approached by different communities, including the Statistics, Machine-Learning, and Database communities.  Each presents a different viewpoint and values different outcomes. Some consequences of these approaches will be discussed. As an example, of why data science is not machine learning, we shall sketch two important algorithms not in this class: locality-sensitive hashing and approximate counting. BIOGRAPHY: Jeffrey David Ullman is the Stanford W. Ascherman Professor of Engineering (Emeritus) in…

Find out more »

Prof. Sang-Woo Jun: Lowering the cost of large-scale data analytics via efficient use of flash storage

February 8, 2019 @ 3:00 pm - 4:00 pm
DBH 3011

In this talk, I present the storage systems aspect of the ongoing work on using relatively cheap solid-state secondary storage to replace expensive DRAM for analytics on large amounts of data, using as examples graph analytics and the bioinformatics application somatic mutation finding. Both applications are inherently random access intensive, which is a bad fit with the coarse access granularity of secondary storage. In both cases, performance impact of random access is addressed in two ways: reorganizing the computation and…

Find out more »

March 2019

Xiangyao Yu: Transaction Processing at Scale

March 18, 2019 @ 11:00 am - 12:00 pm
DBH 3011

Abstract: Online transaction processing (OLTP) is critical for applications including finance, e-commerce, social networks, and healthcare. The increasing performance demands of these applications require OLTP to scale massively. Concurrency control is a major scalability bottleneck in such systems. This talk presents three projects that identify and help resolve scalability challenges. First, I present a scalability study of concurrency control on a simulated 1000-core processor and show the bottlenecks that constrain the scaling of classic algorithms. Then, I present a new…

Find out more »

Fatemeh Nargesian: Data Enrichment for Data Science

March 21, 2019 @ 2:00 pm - 3:00 pm
DBH 3011

Data Enrichment for Data Science Fatemeh Nargesian, University of Toronto March 21, Thursday, 2019 2:00 - 3 pm, DBH 3011 Refreshments start at 1:30 pm Data Science is built on the power of data processing and data preparation. In this talk, I discuss the challenges of data preparation for end-to-end data science. Particularly, I talk about data enrichment via discovery where the goal is to discover and integrate the right data to solve a given data science problem. I introduce…

Find out more »

May 2019

Pat Helland: There’s No Substitute for Interchangeability

May 31, 2019 @ 3:00 pm - 4:00 pm
DBH 3011

Speaker: Pat Helland (Salesforce.com) Title: There's No Substitute for Interchangeability Time: 3-4 PM Place: 3011 DBH Abstract: Distributed systems have many challenges including loosely coupled systems, long running work, and distributed workflow.  In addition,  replication with out-of-order reconciliation is quite difficult, especially when composed with the other challenges.  In this talk, we propose data-centric REST-style connectors that allow work to be decoupled in trust, space, and time.  These replication connectors support replicated services with eventual consistency. As work comes together,…

Find out more »

August 2019

Michal Shmueli-Scheuer : Conversational bots for customer support

August 9, 2019 @ 3:00 pm - 4:00 pm
DBH 4011

Conversational bots for customer support Michal Shmueli-Scheuer, IBM Research - Haifa August 9, 2019, Friday, 3:00 pm - 4:00 pm, DBH 4011 Abstract: In this talk, I'll cover various aspects of conversational bots, focusing on the domain of customer support. Often, human conversations with bots mimic the way humans interact with each other. Moreover, even when customers know that they are interacting with virtual agents (bots), they still expect them to behave like humans. One way to improve interactions with…

Find out more »

November 2019

Gift Sinthong: AsterixDB Meets Data Science

November 8, 2019 @ 12:30 pm - 2:00 pm
DBH 4011

Abstract: In the last few years, Data Science has become an increasingly important use case for data platforms. To support the full Big Data analysis lifecycle, we have examined one of the most popular exploratory data analytics tools, Pandas, which has a serious problem: scalability. Exploratory tools such as Pandas only work well against locally stored data that fits in the memory of a single machine. Our plan is to integrate a Pandas-like user experience with AsterixDB to provide analysts…

Find out more »

Multistage Adaptive Load Balancing in Big Active Data Publish Subscribe Systems

November 15, 2019 @ 12:30 pm - 2:00 pm
DBH 3011

Speaker: Hang Time: 12:30pm Room: 3011 We address issues in the design and operation of a Big Active Data Publish Subscribe (BAD Pub/Sub) systems to enable the next generation of enriched notification systems that can scale to societal levels. The proposed BAD Pub/Sub systems aim to ingest massive amounts of data from heterogeneous publishers and sources and deliver customized, enriched notifications to end users that express interests in these data items via parameterized channels. To support scalability, we employ a…

Find out more »

Texera: Supporting Big Data Analytics for Domain Experts through GUI-based workflows

November 22, 2019 @ 1:00 pm - 2:00 pm
DBH 3011

Speakers:  Avinash Kumar,  Shengquan Ni, Zuozhi Wang   Abstract: Big data analytics is a daunting task for domain experts such as doctors and teachers. Their non-IT background makes it challenging for them to write analytics code and maintain computing infrastructures to efficiently process large amounts of data. Existing data analytics frameworks that offer GUI-based alternatives are mostly limited to a single machine. In light of these facts, we are developing Texera, a scalable data-processing system that supports interaction and debugging…

Find out more »

December 2019

AquaEIS: Middleware Support for Event Identification in CommunityWater Infrastructures

December 6, 2019 @ 12:30 pm - 2:00 pm
DBH 3011

Speaker: Quing Han   Abstract: Real-time event identification is critical in complex distributed infrastructures, e.g., water systems, where failures are difficult to isolate. We present AquaEIS, an event-based middleware tailored to the problem of locating sources of failure (e.g., contamination) in community water infrastructures. The inherent complexity of underground hydraulic systems combined with aging infrastructure presents unique challenges. AquaEIS combines online learning techniques, model-driven simulators and data from limited sensing networks to intelligently guide human participants (e.g., staff) in identifying contaminant…

Find out more »

Scalable transaction and polystore data management in LeanXcale

December 13, 2019 @ 3:00 pm - 4:00 pm
DBH 5011

Speaker: Ricardo Jimenez-Péris (LeanXcale, Spain), Patrick Valduriez (Inria, France) Abstract: Hybrid Transaction Analytical Processing (HTAP) is poised to revolutionize data management. By providing online analytics over operational data, HTAP systems open up new opportunities in many application domains where real-time decision is critical. Important use cases are proximity marketing, real-time pricing, risk monitoring, real-time fraud detection, etc. HTAP also simplifies data management, by removing the traditional separation between operational database and data warehouse/ data lake (no more ETLs!). However, a…

Find out more »

January 2020

Scalable Programming: Progress, Prospects and Challenges (CS/NetSys Seminar)

January 10 @ 11:00 am - 12:00 pm
DBH 6011

Speaker: Prof. Gul Agha (University of Illinois at Urbana-Champaign) Abstract: Mobile cloud computing, social media, cyberphysical systems, and the internet of things, are examples of increasingly important applications requiring scalable concurrency. The Actor model facilitates programming large-scale concurrent applications. Not surprisingly, Actor languages and frameworks have been widely adopted in industry to address scalability. Although this has significantly reduced programming errors, developing complex concurrent systems and reasoning about their properties can nevertheless be challenging and error-prone. A key source of…

Find out more »

LSM-based storage techniques: a tutorial

January 24 @ 3:00 pm - 4:00 pm

Speaker: Chen Luo Abstract: Recently, the log-structured merge-tree (LSM-tree) has been widely adopted for use in the storage layer of modern NoSQL systems. Because of this, there have been a large number of research efforts, from both the database community and the operating systems community, that try to improve various aspects of LSM-trees. In this tutorial, I will describe the basics of LSM-tree storage techniques as well as exploring their design space and performance trade-offs. If time permits, I will…

Find out more »

A Theoretical View of Distributed Systems (CS Distinguished Seminar Series)

January 30 @ 2:00 pm - 6:00 pm
DBH 6011

Speaker: Prof. Nancy Lynch (Massachusetts Institute of Technology) Abstract: For several decades, my collaborators, students, and I have worked on theory for distributed systems, in order to understand their capabilities and limitations in a rigorous, mathematical way. This work has produced many different kinds of results, including: Abstract models for problems that are solved by distributed systems, and for the algorithms used to solve them, Rigorous proofs of algorithm correctness and performance properties (also some error discoveries), Impossibility results and…

Find out more »

Building Personal Chronicle of Life Events (Final Defense)

January 31 @ 3:00 pm - 4:00 pm

Speaker: Jordan Oh Abstract: Human beings have always been interested in understanding themselves and their surroundings. Learning about the relationship between the two can reveal facts of the present and help predict the future, a critical part to live a better life. With the proliferation of IoT sensor devices, it is now possible to collect quality data for each individual and utilize this data for building personal models that can help to understand the self and environment. However, since this…

Find out more »

February 2020

Event Detection with Temporal Predicates

February 7 @ 3:00 pm - 4:00 pm
DBH 3011

Speaker: Fabio Persia (Free University of Bozen-Bolzano, Italy) Abstract: Human perception tends to group individual values into larger structures, this is also the case for time series data. This tendency inspired us to define an event-detection language based on time intervals, which combines timepoint-based events into larger structures. Complex events can then be defined on a more abstract level by specifying temporal relationships between different time intervals. As a result, we propose a system based on an extension of relational…

Find out more »

Effective Filters and Linear Time Verification for Tree Similarity Joins

February 14 @ 3:00 pm - 4:00 pm
DBH 3011

Speaker: Thomas Hütter (University of Salzburg) Abstract: The tree similarity join computes all similar pairs in a collection of trees. Two trees are similar if their edit distance falls within a user-defined threshold. Previous algorithms, which are based on a filter-verify approach, suffer from the following two issues. First, ineffective filters produce a large number of candidates that must be further verified. Second, the candidates are verified by computing the tree edit distance, which is cubic in the number of…

Find out more »

Systems and ML at RISELab (CS Distinguished Seminar Series)

February 21 @ 11:00 am - 12:00 pm
DBH 6011

Speaker: Prof. Ion Stoica (University of California at Berkeley) Abstract: In this talk, I will present several of the projects we are developing at RISELab, a two-year old lab at UC Berkeley that focuses on building platforms and algorithms for real-time intelligent decisions, decisions that are secure and explainable. These projects include both systems to better support machine learning (ML) workloads, and leveraging ML to build better systems. In the first category, I will present, Ray, a general-purpose distributed system…

Find out more »

Opportunities and Perils of Data Science: A Roadmap (ICS Distinguished Lecture)

February 25 @ 11:00 am - 12:00 pm
DBH 6011

Speaker: Dr. Alfred Spector Abstract: Data-driven approaches have led to powerful prediction, optimization and automation techniques. Powered by large-scale, networked computer systems and machine learning algorithms, these have been very impactful to-date and hold great promise in many disciplines, even in the humanities and social sciences. However, no new technology arrives without complications, and we have recently seen the press and various political circles illustrating real, potential, and fictional implications of Big Data. This presentation aims to balance the opportunities…

Find out more »

Depending on Appending

February 28 @ 3:00 pm - 4:00 pm
DBH 3011

Speaker: Pat Helland (Salesforce.com) Abstract: Increasingly, we see "Gray Failures" in the datacenter and public cloud. This happens when a server, router, or other device just plain goes slow. This may result in severe problems in the user perceived performance as the slowness cascades, sometimes not slow enough to cause the exclusion of the bad devices. In this talk, we briefly examine Gray Failures and consider the use of "append" to support our work. How has append been used in…

Find out more »

April 2020

Babak Salimi: Causal Inference for Responsible Data Science

April 14 @ 11:00 am - 12:00 pm
https://uci.zoom.us/j/232157494

ABSTRACT: Scaling and democratizing access to big data promises to provide meaningful, actionable information that supports decision-making. Today, data-driven decisions profoundly affect the course of our lives, such as whether to admit applicants to a particular school, offer them a job, or grant them a mortgage. Unfair, inconsistent, or faulty decision-making raises serious concerns about ethics and responsibility. For example, we may know that our training data is biased, but how do we avoid propagating discrimination when we use this…

Find out more »

David Lomet: Better Database Cost/Performance via Programmable SSD Batched I/O

April 17 @ 12:30 pm - 2:00 pm
DBH 3011

Abstract: A database storage manager should place data at the most cost/performance-effective tier in the storage hierarchy.  While performance and cost both decrease with distance from the CPU, the cost/performance trade-off depends on how efficiently a storage manager can move data across tiers.  Log structuring (LS) is designed to improve the cost/performance of secondary storage by writing batches of pages from main memory to secondary storage when using a conventional block-at-a-time I/O interface.  The advent of programmable SSDs changes the…

Find out more »

Redesigning Storage Systems for Future Workloads, Hardware, and Performance Requirements (CS Faculty Candidate Seminar)

April 20 @ 11:00 am - 12:00 pm
DBH 3011

Speaker: Oana Balmau (University of Sydney) Abstract: Cloud storage stacks are being challenged by new workloads, new hardware and new performance requirements. First, workloads evolved from following a read-heavy pattern (e.g., a static web-page) to a write-heavy profile where the read:write ratio is closer to 1:1 (e.g., as in the Internet of Things). Second, the hardware is undergoing rapid changes. The divide between fine-grained volatile memory and slow block-level storage is rapidly being bridged by the emerging byte-addressable non-volatile memory…

Find out more »

Lei Cao: Toward an End-to-end Anomaly Discovery Paradigm

April 27 @ 11:00 am - 12:00 pm

ABSTRACT: Anomaly detection is critical in enterprises, with applications ranging from preventing financial fraud, and defending network intrusions, to detecting imminent device failures. Although previously developed research offers a plethora of stand-alone methods for detecting particular types of anomalies, there is no end-to-end solution for data scientists to effectively discover anomalies over large volumes of varied data. To build such a system, several critical challenges have to be solved: How to determine which among many alternative anomaly detection algorithms is…

Find out more »

October 2020

CrocodileDB: Resource Efficient Database Execution

October 16 @ 2:00 pm - 3:00 pm
https://uci.zoom.us/j/92895672890

Abstract: The coming end of Moore’s law requires that data systems be more judicious with computation and resources as the growth in data outpaces the availability of computational resources. Current database systems are eager and aggressively consume resources to immediately and quickly complete the task at hand. Intelligently deferring a task to a later point in time can increase result reuse, reduce work that might later be invalidated, or avoid unnecessary work altogether. In this talk I will introduce CrocodileDB,…

Find out more »

YugabyteDB – Bringing Together the Best of Amazon Aurora and Google Spanner

October 23 @ 3:00 pm - 4:00 pm

Speaker: Karthik Ranganathan Abstract: PostgreSQL, a single-node open-source RDBMS, is widely adopted for its powerful set of features. However, PostgreSQL is not built to be used as a cloud-native database, and therefore cannot inherently survive failures, scale horizontally or support geo-distributed deployments. While Amazon Aurora has modified the subsystem of PostgreSQL that writes to disk along with simplifying async replication to make the database resilient to failures, it does not address horizontal scalability or geo-distribution. Google Spanner is a distributed…

Find out more »

November 2020

LOCATER: Cleaning WiFi Connectivity Datasets for Semantic Localization

November 20 @ 3:00 pm - 4:00 pm
https://uci.zoom.us/j/95066121155

Speaker: Yiming Lin, UCI Abstract: Sensor data is abundant in our life but often dirty to generate services with high quality. This talk explores the data cleaning challenges that arise in using WiFi connectivity data to locate users to semantic indoor locations such as buildings, regions, rooms. WiFi connectivity data consists of sporadic connections between devices and nearby WiFi access points (APs), each of which may cover a relatively large area within a building. Our system, entitled semantic LOCATion cleanER…

Find out more »
+ Export Events