ISG Talks are sponsored by Couchbase.

<< All Talks

Sushant Jain : Large scale and low latency data distribution from database to servers

DBH 6011

Many applications at Google are structured with data stored in a transactional database (source of truth) and same data being required by servers distributed worldwide. For efficient and fast computation servers store this data in memory. Further, the database is changing continuously and we need to update the in-memory view of these large number of servers in real-time. For example, in Google Search Ads application we have Advertisers configuration stored in a database and this data is loaded in the memory of various servers to compute Ads in a scalable and fast way. In this talk, we describe our solution to this data distribution problem and the challenges that we encountered in providing a highly reliable and low latency service.

Dr. Andrey Balmin and Mayank Pradhan (Workday): Workday Prism Analytics: Unifying Interactive and Batch Data Processing Using Apache Spark

DBH 3011

Abstract: Workday Prism Analytics enables data discovery and interactive Business Intelligence analysis for Workday customers. To prepare data for analysis, business users can setup data transformation pipelines in an interactive, self-service, modern data prep environment. Thus, Workday Prism Analytics needs to run three types of scalable data processing applications: "always on" query engine and data […]

Vinayak Borkar (FireEye Inc.): The X15 Machine Data Management Platform

DBH 4011

ABSTRACT: Machine Data (aka Log Data) is continuously produced by applications and devices as a result of human-computer and computer-computer interactions. Although most of this data was initially generated for ad-hoc human consumption to aid with debugging and troubleshooting systems and deployments, their systematic treatment using well-known data processing techniques can unlock valuable insight about operations […]

David Lomet (Microsoft Research): How Data Caching Systems Succeed

DBH 4011

Data in traditional "caching'' data systems resides on secondary storage, and is read into main memory only when operated on. This limits system performance. Main memory data stores with data always in main memory are much faster. But this performance comes at a cost. In this paper, we analyze the costs of both in-memory operations and secondary storage operations where data is not "in cache''. We study the performance impact of cache misses on caching system performance. The analysis considers both execution and storage costs. Based on our analysis, we derive cost/performance results for a data caching system and a main memory system to understand where each demonstrates the best cost per operation, what is driving the cost differences, and the scale of the differences. This analysis (1) provides insight into why data caching systems continue to dominate the market; (2) points to higher performance that does not rely on simply increasing main memory cache size; and (3) suggests a path to lower costs and hence better cost/performance.

Prof. Jeff Ullman: Data Science: Is it Real?

DBH 6011

ABSTRACT: We shall discuss the various ways in which data science is approached by different communities, including the Statistics, Machine-Learning, and Database communities.  Each presents a different viewpoint and values different outcomes. Some consequences of these approaches will be discussed. As an example, of why data science is not machine learning, we shall sketch two […]

Prof. Sang-Woo Jun: Lowering the cost of large-scale data analytics via efficient use of flash storage

DBH 3011

In this talk, I present the storage systems aspect of the ongoing work on using relatively cheap solid-state secondary storage to replace expensive DRAM for analytics on large amounts of data, using as examples graph analytics and the bioinformatics application somatic mutation finding. Both applications are inherently random access intensive, which is a bad fit […]

Xiangyao Yu: Transaction Processing at Scale

DBH 3011

Abstract: Online transaction processing (OLTP) is critical for applications including finance, e-commerce, social networks, and healthcare. The increasing performance demands of these applications require OLTP to scale massively. Concurrency control is a major scalability bottleneck in such systems. This talk presents three projects that identify and help resolve scalability challenges. First, I present a scalability […]

Fatemeh Nargesian: Data Enrichment for Data Science

DBH 3011

Data Enrichment for Data Science Fatemeh Nargesian, University of Toronto March 21, Thursday, 2019 2:00 - 3 pm, DBH 3011 Refreshments start at 1:30 pm Data Science is built on the power of data processing and data preparation. In this talk, I discuss the challenges of data preparation for end-to-end data science. Particularly, I talk […]

Pat Helland: There’s No Substitute for Interchangeability

DBH 3011

Speaker: Pat Helland (Salesforce.com) Title: There's No Substitute for Interchangeability Time: 3-4 PM Place: 3011 DBH Abstract: Distributed systems have many challenges including loosely coupled systems, long running work, and distributed workflow.  In addition,  replication with out-of-order reconciliation is quite difficult, especially when composed with the other challenges.  In this talk, we propose data-centric REST-style […]

Michal Shmueli-Scheuer : Conversational bots for customer support

DBH 4011

Conversational bots for customer support Michal Shmueli-Scheuer, IBM Research - Haifa August 9, 2019, Friday, 3:00 pm - 4:00 pm, DBH 4011 Abstract: In this talk, I'll cover various aspects of conversational bots, focusing on the domain of customer support. Often, human conversations with bots mimic the way humans interact with each other. Moreover, even […]