Gift Sinthong: AsterixDB Meets Data Science

November 8, 2019 @ 12:30 pm - 2:00 pm

In the last few years, Data Science has become an increasingly important use case for data platforms. To support the full Big Data analysis lifecycle, we have examined one of the most popular exploratory data analytics tools, Pandas, which has a serious problem: scalability. Exploratory tools such as Pandas only work well against locally stored data that fits in the memory of a single machine. Our plan is to integrate a Pandas-like user experience with AsterixDB to provide analysts with a familiar working environment while scaling out the evaluation of the analytical operations over a large data cluster to enable Big Data analysis. The two main components that we use to enable such a workflow are the AsterixDB UDF framework and our new Python data analytics library (“AFrame”) that operates against AsterixDB. AFrame allows users to interact with a very large volume of semi-structured data in the same way that Pandas DataFrames work against locally stored tabular data. Influenced by Spark SQL and Spark DataFrames, our AFrame prototype leverages lazy evaluation and only performs operations once an action is invoked. AFrame operations are incrementally translated into AsterixDB SQL++ queries that are executed only when final results are called for.  In this talk, we will demonstrate our approach using a restaurant review analytics use case.


November 8, 2019
12:30 pm - 2:00 pm


DBH 4011