Here at ClearBrain, we use Apache Spark and MLlib to power our machine learning pipelines. Spark has grown considerably in the market in recent years, and we have personally found it can dramatically accelerate the pace of development, replacing entire components that otherwise would have to be built from scratch.
In getting started with Spark, there are a variety of approaches to consider. The primary interfaces include Resilient Distributed Datasets (RDDs), and Spark SQL (DataFrames / Datasets). RDDs are the original API that shipped with Spark 1.0, where data is passed around as opaque objects. RDD operators are limited (map, reduceByKey, …), though likely more familiar to Hadoop users / functional programming developers. DataFrames by contrast offer a more ergonomic interface, where data is represented in a tabular format (rows / columns) that comes with an attached schema and rich SQL-like operators.
We ourselves explored these different approaches to Spark in building out ClearBrain – initially building the data engineering components on RDDs, and designing the machine learning components in PySpark, before ultimately settling on a pure Scala implementation using the DataFrame API. This is what has worked best for us, and we think has applicable benefits to many companies dealing with large scale data.
In this post we’ll share some of our key learnings on using DataFrames in Spark, and some circumstances where we’ve found RDDs can provide a better solution instead.
Improved Team Collaboration with DataFrames
There are many great articles on DataFrames vs. RDDs that dive into the technical benefits of using each API. As such, we thought it’d be helpful to review some of the tangible human benefits we’ve seen – specifically how DataFrames can help you improve the pace of development across your team.
A primary benefit of the DataFrame API is that it employs SQL-like constructs that allow one to talk about data engineering transforms using the language of SQL. This is a lingua franca for all members of a modern engineering team: data scientists, engineers, and product. Using a common language to discuss data engineering has helped us iterate on our own problems more quickly and also get contribution from all team members.
One recent example: our CEO devised a 125-line SQL query using subqueries and ranking to generate a valuable new feature. Using the Spark DataFrame API and Window functions we were able to near directly translate this into Scala code that runs efficiently, even scaled to dozens of similar features across billions of events.
Further, the shared language of DataFrames works between members of engineering and data science teams you’re likely to be collaborating with. Spark’s polyglot approach has allowed our own data scientists to develop and iterate on models in python using PySpark, using the DataFrame API and MLlib. The engineering team in turn has been able to translate these models directly into Scala for improved performance and consistency with the rest of our pipeline.
Performance and Simpler Handling of Large Scale Data with DataFrames
With our initial Spark implementation, we spent several weeks crafting MapReduce processing on RDDs (good experiences with MapReduce in the past suggesting this!) Unfortunately, the path was far from smooth.
Running MapReduce on RDDs often lead to issues with memory consumption, compute time, and limitations of Scala native data structures. One concrete example of this is the limit of 22 fields for a tuple in recent prior versions of Scala. This limitation makes working with RDDs containing hundreds or thousands of fields challenging. You can overcome the limit with special handling (nested vectorized columns), but the complexity of the solution is exacerbated when working with different languages across engineering and data science teams, since any special handling needs to be implemented in both Scala and Python, slowing down both teams.
By switching to the DataFrame API, you’re able to realize dramatic reductions in runtime and memory consumption on the data engineering tasks in a typical processing pipeline. It can also make it easier to handle very wide DataFrames (hundreds to thousands of columns) without any extra specialized processing steps.
Completeness and Speed Benefits of RDDs in MLlib
In a few specific cases, we’ve found it useful to dive back into the old RDD-based spark.mllib package for functions which have not yet been ported over to the newer DataFrame-based spark.ml package.
For instance, more complete model evaluation functions, and the Spark Statistics library’s ability to produce a full matrix of correlations in a single pass, make the RDD implementation a better choice for those cases. You can manually work around the absence of evaluation functions in DataFrames by implementing them natively in Scala, but the correlation matrix function is more challenging.
Using the RDD-based spark.mllib correlation matrix function has the benefit of increased performance. Asking Spark’s DataFrame-based spark.ml to compute correlations between any two columns is fast and straightforward, but when computing across hundreds to thousands of features, you’ll see substantial costs in terms of time used on a cluster, as the obvious approaches don’t parallelize. For this specific case, we’ve found a better approach is converting a vector-assembled column of a DataFrame to an RDD of old-style mllib vectors, computing the correlation matrix, then converting back to a DataFrame.
Spark In Practice
To wrap up, we’ve been extremely happy users of Spark and have seen substantial improvements in team dynamics and pace of development over using lower level tools.
In practice, our experiences have strongly favored the newer DataFrame API for human and technical reasons. As a minor note, for some specific absences in the newer MLlib spark.ml API there may be a need to reach back to the older API, and this can be done in a fairly straightforward way.