Transition from Pandas to Spark Koalas to rescue
Problem Statement at hand:
As the data size grows from MB →GB →Tb the processing time goes from SEC →Minutes →Hours and dam our working python scripts start crashing with (pandas out of memory) exceptions.
The team starts looking for more enterprise and scalable approach to handle the big data surge Spark and its family becomes defacto standard solution for such problems.
But wait Data bricks work with either Pyspark or Scala…..
So do you think we need to rewrite all our Pandas code to Pyspark and keep doing so for the new Data Science projects as well? Well the answer is no.
May 2019 the researchers of Spark introduced KOALAS(not the cute little lazy animals)to the open source community.
Let me describe a typical Data Scientists journey:
- University/MOOC as students rely on Pandas
- Analyse Small Data Set interns/freshers rely on Pandas
- Well in the Industry for some years/seniors start analysis on big datasets rely on the Spark Dataframe
But Pyspark has very different set of API compared to single node python package such as pandas so do we see a steep learning curve here? that too when we have deliverables at hand.
These problems above led to development of KOALAS
This framework benefits both the Pandas and Spark users by combining the two hence provide greater value much faster to the organisations.
So what is Koalas?
Koalas is a pure Open Source Python Library which aims at providing Pandas API on top of Apache Spark.
Koalas unifies the 2 ecosystems with a familiar API
Koalas provides seamless transition between small and large dataset so we need to know just one set of API just the Pandas API’S
Using Koalas we get ease of use of Pandas and scalability of Pyspark.
Easy Installation: > pip install koalas
Below is the architecture diagram of Koalas from the community:
Let’s see the action with an example:
PANDAS Code
pandas_df.groupby(“Destination”).sum().nlargest(10,columns = “TCount”)
Pyspark Code
spark.df.groupby(“Destination”).sum().orderedby(“sum(TCount)”,ascending=False).limit(10)
Koalas Code
koalas_df.groupby(“Destination”).sum().nlargest(10,columns = “TCount”)
Just a word of caution we always need to make sure that the data is ordered using sort index because being a distributed computing environment we do not know which order the data comes in.
PANDAS to KOALAS some useful tips:
- Conversions between the tow types of data frames is done very seamlessly:
koalas_df.to_spark() we an use this transformation and our dataframe over to the Pysark expert maybe the deployment team and pyspark_df.to_koalas()
2. Almost all popular Pandas ops available in Koalas.> 70%.Visualization support via matplotlib. The community is accepting missing ops which could be done via the Gihub page of Koalas.
3. Some of the functionalities are explicitly not chosen to be implemented such as Dataframe.value because all data might be loaded into the drivers memory giving us out of memory exception. Easiest work around is koalas_df.to_pandas() do your manipulations pandas_df.to_koalas()
4. Different execution principles we need to be aware of such as ordering,lazy evaluation and underlying Spark df, sort after groupby,different structure of groupby.apply, different NAN treatment.
5. For distributed row based job you could use koalas_df.apply(<your function>,axis=1) this makes sure that all the function calls over all the rows are distributed among all the Spark workers.
6. koalas_df.cache() will not recompute from the beginning everytime which could be leveraged for exploratory data analysis.