I am very new to pandas library, but not new to Spark. Here is my question:
Since Spark is written (mostly) in Scala, does an application performance written in Python suffer from having to convert Scala datatypes to Python and back?
I am not familiar with Scala or Spark too much, so I'm not sure how much I can help. But as far as I can tell, pandas is a lot slower than something like pyspark -- much of this performance comes from the fact that spark is multi-threaded and pandas is single-threaded.