#Databricks v.s. Zeppelin This article is about data exploration and two of the main tools that are currently available on the market, Databricks community and Apache Zeppelin Both Zeppelin and Databricks provide a similar interface to explore your data.
#Data exploration In the world of Big data exploration of the initial data set is one of the first steps you take. Although in some cases Number or Excel will to the trick when the data set has a realistic size it won’t work anymore. Still, you just want to peek around in the data and what better way to do this is creating some graphs based on some fields.
In the current day Spark is the defacto standard to process your data. But then again Apache Spark is not known for the nice graphical interface. To overcome this we have Databricks community and Apache Zeppelin
Spark
From the perspective of Spark both environments provide this out of the box. You are free to use your preferred language, i.e. Scala, Phyton, R and SQL
All language features are available to you.
##Display your data The main difference between Databricks and Zeppelin is the way you can display the data. (I did not look into advanced methods to create graphs such as ggplot, HTML or D3)
Both platforms have a decent set of graphs out of the box available. One that I found missing in Zeppelin is the Box plot.
Databricks graphs
When using the Databricks application you can display your Dataset by means of the Display command. for example:
Datasets
val carLength= campers.map(c => c.lengte match {
case None => 99999
case _ => c.lengte.get
})
.filter(v => v > 1 && v < 20000)
display(carLength)
This will display a table with the values that are contained in the Dataset. The menu displayed with the table allows you to select various forms.
SQL queries
When you use an SQL query on the data the display command is not required, it knows that a table is returned.
Zeppelin graphs
Zeppelin does not know the command display. When using Zeppelin the output is analyzed and displayed. The creation of the table is a bit more involved here.
For a table to display you need to print the data to stdout in the with the keyword %table [tablename]
in front of it. Rows are separated by means of a new line. When your row has multiple columns separate each value with the TAB character. Naming the columns is simpler than in the Databricks UI, just start your data with the names of the columns.
The code snippet below is an example of how to achieve this:
val tenaamStellingen = campers
.filter(c => c.datumTeNaamstelling != null)
.map(c => c.datumTeNaamstelling.toLocalDateTime.getYear)
.groupBy("value")
.count()
.orderBy(desc("value"))
println("%table Sales on campers" + tenaamStellingen.map(r => r.getAs[Integer]("value") + "\t" + r.getAs[Long]("count")).collect().mkString("value\tcount\n", "\n", ""))
This code allows to to create a line graph with the values that are in the Dataset.
Observations
When working with the Databricks UI it takes less effort to display your Dataset. There is no need to collect and reformat your data to a TAB/NEW_LINE format. On the other hand, this allows for the easier naming of the variables that you are using in your graph.
Both the Zeppelin and Databricks installation allow you to switch between the various languages.
Getting data into your system. Databricks does not allow shell script to download data and process it. You need to separately upload your dataset.
Conclusion
I think that both systems are equal. They both have their quirks but are not limiting you in anyway. Happy coding.