Subtitles section Play video
>> Narrator: Live from San Francisco,
it's theCUBE,
covering Spark Summit 2017.
Brought to you by Databricks.
>> Welcome back we're here at theCube at Spark Summit 2017.
I'm David Goad here with George Gilbert, George.
>> Good to be here.
>> Thanks for hanging with us.
Well here's the other man of the hour here.
We just talked with Ali, the CEO at Databricks
and now we have the Chief Architect
and co-founder at Databricks, Reynold Xin.
Reynold, how are you?
>> I'm good. How are you doing?
>> David: Awesome.
Enjoying yourself here at the show?
>> Absolutely, it's fantastic.
It's the largest Summit.
It's a lot interesting things,
a lot of interesting people with who I meet.
>> Well I know you're a really humble guy
but I had to ask Ali what should I ask Reynold
when he gets up here.
Reynold is one of the biggest contributors to Spark.
And you've been with us for a long time right?
>> Yes, I've been contributing for Spark
for about five or six years
and that's probably the most number
of commits to the project
and lately more I'm working with other people
to help design the roadmap
for both Spark and Databricks with them.
>> Well let's get started talking about some
of the new developments that you want
maybe our audience at theCUBE hasn't heard
here in the keynote this morning.
What are some of the most exciting new developments?
>> So, I think in general if we look at Spark,
there are three directions I would say we doubling down.
One the first direction is the deep learning.
Deep learning is extremely hot and it's very capable
but as we alluded to earlier in a blog post,
deep learning has reached sort of a mass produced point
in which it shows tremendous potential but the tools
are very difficult to use.
And we are hoping to democratize deep learning
and do what Spark did to big data, to deep learning
with this new library called deep learning pipelines.
What it does, it integrates different
deep learning libraries directly in Spark
and can actually expose models in sequel.
So, even the business analysts
are capable of leveraging that.
So, that one area, deep learning.
The second area is streaming.
Streaming, again, I think that a lot of customers
have aspirations to actually shorten the latency
and increase the throughput in streaming.
So, the structured streaming effort is going to be
generally available and last month alone
on Databricks platform,
I think out customers processed three trillion records,
last month alone using structured streaming.
And we also have a new effort to actually push down
the latency all the way to some millisecond range.
So, you can really do blazingly fast streaming analytics.
And last but not least is the SEQUEL Data Warehousing area,
Data warehousing I think that it's a very mature area
from the outset of big data point of view,
but from a big data one it's still pretty new
and there's a lot of use cases that's popping up there.
And Spark with approaches like the CBO and also impact here
in the database runtime with DBIO,
we're actually substantially improving the performance
and the capabilities of data warehousing futures.
>> We're going to dig in to some of those technologies here
in just a second with George.
But have you heard anything here so far from anyone
that's changed your mind maybe about what to focus on next?
So, one thing I've heard from a few customers
is actually visibility and debugability
of the big data jobs.
So many of them are fairly technical engineers
and some of them are less sophisticated engineers
and they have written jobs and sometimes the job runs slow.
And so the performance engineer in me would think
so how do I make the job run fast?
The different way to actually solve that problem
is how can we expose the right information
so the customer can actually understand
and figure it out themselves.
This is why my job is slow and this how I can tweak it
to make it faster.
Rather than giving people the fish,
you actually give them the tools to fish.
>> If you can call that bugability.
>> Reynold: Yeah, Debugability.
>> Debugability.
>> Reynold: And visibility, yeah.
>> Alright, awesome, George.
>> So, let's go back and unpack some of those
kind of juicy areas that you identified,
on deep learning you were able to distribute,
if I understand things right, the predictions.
You could put models out on a cluster
but the really hard part, the compute intensive stuff,
was training across a cluster.
And so Deep Learning, 4J and I think Intel's BigDL,
they were written for Spark to do that.
But with all the excitement over some of the new frameworks,
are they now at the point where they are as good citizens
on Spark as they are on their native environments?
>> Yeah so, this is a very interesting question,
obviously a lot of other frameworks
are becoming more and more popular,
such as TensorFlow, MXNet, Theano, Keras and Office.
What the Deep Learning Pipeline library does,
is actually exposes all these single note
Deep Learning tools as highly optimized
for say even GPUs or CPUs, to be available as a estimator
or like a module in a pipeline of the machine learning
pipeline library in spark.
So, now users can actually leverage Spark's capability to,
for example, do hyper parameter churning.
So, when you're building a machine learning model,
it's fairly rare that you just run something once
and you're good with it.
Usually have to fiddle with a lot of the parameters.
For example, you might run over a hundred experiments
to actually figure out what is the best model I can get.
This is where actually Spark really shines.
When you combine Spark with some deep learning library
be it BigDL or be it MXNet, be it TensorFlow,
you could be using Spark to distribute that training
and then do cross validation on it.
So you can actually find the best model very quickly.
And Spark takes care of all the job scheduling,
all the tolerance properties and how do you read data
in from different data sources.
>> And without my dropping too much in the weeds,
there was a version of that where Spark wouldn't take care
of all the communications.
It would maybe distribute the models and then do some
of the averaging of what was done out on the cluster.
Are you saying that all that now can be managed by Spark?
>> In that library, Spark will be able to actually
take care of picking the best model out of it.
And there are different ways you an design
how do you define the best.
The best could be some average of some different models.
The best could be just pick one out of this.
The best could be maybe there's a tree of models
that you classify it on.
>> George: And that's a hyper parameter
configuration choice?
>> So that is actually building functionality
in Sparks machine learning pipeline.
And now what we're doing is now you can actually
plug all those deep learning libraries directly into that
as part of the pipeline to be used.
Another maybe just to add,
>> Yeah, yeah,
>> Another really cool functionality
of the deep learning pipeline is transfer learning.
So as you said, deep learning takes a very long time,
it's very computationally demanding.
And it takes a lot of resources, expertise to train.
But with transfer learning what we allow the customers to do
is they can take an existing deep learning model
as well train in a different domain and they we'd retrain it
on a very small amount of data very quickly
and they can adapt it to a different domain.
That's how sort of the demo on the James Bond car.
So there is a general image classifier that we train it on
probably just a few thousand images.
And now we can actually detect whether a car
is James Bond's car or not.
>> Oh, and the implications there are huge,
which is you don't have to have huge training data sets
for modifying a model of a similar situation.
I want to, in the time we have,
there's always been this debate
about whether Sparks should manage state,
whether it's database, key value store.
Tell us how the thinking about that has evolved
and then how the integration interfaces
for achieving that have evolved.
>> One of the, I would say, advantages of Spark is that
it's unbiased and works with a variety of storage systems,
be it Cassandra, be it Edgebase, be it HDFS, be is S3.
There is a metadata management functionality in Spark
which is the catalog of tables that customers can define.
But the actual storage sits somewhere else.
And I don't think that will change in the near future
because we do see that the storage systems
have matured significantly in the last few years
and I just wrote blog post last week about the advantage
of S3 over HDFS for example.
The storage price is being driven down
by almost a factor of 10X when you go to the cloud.
I just don't think it makes sense at this point
to be building storage systems for analytics.
That said, I think there's a lot of building
on top of existing storage system.
There's actually a lot of opportunities for optimization
on how you can leverage the specific properties
of the underlying storage system
to get to maximum performance.
For example, how are you doing intelligent caching,
how do you start thinking about building indexes
actually against the data
that's stored for scanned workloads.
>> With Tungsten's, you take advantage of the latest hardware
and where we get more memory intensive systems
and now that the Catalyst Optimizer
has a cost based optimizer or will be, and large memory.
Can you change how you go about knowing
what data you're managing in the underlying system
and therefore,
achieve a tremendous acceleration in performance?
>> This is actually one area we invested in the DBIO module
as part of Databricks Runtime,
and what DBIO does, a lot of this are still in progress,
but for example, we're adding some form
of indexing capability to add to the system
so we can quickly skip and prune out all the irrelevant data
when the user is doing simple point look-ups.
Or if the user is doing a scan heavy workload
with some predicates.
That actually has to do with how we think
about the underlying data structure.
The storage system is still the same storage system,
like S3, but were adding actually
indexing functionalities on top of it as part of DBIO.
>> And so what would be the application profiles?
Is it just for the analytic queries
or can you do the point look-ups and updates
in that sort of scenario too?
>> So it's interesting you're talking about updates.
Updates is another thing that we've got a lot
of future requests on.
We're actively thinking about how
we will support update workload.
Now, that said, I just want to emphasize for both use case
of doing point look-ups and updates,
we're still talking about in the context
of analytic environment.
So we would be talking about for example maybe bulk updates
or low throughput updates
rather than doing transactional updates
in which every time you swipe a credit card,
some record gets updated.
That's probably more belongs on the transactional databases
like Oracle or my SEQUEL even.
>> What about when you think about people who are going to run,
they started out with Spark on prem,
they realize they're going to put much more
of their resources in the cloud,
but with IIOT, industrial IOT type applications
they're going to have Spark
maybe in a gateway server on the edge?
What do you think that configuration looks like?
>> Really interesting, it's kind of two questions maybe.
The first is the hybrid on prem, cloud solution.
Again, so one of the nice advantage of Spark
is the couple of storage and compute.
So when you want to move for example,
workloads from one prem to the cloud,
the one you care the most about
is probably actually the data
'cause the compute,
it doesn't really matter that much where you run it
but data's the one that's hard to move.
We do have customers that's leveraging Databricks
in the cloud but actually reading data directly
from on prem the reliance of the caching solution
we have that minimize the data transfer over time.
And is one route I would say it's pretty popular.
Another on is, with Amazon you can literally give them
just a show ball of functionality.
You give them hard drive with trucks,
the trucks will ship your data directly put in a three.
With IOT, a common pattern we see
is a lot of the edge devices,
would be actually pushing the data directly
into some some fire hose like Kinesis or Kafka
or, I'm sure Google and Microsoft
both have their own variance of that.
And then you use Spark to directly subscribe to those topics
and process them in real time with structured streaming.
>> And so would Spark be down,
let's say at the site level.
if it's not on the device itself?
>> It's a interesting thought and maybe one thing
we should actually consider more in the future
is how do we push Spark to the edges.
Right now it's more of a centralized model
in which the devices push data into Spark
which is centralized somewhere.
I've seen for example,
I don't remember exact the use case
but it has to do with some scientific experiment
in the North Pole.
And of course there you don't have a great uplink
of all the data connecting transferring back
to some national lab
and rather they would do a smart parsing there
and then ship the aggregated result back.
There's another one but it's less common.
>> Alright well just one minute now before the break
so I'm going to give you a chance
to address the Spark community.
What's the next big technical challenge
you hope people will work on for the benefit of everybody?
>> In general Spark came along with two focuses.
One is performance, the other one's ease of use.
And I still think big data tools are too difficult to use.
Deep learning tools, even harder.
The barrier to entry is very high for office tools.
I would say, we might have already addressed
performance to a degree that
I think it's actually pretty usable.
The systems are fast enough.
Now, we should work on actually make
(mumbles) even easier to use.
It's what also we focus a lot on at Databricks here.
>> David: Democratizing access right?
>> Absolutely.
>> Alright well Reynold, I wish we could talk to you all day.
This is great.
We are out of time now.
Want to appreciate you coming by theCUBE
and sharing your insights
and good luck with the rest of the show.
>> Thank you very much David and George.
>> Thank you all for watching here were at theCUBE
at Sparks Summit 2017.
Stay tuned, lots of other great guests coming up today.
We'll see you in a few minutes.