Placeholder Image

Subtitles section Play video

  • >> Narrator: Live from San Francisco,

  • it's theCUBE,

  • covering Spark Summit 2017.

  • Brought to you by Databricks.

  • >> Welcome back we're here at theCube at Spark Summit 2017.

  • I'm David Goad here with George Gilbert, George.

  • >> Good to be here.

  • >> Thanks for hanging with us.

  • Well here's the other man of the hour here.

  • We just talked with Ali, the CEO at Databricks

  • and now we have the Chief Architect

  • and co-founder at Databricks, Reynold Xin.

  • Reynold, how are you?

  • >> I'm good. How are you doing?

  • >> David: Awesome.

  • Enjoying yourself here at the show?

  • >> Absolutely, it's fantastic.

  • It's the largest Summit.

  • It's a lot interesting things,

  • a lot of interesting people with who I meet.

  • >> Well I know you're a really humble guy

  • but I had to ask Ali what should I ask Reynold

  • when he gets up here.

  • Reynold is one of the biggest contributors to Spark.

  • And you've been with us for a long time right?

  • >> Yes, I've been contributing for Spark

  • for about five or six years

  • and that's probably the most number

  • of commits to the project

  • and lately more I'm working with other people

  • to help design the roadmap

  • for both Spark and Databricks with them.

  • >> Well let's get started talking about some

  • of the new developments that you want

  • maybe our audience at theCUBE hasn't heard

  • here in the keynote this morning.

  • What are some of the most exciting new developments?

  • >> So, I think in general if we look at Spark,

  • there are three directions I would say we doubling down.

  • One the first direction is the deep learning.

  • Deep learning is extremely hot and it's very capable

  • but as we alluded to earlier in a blog post,

  • deep learning has reached sort of a mass produced point

  • in which it shows tremendous potential but the tools

  • are very difficult to use.

  • And we are hoping to democratize deep learning

  • and do what Spark did to big data, to deep learning

  • with this new library called deep learning pipelines.

  • What it does, it integrates different

  • deep learning libraries directly in Spark

  • and can actually expose models in sequel.

  • So, even the business analysts

  • are capable of leveraging that.

  • So, that one area, deep learning.

  • The second area is streaming.

  • Streaming, again, I think that a lot of customers

  • have aspirations to actually shorten the latency

  • and increase the throughput in streaming.

  • So, the structured streaming effort is going to be

  • generally available and last month alone

  • on Databricks platform,

  • I think out customers processed three trillion records,

  • last month alone using structured streaming.

  • And we also have a new effort to actually push down

  • the latency all the way to some millisecond range.

  • So, you can really do blazingly fast streaming analytics.

  • And last but not least is the SEQUEL Data Warehousing area,

  • Data warehousing I think that it's a very mature area

  • from the outset of big data point of view,

  • but from a big data one it's still pretty new

  • and there's a lot of use cases that's popping up there.

  • And Spark with approaches like the CBO and also impact here

  • in the database runtime with DBIO,

  • we're actually substantially improving the performance

  • and the capabilities of data warehousing futures.

  • >> We're going to dig in to some of those technologies here

  • in just a second with George.

  • But have you heard anything here so far from anyone

  • that's changed your mind maybe about what to focus on next?

  • So, one thing I've heard from a few customers

  • is actually visibility and debugability

  • of the big data jobs.

  • So many of them are fairly technical engineers

  • and some of them are less sophisticated engineers

  • and they have written jobs and sometimes the job runs slow.

  • And so the performance engineer in me would think

  • so how do I make the job run fast?

  • The different way to actually solve that problem

  • is how can we expose the right information

  • so the customer can actually understand

  • and figure it out themselves.

  • This is why my job is slow and this how I can tweak it

  • to make it faster.

  • Rather than giving people the fish,

  • you actually give them the tools to fish.

  • >> If you can call that bugability.

  • >> Reynold: Yeah, Debugability.

  • >> Debugability.

  • >> Reynold: And visibility, yeah.

  • >> Alright, awesome, George.

  • >> So, let's go back and unpack some of those

  • kind of juicy areas that you identified,

  • on deep learning you were able to distribute,

  • if I understand things right, the predictions.

  • You could put models out on a cluster

  • but the really hard part, the compute intensive stuff,

  • was training across a cluster.

  • And so Deep Learning, 4J and I think Intel's BigDL,

  • they were written for Spark to do that.

  • But with all the excitement over some of the new frameworks,

  • are they now at the point where they are as good citizens

  • on Spark as they are on their native environments?

  • >> Yeah so, this is a very interesting question,

  • obviously a lot of other frameworks

  • are becoming more and more popular,

  • such as TensorFlow, MXNet, Theano, Keras and Office.

  • What the Deep Learning Pipeline library does,

  • is actually exposes all these single note

  • Deep Learning tools as highly optimized

  • for say even GPUs or CPUs, to be available as a estimator

  • or like a module in a pipeline of the machine learning

  • pipeline library in spark.

  • So, now users can actually leverage Spark's capability to,

  • for example, do hyper parameter churning.

  • So, when you're building a machine learning model,

  • it's fairly rare that you just run something once

  • and you're good with it.

  • Usually have to fiddle with a lot of the parameters.

  • For example, you might run over a hundred experiments

  • to actually figure out what is the best model I can get.

  • This is where actually Spark really shines.

  • When you combine Spark with some deep learning library

  • be it BigDL or be it MXNet, be it TensorFlow,

  • you could be using Spark to distribute that training

  • and then do cross validation on it.

  • So you can actually find the best model very quickly.

  • And Spark takes care of all the job scheduling,

  • all the tolerance properties and how do you read data

  • in from different data sources.

  • >> And without my dropping too much in the weeds,

  • there was a version of that where Spark wouldn't take care

  • of all the communications.

  • It would maybe distribute the models and then do some

  • of the averaging of what was done out on the cluster.

  • Are you saying that all that now can be managed by Spark?

  • >> In that library, Spark will be able to actually

  • take care of picking the best model out of it.

  • And there are different ways you an design

  • how do you define the best.

  • The best could be some average of some different models.

  • The best could be just pick one out of this.

  • The best could be maybe there's a tree of models

  • that you classify it on.

  • >> George: And that's a hyper parameter

  • configuration choice?

  • >> So that is actually building functionality

  • in Sparks machine learning pipeline.

  • And now what we're doing is now you can actually

  • plug all those deep learning libraries directly into that

  • as part of the pipeline to be used.

  • Another maybe just to add,

  • >> Yeah, yeah,

  • >> Another really cool functionality

  • of the deep learning pipeline is transfer learning.

  • So as you said, deep learning takes a very long time,

  • it's very computationally demanding.

  • And it takes a lot of resources, expertise to train.

  • But with transfer learning what we allow the customers to do

  • is they can take an existing deep learning model

  • as well train in a different domain and they we'd retrain it

  • on a very small amount of data very quickly

  • and they can adapt it to a different domain.

  • That's how sort of the demo on the James Bond car.

  • So there is a general image classifier that we train it on

  • probably just a few thousand images.

  • And now we can actually detect whether a car

  • is James Bond's car or not.

  • >> Oh, and the implications there are huge,

  • which is you don't have to have huge training data sets

  • for modifying a model of a similar situation.

  • I want to, in the time we have,

  • there's always been this debate

  • about whether Sparks should manage state,

  • whether it's database, key value store.

  • Tell us how the thinking about that has evolved

  • and then how the integration interfaces

  • for achieving that have evolved.

  • >> One of the, I would say, advantages of Spark is that

  • it's unbiased and works with a variety of storage systems,

  • be it Cassandra, be it Edgebase, be it HDFS, be is S3.

  • There is a metadata management functionality in Spark

  • which is the catalog of tables that customers can define.

  • But the actual storage sits somewhere else.

  • And I don't think that will change in the near future

  • because we do see that the storage systems

  • have matured significantly in the last few years

  • and I just wrote blog post last week about the advantage

  • of S3 over HDFS for example.

  • The storage price is being driven down

  • by almost a factor of 10X when you go to the cloud.

  • I just don't think it makes sense at this point

  • to be building storage systems for analytics.

  • That said, I think there's a lot of building

  • on top of existing storage system.

  • There's actually a lot of opportunities for optimization

  • on how you can leverage the specific properties

  • of the underlying storage system

  • to get to maximum performance.

  • For example, how are you doing intelligent caching,

  • how do you start thinking about building indexes

  • actually against the data

  • that's stored for scanned workloads.

  • >> With Tungsten's, you take advantage of the latest hardware

  • and where we get more memory intensive systems

  • and now that the Catalyst Optimizer

  • has a cost based optimizer or will be, and large memory.

  • Can you change how you go about knowing

  • what data you're managing in the underlying system

  • and therefore,

  • achieve a tremendous acceleration in performance?

  • >> This is actually one area we invested in the DBIO module

  • as part of Databricks Runtime,

  • and what DBIO does, a lot of this are still in progress,

  • but for example, we're adding some form

  • of indexing capability to add to the system

  • so we can quickly skip and prune out all the irrelevant data

  • when the user is doing simple point look-ups.

  • Or if the user is doing a scan heavy workload

  • with some predicates.

  • That actually has to do with how we think

  • about the underlying data structure.

  • The storage system is still the same storage system,

  • like S3, but were adding actually

  • indexing functionalities on top of it as part of DBIO.

  • >> And so what would be the application profiles?

  • Is it just for the analytic queries

  • or can you do the point look-ups and updates

  • in that sort of scenario too?

  • >> So it's interesting you're talking about updates.

  • Updates is another thing that we've got a lot

  • of future requests on.

  • We're actively thinking about how

  • we will support update workload.

  • Now, that said, I just want to emphasize for both use case

  • of doing point look-ups and updates,

  • we're still talking about in the context

  • of analytic environment.

  • So we would be talking about for example maybe bulk updates

  • or low throughput updates

  • rather than doing transactional updates

  • in which every time you swipe a credit card,

  • some record gets updated.

  • That's probably more belongs on the transactional databases

  • like Oracle or my SEQUEL even.

  • >> What about when you think about people who are going to run,

  • they started out with Spark on prem,

  • they realize they're going to put much more

  • of their resources in the cloud,

  • but with IIOT, industrial IOT type applications

  • they're going to have Spark

  • maybe in a gateway server on the edge?

  • What do you think that configuration looks like?

  • >> Really interesting, it's kind of two questions maybe.

  • The first is the hybrid on prem, cloud solution.

  • Again, so one of the nice advantage of Spark

  • is the couple of storage and compute.

  • So when you want to move for example,

  • workloads from one prem to the cloud,

  • the one you care the most about

  • is probably actually the data

  • 'cause the compute,

  • it doesn't really matter that much where you run it

  • but data's the one that's hard to move.

  • We do have customers that's leveraging Databricks

  • in the cloud but actually reading data directly

  • from on prem the reliance of the caching solution

  • we have that minimize the data transfer over time.

  • And is one route I would say it's pretty popular.

  • Another on is, with Amazon you can literally give them

  • just a show ball of functionality.

  • You give them hard drive with trucks,

  • the trucks will ship your data directly put in a three.

  • With IOT, a common pattern we see

  • is a lot of the edge devices,

  • would be actually pushing the data directly

  • into some some fire hose like Kinesis or Kafka

  • or, I'm sure Google and Microsoft

  • both have their own variance of that.

  • And then you use Spark to directly subscribe to those topics

  • and process them in real time with structured streaming.

  • >> And so would Spark be down,

  • let's say at the site level.

  • if it's not on the device itself?

  • >> It's a interesting thought and maybe one thing

  • we should actually consider more in the future

  • is how do we push Spark to the edges.

  • Right now it's more of a centralized model

  • in which the devices push data into Spark

  • which is centralized somewhere.

  • I've seen for example,

  • I don't remember exact the use case

  • but it has to do with some scientific experiment

  • in the North Pole.

  • And of course there you don't have a great uplink

  • of all the data connecting transferring back

  • to some national lab

  • and rather they would do a smart parsing there

  • and then ship the aggregated result back.

  • There's another one but it's less common.

  • >> Alright well just one minute now before the break

  • so I'm going to give you a chance

  • to address the Spark community.

  • What's the next big technical challenge

  • you hope people will work on for the benefit of everybody?

  • >> In general Spark came along with two focuses.

  • One is performance, the other one's ease of use.

  • And I still think big data tools are too difficult to use.

  • Deep learning tools, even harder.

  • The barrier to entry is very high for office tools.

  • I would say, we might have already addressed

  • performance to a degree that

  • I think it's actually pretty usable.

  • The systems are fast enough.

  • Now, we should work on actually make

  • (mumbles) even easier to use.

  • It's what also we focus a lot on at Databricks here.

  • >> David: Democratizing access right?

  • >> Absolutely.

  • >> Alright well Reynold, I wish we could talk to you all day.

  • This is great.

  • We are out of time now.

  • Want to appreciate you coming by theCUBE

  • and sharing your insights

  • and good luck with the rest of the show.

  • >> Thank you very much David and George.

  • >> Thank you all for watching here were at theCUBE

  • at Sparks Summit 2017.

  • Stay tuned, lots of other great guests coming up today.

  • We'll see you in a few minutes.

>> Narrator: Live from San Francisco,

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it