Placeholder Image

Subtitles section Play video

  • [MUSIC PLAYING]

  • YANG FAN: I'm Yang Fan.

  • I'm a senior engineer at a machine learning platform

  • at Cruise.

  • Today, I'd like to tell you some work

  • we have done the last year, how did the platform team build

  • a machine learning training for our Cruise.

  • So [INAUDIBLE] Cruise is a San Francisco based company.

  • We are building the world's most advanced self-driving

  • ridesharing technology, operable on the San Francisco street.

  • If you visit San Francisco city, you

  • may have a chance to see our orange task

  • cars running on the streets.

  • In order to operate the driverless ridesharing service

  • in San Francisco, our cars have to handle

  • many complex situations.

  • Every day, our test vehicles run through many busy streets.

  • And they interact with double-parked cars, cyclists,

  • delivery trucks, emergency vehicles, pedestrians,

  • and even with their pets.

  • So we see many interesting things on streets

  • all over the city.

  • Our cars are designed to handle most of those situations

  • on it's own.

  • So on our cars are multiple cameras, [INAUDIBLE] sensors

  • detecting the surroundings environment and to make

  • decisions at the same time.

  • Many [INAUDIBLE] tasks are mission critical and driven

  • by machine imaging models.

  • Those machine learning models are trained by a lot of data.

  • You can just see the front row data.

  • Our training data format is very complex and highly dimensional.

  • So we have two work streams in the model development.

  • One is to continuously retrain using the new parameters

  • and data.

  • The other is to develop the experiments and the new models.

  • So on the right side, the chart shows

  • the number of model training jobs by each week.

  • As you can see, the number of training jobs

  • varies week to week.

  • And in the meantime, we want to train machine models fast

  • but not too costly.

  • As a platform, we want to fulfill requirements

  • for both our machine engineers and also for the company.

  • The engineer sides, engineers want

  • to train models at any time.

  • The tools and support of the frameworks

  • are flexible without too much constraints.

  • Our machine's engineers.

  • So once you see jobs start, as soon as they submit training

  • jobs, while they are able to see the results as

  • early as possible.

  • More importantly, they want to have

  • an extremely natural and easy experience using our platform.

  • On the company side, we need to make sure

  • that we spent every penny wisely.

  • It means we need to run our model

  • training jobs efficiently.

  • More importantly, we need to allow our engineers

  • to focus on building the mission critical softwares

  • to impact the car performance.

  • So in order to fulfill those requirements,

  • we decided to build our model training on top of a Google's

  • cloud AI platform.

  • The AI platform offers a fully-managed training service

  • through command line tools and the web APIs.

  • So we could launch our jobs as a single machine

  • or as many machines as our coder allows.

  • Therefore, we can let our machine trainers

  • to scale up the training jobs if they want to.

  • AI platform also provides very customizable hardwares.

  • They could launch our training jobs

  • on a combination of different GPU, CPU and the memory

  • requirements, as they are compatible with each other.

  • The AI platform training service also

  • provide good connectivities to other Google service,

  • for example, like storage service in BigQuery.

  • In order to train model on multiple machines efficiently,

  • we also need good distribution training strategy.

  • We decided to Horovod.

  • Horovod is distribution training framework open sourced by Uber.

  • With a few lines changed in our model training code,

  • we can scale up our training jobs

  • beyond a single machine limit.

  • So far, we have tested a training model

  • using from 16 GPUs to 256 GPUs.

  • While we skill up the training cluster,

  • we noticed the deficiency decreased.

  • That is mostly because of the communication overhead.

  • When there are more GPUs, the more communication

  • CPUs there are.

  • To find the most efficient setup for the training job, at least

  • two factors to be considered.

  • One is the [INAUDIBLE] unit cost.

  • The other is the total training time.

  • On the chart on the right side, it

  • demonstrates one training job example.

  • So if we are training one million images

  • at a one image per second per GPU using Nvidia [INAUDIBLE]

  • and the machine time is the high mapping sense [INAUDIBLE]

  • use with 64 cores.

  • As you could imagine, using more GPU

  • could reduce the total training time.

  • However, to many GPUs would also mean reduce

  • the [INAUDIBLE] efficiency.

  • So the blue line on the chart becomes the flat from left

  • to right, while the number of GPUs increases.

  • While more GPU can save training time,

  • the average cost is increasing.

  • So the red line is showing a total cost

  • of going up from left to right on the chart.

  • For this particular job, we found

  • that using between 40 to 50 GPUs could be most cost effective.

  • We decided to build an automated system to provide the best

  • produce to our users.

  • This is the high level architecture diagram

  • of our district training system.

  • Users interact with a command-line tool.

  • The command-line tool will patch the call

  • and dependencies with parameters into our training jobs,

  • then submit a job to our governance service.

  • The service exam the job parameters

  • to prevent any accidental misconfigured parameters.

  • For example, if this using too many GPUs or too much memories.

  • At the same time, the service will translate the computer

  • requirements into actual machine type AI platform.

  • So users don't need to memorize all the different types

  • of documents neither the cost model themselves.

  • All of our training jobs are registered

  • into a [INAUDIBLE] tracking system, where

  • we can keep the reference to the job source

  • code and its parameters.

  • When we want to analyze the model performance,

  • we can trace back to the job information if that's needed.

  • The training input and output are

  • stored in Google Cloud, the cloud storage service.

  • That's [INAUDIBLE] for short.

  • [INAUDIBLE] provides the cost efficient storage

  • and the highest throughput for our model training jobs.

  • Once the job start to produce some metrics and results,

  • you also can view them through [INAUDIBLE] service.

  • While the job is running, our monitoring service

  • constantly pulls the job metrics through AI platform,

  • APIs, and feeds them into a data log.

  • So far, we care about a GPU and CPU [INAUDIBLE]

  • and job duration.

  • When [INAUDIBLE] is too low, it could mean that job gets stuck.

  • And it doesn't need resource.

  • Then we could notify the jobs--

  • we could notify the users to inspect a job

  • or adjust the machines that have to save some cost.

  • Our internal framework is a bridge between the training

  • system and our users.

  • It provides our [INAUDIBLE] and the [INAUDIBLE] interactions

  • with the training backhand.

  • So [INAUDIBLE] can focus on writing a lot of code.

  • We also [INAUDIBLE] the distribution

  • training strategy by changing one line in the configuration.

  • As I mentioned in the previous slide,

  • the [INAUDIBLE] tool automatically

  • packages [INAUDIBLE] and a submitted training job.

  • The framework allows the users to specify [INAUDIBLE]

  • by a number of GPUs.

  • Then our training framework will have the backhand to figure out

  • the actual machine set up.

  • The frame while also provides the interface for governance

  • and the monitoring services.

  • This slide demonstrates to the user

  • how do they turn on the Horovod trainings in the config.

  • Our framework, we automatically wrap the optimizer

  • into Horovod's digital optimizer.

  • The framework will also change some other place

  • in the model behavior.

  • One important and change in the model change

  • is the how to start a process.

  • Horovod uses MPI ROM or Horovod ROM as the main process.

  • It sends the command to other workers in the cluster.

  • So in order to use Horovod in the AI platform,

  • we wrote a bootstrap script to use it on the master.

  • From the master, the bootstrap script

  • will send the commands on each worker

  • to fetch the GPU configuration.

  • Use this information and then use this information

  • to set up MPI parameters.

  • [INAUDIBLE] the command that we use

  • to fetch the GPU information on the workers

  • is the [INAUDIBLE] CPU.

  • Because we have immediate GPUs on the workers.

  • So we could use Nvidia's driver SMI to the fetch GPU used.

  • However, a different Nvidia driver version

  • would place the [INAUDIBLE] rate as a different location.

  • We had to find the [INAUDIBLE] rate

  • before we could executed it.

  • Once we had a GPU list, we used the script

  • to pause the command output and then

  • use that to the MPI command.

  • Different from the regular TensorFlow training job,

  • we don't want to pause the training process

  • for periodic evaluation.

  • That's because we have to keep all the workers running

  • at the same pace.

  • If one worker paused for evaluation,

  • the other worker will have to wait for it.

  • Then the entire job would have failed because of communication

  • time out.

  • We decided to a move evaluation jobs onto [INAUDIBLE]..

  • So all workers can [INAUDIBLE] at a full speed.

  • The evaluation process on the master

  • monitors check points [INAUDIBLE]..

  • Each neutral point will trigger a new evaluation process

  • until the training process completes

  • and the last checkpoint is evaluated.

  • Last, the feature we provide in the framework

  • is the error handling due to the system errors.

  • This the one example of a very long running job.

  • This job ran for more than 24 hours.

  • And to the middle, there was a network event

  • disturbing the training job.

  • [INAUDIBLE] the workers shut down due

  • to the too many errors.

  • Because all workers have to be synchronized,

  • shutting down one worker would have failed the entire job.

  • Our framework detected the error and waited for the worker

  • to come back and then automatically restarted the job

  • using restored parameters from the last two

  • written check points.

  • So the job can continue and eventually

  • complete without any human risk here.

  • This year, we started exploring TensorFlow 2.

  • One of the attractive features in TensorFlow 2

  • is natively supported [INAUDIBLE] training strategy.

  • The multi-worker mirrored strategy would provide a better

  • experience as opposed to collective ops, either

  • the RING [INAUDIBLE] through gRPC or NCCL, Nvidia's

  • collective [INAUDIBLE] library.

  • NCCL has many cool features.

  • For example, it has improved efficiency at an even larger

  • scale than before.

  • The cluster could include tens of thousands of GPUs.

  • It also improved the topology detection.

  • It could either create a ring or tree, depending

  • on the CPU interconnections.

  • Google AI platform is going to support 2.0 distribution

  • strategy out of the box.

  • So you also no longer need to write

  • the bootstraps script himself.

  • So if you are building model training info for your team,

  • I have four takeaways for you.

  • Firstly, try not to reinvent the wheel.

  • There are many solutions available on the market.

  • And reach out to them.

  • I believe they are willing to solve the problem for you.

  • We have received a significant amount of support

  • for the AI platform team.

  • The collaboration between us allowed the crews

  • to focus on building the best self-driving technology.

  • Meanwhile, AI platform team could

  • build the best machine learning service

  • by learning the customer needs from us.

  • Secondly, be aware of cost.

  • One of our responsibility is to have a company operate

  • as sustainable.

  • Although model training can be very expensive,

  • there are many ways to save the cost.

  • The info team not only need to provide the hardware

  • and software solution, but also the best practice improving

  • the efficiency.

  • Certainly, rely on the community.

  • I'd like to thank Uber for open sourcing the Horovod framework.

  • Also, TensorFlow team on answering the fixing bugs

  • and delivering new features.

  • I also want to thank the many excellent open source

  • projects in the communities.

  • We learned many good things from them.

  • Also, we are looking for opportunities

  • to contribute back.

  • The last but not least is the developer experience.

  • As the info team at Cruise, we serve

  • of our internal customers, machine engineers.

  • They are already facing many challenges day to day.

  • To use our platform shouldn't be another one.

  • If our tools and service could help our service--

  • our customers spin the wheel smoother and a little faster

  • each time, eventually, running a flying wheel

  • will be effortless.

  • So once again, we are Cruise.

  • We are building the world's most advanced

  • self-driving technology in San Francisco.

  • If not interested in us, please check out our website,

  • getcruise.com.

  • And please feel free to reach out to us.

  • Thank you for listening.

  • [MUSIC PLAYING]

[MUSIC PLAYING]

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it