Subtitles section Play video Print subtitles [MUSIC PLAYING] YANG FAN: I'm Yang Fan. I'm a senior engineer at a machine learning platform at Cruise. Today, I'd like to tell you some work we have done the last year, how did the platform team build a machine learning training for our Cruise. So [INAUDIBLE] Cruise is a San Francisco based company. We are building the world's most advanced self-driving ridesharing technology, operable on the San Francisco street. If you visit San Francisco city, you may have a chance to see our orange task cars running on the streets. In order to operate the driverless ridesharing service in San Francisco, our cars have to handle many complex situations. Every day, our test vehicles run through many busy streets. And they interact with double-parked cars, cyclists, delivery trucks, emergency vehicles, pedestrians, and even with their pets. So we see many interesting things on streets all over the city. Our cars are designed to handle most of those situations on it's own. So on our cars are multiple cameras, [INAUDIBLE] sensors detecting the surroundings environment and to make decisions at the same time. Many [INAUDIBLE] tasks are mission critical and driven by machine imaging models. Those machine learning models are trained by a lot of data. You can just see the front row data. Our training data format is very complex and highly dimensional. So we have two work streams in the model development. One is to continuously retrain using the new parameters and data. The other is to develop the experiments and the new models. So on the right side, the chart shows the number of model training jobs by each week. As you can see, the number of training jobs varies week to week. And in the meantime, we want to train machine models fast but not too costly. As a platform, we want to fulfill requirements for both our machine engineers and also for the company. The engineer sides, engineers want to train models at any time. The tools and support of the frameworks are flexible without too much constraints. Our machine's engineers. So once you see jobs start, as soon as they submit training jobs, while they are able to see the results as early as possible. More importantly, they want to have an extremely natural and easy experience using our platform. On the company side, we need to make sure that we spent every penny wisely. It means we need to run our model training jobs efficiently. More importantly, we need to allow our engineers to focus on building the mission critical softwares to impact the car performance. So in order to fulfill those requirements, we decided to build our model training on top of a Google's cloud AI platform. The AI platform offers a fully-managed training service through command line tools and the web APIs. So we could launch our jobs as a single machine or as many machines as our coder allows. Therefore, we can let our machine trainers to scale up the training jobs if they want to. AI platform also provides very customizable hardwares. They could launch our training jobs on a combination of different GPU, CPU and the memory requirements, as they are compatible with each other. The AI platform training service also provide good connectivities to other Google service, for example, like storage service in BigQuery. In order to train model on multiple machines efficiently, we also need good distribution training strategy. We decided to Horovod. Horovod is distribution training framework open sourced by Uber. With a few lines changed in our model training code, we can scale up our training jobs beyond a single machine limit. So far, we have tested a training model using from 16 GPUs to 256 GPUs. While we skill up the training cluster, we noticed the deficiency decreased. That is mostly because of the communication overhead. When there are more GPUs, the more communication CPUs there are. To find the most efficient setup for the training job, at least two factors to be considered. One is the [INAUDIBLE] unit cost. The other is the total training time. On the chart on the right side, it demonstrates one training job example. So if we are training one million images at a one image per second per GPU using Nvidia [INAUDIBLE] and the machine time is the high mapping sense [INAUDIBLE] use with 64 cores. As you could imagine, using more GPU could reduce the total training time. However, to many GPUs would also mean reduce the [INAUDIBLE] efficiency. So the blue line on the chart becomes the flat from left to right, while the number of GPUs increases. While more GPU can save training time, the average cost is increasing. So the red line is showing a total cost of going up from left to right on the chart. For this particular job, we found that using between 40 to 50 GPUs could be most cost effective. We decided to build an automated system to provide the best produce to our users. This is the high level architecture diagram of our district training system. Users interact with a command-line tool. The command-line tool will patch the call and dependencies with parameters into our training jobs, then submit a job to our governance service. The service exam the job parameters to prevent any accidental misconfigured parameters. For example, if this using too many GPUs or too much memories. At the same time, the service will translate the computer requirements into actual machine type AI platform. So users don't need to memorize all the different types of documents neither the cost model themselves. All of our training jobs are registered into a [INAUDIBLE] tracking system, where we can keep the reference to the job source code and its parameters. When we want to analyze the model performance, we can trace back to the job information if that's needed. The training input and output are stored in Google Cloud, the cloud storage service. That's [INAUDIBLE] for short. [INAUDIBLE] provides the cost efficient storage and the highest throughput for our model training jobs. Once the job start to produce some metrics and results, you also can view them through [INAUDIBLE] service. While the job is running, our monitoring service constantly pulls the job metrics through AI platform, APIs, and feeds them into a data log. So far, we care about a GPU and CPU [INAUDIBLE] and job duration. When [INAUDIBLE] is too low, it could mean that job gets stuck. And it doesn't need resource. Then we could notify the jobs-- we could notify the users to inspect a job or adjust the machines that have to save some cost. Our internal framework is a bridge between the training system and our users. It provides our [INAUDIBLE] and the [INAUDIBLE] interactions with the training backhand. So [INAUDIBLE] can focus on writing a lot of code. We also [INAUDIBLE] the distribution training strategy by changing one line in the configuration. As I mentioned in the previous slide, the [INAUDIBLE] tool automatically packages [INAUDIBLE] and a submitted training job. The framework allows the users to specify [INAUDIBLE] by a number of GPUs. Then our training framework will have the backhand to figure out the actual machine set up. The frame while also provides the interface for governance and the monitoring services. This slide demonstrates to the user how do they turn on the Horovod trainings in the config. Our framework, we automatically wrap the optimizer into Horovod's digital optimizer. The framework will also change some other place in the model behavior. One important and change in the model change is the how to start a process. Horovod uses MPI ROM or Horovod ROM as the main process. It sends the command to other workers in the cluster. So in order to use Horovod in the AI platform, we wrote a bootstrap script to use it on the master. From the master, the bootstrap script will send the commands on each worker to fetch the GPU configuration. Use this information and then use this information to set up MPI parameters. [INAUDIBLE] the command that we use to fetch the GPU information on the workers is the [INAUDIBLE] CPU. Because we have immediate GPUs on the workers. So we could use Nvidia's driver SMI to the fetch GPU used. However, a different Nvidia driver version would place the [INAUDIBLE] rate as a different location. We had to find the [INAUDIBLE] rate before we could executed it. Once we had a GPU list, we used the script to pause the command output and then use that to the MPI command. Different from the regular TensorFlow training job, we don't want to pause the training process for periodic evaluation. That's because we have to keep all the workers running at the same pace. If one worker paused for evaluation, the other worker will have to wait for it. Then the entire job would have failed because of communication time out. We decided to a move evaluation jobs onto [INAUDIBLE].. So all workers can [INAUDIBLE] at a full speed. The evaluation process on the master monitors check points [INAUDIBLE].. Each neutral point will trigger a new evaluation process until the training process completes and the last checkpoint is evaluated. Last, the feature we provide in the framework is the error handling due to the system errors. This the one example of a very long running job. This job ran for more than 24 hours. And to the middle, there was a network event disturbing the training job. [INAUDIBLE] the workers shut down due to the too many errors. Because all workers have to be synchronized, shutting down one worker would have failed the entire job. Our framework detected the error and waited for the worker to come back and then automatically restarted the job using restored parameters from the last two written check points. So the job can continue and eventually complete without any human risk here. This year, we started exploring TensorFlow 2. One of the attractive features in TensorFlow 2 is natively supported [INAUDIBLE] training strategy. The multi-worker mirrored strategy would provide a better experience as opposed to collective ops, either the RING [INAUDIBLE] through gRPC or NCCL, Nvidia's collective [INAUDIBLE] library. NCCL has many cool features. For example, it has improved efficiency at an even larger scale than before. The cluster could include tens of thousands of GPUs. It also improved the topology detection. It could either create a ring or tree, depending on the CPU interconnections. Google AI platform is going to support 2.0 distribution strategy out of the box. So you also no longer need to write the bootstraps script himself. So if you are building model training info for your team, I have four takeaways for you. Firstly, try not to reinvent the wheel. There are many solutions available on the market. And reach out to them. I believe they are willing to solve the problem for you. We have received a significant amount of support for the AI platform team. The collaboration between us allowed the crews to focus on building the best self-driving technology. Meanwhile, AI platform team could build the best machine learning service by learning the customer needs from us. Secondly, be aware of cost. One of our responsibility is to have a company operate as sustainable. Although model training can be very expensive, there are many ways to save the cost. The info team not only need to provide the hardware and software solution, but also the best practice improving the efficiency. Certainly, rely on the community. I'd like to thank Uber for open sourcing the Horovod framework. Also, TensorFlow team on answering the fixing bugs and delivering new features. I also want to thank the many excellent open source projects in the communities. We learned many good things from them. Also, we are looking for opportunities to contribute back. The last but not least is the developer experience. As the info team at Cruise, we serve of our internal customers, machine engineers. They are already facing many challenges day to day. To use our platform shouldn't be another one. If our tools and service could help our service-- our customers spin the wheel smoother and a little faster each time, eventually, running a flying wheel will be effortless. So once again, we are Cruise. We are building the world's most advanced self-driving technology in San Francisco. If not interested in us, please check out our website, getcruise.com. And please feel free to reach out to us. Thank you for listening. [MUSIC PLAYING]
B1 inaudible training platform gpus ai job Distributed TensorFlow model training on Cloud AI Platform (TF Dev Summit '20) 2 0 林宜悉 posted on 2020/04/04 More Share Save Report Video vocabulary