Subtitles section Play video Print subtitles MIHAI MARUSEAC: I am Mihai. I've been working recently on file systems for TensorFlow. So I'm going to talk about what TensorFlow has in file system support and what are the new changes coming in. First, TensorFlow file system can be used in Python like this. So we can create the directories, we can create files, read or write to them, and we can see what's in a directory. And you can say, is this similar to Python? And if you compare it with Python, it mostly looks the same. So it's just some names changed. There is one difference in mkdir. In Python, the directory must not exist, but there is a flag that changes this. And now that I've said that, they still look similar. You might ask yourself, why does TensorFlow need its own file system implementation? And one of the main reasons comes from the file systems that TensorFlow supports from the formats of the file parameter. In TensorFlow, you can pass a normal path to a file, or it can pass something that looks like a URL or they are usually called URI-- uniform resource identifiers. And they are divided into three parts-- there's the scheme part, like HTTPS, J3, S3, and so on; the host if it's on a remote host; and then the path, which is like a normal file path on that segment. For TensorFlow, we support multiple schemes. And what we have is a mapping between schemes and a file system implementation. For example, for file, we have a LocalFileSystem. For GS, we have a GoogleCloudFileSystem. With viewfs, Hadoop, and so on. Keep in mind this mapping because this is the core aspect for why TensorFlow needs to have its own system implementation. However, this is not the only case. We need a lot of use cases in TensorFlow code. So beside the basic file I/O that I showed in the first example, we also need to save or load models, we need to checkpoint, we need to dump tensors to file for debugging, We need to parse images and other inputs, and also tf.data datasets, also attaching the file systems. All of this can be implemented in classes, but the best way to implement them would be in a layered approach. You have the base layer where you have the mapping between the scheme and the file system implementation. And then at the top, you have implementation for each of these use cases. And in this talk, I'm going to present all of these layers. But keep in mind that these layers are only for the purposes of this presentation. They grow organically. They are not layered the same way in code. So let's start with the high level API, which is mostly what the users see. It's what they want to do. So when the user wants to load a saved model, the user wants to open a file that contains the saved model, read from the file, load inputs, tensor, and everything. The user would just call a single function. That's why I'm calling this high level API. And in this case, I am going to present some examples of this. For example, API generation. Whenever you are building TensorFlow while building the build package, we are creating several proto buffer files that contain the API symbols that TensorFlow exports. In order to create those files, we are basically calling this function CreateApiDefs, which needs to dump everything into those files. Another example of high level API is DebugFileIO where you can dump tensors into a directory and then you can later review them and debug your application. And there are a few more, like loading saved models. You see that loading saved model needs an export directory to read from. Or more others-- checkpointing, checkpointing of sliced variables across distributed replicas, tf.data datasets, and so on. The question is not how many APIs are available at the high level, but what do they have in common? In principle, we need to write to files, read from files, get statistics about them, create and remove directories, and so on. But we also need to support compression. We also need to support buffered I/O, like read only a part of the file, and then later read another part of the file instead of fitting everything in memory, and so. Most of these implementations come from the next layer, which I am going to call it convenience API. But it's similar to middleware layer in the web application. So it's mostly transforming from the bytes that are on the disk to the information that the user would use in the high level API. Basically, 99% of the use cases are just calling these six functions-- reading or writing a string to a file or writing a proto-- either text or binary. However, there are other use cases, like I mentioned before, for streaming and for buffered and compressed I/O where we have this input stream interface class that implements streaming or compression. So we have the SnappyInputBuffer and ZlibInputBuffer read from compressed data, MemoryInputStream and BufferedInputStream are reading in a streamed fashion, and so on. The Inputbuffer class here allows you to read just a single int, from a file and then you can read another int, and then a string and so on. Like, you read chunks of data. All of these APIs at the convenience level are all implemented in the same way in the next layer, which is the low level API. And that's the one that we are mostly interested in. Basically, this level is the one that needs to support multiple platforms, supports all the URI schemes that we currently support, has to support the directory I/O-- so far, I never talked about directory operations in the higher level APIs-- and also supports users who can get into this level and creating their own implementations in case they need something that is not implemented so far. If you remember from the beginning, we had this file system registry where you had the scheme of the URI and a mapping between that scheme and the file system implementation. This is implemented as a FileSystem registry class, which is basically a dictionary. You can add a value to the dictionary, you can see what value is at a specific key, or you can seeing all the keys that are in that dictionary. That's all this class does. And it is used in the next class, in the environment class-- Env-- which supports the cross-platform support. So we have a WindowsEnv, or a PosixEnv. For Windows when you are compiling on Windows, using TensorFlow on Windows. POSIX when you are using it on the other platforms. And then there are some other Env classes for testing, but let's ignore them for the rest of the talk. The purpose of the Env class is to provide every low level API that the user needs. So, for example, we have the registration-- in this case, like get all the file systems for a file, get all the schemes that are supported, registering a file system. And of a particular notes is the static member default, which allows a developer to write anywhere in the C++ code Env Default and get access to this class. Basically, it's like a single [INAUDIBLE] pattern. So if you need to register a file system somewhere in your function and it's not registered, you just call Env Default register file system. Other functionalities in Env are the actual file system implementation, so creating files. You see there are three types of files. So random access files, writable files, and read-only memory files. The read-only memory regions are files that are mapped in memory on a memory page, and then you can just read directly from memory. There are two ways to write to files. Either you overwrite the entire context, or you append at the end of that. So that's why you have two constructors for writable files-- the NewWritableFile and the NewAppendableFile. More functionalities in Env are creating or removing directories, moving files around, basically everything that is directory operations. Furthermore, the next ones are determining the files existing in your directory, determining all the files that match a specific pattern, or getting information about a specific part entry-- if it exists, if it is a directory, what is its size, and so on. All of these are implemented by each individual file system, but I'm going to that soon. There are other informations that Env contains, but they are out of scope for this talk. So Env also supports threading, supports an API for a clock, getting information about your time, loading shared libraries, and so on. We are not concerned about this, but I just wanted to mention them for completeness. As I mentioned, there are three different types that we support-- the random access file, the writable file, and the read-only memory region, and this is the current API that they support. The first two files have a name, and then they have operations, like read/write. And the memory region, since it's already mapped in memory, you don't need a name for it. You only need to see how long it is and what is the data in there. These three files, I'm going to come back to them later. That's why I wanted to mention them here. Finally, the last important class is the FileSystem class, which actually does the real implementation. So this is where we implement how to create a file, how to read from a file, and everything else. In order to support TensorFlow in other languages, we also need to provide the C API interface that language bindings can link against. And this is very simple at the moment. It's just providing the same functions that we already saw, except they use C types and they use some other markers in the signature to mark that they are symbols exported in a shared library. This CIP interface is not complete. So for example, it doesn't support random access files. So it doesn't support you reading from files from other languages except if that language [INAUDIBLE] directory over the FileSystem class that I showed you in a previous slide. OK. This is all about file systems that currently exist in our disk in the current implementation. However, there is now work to modernize the TensorFlow file system support in order to reduce our complexity. And when I'm speaking about complexity, I am thinking about this diagram where you have the FileSystem class that I showed you, and all of these implementations of it. So have support for POSIX, support for Windows, support for Hadoop, S3, Gcs, and many others, and then a lot of test file systems. Each one of them is implemented in a different way. Some of them are not compatible. So some of them follow some guidelines, others don't. But the main thing is whenever you build up a package, you need to compile all of this in the binary. Whenever you compile something that implements file system that needs access to the file system, you need to compile all of this. That's something that we want to avoid in the future, and this is how we try to [INAUDIBLE] TensorFlow approach. So indeed, this is the diagram of the world we want to live in. We want to have Core TensorFlow-- which has a plugin registry, which is basically the file system registry that I showed before-- and we want to also have plugins that implement file system functionality. If you don't need to support Hadoop file system, you won't need to compile the Hadoop file system plugin and so on. The plugin registry, as I said, is similar to the file system registry. So you have the name of the plug-in or the scheme that you want to implement mapped to the plugin that implements the file system. So to summarize, the modular file system goals are to reduce the compile time since you no longer need to provide support for all the file systems that we are currently supporting, you only compile the ones that you need; we also want to provide a full complete C API interface instead of the one that is provided at the moment. We also want to provide an extensive test suite for all the file systems. As soon as somebody develops a new file system, they can run our test suite and see where they fail. So we have a lot of postconditions and preconditions that the file system operations need to support that implemented in this test suite, and whenever somebody implements a new file system, they just test that. Furthermore, because each developer can create its own file system and that is no longer going to be compiled by TensorFlow, we also need to provide some version guarantees. When we change our TensorFlow version, we cannot ask every file system developers to also compile their code. That's why you also need to provide these guarantees. OK, so let's now see how a plugin is going to be loaded into TensorFlow. As I said before, Env has an API that loads a shared library from disk. It can either load all the shared objects that are in some specific directory at TensorFlow runtime startup, or a user can request TensorFlow to load a symbol from a specific library-- request TensorFlow to load the shared object from a specific path. In both cases, as soon as the shared object is loaded, TensorFlow Core is going to look for the tf_InitPlugin symbol. That's a symbol that the file system plugin needs to implement because TensorFlow is going to call it. This symbol is the one that the plugins call to register all of their implementations for the file system and send them to TensorFlow. We provide an API interface for plugins that they need to follow. And this interface has classes, has structures with function pointers for every functionality that we currently provide. So we have three file types. We're going to have three function tables for their operations. We'll have one FileSystem class, one more then. This interface is going to have one more function table for the operations that the FileSystem class must support. All of these functions here are documented. They list what are the preconditions and the postconditions, and everything here is just C. Then the next section that we have in our API-- in our interface for the plugins must implement-- is the metadata for virtually, again, compatibility. For every structure-- so those three structures for the files and one structure for the file system-- we have three numbers. One is the API number, and the other one is the ABI number-- application binary interface-- and then the third one is the total size of the function table-- the total number of function pointers that we provide for that structure. We have two numbers-- the API and ABI in order to consider cases when we accidentally break ABI compatibility. For example-- if I go back if you reorder the offset and the n parameters in the read method, that is an ABI-breaking compatibility because any code that is going to call direct function would expect the offset to be the second parameter of the function and the number of bytes to read the first one. If you swap them, the code is not going to behave properly. This is breaking the binary compatibility. For API compatibility, if you add the new method to the random access files, that's changing the API number. OK. And after the plugins fill in the data structures with their own implementations, they are going to call this function RegisterFilesystemPlugin. However, you see that RegisterFilesystemPlugin has a lot of parameters. It has three metadata informations for four structures, so that's 12 parameters in the beginning, and then the structure with the operations Because we don't want the plugin orders to manually fill in all of these parameters, we provide this TF_REGISTER_FILESYSTEM_PLUGIN macro to which you only pass the structures that you care about-- the structures that implement the file system. When TensorFlow Core receives this function call, it does the following steps. Its checks that the scheme argument is a valid string, so it must not be empty or null pointer. It checks that the ABI numbers that the plugin says it was compiled against match the ABI numbers that TensorFlow Core was compiled against. If there is a mismatch between the two, we cannot load the plugin because ABI compatibility is broken. Then, TensorFlow Core checks the API numbers. If there is a mismatch between API number that the plugin says it was compiled against and the API number that the TensorFlow Core was compiled against, we can still load the plugin, but we give a warning to the user because some functionality might be missing. We can safely load the plugin because the required methods are already included in the API at the moment. Then, the next step is to validate that the plugin provided all the required methods. For example, if you provide support for creating random access files, you also need to provide support for reading from them. Finally, if all those validations pass, we copy all the functions tables that the plugin provided, we copy them to Core TensorFlow so we don't need to always go via the interface to the library. And then we initialize and register the file system in TensorFlow, and then everyone else can use it. All the middle level and high level APIs can still function transparently with these changes. They don't need to change at all to convert to the modular TensorFlow world. As I mentioned, we also provide an extensive testing suite where we are creating a structure, we are creating a layout on the directory that we are testing, and then we an [INAUDIBLE] operation. For example, in this test, we are creating a directory. And then we try to determine the file size of that directory. Of course, this should fail because the file size should only be returned if the path that you asked for is a file. So that's why we expect this test to fail. If a file system doesn't support the directory, this test should fail before that with created data being not supported. We have around 2,000 lines of testing code, which I think are testing all the corner cases that file systems can get into. Of course, when you add the new API, then that would require adding more tests. As the results of testing on the POSIX file system implementations, we identified 23 bugs where, for example, you can create a file to read from it where the path is actually a directory. The location of the file succeeds, but then when you try to read from it, the reading fails. Or you can create a directory from a file. As long as you don't add new files to that directory or read from them, the Python API would say, yeah, sure, you can create it. When you try to create something else in that directory, that's when it will fail. Also, we have this FileExists API, which doesn't differentiate at the moment between files and directories. So your path can be a path to a directory and FileExists will still say, yes, that exists and is a file. So we had it in a lot of places after FileExists, we added the check to this directory to make sure that the path that you are expecting to be a directory is a directory. And implementing all of this interface and testing was a good learning of C++. The status at the moment of the modular file system world is we have the POSIX support complete and tested, and I started working on Windows support, hoping to finish it by the end of the year. And then the other file systems that we support will be handled in cooperation with the SIG IO, the special interest group, as they will be offloaded to their repository and won't be any longer in the TensorFlow repository. Once the Windows support is finalized, I will send an email with instructions to developers at TensorFlow on how to test the modular file system support. Once all file systems that are listed here are converted, I'm going to make a change that flips a flag and everything-- all the file system support that TensorFlow provides will be converted to modular world. Besides this, there are some more future feature plans. So for example, there are some corner cases in the glue implementation for Python where the file systems are not consistent with the Python normal file API. There are also APIs in C++, high level APIs in C++ which reimplement some low level API. For example, for dumping tensors to a file for debugging them later, the creator of that API needed a way to recursively create directories. At the time, the file system support in TensorFlow didn't provide that functionality, so the creator just implemented his own recursively created directories inside of the high level API. What we need to do in the future is to clean up this to only have the layered approach as was presented here. Finally, we want to deprecate the Env class to separate the FileSystem implementation from any other concerns that that class has. And at the end, after flipping through the modular TensorFlow world, we want to deprecate all the APIs that are not using this framework. And I think this is all that I wanted to talk about. So now it's open for questions. Also, this is slide 42. Like, the answer for everything. [APPLAUSE] AUDIENCE: So am I missing something in terms of, like, at the beginning, why do we need to have our own file system versus all users use all the Python or C++ code to read or write and do all these things? Can you explain? MIHAI MARUSEAC: Yes. So we have all of these requirements for TensorFlow use cases. So reading from model, writing, and so on. We don't want users to always open their files by themselves, write into the files and so on. We want to provide some API for that. We can subclass the Python classes for file operations to create these APIs, but we also need to subclass those classes for every URI scheme that it will support. So we need to have-- let's say we want to read images from several locations. So when it wants to subclass the Python file class to read images with just a single API, one's for reading images from local disk, one's for reading images from a cloud file system, one's for reading images from Hadoop file system, and so on. And whenever you want to add support for your file system, you have to go into all of these classes and add a subclass for every new file system that you want to support. AUDIENCE: I see. So this API is-- oh, if you just use that, that way it will automatically work all these different platforms. Is this what-- MIHAI MARUSEAC: Yeah. Basically, it's a different-- AUDIENCE: And for other languages that-- MIHAI MARUSEAC: And it's also for other languages, yes, because we are doing it in C++ in the C layer. And whenever you create a new language binding, it's already in there. [MUSIC PLAYING]
B1 file api system directory plugin support Inside TensorFlow: TF Filesystems 2 0 林宜悉 posted on 2020/03/25 More Share Save Report Video vocabulary