Facebook’s app size management is a unique challenge. Every day, developers input large amounts of code. Each line of code converts into additional bits in the final apps people download to their phones. If this code is not checked, the app will grow in size until it becomes unacceptable. We use compression to reduce app size. Compressed files are smaller and take up less space. This means that apps download quicker and require less bandwidth for billions around the globe. These savings are particularly important for regions with limited mobile bandwidth, which makes it more expensive to download large apps. However, compression is not enough to keep up with all the features and updates that we make to our apps. Superpack is a combination of compiler analysis and data compression that reveals size optimizations beyond what traditional compression tools can do. Superpack pushes the boundaries of compression to achieve much better compression ratios that existing compression tools.
Over the past two decades, Superpack has been in a position to reduce developer-induced app growth and keep our Android applications small. Superpack’s compression has helped reduce the size of our fleet of Android apps, which are substantially smaller in comparison to regular Android APK compression, with average savings of over 20 percent compared with Android’s default Zip compression. Superpack is used by Messenger, Instagram, WhatsApp and WhatsApp. Below is a table that shows how Superpack has reduced the size of these apps.
Superpack: Compilers meet data compression
While existing compression algorithms, such as Zip’s Deflate and Xz’s LZMA, work well with monolithic data, they weren’t enough to offset the pace of growth we were seeing in our apps, so we set out to develop our own solution. Compression is an established field. The techniques we have developed cover all aspects of compression, from data comprehension to Lempel-Ziv parsing to statistical code.
Superpack is a powerful tool for compressing code. This includes machine code and bytecode as well as other types structured data. The approach underlying Superpack is based on an insight in Kolmogorov’s algorithmic measure of complexity, which defines the information content of a piece of data as the length of the shortest program that can generate that data. Data can be reduced by being represented as a program that generates it. If the data is code, it can then be compressed into a smaller representation. A program that generates Fibonacci number, along with a list indicating their indices is highly compressed representation of a file containing such numbers. Compression isn’t new in the area of compression. Superpack’s innovative approach combines compiler methods with modern compression techniques in order to accomplish this goal.
It is a great idea to formalize compression as a generative process that produces small program. This gives data compression engineers access to a wealth of mature compiler tools that can be reused until the end of data compression. Superpack compression leverages common compiler techniques such as parsing and code generation, as well as more recent innovations such as Satisfiability modulo theories (SMT) solvers to find the smallest programs.
Superpack’s ability to combine these compiler techniques with mainstream data compression is a key ingredient in its effectiveness. Superpack’s compiler half has semantic knowledge that allows for enhanced LZ parsing (a step that removes redundancy) and improved entropy code (the step that generates short codes for frequently used pieces of information).
Improved LZ parsing
Compressors typically identify repeating sequences of bytes using an algorithm selected from the LZ family. Each algorithm attempts to replace recurring data sequences with pointers to previous occurrences. The distance between the current occurrence and the pointer is in bytes. The substitution will be smaller if the pointer is represented in less bits than the data. Superpack makes LZ parsing easier by allowing the discovery of longer repeating sequences and reducing the number bits used to represent pointers.
Superpack allows these improvements in the compression of programs by grouping data according to its AST. The following example shows that the longest repeated sequence is 2. The length of the longest repeating sequence increases to 4 when it is sorted by AST types. This includes the opcode, registers (Group 1 in this table) and immediates (2 in the table). The distance between repeated sequences in the raw parse is 2 instructions. In the grouped version, however, the distance between repeated sequences is 0. The distance is 0. The pointer generated by Superpack is therefore smaller than that computed naively.
But how do we decide when to split the code stream and when to leave it intact? Superpack’s recent work introduces hierarchical compression. This decision is incorporated into the optimizing component LZ parsing. It is called the optimal pare. The edited code below shows how to keep the last segment intact. This will allow you to generate one match using a pointer to five instructions and split the remainder. Splitting the remainder allows for longer matches by exploiting the sparseness in register combinations. This grouping of code reduces the distance by counting the number logical units between repeated occurrences along the AST instead of measuring the number bytes.
Improved entropy coding
Repeating sequences of bytes are substituted efficiently with a pointer to the previous occurrence. What does the compressor do with a non-repeating sequence or short sequences that are easier to represent than a pointer, and what can it do? In such cases, compressors represent the data literally by coding the values in it. The number of bits used to represent a literal exploits the distribution of values that the literal can assume. Entropy coding refers to the process of representing a value with roughly the same number of bits as the data’s entropy. Huffman coding is one of the most well-known compressor techniques.
Superpack includes an integrated ANS coder and a pluggable architecture which supports multiple such coding back end. Superpack helps improve entropy coding through identifying contexts where the literals to represent have lower entropy. Superpack’s analysis of the data structure is used to derive contexts, just as in LZ parsing. Below is a simplified sequence of instructions that shows seven addresses each, with the prefix “0x”. In a large volume of different arrangements of this code, the number of bits used by a regular coder to represent the address field would approach 3.
However, we notice that three out of the seven addresses are paired with the BL opcode, while another three are associated with B. One pair can be used with both. If the above pattern holds true throughout the code, the opcode could be used to provide a context for coding. This context makes it possible to use 2 instead of 3. Below is a table showing the code without and with the context. The opcode in the Superpack-compressed case can be seen as predicating the missing bit. This simple example shows how compiler contexts can be used in coding improvement. Real data is often fractional in terms of the bits gained, so the mappings between contexts, data and contexts are not always as precise as this example.
Programs as compressed representations
We explained how Superpack improves LZ parsing and entropy coding when the data being compressed consists of code. What happens if the data is unstructured? Superpack attempts to give the data structure by compressing them into programs. At decompression time the programs are read to retrieve the original data. This technique can be used to compress Dex references , , which are labels for known values in Dex code. Dex references are highly localized. We exploit this locality by transforming references into a language that stores the most recent values in a logic register and issues future values as deltas from those values.
Writing an efficient compressor for this representation reduces to the familiar register allocation problem in compilers, which decides when to evict values from registers to load new values. This reduction applies only to reference bytecode , but the general idea is applicable to all bytecode representations, namely that the resulting code can be optimized as described in the two previous sections. LZ parsing can be improved by collecting the deltas from a first group and then cohorting the MOV, PIN and opcodes.
Superpack on real data
The transforms that are applied to code are identical. Two parts make up metadata transforms. The first part uses the data structure to group items by type. The second part uses organizing rules to organize metadata. These include those that cause data to be sorted, or that expose correlations between items. This can be used to contextualize literals and distances.
The compression ratios for each of these formats by Zip, Superpack, and Xz are listed in the table below.
Superpack architecture and implementation
Superpack is a unique player in the compression space in that baked into it is knowledge of the types of data that it compresses. We developed a modular design that can be reused across all formats of Superpack to scale its development and usage at Facebook. Superpack is designed as an operating system. It implements paged memory allocations, file and archive abstractions as well as abstractions for manipulating and transforming instructions.
Compiler-oriented mechanisms fall into a dedicated compiler layer. Each format is implemented using a pluggable driver. Drivers use properties of compressed data and code label correlations to leverage the compression layer. Automated inference is used to parse the input code using an SMT solver. This post is not about how we use SMT solvers for compression. However, it will be a fascinating topic for another blog post.
The compression layer is also made up of pluggable modules. Superpack’s compressor is one of these modules. It includes a custom LZ engine as well as an entropy code back end. We also installed modules that used existing compression tools while we were building the compressor. Superpack’s role in this scenario is to organize the data into uncorrelated streams. The best effort compression is performed by an existing tool. This is efficient but has limitations in how it can identify compiler information. Superpack’s custom compression backend solves this problem by providing a fine-grained view into the internal representation of data. This allows it to exploit logic correlations at the fine level of one bit. We can choose from a variety of compromises between compression speed and compression ratio by abstracting the mechanism that does the compression work.
Superpack’s implementation contains a mix of code written in the OCaml programming language and C code. OCaml is used to manipulate complex, compiler-oriented data structures as well as to interface with an SMT solver. C is an obvious choice for decompression logic. It is simple, but it is sensitive to processor parameters such as L1 cache size.
Limitations and related work
Superpack is an asymmetric compressor, which means that decompression is fast but compression is allowed to be slow. Superpack has never been interested in stream compression. This is where data is compressed at the speed at which it is being transmitted. Because its current compression speed cannot keep up with the modern data transfer rate, Superpack cannot meet these constraints. Superpack can be used to compress structured data, code and integer data. It currently does not target images, sound, or video files.
On the Android platform there is a tradeoff between compressing to reduce download times and increasing disk space and updating size. This trade-off does not limit Superpack. It is a compromise between using compression to reduce download time and a possible increase in disk footprint and update size. App updates on Android are distributed in the form of deltas, which is a combination of app versions. These deltas cannot be created by tools that can decompress or recompress an app’s content. The current diffing process is not capable of interpreting Superpack archives. This means that deltas for apps with such archives are larger. These issues could be solved by a finer-grained interface between Superpack, Android tools, greater customizability in Android distribution mechanisms, and public documentation of Superpack’s file format, compression methods, and file format. Superpack is a great tool for compressing code. This goes beyond the existing compression that Google Play on Android uses. Our compression is, for now, beneficial to our users, despite the tradeoff.
Superpack leverages Jarek Duda’s work on asymmetrical numeral systems as its entropy coding back end. Superpack draws on ideas in superoptimization, along with past work on code compression. It leverages the Xz, Zstd, and Brotli compressors as optional back ends to do its compression work. Superpack also uses Microsoft’s Z3SMT solver , which can automatically parse and restructure many code formats.
Superpack combines data compression techniques and compiler to increase the density packed data. This is particularly useful for code like Dex bytecode or ARM machine code. Superpack has significantly reduced the size of our Android apps and saved billions of download times around the globe. Although we have briefly described the core concepts behind Superpack, we have not fully explored the scope of our work in asymmetric compressing.
Our journey is just beginning. Superpack is constantly improving with enhancements to its compiler and compression components. Although Superpack was originally designed to reduce the size of mobile apps, our success in increasing compression ratios for a wide range of data types has allowed us to expand our focus to other uses of asymmetric compression. A new executable file format is being developed that allows for disk space savings by compressing shared libraries and then decompressing them during load time. Superpack is being evaluated for its ability to compress code delta to reduce the size and time required to update software. Superpack is also being investigated for cold storage compressor. This will allow us to compress log data, and other files that are not often used.
Our mobile deployment was limited to our Android apps. Our work can be applied to other platforms such as iOS and we are exploring porting our implementation. Superpack is currently only available to our engineers. However, we hope to make Superpack accessible to all. We are currently exploring ways to make our compression work more compatible with the Android ecosystem. This blog post is one step in that direction. One day, we may consider open-sourcing Superpack.
. The post Superpack: Pushing limits on compression in Facebook’s mobile applications appeared first on Facebook Engineering.
#publicinterest #publicpolicy For More dWeb.News Public Policy News https://dweb.news/category/public-interest/