RE-VETORIZAÇÃO DE CHAMADAS DE FUNÇÃO RUBENS EMILIO ALVES MOREIRA RE-VETORIZAÇÃO DE CHAMADAS DE FUNÇÃO Dissertação apresentada ao Programa de Pós-Graduação em Ciência da Computação do Instituto de Ciências Exatas da Univer- sidade Federal de Minas Gerais como re- quisito parcial para a obtenção do grau de Mestre em Ciência da Computação. Orientador: Fernando Magno Quintão Pereira Belo Horizonte 21 de março de 2017 RUBENS EMILIO ALVES MOREIRA FUNCTION CALL RE-VECTORIZATION Dissertation presented to the Graduate Program in Computer Science of the Fed- eral University of Minas Gerais in partial fulllment of the requirements for the de- gree of Master in Computer Science. Advisor: Fernando Magno Quintão Pereira Belo Horizonte March 21, 2017 © 2017, Rubens Emilio Alves Moreira. Todos os direitos reservados Ficha catalográfica elaborada pela Biblioteca do ICEx - UFMG Moreira,.Rubens Emilio Alves D838f Function call re-vectorization. / Rubens Emilio Alves Moreira. – Belo Horizonte, 2017. xxviii, 71 f.: il.; 29 cm. Dissertação (mestrado) - Universidade Federal de Minas Gerais – Departamento de Ciência da Computação. Orientador: Fernando Magno Quintão Pereira 1. Computação – Teses. 2. Arquitetura de computador. 3. Compiladores (Computadores). I. Orientador. II. Título. CDU 519.6*21(043) I dedicate this work to my mom, family, friends, to every single person that symbiotically keeps UFMG's Department of Computer Science the great institution it is, and specially to my dear professor Fernando Magno Quintão Pereira. I am more than conversant with how hard I have tried to get to this point. I am more than convinced that your guidance set the stage for this success. I am more than aware of my otherwise unfortunate failure. ix Acknowledgments Agradeço à minha mãe por cada segundo do meu sempre que dedica cuidando de mim. E por cada segundo que já teve que estar no trabalho, seja na escolinha, no hospital João XXIII ou na casa de paciente  em plantões que, do que guardo comigo, já chegaram a mais 72 horas seguidas! Tudo para garantir a excelente qualidade de vida que sempre tivemos, além de me permitir tempo para minhas coisas de escola. A meu pai, por ter sido bom com tantas pessoas; pelos saudosos momentos de presença; pela insistente permanência desgastante; por ter sido tudo que teve de ser  sem acasos. A meus pais pelo carinho imensurável. E eterno. À minha família, por tudo que me deram; por tudo que me fizeram aprender. To the magnificent persistence of my dearest professor Fernando Magno Quintão Pereira, to whom I owe each opportunity I have been blessed with in the past two years  and, take note, perhaps even less! For the multiple times you did not let me completely lose myself, Thank you, sir! To my professors from UFMG's Department of Computer Science and the whole team that works together to keep this institution the great environment for learning it is. Special thanks to my dear professor Wagner Meira Jr.; to Sônia Borges (mother of all students), Sheilla, Stella, Maristela, and everyone from the PPGCC team. To my dear professors and friends from UFMG's Music school, that always had the doors opened to receive me whenever I needed the good ol' bits of fun. To my beloved friends  not at all excluding the aforementioned ones , your presence summed up to who I am, indeed some more than others, but every single one of you is entitled a share of my achievements. Special thanks to those I have spent the last few years with and that at times took better care of me than I myself, Péricles Oliveira, Larissa Leijôto, Mívian Marques, Marcos Felipe, Max Lima, Júlio Albinati, Camila Araújo, Gabriel Poesia, Rodrigo Caetano, Fernando Mussel, Elverton Fazzion, Osvaldo Morais, Vinícius Dias, Denise Britto, Samuel Evangelista, Paulo Bicalho, Lu- ciana Lourdes, Luam Catão, Gabriel Caires, Marcus Rodrigues, Pedro Ramos, Marcos Almeida, Gleison Souza, Vitor Paisante, Michael Frank, Florian Brendner, Paschalis xi Mpeis, Douglas Teixeira, Victor Campos, Henrique Nazaré, Demontiê Júnior, Car- olina Medeiros, Gina Nogueira, Filipe Arcanjo, Bruno Coutinho, Diogo Rennó, Izabela Karennina, Guilherme Sad, Lucas Moreira, Hudson de Almeida, Edson Júnior, Viní- cius Garcia, Pedro Dias, Solange Oliveira, Humberto Fialho, Caio Ianelis, Henrique Lourenço, Murilo Barbosa, Maurício Veloso, Fernando Endo, Imane AlSaghira, Aswin Sridharan, Jake Enstrom, Antoine Froger, Spurrya Jaggi, Leo Souza, Adrian Martinu, Alex Meyer, David Sena, Guilherme Ferrari, Talita Ribeiro, Bruna Ziviani, Amanda Abreu, Mara Campelo, Lucas Matos, Rodrigo Chaves, Thales Linke, Débora Campos, Sandro Pessutti, Gabriel Ferraz, to all my friends from my undergrad course in com- puter science, from UFMG's Compilers Lab, LBS, Speed, from Colégio Nossa Senhora das Dores, and from the Orchestre Universitaire de Rennes. Para mí caro amigo Miguel Angel Alfredo Pérez Rueda, con quien aprendí tantas cosas de esa lengua genial que es el español. Estoy seguro de que aún nos encontraremos para platicarmos bastante y para que yo no más te grite sin razón. My sincere apologies to those I have conceded unfortunate moments of arrogance  regardless of how worded, irresponsibly deceiving, or mischievously silent the means of utterance. A thankful nod to my bleeding hearts, that get me up in the morning, put me to bed, and save my life every now and then, Billie Holiday, Cartola, Prokofiev, Nina Simone, David Bowie, Liszt, Elza Soares, Miles Davis, Bill Evans, Villa Lobos, Tchaikovsky, AmyWinehouse, Charles Mingus, Shostakovich, Coleman Hawkings, Leo Ornstein, Poulenc, Stravinksy, Cezar Franck, Mozart. Last, but not least, this little poem I wrote for you all, this mystique amalgam of wondrous people that continuously flies in and out of my life, filling out the gaps with sheer joy and genuine living. It is entitled Borboleta. To my non-portuguese speaker friends, I hope one day I will know your language well enough to be able to pick the proper words to translate what I meant with this poem. xii Borboleta Inspirado em Borboleta (Sommerfugl), opus 43 número 1, de Edvard Grieg. Borboleta criou em mim asas pela mão. E pousou. Alazão na travessia do rio fez meandro com o peso do passo. Pôs-se a galope, penseroso. E esqueceu. Levantou em mim alma pelas asas. Rodopiou com alegria. Alegrou todo canto. Fez esquecer todo o rio. E levitou. Cria leito na esperança pelo repouso da borboleta. Conveceu-se da eternidade! Demagogo... E amoleceu. Abriu em mim as mãos pela alma. Tomou tempo do futuro. Demorou a pousar. E planou. Cada fibra do corpo soluçava latejo. Prumou-se à ideia, penoso.  Alazão, é medo! O vôo seguinte foi alto. Borboleta deu de bruços com a sorte:  O vento é quem me sopra pro norte, mas a razão da alegria é você. Põe-se então o rio a encher de pressa. Alazão perde sorte à outra peça:  Voltarás pra me ver? Borboleta criou em mim alma alada. De encanto, levou todo o rio. Deu de asas com o vento. xiii  stars, hide your fires; life trickles in everywhere risca o chão com vara de marmelo; seria destino upon flights across the ocean; over pools of Those memories no more free steps to heaven locuta es de triuio this should be part none! what difference does it make keep gawking; tip; take a nap; fall and snap culpa rubet uultus meus time and space do not exist we should have proceeded no further in those business amici, amici... lesser, greater; not so happy, yet much happier neither this, nor that; neither that,  aut tace aut loquare meliora silentio the bleeding hearts , and the artists , make their stand and there it goes again, spinning, weaving new patterns my poor heart is sentimental, not made of wood wake duncan with thy knocking; would not we thou couldst triuium... triuium... e o azul: mar alto de por-dizeres... virado de por-fazeres... à deriva never treats me... sweet and gentle... semper oracula mentiantur... ontem, dor eterna; hoje, esquecimento total lies, lies... a tissue of lies; all of it lux aeterna luceat eis domine... donna eis requiem... i've got it bad... and that ain't good... not an answer; not even a laugh, Don't talk like that  it is hard to make meaning out of oneself  amid some many changes ah, corra e olha o céu... it is hard not to talk about that which lies constantly in mind revolução... tem é que ajudar; se envolver; mudança vem é de dentro it's so hard for us to really be e, de repente, sertão: saudade de tudo  () xv Resumo Linguagens SPMD para arquiteturas SIMD, como C para CUDA, OpenCL e ISPC contribuíram para melhorar a programabilidade de aceleradores SIMD e placas de processamento gráfico. No entanto, linguagens SPMD ainda não disponibilizam ao programador toda a flexibilidade que se pode obter a partir de programação SIMD explícita. A fim de contornar esta falha de expressividade, preservando a abstração SPMD, introduzimos a noção de Call Re-Vectorization (CREV). CREV permite que o programador altere a dimensão da vetorização durante a execução de um kernel SPMD, e o faz por meio de uma chamada aninhada de kernel. CREV provê uma abstração similar àquela oferecida pelo conceito de paralelismo dinâmico: é possível invocar um kernel dentro de outro kernel. Nossa abordagem reduz os custos associados a esse processo. Neste trabalho, apresentamos as definições formais de CREV, além de sua implementação no compilador ISPC. Para validar nossa abordagem, implementamos uma série de algoritmos clássicos explorando o conceito de Call Re-Vectorization. Tais algoritmos incluem casamento de padrão, busca em profundidade e Bellman-Ford, e foram implementados com CREV sem muito esforço. Uma vez compilados usando ISPC para gerar instruções vetoriais de máquinas Intel, nossas implementações são tão eficientes quanto soluções de estado-da-arte, sendo, em geral, mais simples de se programar. Por exemplo, nossa implementação simples de casamento de padrão atinge speedup de 12% sobre o algoritmo Knuth-Morris-Pratt. Palavras-chave: Vetorização, Compiladores, SIMD. xvii Abstract SPMD programming languages for SIMD hardware such as C for CUDA, OpenCL, or ISPC have contributed to increase the programmability of SIMD accelerators and graphics processing units. However, SPMD languages still lack the exibility oered by low-level SIMD programming on explicit vectors. To close this expressiveness gap while preserving the SPMD abstraction, this dissertation introduces the notion of Call Re-Vectorization (CREV). CREV allows changing the dimension of vectorization dur- ing the execution of an SPMD kernel, and exposes it as a nested parallel kernel call. CREV aords a programmability close to dynamic parallelism, a feature that allows the invocation of kernels from inside kernels, but at much lower cost. In this work, we present a formal semantics of CREV, and an implementation of it on the ISPC com- piler. To validate our idea, we have used CREV to implement some classic algorithms, including string matching, depth first search and Bellman-Ford, with minimum eort. These algorithms, once compiled by ISPC to Intel-based vector instructions, are as fast as state-of-the-art implementations, yet much simpler. As an example, our straightfor- ward implementation of string matching beats the Knuth-Morris-Pratt algorithm by 12%. Keywords: Vectorization, compilers, SIMD. xix List of Figures 2.1 Toy example of CREV application. We use this snippet to highlight the main programming issues we try to tackle with CREV. The first issue is to bridge SIMT-SIMD, in the sense we must be able to call SIMD-based functions within divergent regions. . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Application of the crev directive within a divergent region. . . . . . . . . 13 3.1 -SIMD instruction set. Operands (o) can be either variables or integer constants. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 The state of -SIMD machine is a septuple M(; ;;;; P; pc).  is the set of active threads. A thread t 2  has a local memory , accessible through a memory bank . Threads communicate through shared memory . The stack  tracks control flow divergences. A key component of Call Re-Vectorization is the thread stack . The program counter, pc, keeps track of the next instruction  2 P to be executed. The program P is a linear sequence of instructions. Although it never changes, we include it as state for convenience. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 Auxiliar functions used to define -SIMD. split is a filter, dividing threads into two divergent sets (0 and n). Auxiliary function push updates the synchronization stack  due to control flow divergences. . . . . . . . . . . 18 3.4 Semantics of -SIMD's control flow instructions. . . . . . . . . . . . . . . 19 3.5 Semantics of arithmetic, logic and data-related instructions. Rule Tl loops over every thread t 2 , and for each one of them, executes instruction . No assumption can be made on the order in which instructions run. . . . . 20 3.6 Program written in -SIMD, plus its initial state. . . . . . . . . . . . . . . 21 3.7 Execution trace of the program in Figure 3.6. Column Var shows contents of last variable assigned. T indicates branch taken; F indicates otherwise. The symbol  marks inactive threads. For the syntax of instructions, we refer the reader to Fig. 3.1; for their semantics, Figs. 3.4 and 3.5. . . . . . 22 xxi 3.8 Low-level code produced to call r-function f . . . . . . . . . . . . . . . . . . 23 3.9 A program written in ISPC, and the tree showing function calls for T0. . . 24 3.10 Example of three nested calls to r-functions. Calls currently in the activa- tion stack are highlighted. . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.1 func is a regular division function, as present in ISPC's documentation. The similarities between C and ISPC code are notable: this function has valid syntax in both language, but indees carries a different meaning. In C, such function is a regular division of float variables a and b, whereas such variables are actually vectors of values in ISPC, each value associated with a thread. In the latter, the result is a vector of floats  generally with unique values per thread. func_divergent wraps the main operation from func with a divergent branch. . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.2 Sample matrix-based procedure in ISPC. proc_matrix creates a vector of varying values, which are zero-initialized and then, asynchronously, receive values depending on the thread it is subject to. In the upcoming section, we show how to process a matrix with two configurable dimensions  unlike this example, in which one dimension is parameterized and the other is given by the length of the processing warp (SIMD vector). The last lines show the output of running the program. . . . . . . . . . . . . . . . . . . . 34 4.3 This example shows a very simple ISPC hello world program. We try to cover the notion of a running warp, possible divergences, and some of the basic keywords from ISPC. . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.4 This dummy procedure shows a very simple ISPC series of assignments that depend on whether the variable belong to the global address space or is local/private to each thread. The single invalid combination of variable attribution is that of assigning a varying value to a uniform var: the compiler may not know from which thread to extract the value and therefore cannot validate the syntax. . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.5 List of files modified in ISPC to implement our CREV idiom. . . . . . . . 38 4.6 ISPC-CREV implementation of a Depth-First Search. We highlight the contribution of crev to achieving an active load-balancing policy during the traversal: whenever function dfs is called, the data within the varying variable child is distributed in independent calls to crev's target function dfs. This allows having all threads active within inner calls of dfs, even within the divergent region created by the last conditional of that function. 41 xxii 5.1 Comparison between CREV-based string matching (Algorithm 7), ISPC's parallel implementation, and the Knuth-Morris-Pratt version of pattern matching. The Y-axis shows runtime, in millions of cycles. The X-axis shows pattern sizes, in number of characters. The target text contains 256MB divided among 5,058,121 lines. White boxes show percentage of speedup (CREV over PAR); grey boxes show percentage of speedup (CREV over KMP). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2 Comparison between CREV's and ISPC's book filter (Algorithm 1). Y-axis gives runtime, and X-axis input size, in bits. White boxes show speedup (%) over PAR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.3 Comparison between CREV's and ISPC's version of Bellman-Ford. Y-axis gives execution time, in millions of cycles, and X-axis gives graph size, in number of nodes. White boxes show percentage of speedup over PAR. . . 46 5.4 Comparison between CREV-based DFS and ISPC's parallel version. Y-axis gives execution time, in millions of cycles, and X-axis gives graph size, in number of nodes. White boxes show percentage of speedup over PAR. . . 48 5.5 Comparison between CREV-based Leader Election and ISPC's parallel ver- sion. White boxes show percentage of speedup over PAR. . . . . . . . . . 48 xxiii List of Tables 5.1 Runtimes for sort algorithms on different input vector lengths. We wrote the mergesort and quicksort algorithms, both using crev and ISPC's launch, as well as relying on bitonic sort for fine-grain optimization. The results explicit how performant is our technique, in the sense we have got speedups at the cost of very smalls code changes. The first block of results is for the mergesort algorithm, wehereas the bottom half are results for the quicksort algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 6.1 A scandalously brief timeline on GPUs. It is clear both the number of tran- sistors and exponentially increasing maximum GFlops delivered by top-per- formance graphics processing boards throughout the past ten years. . . . . 56 xxv Contents Acknowledgments xi Resumo xvii Abstract xix List of Figures xxi List of Tables xxv 1 Introduction 1 2 Overview 7 2.1 Warp Synchronous Programming . . . . . . . . . . . . . . . . . . . . . 8 2.2 Dynamic Parallelism in CUDA . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Call Re-Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 Why CREV? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3 Semantics of CREV 15 3.1 The Cornerstones of CREV . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Low-Level Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 High-Level Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4 Properties of CREV . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4 Implementation 29 4.1 Making friends with ISPC . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.1.1 Parallel Execution Model . . . . . . . . . . . . . . . . . . . . . . 30 4.1.2 ISPC Language . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.1.3 ISPC Compiler Architecture . . . . . . . . . . . . . . . . . . . . 37 xxvii 4.2 Implementation of CREV on ISPC . . . . . . . . . . . . . . . . . . . . 38 4.2.1 IPSC-CREV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2.2 Active Load Balancing . . . . . . . . . . . . . . . . . . . . . . . 40 4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5 Experimental Evaluation 43 5.1 String Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.2 Book Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.3 Bellman-Ford . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.4 Depth-First Search and Leader Election . . . . . . . . . . . . . . . . . 47 5.5 Merge-Sort and Quick-Sort . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 6 A Scandalously Brief History of Vectorization 53 7 Final Thoughts 59 Terms and abbreviations 61 Bibliography 63 Appendix A ISPC-CREV Code 69 A.1 Bitonic Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 xxviii Chapter 1 Introduction New hardware asks for new programming idioms. As an example, the ap- pearance of general purpose Graphics Processing Units (GPUs) has led to a revolution in programming: C for CUDA [Sanders and Kandrot, 2010], OpenCL [Munshi et al., 2011], ISPC [Pharr and Mark, 2012], and Py- Cuda [Garland and Kirk, 2010, Nickolls and Dally, 2010] are among languages designed to unscramble high-end programming for state-of-the-art accelerators. These Multi-Threaded (MT) languages let programmers express computations as single kernels executed by many threads. They target architectures that combine SIMD and multi-threaded execution, like GPUs and multi-core CPUs with vector instructions. Developers are thus spurred into exploiting massive computational power from a clear coding perspective  milder at hardware specifics awareness. While abstracting away low-level primitives springs simpler code, it also gives rise to exibility and compos- ability constraints. For instance, given a matrix of threads, one may either pack such processing units (threads) into context-dependent rows or columns. This packaging, or parallel dimension, is not unoften fixed throughout the execution of a kernel1, in such wise sifting functions that may be invoked by the fixed group of threads. Moreover, threads may also be suspended and resumed due to thread-dependent control flow, which makes up for yet another lter on functions that do not comply with divergence. GPUs and so-called vector machines  including CPUs with vector processing capabilities  fall into a special classification of computer architectures and parallel programming: Flynn's taxonomy. Proposed in 1966 by Michael J. Flynn [Flynn, 1972], it groups architectures by their level of parallelism: single-thread, multi-thread, and multi-program; and by their memory organization, i.e., having processors with a single 1Kernels are functions executed at the device-side (GPUs), as opposed to regular functions, carried on at the host-side. 1 2 Chapter 1. Introduction or multiple data streams. For discussing our solution, it is ideal to bear in mind the following two settings: SIMD and SIMT. An SIMD architecture consists of a Single Instruction stream read, in lock-step, by a vector of n units, each using a different data source from theMultiple Data streams. The SIMT organization is a Single Instruction source executed by Multiple Threads, say n, and each thread holding an SIMD lane of length m, i.e., every instruction runs n m times. By times we shall refer to this latter model as either SIMT or multi-threaded (MT) programming. Most languages that target hardware accelerators fix the parallel dimension along which threads are packed into SIMD vectors, or GPU warps2, for the whole duration of a kernel call. This aects device-side library functions, which cannot assume any particular organization of parallelism nor thread activity. To further grasp how suit- able it is to make such an assumption, pay heed to the fact that many hand-tuned libraries [Catanzaro, 2012, NVIDIA, 2016] provide functions that rely on having all threads enabled within a warp: (i) to use such functions, one must assure warp-wise control flow uniformity; (ii) using high-end code is protable in terms of correctness, eciency and to avoid replicate work. Developers circumvent the constraints of pure SIMT programming in two ways: via warp-synchronous programming, or via dynamic parallelism. In the first case, the programmer prots from the fact that threads in an accelerator are grouped into warps to achieve direct thread communication without synchronization or memory sharing. Yet, warp-synchronous coding is not easily com- posable with classic multi-thread (MT) programming: developers must ensure that every thread within a warp participates in each collective operation; e.g., the CUDA __shfl function has undened behavior when reading data from an inactive thread. Such task is otherwise unsubstantial in plain MT programming, for thread divergence control is put out of the hands of the programmer. Consequently, MT code with con- trol flow divergences may not call warp-synchronous functions from within its divergent regions. Regarding the second approach, CUDA's dynamic parallelism along with OpenCL's device-side enqueue consists of the ability to create a new group of threads from within threads already in flight [Yang and Zhou, 2014], thus conferring developers the opportunity to implement strikingly elegant algo- rithms [Merrill and Grimshaw, 2011]. However, this construct is too hefty for our simpler purpose of re-activating threads within a warp. For instance, invoking new threads from within a thread in CUDA involves the global scheduling of a new grid of threads [Jones, 2014], a very expensive event. In short, currently, either we have 2Following the NVIDIA jargon, we shall call groups of threads that execute in lock-step a warp. 3the programmability and elegance of the multi-threaded model, or the eciency of warp-synchronous programming, but not both. Our goal is to allow the composability of SIMD and SIMT through a programming construct syntactically similar to dynamic parallelism. To this end, we introduce the notion of Call Re-Vectorization (CREV), and show how to implement it efficiently in a state-of-the-art vector compiler. CREV is a programming idiom that modifies function calls. Functions marked with the crev tag, henceforth called r-functions, are executed by all the threads available in an SIMD unit. This implies a context switch: to run an r-function, the runtime must change the state of all the threads, including those inactive due to previously divergent control flows. Upon completion, the runtime returns those workers back to their previous state, in the same way a function call is handled. Thus, we achieve a new level of recursion, in which threads can spawn new threads in a stack-based fashion. However, contrary to traditional dynamic parallelism, CREV uses only the accelerator's local memory (registers and call stack) to save thread states; hence, it is cheaper. To validate our ideas, we have implemented them in ISPC3 [Brodman et al., 2014, Pharr and Mark, 2012]. ISPC is a programming language, plus its companion compiler. This compiler produces industrial quality code for SIMD units such as Intel Streaming SIMD Extensions (SSE), Intel Advanced Vector Extensions (AVX) including AVX-512 for Xeon Phi accelerators [Sodani et al., 2016], or ARM NEON. We chose to imple- ment CREV in ISPC because this framework provides the only modern SIMT-to-SIMD translator that, to the best of our knowledge, supports the notion of unmasked or ev- erywhere blocks [Pharr and Mark, 2012]: the ability to activate  in a new context  threads that are idle due to divergences. This feature is a requirement of CREV. We have re-written some benchmarks available in ISPC to use CREV, as well as in- corporated some of our own to compose a suite of CREV tests. We show that these implementations are as ecient as warp-synchronous versions of them, and as clear and elegant as if the had been implemented using dynamic parallelism. And this extra eciency does not imply a loss of programmability. On the contrary, CREV often leads to more concise programs. In addition to the ISPC implementation of CREV, we have also built a small interpreter for an SIMD-like programming language, which better demonstrates the semantics of our new construct. Summary of our Contributions. The key contribution of this thesis is the notion of function call re-vectorization, which comes out of the observation that it is possible to capitalize on divergent threads to help speed up the work of active threads. We 3The Intel SPMD Program Compiler (ISPC) is available at https://github.com/ispc/ 4 Chapter 1. Introduction explain the concept of CREV through examples, a formal semantics, and an industrial quality implementation:  Examples: Section 2 shows examples of algorithms that benefit from our notion of Call Re-Vectorization. Further examples are discussed in Section 5.  Semantics: Section 3.2 formalizes the semantics of -SIMD, a low-level instruc- tion set sucient to implement CREV. We have written a Prolog interpreter to validate the semantics. This interpreter made it easy to prototype dierent imple- mentations of CREV, until we had a design we could graft into a state-of-the-art compiler.  Translation: Section 3.3 describes the translation of the high-level crev key- word into the low-level representation. Core properties of the nal, low-level code, as produced by the translator, are listed in Section 3.4.  Evaluation: Section 5 provides an empirical evaluation of our implementation. To perform this evaluation, we have implemented some algorithms, which are faster and cleaner than their original versions without CREV. Published Papers We now present the works published throughout this Master's course. The initial work, related to Return Oriented Programming attack prevention, was crucial to gaining experience with one of nowadays' main crowd-source compilation framework, the LLVM infrastructure; it was as well a means of better understanding compilation techniques and optimizations. We have thus heavily applied such learnings in this project. Return Oriented Programming This project targeted a well known vulnerability exploit named ROP-attack. Return-ori- ented programming is a technique attackers employ to take control of the execution of a program, and eventually of the entire host machine. The exploit begins in the identication of flaws in the program, such as buffer overows, and thus input extraneous data into the read-only section of memory. By chaining a series of indirect jumps interleaved with instructions that have little to no side-eects, the attacker forces the foreign data placed in the read-only memory to be processed as if executable memory. To bestow such exploit in a program is rather an artsy process: the attacker must analyze many corner-cases to find the ones useful for conveying the exploit. Being challenging does not mean being impossible to do, and it is known that even government agents have used such technique to acquire top secret informa- tion [Kushner, 2013]. Whenever an attack takes place, the frequency of indirect branches seen at the processor increases signicantly [Tymburibá et al., 2016]. Our 5solution lies in deriving tight frequency thresholds for applications: we statically tra- verse the control flow graph (CFG) of programs, and search for the path of up to a fixed number of instructions with the highest density of indirect branches. Despite the NP-hardness of determining a maximum path, our static analysis operates in a feasible runtime, as we limit the path's maximum length. As outcome of this project, we had a paper accepted at the Brazil- ian Symposium on Information and Computational Systems Security (SB- SEG'15) [Moreira et al., 2015]; a dynamic detector for ROP attacks, namely Rip-Rop Deducer [Tymburibá et al., 2015]; and we have put up a web- site [Moreira and Tymburibá, 2016] with a static analysis to infer frequency thresh- olds for indirect branches in applications. Finally, we had another work pub- lished at the International Symposium of Code Generation and Optimization (CGO'16) [Tymburibá et al., 2016], and were awarded the Golden Medal (1st prize) at CGO's Student Research Competition (CGO-SRC'16) [Spink, 2016]. I thank my friends Mateus Tymburibá and Fernando Magno for being so patient and supportive throughout this work. Function Call Re-Vectorization In this work we aimed at the capabilities of warp-synchronous programming in conjunction with the simplicity of dynamic parallelism. Warp-synchronous programming is known to give programmers a high-level in- terface with SIMD native instructions from vector processing machines. Besides the possibility of ne-tuning applications, warp-synchronous code can easily become a nightmare, even for seasoned developers [Moreira et al., 2017]. We implement, on top of an industrial SIMT compiler, the idiom crev, allowing programmers to call warp-synchronous procedures even within divergent regions. We further detail this project in the remaining sections of this thesis. We published an initial work at Brazilian Symposium on Programming Languages (SBLP'16) [Moreira et al., 2016], in which we dene the semantics of everywhere blocks in the SIMD world. The concept of everywhere blocks is key to developing CREV, as it allows one to temporarily re-enable threads within a warp. We later publish our con- tribution, including the crev implementation on top of Intel's SPMD compiler, ISPC. Our paper, entitled Function Call Re-Vectorization, was published at the Symposium on Principles and Practice of Parallel Programming (PPoPP'17) [Moreira et al., 2017]. Je remerci bien mes orientateurs Fernando et Sylvain pour l'opportunité de travailler avec eux sur ce project, et aussi de faire trois mois de stage à la France. Là-bas j'ai fait la connaissance de beaucoup de personnes qu'ont fait mon séjours vraiment speciale. 6 Chapter 1. Introduction Un grand remercie à mon ami Fernando Akira Endo, et à tous mes amis au centre de recherche INRIA Rennes-Bretagne Atlantique. Twidd: Twig over RDDs Originally ``Twig: An Adaptable and Scalable Distributed FPGrowth was a work developed during the last two years of my undergraduation and first semester of my Master's course. We proposed a distributed FPGrowth algorithm with dynamic policy for load distribution among computing nodes, with low replication overhead. Our approach partitioned the input database using the FPTree structure from the FPGrowth algorithm. FPTree is a prefix-tree having each node to represent an element from a list, given a transactional database. The interesting point of the FPTree is that it keeps the most frequent elements closer to the root node, thus reducing data replication within its structure. The work, available online4, was not accepted at the International Parallel and Distributed Programming Symposium of 2015. During the first semester of my Master's course, the algorithm, initially imple- mented in C++, was reimplemented in Spark/Scala, by my friend and Computer Sci- ence Master's student Vinícius Victor Santos Dias. The implementation, namely Twidd, exploited the benets of Resilient Distributed Datasets (RDDs) to scalably allow for fault tolerance. Vinícius also implemented a distributed version of the Eclat frequent itemset algorithm, which was used, along with Twidd, to study performance issues in massively parallel applications. He later published the work ``Diagnosing Performance Bottlenecks in Massive Data Parallel Programs at the International Symposium on Cluster, Cloud and Grid Computing (CCGRID'16) [Dias et al., 2016], and granted me the co-authorship on his work. Besides having my entire C++ formation upon this work, I have also gained a lot of technical experience throughout this project, both due to my eorts in implementing Twig, and from the many people either voluntarily involved or dragged into this project. I here take this opportunity to publicly thank all of them for their contributions, and Vinícius for his kind gesture. 4Twig: An Adaptable and Scalable Distributed FPGrowth. Work unfortunately not accepted by the committee of IPDPS'15. https://github.com/rubenseam/dfptree/blob/master/ipdps15. pdf. Chapter 2 Overview The goal of this section is to explain Warp-Synchronous Programming, Dynamic Par- allelism (DP), and our notion of Call Re-Vectorization (CREV). To this end, we shall use Algorithm 1 as an example. This program receives a book bi, plus a pattern p. It then copies out all the lines l 2 bi that match p. Pattern matching is performed by memcmp, and memory copying is done by memcpy. The book is represented as a matrix of characters; thus, each of its lines, and also the pattern p, is a vector of up to N characters. Algorithm 1 runs in parallel: tid is a thread identier. Hence, each thread is in charge of matching a line l in bi against p. In case the match is positive, this thread must copy l to an output matrix bo. For clarity, we assume a single warp in this example, although the techniques here described can be applied independently to multiple warps. The number of threads that run simultaneously in Algorithm 1 is W , the warp width. Algorithm 1: SIMD Book Filter and memcpy function 1 W warp size; tid thread index; 2 Function bFilter(mtx bi, mtx bo, vec p, int N) 3 for k tid to num_lines(bi) - 1 step W do 4 l bi[k]; 5 if memcmp(l; p;N) == 0 then 6 memcpy(l; bo[k]; N); 7 Function memcpy(str lsrc, str ldest, int N) 8 for i 0 to N - 1 do ldest[i] lsrc[i] ; The naive multi-thread implementation of memcpy iterates sequentially over the arrays within each thread (Algorithm 1, line 8). This implementation is highly ine- cient due to branch divertion, for only threads that step into the memcpy function will 7 8 Chapter 2. Overview be active and working on their own memory copy. Divergence may take place upon branch evaluation, or due to unrelated memory accesses. Branch divergence occurs if the number of iterations N diers across threads. Threads with few iterations would nish the loop earlier and wait for threads with more iterations in order to restore convergence at the end of the loop. Memory divergence also happens as threads within a warp access data in unrelated locations. Such accesses, referred to as uncoalesced in the CUDA literature or as gather/scatter on SIMD platforms, are bandwidth-inecient compared to accesses to consecutive elements. 2.1 Warp Synchronous Programming It is possible to write function memcpy in a way that distributes operations on contigu- ous elements across consecutive threads. Algorithm 2 does it. Function memcpy_shfl is aware of the SIMD nature of a warp. Variables are stored as vectors, having each position belonging to a specic thread. Instruction shfl(v; i) allows thread tid to read the value stored in variable v, but in the register space of thread i. This implemen- tation give us an ecient way to copy data between arrays, as copies are distributed evenly between threads, removing most of the branch divergence. Memory divergence is also eliminated as threads of a warp access consecutive elements at each iteration of the loop on line 7. Nevertheless, this function has an important limitation: it requires all threads in the warp to be active. It cannot safely be called from a point that has potential branch divergence. Indeed, the loop on line 7 would skip elements if some threads were inactive. To support calls to memcpy_shfl within divergent regions, we need a way to re-activate threads and put them to work on the copy loop. In addition, the warp-synchronous programming construct is more complex and error-prone than the naive implementation. Algorithm 2: Warp synchronous memcpy 1 W warp size; tid thread index; 2 Function memcpy_shfl(vec s, vec d, int N) 3 for j 0 to W 1 do 4 dmy shfl(d; j); 5 smy shfl(s; j); 6 Nmy shfl(N; j); 7 for i tid to Nmy 1 step W do 8 dmy[i] smy[i]; 2.2. Dynamic Parallelism in CUDA 9 2.2 Dynamic Parallelism in CUDA In NVIDIA's CUDA and OpenCL 2.0, dynamic parallelism (DP) is the ability to invoke a new kernel K2 from within a kernel K1 [Wang and Yalamanchili, 2014]. In this case, programmers may request a large number of threads, i.e., multiple new warps in multiple thread blocks. As the inner K2 is a new kernel, all its threads are active upon entry, regardless of branch divergence in K1. Algorithm 3 shows an implementation of memcpy that we could invoke from Algorithm 1 using dynamic parallelism. This algorithm splits, among all the threads in a warp, the work of copying vector s to vector d. Its main advantage is simplicity; its disadvantage is efficiency. Algorithm 3: Implementation of memcpy that could be invoked dynamically from Algorithm 1. 1 W warp size; tid thread index; 2 Function memcpy_dp(vec s, vec d, int N) 3 for k tid to N 1 step W do 4 d[k] s[k]; Wang et al. demonstrate that the overhead of a new kernel launch can be as high as one millisecond [Wang and Yalamanchili, 2014]. The new kernel must be scheduled and wait until there are resources available for its execution. Then, the requested number of warps and memory blocks must be allocated before execution starts. For large workloads, the overhead of launching a nested kernel is paid off by the massive data parallelism available in the GPU [DiMarco and Taufer, 2013]. However, for small tasks, this extra cost might degrade performance. 2.3 Call Re-Vectorization Introducing an inner dimension of parallelism is desirable to implement irregular algo- rithms such as graph traversal and recursive sorting. Unfortunately, current abstrac- tions based on warp-synchronous programming or Dynamic Parallelism either com- promise eciency or programmability. To solve this conundrum, we introduce Call Re-Vectorization (CREV), a new programming idiom. Syntactically, CREV is akin to CUDA's dynamic parallelism. Semantically, it avoids the cost of scheduling new kernels. CREV revisits the concept of everywhere (also known as all or unmasked) blocks to temporarily re-enable inactive threads within divergent regions. Such construction was available in programming languages for SIMD machines, such as 10 Chapter 2. Overview Algorithm 4: SIMD Book Filter using CREV 1 W warp size; tid thread index; 2 Function bFilter(mtx bi, mtx bo, vec p, int N) 3 for k tid to num_lines(bi)-1 step W do 4 l bi[k]; 5 if memcmp(l; p;N) = 0 then 6 crev memcpy_crev(l; bo[k]; N); 7 Function memcpy_crev(vec s, vec d, int N) 8 for k tid to N 1 step W do 9 d[k] s[k]; C* [Rose and Steele, 1987], MPL (MasPar Programming Language) [MasPar, 1992] or POMPC [Hoogvorst et al., 1991] in the late 1980s and early 1990s, and has made a re- cent comeback in ISPC [Pharr and Mark, 2012]. In these languages, an everywhere block is executed by every processing element, regardless of its divergent state. At the end of that block, threads are sent back to their original state. The everywhere block is a low-level construct we employ in the implementation of CREV; however, programmers do not deal with it directly  this is the task of the code generator. Algorithm 4 shows how Algorithm 1 looks like once implemented using CREV. Programmers use the crev keyword at line 6 to re-vectorize functions. CREV maintains a stack of thread states to track execution contexts, thus supporting nested calls of r-functions. In terms of performance, a call to a function using the crev directive is equivalent to a regular function call  unlike the implementation of dynamic parallelism in CUDA, for instance. Thus, we favour the use of CREV for fine grain nested parallelism. Example 2.3.1 arms the reader with some intuition on how CREV works, yet we explain the nitty-gritties behind the CREV directive in Section 3. Example 2.3.1 A function is called with the crev prefix to indicate that every thread, whether enabled or disabled, should execute the function. We address as r-functions the procedures targeted by our crev directive. Every thread should execute the r-function multiple times if multiple enabled threads in the warp call it. For instance, if the warp size is 32 and 7 threads are enabled when the program flow hits line 6 in Algorithm 4, all 32 threads execute memcmp_crev 7 times. In each case, the 32 threads temporarily take on the local state of the active thread that they are helping. Once done, these workers all get their local state restored. 2.4. Why CREV? 11 void simd_reset ( uniform in t data [ ] , uniform in t l ength ) { f o r ( vary ing i n t i = programIndex ; i < length ; i += programCount ) { data [ i ] = 0 ; } } export void toy ( uniform in t  uniform data [ ] , uniform in t l ength ) { i f ( programIndex % 2 == 0) return ; f o r ( uniform in t i = 0 ; i < length ; i += 2) simd_reset ( data [ i ] , l ength ) ; } Figure 2.1. Toy example of CREV application. We use this snippet to highlight the main programming issues we try to tackle with CREV. The first issue is to bridge SIMT-SIMD, in the sense we must be able to call SIMD-based functions within divergent regions. 2.4 Why CREV? As shall be presented in the forecoming sections and remaining of this work, CREV is designed as an idiom for SIMT languanges, in order to provide more exibility for pro- grammers in search for speedups. We endeavor code simplicity in the resulting language extension, so developers can better capitalize on target-hardware computational supply without abdicating their time to understanding yet another concept or struggling to fix new bugs. Our approach relies on the straightforwardness of dynamic parallelism whilst its underlying implementation is founded on the rather user-unfriendly warp synchronous programming. Let us now go quickly over the building blocks for com- posing CREV, and then discuss why CREV is useful and on which context it stands as a better solution than works previously proposed in the literature. To assist us in grasping each of the challenges counterposed by our solution, we built a toy example of application for crev, depicted in Figure 2.1. Everywhere Blocks. One of the concerns that arises from vector-based programming is the usage of SIMD within pontentially divergent regions, which is often the case in multithreaded programs. SIMD kernels require all threads within a warp to be enabled, in other words, the control flow must be uniform. Since divergence is one of the main characteristics of a system orchestrating a set of control flow independent threads, there must be a way of cutting through a divergent region and guaranteeing, at least temporarily, the whole warp to be active. Analyzing function toy from Figure 2.1, the conditional establish a control divergence by halving the threads into odd- and even-indexed sets  the latter being deactivated upon calling return. The subsequent lines of code must all be in compliance with such missing-threads, i.e., they may not 12 Chapter 2. Overview call SIMD functions. If we then try to execute the call to simd_reset, unless originally the programmer's intention, the result would be wrong, as such function depends on having all threads active to reset a vector to 0 valued entries. In our toy example, we simply create a mock condition to enforce divergence, but in the latter sections we show a series of examples in which such dillema may appear. Furthermore, there are also publicly available cases in which one may get entangled amid divergent threads but also need control flow uniformity to call SIMD procedures [NVIDIA, 2016]. The problem of expressing nested SIMD loops in multi- thread style is not new. Some data-parallel programming languages for SIMD com- puters in the 80's and 90's allow to re-enable temporarily dormant threads. The C* language [Rose and Steele, 1987], the Maspar Programming Language [MasPar, 1992], and the POMPC language [Hoogvorst et al., 1991] incorporate a control flow construct named either everywhere or all to this end. We have re-used these instructions to implement CREV. However, these are low-level primitives: they are not program- mer-friendly, nor have any interface with function calls. Using everywhere directly is dicult, as this abstraction has no knowledge nor control over the state of dormant threads. CREV, on the contrary, is as easy to use as dynamic parallelism. It manages register saves and restores automatically, relieving the programmer from this task. Warp-level convergence guarantees. Previous work enforce guarantees on where threads converge after control divergences to make warp-synchronous programming safer. For instance, Pharr et al. have proposed the maximal convergence guaran- tee [Pharr and Mark, 2012], and Gaster has proposed a divergence-aware execution model for OpenCL [Gaster, 2014]. CREV goes further by actually enforcing conver- gence at arbitrary program points, allowing warp-synchronous functions to be called from divergent sections. To the best of our knowledge, this is the first attempt to provide developers with such possibility. Considering our toy example from gure 2.1, we simply need to replace the loop with the invocation of simd_reset by a call to this function using the crev directive. Whenever a variable is tagged varying, it holds in fact a vector of values, one per thread; whereas a uniform variable holds a single value, equal for all threads. Therefore, notice that the index variable to the matrix data, which was the induction variable i, has now been replaced by the programIndex. The result is a varying pointer, i.e., there is one pointer value from data per thread within warp  potentially distinct addresses. To simplify, let us assume the outer dimension of the matrix to equal the number of threads from the warp; but, to ensure the procedure genericity, all that is needed is to make a loop with increment step equal to the warp size. 2.4. Why CREV? 13 export void toy ( uniform in t  uniform data [ ] , uniform in t l ength ) { i f ( programIndex % 2 == 0) return ; crev simd_reset ( data [ programIndex ] , l ength ) ; } Figure 2.2. Application of the crev directive within a divergent region. Grid-level Dynamic Parallelism. Much eort has been spent to reduce the overhead of dynamic parallelism. Alternatives to CUDA Dynamic Parallelism such as DTBL [Wang et al., 2015], Free Launch [Chen and Shen, 2015] and La- Perm [Wang et al., 2016] reduce sub-kernel launch overhead or improve cache local- ity. By relying on global schedulers, they allow load-balancing between GPU stream multiprocessors. We are not competing with these eorts, because CREV is not an alternative to dynamic parallelism. CREV is a static code transformation with no dynamic scheduling; hence, it does not create extra parallelism. In other words, we move work to threads that are already in flight, instead of spawning new threads. The main benet of CREV, when compared to these previous work comes in terms of programmability and eciency: by supporting composability of multi-thread and SIMD code, we give developers the chance to benet from ecient warp-synchronous idioms without neither having to deal with primitives like shue, vote and population count, nor having to worry about saving the context of threads. Again, considering our example, the main goal of our approach is to provide an easy-to-write solution for mixing SIMD and SIMT, while also conveying good performance. Given our sample application of crev, depicted by Figure 2.2, we seem to achieve simplicity; in Section 5, we show we get not too far behind w.r.t. performance. Thread-level divergence aware optimizations. Compilers may re- order computations across loop iterations within each thread to mitigate branch divergence [Coutinho et al., 2011, Han and Abdelrahman, 2013, Novak, 2015, Khorasani et al., 2015]. However, each thread performs the same set of tasks as in the original version, so divergences induced by load unbalance between threads of a warp remains an issue. CREV is a way to deal with irregular programs whose performance divergences hurt. However, CREV is not an optimization implemented by the com- piler: programmers must adapt algorithms to use this construct. CREV deals well with divergences because it lets developers balance workload between threads in flight. In other words, it changes the loop structure by distributing iterations across dierent threads. 14 Chapter 2. Overview Conclusion. This chapter has introduced the basic notions around function re-vector- ization via an example. This example gave us the opportunity to provide some insights on the syntax and the semantics of the new abstraction that we propose. In particular, it made clear the diculty of combining programmability and eciency in the SIMD world. In the next chapter we start to explain the details around the implementation of CREV. In particular, we dene its semantics and syntax more formally. Chapter 3 Semantics of CREV This chapter presents the semantics of Call Re-Vectorization. First, in Section 3.1, we informally state key features of CREV. In Section 3.2, we introduce -SIMD, a low-level programming language with a set of primitives that lets us implement CREV. In Section 3.3, we show how to implement the crev high-level construct using the building blocks available in -SIMD. Finally, in Section 3.4, we use our semantics to state some properties of CREV. Before we dive into these details, we suggest the reader to revisit Example 2.3.1 and thus recapitulate the overall control flow behaviour of a CREV call. 3.1 The Cornerstones of CREV CREV is dened as follows: for each active thread that reaches a call tagged with crev, we execute the target function once, forwarding global parameters (scalars) and extract- ing private ones per active thread (vectors). This principle of Call Re-Vectorization lays on three pillars: thread re-activation, SIMD function call and data distribution. Thread Re-Activation. CREV does not lead to the creation of new threads. A function invoked via a CREV call is executed by every thread of the running warp, regardless of the state of such threads (active or not). As mentioned in Section 2, a thread might be inactive due to divergences. There is, however, a means of temporarily re-enabling threads, which is the functionaly conveyed by an everywhere block. We capitalize on this construct in order to implement our crev extension, thus re-activating dormant threads to perform work. Given a warp, its former state, i.e., the former state of its threads, is saved into the context stack, used for divergence management. The context stack basically has 15 16 Chapter 3. Semantics of CREV to store bit masks for keeping track of the state of each thread within its associated warp. On software-based context stack implementations, such as used on AMD GPUs and Intel AVX-512 platforms, this operation is performed entirely in software. For platforms with hardware-based context stack implementations, like NVIDIA GPUs, it requires a new machine instruction to allow software manipulation of the context stack. SIMD Function Call. Multi-thread (MT) and SIMD languages have dierent de- nitions of function calls. In an MT setting, a function call is only performed by active threads, and upon processing instructions, only register lanes that correspond to active threads are saved. The other threads are guaranteed to stay inactive during execution of the function, thus requiring no context save. Although each thread conceptually has its own private call stack, the call stacks of a warp are typically synchronized, both for performance reasons and to allow the sharing of a single scalar stack pointer for a warp. Implementations of SIMD languages, on the other hand, save whole vector registers on function calls, keeping one stack pointer per warp. Unlike regular MT functions, procedures invoked with the crev directive (r-func- tions) follow an SIMD application binary interface. This ensures that all registers in-use are saved before being overwritten inside the function, including lanes of threads that were inactive. Because crev does not create new threads, the cost associated to our construct is equivalent to that of a context switch resultant of invoking a new function. Data Distribution. Upon entrance at crev call, the execution flow issues a series of light-weight data serializations, one per formerly active thread within the running warp. The serialization consists in loading data, once local to each thread, into global variables, accessible to all threads within the warp. The goal here is to allow the target r-function to receive its input arguments, and also ensure all threads to have access to such data. Once each formerly active warp thread is serialized to have a full warp operate on its data, the r-function can be correctly invoked. Revisiting our warp-synchronous memcpy example (Algorithm 2), the local data serialization requires extracting and broadcasting each thread's register lane. More specifically, all variables that were local to each thread, but that are used as input argument to the r-function, must be serialized and broadcast. Data distribution will be later covered with more detail in Algorithm 5. 3.2. Low-Level Semantics 17 branch if zero . . . . . . . . . . . . . . . . . . . . . . . . . . . . bz v; l unconditional branch . . . . . . . . . . . . . . . . . . . . . jmp l branch if thread previously active. . . . . . . . . jmp_mask tid; l write to shared memory . . . . . . . . . . . . . . . . . . " vx = v read from shared memory. . . . . . . . . . . . . . . . . v =# vx binary operations . . . . . . . . . . . . . . . . . . . . . . . . . v1 = o1  o2 copy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v = o shuffle data between lanes . . . . . . . . . . . . . . . . shfl(v; vlane) synchronization barrier . . . . . . . . . . . . . . . . . . . sync halt the machine . . . . . . . . . . . . . . . . . . . . . . . . . stop begin everywhere block . . . . . . . . . . . . . . . . . . . everywhere end everywhere block. . . . . . . . . . . . . . . . . . . . . end_everywhere Figure 3.1. -SIMD instruction set. Operands (o) can be either variables or integer constants. Labels (L) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ::= l 2 N Constants (C) . . . . . . . . . . . . . . . . . . . . . . . . . . . . ::= c 2 N Variables (V ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ::= Tid [ fv1; v2; : : : g Instructions (I) . . . . . . . . . . . . . . . . . . . . . . . . . . . ::= Figure 3.1 Active Threads . . . . . . . . . . . . . . . . . . . . . . . . . . .   N Local Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . .   V 7! Z Local Memory Bank . . . . . . . . . . . . . . . . . . . . . .  Tid 7!  Shared Memory . . . . . . . . . . . . . . . . . . . . . . . . . .   N 7! Z Synch Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   (L L ) Context Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . .   (  ) Program. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P  L 7! I Program Counter . . . . . . . . . . . . . . . . . . . . . . . . . pc 2 N Figure 3.2. The state of -SIMD machine is a septuple M(; ;;;; P; pc).  is the set of active threads. A thread t 2  has a local memory , accessible through a memory bank . Threads communicate through shared memory . The stack  tracks control flow divergences. A key component of Call Re-Vec- torization is the thread stack . The program counter, pc, keeps track of the next instruction  2 P to be executed. The program P is a linear sequence of instructions. Although it never changes, we include it as state for convenience. 3.2 Low-Level Semantics We formalize the notion of Call Re-Vectorization on top of a core language, - SIMD. This language provides the low-level constructs necessary to implement crev and thus invoke r-functions. Most of the syntax of -SIMD comes from Sam- paioet al. [Sampaio et al., 2013], who, in turn, have reused ideas from Bougé et al. [Bougé and Levaire, 1992] and Farrell et al. [Farrell and Kieronska, 1996]. A - SIMD program is a sequence of instructions indexed by a pc. Figure 3.1 shows - 18 Chapter 3. Semantics of CREV split(; ; v) = (0;n) where 0 = ft j t 2  and [t] = t and t[v] = 0g n = ft j t 2  and [t] = t and t[v] 6= 0g push([];n; pc; l) = [(pc; []; l;n)] push((pc0; []; l0;0n) : ;n; pc; l) =  0 if pc 6= pc0 where 0 = (pc; []; l;n) : (pc0; []; l0;0n) :  push((pc; []; l;0n) : ;n; pc; l) = (pc; []; l;n [0n) :  Figure 3.3. Auxiliar functions used to define -SIMD. split is a filter, dividing threads into two divergent sets (0 and n). Auxiliary function push updates the synchronization stack  due to control flow divergences. SIMD's syntax. Operational Semantics. The state M of a program is a tuple (; ;;;; P; pc), as described in Figure 3.2. Threads are uniquely identified by a natural tid, having a local memory [tid], and sharing a global memory . Memory is vectorized, thus, a local address v denotes a vector of variables v 2 [tid]; hence, each thread sees its private version of v. To formalize the semantics of -SIMD, we use the auxiliary functions shown in Figure 3.3. The semantics of -SIMD is given by Figures 3.4 and 3.5. The former shows the behavior of instructions that change the program's control flow; the latter shows the behavior of logic and arithmetic instructions. The result of executing a control flow instruction is a triple (; ;). The interface between Figure 3.4 and Figure 3.5 is performed by Rules It and Tl. The result of executing an arithmetic or logic instruction is a pair ( ;), i.e., they only update the program memory. The semantics of control flow divergences. To simulate the effect of divergences, -SIMD has a stack . Each element in  is a tuple (lid;done; lnext;todo), which indicates the point where divergent threads must re-converge. A new tuple is pushed onto  due to a conditional branch, located at lid, that has caused a divergence, as described by Rules Bt, Bf and Bd, in Figure 3.4. done is the set of threads that have reached the synchronization point. todo is the set of threads waiting to execute. These threads, once active, will resume execution at label lnext. The stack is popped by instructions sync, whose behavior is given by Rules Ss and Sp. The Thread Stack. To implement CREV, we have added a thread stack  to - SIMD. This stack is fundamental to the implementation of everywhere blocks.  3.2. Low-Level Semantics 19 (Sp) P [pc] = stop (; ;; ;;; P; pc)! (; ;) (Jp) P [pc] = jmp l (; ;;;; P; l)! (0; 0;0) (; ;;;; P; pc)! (0; 0;0) (Bt) P [pc] = bz v; l split(; ; v) = (; ;) push(; ;; pc; l) = 0 (; ;;0;; P; l)! (0; 0;0) (; ;;;; P; pc)! (0; 0;0) (Bf) P [pc] = bz v; l split(; ; v) = (;;) push(; ;; pc; l) = 0 (; ;;0;; P; pc + 1)! (0; 0;0) (; ;;;; P; pc)! (0; 0;0) (Bd) P [pc] = bz v; l split(; ; v) = (0;n) pc0 = pc + 1 push(;n; pc; l) = 0 (0; ;;0;; P; pc0)! (0; 0;0) (; ;;;; P; pc)! (0; 0;0) (Ba) P [pc] = jmp_mask Tid; l Tid 2 0 (; ;;; (0;0) : ; P; l)! (00; 0;0) (; ;;; (0;0) : ; P; pc)! (00; 0;0) (Bi) P [pc] = jmp_mask Tid; l Tid 62 0 (; ;;; (0;0) : ; P; pc + 1)! (00; 0;0) (; ;;; (0;0) : ; P; pc)! (00; 0;0) (Ss) P [pc] = sync n 6= ; (n; ;; (pc0;0; l; ;) : ;; P; l)! (0; 0;0) (; ;; (pc0; ;; l;n) : ;; P; pc)! (0; 0;0) (Si) P [pc] = sync pc0 = pc + 1 (n; ;; (_; ;;_;0) : ;; P; pc0)! (0; 0;0) (0 [n; ;;;; P; pc)! (0; 0;0) (Eb) P [pc] = everywhere (all; ;; ;; (;) : ; P; pc + 1)! (0; 0;0) (; ;;;; P; pc)! (0; 0;0) (Ee) P [pc] = end_everywhere (; ;;;; P; pc + 1)! (0; 0;0) (_; ;; ;; (;) : ; P; pc)! (0; 0;0) (It) P [pc] =   6= Control Flow Instruction (; ;;mask; )! ( 0;0) pc0 = pc + 1 (; 0;0;; (mask;0) : ; pc0)! (0; 00;00) (; ;;; (mask; 0) : ; P; pc)! (0; 00;00) Figure 3.4. Semantics of -SIMD's control flow instructions. holds pairs (;). Figure 3.4 shows that instructions everywhere (Rule Eb) push elements onto , and instructions end_everywhere (Rule Ee) pop it. The first ele- ment in this tuple is the set of threads active immediately before the execution of an 20 Chapter 3. Semantics of CREV (Mm) (v) = c  ` v = c (Mt) t; ` tid = t (Mv) [t] = t t(v) = c t; ` v = c (Tl) (t; ;;mask; )! (t;0) (; n [ [t] 7! t];0;mask; )! ( ";") (ftg [; ;;mask; )! ( ";") (Bp) t; ` v2 = c2 t; ` v3 = c3 [t] = t c1 = c2  c3 (t; ;;_; v1 = v2  v3)! (t n [v1 7! c1];) (Si) t; ` v1 = c1 t; ` vlane = clane [t] = t clane 62 mask (t; ;;mask; shfl(v1; vlane))! (t n [v1 7! _];) (Sv) t; ` v1 = c1 t; ` vlane = clane [t] = t clane 2 mask [clane] = lane lane(v1) = c2 (t; ;;mask; shfl(v1; vlane))! (t n [v1 7! c2];) (Ct) [t] = t (t; ;;_; v = c)! (t n [v 7! c];) (Ld) t; ` vx = cx [t] = t  ` cx = c (t; ;;_; v =# vx)! (t n [v 7! c];) (As) t; ` v0 = c [t] = t (t; ;;_; v = v0)! (t n [v 7! c];) (St) t; ` vx = cx t; ` v = c [t] = t (t; ;;_; " vx = v)! (t; n [cx 7! c]) Figure 3.5. Semantics of arithmetic, logic and data-related instructions. Rule Tl loops over every thread t 2 , and for each one of them, executes instruction . No assumption can be made on the order in which instructions run. everywhere block. The second element is the divergence stack, also in the state before the execution of the last everywhere block traversed by the program flow. In Rule Eb (Fig. 3.4), all represents all the threads available in a warp. The thread stack lets us represent an unbounded number of dierent thread contexts; hence, programs might contain an arbitrary number of nested everywhere blocks. After executing the in- struction end_everywhere, threads previously inactive will go back into sleeping mode. In other words, after end_everywhere, the pair (;) at the top of  is popped, and the diverging conguration  becomes part of the current state of threads. If necessary to check if a thread is active due to an everywhere block, then -SIMD provides a conditinal jmp_mask. The statement jmp_mask(tid; l) will divert execution to l if tid is active in the mask at 's top. Example 3.2.1 illustrates the behavior of these instructions. 3.3. High-Level Semantics 21 Instructions v0 = ↓tid Address 0 1 2 3 v1 = (v0 == 0) Contents 0 1 1 0 bz v1, Done Address 4 5 6 7 v2 = 4 * (tid + 1) Contents 2 1 3 4 everywhere Address 8 9 10 11 v8 = 0 Contents 1 5 6 1 Loop jmp_mask v8, Call Address 12 13 14 15 jmp Next Contents 2 3 1 7 Call v3 = shfl(v2, v8) Address 16 17 18 19 v4 = v3 + tid Contents 1 3 4 0 v5 = ↓v4 v6 = v5 + 1 ↑v4 = v6 Next v8 = v8 + 1 Address v0 v1 v2 v3 v7 = (v8 == 4) Contents * * * * bz v7, Loop Address v4 v5 v6 v7 end_everywhere Contents * * * * ↑tid = 1 Address v8 Done sync Contents * Shared Memory Private Memory Figure 3.6. Program written in -SIMD, plus its initial state. Example 3.2.1 Figure 3.6 shows a program written in -SIMD. We assume jallj = 4. This program increments a 4  4 matrix; however, line i is incremented only if [tid] = 0. The figure shows the initial state of the shared () and local memory of each thread (). The initial state of the variables in the local memory is immaterial for this example. Figure 3.7 shows a trace of the execution of the program, given its initial state. Only threads tid = 0 and tid = 3 will enter the everywhere section, because [0] = [3] = 0. Nevertheless, all the four threads will execute the commands within that block. Instruction v3 = shfl(v2, v8) lets each thread read into v3 the value of v2 seen by thread v8. 3.3 High-Level Semantics The -SIMD assembly gives us the primitive building blocks to implement CREV in higher-level languages. As a proof of concept, we have implemented CREV onto ISPC, using instructions of ISPC that are equivalent to those seen in -SIMD. By focusing on an abstract notation, -SIMD, instead of on a concrete language, such as ISPC, we claim generality: CREV can be implemented in any environment that supports our notions of everywhere and shuffle. In this section we show how to implement the crev modifier, which marks a function call as an r-function. For simplicity, our high-level language provides only syntax to declare and invoke functions. A function declaration consists of a name f , plus a list of formal parameters, e.g.: f(T p1; : : : ; T pn). We 22 Chapter 3. Semantics of CREV Instructions Var 0 1 2 3 0 1 2 3 v0 = ↓tid v0 0 1 1 0 ✓ ✓ ✓ ✓ v1 = (v0 == 0) v1 1 0 0 1 ✓ ✓ ✓ ✓ bz v1, Done F T T F ✓ ✓ ✓ ✓ v2 = 4 * (tid + 1) v2 4 * * 16 ✓ • • ✓ everywhere ✓ • • ✓ v8 = 0 v8 0 0 0 0 ✓ ✓ ✓ ✓ Loop jmp_mask v8, Call T T T T ✓ ✓ ✓ ✓ jmp Next F F F F ✓ ✓ ✓ ✓ Call v3 = shfl(v2, v8) v3 4 4 4 4 ✓ ✓ ✓ ✓ v4 = v3 + tid v4 4 5 6 7 ✓ ✓ ✓ ✓ v5 = ↓v4 v5 2 1 3 4 ✓ ✓ ✓ ✓ v6 = v5 + 1 v6 3 2 4 5 ✓ ✓ ✓ ✓ ↑v4 = v6 ✓ ✓ ✓ ✓ Next v8 = v8 + 1 v8 1 1 1 1 ✓ ✓ ✓ ✓ v7 = (v8 == 4) v7 0 0 0 0 ✓ ✓ ✓ ✓ bz v7, Loop T T T T ✓ ✓ ✓ ✓ Loop jmp_mask v8, Call F F F F ✓ ✓ ✓ ✓ jmp Next T T T T ✓ ✓ ✓ ✓ Next v8 = v8 + 1 v8 2 2 2 2 ✓ ✓ ✓ ✓ v7 = (v8 == 4) v7 0 0 0 0 ✓ ✓ ✓ ✓ bz v7, Loop T T T T ✓ ✓ ✓ ✓ Loop jmp_mask v8, Call F F F F ✓ ✓ ✓ ✓ jmp Next T T T T ✓ ✓ ✓ ✓ Next v8 = v8 + 1 v8 3 3 3 3 ✓ ✓ ✓ ✓ v7 = (v8 == 4) v7 0 0 0 0 ✓ ✓ ✓ ✓ bz v7, Loop T T T T ✓ ✓ ✓ ✓ Loop jmp_mask v8, Call T T T T ✓ ✓ ✓ ✓ jmp Next F F F F ✓ ✓ ✓ ✓ Call v3 = shfl(v2, v8) v3 16 16 16 16 ✓ ✓ ✓ ✓ v4 = v3 + tid v4 16 17 18 19 ✓ ✓ ✓ ✓ v5 = ↓v4 v5 1 3 4 0 ✓ ✓ ✓ ✓ v6 = v5 + 1 v6 2 4 5 1 ✓ ✓ ✓ ✓ ↑v4 = v6 ✓ ✓ ✓ ✓ Next v8 = v8 + 1 v8 4 4 4 4 ✓ ✓ ✓ ✓ v7 = (v8 == 4) v7 1 1 1 1 ✓ ✓ ✓ ✓ bz v7, Loop F F F F ✓ ✓ ✓ ✓ end_everywhere ✓ • • ✓ ↑tid = 1 ✓ • • ✓ Done sync ✓ ✓ ✓ ✓ Tid Tid Figure 3.7. Execution trace of the program in Figure 3.6. Column Var shows contents of last variable assigned. T indicates branch taken; F indicates otherwise. The symbol  marks inactive threads. For the syntax of instructions, we refer the reader to Fig. 3.1; for their semantics, Figs. 3.4 and 3.5. let T denote a type modifier, which can be either uniform or varying. Recall from Chapter 2.4, that a uniform variable holds a single value, shared across all threads from the warp, whereas a varying variable holds a vector of values, one per thread. We have borrowed such notation from ISPC. Other programming languages have different ways to express these modifiers. For instance, in CUDA we have shared and global allocation filling the role of ISPC's uniform variables. Figure 3.8 shows the code that we produce for a r-function call f(a1; : : : ; an), 3.3. High-Level Semantics 23 1 everywhere ; ; begin CREV 2 i = 0 ; ;Loop counter 3 loop : jmp_mask i; call 4 jmp next ; ; Skip idle threads 5 call : extract(tn; pn; an; i) ; ;Algorithm 5 6 ``call f ; ; function call 7 next : i = i+ 1 8 bnz(i 6= W ) loop 9 end_everywhere ; ; end CREV Figure 3.8. Low-level code produced to call r-function f . Algorithm 5: Data distribution 1 Function declaration: f(p1; : : : ; pn); 2 Function call: f(t1 a1; : : : ; tn an); 3 Function extract(tn; pn; an, i) 4 for k 2 1 : : : n do 5 if tk == uniform then 6 pk = ak; 7 if tk == varying then 8 shfl(ak; i); where each ai; 1  i  n is an actual argument of f . Such an r-function call will trigger up to jj executions of f , one for each active thread t 2 . The test in lines 3 or 8 in Figure 3.8 are used to single out the function invocation performed by each thread. A different call will happen due to each handlet label. In another dimension of parallelism, each function call will be executed by all threads, due to the everywhere block at lines 10 and 13. Thus, we might have up to 2all computations. Algorithm 5 generates code that implements data distribution. Data distribution determines how actual parameters are bound to formal parameters, given that actual parameters can have one of two types: uniform or varying. By construction, r-func- tions have only uniform parameters. The loop in line 4 will go over all the function arguments, comparing formal (p) and actual (a) parameters. We let the type of ai be ti. If an actual argument is uniform, then parameter passing is trivially implemented as a copy between variables. Line 5 of Algorithm 5 generates code under such circum- stance. If an actual parameter has type varying, then we generate code to perform a broadcast, as seen in line 7 of Algorithm 5. 24 Chapter 3. Semantics of CREV main T0 T1 T2 T3 foo(a = 0) a = 1 a = 2 a = 3 T0 T1 T2 T3 T0 T1 T2 T3 T0 T1 T2 T3 main() { var int a = Tid; crev foo(a); } foo(uni int a) { var int b = a+Tid; crev bar(b) } bar(uni int b) { uni int c = b+1; crev baz(c) } baz(uni int c) { var int d = c+1; } • • • • • • • • • bar(b = 0) b = 1 b = 2 b = 3 baz(c = 1) c = 1 c = 1 c = 1 d = 2 d = 2 d = 2 d = 2 Figure 3.9. A program written in ISPC, and the tree showing function calls for T0. Example 3.3.1 The program in Figure 3.9 shows three functions called via crev. We are assuming an architecture with four SIMD lanes, i.e., all = fT0; T1; T2; T3g. When foo is invoked, the value of a, main's local variable, is broadcasted to foo's formal parameter. Thus, T0 sees foo(0), T1 sees foo(1), etc. When T0 calls bar from foo, the same behavior is observed. However, when T0 calls baz from bar, all the four threads activated into this context see baz(2), because baz receives a uniform argument. The fact that baz's local variable d is marked as varying is immaterial in this example, as this variable is initialized with uniform values. 3.4 Properties of CREV The semantics of CREV, given by -SIMD's primitive building blocks, and the trans- lator seen in Section 3.3, lets us establish a few properties that are true about this programming abstraction. In this section we go over a few of these properties. They are valid under the assumption that programs are well-formed. We dene well-formed programs below: Definition 3.4.1 (Well-Formed Program) A -SIMD program is well-formed if any occurrence of an everywhere instruction at label l1 is matched by an occurrence of an end_everywhere instruction at label l2, and these two labels are control equivalent. 3.4. Properties of CREV 25 Definition 3.4.1 borrows the concept of control equivalence from Ferrante et al. [Ferrante et al., 1987]. Two points, l1 and l2, in a program's control flow graph are said to be control equivalent if l1 dominates l2, and l2 post-dominates l1. We say that l1 dominates l2 if, and only if, any path from the root of the CFG to l2 must cross l1. Dually, l2 post-dominates l1 if, and only if, any path from l1 to the end of the CFG must cross l2. Our translator produces well-formed programs, as long as the program flow cannot leave a function through points other than its return address. Theorem 3.4.2 (Well-Formed Translation) The translator of Figure 3.8 pro- duces well-formed programs. Proof: This result follows trivially from the fact that an everywhere block surrounds only Algorithm 5 and the r-function. Well-formedness holds as long as none of these routines let the program flow escape the enclosing end_everywhere instruction. This implies that the r-function cannot throw exceptions, for in- stance. Composability. CREV allows the nesting of everywhere blocks. Composition happens due to nested function calls. The thread stack  ensures that the last invoked r-function will be the first to remove pending computation. In what follows, we visit three consequences of this property. Composition is multiplicative. An crev call will put all the warp threads in active mode. By coupling this observation with composibility, we have that, in the absence of divergences, a sequence of n nested crev calls will create jalljN tasks. Notice that CREV produces new tasks, but not new threads: we still have only jallj threads to solve these tasks. Commutativity. The translator of Figure 3.8 calls an r-function in a lexicographic order defined by thread identifiers. However, -SIMD's primitives do not impose any order on the threads pushed onto . Therefore, the multiple SIMD calls of an r-function can be handled in any order. Synchronization parity. There is no distinction between the top level of parallelism and the inner level of parallelism with regards to the synchronization primitive. In other words, divergences are handled transparently by the synchronization stack , and, from a synchronization standpoint, it is not possible to tell if execution exists within the context of an r-function or not. To ensure this property, -SIMD's everywhere instruction pushes onto , together with the set of active threads, the divergent state . 26 Chapter 3. Semantics of CREV Thread Stack T0: foo T1: foo T2: foo T3: foo T0: foo; T0: bar T0: foo; T1: bar T0: foo; T2: bar T0: foo; T3: bar T0: foo; T0: bar; T0: baz T0: foo; T0: bar; T1: baz T0: foo; T0: bar; T2: baz T0: foo; T0: bar; T3: baz Activation Stack main() { var int a = Tid; crev foo(a); } foo(uni int a) { var int b = a+Tid; crev bar(b) } bar(uni int b) { uni int c = b+1; crev baz(c) } baz(uni int c) { var int d = c+1; } T0 T1 T2 T3 0 1 2 3 main: a foo: b bar: c baz: d Program 0 1 2 3 2 2 2 2 1 Figure 3.10. Example of three nested calls to r-functions. Calls currently in the activation stack are highlighted. The interplay between CREV and nested function calls. The implementation of CREV does not interfere with the implementation of function calls. Programming languages that support recursion use a structure known as activation stack to manage function calls. Entries in the activation stack are called activation records, and they store functions' local variables, return address, arguments, etc. Upon invocation, the activation record of a function is pushed onto the activation stack. For each thread pushed onto the thread stack there will exist one activation record on the activation stack. The multiplicative nature of CREV also implies on a multiplication of activation records. Therefore, n nested r-calls will generate jalljn activation records; however, the maximum depth of the activation stack is still n+ 1: activation records owned by different threads will not exist simultaneously. Example 3.4.3 Figure 3.10 reuse the program from Example 3.3.1 to illustrate these points. Again, we assume jallj = 4. Thus, three non-divergent nested r-calls will create 4  4  4  4 = 256 tasks. At any time, the thread stack will contain at most 4+4+4+4 = 16 tasks waiting for execution. The activation stack will contain, at any given point, at most 4 activation records, corresponding to the activation of functions main, foo, bar and baz. 3.5. Discussion 27 3.5 Discussion This chapter has presented the semantics of CREV. The main benefit of having a formal semantics is the possibility to test different approaches when design the set of primitive instructions that constitute CREV. From these primitives, we can then design high-level constructs that give developers the opportunity to user our new abstraction. This observation is so true that we have implemented the semantics of CREV in Prolog. We have used this implementation extensively, before producing the actual x86 implementation of CREV. Thus, our semantics is executable, and extensible, as new constructs typical of the SIMD execution model can be added to it. In the next chapter we show how this semantics can be materialized into a concrete and robust implementation of function re-vectorization in an Intel architecture. Chapter 4 Implementation We now present our CREV implementation on top of Intel's SPMD compiler, ISPC. To truly grasp how the idiom works, and thus understand the expected results, it is desireable to not only be conversant with the ISPC compiler and language, but to have in mind the SIMD and SIMT processing models. Before diving into the technicality of our code, let us first get acquainted with ISPC in a higher level. Most of the information here exposed on the ISPC language and compiler we extracted from ISPC's documentation [Intel, 2016]. 4.1 Making friends with ISPC Intel's SPMD Program Compiler, ISPC for short, is an SPMD compiler for CPU ap- plications. In SPMD programs, the abstraction is that of processing a single piece of data at a time, while the underlying hardware and runtime system hand over the same instruction to multiple processing nodes, each using as input its own data. The hard- ware handles such operations with as many parallel instances as there are resources available. The ISPC compiler is open source, with a BSD license. It uses the LLVM Compiler Infrastructure for back-end code generation and optimization. The compiler supports Windows, Mac, and Linux, with both x86 and x86-64 targets. ISPC is a variant of the C programming languange and its companion compiler. The languange is meant to deliver good performance for programmers who want to run SPMD programs on CPUs. It provides a thin abstraction layer between the pro- grammer and the hardware, in such a way that it is still possible to trace back, in the generated assembly code, the behaviour described in the source code of ISPC programs. ISPC harnesses the computational power of SIMD vector units without harming the 29 30 Chapter 4. Implementation programmibilty, i.e., it spares programmers the burden of low-productive code writing, as it abstracts away low-level SIMD intrinsics. 4.1.1 Parallel Execution Model In CUDA terminology, a warp is a group of threads set to execute a kernel. In ISPC, a warp is referred to as gang, and threads as program instances. For consistency only, we stick to CUDA's nomenclature. As ISPC is designed to support CPU SPMD programmers, a warp, or gang, is actually an SIMD vector unit, each lane being a thread, or program instance. Code organization. An ISPC code is usually structured in two files: the C/C++ source accomodating the main function, and the ISPC source itself, containing the SPMD functions. As an abuse of terminology, let us refer to these functions as kernels, as they hold the core of ISPC parallelism and are executed on the target device  i.e., on the CPU vector units. To further clarify the composition of such kernels, bear in mind they are the one functions lowered to the vector-processing instruction set of the CPU. The first source compiled is the ISPC one. The compiler generates a C header file, creating an interface between C and ISPC. Within the C source, programmers should include the header file generated, and make use of the kernels written on the ISPC source  notice this header also comprises types and structures created along with kernels. In addition to the header, the kernels are compiled to the target architectures selected on the command line. Then, the C source is compiled and linked with the programmer's brand-new SPMD library. Code writing. ISPC programs are indeed very similar to C programs in terms of syntax. For example, the code from Figure 4.1.1 is valid both in C language and when compiled using the ISPC compiler. However, the semantics of the program changes considerably: unlike the result obtained in C  the floating number (a+ b=2) , ISPC holds multiple values for a and b, one per each thread in a warp, thus generating multiple return values. From the parallel programming realm, ISPC ports the concepts of global and private values: variables in ISPC are either uniform, i.e., global/scalar values, uniform across all threads; or varying, i.e., each thread holds a private value, which may vary in comparison to other threads. If we were to use a C compiler, the code generated for func would have a regu- lar non-parallel semantic. Using the ISPC compiler, on the other hand, the assembly 4.1. Making friends with ISPC 31 f l o a t func ( f l o a t a , f l o a t b) { re turn a + b / 2 . ; } f l o a t func_divergent ( f l o a t a , f l o a t b) { i f ( programIndex % 2 == 0) return a + b / 2 . ; r e turn 0 ; } Figure 4.1. func is a regular division function, as present in ISPC's documen- tation. The similarities between C and ISPC code are notable: this function has valid syntax in both language, but indees carries a dierent meaning. In C, such function is a regular division of float variables a and b, whereas such variables are actually vectors of values in ISPC, each value associated with a thread. In the latter, the result is a vector of floats  generally with unique values per thread. func_divergent wraps the main operation from func with a divergent branch. produced comprises low-level instructions that exploits the vector processing capabil- ities of the target CPU. The semantics of the program changes considerably: instead of having two scalar values and a single control flow, we now encounter one unique control flow per thread, and each thread has its own version of variables a and b, i.e., such variables are now varying within the warp. Therefore, the ISPC kernel result in a vector of floats  generally with unique values per program instance. The multiplicity of variables is statically defined by the length of the SIMD vector of processing units  a warp. The number of threads within a warp usually ranges from 2 to 8 threads, being thus quite small w.r.t. warps of often 32 threads, found in GPUs. Akin to SPMD GPU programs, ISPC's runtime system also grants each thread within a warp its own control flow: instructions are fetched and received in lock-step, and eventual divergence are handled by deactivating divergent threads. Extending our initial example with a conditional branch allows us to perceive this SPMD characteristic in ISPC. Still in Figure 4.1.1, we have function func_divergent, which wraps the division with a divergent branch. Whenever a thread of even index enter reach the branch, it will compute the division and return the result. Since the execution is in lock-step, threads with odd indices is put on hold at each instruction executed by active threads. Upon leaving the branch, the remaining threads return 0. Notice that, due to the keyword programIndex, function func_divergent may only be compiled by the ISPC compiler. programIndex is equivalent to CUDA's threadIdx in the sense it allows programmers to distinct threads by their indices. Explicit parallelism. ISPC defines default qualifiers for variables: in plain C, function func (Figure 4.1.1) would have scalar variables a and b, and a single result would be 32 Chapter 4. Implementation returned. In ISPC code, the declarations of a and b default to varying float a and varying float b, thus rendering such vars, by default, vectorized values. To explicitly inform the compiler the desired behavior, or even if to further document the code for future usage, programmers should add the qualifiers uniform or varying. It is upon having parallel data regular functions will issue their overloaded parallel versions and vice-versa. An interesting point concerning uniform variables is the assignment policy. As stated in the quote below, extracted from the ISPC documentation, uniform variables cannot receive data from varying variables. For instance, if a warp holds an integer global acrosss all threads within the warp, a single thread is not able to assign this integer a varying value, private to this thread's context. It may seem counter-intuitive to attempt doing such an operation, but our implementation depends heavily on an equivalent syntax, as discussed ahead in Section 4.2. uniform variables can be modified as the program executes, but only in ways that preserve the property that they have a single value for the entire gang. Thus, it's legal to add two uniform variables together and assign the result to a uniform variable, but assigning a non-uniform (i.e., varying) value to a uniform variable is a compile-time error.  ISPC Documentation, "uniform" and "varying" Qualifiers. ISPC also features the keywords launch and sync, for explicit asynchronous par- allel processing. launch issues asynchronous parallel tasks and absracts away the many possible implementations of underlying engines for simulating asynchronous processing, which includes:  Microsoft's Concurrency Runtime  Apple's Grand Central Dispatch  bare pthreads  Cilk Plus  TBB  OpenMP The task system implementation can be selected at compile time, by dening the appropriate preprocessor symbol on the command line. Not all combinations of platform and task system are meaningful. If no task system is requested, a reasonable default is selected for the host platform. The keyword sync, similar to a PTHREADS 4.1. Making friends with ISPC 33 join, puts the execution on hold until all threads have converged to its launch call site. As the task systems are available along with the ISPC open source code, the pro- grammer may extend the set of runtime engines, developing her/his own task system. Delineating the benets from using ISPC's task system, we now show a wrap- ping-up example that uses both ISPC's SIMD capabilities and task system. The ex- ample consists of a generic matrix-based procedure: simply put, we create a vector of varying values, private to each thread; such values are zero-initialized, then, asyn- chronously, each entry of the vector (varying values in data[i]) receives its corre- sponding thread index multiplied by -1, being the last operation subject to whether the index is an even number. Each line of the matrix is processed asynchronously, in parallel, by as many threads as the system supports; and then, each thread uses its associated SIMD resouces to operate, in lock-step, on the input line of varying values (function f). In this example, the matrix has a single parameterizable dimension: we can set the length of the vector within function proc_matrix, but the number of values each position data[i] holds is always defined by the length of the warp. In the next section we show, amongst other examples, how to exploit the SIMD capabilities without retaining oneself to a matrix structure. 4.1.2 ISPC Language The ISPC language is an extended version of the C programming language, and pro- vides a number of new features that make it easy to write high-performance SPMD programs for the CPU. Albeit there is but a handful of syntactic dierences between ISPC and C code, the former conveys a fundamentally parallel execution model, and thereby C code cannot simply be compiled by the ISCP compiler to correctly run in parallel. However, starting with working C code and porting it to ISPC is an ecient way to quickly write ISPC programs [Intel, 2016]. We now introduce small examples of programs written in ISPC, as to further acquaint the reader with the language. Details on the syntax and semantics of the ISPC language can and should be looked up online, at ISPC's documentation web- page [Intel, 2016]. We here focus on elements we believe should grant the reader base knowledge for catching up with small ISPC applications and, indeed, appreciating the motivation for our CREV idiom in practice. Hello World. We believe the examples from Figures 4.1.1 and 4.1.1 are very good starters, but to review the very basics of ISPC and make sure no reader may lag 34 Chapter 4. Implementation // SIMD kerne l : f unc t i on executed in locks tep task void f ( vary ing i n t& data ) { i f ( programIndex % 2 == 0) data = data + programIndex ; e l s e data = data programIndex ; } // SPMD func t i on with invok ing ISPC ' s task system void f_matrix ( vary ing i n t data [ ] , uniform in t l ength ) { f o r ( uniform in t i = 0 ; i < length ; ++i ) launch f ( data [ i ] ) ; sync ; } // Matrix procedure export void proc_matrix ( ) { varying i n t data [ 1 0 ] ; uniform in t l ength = 10 ; f o r ( uniform in t i = 0 ; i < length ; ++i ) data [ i ] = 0 ; f_matrix ( data , l ength ) ; f o r ( uniform in t i = 0 ; i < length ; ++i ) { p r i n t (" data [% ] : %\n" , i , data [ i ] ) ; } } // Output $ . / matrixlaunch data [ 0 ] : [0 ,1 ,2 ,3 ,4 ,5 ,6 ,7] data [ 1 ] : [0 ,1 ,2 ,3 ,4 ,5 ,6 ,7] . . . data [ 9 ] : [0 ,1 ,2 ,3 ,4 ,5 ,6 ,7] Figure 4.2. Sample matrix-based procedure in ISPC. proc_matrix creates a vector of varying values, which are zero-initialized and then, asynchronously, receive values depending on the thread it is subject to. In the upcoming section, we show how to process a matrix with two configurable dimensions  unlike this example, in which one dimension is parameterized and the other is given by the length of the processing warp (SIMD vector). The last lines show the output of running the program. behind  which is important! , we include this hello world example. Notice, though, the lessons here left are more technical, in the sense they broaden the familiarity with the ISPC syntax  whereas forecoming sample applications focus on exercising parallel concepts of the language. As discussed before, ISPC code is organized in two main files: the main C/C++ file, which includes the header file with ISPC function declarations; and the ISPC implementation file, in which the developer shall dene and export whatever func- tions should compose the ISPC interface with the C/C++ main. To access exported functions in the host language, use either C++ namespace iscp::*, or a separating 4.1. Making friends with ISPC 35 // C++ f i l e : main . cpp # inc lude " he l l owor ld_i spc . h" i n t main ( void ) { i s p c : : he l lo_world ( ) ; r e turn 0 ; } // ISPC f i l e : h e l l owor ld . i s p c export void hel lo_world ( ) { p r i n t ("Warp length : %\n" , programCount ) ; p r i n t (" Active threads ( a l l ) : %\n" , programIndex ) ; i f ( programIndex % 2 == 0) { p r i n t (" Active threads ( even i d s ) : %\n" , programIndex ) ; } e l s e { p r i n t (" Active threads ( odd id s ) : %\n" , programIndex ) ; } } // Output $ . / hel lo_world Warp length : 8 Active threads ( a l l ) : [ 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 ] Act ive threads ( even i d s ) : [ 0 , ( ( 1 ) ) , 2 , ( ( 3 ) ) , 4 , ( ( 5 ) ) , 6 , ( ( 7 ) ) ] Act ive threads ( odd id s ) : [ ( ( 0 ) ) , 1 , ( ( 2 ) ) , 3 , ( ( 4 ) ) , 5 , ( ( 6 ) ) , 7 ] Figure 4.3. This example shows a very simple ISPC hello world program. We try to cover the notion of a running warp, possible divergences, and some of the basic keywords from ISPC. underscore, as in ispc_*. The ISPC compiler generates both an assembly and a header file. The header consists of composite types created in the ISPC source (special structs and typedefs), along with methods tagged with the export keyword. Therefore, apart from type dec- larations, any method expected to be available at the C/C++ source should be marked as export. The language also provides a print function with special implementation for dumping varying values. The expression string has a formatting similar to that from C's printf, except one needs only to add % as placeholders for variables  no type distinction is required. ISPC has two reserved keywords that gives the programmer information about the warp size (programCount) and the index of each thread (programIndex). The first variable is a regular integer, and is output as such; whereas the second is a varying integer, with one value per thread in the warp (the index of each thread). When within a divergent region, inactive threads are marked by its index surrounded by a double pair of parantheses, like the output presented in our HelloWorld example. 36 Chapter 4. Implementation Varying-Uniform data access. We also take the time to show, in the example from Figure 4.4, a syntatically invalid line, implementing the impossible assignment of a varying value to a uniform variable. This dummy procedure shows a very simple ISPC series of assignments that depend on whether the variable belong to the global address space or is local/private to each thread. The single invalid combination of variable attribution is that of assigning a varying value to a uniform var: the compiler may not know from which thread to extract the value and therefore cannot validate the syntax. This notion is revised and receives a new meaning when dealing with our CREV idiom extension, as will be discussed in Section 4.2. export void var_uni ( ) { uniform in t a = programCount ; // r e gu l a r g loba ltog l oba l ass ignment uniform in t b = programIndex ; // which p r i va t e va lue should we keep ? varying i n t c = programCount ; // broadcast programCount varying i n t d = programIndex ; // r e gu l a r pr ivatetopr i va t e ass ignment } Figure 4.4. This dummy procedure shows a very simple ISPC series of assign- ments that depend on whether the variable belong to the global address space or is local/private to each thread. The single invalid combination of variable attri- bution is that of assigning a varying value to a uniform var: the compiler may not know from which thread to extract the value and therefore cannot validate the syntax. Renabling threads and Task parallelism. We know from our discussion in Section 6 that meeting up with the conditions that produce a situation that demands re-enabling threads within a divergent SIMT context is not a mere programming ourish. Re-en- abling threads may become confusing depending on the organization of the code and on the level of nesting at which the divergence may be located. Such diculty, along with the complexity of keeping track of the control flow still at a code level, character- izes warp-synchronous programming. Along with the source for our CREV-extended version of ISPC, the user shall find quite a few examples of warp-synchronous code. As hint of the level of detail required by such programming paradigm, we add to the appendix of this document the implementation of the intrinsically warp-synchronous algorithm BitonicSort A.1. In Section 5 we present some examples of task parallelism, which is ISPC's take at dynamic parallelism. With special attention to the Mergesort and Quicksort algorithms (Section 5.5) we show how CREV may be an efficient surrogate to dynamic parallelism. In a nutshell, task parallelism in ISPC is materialized via the keyword launch, and its 4.1. Making friends with ISPC 37 usage  unexpected to be otherwise  is akin in syntax to that of crev and CUDA's dynamic parallelism. Overall, we hope the reader was able to follow the examples presented in this section. Any questions on specific syntatic constructs should be clarified by ISPC's documentation [Intel, 2016] available online. 4.1.3 ISPC Compiler Architecture The ISPC compiler also falls into the common compiler organization, presenting a front-end to deal with its high-level C-based ISPC language; a middle-end for applying code optimizations, which uses mostly LLVM's intermediate representation; and the back-end, also based on the LLVM infrastructure, that further optimizes and lowers intermidiate representation to the selected target architecture. For the initial phase, ISPC uses Bison, a YACC-compatible parser generator; and Flex, a fast lexical analyzer generator. We believe that fine-grain info on these tools, including on the LLVM infrastructure, are out of the scope of this work, and thus will not be covered. There is, though, plenty of reference on these tools online, which may be found at:  Flex, fast lexical analyzer generator (>= 2:5)  Bison, YACC-compatible parser generator (>= 2:4)  LLVM Compiler Infrastructure (3:7:1) Although we do not get into the minute details of the tools that compose ISPC, we find it relevant to highlight where in lies each of the sections of the code we used to implement our idiom. This consists mostly of the organization of the compiler itself: we have added an extension to the compiler, which is accessible, implementation-wise, via a keyword, crev. To implement such idiom on top of ISPC, we had to modify: the lexer, as to identify our keyword crev within the code; the parser, in order to validate the keyword and verify the syntax of a composite function call (crev ); ISPC's type system, as to overlook formerly-invalid varying to uniform parameter passing, which become valid in the context of a crev function invocation; and, of course, the execution flow of crev itself, implemented thoroughly on top of LLVM  ISPC's intermediate representation. The files we have imbued with crev interpreting and processing lines are listed below. In the next chapter, we show how we have implemented CREV in a lower level, closer to the actual source code. As a gentle reminder, our code is available online1 for whoever may be interested in  and can afford  spending some time to check out what 1ISPC-CREV source code http://cuda.dcc.ufmg.br/swan/src/crev-ispc.tar.gz. 38 Chapter 4. Implementation  lex.ll: defines the lexemes and lexical grammar for the ISPC language. We have included our crev token into ISPC accordingly.  parse.yy: defines the syntatic grammar for the ISPC language. We have also incorporated the necessary rules to ISPC grammar in order to validate crev function calls.  type.cpp: defines ISPC's type system, i.e., the set of types the language supports, as well as the possible coercions in between those types. We have gone through the entire type system, analyzing the cases in which crev would validate coercions from varying to uniform variable  utterly prohibited in plain ISPC.  ctx.h: declares structures for function context emission. We have had but to add a flag to noitfy other structures in the code they were dealing with a crev function call.  expr.h: declares types and structures required for processing ISPC expressions. We have had to update a few function headers, just enough to pass around a flag stating whether the expression under analysis was a crev call.  expr.cpp: defines functions to process the varios expressions allowed in ISPC. We have extended the set of expressions to bestow upon ISPC a fully-working crev implementation. Figure 4.5. List of files modified in ISPC to implement our CREV idiom. exactly has been done. Still, we hope the description is sufficiently self-contained, so anyone is able to understand our approach. 4.2 Implementation of CREV on ISPC We now give a brief, yet lower-level view of our CREV implementation on Intel's SPMD Program Compiler. Recall from the list of files present in Figure 4.1.3 file expr.cpp, which we have extended as to define the behavior of a crev expression. We shall focus our discussion on this particular file. Afterwards, we present a few constructive examples of our idiom in practice. 4.2.1 IPSC-CREV We chose Intel's SPMD Program Compiler for it is open source, but were quite lucky to find it makes extensive use of the LLVM infrastructure, a compilation framework with a large active community and well documented code. Although we have applied 4.2. Implementation of CREV on ISPC 39 Algorithm 6: CREV(Function F, Function target, VarList args) 1 currentMask stores state of threads before CREV; 2 for each index i := 0 to warpSize do 3 insert test in F to check whether thread i is active; /* Re-building arguments for thread i */ 4 argsi []; 5 for each arg in args do 6 if arg is varying but parameter is uniform then 7 append value from extract(arg, i) to argsi; 8 else 9 append arg to argsi; /* Enabling all threads and executing target function */ 10 everywhere; 11 runs target function with argi arguments; /* Resetting original thread activity */ 12 set warp state back to currentMask; 13 insert branch in F to jump to next thread or to final basic block; a handful of changes using domain specific languages (e.g.: Bison and Flex), most of the code for implementing crev has been written in plain C++, using library functions from LLVM. We now take a closer look at the main procedure in our solution, which defines the behavior of a crev expression. As a side-note, during the implementation of our solution on top of ISPC we found Function Call Re-Vectorization to be more suitable a name to what we were trying to achieve, in detriment to its predecessor, Lightweight Dynamic Parallelism. As the code was mostly implemented, we kept the old naming in the actual ISPC-CREV code. Here, for consistency with the remaining of the work, we update whatever occurrence of the old acronym ldp to crev. Algorithm 6 is a code generation procedure. That means there is no actual execu- tion of the program going on, the algorithm simply describes how the code is generated. At the moment a crev expression is encountered by ISPC/LLVM code generation, it has been asserted the presence of a syntatically correct and type-sound expression: the expression and associated operands are checked for syntax (crev ) and for type equivalence. Type equivalence here means that each operand must be either the same as, or coercible to their associated formal parameters. We begin by assigning to variable currentMask the state the running warp (which 40 Chapter 4. Implementation threads are active and inactive). Right after, for every thread of the warp, we create a basic block that holds a test: if the corresponding thread i is active, the target function will be invoked using arguments argsi specific for that thread; otherwise, the execution shall flow right into the subsequent basic block, which holds the test for thread i+ 1. For simplicity, we do not dwell into the basic block chaining process. In case the target function has to be invoked for thread i, we first extract the values from varying variables whose formal parameters are uniform. Recall from Ex- ample 4.1.2 (Varying-Uniform data access) that assginment from varying to uniform is strictly prohibited in plain ISPC. Our idiom allows invoking, from within divergent regions, SIMD functions that require all threads to be active. If the programmer wants to exploit the vector processing capabilities of their CPU, intuitively, there must be plenty of data associated to the threads that are active within the region of divergence. Therefore, it is imperative to provide developers with such coercion. The extraction is quite straightforward: whenever the coercion must take place, we issue an extra instruction, namely extract, responsible for loading the i-th value from a vectorized variable into a uniform one. Following the extraction and re-building of the original list of arguments, we finally invoke the target function using argsi as argument list. Before invoking the function, though, we must make sure all threads will be enabled. To this end, we also emit the instruction everywhere, that temporarily enables all threads within the warp. The remaining of our code emission procedure consists in resetting the warp to its original state and chaining the current basic block with subsequent ones (last two lines from Algorithm 6). 4.2.2 Active Load Balancing Another benet from our construct is its clear-cut syntax-friendly load balancing func- tionality. Suppose you are about to implement a depth-rst search (DFS) in an SIMD language. We know each thread to have it own data at the beginning of the traversal, and we also know the execution to be in a lock-step fashion: whenever a thread runs out of data to process, it will be held inactive, anxiously lingering for the remaining threads to nish their workload. Now imagine the outcome of a heavily unbalanced initial data distribution  say a straight-line graph, rooted at a single thread. If the graph was not that unbalanced, the developer should be able to always re-distribute the workload, in a way to actively maintain threads busy. This is exactly the approach depicted in Figure 4.2.2: the last conditional of function dfs creates a divergent region, which does not interfere in having all threads active under recursive calls to dfs, due to our idiom crev. In the case of a straight-line graph, not even our 4.3. Discussion 41 // Traverses the matrix in a depth f i r s t f a sh i on void d f s ( uniform s t r u c t Graph& graph , uniform in t root , f l o a t  uniform f ) { // Se t t i ng cur rent node as v i s i t e d i f ( graph . node [ root ] . v i s i t e d ) re turn ; graph . node [ root ] . v i s i t e d = true ; // Performing some computation ( j u s t a d i v i s i on , as example ) f [ root ] = graph . node [ root ] . l ength / ( f l o a t ) graph . num_nodes ; // P a r a l l e l t r a v e r s i n g fo r each ( i = 0 . . . graph . node [ root ] . l ength ) { i n t ch i l d = graph . node [ root ] . edge [ i ] . node ; i f ( ! graph . node [ c h i l d ] . v i s i t e d ) crev d f s ( graph , ch i ld , f ) ; } } // I n t e r f a c e func t i on with C/C++ main func t i on export void graph_dfs ( uniform s t r u c t Graph& graph , uniform in t root , f l o a t  uniform f ) { fo r each ( i = 0 . . . graph . num_nodes ) graph . node [ i ] . v i s i t e d = f a l s e ; d f s ( graph , root , f ) ; } Figure 4.6. ISPC-CREV implementation of a Depth-First Search. We highlight the contribution of crev to achieving an active load-balancing policy during the traversal: whenever function dfs is called, the data within the varying variable child is distributed in independent calls to crev's target function dfs. This allows having all threads active within inner calls of dfs, even within the divergent region created by the last conditional of that function. approach would be able to make amends but, intuitively, we should be able to tackle quite a few cases of moderately unbalanced graphs. 4.3 Discussion This chapter has presented the implementation of CREV in ISPC, the Intel SPMD Program Compiler. ISPC is an industrial-strength product, that gave us all the in- fra-structure necessary to concretize the abstract semantics seen in the last chapter into an actual implementation. One important question that arises, at this point, is: ``Why not to implement CREV into an actual GPU?" As we have seen in the last chapter, the implementation of CREV requires the ability to wake up threads that, due to divergences, are in a dormant state. ISPC gives us this ability, inasmuch as the mask of active threads are visible to the developers of this compiler. Unfortunately, this is not the case in current Nvidia machines. Thus, we opted to demonstrate the 42 Chapter 4. Implementation eectiveness of our ideas in vector instructions. Nevertheless, as we shall see in the next chapter, this implementation is solid enough to gives us the opportunity to design and test high-performant code. This is a story that we shall tell in the next chapter. Chapter 5 Experimental Evaluation In order to evaluate the ideas presented in this work, we have implemented crev onto ISPC. Because crev is a novel concept within ISPC, this compilation framework does not provide benchmarks to evaluate our contribution. Thus, we have re-implemented seven classic algorithms using the new keyword. We compare these algorithms against parallel versions written in ISPC. Our seven benchmarks are: (i) String Matching; (ii) Depth-First Search; (iii) Leader Election; (iv) Book Filter; (v) Bellman-Ford; (vi) Merge-Sort; and (vii) Quick-Sort. How to read our results. Results are measured in millions of execution cycles, as reported by ISPC testing environment. All the numbers reported are the average of five, out of six samples. We have removed the first to avoid cold-start discrepan- cies. The reader must bear three observations in mind when analyzing our results: (i) speedups are due to the better load distribution that CREV accomplishes by transport- ing work to inactive threads; (ii) slowdowns are due to the boilerplate code necessary to serialize threads, before invoking r-functions; (iii) we are comparing against an in- dustrial-strength compiler; hence, speedups tend to be modest. Experimental Setup. We have implemented CREV onto ISPC v 1.9.1, and have used it to target a 6-core 2.00 GHz Intel Xeon E5-2620 CPU with 8-wide AVX vector units. Henceforth, we shall be using warps that contain eight threads, e.g., jallj = 8. 5.1 String Matching String matching is the problem of nding a pattern P within a text T . Algorithm 7 shows the CREV-based implementation of string matching. This is a warp-synchronous implementation of parallel matching: each thread tid tries to match P at positions T [tid + n W ], where n  jT j, and W is the warp size. Thus, in the best scenario, 43 44 Chapter 5. Experimental Evaluation runtime is divided by W . This implementation is irregular: divergences might hap- pen at lines 6 and 10. Each call to memcmp 1 will commence a CREV sequence of computations. Algorithm 7: Pattern matching: CREV vs. Nave 1 P pattern; T target text; 2 W warp size; tid thread index; 3 Function memcmp(Offset k) 4 m True; 5 for i tid to jP j do 6 if P [i] 6= T [i+ k] then m False ; 7 if all (m = True) then Found(k) ; 8 Function StringMatch 9 for i tid to (jT j jP j) step W do 10 if P [0] = T [i] then crev memcmp(i) ; 11 Function NaiveStringMatch 12 for i tid to (jT j jP j) step W do 13 j 0; k i; 14 while j < jP j and P [j] = T [k] do 15 j j + 1; k k + 1; 16 if j = jP j then Found(k) ; Figure 5.1 compares our implementation, seen in Algorithm 7 (StringMatch) against the equivalent parallel version that uses ISPC primitives (Par). This com- petitor is function NaiveStringMatch in Algorithm 7. To give the reader a bet- ter perspective of the results, we also compare against the Knuth¿Morris¿Pratt (KMP) [Knuth et al., 1977] algorithm. KMP is sequential, but has lower complexity than Algorithm 7. It runs inO(jT j+jP j), whereas Algorithm 7 runs inO(jT jjP j=jW j). For this experiment we searched for prexes of the pattern She had been watching him the la, of sizes 4; 8; : : : ; 28; 32 in Jane Austen's book Pride and Prejudice, taken from Project Gutenberg2. CREV is always faster than NaiveStringMatch, and runs faster than KMP in more than half the cases. CREV beats plain parallelism because it distributes function memcmp among the eight available threads. On the other hand, NaiveStringMatch has a potentially long divergent block in line 14. In our best re- sult, observed for patterns of size eight, CREV runs in 44% of the time taken by NaiveStringMatch, and in 40% of the time taken by KMP. 1Function memcmp is also used at line 5 of Algorithm 1 2https://www.gutenberg.org/ebooks/42671 5.2. Book Filter 45 0 2000 4000 6000 8000 0 4 8 12 16 20 24 28 32 KMP PAR CREV +42. 5 +55. 2 +17. 0 +25 .2 +10. 1 +28. 5 +5.2 +42. 2 -17.1 +59. 7 +0.0 + 8.1 -1.8 +33. 4 -10.2 +50. 9 Figure 5.1. Comparison between CREV-based string matching (Algorithm 7), ISPC's parallel implementation, and the Knuth-Morris-Pratt version of pattern matching. The Y-axis shows runtime, in millions of cycles. The X-axis shows pattern sizes, in number of characters. The target text contains 256MB divided among 5,058,121 lines. White boxes show percentage of speedup (CREV over PAR); grey boxes show percentage of speedup (CREV over KMP). 5.2 Book Filter We have compared Algorithm 1 against a parallel version, implemented in the original ISPC language. To avoid borrowing from the gains already observed in Section 5.1, we have only invoked function memcpy (Algorithm 1, line 6) using crev. We have used as input a binary assembly file, and have tried to copy chunks of 80 bits that contain the pattern of a register-register move instruction. To vary the input file, we have cropped the file at prefixes having 1K, 5K, 10K, 15K and 20K bits. The number of occurrences of the target pattern, which marks lines that must be copied, is similar in all the files: 4,097, 4,485, 4,795, 5,144 and 5,604 times. Figure 5.2 shows the result of this comparison. The CREV-based version runs faster in every sample, and the gap increases as the input increases. CREV gives a speedup of 11.2% in the 1K file, and of 13.2% in the 20K file. Gains in speed, in this experiment, are due to the r-function memcpy only, as memcmp was not invoked using crev. 5.3 Bellman-Ford Our third experiment is an implementation of the classic Bellman-Ford algo- rithm [Bellman, 1958]. This algorithm computes shortest paths from a single source to all the other nodes in the graph. As input we use Erdös-Rényi [Erdos and Renyi, 1959] graphs with 216, 512, 1024 and 2048 vertices, and probability of 80% of existing an 46 Chapter 5. Experimental Evaluation 0 1500 3000 4500 6000 7500 9000 0 5000 10000 15000 20000 CREV PAR +10.1 +10.8 +11.9 +12.3 +13.2 Figure 5.2. Comparison between CREV's and ISPC's book filter (Algorithm 1). Y-axis gives runtime, and X-axis input size, in bits. White boxes show speedup (%) over PAR. 0 15000000 30000000 45000000 60000000 75000000 90000000 0 512 1024 1536 2048 PAR CREV +0.0 +1.0 +5.7 +3.0 Figure 5.3. Comparison between CREV's and ISPC's version of Bellman-Ford. Y-axis gives execution time, in millions of cycles, and X-axis gives graph size, in number of nodes. White boxes show percentage of speedup over PAR. edge between two nodes. Weights are randomly set from 1 to 100. Figure 5.3 shows the results that we have observed. CREV has yielded faster runtimes for all the input sizes, but the dierence is small, and within error margin for graphs with 256 and 512 nodes. In the other cases, CREV is faster by 5.8% (1024 nodes) and by 3.1% (2048 nodes). 5.4. Depth-First Search and Leader Election 47 5.4 Depth-First Search and Leader Election Our last two experiments use the same algorithm: Depth-First Search (DFS). We have compared CREV and ISPC on a straightforward implementation of DFS, seen in Algorithm 8. Algorithm 8 illustrates the composability of CREV, as it recursively calls the traversal routine for each vertex in the graph. To compare the CREV and plain parallel versions of DFS, we use the same graph model seen in Section 5.3. Figure 5.4 shows the results of this comparison. We have not observed any substantial difference between both implementations. In its best performance, the CREV version of DFS is 4.4% faster than ISPC's implementation (2,048 nodes). In the worst case (4,096) nodes, CREV got a slowdown of 1.1%, within the error margin of this experiment. We speculate that this slowdown is due to the overhead of serializing function calls and saving thread masks. Figure 3.8 shows this serialization. In all the other samples CREV has been slightly faster: 3.9% for 256 nodes and 2.7% for 1,204. Results were the same for graphs with 512 vertices. These numbers are highly input dependent. For instance, repeating the same experiment for full n-ary trees with 4K nodes reveals that CREV outperforms PAR by a greater margin. Full n-ary trees of height 3, 4 and 5, give us the following results: 0:036 0:043, 0:680 0:827 and 208:479 217:597. The first number is CREV's runtime; the second is PAR's, in millions of cycles. Algorithm 8: SIMD Depth first traversal 1 W warp size; tid thread index; 2 Function DFS(Node root, Function f) 3 f(root); 4 C child list of node root; 5 for i tid to jCj step W do 6 c C[i]; 7 if c is not null then 8 crev DFS (c, f); We have used Algorithm 8 to build a leader election routine (LE). Leader Election uses a DFS, starting at each graph node, to propagate the identifier (ID) of that node to all the other vertices. Once a vertex receives an ID, it compares it against its current ID, keeping the greatest value. In the end, every vertex will have the largest ID among all the IDs within its connected component. Therefore, this algorithm finds connected components in graphs, by naming every vertex in the same component with a common identifier. Figure 5.5 shows the results of this experiment. CREV surpasses its original ISPC competitor in every sample. Its best result is a speedup of 15.5% for 2,048 nodes. 48 Chapter 5. Experimental Evaluation 0.001 0.01 0.1 1 10 100 1000 10000 Par CREV Seq +9.0 + 9.0 -1.0 +0.0 +0.0 -62.0 - 36.0 -41.0 +0.0 +0.0 Figure 5.4. Comparison between CREV-based DFS and ISPC's parallel version. Y-axis gives execution time, in millions of cycles, and X-axis gives graph size, in number of nodes. White boxes show percentage of speedup over PAR. 0.01 0.1 1 10 100 1000 10000 Launch CREV Par Seq -50.0 -32.7 -51.0 -1.0 +0.0 +5.0 + 9.3 +0.0 +2.4 +0.0 +700 0.6 + 1280 0.0 + 900. 4 +78. 6 +32. 3 8 72 584 4680 37448 Figure 5.5. Comparison between CREV-based Leader Election and ISPC's parallel version. White boxes show percentage of speedup over PAR. The worst result is a speedup of 3.0% for graphs with 4,096 vertices. 5.5 Merge-Sort and Quick-Sort Sorting is part of the realm of classical algorithms studied in pretty much any computer science course as it serves as basis for many other solutions to be developed. Of course, we could not compose a benchmark test suite without evaluating our performance in this area as well. Towards this end, we have implemented both the Quick-Sort and the Merge-Sort algorithm using our crev extension to ISPC. First, let us present the Merge-Sort algorithm and how we have applied our solution to it. Afterwards, we show 5.5. Merge-Sort and Quick-Sort 49 Algorithm 9: Merge sort implementation 1 W warp size; tid thread index; 2 Function mergesort-seq(Array A) 3 if jAj  2 then simple sorting of array A; 4 else 5 q jAj  2; A0 farray; array + qg; 6 mergesort-seq(A0[0]); mergesort-seq(A0[1]); 7 i 0; j q; 8 S empty vector with jAj positions; 9 for k 0 to jAj do 10 if i < q and j < len then 11 if A[i] < A[j] then S[k] A[i];++ i; 12 else S[k] A[j];++ j; 13 else if i < q then S[k] A[i];++ i; 14 else S[k] A[j];++ j; 15 Function mergesort-crev(Array A) 16 Equivalent to function mergesort-seq, except line 6 is replaced by: 17 if tid < 2 then crev mergesort-crev(A 0[tid]); our approach in face of the Quick-Sort algorithm, which is a similar solution, given the divide-and-conquer nature of both algorithms. Algorithm 9 shows an implementation of Merge-Sort using crev. We aim at maximizing the parallelism at small workloads, i.e., we want all threads to be active during the execution of SIMD instructions. Our implementation relies on the fact that, upon reaching the branch at line 3, some threads may diverge, implying loss of control flow uniformity, and thus invalidating eventual SIMD kernel calls. Before we conclude on our solution, there are two important things to understand. First, although the program is SIMT, and this is to say we have a group (warp) of threads executing the program, the parallelism is not perceived at times, as we may be dealing with purely global values. In other words, operations such as the ones described in lines 5 and the entire loop beginning at line 9 are sequential. The second important consideration, which may be regarded as even more peculiar, is the fact that we call the mergesort-seq procedure twice in line 6. In ISPC, the parallelism is structured in a bit more subtle way, as it is designed to exploit different capabilities of the CPU  those being the SIMD vector-processing, and the multi-tasking. Whenever a kernel is called within an ISPC kernel, its invocation is interpreted as: start the kernel with the threads that were active, as if the kernel were issued from a fresh C/C++ call, and upon its start, threads previously inactive were put back into sleep mode. This is important 50 Chapter 5. Experimental Evaluation Algorithm 10: Quick sort implementation 1 W warp size; tid thread index; 2 Function quicksort-seq(Array A) 3 if jAj  2 then simple sorting of array A; 4 else 5 m jAj 2; p A[jAj 1]; 6 i 0; 7 while i < m do 8 if A[i]  p then ++ i; 9 else swap elements A[i] and A[m]; m; 10 if A[m]  p then m m+ 1; 11 swap elements A[m] and A[jAj 1]; 12 quicksort-seq(A); quicksort-seq(A+m+ 1); 13 Function quicksort-crev(Array A) 14 Equivalent to function quicksort-seq, except line 12 is replaced by: 15 if tid < 2 then crev quicksort-crev(A 0[tid]); as a clarification for the syntax used. Line 6 does not describe the execution of multiple instances of the mergesort-seq procedure. Only two new executions take place, one with A0[0] half of the array A, and the other with the remaining A0[1] elements. Although the function name is tagged sequential (mergesort-seq), the function is valid in terms of ISPC. It just so happens that, be it implemented in plain C/C++, or as an ISPC kernel, the behavior is that of a purely sequential procedure  despite having the SIMD vector-processing capabilities at absolute disposal in the latter case. Now, back to the one line of thought that comprises our CREV solution, function mergesort-crev shows the simplicity of instilling effective parallelism in the once sequential algorithm. We simply replace line 6 by line 17. Upon reaching line 17, threads are filtered, in as much as to conform with the semantics of our solution: only up to two threads will be active when issuing the crev directive, and therefore, only two executions of mergesort-crev will take place  one with A0[tid = 0] and the other having A0[tid = 1] as input. During execution of each recursive invocation of mergesort-crev, all threads are active within the warp, thus holding sound the program: the execution is allowed to progress properly, as threads 0 and 1 will always be active upon entry in mergesort-crev, and the evaluation of the kernel is always syntatically correct. The algorithm Quick-Sort, with all its distinctions w.r.t. Merge-Sort, does also port the same divide-and-conquer nature. Our CREV solution to Quick-Sort is analo- gous to the previous one described to Merge-Sort. We therefore leave the understanding 5.5. Merge-Sort and Quick-Sort 51 Algorithm 11: Bitonic Sort 1 if jAj < 2 then return; 2 if jAj = 2 and A[0] > A[1] then swap elements A[0] and A[1]; 3 assert jAj to go up to the warp size; 4 load values from array A in private variables: val A[tid]; 5 generates a bitonic sequence; 6 sort bitonic sequence; 7 move sorted sequence into original container: A[tid] val; length seq laubch-def launch-bi crev-def crev-bi 1 0.270 13.032 1.081 0.332 0.186 2 0.528 17.899 2.855 0.636 0.474 4 1.253 27.086 5.110 1.242 0.861 8 2.635 41.806 50.664 2.579 1.829 16 5.201 106.098 64.468 4.383 3.058 32 7.302 169.924 104.985 5.191 4.114 1 0.127 12.046 3.967 0.130 0.125 2 0.267 77.302 56.131 0.331 0.309 4 0.646 89.428 6.955 0.643 0.643 8 1.349 241.971 140.169 1.202 1.241 16 2.125 343.630 186.423 1.915 1.963 32 2.876 420.213 204.278 2.736 2.878 Table 5.1. Runtimes for sort algorithms on dierent input vector lengths. We wrote the mergesort and quicksort algorithms, both using crev and ISPC's launch, as well as relying on bitonic sort for ne-grain optimization. The results explicit how performant is our technique, in the sense we have got speedups at the cost of very smalls code changes. The first block of results is for the mergesort algorithm, wehereas the bottom half are results for the quicksort algorithm. to the reader. Algorithm 10 describes the sequential and crev-based version of this sorting procedure. And we have also tried to improve the performance of both sorting algorithms. To do so, we replace the base step  which consisted of a small sequence of instructions to check for and swap unordered pairs  by an ecient implementation of bitonic sort. Bitonic sort is an essentially SIMD parallel algorithm, and thus fits more than adequetely into our CREV approach. Algorithm 11 presents a high-level description of the BitonicSort procedure, but implementation details are available in appendix A.1. Our results show that CREV-based implementations of Merge- and Quick-Sort can be quite ecient, and even achieve better results with local algorithmic optimiza- tions (bitonic sorting). Moreover, it is importatnt to realize how simple it was to put to work a CREV algorithm from an existing sequential procedure. Although we look 52 Chapter 5. Experimental Evaluation forward to achieving ever-improved runtimes, CREV main goal is to render ecient warp-synchronous programming a simple task. 5.6 Discussion This chapter has presented an empirical evaluation of the implementation of CREV available in the ISPC compiler. To carry out this evaluation, we have used CREV to im- plement seven classic textbook algorithms. We believe that this chapter meets its goal: our algorithms have almost the same syntax of dynamic parallelism; however, they are as ecienty as warp-synchronous programs. And, contrary to warp-synchronous im- plementations, our algorithms do not suer restrictions such as the inability to invoke functions within divergent regions. Indeed, the very fact that we can use vector in- structions to boost the performance of quick-sort is a testimony of the usefulness of CREV. It would be very dicult to implement this algorithm in ISPC without any form of dynamic parallelism. Even though CREV is not dynamic parallelism per se, it aords the same syntax, but much better eciency. Chapter 6 A Scandalously Brief History of Vectorization GPUs' increasing programmability and decreasing costs have made them very popular for the development of general purpose high performance applica- tions [Nickolls and Dally, 2010]. This popularity has attracted the interest of programming language researchers, particularly for studies on control flow diver- gences. Therefore, the compiler-related literature contains a vast body of work describing analyses [Coutinho et al., 2011, Sampaio et al., 2012, Sampaio et al., 2013, Schaub et al., 2015] and optimizations [Coutinho et al., 2012, Coutinho et al., 2011, Zhang et al., 2011, Wu et al., 2016] that reduce the eects of divergences in GPGPU code. CREV is not a competitor of these analyses and optimizations. On the contrary, Call Re-Vectorization complements such techniques, giving programmers a tool that lets them deal with divergences at the software level. Flynn's Taxonomy. From Chapter 1, recall our a quick discussion on some of the cases within Flynn's taxonomy. To refresh our memory on those concepts, let us go through them once more; this time, covering all of the terms proposed [Flynn, 1972]: SISD, MISD, SIMD, MIMD, and the SIMT variant. The SISD setting correspondes to a Single Instruction operating on a Single Data stream, and is examplied by reg- ular serial computers. An MISD environment is hypothetically possible, but is often deemed impractical, as it would result in executing Multiple Instructions in a Single Datum  not much protable in a general-purpose context [Duncan, 1990]. An SIMD architecture consists of a Single Instruction stream read, in lock-step, by a vector of n units, each using a dierent data source from the Multiple Data streams. The SIMT organization is a Single Instruction source executed by Multiple Threads, say n, and 53 54 Chapter 6. A Scandalously Brief History of Vectorization each thread holding an SIMD lane of length m, i.e., every instruction runs nm times. By times we shall refer to this latter model as either SIMT or multi-threaded (MT) programming. This is not originally part of Flynn's taxonomy, but is a useful case to bear in mind  an extension of the SIMD setting. Finally, we have the MIMD, which consists of multiple autonomous processors, and consequently Multiple Instructions, operate on Multiple Data streams. A Scandalously Brief History of Vectorization The appearance of vector processor architectures dates back to the late 1960s and early 1970s, mostly in the form of pro- cessing machines designed to support massive mathematical computations (vector and matrix processing) [Watson, 1972, Lincoln, 1978]. A vector processor consists basically of multiple functional units working in parallel, each having a memory section from which to read input from. Such units could also be disposed as to pipeline tasks: the input data flows to the initial row of computing units, and outputs directly as input to the next row of units. Vector processors' computing units implemented mostly arith- metic and boolean operations  both for vectors and scalars [Duncan, 1990]. Due to the pipeline extension, such machines are hard to fit seemlessly within one of Flynn's classes: they neither present the SIMD lockstep execution, nor the asynchronous au- tonomy of the MIMD category. Some other machines, such as the famous Illiac IV, are categorized as SIMD, for they have vectors of processors executing in lockstep. Such computers are generally formed by a central CPU which feed instructions to the processing units; such units are connected via some network setting, by times enabling communication between processors and from processors to memory [Duncan, 1990]. This set up would later be present in massive parallel boards. In the good old days, graphics hardware boiled down to the so-called VGA, or Video Graphics Array/Adapter: a display hardware first introduced with the IBM PS/2 line of computers in 1987 [Polsson, 2016]. A VGA card is basically an interface between the computer and its corresponding monitor. A program running on the CPU yields data that the target monitor uses to build and display images; it is the duty of a VGA card to keep the data the monitor will put up as pictures in an exhibition  the VGA functions as a frame buffer. There must, of course, be a means of communication between the VGA card and the CPU itself. The graphics card is, thus, plugged to the motherboard and, upon having the required drivers, data transferring between those two devices become possible. It is also noticeable the runtime involved in this process: subject to the memory hierarchy, data flows from the CPU to its main Random Access Memory (RAM); subsequently, the operational system puts the driver software to use, allowing the VGA card to copy data from the computer's RAM to VGA's frame 55 buffer. The runtime is, no doubt, dependent on the hardware's throughput/bandwidth capabilities, but as a rule of thumb, the less communication involved, the better. Still, all the actual processing took place at the CPU, and was, therefore, imple- mented in software, which is costly both in terms of code development and eventual data transfer latencies. The programming downside, one may suggest to tackle with library code, which is reasonable, but brings forth compatibility issues in face of VGA devices from dierent manufactures  or even of versions of the same product. For memory access and communication related aspects, there are quite a few well-known optimizations, such double buering and page ipping [Brackeen, 1996]. Double buer- ing consists in generating all the data that must be copied to the graphic device on the CPU memory first, and only then issuing the copy  to avoid multiple (unnecessary) requests. Page ipping, in its turn, focus on having visually smooth frame transitions: the data is written to the frame buffer at a memory location not being used to refresh images on the screen; and only after a whole new image has been built the frame buffer it is that frame pointers are updated, and the display device prints out fresh content. Despite its importance as an interface between CPU and display device, graphics cards were still to be put to greater use. Optimizations help, but fall short in view of what could be done if the data had not to travel from CPU to graphics device. Of course the device must read some input information, yet the processing should ideally happen on the VGA, and the generated output be already dispatched to the VGA's frame buffer. But what would someone want to run at the hardware level? The an- swer: a series of well-dened functions that are applied to an image before it can be properly displayed, e.g., vertex operations, primitive assembly, rasterization, fragment operations, and composition into a nal image [Owens et al., 2007]. Support for spe- cic computations was added, and whence surged the closest ancestors of nowadays' video cards. Such boards, also referred to as accelarators, were indeed accelarating computations once performed entirely on the CPU: initial boards with processors al- lowed developers to run well-dened image processing tasks, instead of having to resort to CPU computation followed by data copies. An interesting fact about the nature of those operations is their embarrass- ing parallelism: given an input image, usually represented as n-dimensional arrays, a.k.a. textures, it contains data and control independency w.r.t. its processing ele- ments (e.g., vertices, triangles, fragments) [Moya et al., 2005]. Such nature was also found in the old SIMD machines, such as Illiac IV, and is a clear call for vector- ization. As there was a denite demand for parallel computational power, gener- al-purpose computing evolved on top of graphics boards: more transistors and new functionalities were added to those boards. However, it was only around 2001 such 56 Chapter 6. A Scandalously Brief History of Vectorization Year Transistors Model Tech Max GFlops 1999 25M GeForce 256 DX7, OpenGL 8 2001 60M GeForce 3 Programmable Shader 10 2002 125M GeForce FX CG programs 29 2006 681M GeForce 8800 C for CUDA 576 2008 1.4G GeForce GTX 285 IEEE FP 1063 2010 3.0G GeForce GTX 480 Cache, C++ 1345 Table 6.1. A scandalously brief timeline on GPUs. It is clear both the num- ber of transistors and exponentially increasing maximum GFlops delivered by top-performance graphics processing boards throughout the past ten years. boards became practical and popular. Its wide-spread was due to the advent of both programmable shaders and oating point support on the graphics cards. Matrix- and general vector-based problems were easily implemented on top of this hardware, which signicantly accelerates the computation; one of the first common scientic programs to run faster on GPUs than CPUs was an implementation of LU factoriza- tion(2005) [Du et al., 2012]. In addition, even a single GPU-CPU framework provided advantages that multiple CPUs on their own would not oer, due to the specialization in each chip [Mittal and Vetter, 2015]. Graphics cards' manufacturers oblige such par- allelism by jamming massive amounts of processing cores on their boards. The growth on the computational power of this hardware is reected by the number of transistors, as shown in Table 6.1 [Pereira, 2014, Hardware-INFOS, 2017]. Forms of Parallelism. General-purpose graphics processor units have four forms of parallelism. The hardware pipeline is divided into hundreds of single cycle stages to increase the throughput and the GPU clock frequency; this technique is named pipeline parallelism, and relies on overlapping computations at the many different processing units at different stages. For example, if an instruction at procesing unit A requires some data to be loaded from memory, it may cause the execution of the instruction to be stalled; but as long as there is any further operation not depending on the result of the memory load, such computation may take place at a different stage of the pipeline. Moreover, the pipeline stages are replicated to process in parallel multiple vertices, triangles and fragments. This is the basic proposal from SIMD architectures, and such approach, namely data parallelism, consists in supplying developers more processing units at a time, each unit drawing data from its own source stream [Moya et al., 2005]. Another form of parallelism bestowed in GPUs is multi-threading : multiple pro- cessing elements are stored and processed concurrently to hide memory latencies in a specic stage or processing unit. At the programming point of view, multi-threading 57 comprises a series of independent processes, each reading input data from a distinct data stream, and such behavior is achieved by launching dierent kernels in parallel. Finally, instructions can be re-ordered and combined into groups which may executed in parallel without changing the result of the program. This way, GPUs also allow its processing units to execute independent instructions in parallel, thus conveying the so-called instruction level parallelism [Moya et al., 2005]. Vector Processor Languages. There are many dierent approaches towards ex- ploiting the computational powers of vector processors, be it CPU- or GPU-based. Languages may be machine assembly code, in which case the programmer directly de- scribe the sequence of operations desired; or high-level programming abstractions of such machine code. We now give a quick overview on some well-known vector-based programming languages. C is not uncommonly chosen as base language for many vector-processing ex- tensions. C is originally a scalar language, with no intrinsics for data distribution nor parallel execution. As C is but a step up from assembly code, it gives programmers a proximity with the assembly code and memory acesses not possible in higher-level lan- guages. Since vector-processing extensions involve many vector copies and accessess, it is convenient to use languages such as C. Nevertheless, it is possible to implement ecient vectorized code in C, by inlining assembly instructions for vector-processors. SSE, or Streaming SIMD Extensions, is an SIMD instruction set extension to the x86 architecture, designed by Intel in 1999. SIMD instructions can greatly in- crease performance when exactly the same operations are to be performed on mul- tiple data objects. Typical applications are digital signal processing and graphics processing. SSE was subsequently expanded by Intel to SSE2, SSE3, SSSE3, and SSE4 [Wikipedia, 2017b]. AVX, or Advanced Vector Extensions, is another extension to the x86 instruction set architecture for microprocessors from Intel and AMD, and was originally proposed by Intel, in March 2008. AVX provides new features, new instructions and a new coding scheme, and has later been extended by AVX2 and AVX-512 [Wikipedia, 2017a]. Intel's SPMD Program Compiler, ISPC for short, is both a language and a com- piler developed by Intel. Its code is open source and focus on allowing programmers to write ecient parallel multi-threaded programs for CPUs. There is also an absolute interest in minimizing the complexity of code writing, as it provides abstractions for SIMD and SIMT programming. The code is lowered to Intel's SSE/AVX assembly lan- guages as the programmer see fit. ISPC is an extension of the C programming language, 58 Chapter 6. A Scandalously Brief History of Vectorization and also supplies coders with C++ library interfaces and compilation infrastructure, i.e., ISPC kernels may be used in conjunction with either C or C++ [Intel, 2016]. CUDA is a parallel computing platform and programming model invented by NVIDIA. It enables dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU). Its software layer grants developers di- rect access to the GPU's virtual instruction set for the execution of compute kernels. GPGPU programming was far from easy at the early years of graphics boards, as de- velopers had to map scientic calculations onto problems that could be represented by triangles and polygons. It was in 2013, a team of researchers led by Ian Buck unveiled Brook, the first widely adopted programming model to extend C with data-parallel constructs. The language ported concepts such as streams, kernels and reduction op- erators, and its compiler and runtime system exposed the GPU as a general-purpose processor in a high-level language. Most importantly, Brook programs were not only easier to write than hand-tuned GPU code, they were seven times faster than similar existing code. NVIDIA then invited Ian Buck to join the company and start evolving a solution to seamlessly run C on the GPU, which culminated in CUDA in the year 2006 [NVIDIA, 2017]. Finally, OpenCL views a computing system as consisting of a number of com- pute devices, which might be central processing units (CPUs) or accelerators such as graphics processing units (GPUs), attached to a host processor (a CPU). OpenCL de- nes a C-like language for writing programs, and akin to CUDA, functions executed on an OpenCL device are called kernels [Howes and Munshi, 2015]. Processing nodes are dened according to the hardware setting, but should correspond to the number of SIMD processing lanes available at runtime. It is hard to dene what a processing node is, even within a CPU, as it may have a static number of cores, but enhance its power virtually via hyperthreading [Gaster et al., 2012]. A single kernel execution can run on all or many of the processing nodes, in parallel. In addition to its C-like programming language, OpenCL denes an application programming interface (API) that allows programs running on the host to launch kernels on the compute devices and manage device memory, which is (at least conceptually) separate from host memory. Programs in the OpenCL language are intended to be compiled at run-time, so that OpenCL-using applications are portable between implementations for various host de- vices [Stone et al., 2010]. The OpenCL standard denes host APIs for C and C++, but third-party APIs exist for other programming languages and platforms such as Python [Klöckner et al., 2012], Java and .NET [Gaster et al., 2012]. Chapter 7 Final Thoughts Primitives such as warp vote and shue have given experts the possibility of writing ecient SIMD code, by programming from the point of view of one warp. This coding style has been used in CUB and many other CUDA libraries1. However, warp-syn- chronous code does not play well with branch divergence. Most warp-synchronous al- gorithms require all threads in a warp to be active. This is a problem for the common usage scenario of a simple MT-style CUDA kernel that calls warp-synchronous library functions. It is our vision that the application developer writing the kernel should not be concerned with the internal implementation of library functions, and should be able to call any function inside divergent program regions. To meet the demands of this vision, this paper has introduced the notion of Call Re-Vectorization(CREV). We have described the building blocks necessary to implement CREV. Looking towards compat- ibility with future hardware, we have proposed low-level primitives with well-dened semantics and a high-level interface, the crev idiom, that makes programmer intent explicit. Thus, our notion of CREV does not rely implicitly on current hardware be- havior, which might eventually change. We have implemented CREV into ISPC, using Intel instructions, and have shown how to code irregular algorithms in this environ- ment. Our implementations are not only clearer than non-CREV based approaches, but also more ecient, as they balance work among inactive warp threads. We believe that this work opens up several dierent research directions. For instance, we have been using CREV manually. That is to say: thus far, the devel- oper must manually annotate functions with the CREV high-level keyword. However, nothing hinders a compiler from providing static analyses that add this primitive di- rectly onto source code already in place. Whoever decides to follow this direction will have a number of concerns to worry about, such as: which functions to vectorize, 1See https://nvlabs.github.io/cub/ 59 60 Chapter 7. Final Thoughts which thresholds to consider when invoking r-functions, how to measure the benet of function re-vectorization, etc. List of Terms I ISPC Intel's Single Program Multiple Data compiler. An open-source compiler avail- able online at https://ispc.github.io/. xvii, xix, xxii, xxiii, xxv, xxvii, xxviii, 1, 3, 5, 10, 21, 22, 24, 2941, 4352, 5759, 61, 63, 66, 69, 70 S SIMD An SIMD architecture consists of a Single Instruction stream read, in lock-step, by a vector of n units, each using a different data source from the Multiple Data streams. xvii, xix, xxi, xxii, 15, 721, 24, 25, 27, 2931, 33, 34, 40, 47, 4951, 5359, 61, 63, 64, 69 SIMT An SIMT machine has a Single Instruction source executed by Multiple Threads, say n, and each thread holding an SIMD lane of length m, i.e., every instruction runs nm times. xxi, 2, 3, 5, 11, 13, 29, 36, 49, 53, 54, 57, 61, 66 SPMD SPMD, or Single Program, Multiple Data, identifies architectures with a Single Program executed independently by many processing units, each with its own data stream. xvii, xix, 3, 5, 2931, 33, 38, 41, 57, 61, 66 61 Bibliography [Bellman, 1958] Bellman, R. (1958). On a routing problem. Quarterly of Applied Mathematics, 16(1):87--90. [Bougé and Levaire, 1992] Bougé, L. and Levaire, J.-L. (1992). Control structures for data-parallel SIMD languages: semantics and implementation. Future Generation Computer Systems, 8(4):363--378. [Brackeen, 1996] Brackeen, D. (1996). 256-color vga programming in c. http://www. brackeen.com/vga/unchain.html. Accessed: 2017-02-19. [Brodman et al., 2014] Brodman, J., Babokin, D., Filippov, I., and Tu, P. (2014). Writ- ing scalable SIMD programs with ISPC. In WPMVP, pages 25--32. ACM. [Catanzaro, 2012] Catanzaro, B. (2012). Trove cuda library. https://github.com/ bryancatanzaro/trove. Accessed: 2017-02-02. [Chen and Shen, 2015] Chen, G. and Shen, X. (2015). Free launch: optimizing GPU dynamic kernel launches through thread reuse. In Micro, pages 407--419. ACM. [Coutinho et al., 2011] Coutinho, B., Sampaio, D., Pereira, F. M. Q., and Jr., W. M. (2011). Divergence analysis and optimizations. In PACT, pages 320--329. IEEE. [Coutinho et al., 2012] Coutinho, B., Sampaio, D., Pereira, F. M. Q., and Jr., W. M. (2012). Profiling divergences in GPU applications. Concurrency and Computation: Practice and Experience, 1(10.1002/cpe.285-15):1--15. [Dias et al., 2016] Dias, V., Moreira, R., Meira, W., and Guedes, D. (2016). Diagnosing performance bottlenecks in massive data parallel programs. In Cluster, Cloud and Grid Computing (CCGrid), 2016 16th IEEE/ACM International Symposium on, pages 273--276. IEEE. 63 64 Bibliography [DiMarco and Taufer, 2013] DiMarco, J. and Taufer, M. (2013). Performance impact of dynamic parallelism on different clustering algorithms. Modeling and Simulation for Defense Systems and Applications, 8752(VIII):87520E87520E:8. [Du et al., 2012] Du, P., Weber, R., Luszczek, P., Tomov, S., Peterson, G., and Don- garra, J. (2012). From cuda to opencl: Towards a performance-portable solution for multi-platform gpu programming. Parallel Computing, 38(8):391--407. [Duncan, 1990] Duncan, R. (1990). A survey of parallel computer architectures. Com- puter, 23(2):5--16. [Erdos and Renyi, 1959] Erdos, P. and Renyi, A. (1959). On random graphs. I. Pub- licationes Mathematicae, 6(1):290--297. [Farrell and Kieronska, 1996] Farrell, C. A. and Kieronska, D. H. (1996). Formal spec- ication of parallel SIMD execution. Theo. Comp. Science, 169(1):39--65. [Ferrante et al., 1987] Ferrante, J., Ottenstein, K. J., and Warren, J. D. (1987). The program dependence graph and its use in optimization. TOPLAS, 9(3):319--349. [Flynn, 1972] Flynn, M. J. (1972). Some computer organizations and their eective- ness. IEEE transactions on computers, 100(9):948--960. [Garland and Kirk, 2010] Garland, M. and Kirk, D. B. (2010). Understanding through- put-oriented architectures. Commun. ACM, 53:58--66. [Gaster, 2014] Gaster, B. (2014). An execution model for OpenCL 2.0. Technical report 2014-02, Computer Sciences. [Gaster et al., 2012] Gaster, B., Howes, L., Kaeli, D. R., Mistry, P., and Schaa, D. (2012). Heterogeneous Computing with OpenCL: Revised OpenCL 1. Newnes. [Han and Abdelrahman, 2013] Han, T. D. and Abdelrahman, T. S. (2013). Reducing divergence in GPGPU programs with loop merging. In GPGPU, pages 12--23. ACM. [Hardware-INFOS, 2017] Hardware-INFOS (2017). Grafikkarten - nvidia. http:// www.hardware-infos.com/grafikkarten_nvidia.php. Accessed: 2017-03-02. [Hoogvorst et al., 1991] Hoogvorst, P., Keryell, R., Paris, N., and Matherat, P. (1991). POMP or how to design a massively parallel machine with small developments. In PARLE, pages 83--100. Springer. [Howes and Munshi, 2015] Howes, L. and Munshi, A. (2015). The opencl specification. Bibliography 65 [Intel, 2016] Intel (2016). Ispc documentation. https://ispc.github.io/ispc.html. Accessed: 2017-02-09. [Jones, 2014] Jones, S. (2014). Introduction to dynamic parallelism  Invited Talk. In GPU Technology Conference, pages 1--33. NVIDIA. [Khorasani et al., 2015] Khorasani, F., Gupta, R., and Bhuyan, L. N. (2015). Efficient warp execution in presence of divergence with collaborative context collection. In Micro, pages 204--215. ACM. [Klöckner et al., 2012] Klöckner, A., Pinto, N., Lee, Y., Catanzaro, B., Ivanov, P., and Fasih, A. (2012). Pycuda and pyopencl: A scripting-based approach to gpu run-time code generation. Parallel Computing, 38(3):157--174. [Knuth et al., 1977] Knuth, D. E., Jr., J. H. M., and Pratt, V. R. (1977). Fast pattern matching in strings. Journal of Computing, 6(2):323--350. [Kushner, 2013] Kushner, D. (2013). The real story of stuxnet. ieee Spectrum, 3(50):48- -53. [Lincoln, 1978] Lincoln, N. (1978). A safari through the control data star-100 with gun and camera. In Proc. AFIPS NCC, volume 47. [MasPar, 1992] MasPar (1992). MasPar Programming Language (ANSI C compatible MPL) Reference Manual. [Merrill and Grimshaw, 2011] Merrill, D. and Grimshaw, A. (2011). High performance and scalable radix sorting: A case study of implementing dynamic parallelism for GPU computing. Parallel Proceessing Letters, 21(2):245--272. [Mittal and Vetter, 2015] Mittal, S. and Vetter, J. S. (2015). A survey of cpu-gpu heterogeneous computing techniques. ACM Computing Surveys (CSUR), 47(4):69. [Moreira et al., 2017] Moreira, R., Collange, S., and Pereira, F. (2017). Function call re-vectorization. In ACM SIGPLAN Symposium on Principles and Practice of Par- allel Programming (PPoPP). [Moreira and Tymburibá, 2016] Moreira, R. E. and Tymburibá, M. (2016). Riprop-d- educer. http://cuda.dcc.ufmg.br/rip-rop-deducer/. Accessed: 2017-02-07. [Moreira et al., 2015] Moreira, R. E., Tymburibá, M., and Quintão Pereira, F. M. (2015). Inferência estática da frequência máxima de instruções de retorno para 66 Bibliography detecção de ataques rop. In Proceedings of the 2015 Brazilian Symposium on Infor- mation and Computational Systems Security, pages 215. SBC. [Moreira et al., 2016] Moreira, R. E. A., Collange, S., and Quintão, F. M. (2016). Definição semântica de blocos everywhere para programação simd. [Moya et al., 2005] Moya, V., Gonzalez, C., Roca, J., Fernandez, A., and Espasa, R. (2005). Shader performance analysis on a modern gpu architecture. In Microar- chitecture, 2005. MICRO-38. Proceedings. 38th Annual IEEE/ACM International Symposium on, pages 10--pp. IEEE. [Munshi et al., 2011] Munshi, A., Gaster, B., Mattson, T. G., Fung, J., and Ginsburg, D. (2011). OpenCL Programming Guide. Addison-Wesley, 1st edition. [Nickolls and Dally, 2010] Nickolls, J. and Dally, W. J. (2010). The GPU computing era. IEEE Micro, 30:56--69. [Novak, 2015] Novak, R. (2015). Loop optimization for divergence reduction on GPUs with SIMT architecture. IEEE Transactions on Parallel and Distributed Systems, 26(6):1633--1642. [NVIDIA, 2016] NVIDIA (2016). Cub cuda library. http://nvlabs.github.io/cub/ group___warp_module.html. Accessed: 2017-02-02. [NVIDIA, 2017] NVIDIA (2017). Cuda parallel programming and computing platform. http://www.nvidia.com/object/cuda_home_new.html. Accessed: 2017-02-22. [Owens et al., 2007] Owens, J. D., Luebke, D., Govindaraju, N., Harris, M., Krüger, J., Lefohn, A. E., and Purcell, T. J. (2007). A survey of general-purpose computation on graphics hardware. In Computer graphics forum, volume 26, pages 80--113. Wiley Online Library. [Pereira, 2014] Pereira, F. M. Q. (2014). Divergence analysis. http: //homepages.dcc.ufmg.br/~fernando/classes/dcc888/ementa/slides/ DivergenceAnalysis.pdf. Accessed: 2017-02-20. [Pharr and Mark, 2012] Pharr, M. and Mark, W. R. (2012). ISPC: A SPMD compiler for high-performance CPU programming. In InPar, pages 1--13. IEEE. [Polsson, 2016] Polsson, K. (2016). Chronology of ibm personal computers. http: //pctimeline.info/ibmpc/ibm1987.htm. Accessed: 2017-02-19. Bibliography 67 [Rose and Steele, 1987] Rose, J. and Steele, G. (1987). C*: An extended C language for data parallel programming. In ICS. [Sampaio et al., 2013] Sampaio, D., de Souza, R. M., Collange, S., and Pereira, F. M. Q. (2013). Divergence analysis. ACM Trans. Program. Lang. Syst., 35(4):13. [Sampaio et al., 2012] Sampaio, D., Martins, R., Collange, S., and Pereira, F. M. Q. (2012). Divergence analysis with affine constraints. In SBAC-PAD, pages 67--74. IEEE. [Sanders and Kandrot, 2010] Sanders, J. and Kandrot, E. (2010). CUDA by Exam- ple: An Introduction to General-Purpose GPU Programming. Addison-Wesley, 1st edition. [Schaub et al., 2015] Schaub, T., Moll, S., Karrenberg, R., and Hack, S. (2015). The impact of the simd width on control-flow and memory divergence. TACO, 11(4):54:1- -54:25. [Sodani et al., 2016] Sodani, A., Gramunt, R., Corbal, J., Kim, H.-S., Vinod, K., Chinthamani, S., Hutsell, S., Agarwal, R., and Liu, Y.-C. (2016). Knights Land- ing: Second-generation Intel Xeon Phi product. IEEE Micro, 36(2):34--46. [Spink, 2016] Spink, T. (2016). Cgo 2016 - awards. http://cgo.org/cgo2016/ awards/. Accessed: 2017-02-07. [Stone et al., 2010] Stone, J. E., Gohara, D., and Shi, G. (2010). Opencl: A parallel programming standard for heterogeneous computing systems. Computing in science & engineering, 12(3):66--73. [Tymburibá et al., 2015] Tymburibá, M., Moreira, R. E., and Quintão Pereira, F. M. (2015). Riprop: A dynamic detector of rop attacks. In Proceedings of the 2015 Brazilian Congress on Software: Theory and Practice, pages 215. SBC. [Tymburibá et al., 2016] Tymburibá, M., Moreira, R. E., and Quintão Pereira, F. M. (2016). Inference of peak density of indirect branches to detect rop attacks. In Pro- ceedings of the 2016 International Symposium on Code Generation and Optimization, pages 150--159. ACM. [Wang et al., 2015] Wang, J., Rubin, N., Sidelnik, A., and Yalamanchili, S. (2015). Dynamic thread block launch: A lightweight execution mechanism to support irreg- ular applications on gpus. In ISCA, pages 528--540. ACM. 68 Bibliography [Wang et al., 2016] Wang, J., Rubin, N., Sidelnik, A., and Yalamanchili, S. (2016). LaPerm: Locality aware scheduler for dynamic parallelism on GPUs. In ISCA. [Wang and Yalamanchili, 2014] Wang, J. and Yalamanchili, S. (2014). Characteri- zation and analysis of dynamic parallelism in unstructured GPU applications. In IISWC, pages 51--60. IEEE. [Watson, 1972] Watson, W. (1972). The ti asc: a highly modular and flexible super computer architecture. In Proceedings of the December 5-7, 1972, fall joint computer conference, part I, pages 221--228. ACM. [Wikipedia, 2017a] Wikipedia (2017a). Advanced vector extensions. https://en. wikipedia.org/wiki/Advanced_Vector_Extensions. Accessed: 2017-02-22. [Wikipedia, 2017b] Wikipedia (2017b). Streaming simd extensions. https://en. wikipedia.org/wiki/Advanced_Vector_Extensions. Accessed: 2017-02-22. [Wu et al., 2016] Wu, J., Belevich, A., Bendersky, E., Heernan, M., Leary, C., Pien- aar, J., Roune, B., Springer, R., Weng, X., and Hundt, R. (2016). Gpucc: An open-source gpgpu compiler. In CGO, pages 105--116. ACM. [Yang and Zhou, 2014] Yang, Y. and Zhou, H. (2014). CUDA-NP: realizing nested thread-level parallelism in GPGPU applications. In PPoPP, pages 93--106. ACM. [Zhang et al., 2011] Zhang, E. Z., Jiang, Y., Guo, Z., Tian, K., and Shen, X. (2011). On-the-fly elimination of dynamic irregularities for GPU computing. In ASPLOS, pages 369--380. ACM. Appendix A ISPC-CREV Code We here leave some code we find relevant to the reader. The implementations in this section are ISPC-CREV code, and should thus be compiled with our extend compiler, available online at http://cuda.dcc.ufmg.br/swan. A.1 Bitonic Sort Bitonic Sort is a well-known parallel sorting algorithm. We implement Bitonic Sort in our ISPC-CREV extend language to improve the performance of the Merge- and Quick-Sort algorithms, both implemented using the crev keyword. The following BitonicSort kernel is SIMD, i.e., all threads within a warp must be active upon call to this function. We use crev to guarantee such property holds, validating the execution of the program. s t a t i c i n t bitonic_mask [ 6 ] [ 2 ] [ 8 ] = { { {0 , 1, 2, 3 , 4 , 5, 6, 7} , {1 , 0 , 3 , 2 , 5 , 4 , 7 , 6} } , { {0 , 1 , 2, 3, 4, 5, 6 , 7} , {2 , 3 , 0 , 1 , 6 , 7 , 4 , 5} } , { {0 , 1, 2 , 3, 4, 5 , 6, 7} , {1 , 0 , 3 , 2 , 5 , 4 , 7 , 6} } , { {0 , 1 , 2 , 3 , 4, 5, 6, 7}, {4 , 5 , 6 , 7 , 0 , 1 , 2 , 3} } , { {0 , 1 , 2, 3, 4 , 5 , 6, 7}, {2 , 3 , 0 , 1 , 6 , 7 , 4 , 5} } , { {0 , 1, 2 , 3, 4 , 5, 6 , 7}, {1 , 0 , 3 , 2 , 5 , 4 , 7 , 6} } } ; void set_order ( i n t  uniform array , uniform in t l en ) { i f ( l en == 2 && array [ 0 ] > array [ 1 ] ) { uniform in t aux = array [ 1 ] ; array [ 1 ] = array [ 0 ] ; array [ 0 ] = aux ; } e l s e i f ( l en > 2) p r in t ("ERROR!\ n " ) ; 69 70 Appendix A. ISPC-CREV Code } void b i t on i c_so r t ( i n t  uniform array , uniform in t l en ) { i f ( l en < 2) re turn ; i f ( l en == 2 && array [ 0 ] > array [ 1 ] ) { uniform in t aux = array [ 1 ] ; array [ 1 ] = array [ 0 ] ; array [ 0 ] = aux ; } i f ( l en > programCount ) { p r i n t ("ERROR!\ n " ) ; r e turn ; } // Fetching va lue s varying i n t va l = (1 << 30) , tmp = (1 << 30 ) ; i f ( programIndex < len ) va l = array [ programIndex ] ; // B i ton i c s o r t i n g masks varying i n t cmp1 = bitonic_mask [ 0 ] [ 0 ] [ programIndex ] ; vary ing i n t cmp2 = bitonic_mask [ 1 ] [ 0 ] [ programIndex ] ; vary ing i n t cmp3 = bitonic_mask [ 2 ] [ 0 ] [ programIndex ] ; vary ing i n t va l1 = bitonic_mask [ 0 ] [ 1 ] [ programIndex ] ; vary ing i n t va l2 = bitonic_mask [ 1 ] [ 1 ] [ programIndex ] ; vary ing i n t va l3 = bitonic_mask [ 2 ] [ 1 ] [ programIndex ] ; vary ing i n t cmp4 = bitonic_mask [ 3 ] [ 0 ] [ programIndex ] ; vary ing i n t cmp5 = bitonic_mask [ 4 ] [ 0 ] [ programIndex ] ; vary ing i n t cmp6 = bitonic_mask [ 5 ] [ 0 ] [ programIndex ] ; vary ing i n t va l4 = bitonic_mask [ 3 ] [ 1 ] [ programIndex ] ; vary ing i n t va l5 = bitonic_mask [ 4 ] [ 1 ] [ programIndex ] ; vary ing i n t va l6 = bitonic_mask [ 5 ] [ 1 ] [ programIndex ] ; // Generating a b i t on i c sequence // Step 1 : d i s t ance 1 tmp = s h u f f l e ( val , va l1 ) ; i f ( programIndex == cmp1) va l = ( va l > tmp) ? tmp : va l ; e l s e va l = ( va l < tmp) ? tmp : va l ; // Step 2 . 1 : d i s t ance 2 tmp = s h u f f l e ( val , va l2 ) ; i f ( programIndex == cmp2) va l = ( va l > tmp) ? tmp : va l ; A.1. Bitonic Sort 71 e l s e va l = ( va l < tmp) ? tmp : va l ; // Step 2 . 2 : d i s t ance 4 tmp = s h u f f l e ( val , va l3 ) ; i f ( programIndex == cmp3) va l = ( va l > tmp) ? tmp : va l ; e l s e va l = ( va l < tmp) ? tmp : va l ; // Sor t ing a b i t on i c sequence // Step 3 . 1 : d i s t ance 1 tmp = s h u f f l e ( val , va l4 ) ; i f ( programIndex == cmp4) va l = ( va l > tmp) ? tmp : va l ; e l s e va l = ( va l < tmp) ? tmp : va l ; // Step 3 . 2 : d i s t ance 2 tmp = s h u f f l e ( val , va l5 ) ; i f ( programIndex == cmp5) va l = ( va l > tmp) ? tmp : va l ; e l s e va l = ( va l < tmp) ? tmp : va l ; // Step 3 . 3 : d i s t ance 4 tmp = s h u f f l e ( val , va l6 ) ; i f ( programIndex == cmp6) va l = ( va l > tmp) ? tmp : va l ; e l s e va l = ( va l < tmp) ? tmp : va l ; // Stor ing so r t ed array i f ( programIndex < len ) array [ programIndex ] = va l ; }