Hastega: Challenge for GPGPU on Elixir (intend to apply it to machine learning)
Do you want to learn and use machine learning in Elixir / Phoenix / Nerves apps? Of course, you can, but the machine learning is written in Python and invoked from Elixir via ErlPort... Of course, you can implement some machine learning algorithms in Elixir. However, they have no computational power enough to execute real applications. We wish Elixir could have more computational power!
This may change by Hastega! Hastega will make a series of code fragments of pipelined Enum.map to be transformed into optimized native code using GPUs and/or CPUs with SIMD instructions.
What's GPUs and SIMD?
GPUs are main components of graphic cards:
Fig. 1 shows graphic cards that are a little older models. There are GPUs inside of them.
The most popular architecture of GPUs is SIMD (Single Instruction, Multiple Data architecture), which is one of parallel architectures that has been commercially successful. One of attractive features of SIMD is that SIMD is so simple that the core size of it is very small, that it can be integrated in very large scale, and that number of cores of it can be very many, easily.
Hastega presentation @ Lonestar ElixirConf 2019
We made a presentation on Hastega at Lonestar ElixirConf 2019 at Austin:
What is Hastega?
In detail, Hastega will do:
- Detect code fragments of a series of pipelined Enum.map (Fig. 3) in defhastega with a do block (Fig. 4).
- Extract it from the original code into a NIFs function (Fig. 5) that drives GPUs and/or SIMD instructions.
That's all! Hastega will be really convinient! But, can Hastega realize it?
History of Prototypes of Hastega
Phase 1: Establish Realization Methods and Evaluate Performance
We've implemented some prototypes of Hastega to prove it to be able to be realized.
Firstly, we've implemented a prototype of Hastega, a benchmark suit to calculate logistic maps with integers that drives GPUs or multi-cored CPUs with SIMD instructions by NIFs using OpenCL or the rayon library in Rust, which can generate SIMD instructions, to evaluate performance potentials. Its code is available at https://github.com/zeam-vm/logistic_map (branch: range).
The performance potential of it hopeful:
Fig. 6 shows comparison of execution time in seconds between pure Elixir with Flow (left blue bar) and the prototype of Hastega (right red bar). Flow.map in Elixir_recursive calls a function to calculate a logistic map 10 times with the recursive call style, and that in Elixir_inline calls a function to calculate it 10 times with inlining. We found that the prototype of Hastega is 4-8 times faster than pure Elixir. We also found that multi-core CPUs with SIMD instructions can run this benchmark suit faster than GPUs because to execute with GPUs requires data transfer between CPUs and GPUs.
Fig. 7 shows comparison of execution time in seconds between Python with GPUs by CuPy (left blue bar) and the prototype of Hastega (right red bar). We found that the prototype of Hastega is more than 3 times faster than Python.
Fig. 8 shows comparison of execution time in seconds between native code in Rust that drives GPUs (left blue bar) and the prototype of Hastega (right red bar). We found that the prototype of Hastega is only 1.5 times slower than the native code. To reduce the overheads, we must compile whole benchmark suit into native code without data conversion between list and array.
Thus, the above results of performance evaluation demonstrate to us that Hastega will have hopeful computational power to realize heavy Elixir / Phoenix / Nerves apps such as machine learning.
Phase 2: Realize Linear Regression and Evaluate Performance of it
Next, we've challenged to implement another prototype of Hastega, a benchmark suit to calculate a linear regression using gradient descent, which drives GPUs or multi-core CPUs with SIMD instructions, similar to the benchmark suit to calculate a logistic map. Its code is available at https://github.com/zeam-vm/linear_regressor_cli.
We've gotten the result that the code of the prototype of Hastega that drives a single-cored CPU with SIMD instructions is 6-15 times faster than original Elixir code. However, the codes that drives GPUs and multi-cored CPUs with SIMD instructions are slower than the code that drives a single-cored CPU. The reason of it is that data transfer, starting Rust processes and Enum.reduce are performance bottlenecks.
Phase 3: Realize Two-layered Neural Network and Evaluate Performance of it (work-in-progress)
Next, we've challenged to implement another prototype of Hastega, a benchmark suit to calculate to learn and to use something with a two-layered neural-network. This work is in work-in-progress, now. We got intermediate results that it makes the performance become worse only to convert inner product calculation into native code that drives GPUs because amounts of data flows are too short and because execution time of data transfer is too long to pay for itself.
We’ve also simulate how much performance in amounts of data flow in the real case of linear regression is expected to optimize Enum.map by Hastega, with changing the number of cores. The results show that larger amounts of data flow are, faster the results of the execution time are, in the case of a few cores, but not that the results of the execution time are shorten in the case of many cores, independently on amounts of data flow.
Summary of Phase 1-3
Let us summarize these results of performance evaluation:
- Optimization of Enum.map by the prototypes of Hastega has gotten quite good performance!
- Enum.reduce is a performance bottleneck. To reduce the execution time of Enum.reduce, we should improve communication and synchronization between each cores by reviewing the communication method in the levels on instruction and algorithm.
- Data transfer is also another performance bottleneck. So, we need data transfer optimization and load-balancing with cores and GPUs by analyzing and/or profiling data amount enough to pay for itself.
Phase 4: How should we code in LLVM-IR?
Based on the above results, we've implemented other prototypes to prove that our design strategy is implementable.
Firstly, we decided to adopt the LLVM compiler infrastructure to generate native code. Then, we had to know how we should code in LLVM-IR. In addition, we want to generate it type-safely. We also needed to know how to treat BigNums with NIFs. The source code of this prototype is available at https://github.com/zeam-vm/nif_llvm.
Phase 5: How do we generate and execute native code in LLVM-IR?
Next, we had to establish how we generate and execute native code in LLVM-IR. The source code of this prototype is available at https://github.com/zeam-vm/nif_llvm_2. It can generate and execute native code in LLVM-IR by the LLVM-Rust binding invoked by Rustler, but it can execute it only as an interpreter, now.
Phase 6: SumMag meta-programming library
Next, we decided to adopt the Elixir meta-programming mechanism using 'quote' and 'unquote'. We also had to define the syntax of Hastega.
So, we've been developing the SumMag meta-programming library. SumMag extracts a series of pipelined Enum.map into another function with the Elixir meta-programming mechanism. An sample code of Hastega is here:
'hastegastub' is a macro to generate and to transform all of the functions within the do block of defhastega.
SumMag transforms the above code to be:
The transformation includes extraction of a series of pipelined Enum.map into a function, and optimization using map-map fusion, that is, fusion of some invocations of Enum.map into an invocation of Enum.map with a fused function including all functions in the original series of pipelined Enum.map.
SumMag also returns an AST of the inside of the function func_nif_1.
SumMag is our product in work-in-progress. The source code of it is available at https://github.com/zeam-vm/sum_mag/.
Phase 7: Magicite Elixir-LLVM binding (to be implemented)
We'll implement Magicite, which will be an Elixir-LLVM binding via Rustler. It'll be able to generate and execute native code, not as an interpreter but as a compiler, controlled by Elixir apps.
Phase 8: 1st practical version of Hastega
Hastega will use SumMag and Magicite, and compile Elixir code including pipelined Enum.map into native code using SIMD instructions and/or GPU.
Hastega will fork each process to drive corresponding core or GPU in advance, and will execute Enum.map with multi-cores or GPUs, with communication and synchronization between the corresponding processes.
We have a plan to release the 1st practical use version of Hastega before Summer, 2019. In this version, Hastega will support to compile pipelined Enum.map into native code running on a CPU core using x86_64 SIMD instructions.
We also have a plan to evaluate performance of the benchmark suits of linear regression and neural-network using it.
Future Roadmap of Hastega
Fistly, Hastega will support x86_64 CPUs, using SIMD instructions, but on only a single CPU.
Next, it will support GPUs including AMD and NVIDIA, which support OpenCL, implemented by messaging to a process monopolizing communication to a GPU.
Supporting multi-core processing in Hastega may be a little difficult to implement in current Erlang VM because we observed that our prototypes, especially linear regression and neural network, are inefficient to start and to synchronize new process. We would use Erlang processes from NIFs.
We’re also interested in implementing to support Metal and CUDA, which drive GPUs like OpenCL, to realize highly efficiency, and in load-balancing CPUs and GPUs to make programming Hastega easier.
Discussion with José
José proposed that we should use Pnum instead of using defhastega and hastegastub:
Pnum means parallel enum. As José said in his keynote at Lonestar ElixirConf 2019, when we go from eager to lazy, we replace Enum by Stream. What if when go from Enum to parallel, we use Pnum?
José also wrote an example implementation:
It’s simple and great!
I thought that José’s opinion is worth to listen, but I had another opinion:
- Our final goal is to realize not only Hastega, but also a new language system of Elixir that is an alternative Elixir on Erlang VM.
- From the view of the world of it, we think that to optimize usual Enum.map automatically is a better approach than using another named library such as Pnum.
- We also think that it is desirable that the optimization of Enum.map prefers automatic parallelization except that it causes problems on execution order and/or side effects, or that observation code is inserted within the code.
- It works like the quantum theory, that is, observation causes prevention of optimization.
José agreed it. If the long term goal is to optimize a certain subset of Elixir code, then it makes sense to automatically optimize Enum.map/2.
José also proposed that introspection and intercession functions on an AST:
- We can retrieve the compiled Elixir AST for a given module by the code in Fig. 13, which is introspection.
- After changing the definition, we can recompile it with the code in Fig. 14, which is intercession. (José warn that it is NOT public API but it is mostly stable.)
How great and smart it is!
I decided at once to implement such a simple and powerful intercession function as :elixir_erl.compile/1 in Magicite or Hastega.