2D Matrix Multiplication on GPUs with Limited Memories

There was a time when I imagined feeding my big big data to the graphics cards smoothly, but Mom told me that is impossible because my data stream is much too big and the memory of  the graphics cards is limited. The memory is to a graphics card what capacity to a human brain.

But we have hard drives and CPU memories which are a lot larger than GPU memories. Using CUDA one can perform such operations like slicing matrices into different parts then calculating sub-matrices multiplications once at a time on GPU and composing the little matrices into the final result, so as to ease the burden when the matrices are too big to load on the GPU memory at a time, which is just what this article introduced.

This should solve the problem of limited GPU memories. And what is this post for? Well, I was just too lazy to actually chew the codes of CUDA/CUBLAS, and msot of my actual data processing missions are done on Julia, and you know I am considering some more convenient approach, with which I don’t need to change my current data format or programming logic.

And I found MXNet and his little friend MXNet.jl.

Getting MXNet Ready: An Easier Approach to Avoid the Troubles

MXNet is a deep learning framework, and you may find out those features in the official documents here. There are packages enabling Julia with certain GPU operations(see here for a list),  but since I was too lazy to deal with pieces of troubles, I am glad to find out the lower api in MXNet can meet my taste.

To install the Julia interface of MXNet, just simply using the following command:

julia> Pkg.install("MXNet")

which compile MXNet without CUDA, which is not what I wanted. So see here for how to compile the libmxnet.so with CUDA before you type in the command above. In a word, setting all the concerning flags to “1” in the file “config.mk” before compiling, if you already have CUDA itself compiled, otherwise, see here or here on how to compile CUDA.

MXNet.jl: Some Low-Level Interfaces

While I am using Julia, it’s better to refer to the MXNet.jl documents to find out useful APIs. There are some of the functions that we may need now, like:

# Some MXNet.jl Functions
julia> using MXNet
julia> a=rand(Float32,200,100); # generate a (200,100) random matrix in julia
julia> A=mx.copy(a,mx.gpu(0)) # copy the matrix to GPU No.0 and return the reference to it
mx.NDArray(200,100)
julia> b=rand(Float32,100,300);
julia> B=mx.copy(b,mx.gpu(0))
julia> C=mx.dot(B,A) # 2D Matrix Multiplication


I exchanged the order of the two variables in the calling of the function mx.dot(), for which is a workaround for a known issue of inconsistent behaviours(a bug) in the implementation of the latest release 189e755(see here for the issue).

Updates: A workaround is submitted to solve the mx.dot() problem, but there is still no release yet. You may use the “Pkg.checkout” command to grab the latest package as mentioned in my previous article, so that it would be “mx.dot(A,B)”  to calculate 2D matrix multiplication.

Julia: Calling External Library & Metaprogramming

We have compiled the library “libmxnet.so” with CUDA enabled, and so the next step is to call the functions of the library in Julia. There’s a “ccall” syntax in Julia which enables it to call C and Fortran codes “directly”. Luckily, the MXNet.jl has implemented some interfaces for us to call in Julia, but there are still some more functions not directly shown in the document, which we will have to “call” them from the library.

But this package has some more specific type definition and conversion between C types and Julia types, and so it’s better find out some higher level implementations of the “ccall” so as to free us from the complicated type conversion. Then I found the following implementation (go here if you can’t see the embedded code):

which save me the time to dig into inner type conversion mechanism and the “ccall” syntax. This “macro” thing is part of Julia’s metaprogramming system, which is kind of like defining your own syntax. Maybe we’ll have fun in some future article involving this interesting feature.

So basically in MXNet, we may use the following syntax to call functions from the library:

# calling libmxnet.so functions
julia> mx.@mxcall(:interfacename,types,values)


It’s convenient to wrap things into the “mxcall” and let it deal with all of the rest troubles.

The Implementation: Some Details

According to the idea of “piling”, we should first of all splice the two matrices into several sub-matrices. The number of sub-matrices become parameters, which depend on the memory of the GPU. I’ve wrote a function distributing missions into even parts(see here), and this could be used on this mission splicing matrices. The function becomes the following (go here if you can’t see the embedded code):

which makes it easier to splice the matrix. There’s one thing to be noticed that, since Julia has column-based arrays, it is theoretically faster to splice columns than to “extract” rows. In the “piling” method, rows are “extracted” from the left matrix and columns are spliced from the right matrix, which means that, in order to raise the speed we should “extract” larger sub-matrix from the left matrix than sub-matrix spliced from the right matrix. And this could be a tunning strategy of the parameters mentioned above.

So the general procedure should be:

1. Split the two matrices into sub-matrices
2. Copy the left sub-matrix onto GPU
3. Copy the right sub-matrix onto GPU
4. Calculate sub-matrix multiplication, and copy the result back to CPU, mapping onto specific matrix block
5. Delete the matrix object on GPU to free the GPU memory
6. Repeat 2-3-4-5 until it finishes every matrix block

There it is. The following is the code (go here if you can’t see the embedded code):

A Simple Performance Comparison

And now we run some simple comparisons on performance with the original Julia multi-threaded matrix multiplication.

In Julia, I generate two 30,347 x 30,347 matrices and calculate the dot product of them. The machine running Julia multi-threaded matrix multiplications has 148GB CPU memory and double Xeon X5650 with 16 thread enabled when calculating.

# Dot Product on CPU
# load a matrix onto the variable a
julia> @time c=a*a
# about 1400 seconds, will update later


The whole matrix storing the result would cost up to 7GB memory, and now I run the function on a machine with a GTX970 equipped (4GB GPU memory).

# Dot Product on GPU
# load a matrix onto the variable a
julia> include("gMul.jl");
julia> @time c=gmul(a,a,3,300,mx.gpu(0),Float32); # 1st time
176.086090 seconds (2.24 M allocations: 51.553 GB, 1.18% gc time)
julia> @time c=gmul(a,a,3,300,mx.gpu(0),Float32); # 2nd time
149.566852 seconds (58.00 k allocations: 51.464 GB, 0.54% gc time)
julia> @time c=gmul(a,a,2,900,mx.gpu(0),Float32); # 3rd different parameter
140.871846 seconds (2.27 M allocations: 41.265 GB, 1.23% gc time)


The second method could be 9 times faster than the first one, and it’s “scalable”. It’s interesting that, using the same parameters, the second time I run the same command without closing the REPL generates a better result. I guess this part of attributes to the already initialized GPU interface. And by changing the parameters I may get even faster performance as shown in the third command.

Afterwords

Just as mentioned in the beginning of this article, the reason for me to try this “odd” method is:

1. I don’t have a GPU with large enough memory
2. I want faster matrix multiplication
3. I want to analyze data using data science friendly language, like Julia, instead of others

And it turns out that such method above works for me. Through analysis, the time for this whole calculation could still be shortened because nearly half of the time is spent on:

1. Copying and converting matrices between CPU and GPU
2. “Extracting” and splicing out sub-matrices
3. Filling corresponding blocks of the result matrix

I believe there are still room for improvement, and this is just a basic implementation. At the same time, combined with the Julia language’s features, like shared arrays, the operations could be even faster on multiple GPUs. I may try that out later.