Press "Enter" to skip to content

Another Day on Julia

Feelings During Writing

There was a time when I thought this language was about to get out of my control. How could I write such codes and my dear how could it executed the codes in such a way!

Calculating 9,000,000,000+ (9 billion+) Cosine Distances on 20 Processes Simultaneously
Calculating 9,000,000,000+ (9 billion+) Cosine Distances on 20 Processes Simultaneously

JLD Performance on Different String Types

It seems Julia is slow dealing with kinds of strings, so does the JLD serializations. I compared the following:

# a little test
using JLD
a = rand(10000000) # 10000000-element Array{Float64,1}
b = [string(i) for i in a] # 10000000-element Array{Any,1}
@time save("./test_float.jld","_",a) # 1.100200 seconds (1.62 M allocations: 72.982 MB)
@time save("./test_string.jld","_",b) # terminated when running 5+ minutes

While I suspected that whether the type of array affected the efficiency, I continued the following:

# a little test
using JLD
a = rand(10000000) # 10000000-element Array{Float64,1}
b = [string(i) for i in a] # 10000000-element Array{Any,1}
c = Array{AbstractString,1}(b) # 10000000-element Array{AbstractString,1}
d = Array{ASCIIString,1}(b) # 10000000-element Array{ASCIIString,1}
e = Array{UTF8String,1}(b) # 10000000-element Array{UTF8String,1}
@time save("./test_abstr.jld","_",c) # terminated when running 5+ minutes
@time save("./test_ascstr.jld","_",d) # 2.094302 seconds (10.06 M allocations: 231.583 MB)
@time save("./test_utfstr.jld","_",e) # 3.230833 seconds (10.05 M allocations: 231.363 MB, 44.33% gc time)

That’s quite a relief! Using some definite type of array definitions may probably accelerate the serialization. In some cases it seemed that JLD has flopped into a infinite loop (I don’t know whether there’s a bug).

JLD: Is Multi-Dimensional Array Faster?

Seemingly there’s doubt whether “better” structured data should operate faster than those which are not. So I tried to construct some “malformed” data structure and tried saving them using JLD, compared to those in better form.

# testing saving speeds on different data structures
using JLD
a=Array{Array{Int64,1},1}() # 0-element Array{Array{Int64,1},1}
# construct arrays of different lengths, "malformed"
for i=1:10000000
    tmp = Array{Int64,1}()
    for j=1:rand(1:100)
        push!(tmp,j)
    end
    push!(a,tmp)
end
@time save("mal.jld","_",a) # terminated when running 5+ minutes
b=zeros(Int64,(10000000,100)) # 10000000x100 Array{Int64,2}
# construct a 2-dimensional arrays, more digits than the "malformed" array
for i=1:10000000
    for j=1:rand(1:100)
        b[i,j]=j
    end
end
@time save("wel.jld","_",b) # 7.587946 seconds (66.61 k allocations: 3.197 MB, 0.10% gc time)

Even the Multi-Dimensional Array seems more wasteful than the “malformed” array, it can be processed much more efficient. Leaving Julia with well structured and pre-defined data should increase the general speed a lot I think.

Matrix in Different Ways

There are differences between an array and a matrix in Julia, just as what some other languages would regard them. Perhaps because I have quite some bad habits writing Python codes, I found my codes in Julia quite slow at the beginning. In the JLD examples, it is showed fixed data types would help Julia proceed faster, and now fixed data structures can also helps.

Previous on Python, the type is quite vague with that no worries spent on determining the data types that you are going to declare, but now it could affect the performance on Julia. This is the first thing I realized that I have to change my coding style when switching to Julia. In Julia, some operations can only be done on Multi-Dimensional Arrays (Matrices).

# operations on matrices
a = spzeros(Int64,1000000000,1000000000) # a sparse matrix of Int64
b = rand(Float64,(10,10)) # a 10x10 matrix
# 10x10 Array{Float64,2}:
# 0.854925  0.457697   0.154232   0.185742  …  0.933921  0.918253   0.989606 
# 0.641039  0.707965   0.593308   0.541274     0.64299   0.135775   0.0609116
# 0.924463  0.397864   0.911429   0.385812     0.910533  0.0407729  0.129485 
# 0.623274  0.158428   0.576261   0.268528     0.730114  0.247039   0.96286  
# 0.884139  0.79752    0.978052   0.227932     0.228846  0.944905   0.483374 
# 0.780954  0.344025   0.750445   0.514296  …  0.557673  0.201271   0.328907 
# 0.557319  0.292272   0.301447   0.956826     0.953306  0.560146   0.275325 
# 0.646322  0.878229   0.83556    0.929821     0.581251  0.279477   0.689003 
# 0.796343  0.0559756  0.0558835  0.329716     0.635937  0.851748   0.0581257
# 0.825582  0.141715   0.115811   0.990876     0.564139  0.437218   0.974525
c = b[b[:,1].>0.5,:] # find rows in b with the values of 1st column larger than 0.8
# 4x10 Array{Float64,2}:
# 0.854925  0.457697   0.154232   0.185742  …  0.933921  0.918253   0.989606 
# 0.924463  0.397864   0.911429   0.385812     0.910533  0.0407729  0.129485 
# 0.884139  0.79752    0.978052   0.227932     0.228846  0.944905   0.483374 
# 0.825582  0.141715   0.115811   0.990876     0.564139  0.437218   0.974525
d = sortrows(c,by=x->x[3],rev=false) # sort rows by the values of 3rd column, DESC
# 4x10 Array{Float64,2}:
# 0.924463  0.397864   0.911429   0.385812     0.910533  0.0407729  0.129485 
# 0.884139  0.79752    0.978052   0.227932     0.228846  0.944905   0.483374 
# 0.854925  0.457697   0.154232   0.185742  …  0.933921  0.918253   0.989606 
# 0.825582  0.141715   0.115811   0.990876     0.564139  0.437218   0.974525
e = d[2,:][:] # take out 2nd row and transform it to 1d array
# 10-element Array{Float64,1}:
# 0.884139  0.79752    0.978052   0.227932  ...  0.228846  0.944905   0.483374 
eval, eidx = findmax(e) # find the maximum in e and its index
# you get (0.978052,3), it's a tuple

There are more interesting grammars, some of which are Matlab-like, some Python-like. In fact, I can sometimes think of what I want and the codes just flow out of my mind, and most importantly, it works as what I expected, of course, despite the errors thrown by some “data type checker”.

In general, the data types are not that strict in Julia, but when considering performance of your programs, you need to pay much more attentions on them you are to declare, together with some proper data structures, otherwise, it would disgust you.

Let’s Parallel!

In the official document, there’s a chapter Parallel Computing in which you may read some of the parallel techniques and examples. I spent quite several days trying out those examples (what? you said ANOTHER DAY on julia, now you said SEVERAL DAYS?). I mean you may get to understand what structure of your computing framework you should design in one day in order to optimize the general performance when doing parallel computing.

And for a quick start, I tried some of the easy examples. Some of the methods are transparent to users, while some are not.

To enable multiple processes in Julia, there are some easy ways to do this explicitly. While in REPL, simple type addprocs(8) to add available processes for Julia. Or add parameters when starting Julia, like julia -p 8. Notice that the id of your main process is 1, and followed by 2, 3, 4 etc. if you add more.

Distribute Your Missions

Like what we would usually imagine, one of the common ideas of implementing parallel computing is to distribute different missions to different processes, in which these missions could vary and even share no common data. The only thing that connect them is the return value as what users can see (of course there are deeper structural things that we can’t see).

So I want to calculate different distances between different vectors, here are what the codes look like in Julia:

# distribute your missions
# a function calculating euclidean distance
function worker_dist(arr,distfunc)
    # arr::Array{Array{Float64,1},1}
    ret = fill(Float64,(length(arr),length(arr[1])) # matrix to store results
    for i in arr
        for j in arr[i]
            ret[i,j] = distfunc(i,j) # calculate distances
        end
    end
    return ret # or simply ret;
end
function calc_euc(a,b)
    # do something
end
function calc_cos(a,b)
    # do something
end
addprocs(2) # we need to calculate 2 kinds of distances
mtx_a = rand((100,100)) # a random 2-dimensional matrix
# then distribute your mission to different workers
r_mtx_euc = @spawn worker_dist(mtx_a,calc_euc)
r_mtx_cos = @spawn worker_dist(mtx_a,calc_cos)
# fetch the results
mtx_euc = fetch(r_mtx_euc)
mtx_cos = fetch(r_mtx_cos)

By using @spawn before calling the function, the specific missions will be distributed to specific processes. Such operation only distribute the mission and return a remote reference to the result. It is not until you use the function fetch to retrieve the result that it starts to run the actual distributed codes.

Notice that in such a way the variable mtx_a is not shared by the processes, which means there are copies in every processes. If your matrix is too large and you allocate quite some processes, the program may consume a lot of memory.

While this kind of parallel computing is flexible, you may implement codes in the worker function as if it is an isolated function and don’t have to consider some detailed problems. But when meeting with memory insufficiency, there are some more powerful ways to solve it.

Distribute Your Arrays

To distribute your matrix into several parts is a way to save and operate on big matrix, in which way a large array is separated so that every process have a piece of it. Theoretically, if you are dealing with a matrix of the size 1GB separately on 10 processes then each process can access 0.1GB of different parts of the matrix. Notice that such 0.1GB*10 memory will still need to be copied to every processes(workers), so you may need totally 2GB memory to process.

Before we start, you may need to visit DistributedArrays.jl to prepare the necessary library.

And now I am distributing a large matrix, and filling the different matrix parts with the workers’ ids:

# filling the matrix with worker id
N_PROCS = 10
addprocs(10)
# a walkaround avoiding multiple using in processes
import DistributedArrays; @everywhere using DistributedArrays
function viewblock(d::DArray)
    println(myid()) # you get "1" for the environment here is still the main process
    DArray(size(d),procs(d)) do I
        print(myid()) # you get 2-11 for the environment here is one of the worker process
        this_array = d[I[1]] # the array allocated for this process
        mid = myid() # get the current worker's id
        for i=1:length(this_array)
            this_array[i] = mid # change the element into the id of the worker that process it
        end
        return this_array
    end
end
a = rand(Int64,(100,100))
da = distribute(a) # automatically distribute the array into different processes
va = viewblock(da)
println(va)

Running a test using N_PROCS = 4 on a matrix of 4*4, we may probably get the return matrix like:

2 2 3 3
2 2 3 3
4 4 5 5
4 4 5 5

from which we see that the matrix is separated into 4 parts distributed to 4 workers and every worker deals with one part.

The Distributed Arrays is convenient for it is nearly transparent to users. An array can be easily distributed and processed separately. While you should pay attention to the data structures of the distributed arrays if you want some more complicated applications.

The @everywhere will help run the specific code in every process, and there more interesting wrappers in parallel computing. You may find something else in the official document.

Almost Starting

Till now I can use Julia to fulfill nearly every possible mission I’ve ever done before. It can finally meet my requirements. During these days, I did find something interesting in Julia, and of course something quite confusing in it. For the present maybe there should be more tutorials and examples online to help the language convey itself.

Anyway, it’s time to think different and embrace a new world.

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *