Press "Enter" to skip to content

Bricks on Julia

Writing something I found when coding, like some kind of syntax sugars… though no guarantee for the correction…

Initializing An Array

There are different kinds of array initializations. Notice the differences between “{}” and “()” notations.

# Array Initialization
a = Array{Int64,1}() # 0-element Array{Int64,1}
b = Array(Int64,0) # 0-element Array{Int64,1}
c = Array(Int64,1)
# 1-element Array{Int64,1}:
# 6874584755139077376
d = Array(Int64,2)
# 2-element Array{Int64,1}:
# 140298281045760
# 140298275170800
e = Array(Int64,2,2)
# 2x2 Array{Int64,2}:
# 140298302013152 140298289255248
# 140298289255184 140298302013232
f = Array{Int64,2}(2,2)
# 2x2 Array{Int64,2}:
# 140298246197808 140298268973680
# 140298268973616 140298268947024

All of them work. Those in parentheses specify how they can be initialized.

How Does Sparse Matrix Work?

The official document doesn’t tell much, so I go to the source code here. There’s a type definition SparseMatrixCSC, which is the basic type of a lot of methods related to sparse matrix. The definition is as follows:

# Type Definition
type SparseMatrixCSC{Tv,Ti<:Integer} <: AbstractSparseMatrix{Tv,Ti}
m::Int # Number of rows
n::Int # Number of columns
colptr::Vector{Ti} # Column i is in colptr[i]:(colptr[i+1]-1)
rowval::Vector{Ti} # Row values of nonzeros
nzval::Vector{Tv} # Nonzero values
end

After some tossing about the codes, I figure out what does all the parameters mean in the SparseMatrixCSC.

  • m: control the number of rows in the matrix
  • n: control the number of columns in the matrix
  • colptr: column pointers, control the starting and ending indexes of every column, usually Array of the shape (n+1,)
  • rowval: row values of every non-sparse element, Array, the size is the same as the number of non-sparse elements, controls which row is a specific element in
  • nzval: values of every non-sparse element, Array, the size is the same as the number of non-sparse elements

In fact, in a sparse matrix like the following:

# a sparse matrix
# 2 x 3 5 x x x 7
# x x x 1 4 x x x
# 7 8 4 5 3 x x x

in which the “x” indicates the element is sparse. The parameters are like:

# parameters of the matrix
# m = 3
# n = 8
# colptr = [1, 3, 4, 6, 9, 11, 11, 11, 12]
# rowval = [1, 3, 3, 1, 3, 1, 2, 3, 2, 3, 1]
# nzval = [2, 7, 8, 3, 4, 5, 1, 5, 4, 3, 7]

The colptr may seem a little hard to understand. In fact, if you iterate the array using only one iterator, you get a sequence which is the actual storing order of the matrix [2, 7, 8, 3, 4, 5, 1, 5, 4, 3, 7], and the column pointers records the first indexes of the value occurring in the column, or the first expected value in the column if there no value in this column. For example, the first expected value of the 1st column is 2, which has an index of 1 in the iterating list, and the first expected value of the 2nd column is 8 which has an index of 3 in the iterating list, and so on. Then we get the list of column pointers.

Notice that there’s no non-sparse element in the 6th column, and next we are expecting the value 7 to occur, then we should fill 11(the index of 7 in the iterating list) in the array of column pointers.

And the rowval is easier to understand when you just need to assign a row index to each element and so does the nzval.

And now we may construct a specific sparse matrix we would like to.

Dispatching A Missions List in One Line

When doing parallel computing, especially when using DistributedArrays, one may always want to dispatch/separate an array into (even) parts. It seems that I couldn’t find a simple way or method achieving that. Let’s say you want M missions dispatched/separated/grouped/splitted to N processes/workers, so the following may help:

# dispatch your mission
M = 50
N = 8
dlist = [collect(i:N:M) for i=1:N]
# [1,9,17,25,33,41,49] 
# [2,10,18,26,34,42,50]
# [3,11,19,27,35,43]   
# [4,12,20,28,36,44]   
# [5,13,21,29,37,45]   
# [6,14,22,30,38,46]   
# [7,15,23,31,39,47]   
# [8,16,24,32,40,48]

And you don’t need to worry about whether M can be divided exactly. But this seems quite “sparse”, what if I want neighboring missions dispatched together to the same process?

# dispatch your mission
M = 50
N = 8
[p[1]+1:p[2] for p in[unshift!([sum([floor(Int64,M/N)+(i<=M%N) for i=1:N][1:j]) for j=1:N],0)[k:k+1] for k=1:N]]
# 8-element Array{Any,1}:
# 1:7  
# 8:14 
# 15:20
# 21:26
# 27:32
# 33:38
# 39:44
# 45:50
[collect(p[1]+1:p[2]) for p in[unshift!([sum([floor(Int64,M/N)+(i<=M%N) for i=1:N][1:j]) for j=1:N],0)[k:k+1] for k=1:N]]
# 8-element Array{Any,1}:
# [1,2,3,4,5,6,7]     
# [8,9,10,11,12,13,14]
# [15,16,17,18,19,20] 
# [21,22,23,24,25,26] 
# [27,28,29,30,31,32] 
# [33,34,35,36,37,38] 
# [39,40,41,42,43,44] 
# [45,46,47,48,49,50]

Just as what I want, but can the code itself be more elegant? Well, perhaps you may find this easier to understand:

# dispatch your mission
M = 50
N = 8
a = [floor(Int64,M/N)+(i<=M%N) for i=1:N] # how many elements in each group
# [7, 7, 6, 6, 6, 6, 6, 6]
b = unshift!([sum(a[1:j]) for j=1:N],0) # initial-1 and ending elements of each group
# [0, 7, 14, 20, 26, 32, 38, 44, 50]
c = [collect(p[1]+1:p[2]) for p in[b[k:k+1] for k=1:N]] # create array between consecutive elements
# c is the demanded grouping scheme 

Well I am thinking of some better and more elegant method, which may take some time. Before that you can wrap the code into a function for convenience.

And I translated it to Python, which writes like:

# dispatch your mission in Python
M = 50
N = 8
L = [list(range(p[0]+1,p[1]+1)) for p in [([0]+[sum([int(M/N)+(i<=M%N) for i in range(1,N+1)][0:j]) for j in range(1,N+1)])[k:k+2] for k in range(0,N)]]

It looks…… Perhaps there are better ways…

Avoid Some Structures When Saving with JLD

You know, some structures should be avoided, like the Dict{UTF8String,Array{Float64,1}}, even if you explicitly declare the types of every components, they are still slow when saving…

Calculating Various Distances

The Distances.jl would help. The cosine_dist only accepts type of Float64 or other implicitly convertible types, so try to convert your type if necessary.

# convert the type
using Distances
a = [1,2,3,4] # 4-element Array{Int64,1}
b = [5,6,7,8] # 4-element Array{Int64,1}
cosine_dist(a,b) # ERROR: MethodError: `evaluate` has no method matching evaluate(::Distances.CosineDist, ::Array{Int64,1}, ::Array{Int64,1})
cosine_dist(float(a),float(b)) #0.031136068373033843

Striping 1d Array From 2d Array

There are ways of doing this, but which of them is faster? I want an array of type Array{Float64,1} stripped out of Array{Float64,2}, say, the 3rd row in the matrix in a 1d form.

# strip out a row
a = rand(100,100000)
m1 = @time a[3,:][:] # 100000-element Array{Float64,1}, 0.078370 seconds (52.34 k allocations: 4.619 MB)
m2 = @time vec(a[3,:]) # 100000-element Array{Float64,1}, 0.006913 seconds (1.95 k allocations: 881.535 KB)
m3 = @time a[3:100:100000*100] # 100000-element Array{Float64,1}, 0.001929 seconds (8 allocations: 781.516 KB)

It seems m3 is the fastest. Let’s switch to a computer with larger memory.

# strip out a row (larger memory)
a = rand(10000,1000000)
m1 = @time a[3,:][:] # 1000000-element Array{Float64,1}, 0.134122 seconds (38.74 k allocations: 17.153 MB, 6.62% gc time)
m2 = @time vec(a[3,:]) # 1000000-element Array{Float64,1}, 0.068821 seconds (108 allocations: 7.637 MB, 5.38% gc time)
m3 = @time a[3:10000:1000000*10000] # 1000000-element Array{Float64,1}, 0.072632 seconds (4.11 k allocations: 7.831 MB)

And now m2 is quite faster with still 5.38% gc(garbage collection I guess) time, which would be 0.0651184302 if the gc time is not considered. Then we know the vec function has optimizations and on larger scale it performs better. See here for a performance exploration.

Different Behaviours between Versions

Just a record that the matrix slicing operation behaves differently in version 0.4.1 and Version 0.5.0-dev+1259 (2015-11-12 19:47 UTC) Commit af36998.

# different behaviours
julia> a=rand(5,5)
5x5 Array{Float64,2}:
 0.141514  0.400978    0.127209    0.292197  0.155338
 0.717587  0.00328231  0.0538567   0.995727  0.275293
 0.829224  0.244809    0.00831194  0.081122  0.559329
 0.718257  0.46811     0.781658    0.831157  0.85268 
 0.655338  0.56655     0.403032    0.454423  0.516852
julia(0.4.1)> a[4,:]
1x5 Array{Float64,2}:
 0.718257  0.46811  0.781658  0.831157  0.85268
julia(0.5.0)> a[4,:]
5-element Array{Float64,1}:
 0.718257
 0.46811 
 0.781658
 0.831157
 0.85268

It seems there’s going to be some changes.

PyCall: Adding System Path & Importing Your Own Module

It seems that the Python environment somewhat is using the Julia’s system path definition, which means that you have to add your own path to the Julia’s environment path in order to import modules written in Python by yourself using PyCall in Julia (seems complicated?). Just use the following code:

# different behaviours
julia> using PyCall
julia> unshift!(PyVector(pyimport("sys")["path"]),"your own path")
julia> @pyimport mymodule

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *