Julia basic concepts
Alejandro Morales
Centre for Crop Systems Analysis - Wageningen University
Introduction
This is not a tutorial or introduction to Julia, but a collection of basic concepts about Julia that are particularly useful to understand VPL. It is assumed that the reader has some experience with programming in other languages, such as Matlab, R or Python. These concepts should be complemented with general introductory material about Julia, which can be found at the official Julia website.
Julia is a dynamic, interactive programming language, like Matlab, R or Python. Thus, it is very easy to use and learn incrementally. The language is young and well-designed, with an emphasis on numerical/scientific computation, although it is starting to occupy some space in areas such as data science and machine learning. It has a clear syntax and better consistency than some older programming languages.
Unlike Matlab, R or Python, Julia was designed from the beginning to be fast (as fast as statically compiled languages like C, C++ or Fortran). However, achieving this goal does require paying attention to certain aspects of the language, such as type stability and dynamic memory allocation, which are not always obvious to the user coming from other scientific dynamic languages. In the different sections below, a few basic Julia concepts are presented, first by ignoring performance considerations and focusing on syntax, and then by showing how to improve the performance of the code. Some concepts are ignored as they are not deemed relevant for the use of VPL.
Running Julia
There are different ways of executing Julia code (most popular ones are VS Code and Jupyter notebook):
- Interactive Julia console from terminal/console (REPL)
- Plugins for code editors
- Visual Studio Code (most popular)
- Atom/Juno (less popular now)
- vim, Emacs and others (less popular)
- Code cells inside a Jupyter notebook
- Code cells inside Pluto notebook (a Julia implementation of a reactive notebook)
The first time in a Julia session that a method is called, it will take extra time as the method will have to be compiled (i.e. Julia uses a Just-in-Time compiler as opposed to an interpreter). Also, the first time you load a package after installation/update it will take extra time to load due to precompilation (this reduces JIT compilation times somewhat). Moreover, code editors and notebooks may need to run additional code to achieve their full functionality, which may add some delays in executing the code.
Basic concepts
Functions
A function is defined with the following syntax.
function foo(x)
x^2
end
foo(2.0)
Very short functions can also be defined in one line
foo2(x) = x^2
foo2(2.0)
Functions can also be defined with the "$\to$" syntax. The result can be assigned to any variable.
foo3 = x -> x^2
foo3(2.0)
A begin
- end
block can be used to store a sequence of statements in multiple lines and assign them to "short function or a "$\to$ function.
foo4 = begin
x -> x^2
end
foo4(2.0)
Once created, there is no difference among foo
, foo2
, foo3
and foo4
.
Anonymous functions are useful when passing a function to another function as argument. For example, the function bar
below allows applying any function f
to an argument x
. In this case we could pass any of the variables defined above, or just create an anonymous function in-place.
function bar(x, f)
f(x)
end
bar(2.0, x -> x^2)
Types
A Type in Julia is a data structure that can contain one or more fields. Types are used to keep related data together, and to select the right method implementation of a function (see below). It shares some properties of Classes in Object-Oriented Programming, but there are also important differences.
Julia types can be immutable or mutable.
Immutable means that, once created, the fields of an object cannot be changed. They are defined with the following syntax:
struct Point
x
y
z
end
p = Point(0.0, 0.0, 0.0)
p.x = 1.0
Mutable means that the fields of an object can be modified after creation. The definition is similar, just needs to add the keyword mutable
mutable struct mPoint
x
y
z
end
mp = mPoint(0.0, 0.0, 0.0)
mp.x = 1.0
mp
We can always check the type of object with the function typeof
typeof(p)
If you forget the fields of a type, try to use fieldnames
in the type (not the object). It will return the name of all the fields it contains (the ":" in front of each name can be ignored)
fieldnames(Point)
Note that, for performance reasons, the type of each field should be annotated in the type definition as in:
struct pPoint
x::Float64
y::Float64
z::Float64
end
pPoint(1.0, 2.0, 3.0)
Also, note that there are no private fields in a Julia type (like Python, unlike C++ or Java).
Methods
Methods are functions with the same name but specialized for different types.
Methods are automatically created by specifying the type of (some of) the arguments of a function, like in the following example
function dist(p1::pPoint, p2::pPoint)
dx = p1.x - p2.x
dy = p1.y - p2.y
dz = p1.z - p2.z
sqrt(dx^2 + dy^2 + dz^2)
end
p1 = pPoint(1.0, 0.0, 0.0)
p2 = pPoint(0.0, 1.0, 0.0)
dist(p1, p2)
Note that this function will not work for mPoint
s
mp1 = mPoint(1.0, 0.0, 0.0)
mp2 = mPoint(0.0, 1.0, 0.0)
dist(mp1, mp2)
So we need to define dist
for mPoint
as arguments, or use inheritance (see below).
Abstract types
Types cannot inherit from other types.
However, when multiple types share analogous functionality, it is possible to group them by "abstract types" from which they can inherit. Note that abstract types do not contain any fields, so inheritance only works for methods. "abstract types are defined by how they act" (C. Rackauckas)
For example, we may define an abstract type Vec3
as any type for which a distance (dist
) can be calculated. The default implementation assumes that the type contains fields x
, y
and z
, though inherited methods can always be overridden.
Inheritance is indicated by the "<:" syntax after the name of the type in its declaration.
# Vec3 contains no data, but dist actually assumes that x, y and z are fields of any type inheriting from Vec3
abstract type Vec3 end
function dist(p1::Vec3, p2::Vec3)
dx = p1.x - p2.x
dy = p1.y - p2.y
dz = p1.z - p2.z
sqrt(dx^2 + dy^2 + dz^2)
end
# Like before, but inhering from Vec3
struct Point2 <: Vec3
x::Float64
y::Float64
z::Float64
end
mutable struct mPoint2 <: Vec3
x::Float64
y::Float64
z::Float64
end
struct Point3 <: Vec3
x::Float64
y::Float64
end
The method now works with Point2
and mPoint2
p1 = Point2(1.0, 0.0, 0.0)
p2 = Point2(0.0, 1.0, 0.0)
dist(p1, p2)
mp1 = mPoint2(1.0, 0.0, 0.0)
mp2 = mPoint2(0.0, 1.0, 0.0)
dist(mp1, mp2)
The method will try to run with Point3
but it will raise an error because Point3
does not have the field z
.
p3 = Point3(1.0, 0.0)
dist(p1, p3)
Optional and keyword arguments
Functions can have optional arguments (i.e. arguments with default values) as well as keyword arguments, which are like optional arguments but you need to use their name (rather than position) to assign a value.
An example of a function with optional arguments:
opfoo(a, b::Int = 0) = a + b
opfoo(1)
opfoo(1,1)
An example of a function with keyword arguments
kwfoo(a; b::Int = 0) = a + b
kwfoo(1)
kwfoo(1, b = 1)
Modules
Within a Julia session you cannot redefine Types. Also, if you assign different data to the same name, it will simply overwrite the previous data (note: these statements are simplifications of what it actually happens, but it suffices for now).
To avoid name clashes, Julia allows collecting functions, methods, types and abstract types into Modules. Every Julia package includes at least one module.
A module allows exporting a subset of the the names defined inside of it:
module Mod
export fooz
fooz(x) = abs(x)
struct bar
data
end
end
In order to use a module the using
command must be used (the .
is required and indicates that the module was defined in the current scoppe, as modules can be nested).
using .Mod
Exportednames can be used directly
fooz(-1)
Unexported names can still be retrieved, but must be qualified by the module name.
b = Mod.fooz(-1.0)
Adding methods to existing functions
If a function is defined inside a module (e.g., a Julia package) we can add methods to that function by accessing it through the module name. Let's define a function abs_dist
that calculates the Manhattan (as opposed to Euclidean) distance between two points. We will put it inside a module called Funs
to emulate a Julia package.
module Funs
export manhattan
function manhattan(p1, p2)
dx = p1.x - p2.x
dy = p1.y - p2.y
dz = p1.z - p2.z
abs(dx + dy + dz)
end
end
using .Funs
manhattan(p1, p2)
manhattan(p1, p3)
We see that we have the same error as before when using p3
. Let's add methods for when one the first or second argument is a Point3
that ignores the z
:
Funs.manhattan(p1::Point3, p2) = abs(p1.x - p2.x + p1.y - p2.y)
Funs.manhattan(p1, p2::Point3) = abs(p1.x - p2.x + p1.y - p2.y)
manhattan(p1, p3)
manhattan(p3, p1)
You can find all the methods of a function by using methods()
on the function name:
methods(manhattan)
Macros
A macro is a function or statement that starts with @
. The details of macros are not explained here, but it is important to know that they work by modifying the code that you write inside the macro, usually to provided specific features or to achieve higher performance. That is, a macro will take the code that you write and substitute it by some new code that then gets executed.
An useful macro is @kwdef
provided by the module Base
, which allows assigning default values to the fields of a type and use the fields as keyword arguments in the constructors. This macro needs to be written before the type definition. For example, a point constructed in this manner would be:
Base.@kwdef struct kwPoint
x::Float64 = 0.0
y::Float64 = 0.0
z::Float64 = 0.0
end
kwPoint()
kwPoint(1,1,1)
kwPoint(y = 1)
Dot notation
Dot notation is a very useful feature of Julia that allows you to apply a function to each element of a vector. For example, if you want to calculate the square of each element of a vector you can do:
x = [1,2,3]
y = x.^2
The dot notation can be used with any function, not just mathematical functions. For example, if you want to calculate the absolute value of each element of a vector you can do:
abs.(y)
If the operation is more complex, the '.' should be used in all the steps or, alternatively, use the macro @.
that does the same:
abs.(y) .+ x.^3
@. abs(y) + x^3
The dot notation can also be used with functions that take more than one argument, but make sure that all the arguments have the same length
min.(x, y)
max.(x, y)
Improving performance
Type instability
As indicated above, annotating the fields of a data type (struct
or mutable struct
) is required for achieve good performance. However, neither arguments of functions nor variables created through assignment require type annotation. This is because Julia uses type inference (i.e. it tries to infer the type of data to be stored in each newly created varaible) and compiles the code to machine level based on this inference. This leads to the concept of type instability: if the type of data stored in a variable changes over time, the compiler will need to accomodate for this, which results (for technical reasons beyond the scope of this document) in a loss of performance.
Here is a classic example of type instability. The following function will add up the squares of all the values in a vector:
function add_squares(x)
out = 0
for i in eachindex(x)
out += x[i]^2
end
return out
end
It looks innocent enough. The issue here is that out
is initialized as an integer (0
), but then it is assigned the result of sqrt(x)
, which may be a real value (e.g. 1.0
), which would have to be stored as a floating point number. Because out
has different types at different points in the code, the resulting code will be slower than it could be, but still correct (this is why Julia is useful for rapid code development compared to static languages like C++ or Java).
add_squares(collect(1:1000)) # type stable
add_squares(collect(1:1000.0)) # not type stable
How do we measure performance then? The @time
macro is useful for this if dealing with a slow function. Otherwise it is better to use @btime
from the BenchmarkTools package (see documentation of the package to understand why we use $
).
using BenchmarkTools
v1 = collect(1:1000)
v2 = collect(1:1000.0)
@btime add_squares($v1)
@btime add_squares($v2)
The second code is 12 times slower than the first one. We can detect type instability by using the @code_warntype
macro. This macro will print a internal representation of the code before it is compiled. The details are complex, but you just need to look for things in red (which indicate type instability).
@code_warntype add_squares(v1)
@code_warntype add_squares(v2)
How do we fix this? We could write different methods for different types of x, but this is not very practical. Instead, we can use the zero()
function combined with eltype()
to initialize out
with the correct type.
function add_squares(x)
out = zero(eltype(x)) # Initialize out with the correct type with value of zero
for i in eachindex(x)
out += x[i]^2
end
return out
end
You could also initialize out to the first element of x and iterate over the rest of the elements, but this may not always possible (e.g. if x
is empty or has one value only), so the logic will get more complex.
Now the code is type stable:
@code_warntype add_squares(v1)
@code_warntype add_squares(v2)
And the performance is more similar:
@btime add_squares($v1)
@btime add_squares($v2)
Performance annotations
Sometimes code can be annotated to improve performance. For example, the @simd
can be used in simple loops to indicate that the loop can be vectorized inside the CPU (it allows to run simple CPU instructions on small sets of data simultaneously). The @inbounds
macro can be used to indicate that the code will not access elements outside the bounds of an array.
function add_squares(x)
out = zero(eltype(x)) # Initialize out with the correct type with value of zero
@simd for i in eachindex(x)
@inbounds out += x[i]^2
end
return out
end
@btime add_squares($v1)
@btime add_squares($v2)
Now we actually get faster performance for floating point number, which is related to the fact that the CPU can vectorize floating point operations more efficiently than integer operations (at least in this example). You can see the actual assembly code being generated (or an approximation of it) by using the @code_native
macro. Any instruction that starts with v
is a vectorized instruction. Note that sometimes Julia will automatically vectorize code without the need for the @simd
but this is not always the case.
@code_native add_squares(v2)
Notice how with some simple annotations and reorganizing the code to deal with type instability we were able to get a 30x speedup. Obviously this was a simple function with minimal runtime, so the speedup is not particularly useful, but this type of small functions are often the ones that are called many times in complex computations (e.g., ray tracing), so the speedup can be significant in actual applications. Whether you need to worry about performance depends on where the bottleneck is in your code.
For more details, see the sections of the manual on profiling and performance tips.
Global variables and type instability
Global variables are any variable defined in a module outside of a function that is accessed from within a function. Global variables are not recommended in general as they can easily introduce bugs in your code by making the logic of the program much harder to reason about. However, they can also introduce performance issues as any global variable that is not annotated with its type will lead to type instability.
For example, let's say we modify the add_squares
function to use a global variable (a bit odd, but it is just to illustrate the point) to enable differen options in the code.
function add_squares(x)
out = zero(eltype(x))
if criterion > 0
@simd for i in eachindex(x)
@inbounds out += x[i]^2
end
else
@simd for i in eachindex(x)
@inbounds out -= x[i]^2
end
end
return out
end
criterion = 1
@code_warntype add_squares(v2)
@btime add_squares(v2)
It is not a major hit in performance as criterion
is only accessed once, but this can be a problem if the global variable is accessed many times in the code. The short term solution is to annotate the type of the global variable (but really we should be writing code without global variables). This can be frustating as once a global variable is created, you cannot annotate its type (even if it does not change!) without restarting the Julia session (unless it is inside a module).
function add_squares(x)
out = zero(eltype(x))
if criterion2 > 0
@simd for i in eachindex(x)
@inbounds out += x[i]^2
end
else
@simd for i in eachindex(x)
@inbounds out -= x[i]^2
end
end
return out
end
criterion2::Int64 = 1
@code_warntype add_squares(v2)
@btime add_squares(v2)