Storing sparse matrix – list of lists LIL

Usually matrices in C++ programs are represented as two-dimensional arrays. Memory requirement of such array is proportional to m×n, where m and n are the height and width of this array.

Almost all of dense matrix values are significiant

Things look differently if the matrix we need to store is a sparse matrix. Sparse matrix is a matrix which most elements have default value (often zero) and only few elements have other value. Storing such matrix in two-dimensional array would be a big waste of memory space, especially when it is large-sized and the sparsity level is high. Such array can be stored in reduced space with no information loss and with only slight deterioration in performance.

Example of sparse matrix

List of lists: Efficient representation of sparse array

One of possible approaches to store sparse array is a list of lists (LIL) method. The whole array is represented as list of rows and each row is represented as list of pairs: position (index) in the row and the value. Only elements with value other than default (zero in presented case) are stored. For the best performance the lists should be sorted in order of ascending keys.

Let’s consider one-dimensional array for simplicity. It contains a huge amount of elements, but only the initial ones are presented below. Three elements are of non-zero value.

Sparse 1D array

It can be represented as following list of pairs (index, value):

1D sparse array list of lists

The representation of 2D sparse matrix is similar; just store rows with its sequential number (index) in the another list. You have list (of rows) of lists (row’s elements).

Example of implementation in C++

CSparse1DArray is example draft C++11 implementation of fixed-size 1D sparse array of integers, based on list approach. The value of skipped fields can be customized,
it’s 0 by default.

class CSparse1DArray
{
public:
    CSparse1DArray(size_t length, int defVal = 0);
    CSparse1DArray() = delete;

    int get(size_t idx) const;
    void set(size_t idx, int value);
    size_t size() const;

private:
    std::list<std::pair <size_t, int> > m_list;
    size_t                              m_size;
    int                                 m_defVal;
};
CSparse1DArray::CSparse1DArray(size_t length
                              , int defVal)
    : m_list()
    , m_size(length)
    , m_defVal(defVal)
{
}

int CSparse1DArray::get(size_t idx) const
{
    for (auto it = m_list.begin(); 
         it != m_list.end(); 
         ++it)
    {
        if (idx == it->first)
        {
            return it->second;
        }
    }

    return m_defVal;
}

void CSparse1DArray::set(size_t idx, int value)
{
    if (idx >= m_size)
    {
        return;
    }

    auto it = m_list.begin();
    while (it != m_list.end())
    {
        if (it->first > idx)
        {
            break;
        }
        else if (it->first == idx)
        {
            it->second = value;
            return;
        }

        ++it;
    }

    m_list.insert(it, std::pair<size_t, int>(idx, value));
}

Ready solutions

It may be clever to use already implemented solutions. For C++ usage, check out the possibilities of boost library for sparse matrices. For Python try sparse matrices in SciPy library.

Advertisements

Improve performance with cache prefetching

CPU puts recently used/often used data into small, very fast memory called cache. Understanding the basics of how cache works can lead to big performance improvements. The whole trick is to make the most of the data that already is in cache.

Cache basics

When a program needs to load some data, it looks for it first in the internal memory, which is the fastest — processor registers and cache. Cache is organized as hierarchy of cache levels:

  • L1 is the fastest and the smallest level of cache, usually about few kilobytes (often 16-32kB). Hit latency : ~4 cycles,
  • L2 is bigger (often 256kB-4MB), but slower. Hit latency: ~10 cycles,
  • L3 is optional. Bigger and slower than L2. Hit latency: ~40 cycles.

The reference to main memory can be around 200 times slower than the reference to L1. Hit latency are example values for Core i7 Xeon 5500 Series [source].

On Linux it’s possible to check the cache size with the command lscpu. My CPU’s cache is following:
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K

Cache hierarchy

Program starts to look for the data in the fastest cache level (L1) first and if data is not found there it goes further to L2 and L3. If data is still not found, it reads the data from main memory (RAM). The slowest and the biggest memory is of course disc — it can store almost unlimited amount of data, but the disc seek can be around 100,000 times slower than reference to RAM. Figure below presents the memory size and speed dependency.

Cache hierarchy

Memory size and reading speed dependency

Prefetching

When the needed data was not found in cache and was loaded from main memory, the CPU makes something to ensure that next operations on this data will be faster — it stores just fetched data into cache. Things that were in cache before can be erased to make room in cache for these new data.

But CPU in fact stores not only demanded data into cache. It also loads neighbouring (following) data in cache because it’s very likely that these data will be requested soon. Operation of reading more data than requested is called prefetching. If such prefetched data is in fact requested in the nearest future, it can be loaded with cache hit latency instead of expensive main memory reference.

Performance test

I made an experiment to check how much these cache things can improve. I implemented two almost identical functions, which sum all elements in the 2D array. The only difference is the traverse direction — first function sumArrayRowByRow reference to elements reading array row by row, second one — sumArrayColumn-ByColumn— column by column. If we consider the memory layout of 2D array, it’s obvious that only the first function references to array elements respecting their sequential order in memory.

quint64 sumArrayRowByRow()
{
    quint64 sum = 0;

    for (size_t i = 0; i < HEIGHT; i++)
    {
        for (size_t j = 0; j < WIDTH; j++)
        {
            sum += array[i][j];
        }
    }

    return sum;
}
quint64 sumArrayColumnByColumn()
{
    quint64 sum = 0;

    for (size_t i = 0; i < HEIGHT; i++)
    {
        for (size_t j = 0; j < WIDTH; j++)
        {
            sum += array[j][i];
        }
    }

    return sum;
}

The test was run for various sizes of square array of quint8: 32×32, 64×64, 128×128, 512×512, 1024×1024, 2048×2048 and 4096×4096.

The experiment showed no big differences for small-sized arrays. The theoretically better function, sumArrayRowByRow was only about 1% faster than sumArray- ColumnByColumn for arrays of dimension 32×32, 64×64 and 128×128. The reason of that can be easily spotted. The total size of the biggest mentioned array is 128*128 bytes, which is about 16kB. It’s half the size of my L1d cache, so it’s possible that the whole array was loaded to cache.

Cache influence on performance small array

The situation changes dramatically when the array size increases. The bigger the array was, the bigger advantage the “memory-ordered” function took. Operations on the biggest array 4096×4096 was more than 3 times slower with the sumArrayColumnByColumn function! This whole array takes ~16MB so there is no way to fit it whole into cache. Here prefetching made a big difference.

Cache influence on performance big array

Summary

Reference to elements in order of their memory placement. Traverse arrays in row-by-row manner, operate on the struct members in the order of their declaration and so on. Such approach may result in performance improvements.

I recommend this awesome article about cache: Gallery of processor cache effects .

Struct members order does make a difference

Memory usage: decreased

General modern computer address their memory in word-sized chunks, which mostly is 4-bytes word for x86 architecture or 8-bytes word for x86-64 architecture. To maintain the higher performance, computer applies rules of data alignment, which in practice means that some useless bytes can be added between variables so that the variables are accessible at addresses equal to multiple of the word size. The knowledge of that can be essential when designing structures for systems with strict memory requirements since it will allow you to reduce the size of struct.

Continue reading

Cross jumping/tail merging

Code size:   decreased or the same
Performance: worse or the same
Done by:     developer or compiler

Cross jumping or tail merging is an optimization technique used by compilers and humans. This approach is very easy to understand; first detect identical code sequences that can be shared, and then modify the program flow to work the same with the only one unique instance of this sequence.

Continue reading

Loop unrolling

Code size:   increased
Performance: better
Done by:     developer or compiler

Loop unrolling, also known as loop unwinding, is an optimization which can reduce overhead of running a loop — number of instructions of checking the loop termination condition and loop counter modification. It requires fixed and known loop count.

For loop executes conditional statement e.g. i < 0 and modifies the loop counter e.g i++ once per iteration. It is shown in example below — a simple loop with very small body. To call this body 100 times the loop also executes conditional statement and loop counter incrementation, both 100 times. It means 200 of 300 instructions of this loop are overhead.

Continue reading