performance tuning without changing anything

When performance tuning at scale, cache misses and table walks can be killer. A cache miss is when whatever memory you're accessing isn't readily available in your L1 cache, you "miss" and look to the next layer, and so on. Eventually hitting the dreaded DRAM and hopefully never making it to disk (swap).
This [stackexchange] question/answer nicely explains what happens during a table walk. The tldr is that we need to translate between virtual and physical memory but it isn't a direct 1:1 mapping. In our program we may dereference a pointer to manipulate a value at some address but in reality that is a virtual address that must get translated to a physical address. Virtual memory is stored within pages which are typically 4096 bytes in size (standard 64-bit linux page size). When your memory access is not on the current page accessible, we must walk our table to find the page with the memory we want to access and then translate for the physical page which is a multi-level pointer chase. As you can probably imagine, these lookups and loads are not cheap especially when you care about optimizing your operations.

huge pages

So I mentioned that your memory is mapped into pages of size 4096. If you work with any enterprise scale application chances are the full instructions and data do not fit into a nicely packed single 4K page. Meaning you risk table walking depending on what you are doing, like a far jump or resolving very large hash table entry buckets. A very simple question one may ask is "can we make the virtual pages bigger?".
Yes.
Sort of.
In 64-bit Linux land depending on your kernel version you have access to what are called huge pages. A peek at the [manual page for mmap] talks a bit about these huge pages with the MAP_HUGE_2MB and MAP_HUGE_1GB flags. First you'll need to see if your system supports it and what is configured exactly.

        
 $ grep -i huge /proc/meminfo
 AnonHugePages:         0 kB
 ShmemHugePages:        0 kB
 FileHugePages:         0 kB
 HugePages_Total:       0
 HugePages_Free:        0
 HugePages_Rsvd:        0
 HugePages_Surp:        0
 Hugepagesize:       2048 kB
 Hugetlb:               0 kB

Nice so my system has huge pages configured to 2MB (2048kB) but it isn't using any at all. On a desktop machine this makes sense, usually server variants of an OS will have them default configured. To see if they are enabled:

        
 $ cat /sys/kernel/mm/transparent_hugepage/enabled
 always [madvise] never

This output shows 3 options and the one inside the square brackets is currently selected. In this case huge pages can be used (if they exist) when madvise notes to try. For the unfamiliar [madvise] is a system call to give advice to the kernel about the address range provided. One such advice is MADV_HUGEPAGE which enables transparent huge pages for the pages in the range specified. This means that as the kernel scans areas marked as huge page candidates it will attempt to allocate and replace them. Handy way of letting the kernel do the magic.

The thing is there are no huge pages available to use on my system, so that madvise option isn't going to do anything. We could change that to [always] of course but still same problem. So lets tell the OS to have some huge pages. Before you do this make sure that you have the available memory, and calculate how many you will need. For this experiment I will make a single page but in a real environment you'll want many more. If you get this value wrong or try to access huge pages that don't exist you can enter some unfortunate state in your application or even OS. [Oracle] has a very useful guide for this exact task but comes with the context of managing Oracle on Linux so I'll keep it brief and skip all that fluff.

        
 # open /etc/sysctl.conf with your favorite editor (it is vim right?)
 # update or add this line for a single huge page
 vm.nr_hugepages=1
 # save and exit then run the following command to refresh this system config
 sysctl -p

We have now configured our machine to have a whopping 1 huge page (2MB in size). We can verify by checking out meminfo again:

        
 $ grep -i hugepages /proc/meminfo 
 AnonHugePages:         0 kB
 ShmemHugePages:        0 kB
 FileHugePages:         0 kB
 HugePages_Total:       1
 HugePages_Free:        1
 HugePages_Rsvd:        0
 HugePages_Surp:        0
 Hugepagesize:       2048 kB

Voila. There it is. Let's use it!

remap running code

I wasn't exactly honest when I said we can get some performance gains by changing nothing. So far we had to change our system configuration, and now we will modify our program we are optimizing but not by changing algorithms or data structure choices. Instead we will focus on our actual code leaving out the data for now. After this exercise, how to make your static and dynamic allocations use huge pages should be obvious. We are going to take our text section of the running ELF binary and change it from regular sized pages to huge pages so that ideally our instructions all fit on a single page never needing a table walk, and if we arrange things optimally then you could reduce cache misses further by retaining it in cache.

Searching the internet there are tools that will do the remapping for you. But where's the fun in that when we have already done the configuration manually?
To implement this ourselves I'm going to use GCC and I'm on a linux 64-bit system, which I am noting because I'm making use of specific GNU features that may or may not exist in other compiler toolchains. YMMV.

GCC has attributes which can be applied to functions as hints to the compiler/linker of extra features to perform. One such is constructor which inserts your function to be run before entering main (in the linker this would be in the ctor section). The reason why this is important is that we will be shuffling around a loaded ELF binary text segment and we wouldn't want to move something only to completely hose ourselves and be stuck. Putting it in its own section is valid but putting it to be run before main means we can be relieved of worry of anything running. For this exercise the function would look something like:

        
 extern unsigned char __executable_start[];
 extern unsigned char __etext[];
 void __attribute__((constructor)) __code_section_hugepage (void)
 {
     /* the meat */
 }

Hey wait what are those two external variables? These are hidden linker provided variables that give us the beginning and the end of our text (code) section. Well, etext is actually after fini but this is fine. We need these to determine the locations of the start and end of our code! Now the strategy is:

Using those two addresses we can determine the size of the text
Allocate a new region of memory of that size (mmap, malloc, take your pick)
Move all data in the text section (start of __executable_start to end) into the new region
Re-map the original memory with mmap() given some special variables to ensure it ends up in the exact same place as well as with huge pages with correct permissions
Move from temp region to the new remapped region
Free the temp region

That's it! Pretty simple however there are plenty of gotchas along the way. So as I show the code I'll try to document. First of course is making sure you have enough huge pages available!

        
 #define HPINFO_FP           "/proc/sys/vm/nr_hugepages"
 void __attribute__((constructor)) __code_section_hugepage (void)
 {
     uint64_t numhugepages;
     ssize_t bytesread = 0;
 
     /* open the huge page info file */
     int fd = open(HPINFO_FP, O_RDONLY);
     if (fd < 0)
     {
         ERROR("can't open %s", HPINFO_FP);
         return;
     }
 
     /* its a raw ascii number up to 20 digits max */
     bytesread = read(fd, data, 20);
     if (bytesread < 0)
     {
         ERROR("unable to read %s.", HPINFO_FP);
         exit(-1);
     }
 
     /* close the file we don't need it anymore */
     if (close(fd) < 0)
     {
         ERROR("couldn't close %s.", HPINFO_FP);
         exit(-1);
     }
 
     /* get the number of hugepages */
     for (i = 0; i < (bytesread - 1); i++)
     {
         numhugepages = (numhugepages * 10) + data[i];
     }
 }

Now assuming the number of pages is valid we continue with the step to make a temporary mapping and move it over. The size we calculated is textsz.

        
 /* make a temporary mapping */
 tmpbuf = mmap(NULL, textsz,
               PROT_READ | PROT_WRITE,
               MAP_ANONYMOUS | MAP_PRIVATE,
               -1, 0);
 if (MAP_FAILED == tmpbuf)
 {
     ERROR("mmap failed!");
     return;
 }
 
 /* copy from old to temp */
 memcpy(tmpbuf, __executable_start, textsz);

If you aren't familiar with mmap() that is okay! The man page is available but it can be overwhelming. The important parts to know here is that it is a temporary region so all the parameters are very default. NULL tells the OS give me any location for this region to exist. The size is next followed by protection bit flags. It must be read and write so that we can copy into the region (duh!) and then read to copy back out. The map flags mean it is for just this process and finally the last two arguments are file descriptor (-1 means none) and offset (we start at the start). It is important to note mmap() doesn't return NULL on failure but instead a special value so to check if it failed you must look for MAP_FAILED. And finally copy the original text size on over.

Wait! Doesn't all of this need to be page aligned and page size? Yes you are absolutely correct. Our text section should be page aligned but depending on your compiler of choice you may or may not by default get the final memory page to yourself (meaning the next ELF section could be shared onto this page). You can force this in a linker script, which I will show further down.

The re-map can now begin. This is the magic we have been waiting for:

        
 /* re-map original map */
 void *section = mmap(__executable_start, textsz,
                PROT_READ | PROT_WRITE,
                MAP_ANONYMOUS | MAP_FIXED | MAP_PRIVATE | MAP_HUGETLB,
                -1, 0);
 madvise(section, textsz, MADV_WILLNEED);
 
 memcpy(section, tmpbuf, textsz);
 
 mprotect(section, ssz, PROT_READ | PROT_EXEC);

Starting with the mmap() you'll notice we still haven't marked it as executable yet but also there is 2 new flags: MAP_FIXED and MAP_HUGETLB. The first parameter is the address of our text section start and MAP_FIXED says to call mmap at that address specifically. This is how we "re-map" our original text section. The other flag is MAP_HUGETLB which if you are here I'm sure you can guess what this does: tells the OS to allocate huge pages. There are more flags you can add to really amp up the performance guarantees like MAP_NORESERVE (do not use swap for this region) and MAP_POPULATE (pre-fault all the memory pages so you don't get latency hit on a page fault later). Follow this up with an madvise() which gives advice to the OS what to do with some region of memory (given a starting address and size). In this case we put MADV_WILLNEED as a way to pre-fetch pages and mark them as we will be using them soon. The next syscall is just like before but in reverse copying back to our original location. And finally we mark these pages as executable because it is code!

Now your program is running with huge pages for all of its code reducing i-tlg misses (ideally). If you can fit all of your application (or at least critical datapath) into 2MB then it will be in 1 page making for zero page faults or tlb walks and depending on the chipset could even sit forever in L1 making for no cache misses period. You can even apply this strategy for your data as well especially in applications where all memory is allocated on start-up, which is what you should be doing if in ultra high performance computing.

what's this about a linker script?

If you are just running gcc straight then it uses a default linker script which you can see with:

       
 ld --verbose

In there you can see the etext mentioned above and a goofy syntax on how that variable is created. This linker script has a ton in it and you could prune to what you want or add what you want. One strategy is to create a section strictly for the code/data you want to map in a special manner. In order to force page alignment the ALIGN(size) keyword can be used as well as pushing the current location forward to whatever is the next page size. This is seen in our default linker script with this line:

        
 . = ALIGN(CONSTANT (MAXPAGESIZE));

Linker scripts feel like magic spells where the documentation leaves a lot to be desired. There are a number of resources online especially in the embedded world which helps de-mystify them a bit so I recommend checking those out if you want to further optimize your ELF.

e-mail: [email protected]