[home] [sandbox] [resume] [contact]
$ grep -i huge /proc/meminfo
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
FileHugePages: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 0 kB
Nice so my system has huge pages configured to 2MB (2048kB) but it isn't using any at all. On a desktop machine this makes sense, usually server variants of an OS will have them default configured. To see if they are
enabled:
$ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
This output shows 3 options and the one inside the square brackets is currently selected. In this case huge pages can be used (if they exist) when madvise notes to try. For the unfamiliar
[madvise] is a system call to give advice to the kernel about the address range provided. One such advice is MADV_HUGEPAGE which enables transparent
huge pages for the pages in the range specified. This means that as the kernel scans areas marked as huge page candidates it will attempt to allocate and replace them. Handy way of letting the kernel do the magic.
# open /etc/sysctl.conf with your favorite editor (it is vim right?)
# update or add this line for a single huge page
vm.nr_hugepages=1
# save and exit then run the following command to refresh this system config
sysctl -p
We have now configured our machine to have a whopping 1 huge page (2MB in size). We can verify by checking out meminfo again:
$ grep -i hugepages /proc/meminfo
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
FileHugePages: 0 kB
HugePages_Total: 1
HugePages_Free: 1
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Voila. There it is. Let's use it!
extern unsigned char __executable_start[];
extern unsigned char __etext[];
void __attribute__((constructor)) __code_section_hugepage (void)
{
/* the meat */
}
Hey wait what are those two external variables? These are hidden linker provided variables that give us the beginning and the end of our text (code) section. Well, etext is actually after fini but this is fine.
We need these to determine the locations of the start and end of our code! Now the strategy is:
#define HPINFO_FP "/proc/sys/vm/nr_hugepages"
void __attribute__((constructor)) __code_section_hugepage (void)
{
uint64_t numhugepages;
ssize_t bytesread = 0;
/* open the huge page info file */
int fd = open(HPINFO_FP, O_RDONLY);
if (fd < 0)
{
ERROR("can't open %s", HPINFO_FP);
return;
}
/* its a raw ascii number up to 20 digits max */
bytesread = read(fd, data, 20);
if (bytesread < 0)
{
ERROR("unable to read %s.", HPINFO_FP);
exit(-1);
}
/* close the file we don't need it anymore */
if (close(fd) < 0)
{
ERROR("couldn't close %s.", HPINFO_FP);
exit(-1);
}
/* get the number of hugepages */
for (i = 0; i < (bytesread - 1); i++)
{
numhugepages = (numhugepages * 10) + data[i];
}
}
Now assuming the number of pages is valid we continue with the step to make a temporary mapping and move it over. The size we calculated is textsz.
/* make a temporary mapping */
tmpbuf = mmap(NULL, textsz,
PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE,
-1, 0);
if (MAP_FAILED == tmpbuf)
{
ERROR("mmap failed!");
return;
}
/* copy from old to temp */
memcpy(tmpbuf, __executable_start, textsz);
If you aren't familiar with mmap() that is okay! The man page is available but it can be overwhelming. The important parts to know here is that it is a temporary region so all the parameters are very default.
NULL tells the OS give me any location for this region to exist. The size is next followed by protection bit flags. It must be read and write so that we can copy into the region (duh!) and then read to copy
back out. The map flags mean it is for just this process and finally the last two arguments are file descriptor (-1 means none) and offset (we start at the start). It is important to note mmap() doesn't return NULL
on failure but instead a special value so to check if it failed you must look for MAP_FAILED. And finally copy the original text size on over.
/* re-map original map */
void *section = mmap(__executable_start, textsz,
PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_FIXED | MAP_PRIVATE | MAP_HUGETLB,
-1, 0);
madvise(section, textsz, MADV_WILLNEED);
memcpy(section, tmpbuf, textsz);
mprotect(section, ssz, PROT_READ | PROT_EXEC);
Starting with the mmap() you'll notice we still haven't marked it as executable yet but also there is 2 new flags: MAP_FIXED and MAP_HUGETLB. The first parameter is the address of our text section start and MAP_FIXED says to
call mmap at that address specifically. This is how we "re-map" our original text section. The other flag is MAP_HUGETLB which if you are here I'm sure you can guess what this does: tells the OS to allocate huge pages. There
are more flags you can add to really amp up the performance guarantees like MAP_NORESERVE (do not use swap for this region) and MAP_POPULATE (pre-fault all the memory pages so you don't get latency hit on a page fault later).
Follow this up with an madvise() which gives advice to the OS what to do with some region of memory (given a starting address and size). In this case we put MADV_WILLNEED as a way to pre-fetch pages and mark them as we will
be using them soon. The next syscall is just like before but in reverse copying back to our original location. And finally we mark these pages as executable because it is code!
ld --verbose
In there you can see the etext mentioned above and a goofy syntax on how that variable is created. This linker script has a ton in it and you could prune to what you want or add what you want. One strategy is to create a
section strictly for the code/data you want to map in a special manner. In order to force page alignment the ALIGN(size) keyword can be used as well as pushing the current location forward to whatever is the next page size. This is seen in our default linker script with this line:
. = ALIGN(CONSTANT (MAXPAGESIZE));
Linker scripts feel like magic spells where the documentation leaves a lot to be desired. There are a number of resources online especially in the embedded world which helps de-mystify them a bit so I recommend checking
those out if you want to further optimize your ELF.