[home] [sandbox] [resume] [contact]
We take for granted the critical texts that start our applications. When first learning a "Hello, World!" program, especially in a language like C, it is kind of magic
that after a successful compilation we can do our simple ./a.out and it just works. Most engineers, developers, or curious folk at some point wonder how. A quick internet
search usually points to some assmebler output and is met with a quick close of that tab only to accept "it just works". Step by step lets try to find some of the how.
#include <stdio.h>
int main (int argc, char *argv[])
{
printf("hello, world\n");
return 0;
}
Simple enough. We can compile it like so:
gcc -o hello hello.c
And when we execute the hello binary, we get our wonderful "hello, world" to screen with a newline. But why is this so large? Sifting through 16K isn't going to make finding main invocation all that easy.
$ wc -c hello
15960 hello
This is all because to get to main we need to go through what's called the C Runtime. I know, it is a compiled language so why does it have a "runtime"? It isn't quite the same as an interpreted language runtime.
$ objdump -d -Mintel hello
0000000000001060 <_start>:
1060: f3 0f 1e fa endbr64
1064: 31 ed xor ebp,ebp
1066: 49 89 d1 mov r9,rdx
1069: 5e pop rsi
106a: 48 89 e2 mov rdx,rsp
106d: 48 83 e4 f0 and rsp,0xfffffffffffffff0
1071: 50 push rax
1072: 54 push rsp
1073: 45 31 c0 xor r8d,r8d
1076: 31 c9 xor ecx,ecx
1078: 48 8d 3d ca 00 00 00 lea rdi,[rip+0xca] # 1149 <main>
107f: ff 15 53 2f 00 00 call QWORD PTR [rip+0x2f53] # 3fd8 <__libc_start_main@GLIBC_2.34>
1085: f4 hlt
1086: 66 2e 0f 1f 84 00 00 cs nop WORD PTR [rax+rax*1+0x0]
108d: 00 00 00
0000000000001149 <main>:
1149: f3 0f 1e fa endbr64
114d: 55 push rbp
114e: 48 89 e5 mov rbp,rsp
1151: 48 83 ec 10 sub rsp,0x10
1155: 89 7d fc mov DWORD PTR [rbp-0x4],edi
1158: 48 89 75 f0 mov QWORD PTR [rbp-0x10],rsi
115c: 48 8d 05 a1 0e 00 00 lea rax,[rip+0xea1] # 2004 <_IO_stdin_used+0x4>
1163: 48 89 c7 mov rdi,rax
1166: e8 e5 fe ff ff call 1050 <puts@plt>
116b: b8 00 00 00 00 mov eax,0x0
1170: c9 leave
1171: c3 ret
0000000000001174 <_fini>:
1174: f3 0f 1e fa endbr64
1178: 48 83 ec 08 sub rsp,0x8
117c: 48 83 c4 08 add rsp,0x8
1180: c3 ret
The above is two disassembled functions that live inside our compiled binary: _start and main. We wrote main so that is obvious why it is there, and the keen eye
may also see that printf() got optimized by the compiler to puts() since we had no formatters to expand. You don't need to be an expert in assembly here; take a look
in _start at offset 1078. There is a load-effective-address into register %rdi that contains the address of our main function. Immediately after is a call to a function
in libc named __libc_start_main. The assembly operations before this are setting up our argc and argv as well. That __libc_start_main is specific to glibc (as we used gcc on a GNU Linux) so
again we could jump into that source code and attempt to make sense of it all or we could simply try invoking main ourselves ignoring whatever safeties and features libc gives us. This way
we get to understand a bit of what is actually necessary to invoke main versus what's nice to have when invoking main. And hopefully it will help making the source code reading easier as
we will know roughly what the goal is of libc stuff before main.
$ gcc -nostdlib -o hello hello.c
/usr/bin/ld: warning: cannot find entry symbol _start; defaulting to 0000000000001030
/usr/bin/ld: /tmp/ccANS8XH.o: in function `main':
hello.c:(.text+0x1e): undefined reference to `puts'
collect2: error: ld returned 1 exit status
Right. We are using printf() to output to our screen which is a libc function. But we're in luck, we can skip that entirely and go right to the direct way of doing it (losing formatters and such but we just want a simple string). This means however a direct jump into assembly since calls to things like puts() or write() are still stdlib wrappers. To access the syscall direct we must use assembly. Fortunately it is standardized and well documented.
#define SYS_write 1
#define SYS_exit 60
char hello[] = "hello, world\n";
int main (int argc, char *argv[])
{
/* the write wrapper looks something like this:
* int write(int fd, const char *buf, size_t len)
* and the syscall interface is like:
* syscall(int syscall_id, ...)
* with the parameters varying based on which syscall. In this case write is
* syscall3 meaning it has 3 parameters so it would appear:
* syscall(int syscall_id, int fd, const char *buf, size_t len)
*/
asm volatile (
/* the asm instruction */
"syscall"
/* outputs we are ignoring for now so leave it empty */
:
/* inputs are syscall_id, fd, buf, and len */
: "a"(SYS_write), "D"(1), "S"(hello), "d"(sizeof(hello))
/* write is defined as clobbering these registers so lets put it here anyway */
: "rcx", "r11", "memory"
);
/* we can't return! so we must call the exit syscall */
asm volatile (
"syscall"
:
/* inputs are syscall id and error code (0 for no errors) */
: "a"(SYS_exit), "D"(0)
: "memory"
);
}
If we try our basic compilation without stdlib there is a linker warning:
$ gcc -nostdlib -o hello hello.c
/usr/bin/ld: warning: cannot find entry symbol _start; defaulting to 0000000000001000
Since there is no stdlib which has an entry point that eventually calls into main, we can set our main as the entry point:
gcc -nostdlib -Wl,--entry=main -o hello hello.c
Running this gives us our expected output! This is great, but why is the binary still large?
$ wc -c hello
13848 hello
; ASM introduction with hello world but in 64-bit
section .text
global _start
;remember, these are 64-bit registers and syscalls in linux!
_start:
mov rdx, len ;len: as you can imagine length of our string
mov rsi, msg ;buf: address of our hello world string
mov rdi, 1 ;fd: 1 is stdout
mov rax, 1 ;in 64-bit land, sys_write is 1
syscall
mov rdi, 0 ;we toss 0 in the error code param and assume all is well
mov rax, 60 ;60 is sys_exit
syscall
section .data
msg db 'hello, world',0xa ;the lovely string appended with a LINE FEED (newline)
len equ $ - msg ;funny way to easily calculate length of the string
I threw in some inline comments to help out but this is very specific to 64-bit x86. If you are on 32-bit, your syscall numbers change (but you would
have figured that out when trying to compile the previous C with inline asm file).
But wait, there's no main! Much like the previous exercise we need to define the entry point and in this case why not just name our function the standard
_start entry point that ld linker expects. You could in theory name it whatever you want, and you'll see soon how to adjust accordingly. First, let's build
this asm file:
nasm -f elf64 -o hello.o hello.asm
ld -s -o hello hello.o
We have to do our assembler and linker phases manually since I'm no longer using gcc. Once you run those two commands you'll end up with a binary named hello
and the output to your screen should be the fine, wonderful "hello, world". But it is still big.
$ wc -c hello
8488 hello
At this point we have pretty much found our path to main. The missing piece you might say is "well how do I actually get there when I enter ./hello on the command line?". The answer to that is somewhat shell dependent (I know not a direct answer sorry) but in Linux your shell typically makes a syscall to execve() which you can peek at the man page. This syscall does some elf parsing and begins the stages we walked through above.
$ readelf -S hello
There are 4 section headers, starting at offset 0x2028:
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
[ 0] NULL 0000000000000000 00000000
0000000000000000 0000000000000000 0 0 0
[ 1] .text PROGBITS 0000000000401000 00001000
0000000000000027 0000000000000000 AX 0 0 16
[ 2] .data PROGBITS 0000000000402000 00002000
000000000000000d 0000000000000000 WA 0 0 4
[ 3] .shstrtab STRTAB 0000000000000000 0000200d
0000000000000017 0000000000000000 0 0 1
The assembly we wrote defined our two sections .text (which contains the executable code) and .data (which contains our string literal and length of the string) which we see here. However to get really in the weeds we can make a linker script that ld will run to generate a specific layout binary. With this we can potentially start to trim more off the size.
OUTPUT_FORMAT("elf64-x86-64")
OUTPUT_ARCH(i386:x86-64)
ENTRY(_start)
SECTIONS {
.text 0x400000 : AT(0x400000)
{
*(.text*)
}
.data :
{
*(.data*)
}
/DISCARD/ : { *(.plt*) *(.iplt*) *(.rela*) *(.got*) *(.igot*) *(.shstrtab*) }
}
Dump that into a hello.ld and we have a linker script. A swift overview:
ld -s -T hello.ld -Map=- -o hello hello.o
You may be surprised by the sudden output to stdout here but this comes from the -Map option to the linker. This option tells ld to dump its linker map so we know what goes where, and -Map=- sends it to stdout rather than a file.
With that output we see text, data, the offsets, and what was discarded along with our output elf64. So what is our size now?
$ wc -c hello
4432 hello
It is getting smaller that's for sure. We started at 16K and now down to 4.4K, shrinking almost to 25% of the original size. But this still feels like too much. Can we trim even more? The answer to that now extends past our original question of getting to main and into hyper optimization of our ELF for the sake of shrinking it. Maybe it is a quest that helps understand getting to main and the minimum requirements.