What’s “this” file?

You don’t need an extension to determine. It’s all a sequence of bytes and depening on the type of data that sequence changes.

Magic bytes are the first few bytes that determine the type of file. For pure text or say markdown or some other types, you don’t need magic bytes since it’s actually just text data but for others you do.

So if the prefix set of bytes is same? How to do a tie breaker? It’s mostly a heuristic in that case but that is more of an edge case.

Example of taking a file and reading the first few bytes & priting them.

#include <stdio.h>
#include <stdlib.h>
 
int main(int argc, char* argv[]) {
    if (argc != 3) {                                                      // 0th is program, 1 is path, 2 is num_bytes
        fprintf(stderr, "Usage: %s <file_path> <num_bytes>\n", argv[0]);  // note the way to get name
        return 1;                                                         // non zero is error
    }
 
    char* file_path = argv[1];
    int num_bytes = atoi(argv[2]);  // ascii to int
 
    if (num_bytes <= 0) {
        fprintf(stderr, "Error: Number of bytes must be positive\n");
        return 1;
    }
 
    FILE* file = fopen(file_path, "rb");  // read & binary mode
    if (file == NULL) {
        perror("Error opening file");  // to get error from the syscall
        return 1;
    }
 
    printf("First %d bytes of '%s':\n", num_bytes, file_path);
    printf("Hex view: ");
 
    for (int i = 0; i < num_bytes; i++) {
        // this probably returns int so that -1 ( which is all Fs in hex ) can not be a
        // valid byte in the file, so it can be used to indicate EOF
        int byte = fgetc(file);  // get 1 character, which is same as 1 byte
        if (byte == EOF) {       // EOF = -1
            printf("\n(End of file reached after %d bytes)\n", i);
            break;
        }
        printf("%02X ", (unsigned char)byte);  // print as 0 padded, atleast 2 width, capital HEX val
    }
 
    // print as ascii
    rewind(file);  // reset file pointer to start
    printf("\nASCII view: ");
 
    for (int i = 0; i < num_bytes; i++) {
        int byte = fgetc(file);
        if (byte == EOF) {
            break;
        }
 
        // to print only printable chars
        if (byte >= 32 && byte <= 126) {  // printable ASCII range
            printf("%c", byte);
        } else {
            printf(".");  // non-printable characters as dots
        }
    }
    printf("\n");
 
    fclose(file);
    return 0;
}

When run on self this gives:

First 20 bytes of 'a.out':
Hex view: 7F 45 4C 46 02 01 01 00 00 00 00 00 00 00 00 00 03 00 3E 00
ASCII view: .ELF..............>.

Often the magic bytes are so that the ASCII representation has some valid characters, though this is by no means a requirement. If I compare it with the list of file signatures on wiki, this matches to the ELF signature there as well.

If I have a dictionary of magic bytes, I can similary write a simple version of the file command.

ELF

Executable and Linkable Format is the format that is used for executables, object code, shared libraries on linux.

You can do man elf for more details but a lot of that info and the code for constants used in ELF are define in /usr/include/elf.h on linux and can be just imported as #include <elf.h>.

Some tools to read ELF files

  • readelf - read ELF files
  • objdump - disassemble ELF files
  • nm - list symbols in ELF files
  • ldd - list shared libraries used by ELF files

There are a lot of options and I’ll add the commands for the different breakdowns I do.

Problem with investigating ELF for a simple C program

Ideally you would expect that reading ELF for that would be useful but it’s not since modern compilers add a LOT of stuff.

For a simple empty program,

int main() {}

file a.out gives:

a.out: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=ec281dd54a98d7a95a07d5cb16867b142a663e45, for GNU/Linux 4.4.0, not stripped
  • LSB means least significant byte first, which is little endian
  • pie means position independent executable, which means it can be loaded at any address in memory ( this is part of ASLR - Address Space Layout Randomization )
  • dynamically linked which means it uses shared librarries
  • interpreter is the C runtime /lib64/ld-linux-x86-64.so.2
    • Wait, so this program is not standalone? Yes.
  • not stripped means it has debug symbols and other info present, for example I can do nm a.out and get the symbols in the file.

If I do readelf -h a.out, this gives me the an overview of the ELF file headers in this.

ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              DYN (Position-Independent Executable file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0x1020
  Start of program headers:          64 (bytes into file)
  Start of section headers:          13440 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         14
  Size of section headers:           64 (bytes)
  Number of section headers:         28
  Section header string table index: 27

This checks out with what the file command showed though it has a LOT of headers.

What’s also worth noting is a DYN is a shared library ( dynamically linked to the C runtime). ls -h gives me a size of 15232 bytes. I can use ldd a.out to see a list of all dependencies actually needed for this ( note that ldd might just execute the program )

linux-vdso.so.1 (0x000071daf4002000)
libc.so.6 => /usr/lib/libc.so.6 (0x000071daf3dcc000)
/lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x000071daf4004000)

So these must be from the ELF file as well, right? Yes, for some.

For example, readelf -d a.out which gives the dynamic section of the ELF file, which contains info on what to load dynamically gives:

Dynamic section at offset 0x2e20 contains 22 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 (more stuff)

This checks out from objdump -p ./a.out | grep NEEDED as well.

NEEDED               libc.so.6

The libc.so.6 is the C runtime library.

What are the other two entries then that ldd figured out?

  • linux-vdso.so.1: Refer vdso. This is basically a virtual file for syscall optimizations.
  • /lib64/ld-linux-x86-64.so.2: This is the dynamic linker/loader, which is also part of the ELF file but not listed in the dynamic section since it’s not a shared library but a part of the executable itself ( due to it being a DYN )

I can do readelf -l a.out to get the program headers ( aka segments ) where I can see the linker being added int he INTERP ( for interpreter I assume ) segment.

INTERP         0x00000000000003b4 0x00000000000003b4 0x00000000000003b4
                0x000000000000001c 0x000000000000001c  R      0x1
    [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]

and compile with gcc -static to inspect a static executable which does not depend on any shared libraries.

file now gives

ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), statically linked

so it’s statically linked and not a pie as well. readelf -h shows the type as EXEC now instead of DYN. ldd gives not a dynamic executable and similary any of the above commands that showed something do not do so anymore.

This makes sense, the C runtime is now baked in so you don’t need the dynamic loader, this also increases the size to 778144 bytes ( ~51 times more).

What about the vdso? I would assume it’s still needed but ldd does not show it anymore.

If the I use the following code to prevent the program from exiting,

#include <stdio.h>
#include <unistd.h>
 
int main() {
    printf("Static binary test - check /proc/%d/maps\n", getpid());
    printf("Press Enter to exit...\n");
    getchar();
    return 0;
}

Then doing,

./a.out & # suspend
cat /proc/{PID from program}/maps

gives me all the memory mappings while the program is running including the vdso ( there’s stack and heap as well which I cover later) l

7b7e7bdc0000-7b7e7bdc2000 r-xp 00000000 00:00 0                          [vdso]

What does this text mean?

  • 7b7e7bdc0000-7b7e7bdc2000 is the memory range where the vdso is mapped, so start and end.
  • r-xp means it’s readable, executable and private ( not shared with other processes ). The private is not actually private and the kernel actually shares this.
  • 00000000 is the offset in the file where this mapping starts.
  • 00:00 is the device number.
  • 0 is the inode number.

Last three are all 0 since it’s not a real file on disk.

Back to writing my ELF

So I cannot investigate the ELF of a simple C program and understand too much of it. Since the compiler overcomplicates, let me simply it by hand.

A pre-requisite to writing an ELF direcly is being able to write assembly since the .text segment needs to be in assembly. Refer assembly.