pcc/ how-it-works

The process of taking a C program and converting it into a binary executable happens in quite a few steps. The basic process is:

pre-processor -> compiler -> assembler -> linker

This process is performed by pcc. You can see each of these steps being performed by invoking pcc with -v:

$ pcc -v helloworld.c

pcc is actually a "wrapper" program which invokes other programs to perform stages of the compilation process.

The C source file

The C source file contains the following information:

The pre-processor

The pre-processor is the first step of the compilation process. It reads the C source file and processes the pre-processor directives. The most-familiar directives are #include and #define.

You can see the output after the pre-processing step by invoking pcc with -E:

$ pcc -E helloword.c

The pre-processed result is called helloword.i.

The compiler proper

The real compilation work is performed by the compiler proper. The compiler takes the pre-processed C source and generates assembly language code suitable for the system assembler. The style and syntax of assembly language varies greatly between machines and operating systems.

You can see the output after the compilation step by invoking pcc with -S:

$ pcc -S helloworld.c

The compiled result is called helloword.s.

The assembler

The assembler takes the assembly language program and creates object files. Object files contain machine code, but not in a format suitable for executing yet. These object files can be stored in libraries for later use.

You can see the output after the assembler step by invoking pcc with -c:

$ pcc -c helloworld.c

The compiled result is called helloworld.o.

The linker

The final step in the compilation process is to link the object file with the system library and startup files to generate the executable binary.

The system library is generally called libc and it is a library of other object files containing machine code. The startup files are object files containing machine code to create an environment for the program. Ever wondered where argc and argv come from? The startup code obtains this information from the operating system and builds the parameters before invoking main().

Putting it all together

Consider the following command-line using pcc to compile a program:

$ pcc -g -O -I/usr/local/include -D_DEBUG -Wl,-r/usr/local/lib prog.c

The first option is -g which instructs the compilation process to compile with debugging information. This options is not needed by the pre-processor, but is used by the compiler, assembler and sometimes the linker to put debugging information in the executable. This information can be used by a symbolic debugger.

The second option is -O which enables compiler optimizations. It is only used by the compiler and is not needed by the pre-processor, assembler nor linker.

The third option is -I/usr/local/include which specifies an additional directory to find include files. Since the pre-processor handles the #include directive this options is only used by the pre-processor and is ignored by the compiler, assembler and linker.

The fourth option is -DDEBUG which defines the pre-processor macro _DEBUG. It is only used by the pre-processor and is not needed by the compiler, assembler nor linker.

The fifth option is -Wl,-r/usr/local/lib which instructs the linker to record this directory in the executable for use by the dynamic linker. It's only used by the linker and is not needed by the pre-processor, compiler nor assembler.

From this analysis, we can appreciate that the steps required to achieve compilation and how all those command-line arguments work together to generate a binary executable.

If pcc ever has a problem generating a binary executable, then the problem must be attributed to one of these steps of the build process.

TODO: need to add a section about the optimizer.