Wednesday, January 26, 2011

Stages of Compilation in Linux using gcc

When you write a program, it doesn't do anything until you compile it. People working on Linux machine use GCC a compiler for C, C++, java, Fortan and other program code that can be used in Unix, GNU/Linux machines. It is distributed as Free Software under the GNU General Public License (GNU GPL). It is useful to know the step by step compilation stages as a developer or even for a beginner.
        During the compilation we go through four stages and each stage use a tool to translate code from one to other till we reach the loadable binary image (binary file) for execution in architecture. As we are using sequence of tools hence it is called as GNU tool chain. Understanding the various stages of compilation helps in cross compilation of code.
Below steps shows compilation process using gcc compiler
Source file:
        It contains the source program in test format. It can be of any language c, c++, etc. For eg: first.c is a C source code.


Step1:
Pre-processing: (here we use cpp tool)
  • it helps in creating fast and efficient code.
  • it reads from header files for creating a pre-processed source file.
  • all macros and constant symbols are replaced.
  • all conditional pre-processor directives are processed by pre-processor.
  • it provides conditional pre-processor directives are pre-processed by processor
$ gcc -E first.c -o first.i 

    - E option to halt the compilation at pre-processing stagem. Refer man page.
    - o option to redirect the output to the new file first.i.


first.i contains the entire header file content + code.To see the sequence of approach in generation of first.i
$ gcc -v -E first.c -o first.i
    - v option stands for verbose.
 

Step2:
Assembler:
(here we use compiler tool)

  • Takes pre-processed file and creates file with '.s' extension called as assembly file.
  • It is mainly required for optimixation (speed and space) of code.

$ gcc -S first.i -o first.s

     - S option to halt at assembly stage.

Step3:
Relocatable Binary:
(here we use assembler tool)

  • contains offset address of the assembly code, it is assigned at compile time.object dump of first.o shows offset address.
for eg: a relocatable code contains call 19<>. Its position depends on main position.
This file contains source in assembly and library routines.

$ gcc -c first.s -o first.o

Note: first.o is not readable. To view the content we use a tool "objdump" called as binary disassembler tool.

$ objdump -D first.o
    - D option stands for disassemble, refer man pages.
 

Step4:
Linking:
(here we use linker tool)

  • linker tool is used to build the executable image, here we are packaging that gives lodable binary code that can be loaded and executed.

$ gcc first.o
    gcc first.o by default creates a.out, to get executable of specified name
we can give as
$ gcc first.o -o first (here first is the executable name we specified)

  • This executable (first) will be green in color in Bash shell.
  • This loadable fiel contaions loadable address in the form of segement and offset called as absolute address. 
  • Function calls entries present it PLT called as procesure linkage table.
  • Executable file contains some run time library. 
  • This file is mainly created by linker which is OS dependent.
    to view the content of executable first page wise,
$ objdump -D first | more
 


Observations:
So finally we are with five different files first.c, first.i, first.s, first.o, first, we shall
check out these file formats using the tool file.
( just a sequence of steps together shown )

$ gcc -E first.c -o first.i
$ gcc -S first.i -o first.s
$ gcc -c first.s -o first.o
$ gcc first.o -o first
$ file first.c
first.c: ASCII text
$ file first.i
first.i: ASCII C program text
$ file first.s
first.s: ASCII assembler program text
$ file first.o
first.o: ELF 32-bit LSB relocatable, Intel 80386, version 1 (SYSV), not stripped
$ file first
first: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.15, not stripped
$

$ objdump -D first.o | more
first.o: file format elf32-i386
Disassembly of section .text:
00000000 <main>:
0:         55                                          push          %ebp 
1:         89 e5                                     mov           %esp,%ebp 
3:         83 e4 f0                                 and           $0xfffffff0,%esp 
6:         83 ec 10                                sub            $0x10,%esp 
9:         c7 04 24 00 00 00 00            movl          $0x0,(%esp) 
10:       e8 fc ff ff ff                            call 11       <main+0x11> 
15:       c9                                          leave 
16:       c3                                          ret
--More--
$ objdump -D first | more (go down and see <main> section)
080483e4 <main>:
80483e4:         55                                          push          %ebp
80483e5:         89 e5                                     mov           %esp,%ebp
80483e7:         83 e4 f0                                 and           $0xfffffff0,%esp
80483ea:         83 ec 10                                sub            $0x10,%esp
80483ed:         c7 04 24 00 00 00 00            movl          $0x0,(%esp)
80483f4:       e8 fc ff ff ff                              call 11       <main+0x11>
80483f9:       c9                                            leave
80483fa:       c3                                            ret 
--More--

       You can view machine instruction code and important thing to observe is the address at the extreme left of each line, this is an offset address which is reloaded or remapped to a virtual address by adding this offset to a base address of the segment.

       We have obtained executable (first) from the relocatable (first.o) and here in first observe the address of the instruction that are mapped to some virtual 32-bit address. The step3 output (first.o) is hence called as relocatable as the offset address are remapped to some virtual address. Linker does the job of relocating offset address to the platform specific address. This virtual address concept is huge and interesting and even important topic of discussion which i will post soon. :-)


Note: Creation of files from .c to .o can be used in any architecture. Where as the executables are specific to platform and architecture.

Please leave comment :-)                                                Queries are at free of cost

7 comments:

  1. Hi GVK51,

    Thanks for the beautiful explanation .

    I have some questions :

    1) Which occurs before loading or linking ? Any thumb rule or conditions that determine either of them will execute first ? Please elaborate .

    2) What is offset address and base address ? How do you add the offset to base address ? can you please explain with example ?

    3) Can you please explain the above scenario wrt Program counters for each instructions maybe by using GDB ?

    4) Can you also include .o and .ko wrt above program ?

    5) Please explain the difference between shared object files ( .so) and DLL files ( .so ) ? Although I have read but expecting a more convincing reply .

    Thanks in advance .

    Warm Regards,
    Marc

    ReplyDelete
  2. Hi Marc,

    Please refere to the following post, may clear few of your doubts.

    http://www.linuxkernel51.blogspot.in/2012/11/static-and-dynamic-libraries-in-c-in.html

    2) -- A program is a set of instructions (assembly) that will be loaded on to you RAM during execution and every instruction is placed at a memory location.

    An address that serves as a reference point for other addresses is called as base-address. For example, a base-address could indicate the beginning of a program. The address of ever instruction in the program could then be specified by adding an offset to the base-address. For example, the address of the third instruction would be the base-address plus 3.

    3) with gdb you can debug your code step-by-step.

    For example if you want to debug your code, then compile code with option -g

    $ gcc -g test.c -o test
    $ gdb test

    you can get lot of info and tutorials in using gdb

    4) .ko is a kernel module, I din't understand in which context you are using this.

    5) Please refer to the above link in this comment.

    Thanks,
    GVK51

    ReplyDelete
  3. adding, to above comment:

    GCC's external interface is generally standard for a UNIX compiler. Users invoke a driver program named gcc, which interprets command arguments, decides which language compilers to use for each input file, runs the assembler on their output, and then possibly runs the linker to produce a complete executable binary.
    refer_to: http://en.wikipedia.org/wiki/GNU_Compiler_Collection

    ReplyDelete
  4. I do not even know how I ended up here, but I thought this post was great.
    I don't know who you are but definitely you're
    going to a famous blogger if you are not already ;) Cheers!
    MCX Tips

    ReplyDelete
  5. Hi GVK51,

    Few questions:

    Does OS has any role in Converting Offset to virtual Address.. I mean is Linker dependent on OS in any way.

    Is the virtual address obtained the here final or do we change something (add/change) in that.

    ReplyDelete
  6. Linker is dependent on the OS, I mean in terms of virtual address range that has to be allocated and the virtual address created at the linking stage is final and is specific to that machine.

    ReplyDelete
  7. hi there
    thanks for detailed explantion

    ReplyDelete