Tuesday, December 23, 2008

Emacs as a debugger

Emacs is a beast of many, many faces. No one can really understand the whole thing. There are so many packages, so many functions, so many details… but this post is simple, really: it’s about using Emacs as a debugger for your C/C++ programs.

Surely you can do all of your debugging using GDB directly. GDB is a great program and, once you master it, it’s not that bad. However, having a debugger integrated with your favorite editor is even better: all of your files are already there, it’s orders of magnitude faster to add breakpoints and it’s far more comfortable to step over/into lines of code and see them on the screen, highlighted and all. Most certainly, you have seen how those wicked IDEs work. If you long for their pretty buttons or their “user-friendliness”, there isn’t much Emacs can do for you; if, on the other hand, you just want to debug your code within Emacs and easily set breakpoints and run your programs, then read on!

Let’s say you wrote the following program and stored it in a file named x.cpp:

#include <iostream>

int main()
{
int i = 3;
std::cout << i << std::endl;
i++;
std::cout << i << std::endl;
}

In order to compile it, you can type M-x compile (where M-x most probably means “Alt-x” for you, and where compile should be followed by Enter) from Emacs and then alter the compile command to something like:

Compile command: g++ -g3 -ggdb -o x x.cpp

The -g3 -ggdb part asks GCC to add a lot of debugging information, which is sometimes better than a plain option like -g. This makes your executable somewhat bigger, but in most cases that isn’t a problem.

Now for the real debugging: GDB is available from Emacs thanks to GUD (Grand Unified Debugger), which you invoke by typing M-x gdb. You will then be asked to confirm the GDB command line in the minibuffer (the status bar-like area at the bottom). In many cases, GUD will guess the executable file by itself and all you have to do is press Enter; if it guesses wrong or if it shows no guess at all, then just type the path for the executable you want to debug (TAB autocompletion for files works). In our example, the final GDB command line would look like this in the minibuffer:

Run gdb (like this): gdb --annotate=3 x

I think the --annotate=3 part is there because GUD needs it. I don’t know what happens if you turn it off! In any case, after pressing Enter, you should see GDB starting and finally showing a (gdb) prompt. You can now use it much in the same way you would if you were in a regular terminal, one notable exception being that you must use Ctrl-Up (not just Up) to navigate through the history.

After GDB has started, take the following steps:

  1. Type C-x 2 (where C is Control). This will split your frame into two, allowing you to view both GDB and your program.
  2. Visit your file with C-x C-f.
  3. Switch from one window to the other by typing C-x o.

When running GDB with GUD, you can invoke many GDB commands by using those provided by GUD. So if you’re currently browsing x.cpp, you can set a breakpoint by positioning the point (cursor) in a source code line and typing M-x gud-break. GUD will communicate with GDB and execute the appropriate command and will also display an arrow (in graphics mode) or a red “B” to the left of the line to indicate there’s a breakpoint there.

All important GUD functions are bound to a key combination. You can check all of them at this page of the Emacs manual. Alternatively, you can set arbitrary key combinations to execute any Emacs function you like, including the GUD ones. If you would like F5 to add a breakpoint to the current line, you can do so by typing M-x global-set-key. Emacs will first ask you what key you want to bind and then what function that key should be bound to. In this example, you would press F5 and then type in gud-break. Beware, though, that GDB should be running while you do it, otherwise the gud-break command may not have yet been loaded.

A final tip: if your debugging often freezes for a few seconds when you try to step into a function (inside or outside Emacs!), try setting the environment variable LD_BIND_NOW to a non-empty value before running your program. You can do it from within GDB by typing the command:

(gdb) set env LD_BIND_NOW 1

Good debugging!

Wednesday, September 17, 2008

High resolution screen shots

One of the most complicated configuration files a Linux user has to deal with is /etc/X11/xorg.conf. There are just so many options, so many sections. Usually, though, a user has to change it very seldom—unfortunately today was one of those times.

I needed to take a very high resolution screen shot of a rendering, so that it would still look good on an A0-sized poster. Now, you see, this is not the place to ask why I couldn’t render whatever I was rendering directly onto a file—sometimes you just don’t have enough time to do the right thing. So what I tried to do was to use a desktop larger than my monitor could support. This is what they call a virtual desktop. I don’t think many people would use it daily, because it feels very weird, but it’s definitely good enough to take a screen shot.

The change to /etc/X11/xorg.conf is very simple. All it takes is adding this to the relevant Screen section:

SubSection "Display"
Virtual 3840 3072
EndSubSection

The sad part is that this should work for most screen shots. But it didn’t for me. It seems that such huge screen sizes make CUDA stop working… so I had to take several smaller screen shots and glue them together.

Saturday, August 16, 2008

When syslogd hangs

Prelude

I wake up at around 8 am. I reach my computer, which had been downloading stuff all night long. All downloads halted. Little or no network activity. When I try to su, it hangs. I try to log in using a text terminal, but it hangs too. Programs are running fine, but yet I can’t be root. What is going on here?

First reactions

I first thought about the hard disk. Hard disk failures cause all sorts of weird symptoms, but this time dmesg didn’t show anything relevant and I could read and write files, so hard disk didn’t seem to be a problem. Also, there was a lot of free space in all mounted partitions, thus ruling out almost completely any hard disk-related problem.

I thought that maybe someone had been able to break into my system. Why on earth was I unable to log into my own computer? I have experienced that kind of problem in the past, but only when using LDAP. If the LDAP server was dead—one wouldn’t be able to su or even to ls. But I was home and never, ever experienced something like that. To a Linux user, being denied the power of root is like being humiliated.

I unplugged the network cable and did the only thing I could to try to regain control over my own system: a hard reset. When the system booted, everything was working just fine. I tried to look for weird files, for weird commands in my .bash_history files, but nothing. However, when I checked the logs, a big surprise: there was no logging since 11:50 pm—more than 8 hours of silence! If an attacker had indeed broken into my system, then they had forgotten about the “-- MARK --” signs that appears every 20 minutes.

Just to be on the safe side, I decided to use some other computer until I had a decent hypothesis about what had happened.

The problem was syslogd

After a while, I had an insight that the problem was related to logging. That would explain the whole situation—why logging stopped and why hours later networking and suing weren’t working anymore. Furthermore, the possibility of an attacker actually breaking into my box is quite low. To begin with, there’s nothing generally valuable inside (it’s a personal computer, after all). The only service available to the outside is SSH, which is very safe, especially when you allow only one user to use it and he happens to have a good password. People just can’t break into my system like that. So it has got to be syslogd.

Unfortunately, this is a situation where you can never be sure that no attack has happened. If you find proof, you will know an attack has happened, but there can never be enough evidence to show the converse is true. Any machine may have been attacked—it’s just that the attack may have been so perfect no one has noticed. So instead of becoming paranoid, I decided to find out whether syslogd could really be the problem.

syslogd is responsible for receiving log messages from programs and storing them appropriately. What if it dies, that is, what if there is no syslogd running? It turns out that in that case, things work just fine. If your syslogd is dead, you can still su, for example. There won’t be any logs for that, but things will indeed work.

But what if syslogd is there, but for some reason won’t answer requests? That is possible, since syslogd makes available a socket file for programs: /dev/log. In order to log a few messages, all a program has to do is connect to /dev/log and write to it. The question is: If syslogd stops receiving data from /dev/log, what will happen to the programs writing to it?

If programs writing to a /dev/log from which syslogd is not reading stalled, that would explain what happened:

  • su needs to log every login attempt. Therefore, if it hangs while trying to write to /dev/log, you can’t become root. The same happens when you try to log in using a text terminal.
  • Because I use an ADSL connection and my modem is not configured to be a router, pppd ends up being responsible for a part of my network configuration. pppd also uses syslogd, but mostly when initiating or terminating a connection. Checking the logs, I noticed that the connection being dropped and restored is not a rare event; probably during that night my connection was dropped—but pppd stalled, because it could not log that it had found problems. In fact, not only the internet connection was not working anymore, but the ppp0 interface was still showing. Poor pppd never got to shut down the interface.

Partial reproduction of the problem

All I needed was to test my hypothesis. To that end, I wrote two programs: one to replace syslogd and the other to write to /dev/log. The former (call it golsys) simply reads one log message from /dev/log and takes 100 seconds to read the following; the latter (call it yadda) writes a big bunch of messages to /dev/log.

I killed syslogd and then ran golsys. The socket file has a buffer, so when I started yadda it was able to write a few messages. But once the buffer was full, it simply stalled! su also stopped working. As for pppd, I disconnected the network cable and the interface was up for a long time. When I killed golsys, it was immediately shut off—pppd was just waiting for an opportunity to quit!

Final remarks

It’s a bit annoying the idea that a bug in syslogd can bring your whole system to such a halt. If syslogd is not running, it’s OK; if it stops responding, you can’t even su to kill it!

I still have no idea why syslogd stopped responding. I checked the hard disk for bad blocks. I googled thoroughly, and the only relevant post was this one. It’s from 2005, and maybe the bug described there is still around. From the description, it does look like a bug which would really seldom show up; in that case, I was just unfortunate.

Sunday, July 27, 2008

Making an image of a hard disk drive over the network

Introduction

More than once I’ve been before a computer and, prior to taking any potentially destructive action, there was the need to back up a hard disk in its entirety. The best way to do this would be to attach another hard drive and dump the contents of one hard disk into the other. Now that would be really simple; if we wanted to back up, say, /dev/sda and we have a big partition in /dev/sdb1, all we had to do would be:

root@saveit:~# mount /dev/sdb1 /mnt/tmp
root@saveit:~# dd if=/dev/sda of=/mnt/tmp/sda.backup

You can always do that—unless you either don’t have a spare hard disk or you can’t open the relevant computer cases. Sometimes, unfortunately, the only way of getting data out of a computer is through the network. (Actually, more often than not you could record 10 DVDs to back up a 40 GB hard disk, but that’s not really an option.) Under Linux, that’s easy and can be done by using the standard tools of every Linux distribution.

Doing it over a network

We will be dealing here with two boxes: one named store, where the backup will be copied to, and one named saveit, whose HD is to be copied. First off, in the store box, we issue the following command1:

root@store:~# nc -l -p 4444 > img

This is the good old nc, the “TCP/IP swiss army knife”. The meaning of the line above is: listen (“-l”) for an incoming connection on port 4444 (“-p 4444”), and then dump whatever can be read from there into the file img (“> img”).

All that’s left for us to do is send the contents of the hard disk to be copied to the port 4444 of the store computer. Since it’s definitely not a good idea to do this on a system running off the hard disk we want to back up, the following commands are usually run from a bootable CD. Since only very basic commands are used, they can be run from basically any boot CD (although you may face problems with your network card). If the store box’s IP is 192.168.0.2, the following commands should do the trick (assuming no previous fiddling with eth0 has already been done):

root@saveit:~# ifconfig eth0 192.168.0.3 up
root@saveit:~# cat /dev/sda > /dev/tcp/192.168.0.2/4444

The first command simply configures the network; you should pick an available IP address, so that there’s less risk your boss will be mad at you. The second command uses one of the niceties provided by Bash: when using redirections, it’s possible to connect to a remote TCP/IP port as though it were a local file. So cat dumps the hard disk to the standard output, which gets redirected by Bash to the TCP port 4444 at IP address 192.168.0.2.

That redirection trick also resolves names and services as expected. So if you happen to have /etc/hosts and /etc/services properly configured, you could even do things like:

adiel@darkstar:~$ cat ~/invoice > /dev/tcp/www.example.com/fax

(whatever the original intended use of the “fax” port was!)

Notice that you can’t normally use two nc commands. You could reverse the roles and have the saveit box listen for an incoming connection, but you can’t use nc at both ends. The reason is that nc always waits for the other side to hang up. Of course, you could use nc at both ends and then, when you notice all data had been transmitted, you could simply press Ctrl+C. But using the Bash trick at one of the ends solves the problem more nicely.

Finally, beware that there’s no authentication or authorization going on in the process just described.


1. The nc command seems to come in different flavors, so the particular syntax shown here may not work at all for you. Sorry.

Saturday, April 12, 2008

soffice.bin: No such file or directory

I was recently installing OpenOffice but, when I tried to run it, the following happened:

adiel@darkstar:~$ LC_ALL=C /opt/oo/program/swriter
/opt/oo/program/soffice: line 182: /opt/oo/program/javaldx: No such file or directory
/opt/oo/program/soffice: line 240: /opt/oo/program/pagein: No such file or directory
/opt/oo/program/soffice: line 252: /opt/oo/program/soffice.bin: No such file or directory

The weird thing about it is that all the files that the errors mention existed. In this post, I explain the reasons behind these seemingly random messages and how to fix the problem.

The files are there!

When you get a “No such file or directory” error, be it directly or through a program or script, it is usually easy to find a fix. There is a reason for that file or directory not being there, and once you discover that reason, the error will be gone. But what if the file is there, and you're still getting the error?

When I saw the error messages above, the first thing I did was checking whether the files indeed existed:

adiel@darkstar:~$ ls -l /opt/oo/program/{javaldx,pagein,soffice.bin}
-r-xr-xr-x 1 786 261 11K 2008-04-09 20:02 /opt/oo/program/javaldx
-r-xr-xr-x 1 786 261 5.1K 2008-04-09 20:02 /opt/oo/program/pagein
-r-xr-xr-x 1 786 261 442K 2008-04-09 20:02 /opt/oo/program/soffice.bin

So why the soffice wrapper script wasn't finding theese programs, if they were really there? The error messages contain full paths, so that it can't really be a PATH problem. I tryied to run one of the programs which couldn't be found:

adiel@darkstar:~$ /opt/oo/program/soffice.bin
-bash: /opt/oo/program/soffice.bin: No such file or directory

Oh, my! It seems that even bash can't find them, although ls shows that they are there! When one wants to inspect what a program is doing, the strace comes in handy. This is what it says about soffice.bin1:

adiel@darkstar:~$ strace /opt/oo/program/soffice.bin
execve("/opt/oo/program/soffice.bin",
["/opt/oo/program/soffice.bin"],
[/* 49 vars */]) = -1 ENOENT (No such file or directory)
dup(2) = 3
fcntl(3, F_GETFL) = 0x8002 (flags O_RDWR|O_LARGEFILE)
fstat(3, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b7d7cd2a000
lseek(3, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)
write(3, "strace: exec: No such file or di"..., 40strace: exec: No such file or directory
) = 40
close(3) = 0
munmap(0x2b7d7cd2a000, 4096) = 0
exit_group(1) = ?
Process 3061 detached

Take a look at the very first function being invoked, execve(). It was called by strace itself and, again, the “No such file or directory” error is there!

One begins to wonder what kind of program soffice.bin really is… this is what a hex dump of its beginning look like:

adiel@darkstar:~$ xxd /opt/oo/program/soffice.bin | head -n 25
0000000: 7f45 4c46 0101 0100 0000 0000 0000 0000 .ELF............
0000010: 0200 0300 0100 0000 e088 0608 3400 0000 ............4...
0000020: ace0 0600 0000 0000 3400 2000 0800 2800 ........4. ...(.
0000030: 1b00 1a00 0600 0000 3400 0000 3480 0408 ........4...4...
0000040: 3480 0408 0001 0000 0001 0000 0500 0000 4...............
0000050: 0400 0000 0300 0000 3401 0000 3481 0408 ........4...4...
0000060: 3481 0408 1300 0000 1300 0000 0400 0000 4...............
0000070: 0100 0000 0100 0000 0000 0000 0080 0408 ................
0000080: 0080 0408 9cfb 0500 9cfb 0500 0500 0000 ................
0000090: 0010 0000 0100 0000 0000 0600 0080 0a08 ................
00000a0: 0080 0a08 80df 0000 50e5 0000 0600 0000 ........P.......
00000b0: 0010 0000 0200 0000 38d2 0600 3852 0b08 ........8...8R..
00000c0: 3852 0b08 9001 0000 9001 0000 0600 0000 8R..............
00000d0: 0400 0000 0400 0000 4801 0000 4881 0408 ........H...H...
00000e0: 4881 0408 2000 0000 2000 0000 0400 0000 H... ... .......
00000f0: 0400 0000 50e5 7464 50ed 0500 506d 0a08 ....P.tdP...Pm..
0000100: 506d 0a08 4c0e 0000 4c0e 0000 0400 0000 Pm..L...L.......
0000110: 0400 0000 51e5 7464 0000 0000 0000 0000 ....Q.td........
0000120: 0000 0000 0000 0000 0000 0000 0600 0000 ................
0000130: 0400 0000 2f6c 6962 2f6c 642d 6c69 6e75 ..../lib/ld-linu
0000140: 782e 736f 2e32 0000 0400 0000 1000 0000 x.so.2..........
0000150: 0100 0000 474e 5500 0000 0000 0200 0000 ....GNU.........
0000160: 0200 0000 0500 0000 0704 0000 0f06 0000 ................
0000170: 0000 0000 0405 0000 0000 0000 d503 0000 ................
0000180: 0806 0000 a302 0000 c601 0000 7c01 0000 ............|...

It looks like a normal ELF binary... but hey! There seems to be a reference to a file there, which does not exist:

adiel@darkstar:~$ ls /lib/ld-linux.so.2
ls: cannot access /lib/ld-linux.so.2: No such file or directory

Why execve() behaves like this

So this is what is going on: execve() is reporting that some file was not found, but not the one file it is supposed to execute. But what is this /lib/ld-linux.so.2 anyway? From the ld-linux man page: “The programs ld.so and ld-linux.so find and load the shared libraries needed by a program, prepare the program to run, and then run it.”

It turns out that a dynamic executable file like soffice.bin specifies its loader (or interpreter), which, in this case, does not exist. This behavior is weird, since, according to POSIX, execve() should only return that specific error (ENOENT) when a component of the path does not name an existing file, or when it is NULL.

The kernel file responsible for this behavior is fs/binfmt_elf_fdpic.c. There, the function load_elf_fdpic_binary returns -ENOENT if the loader doesn't exist.

Solving the problem

Well, now that we know which file does not exist, we can look for the reason why it's not there. In my case, it is fairly simple: I'm running Slamd64, which is a 64-bit system, and the 32-bit compatibility libraries hadn't been installed. After installing those libraries (available under slackware/c in the DVD), OpenOffice started fine.


1. The first line was separated into three pieces, to avoid overflowing.

Thursday, February 7, 2008

Locales in Linux

Processes in Linux are associated with a locale. A locale carries information on how the program should display and parse dates, numbers, and other information, as well as what character encoding it should use for reading and writing strings. Most users seldom, if ever, deal directly with locales, but knowing a bit about locales can sometimes save you a lot of time and even bring you some fun.

Background

When a program starts, the GNU C Library will look at a few environment variables to determine its default locale; later, it may call that same library's setlocale() function to change it. All of those variables start with the prefix LC_. So, at the Bash prompt, you can see what locale environment variables are currently set to by typing

adiel@darkstar:~$ set | grep LC_

The values of all locale variables have the same format1:

ll[_TT][.ENCODING]

Here, ll is a two-letter language code (which I believe is an ISO 639 code); _TT is another two-letter code, this time for the territory or country; and finally ENCODING specifies a character encoding.

Each variable controls a different aspect of the locale. The variable LC_TIME, for example, controls the parsing and formatting of dates and times. The date command, for instance, outputs the current date and time respecting the locale:

adiel@darkstar:~$ LC_TIME=en_US.UTF-8 date
Wed Apr  2 20:40:25 BRT 2008
adiel@darkstar:~$ LC_TIME=pt_BR.UTF-8 date
Qua Abr  2 20:41:21 BRT 2008
adiel@darkstar:~$ LC_TIME=ru_RU.UTF-8 date
Срд Апр  2 20:41:36 BRT 2008

There's also a variable, LC_ALL, that overrides all others, that is, programs will use whatever locale you specify there, disregarding all other LC_* variables. To know how each aspect of the locale will actually be controlled when programs are run, you can use the locale command:

adiel@darkstar:~$ locale
LANG=en_US.UTF-8
LC_CTYPE=en_US.UTF-8
LC_NUMERIC=en_US.UTF-8
LC_TIME=en_US.UTF-8
LC_COLLATE=en_US.UTF-8
LC_MONETARY=en_US.UTF-8
LC_MESSAGES=el_GR.UTF-8
LC_PAPER=en_US.UTF-8
LC_NAME=en_US.UTF-8
LC_ADDRESS=en_US.UTF-8
LC_TELEPHONE=en_US.UTF-8
LC_MEASUREMENT=en_US.UTF-8
LC_IDENTIFICATION=en_US.UTF-8
LC_ALL=

The LC_ALL and LANG variables only determine the actual values of other locale variables, but they are there anyway.

Solving locale problems

Random errors

Locales solve many problems for users all around the world, but they also introduce new ones. Every now and then something goes wrong… and then after hours looking for the reason, it turns out it was a locale-related issue!

Many programs assume they will always be run under a specific locale (typically an English one), and will fail miserably when one tries to run them under other locales. The greatest challenge here is having the insight that the problem at hand is indeed a locale-related one. The symptoms are completely random. My advice is: when you are having really weird problems with a program, try running it under the POSIX locale. This can be accomplished like this:

adiel@darkstar:~$ LC_ALL=C program

I have faced problems with MuPAD, the Intel C++ Compiler and others. The most recent one took me literally hours to discover, and I still don't know which was the faulty program. I was working with a wxWidgets project which at some point dynamically compiled some Cg code. I was getting some weird compiling errors which I could not reproduce when I compiled the program either from the command line or from other non-wxWidgets projects. After hours, I noticed that if I tried to run the Cg program before initializing wxWidgets, everything worked; when I tried the same after initializing it, I got those errors. I inserted the compilation code in various places within the wxWidgets initialization code, and nailed down the problem to a piece of code that played with locales! After running the program under the POSIX locale, the error disappeared.

Garbage on the screen

At times you may find that after logging into a remote computer via SSH, your ls commands keep producing garbage in file names, or that when you cat a file, its contents contain weird characters. This is very common when dealing with non-English locales, and can be solved by opening up a new terminal under a different locale and then logging in remotely. This as easy as it sounds:

adiel@darkstar:~$ LC_ALL=pt_BR.ISO-8859-1 xterm

This would open a new Xterm that would happily accept ISO 8859-1 characters.

Slow commands

Sometimes you wonder why some commands you run on the terminal are taking so long, even though you own a very fast computer. Yep, this could be a locale problem! This is especially true for programs that do string manipulation. Here's an example:

adiel@darkstar:~$ export LC_ALL=en_US.UTF-8
adiel@darkstar:~$ time cat log | grep bone
gobble, gobble!
real    0m5.190s
user    0m4.264s
sys     0m0.228s
adiel@darkstar:~$ export LC_ALL=C
adiel@darkstar:~$ time cat log | grep bone
gobble, gobble!
real    0m0.158s
user    0m0.068s
sys     0m0.076s

This example shows that a simple grep operation can run more than 30 times faster under the POSIX locale instead of a UTF-8 locale. Of course, you shouldn't go around using the POSIX locale for everything, but if you use it with responsability you may, e.g., make your scripts run faster.

KDE and locales

KDE annoyingly ignores some locale variables that you set. It uses instead its own settings, which can be changed via its KControl program. So if your KDE programs are not paying attention to your locale variables, you can start up KControl and let KDE know your preferences.

Diacritics

This is a problem I believe most users won't ever face, but which to date still gets in my way. When you start a program, the current locale determines how dead keys combine with regular keys, that is, if you press the acute key and then the A key and you get an “Á” on your screen, thank your locale for it. The files used to determine how keys combine can be found in /usr/X11R6/lib/X11/locale.

The en_US.UTF-8 locale, for example, allows you to use many, many dead keys… but not all of them. Say you want to type a text in Ancient Greek; you surely won't be able to type all its diacritics with the English locale. You will have to run your editor under the Greek locale:

adiel@darkstar:~$ LC_ALL=el_GR.UTF-8 kate

This is really annoying because then you can't type other diacritics, and also because, if you're typing in Ancient Greek, you have already changed your keyboard layout, and yet you still have to deal with locales.

Having fun with locales

People will usually get in touch with locales for not-so-fun reasons, but you can take advantage of Linux's ability with locales and tell all programs to talk to you in the language of your choice! So with a simple entry on your .bashrc, you get a complete Greek environment. (Except for KDE, which ignores your settings. See above.)


1. Actually, things are more complicated than this. See, e.g.,

http://www.debian.org/doc/manuals/intro-i18n/ch-locale.en.html.

Tuesday, January 29, 2008

Using X applications remotely

Introduction

The world of Linux (and Unix, for that matter) is packed with little beautiful things. It takes you a while to see them, though, especially if your first operating system was Windows, like me.

One of these nice things is the way X works. Graphical applications communicate with the X server through sockets, and not through direct API calls. Although you will probably never have to write communication code with the X server yourself, deep down there's a library (Xlib) that is doing this work for you. And when you think about it, the fact that we talk about the X server makes it clear that it works in a client-server fashion.

Console applications, both in Windows and Linux, when they need to write something on the screen, or when they want to read something from the keyboard, they either read or write to a file (stdin and stdout). This is elegant, because different tasks are being subsumed in a clean way under the same umbrella (see Occam's razor). This way it is very easy to filter the output of programs (since they're producing a stream) and to send this stream through sockets.

X applications use sockets to communicate with the server. The same kind of socket used to download a web page is used by a graphical application to talk to the X server. This stands in contrast with, e.g., Windows, where applications call APIs directly. Now, if an application is using a socket to communicate with an X server, then the server by no means has to be running on the same system as the client. In fact, the client application can freely choose what server it wants to connect to.

Running an X application remotely

So suppose we want to run xcalc on the host darkstar and have it use the X server running on bones to display graphics and read input. When we start xcalc on darkstar, we must inform it somehow the X server it should connect to. Most (if not all) applications will choose the server they want to connect to by looking at the DISPLAY environment variable. This variable informs both the host where the X server is running and the display number within that server (there's also the screen number, which isn't really interesting). More than one display number is needed since there can be more than one X session running on the same system. The display number also determines the TCP port the X server listens for connections. The DISPLAY variable can be set like this on Bash:

adiel@darkstar:~$ export DISPLAY=bones:0

Usually an X server does not allow all hosts to connect to it; otherwise, I'm sure you would see strange things on your screen every now and then. For this reason, X implementations provide users with the xhost command. This command:

adiel@bones:~$ xhost +darkstar

allows programs from darkstar to communicate with the X server specified by the current value of the DISPLAY variable (which should already be set, if the command is run within an X session). So next we can simply run xcalc on darkstar:

adiel@darkstar:~$ xcalc

Using SSH to ease things up

Another way is using SSH. Instead of the commands above, one could simply run:

adiel@bones:~$ ssh -X darkstar
adiel@darkstar's password: gobble, gobble, gobble!
adiel@darkstar:~$ xcalc

But how did ssh make all this happen? First, it forwards connections from darkstar to bones. The trick here is knowing what TCP port the X server listens on. This can be determined by adding the display number to 6000. Therefore, if an application wants to use display number 0, it must connect to port 6000; if it wants to connect to use display number 5, it must connect to port 6005. So the ssh command forwards a port like darkstar's 6010 straight to bones's port 6000.

Second, it sets the DISPLAY variable. If it used port 6010 for the connection forwarding, then it would naturally set it to something like :10 (if no host is specified, then localhost is assumed).

We don't need to run the xhost command because there isn't really any remote connection going on, since they are tunnelled through SSH.

There are some key benefits from using SSH:

  • It's obviously secure, since everything will be encrypted.
  • It's usually more handy, because we don't have to set variables or run the xhost command.

Please note that, in order to use SSH, the server you're connecting to (darkstare in this case) must have the option X11Forwarding set to yes. This can be achieved by editing the file usually located at /etc/ssh/sshd_config.

Starting KDE or Gnome remotely

Not only it is possible to run simple programs remotely, but also full desktop environments, like KDE or Gnome. There's only one additional complication here: you can't run two window managers in one X session, so you have to start a separate session for this to work. You can achieve that by doing the following (these instructions assume you're using SSH; but you can easily make things work without it—see above):

  1. If you're going to start your X server from within another X session, then you should run this command before doing anything else:
    adiel@bones:~$ unset XAUTHORITY
    For reasons beyond the scope of this post, if you don't do this then you may get some weird errors.
  2. Start a new and empty X session:
    adiel@bones:~$ X :1 &>/dev/null &
    Notes:
    • The argument :1 tells the X server to start on display number 1.
    • The fragment &>/dev/null redirects both stdin and stdout to /dev/null, thus effectively avoiding any garbage on screen.
    • The & at the end tells Bash to run this command in background, so that when you come back you can type in some more commands.
    • If you run this command from a terminal in an X session, you may get your prompt printed several dozen times. Don't worry.
  3. You will have to switch back to your previous terminal or X session by pressing Alt-Ctrl-<fn>, since the previous command will activate the newly started X server.
  4. Set the DISPLAY variable. Since the new X server was started on the local host's display number 1, we must run:
    adiel@bones:~$ export DISPLAY=:1
  5. Establish an SSH connection:
    adiel@bones:~$ ssh -X darkstar
    adiel@darkstar's password: gobble, gobble, gobble!
  6. Now run the appropriate command for the desktop you want to use. For KDE, you would type in:
    adiel@darkstar:~$ startkde
    For Gnome, it could be:
    adiel@darkstar:~$ gnome-session

Final notes

The X protocol is quite heavy and it's probably not a good idea to use it over slow connections, like DSL. It works quite well over LANs, though.

While the underlying idea behind the X protocol is beautiful, things sometimes may end up not being that beautiful. You may experience some frustrating incompatibilities at times.