New article – Aligned vs. unaligned memory access
This concludes a long research that I’ve made. I wanted to know if unaligned memory access is really that bad, or not a big deal. Eventually I made some quiet interesting discoveries. Read on.
This concludes a long research that I’ve made. I wanted to know if unaligned memory access is really that bad, or not a big deal. Eventually I made some quiet interesting discoveries. Read on.
So many times I’ve heart people mentioning aligned memory access. I even protected memory alignment considerations when argued over some software design with my colleagues. I wanted to add few padding bytes into some packed structure to make the structure memory aligned and my colleagues could not understand why I am doing this. This is when I started wondering why I am bothering anyway. I mean, this memory alignment thing, it probably does something useful, but on some esoteric architectures like ALPHA and alike. Do I really need it on x86?
I’ve seen some reports about effects of aligned memory access, yet I couldn’t find some numbers that give a handy indication. I did find lots of university papers with lots of Greek letters and some complex formulas, few books dedicated to the subject and some documents about how bad unaligned memory access is. But really, how bad it is?
So, I decided to check things myself.
What I wanted to see is how much objective time it takes for the CPU to complete an aligned memory access versus a unaligned memory access. By objective time I mean same thing you see on your watch. Yet, since what I am interested in is actually a ratio, we can safely drop the time units. I wanted to test things on x86. No Sparcs and ALPHAs. Sorry mk68000 and PowerPC. Also, I wanted to test a native architecture memory unit: on 32-bit platform that would be reading/writing 32-bit long variable; on 64-bit platform that would be 64-bit long variable. Finally, I wanted to eliminate affects of L2 cache.
This is how I did the test.
What could be the problem you may ask?
The problem starts in the memory chip. Memory is capable to read certain number of bytes at a time. If you try to read from unaligned memory address, your request will cause RAM chip to do two read operations. For instance, assuming RAM works with units of 8 bytes, trying to read 8 bytes from relative offset of 5 will cause RAM chips to do two read operations. First will read bytes 0-7. Second will read bytes 8-15. As a result, the relatively slow memory access will become even slower.
Luckily hardware developers learned to overcome this problem long time ago. Actually the solution is not absolute and the ultimate problem remains. In modern computers CPU caches memory it reads. Memory cache build of 32 and 64 byte long cache lines. Even when you read just one byte from the memory, CPU reads a complete cache line and places it into the cache. This way reading 8 bytes from offset 5 or from offset 0 makes no difference – CPU reads 64 bytes in any case.
However things get not so pretty when you read from a memory address that is not on a cache line boundary. In that case CPU will read two complete cache lines. That is 128 bytes instead of only 8. This is a huge overhead and this is the overhead I would like to demonstrate you.
Now I guess you already noticed that I only talk about reads, but not about write memory accesses. This is because I think that memory reads should give us good enough results. The thing is that when you write some data into the memory, CPU does two things. First it loads cache line into cache. Then it modifies part of the line in the cache. Note that it does not write the cache line back to memory immediately. Instead it waits for more appropriate time (for instance when its less busy or, in SMP systems, when other processor needs this cache line).
We’ve already seen that, at least theoretically speaking, problem that effects performance of unaligned memory access is the time it takes for the CPU to transfer memory to the cache. Compared to this, time it takes for the CPU to modify few bytes in the cache line is negligible. As a result, its enough to test only the reads.
Once we’ve figured out what to do, lets talk about how we’re going to do it. First of all, we have to, somehow, measure time. Moreover, we have to measure it precisely.
Next thing I can think of after forcing myself to stop looking at my Swiss watch was, of course, the rdtsc instruction. After all, what can serve us better than built into CPU clock ticks counter and an instruction that reads it?
The thing is that modern x86 CPUs has an internal clock tick counter register or as folks at Intel call it, the time-stamp counter. It is incremented with at a constant rate, once every CPU clock tick. So, even fastest CPU instruction should take no less than one unit of measurement to execute. Conveniently, Intel engineers has added an instruction that allows one to read the value of the register.
Before diving into actual code that proves our little theory, we should understand that rdtsc gives us very precise results. So precise that the rdtsc instruction itself can affect it. Therefore, simply reading time-stamp register before and after reading 8 bytes from memory is not enough. The rdtsc instruction itself may take more time than the actual read.
To avoid this sort of interference, we should do many 8 byte reads. This way we will minimize effect of the rdtsc instruction itself.
Next problem that our test should address has to do with L2 cache. When reading the memory, we have to make sure that it is not yet in the cache. We want the CPU to read the memory into the cache and not to use values that are already in the cache.
Simplest way to overcome this problem is to allocate large buffer of “fresh” memory every time we do the test, work with that memory and release it. Then we have to repeat the test several times, drop several fastest and slowest results and calculate an average. We will receive objective results most of the time, but not always. Repeating test several times, dropping best and worse results and then calculating the average should eliminate “the noise”.
There is one hardware limitation when dealing with rdtsc instruction. Older CPUs (Pentium 3 for instance and even some Pentium 4’s and Xeon’s) has the internal time-stamp register work in slightly different manner, incrementing every instruction and not every clock tick. As a result, on older CPUs we would get a subjective result. Not good. Therefore, we have to use somewhat modern machine, with modern CPUs. This is what I have at hand.
processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Xeon(R) CPU E5450 @ 3.00GHz stepping : 6 cpu MHz : 2992.526 cache size : 6144 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm bogomips : 5989.11 clflush size : 64 cache_alignment : 64 address sizes : 38 bits physical, 48 bits virtual power management:
Now to the details. Allocating large memory buffers in kernel can be a bit of a problem. We can however allocate 1MB buffer without worrying too much about allocation errors. 1Mb is smaller than common cache size, but we’ve found a way to avoid cache implications, so no problems here.
Since we’re on 64-bit machine, we want to run through the complete buffer and read 8 bytes at a time. The machine I am using for the test is quiet new and has 64 byte long cache line. Therefore, to avoid reading from already cached 64 bytes, we have to read once every 128 bytes. I.e. we skip 128 bytes every iteration. Later I call this number step. Also, since we want to simulate unaligned memory access, we should start reading from a certain offset. This number called indent.
The function below does single test. It receives four arguments, first and second describing the buffer (pointer and buffer size), third telling where to start reading from (the indent) and fourth that tells how many bytes to jump every iteration (the step).
First, function does the shifting by adding the indent to the pointer to buffer. Next it reads time-stamp counter using rdtscll() macro. This is a kernel macro that reads time-stamp counter into long long variable – in this case, the before variable.
Then the function does the actual test. You can see the memory reads in the loop. It increments value of the temp pointer by step every iteration. Finally it calls rdtscll() again to check how much time have passed and returns the delta.
.
.
.
unsigned long long do_single_test( unsigned long real_page, int area_size, int indent, int step ) { volatile unsigned char* page; volatile unsigned long temp; unsigned long i; unsigned long long before = 0; unsigned long long after = 0; // Doing the indent... page = (unsigned char *)(real_page + indent); // Doing the actual test... rdtscll( before ); for (i = 0; i < (area_size / step) - 1; i++) { temp = *((unsigned long *)page); page += step; } rdtscll( after ); return after - before; }
.
.
.
You can get the complete source code of the driver, with appropriate Makefile, here.
I did the test twice, first time with indent equals 62 and second time with indent 64, i.e. started from offset 62 and 64. Both times I used step equals 128, i.e. jumped 128 bytes at a time. In first test, the value of indent intentionally caused unaligned memory access. Second test was ok in terms of memory alignment.
I ran each test 100 times, with 10 seconds delay in between. Then I dropped 10 best and 10 worse results and calculated an average. The final result is this.
Average number of clock ticks for first, the unaligned test was 1,115,326. Average number of ticks for second the aligned test was 512,282. This clearly indicates that in worse case scenario, unaligned memory access can be more than twice slower than aligned one.
We’ve seen that in worse case scenario unaligned memory access can be quiet expensive, in terms of CPU ticks. Luckily, hardware engineers continuously make it harder and harder to reach this scenario. Yet, at least at the moment, it is still quiet doable.
Therefore, while guys at Intel breaking their heads over this problem, it is better to keep your structures well aligned in the memory. This is especially important for large data structures, because such data structures cannot utilize memory cache completely. It is also important for data structures that being rarely (relatively) used. Such data structures tend to disappear from the cache, and has to be returned into cache, each time you access them. To be on the safe side, try to keep your data structures memory aligned, always.
Just in case you’re wondering how much time it takes to do the same test on cached memory, here are the results. I repeated the 64/128 (indent/step) test twice in a row, on the same memory buffer. We’ve seen the results of the first test. Repeating same test on cached memory buffer takes 39,380 clock ticks. That is 13 times less than 64/128 test results and 28 times less than 62/128 test results. Quiet overwhelming, isn’it it?
Here is the code I used to produce the later result.
As the name implies, this article explains how to crack up, modify and save modern initrd file, without using fancy tools such as mkinitrd and friends. You can find it here.
Ever wondered what’s inside of the initrd file? This article tells you how to look into the initrd and even modify it.
Linux uses the initrd or initial ram-disk during the boot process. Linux kernel is very modular as you know. While the kernel main file contains only the most needed stuff, rest of the kernel, drivers included, reside in separate files – the kernel modules.
It would be impossible to create a single kernel binary image that would suit all the hardware configurations out there. Instead, kernel supports the initrd. initrd is a virtual file-system that contains drivers (kernel modules) needed to boot the system. For instance, very often a SCSI controllers drivers reside inside of the initrd. Kernel needs a SCSI controller driver to boot the operating system, but it does not include it, nor it can read it from hard-disk (you’d need a driver for the hard-disk, right?). And this is when the initrd becomes very handy.
BIOS routines that read the actual kernel from the disk into RAM, do the same job with initrd. When Linux kernel boots, long before trying to mount the real root file-system, it loads initrd into memory and makes it a temporary root file-system.
See how handy this is. initrd itself requires no drivers whatsoever, because BIOS handles all the work of loading it into memory. On the other hand, it contains all the drivers Linux needs to boot. And you can easily rebuild it without changing the kernel.
After loading initrd into RAM, the kernel runs a script named init that resides in initrd‘s root directory. The script contains commands that would load all required kernel modules. And only after that Linux tries to mount the real root file-system.
Content of the initrd file and its format has significantly changed over last couple of years. Something like four years ago, it was a common practice to create a real RAM-disk with a fixed size, format it with ext2 file-system and write some data to it.
To look into it, you had to open it up with gzip and then mount using loopback device (mount -o loop).
Today things are totally different. Kernel configuration option that configures the size of initrd has gone. It wasn’t really convenient because your system was limited to certain initrd size. Instead kernel adapts itself to initrd, no matter what is it’s size.
Like the kernel, initrd is compressed to save disk space. Unlike the kernel, it can be easily decompressed. The tool we’ll use to decompress it is, nothing fancy gzip. Same good old gzip that we use so often.
Now before we begin it is a good idea to create a directory where we’ll work. After all, internal structure of initrd is quiet complex and we don’t want to mix contents the initrd with contents of your, let’s say, home directory. So, do mkdir and cd to create our clean environment. We’ll call this directory A. To make things even cleaner, place initrd file into your newly created directory and an additional directory in it. This is directory B. In that directory we will have the contents of the initrd. Eventually, you should have a layout similar to this one.
Let’s start decompressing. Enter directory A and copy initrd that you would like to open into the directory. Then, rename it so that it would have .gz extension. The thing is that initrd is gzip compressed archive. Since gzip refuses do decompress something that doesn’t have .gz extension, we have to rename the file.
Next we have to decompress the file. gzip -d <file name> does the job for us. Next step is to open up the cpio archive. Yes, modern initrd is a cpio archive. We can do that with cpio -i < <file name>, but before we do that, we have to enter directory B specifying file name with double dots indicating file is in the parent directory – the A directory.
sasha@sasha-linux:~/A$ cp /boot/initrd.img-2.6.24-16-generic . sasha@sasha-linux:~/A$ mv initrd.img-2.6.24-16-generic initrd.img-2.6.24-16-generi c.gz sasha@sasha-linux:~/A$ gzip -d initrd.img-2.6.24-16-generic.gz sasha@sasha-linux:~/A$ ls B/ initrd.img-2.6.24-16-generic sasha@sasha-linux:~/A$ cd B/ sasha@sasha-linux:~/A/B$ cpio -i < ../initrd.img-2.6.24-16-generic 42155 blocks sasha@sasha-linux:~/A/B$ ls -F bin/ conf/ etc/ init* lib/ modules/ sbin/ scripts/ usr/ var/ sasha@sasha-linux:~/A/B$
In this example you can see me opening default initial ram-disk image from my Ubuntu 8.04 installation. We can see that the initrd opened up into a nice directory tree that resembles your root directory structure. In the heart of the initrd structure is the init script that does most of the job of loading right modules when system boots.
The content of the init script is different from distribution to distribution. The main difference is in approach. In some distributions developers preferred to keep as many initializations as possible out of the initrd. In other distributions developers didn’t care that much about keeping initrd small and fast. In general both approaches has a place under the sun. First approach based on the fact that initrd is a limited environment, on the contrary to Linux when its fully loaded. Thus when Linux is fully loaded, you can do more complex stuff with less effort. Second approach on the other hand, sees in initrd an environment that works faster than “big” Linux, so it uses initrd‘s fastness to do some initializations.
Ubuntu’s initrd image based upon first approach. It uses a shell program named busybox – the shell environment originally designed for embedded systems and known for its small memory footprint and good performance. initrd in OpenSuSE 10.2 on the other hand uses bash shell – same shell as you use regularly. This is a clear example of the second approach.
Another interesting input to look at, is the fact that init script in Ubuntu 8.04 is ~200 lines long, while in OpenSuSE 10.2 it is ~1000 lines long.
Once you have it opened up, you can see things inside of it and even make some modifications. As I already explained, structure of the initial ram-disk changes from distribution to distribution. However, all distributions share few common things. For instance, disregarding the distribution and a particular initrd format, lib/modules/ directory always contains kernel modules that initrd loads at boot time. You may swap one module with another without anyone even noticing.
Number of modules, their names, etc controlled via init script in distribution dependent form. Therefore, no matter what distribution of Linux you have, init script is the key to understanding how initrd works. Apprehend the init script, and you will have full control over your initrd, it’s contents and what it does.
Assuming you’re done playing around with initrd contents and you want to pack it back. Here is what you do.
First you have to pack cpio archive. Remember the B directory we’ve created. This is where it becomes handy. We want to keep contents of the initrd as clean as possible. The A-B separation allows us to keep the original initrd image out of the way when packing it back.
This is how we do that. First, we should enter the B directory. From there, run following command:
find | cpio -H newc -o > ../new_initrd_file
This will create a new initrd file named new_initrd_file inside of directory A.
Next enter directory A and pack the cpio archive with gzip. Here’s the command that should do the job.
gzip -9 new_initrd_file
This will pack the initrd in new_initrd_file into new_initrd_file.gz archive. Finally rename the file into whatever you want to call it. Remember that getting rid of .gz extension is a common practice, although not a necessity.
This is how complete session will look like on Ubuntu:
sasha@sasha-linux:~$ cd A/B/ sasha@sasha-linux:~/A/B$ find | cpio -H newc -o > ../new_initrd_image 42155 blocks sasha@sasha-linux:~/A/B$ cd ../ sasha@sasha-linux:~/A$ gzip -9 new_initrd_image sasha@sasha-linux:~/A$ ls B initrd.img-2.6.24-16-generic new_initrd_image.gz sasha@sasha-linux:~/A$ mv new_initrd_image.gz initrd.img-2.6.24-16-generic-modified sasha@sasha-linux:~/A$ ls B initrd.img-2.6.24-16-generic initrd.img-2.6.24-16-generic-modified sasha@sasha-linux:~/A$
Changing initrd is always a risky business. When playing with matters of this kind, mistakes are common and it is important to stay on the safe side. Adding a new GRUB configuration is not such a big deal, but by all means do so when trying to boot the brewed five minutes ago initrd. You’ll save yourself lots of time reinstalling distributions and poking around with different rescue systems to make your system boot again.
Have fun!
In this article I explain how to use tcpdump, one of the most powerful tools in my toolbox. Hope you’ll find it useful. You can find it here.
In this article I would like to talk about one of the most useful tools in my networking toolbox and that is tcpdump. Unfortunately mastering this tool completely is not an easy task. Yet stuff you do the most is relatively simple and may become a good springboard when diving into more complex topics.
tcpdump is a packet sniffer. It is able to capture traffic that passes through a machine. It operates on a packet level, meaning that it captures the actual packets that fly in and out of your computer. It can save the packets into a file. You can save whole packets or only the headers. Later you can “play” recorded file and apply different filters on the packets, telling tcpdump to ignore packets that you are not interested to see.
Under the hood, tcpdump understands protocols and host names. It will do all in its power to see what host sent each packet and will tell you its name instead of the IP address.
It is exceptionally useful tool for debugging what might have caused certain networking related problem. It is an excellent tool to learn new things.
Invoking tcpdump is easy. First thing that you have to remember is that you should either be logged in as root or be a sudoer on the computer – sudoer is someone who is entitled to gain administrator rights on computer for short period of time using sudo command.
Running tcpdump without any arguments makes it capture packets on first network interface (excluding lo) and print short description of each packet to output. This may cause a bit of a headache in case you are using network to connect to the machine. If you are connected with SSH or telnet (rlogin?), running tcpdump will produce a line of text for each incoming or outgoing packet. This line of text will cause SSH daemon to send a packet with this line, thus causing tcpdump to produce another line of text. And this will not stop until you do something about it.
So first thing that we will learn about tcpdump is how to filter out SSH and telnet packets. We will study the basics of tcpdump filtering later in this guide, but for now just remember this syntax.
# tcpdump not port 22
“not port 22” is a filter specification that tells tcpdump to filter out packets with IP source or destination port 22. As you know port 22 is SSH port. Basically, when you tell tcpdump something like this, it will make tcpdump ignore all SSH packets – exactly what we needed.
Telnet on the other hand, uses port 23. So if you are connecting via telnet, you can filter that out with:
# tcpdump not port 23
Clear and simple!
By default tcpdump produces one line of text per every packet it intercepts. Each line starts with a time stamp. It tells you very precise time when packet arrived.
Next comes protocol name. Unfortunately, tcpdump understands very limited number of protocols. It won’t tell you the difference between packets belonging to HTTP and for instance FTP stream. Instead, it will mark such packets as IP packets. It does have some limited understanding of TCP. For instance it identifies TCP synchronization packets such as SYN, ACK, FIN and others. This information printed after source and destination IP addresses (if it IP packet).
Source and destination addresses follow protocol name. For IP packets, these are IP addresses. For other protocols, tcpdump does not print any identifiers unless explicitly asked to do so (see -e command line switch below).
Finally, tcpdump prints some information about the packet. For instance, it prints TCP sequence numbers, flags, ARP/ICMP commands, etc.
Here’s an example of typical tcpdump output.
17:50:03.089893 IP 69.61.72.101.www > 212.150.66.73.48777: P 1366488174:1366488582 (408) ack 2337505545 win 7240 <nop,nop,timestamp 1491222906 477679143>
This packet is part of HTTP data stream. You can see meaning of each and every field in the packet description in tcpdump’s manual page.
Here’s another example
17:50:00.718266 arp who-has 69.61.72.185 tell 69.61.72.1
This is ARP packet. It’s slightly more self explanatory than TCP packets. Again, to see exact meaning of each field in the packet description see tcpdump’s manual page.
Now, when we know how to invoke tcpdump even when connecting to the computer over some net, let’s see what command line switches are available for us.
We’ll start with a simple one. How to dump packets that arrived and sent through a certain network interface. -i command line argument does exactly this.
# tcpdump -i eth1
Will cause tcpdump to capture packets from network interface eth1. Or, considering our SSH/telnet experience:
# tcpdump -i eth1 not port 22
Finally, you can specify any as interface name, to tell tcpdump to listen to all interfaces.
# tcpdump -i any not port 22
As we debug networking issues, we may encounter a problem with how tcpdump works out of the box. The problem is that it tries to resolve every single IP address that it meets. I.e. when it sees an IP packet it asks DNS server for names of the computers behind IP address. It works flawlessly most of the time. However, there are two problems.
First, it slows down packet interception. It’s not a big deal when there are only few packets, but when there are thousands and tens of thousands it introduces a delay into the process. Amount of delay can be different, depending on the traffic.
Another, much more serious problem occurs when there is no DNS server around or when DNS server is not working properly. If this is the case, tcpdump spends few seconds trying to figure out two hostnames for each IP packet. This means virtually stopping intercepting the traffic.
Luckily there is a way around. There is an option that causes tcpdump to stop detecting hostnames and that is -n.
# tcpdump -n
And here are few variations of how you can use this option in conjunction with options that we have learned already.
# tcpdump -n -i eth1 # tcpdump -ni eth1 not port 22
Here are few more useful options. Sometimes amount of traffic that goes in and out of your computer is very high, while all you want to see is just few packets. Often you want to see who sends you the traffic, but when you try to capture anything with tcpdump it dumps so many packets that you cannot understand anything. This is the case when -c command line switch becomes handy.
It tells tcpdump to limit number of packets it intercepts. You specify number of packets you want to see. tcpdump will capture that number of packets and exit. This is how you use it.
# tcpdump -c 10
Or with options that we’ve learned before.
# tcpdump -ni eth1 -c 10 not port 22
This will limit number of packets that tcpdump will receive to 10. Once received 10 packets, tcpdump will exit.
One of the most useful tcpdump features allows capturing incoming and outgoing packets into a file and then playing this file back. By the way, you can play this file not only with tcpdump, but also with WireShark (former Ethereal), the graphical packet analyzer.
You can do this with -w command line switch. It should be followed by the name of the file that will contain the packets. Like this:
# tcpdump -w file.cap
Or adding options that we’ve already seen
# tcpdump -ni eth1 -w file.cap not port 22
By default, when capturing packets into a file, it will save only 68 bytes of the data from each packet. Rest of the information will be thrown away.
One of the things I do often when capturing traffic into a file, is to change the saved packet size. The thing is that disk space that is required to save the those few bytes is very cheap and available most of the time. Spending few spare megabytes of your disk space on capture isn’t too painful. On the other hand, loosing valuable portion of packets might be very critical.
So, what I usually do when capturing into a file is running tcpdump with -s command line switch. It tells tcpdump how many bytes for each packet to save. Specifying 0 as a packet’s snapshot length tells tcpdump to save whole packet. Here how it works:
# tcpdump -w file.cap -s 0
And with conjunction with options that we already saw:
# tcpdump -ni eth1 -w file.cap -s 0 -c 1000 not port 22
Obviously you can save as much data as you want. Specifying 1000 bytes will do the job for you. Just keep in mind that there are so called jumbo frames those size can be as big as 8Kb.
Now, when we have captured some traffic into a file, we would like to play it back. -r command like switch tells tcpdump that it should read the data from a file, instead of capturing packets from interfaces. This is how it works.
# tcpdump -r file.cap
With capture file, we can easily analyze the packets and understand what’s inside. tcpdump introduces several options that will help us with this task. Lets see few of them.
There are several options that allow one to see more information about the packet. There is a problem though. tcpdump in general isn’t giving you too much information about packets. It doesn’t understand different protocols.
If you want to see packet’s content, it is better to use tools like Wireshark. It does understand protocols, analyzes them and allows you to see different fields, not only in TCP header, but in layer 7 protocols headers.
tcpdump is a command line tool and as most of the command line tools, its ability to present information is quiet limited. Yet, it still has few options that control the way packets presented.
-e command line switch, causes tcpdump to present Ethernet (link level protocol) header for each printed packet. Lets see an example.
# tcpdump -e -n not port 22
There are four command line switches that control the way how tcpdump prints time stamp. First, there is -t option. It makes tcpdump not to print time stamps. Next comes -tt. It causes tcpdump to print time stamp as number of seconds since Jan. 1st 1970 and a fraction of a second. -ttt prints the delta between this line and a previous one. Finally, -tttt causes tcpdump to print time stamp in it’s regular format preceeded by date.
-v causes tcpdump to print more information about each packet. With -vv tcpdump prints even more information. As you could guess, -vvv produces even more information. Finally -vvvv will produce an error message telling you there is no such option
-x command line switch will make tcpdump to print each packet in hexadecimal format. Number of bytes that will be printed remains somewhat a mystery. As is, it will print first 82 bytes of the packet, excluding ethernet header. You can control number of bytes printed using -s command line switch.
In case you want to see ethernet header as well, use -xx. It will cause tcpdump to print extra 14 bytes for ethernet header.
Similarily -X and -XX will print contents of packet in hexadecimal and ASCII formats. The later will cause tcpdump to include ethernet header into printout.
We already saw a simple filter. It causes tcpdump to ignore SSH packets, allowing us to run tcpdump from remote. Now lets try to understand the language that tcpdump uses to evaluate filter expressions.
We should understand that tcpdump applies our filter on every single incoming and outgoing packet. If packet matches the filter, tcpdump aknownledges the packet and depending on command line switches either saves it to file or dumps it to the screen. Otherwise, tcpdump will ignore the packet and account it only when telling how many packets received, dropped and filtered out when it exits.
To demostrate this, lets go back to not port 22 expression. tcpdump ignores packets that either sourced or destined to port 22. When such packet arrives, tcpdump applies filter on it and since the result is false, it will drop the packet.
So, from what we’ve seen so far, we can conclude that tcpdump understands a word port and understands expression negation with not. Actually, negating an expression is part of complex expressions syntax and we will talk about complex expressions a little later. In the meantime, lets see few more packet qualifiers that we can use in tcpdump expressions.
We’ve seen that port qualifier specifies either source or destination port number. In case we want to specify only the source port or only the destination port we can use src port or dst port. For instance, using following expression we can see all outgoing HTTP packets.
# tcpdump -n dst port 80
We can also specify ranges of ports. portrange, src portrange and dst portrange qualifiers do exactly this. For instance, lets see a command that captures all telnet and SSH packets.
# tcpdump -n portrange 22-23
Using dst host, src host and host qualifiers you can specify source, destination or any of them IP addresses. For example
# tcpdump src host alexandersandler.net
Will print all packets originating from alexandersandler.net computer.
You can also specify Ethernet addresses. You do that with ether src, ether dst and ether host qualifiers. Each should be followed by MAC address of either source, destination or source or destination machines.
You can specify networks as well. The net, src net and dst net qualifiers do exactly this. Their syntax however slighly more complex than those of a single host. This is due to a netmask that has to be specified.
You can use two basic forms of network specifications. One using netmask and the other so called CIDR notation. Here are few examples.
# tcpdump src net 67.207.148.0 mask 255.255.255.0
Or same command using CIDR notation.
# tcpdump src net 67.207.148.0/24
Note the word mask that does the job of specifying the network in first example. Second example is much shorter.
There are several useful qualifiers that don’t fall under any of the categories I already covered.
For instance, you can specify that you are interested in packets with specific length. length qualifier does this. less and greater qualifiers tell tcpdump that you are interested in packets whose length is less or greater than value you specified.
Here’s an example that demonstrates these qualifiers in use.
# tcpdump -ni eth1 greater 1000
Will capture only packets whose size is greater than 1000 bytes.
As we already saw we can build more complex filter expressions using tcpdump filters language. Actually, tcpdump allows exceptionally complex filtering expressions.
We’ve seen not port 22 expression. Applying this expression on certain packet will produce logical true for packets that are not sourced or destined to port 22. Or in two words, not negates the expression.
In addition to expression negation, we can build more complex expressions combining two smaller expression into one large using and and or keywords. In addition, you can use brackets to group several expressions together.
For example, lets see a tcpdump filter that causes tcpdump to capture packets larger then 100 bytes originating from google.com or from microsoft.com.
# tcpdump -XX greater 100 and \(src host google.com or src host microsoft.com\)
and and or keywords in tcpdump filter language have same precedence and evaluated left to right. This means that without brackets, tcpdump could have captured packets from microsoft.com disregarding packet size. With brackets, tcpdump first makes sure that all packets are greater than 100 bytes and only then checks their origin.
Note the backslash symbol (“\”) before brackets. We have to place them before brackets because of shell. Unix shell has special understanding of what brackets used for. Hence we have to tell shell to leave these particular brackets alone and pass them as they are to tcpdump. Backslash characters do exactly this.
Talking about precedence, we have to keep in mind that in tcpdump’s filter expression language not has higher precedence than and and or. tcpdump’s manual page has very nice example and emphasizes the meaning of this.
not host vs and host ace
and
not (host vs or host ace)
are two different expressions. Because not has higher precedence over and and or, filter from the first example will capture packets not to/from vs, but to/from ace. Second filter example on the other hand will capture packets that are not to/from vs and to/from ace. I.e. first will capture packet from ace to some other host (but not to vs). Yet second example won’t capture this packet.
To conclude this article, I would like to tell you one more thing that may become handy when writing complex tcpdump filter expressions.
Take a look at the following example.
# tcpdump -XX greater 100 and \(src host google.com or microsoft.com\)
We already saw this example, with one little exception. In previous example we had a src host qualifier before microsoft.com and now its gone. The thing is that if we want to use same qualifier two times in a row, we don’t have to specify it twice. Instead we can just write qualifier’s parameter and tcpdump will know what to do.
This makes tcpdump filter expression language much easier to understand and much more readable.
I hope you found this article useful. In case you have questions, suggestions or you would like to share your appreciation to the author , don’t hesitate to mail me to alexander.sandler@gmail.com
[singlepic=666,400,350,,center]
I took these pictures in Rothschild Park in Zihron Yaakov. You can find it here.
These pictures are from Rothschild Park in Zihron Yaakov.
I shot these pictures in Utopia park. It is a small park in Beit Hefer, quiet close to Tulkarem. It is best known for its orchids, although thepeacock definitely tried to steal the show. Enjoy it here.