P For Paranoia OR a quick way of overwriting a partition with random-like data

(General Surgeon’s warning: The following post contains doses of paranoia which might exceed your recommended daily dosage. Fnord!).

A lot of the data sanitisation literature around advises overwriting partitions with random data (btw, SANS Institute research claims that even a pass with /dev/zero is enough to stop MFM but YPMV). So leaving Guttman-like techniques aside, in practice, generation of random data will take a long time in your average system which does not contain a cryptographic accelerator. In order to speed up things, /dev/urandom can be used in lieu of /dev/random, noting that when read, the non-blocking /dev/urandom device will return as many bytes as are requested, even if the entropy pool is depleted . As a result, the result stream is not as cryptographically sound as /dev/random but is faster.

Assuming that time is of the essence and your paranoia level is low there is an alternative which you can use, both providing random-like data (which means you do not have to fall back to /dev/zero and keep fingers crossed) and being significantly faster. Enter Truecrypt. Truecrypt allows for encrypted partitions using a variety of algorithms that have been submitted to peer review and are deemed secure for general usage. I can hear Johnny sceptical shouting “Hey wait a minute now, this is NOT random data, what the heck are you talking about?”. First of all, Truecrypt headers aside, let’s see what ent reports. For those of you not familiar with ent, it is a tool that performs a statistical analysis of a given file (or bitstream if you tell it so), giving you an idea about entropy and other way way useful statistics. For more information man 1 ent.

For the purposes of this demonstration, I have created the following files:

  • an AES encrypted container
  • an equivalent size file getting data from /dev/urandom (I know, but I was in a hurry )
  • a well defined binary object in the form of a shared library
  • a system configuration file
  • a seed file which contains a mixture of English, Chinese literature, some C code, strings(1) output from the non-encrypted swap (wink-wink, nudge-nudge)
  • Let’s do some ent analysis and see what results we get (for the hastily un-strict compliant Perl code look at the end of the article)

    ################################################################################
    processing file: P_for_Paranoia.tc 16777216 bytes
    Entropy = 7.999988 bits per byte.

    Optimum compression would reduce the size
    of this 16777216 byte file by 0 percent.

    Chi square distribution for 16777216 samples is 288.04, and randomly
    would exceed this value 10.00 percent of the times.

    Arithmetic mean value of data bytes is 127.4834 (127.5 = random).
    Monte Carlo value for Pi is 3.141790185 (error 0.01 percent).
    Serial correlation coefficient is 0.000414 (totally uncorrelated = 0.0).

    ################################################################################
    processing file: P_for_Paranoia.ur 16777216 bytes
    Entropy = 7.999989 bits per byte.

    Optimum compression would reduce the size
    of this 16777216 byte file by 0 percent.

    Chi square distribution for 16777216 samples is 244.56, and randomly
    would exceed this value 50.00 percent of the times.

    Arithmetic mean value of data bytes is 127.4896 (127.5 = random).
    Monte Carlo value for Pi is 3.143757139 (error 0.07 percent).
    Serial correlation coefficient is -0.000063 (totally uncorrelated = 0.0).

    ################################################################################
    processing file: seed 16671329 bytes
    Entropy = 5.751438 bits per byte.

    Optimum compression would reduce the size
    of this 16671329 byte file by 28 percent.

    Chi square distribution for 16671329 samples is 101326138.53, and randomly
    would exceed this value 0.01 percent of the times.

    Arithmetic mean value of data bytes is 82.9071 (127.5 = random).
    Monte Carlo value for Pi is 3.969926804 (error 26.37 percent).
    Serial correlation coefficient is 0.349229 (totally uncorrelated = 0.0).

    ################################################################################
    processing file: /etc/passwd 1854 bytes
    Entropy = 4.898835 bits per byte.

    Optimum compression would reduce the size
    of this 1854 byte file by 38 percent.

    Chi square distribution for 1854 samples is 20243.47, and randomly
    would exceed this value 0.01 percent of the times.

    Arithmetic mean value of data bytes is 86.1019 (127.5 = random).
    Monte Carlo value for Pi is 4.000000000 (error 27.32 percent).
    Serial correlation coefficient is 0.181177 (totally uncorrelated = 0.0).

    ################################################################################
    processing file: /usr/lib/firefox-4.0.1/libxul.so 31852744 bytes
    Entropy = 5.666035 bits per byte

    Optimum compression would reduce the size
    of this 31852744 byte file by 29 percent.

    Chi square distribution for 31852744 samples is 899704400.21, and randomly
    would exceed this value 0.01 percent of the times.

    Arithmetic mean value of data bytes is 74.9209 (127.5 = random).
    Monte Carlo value for Pi is 3.563090648 (error 13.42 percent).
    Serial correlation coefficient is 0.391466 (totally uncorrelated = 0.0).

    Focusing on entropy, we see that
    Truecrypt: Entropy = 7.999988 bits per byte.
    /dev/urandom: Entropy = 7.999989 bits per byte.

    which are directly comparable (if you are trusting ent that is) and much better than a well structured binary file (5.666035 bits per byte) and heads and shoulders our seed.txt results (which is a conglomerate unlikely to be encountered in practice). Chi-square entropy distribution values are different by a factor of 5 in our example, in favor of /dev/urandom data, which is still way more than the data encountered in our other test cases.

    From the above, there is strong indication that when you need random-like data and /dev/urandom is too slow (for example, as I will elaborate on an upcoming post), for example when you want to “randomize” your swap area, a Truecrypt volume will do in a pinch.

    #!/usr/bin/env perl
    use warnings;
    use File::stat;
    # a 5 min script (AKA no strict compliance) to supplement results for a blog article
    # why perl? Nostalgia :-)

    @subjects = qw(P_for_Paranoia.tc P_for_Paranoia.ur seed /etc/passwd /usr/lib/firefox-4.0.1/libxul.so);
    sub analyzeEnt {
    my($file) = @_;
    my $sz = stat($file)->size;
    my $ent = `ent $file` ."\n";
    print "#" x 80 . "\nprocessing file: $file ". $sz ." bytes\n".$ent;
    }
    foreach my $subject (@subjects) {
    &analyzeEnt($subject);
    }

    Hello world demystified

    Hello all,
    in the finest programming tradition, I take it that every time one took a programming course or decided to have a quick look into a programming language, writing a “Hello World” program was one of the first things done.
    However, even in such a small program, do you really know what it does behind the scenes? Yiannis blogged about it here, giving a Java test case (and by the way, allow me to extend my welcome!) and Dimitris provided a C equivalent there. However, let’s do a roundup. After whacking some Perl code, I got the following results

    Let’s start with the old workhorse, Perl:


    ################################################################################
    The Perl version
    ################################################################################
    strace -c -f -q perl hello.pl
    % time seconds usecs/call calls errors syscall
    ------ ----------- ----------- --------- --------- ----------------
    100.00 0.000029 1 36 13 open
    0.00 0.000000 0 10 read
    0.00 0.000000 0 1 write
    0.00 0.000000 0 23 close
    0.00 0.000000 0 1 execve
    0.00 0.000000 0 1 time
    0.00 0.000000 0 7 7 access
    0.00 0.000000 0 4 brk
    0.00 0.000000 0 4 3 ioctl
    0.00 0.000000 0 1 readlink
    0.00 0.000000 0 2 munmap
    0.00 0.000000 0 1 uname
    0.00 0.000000 0 8 mprotect
    0.00 0.000000 0 4 2 _llseek
    0.00 0.000000 0 4 rt_sigaction
    0.00 0.000000 0 1 rt_sigprocmask
    0.00 0.000000 0 1 getrlimit
    0.00 0.000000 0 31 mmap2
    0.00 0.000000 0 3 3 stat64
    0.00 0.000000 0 22 fstat64
    0.00 0.000000 0 1 getuid32
    0.00 0.000000 0 1 getgid32
    0.00 0.000000 0 1 geteuid32
    0.00 0.000000 0 1 getegid32
    0.00 0.000000 0 1 fcntl64
    0.00 0.000000 0 3 1 futex
    0.00 0.000000 0 1 set_thread_area
    0.00 0.000000 0 1 set_tid_address
    0.00 0.000000 0 1 set_robust_list
    ------ ----------- ----------- --------- --------- ----------------
    100.00 0.000029 176 29 total
    Hello World!


    ltrace -c -f perl hello.pl
    % time seconds usecs/call calls function
    ------ ----------- ----------- --------- --------------------
    91.07 2.013999 9455 213 calloc
    2.15 0.047513 64 733 pthread_getspecific
    1.44 0.031752 64 492 memcpy
    1.33 0.029410 65 446 malloc
    1.08 0.023974 64 370 memset
    1.05 0.023194 64 357 strlen
    0.33 0.007340 63 115 free
    0.31 0.006860 64 106 memmove
    0.25 0.005633 1126 5 setlocale
    0.19 0.004176 64 65 strrchr
    0.15 0.003260 65 50 strchr
    0.12 0.002586 63 41 strncmp
    0.07 0.001598 61 26 __ctype_toupper_loc
    0.07 0.001586 61 26 __ctype_tolower_loc
    0.04 0.000961 64 15 pthread_mutex_lock
    0.04 0.000951 63 15 pthread_mutex_unlock
    0.04 0.000781 65 12 getenv
    0.02 0.000466 66 7 realloc
    0.02 0.000445 111 4 isatty
    0.02 0.000431 107 4 lseek64
    0.02 0.000358 119 3 read
    0.02 0.000343 114 3 __xstat64
    0.02 0.000332 66 5 __errno_location
    0.01 0.000307 61 5 pthread_mutex_init
    0.01 0.000258 64 4 __sigsetjmp
    0.01 0.000243 121 2 open64
    0.01 0.000229 114 2 close
    0.01 0.000192 64 3 pthread_mutex_destroy
    0.01 0.000155 155 1 readlink
    0.01 0.000135 67 2 strxfrm
    0.01 0.000132 66 2 sysconf
    0.01 0.000127 63 2 localeconv
    0.01 0.000118 118 1 time
    0.01 0.000118 118 1 write
    0.01 0.000114 114 1 fcntl
    0.01 0.000111 111 1 __fxstat64
    0.00 0.000110 110 1 getuid
    0.00 0.000109 109 1 sigaction
    0.00 0.000109 109 1 signal
    0.00 0.000104 104 1 geteuid
    0.00 0.000102 102 1 getegid
    0.00 0.000102 102 1 getgid
    0.00 0.000094 94 1 __register_atfork
    0.00 0.000083 83 1 pthread_setspecific
    0.00 0.000069 69 1 strcmp
    0.00 0.000068 68 1 __ctype_b_loc
    0.00 0.000068 68 1 srand48_r
    0.00 0.000068 68 1 getpid
    0.00 0.000068 68 1 pthread_key_create
    0.00 0.000067 67 1 memchr
    0.00 0.000066 66 1 drand48_r
    0.00 0.000065 65 1 pthread_key_delete
    0.00 0.000064 64 1 nl_langinfo
    ------ ----------- ----------- --------- --------------------
    100.00 2.211604 3157 total
    Hello World!

    Moving up a bit, it is time for some Python statistics:


    ################################################################################
    The Python version
    ################################################################################
    strace -c -f -q python hello.py
    % time seconds usecs/call calls errors syscall
    ------ ----------- ----------- --------- --------- ----------------
    56.25 0.000027 0 177 118 open
    43.75 0.000021 0 88 fstat64
    0.00 0.000000 0 89 read
    0.00 0.000000 0 1 write
    0.00 0.000000 0 59 close
    0.00 0.000000 0 1 execve
    0.00 0.000000 0 10 10 access
    0.00 0.000000 0 6 brk
    0.00 0.000000 0 5 3 ioctl
    0.00 0.000000 0 3 2 readlink
    0.00 0.000000 0 29 munmap
    0.00 0.000000 0 1 uname
    0.00 0.000000 0 11 mprotect
    0.00 0.000000 0 3 _llseek
    0.00 0.000000 0 68 rt_sigaction
    0.00 0.000000 0 1 rt_sigprocmask
    0.00 0.000000 0 1 getcwd
    0.00 0.000000 0 1 getrlimit
    0.00 0.000000 0 57 mmap2
    0.00 0.000000 0 100 52 stat64
    0.00 0.000000 0 1 lstat64
    0.00 0.000000 0 1 getuid32
    0.00 0.000000 0 1 getgid32
    0.00 0.000000 0 1 geteuid32
    0.00 0.000000 0 1 getegid32
    0.00 0.000000 0 4 getdents64
    0.00 0.000000 0 1 fcntl64
    0.00 0.000000 0 3 1 futex
    0.00 0.000000 0 1 set_thread_area
    0.00 0.000000 0 1 set_tid_address
    0.00 0.000000 0 1 set_robust_list
    ------ ----------- ----------- --------- --------- ----------------
    100.00 0.000048 727 186 total
    Hello world!


    ltrace -c -f python hello.py
    % time seconds usecs/call calls function
    ------ ----------- ----------- --------- --------------------
    35.03 2.118504 1153 1837 free
    34.89 2.109835 1226 1720 strcmp
    10.84 0.655259 63 10259 memcpy
    6.76 0.408768 64 6312 strlen
    3.75 0.226680 63 3556 __ctype_b_loc
    2.30 0.139353 65 2130 memset
    1.75 0.105637 65 1625 malloc
    1.07 0.064819 16204 4 qsort
    0.84 0.050971 65 784 realloc
    0.43 0.025782 65 396 strchr
    0.32 0.019097 119 160 fopen64
    0.24 0.014268 62 227 strcpy
    0.21 0.012702 64 198 strncpy
    0.20 0.011854 81 145 _IO_getc
    0.19 0.011645 66 176 readdir64
    0.19 0.011473 114 100 __xstat64
    0.12 0.007396 108 68 sigaction
    0.12 0.007067 62 113 pthread_self
    0.11 0.006385 145 44 fclose
    0.09 0.005559 173 32 fread
    0.08 0.005079 108 47 __fxstat64
    0.05 0.003305 63 52 fileno
    0.05 0.003187 63 50 __strcpy_chk
    0.05 0.002915 63 46 strrchr
    0.04 0.002119 64 33 __errno_location
    0.03 0.001530 63 24 memchr
    0.03 0.001526 63 24 __rawmemchr
    0.02 0.001389 463 3 setlocale
    0.02 0.001136 63 18 __memcpy_chk
    0.02 0.001092 64 17 sem_post
    0.01 0.000880 62 14 memmove
    0.01 0.000796 66 12 getenv
    0.01 0.000723 65 11 strcat
    0.01 0.000715 71 10 __vsnprintf_chk
    0.01 0.000651 65 10 sem_trywait
    0.01 0.000585 65 9 ferror
    0.01 0.000549 109 5 isatty
    0.01 0.000456 65 7 sem_wait
    0.01 0.000347 86 4 fflush
    0.01 0.000344 114 3 readlink
    0.01 0.000336 67 5 __strncpy_chk
    0.01 0.000329 164 2 opendir
    0.01 0.000320 64 5 __ctype_tolower_loc
    0.00 0.000295 73 4 strerror
    0.00 0.000265 66 4 sigemptyset
    0.00 0.000262 131 2 ftell
    0.00 0.000245 245 1 fseek
    0.00 0.000226 113 2 closedir
    0.00 0.000199 66 3 sem_init
    0.00 0.000199 66 3 strncat
    0.00 0.000196 65 3 flockfile
    0.00 0.000187 62 3 funlockfile
    0.00 0.000182 182 1 __realpath_chk
    0.00 0.000165 165 1 fwrite
    0.00 0.000148 74 2 __strdup
    0.00 0.000133 133 1 geteuid
    0.00 0.000116 116 1 getuid
    0.00 0.000115 115 1 getegid
    0.00 0.000115 115 1 getgid
    0.00 0.000113 113 1 rewind
    0.00 0.000111 111 1 __uflow
    0.00 0.000080 80 1 sysconf
    0.00 0.000075 75 1 getpid
    0.00 0.000074 74 1 __libc_current_sigrtmin
    0.00 0.000069 69 1 __libc_current_sigrtmax
    0.00 0.000069 69 1 clearerr
    0.00 0.000069 69 1 __strcat_chk
    0.00 0.000068 68 1 fputs
    0.00 0.000068 68 1 ungetc
    0.00 0.000068 68 1 nl_langinfo
    ------ ----------- ----------- --------- --------------------
    100.00 6.047275 30341 total
    Hello world!

    Let see some Ruby (see if you can spot something peculiar!)


    ################################################################################
    The Ruby version
    ################################################################################
    strace -c -f -q ruby hello.rb
    % time seconds usecs/call calls errors syscall
    ------ ----------- ----------- --------- --------- ----------------
    100.00 0.000020 2 11 fstat64
    0.00 0.000000 0 10 read
    0.00 0.000000 0 1 write
    0.00 0.000000 0 11 open
    0.00 0.000000 0 11 close
    0.00 0.000000 0 1 execve
    0.00 0.000000 0 9 9 access
    0.00 0.000000 0 4 brk
    0.00 0.000000 0 1 gettimeofday
    0.00 0.000000 0 2 munmap
    0.00 0.000000 0 1 uname
    0.00 0.000000 0 11 mprotect
    0.00 0.000000 0 5 sigprocmask
    0.00 0.000000 0 14 rt_sigaction
    0.00 0.000000 0 3 rt_sigprocmask
    0.00 0.000000 0 2 getrlimit
    0.00 0.000000 0 26 mmap2
    0.00 0.000000 0 1 getuid32
    0.00 0.000000 0 1 getgid32
    0.00 0.000000 0 2 geteuid32
    0.00 0.000000 0 2 getegid32
    0.00 0.000000 0 2 1 futex
    0.00 0.000000 0 1 set_thread_area
    0.00 0.000000 0 1 set_tid_address
    0.00 0.000000 0 1 set_robust_list
    ------ ----------- ----------- --------- --------- ----------------
    100.00 0.000020 134 10 total
    Hello World!


    ltrace -c -f ruby hello.rb
    % time seconds usecs/call calls function
    ------ ----------- ----------- --------- --------------------
    90.19 0.008306 8306 1 ruby_init
    8.44 0.000777 777 1 ruby_options
    1.37 0.000126 126 1 ruby_init_stack
    ------ ----------- ----------- --------- --------------------
    100.00 0.009209 3 total
    Hello World!

    Java, which already has been covered:


    ################################################################################
    The Java version
    ################################################################################
    strace -c -f -q java Hello
    % time seconds usecs/call calls errors syscall
    ------ ----------- ----------- --------- --------- ----------------
    99.92 0.490957 3836 128 16 futex
    0.04 0.000193 0 9718 gettimeofday
    0.01 0.000056 0 206 149 open
    0.01 0.000041 3 13 clone
    0.01 0.000040 0 151 111 stat64
    0.01 0.000029 0 768 read
    0.01 0.000029 1 23 munmap
    0.00 0.000023 0 734 _llseek
    0.00 0.000000 0 2 write
    0.00 0.000000 0 56 close
    0.00 0.000000 0 1 unlink
    0.00 0.000000 0 2 execve
    0.00 0.000000 0 20 18 access
    0.00 0.000000 0 1 1 mkdir
    0.00 0.000000 0 13 brk
    0.00 0.000000 0 1 setrlimit
    0.00 0.000000 0 4 readlink
    0.00 0.000000 0 1 ftruncate
    0.00 0.000000 0 4 uname
    0.00 0.000000 0 47 mprotect
    0.00 0.000000 0 2 getdents
    0.00 0.000000 0 2 sched_yield
    0.00 0.000000 0 27 rt_sigaction
    0.00 0.000000 0 48 rt_sigprocmask
    0.00 0.000000 0 1 getcwd
    0.00 0.000000 0 4 getrlimit
    0.00 0.000000 0 126 mmap2
    0.00 0.000000 0 58 1 lstat64
    0.00 0.000000 0 51 fstat64
    0.00 0.000000 0 3 getuid32
    0.00 0.000000 0 2 getgid32
    0.00 0.000000 0 3 geteuid32
    0.00 0.000000 0 2 getegid32
    0.00 0.000000 0 4 madvise
    0.00 0.000000 0 8 getdents64
    0.00 0.000000 0 15 fcntl64
    0.00 0.000000 0 14 gettid
    0.00 0.000000 0 88 sched_getaffinity
    0.00 0.000000 0 2 set_thread_area
    0.00 0.000000 0 2 set_tid_address
    0.00 0.000000 0 19 clock_gettime
    0.00 0.000000 0 1 clock_getres
    0.00 0.000000 0 15 set_robust_list
    0.00 0.000000 0 2 socket
    0.00 0.000000 0 2 2 connect
    ------ ----------- ----------- --------- --------- ----------------
    100.00 0.491368 12394 298 total
    Hello World!


    ltrace -c -f java Hello
    % time seconds usecs/call calls function
    ------ ----------- ----------- --------- --------------------
    71.44 2.000665 181878 11 strncmp
    27.65 0.774271 774271 1 pthread_join
    0.20 0.005578 68 82 fgets
    0.13 0.003744 3744 1 dlopen
    0.06 0.001678 279 6 sysconf
    0.06 0.001583 63 25 strlen
    0.05 0.001525 89 17 JLI_MemAlloc
    0.05 0.001490 62 24 JLI_StringDup
    0.05 0.001478 61 24 strcspn
    0.04 0.001228 61 20 strspn
    0.04 0.001075 63 17 JLI_MemFree
    0.03 0.000706 64 11 getenv
    0.02 0.000579 72 8 __sprintf_chk
    0.02 0.000493 61 8 strrchr
    0.02 0.000456 76 6 __strcat_chk
    0.01 0.000370 370 1 pthread_create
    0.01 0.000347 173 2 fclose
    0.01 0.000321 64 5 strchr
    0.01 0.000263 131 2 readlink
    0.01 0.000263 131 2 fopen
    0.01 0.000240 120 2 __xstat
    0.01 0.000234 117 2 access
    0.01 0.000219 109 2 getgid
    0.01 0.000213 106 2 getuid
    0.01 0.000212 106 2 getegid
    0.01 0.000209 104 2 geteuid
    0.01 0.000148 74 2 dlsym
    0.01 0.000147 73 2 strncpy
    0.00 0.000126 63 2 JLI_FreeManifest
    0.00 0.000125 62 2 __strcpy_chk
    0.00 0.000124 62 2 fflush
    0.00 0.000074 74 1 pthread_attr_destroy
    0.00 0.000074 74 1 JLI_WildcardExpandClasspath
    0.00 0.000068 68 1 putenv
    0.00 0.000068 68 1 pthread_attr_init
    0.00 0.000067 67 1 strcat
    0.00 0.000065 65 1 pthread_attr_setdetachstate
    0.00 0.000065 65 1 getpid
    0.00 0.000064 64 1 pthread_attr_setstacksize
    ------ ----------- ----------- --------- --------------------
    100.00 2.800655 303 total
    Hello World!

    C (can you spot the gcc switch I used btw?)


    ################################################################################
    The C version
    ################################################################################
    strace -c -f -q ./helloC
    % time seconds usecs/call calls errors syscall
    ------ ----------- ----------- --------- --------- ----------------
    -nan 0.000000 0 1 read
    -nan 0.000000 0 1 write
    -nan 0.000000 0 2 open
    -nan 0.000000 0 2 close
    -nan 0.000000 0 1 execve
    -nan 0.000000 0 3 3 access
    -nan 0.000000 0 1 brk
    -nan 0.000000 0 1 munmap
    -nan 0.000000 0 4 mprotect
    -nan 0.000000 0 7 mmap2
    -nan 0.000000 0 3 fstat64
    -nan 0.000000 0 1 set_thread_area
    ------ ----------- ----------- --------- --------- ----------------
    100.00 0.000000 27 3 total
    Hello World!


    ltrace -c -f ./helloC
    % time seconds usecs/call calls function
    ------ ----------- ----------- --------- --------------------
    100.00 0.000426 426 1 __printf_chk
    ------ ----------- ----------- --------- --------------------
    100.00 0.000426 1 total
    Hello World!

    Assembly

    ################################################################################
    The assembly version
    ################################################################################
    strace -c -f -q ./helloAsm
    % time seconds usecs/call calls errors syscall
    ------ ----------- ----------- --------- --------- ----------------
    -nan 0.000000 0 1 write
    -nan 0.000000 0 1 execve
    ------ ----------- ----------- --------- --------- ----------------
    100.00 0.000000 2 total
    Hello World!


    ltrace -c -f ./helloAsm
    ltrace: Couldn't find .dynsym or .dynstr in "./helloAsm"
    % time seconds usecs/call calls function
    ------ ----------- ----------- --------- --------------------
    ------ ----------- ----------- --------- --------------------
    100.00 0.000000 0 total

    Obviously, this is not a comparative post, all the languages mentioned have their own design goals and not all implementations are equivalent.

    CPython threading vs multiprocessing – the 5 minute introduction

    MJC brought to my attention the following piece of code which attempts to provide some empirical data comparing the use of threads vs processes in CPython.

    This code, which I understand that it is stored for historical purposes, has some bugs, which I have pointed out in the comment section, but this is not the point of this article (something like this would be not constructive). I believe that a far better contribution would be to write a short piece attempting to point out a few “gotchas”.

    Bit rot took hold and the post above has disappeared but the rest of the article still applies …

    For the rest of this small article, I assume that you are familiar with how threads work in CPython, especially the GIL “issue”. If you are not then, in layman’s terms, the GIL lock is one big lock that ensures that only one thread is executing per Python interpreter, in order to facilitate more efficient garbage collection. It should be clarified that the GIL lock applies to CPython and not Jython, which manages to “map” python threads to “true-parallelizable-per-core-as-they-were-used-to-be” threads. Jython comes with a performance penalty though so YMMV. (NB: Ruby 1.8.x suffers from the same “issue” so no smug Ruby fans need apply 🙂 ) To sum it up, in CPython, only one thread is allowed to use the python interpreter at any given time. “What” I can hear you lamenting “this will kill my threading performance”. However, keep on reading, all is not lost.

    First of all, the GIL ensures that only one thread is executing Python bytecode. However, obviously, this is a performance killer in CPU intensive applications, if you are solving an I/O bound problem then the effect is reduced. If you are using CPython to interface with C code (something that CPython does implicitly for certain library calls), once you have reached C territory you are GIL-free. In order to overcome this limitation, the multiprocessing module was introduced, a module that shares the API with threading module. Multiprocessing attempts a workaround of the GIL limitation by spawning a new python interpreter in a separate process. In a multicore system, this allows for parallelism, much better than threads. However, someone might assume at first that “wow, since a design goal of the API was to be nearly identical, now I can do something along the lines of :s/threading.Thread/multiprocessing.Process/g and I am home free and reap all the performance benefits”.

    Not so fast, junior.

    Access in global variables is the same when using threading, however if you are using multiprocessing you have to specify explicitly what is shared between the various python interpreters, which adds a bit of code complexity . A nice security side-effect of using multiprocessing, as opposed to threads, is that by controlling what is really shared between processes you get a poor-man’s sandbox for free. (I will update this blog post with some practical examples a bit later)

    One might be able to get a few small speedups and proper process control by using a process pool (e.g. you definitely do not want your program to start creating processes uncontrollably, you might hit OS limits or worse have a HIDS detect a privilege abuse or something similar). Given the fact that creating a new process is a somewhat costly operation, in the simplest form, you just create a bunch of them upfront, have them “sleeping” and assign tasks as needed. Given that there are a lot of thread pool implementations out there and that the multiprocessing API closely matches, one should be able to code something like that in a small amount of time. A nice trick that I have seen in practice is fire up X processes, where X is the number of cores and then assign a thread pool to each of them.

    Closing, in a multicore system, when solving CPU intensive problems, you are likely to get speed improvements by using multiprocessing as opposed to threads. As with all forms of parallel programming, subtle bugs might be introduced. I hope that my post gave a nice starting point for some “gotchas”

    Some random thoughts on Greek startups

    [The article below is somewhat of a rant, read it at your own peril and yes I know this is not the proper way to resume blogging after months and months of inactivity]
    So it seems that the Greek IT market lately has seen an influx of “startups”. Their implied cause appears to be quite a noble one, “be not what is traditionally associated with the words “Greek IT” (or “get-rich-quick-or-die-trying” for the more pragmatic amongst you). However, the whole thing reeks of the “Johnny-come-lately” syndrome, at least for me. Below is a partial list of my pet-peeves:

    1) i-somethings will NOT make you a millionaire. Sure, there are more than enough success stories doing the rounds but the whole market is rather hit-n-miss, with the miss part taking the lion’s share. There is at least one book about it, read it please.

    2) Copying SF/CA lingo wholesale does not magically transform an economy (and its main constituents, the people) to a more advanced one. I have seen more Something-Ninjas lately that I did in my childhood (and boy, I was a HUGE Ninja flick fan, I have seen them all and took the Ninjutsu lessons to prove it, complete with a laminated “NINJA” card). People mean connections and connections might mean all the difference in the world

    3) In order for startups to flourish, you do not only need an idea and a shiny business plan, you need smart people (actually “smart and get things done!”, to somewhat narrow it down) to do the actual work. In fact, I will be more than willing to assume the somewhat controversial view that mediocre ideas with good execution will beat goodideas with mediocre execution. So the problem is, where can you find such smart people (even not in the Spolskean sense of absolute brilliance) and get them to work for your project on “Greek startup pay” (incidentally, we do not offer “Greek startup pay” or “Greek pay”, for the local standards we pay well)?

    3.5) To expand on the previous point using some anecdotal data, I am in charge of the second interview round (the technical one) for our project. I believe that I like to give people a chance, I do not care as much about experience as I care about actual ability to adapt and improvise.The hiring route is the “traditional” one (some generic ad posted plus whatever contacts one of the guys was willing to dig up). As of now, I have yet to see a candidate that blows me away and screams “INSTANT HIRE”. (Full Disclosure: we are not located in Athens, however we offset this one by being heavily involved with the local University, which is a good one). In fact, most of the (few!) applications we receive fail to pass the first round altogether, which can be summarized as “we buy you coffee, we REALLY WANT to hire you so please help us do so by demonstrating basic ability so we can see if you are up for the tasks at hand” (This one deserves a separate blog post).

    PS, you can wager that our hiring process is broken (which indeed it is in more than one place) but I see a lot of people relying on good-ole nepotism to get ahead, neglecting personal and professional skills. I will not even comment on so-called start-ups that will recruit anyone (elitism is a virtue if you are a startup).

    4) “You ain’t gonna need it” should be etched in the brain of every new start-up. Copying blindly the technology list (the buzzword parade, if you prefer) from other successful startups does not guarantee success. I am all for “bleeding-edge” technologies, however it is highly unlikely that a Greek start-up needs all of the following: five-nines,three-letter-agency-security, scalability that can rival Skynet of Terminator fame, all that made by two guys working for 900 Euros a month each (I have heard this figure quoted, as a “competitive salary figure”).

    5) Speaking of money, in order to make money, you need to spend money. Sorry, but say Google or Facebook, indeed got (and spent in all the right places, having been taught their lesson from the dot com bubble) some serious $$$ in their infancy to get the ball rolling. Efficiency and cutting some corners is one thing. Being cheap/under-resourced is another ballgame altogether.

    Is there a Greek startup that actually transcends the aforementioned shortcomings and already should have been in your radar for the past couple of years? Yes! But again, most “startup-ites/σταρταπακηδες” I know (you know them, the guys that come to work with a new idea every day that will make them millionaires and usually involves “buzzword of the month”) only got wind of it a few days ago when they got name dropped from the Greek Prime Minister)

    [Update Sun Sep 26 EEST 2010]

    While this post did not gathered the number of comments I would have like to have seen, there was a interesting conversation over twitter with dfunc. dfunc believes that I am overly harsh, quoting “regarding your startups-post, times are tough for everyone here esp. for the startups.”. Now first things first, indeed the article was a bit of a rant, and at least partially fueled by a conversation at a certain Greek startup site and some conversations with some 2nd degree colleagues. This post was not meant in any to disrespect the hard work and plain cojones (oops!, I guess I should have labeled this one NSFW or something), in fact I used the line “be not what is traditionally associated with the words “Greek IT”. So with this out of the way, I will offer some additional insight to my definitely not complete list of pet peeves.

    So, dfunc writes “Regarding the iSomething COs, they develop products so their startup period is much tougher than that of service providers.” Now, while this line is correct, is my biggest pet-peeve of all. For some reason or another, the iSomething is supposed to have a pretty low barrier of entry (notice the supposed, I do not personally subscribe to that train of thought, I believe that *any* software that has a degree of ingenuity and quality is far from trivial to design, implement, secure and deploy). This, by definition, leads to a hit-driven market, where you have a few hits and a zillion of misses. While you get the illusion that you have a “get-rich-quick” scheme, statistics imply that you are most likely to end up in the dustbin than being a high earner. And I will restate my assumption again: “In order to get $$$ you must spend $$$”. So, given the risk that goes with the territory, I fail to see why almost every other Greek startup out there is aiming for the iSomething market.

    Finally, “A new business should have a plan (“orama”) to where it’s heading. I agree with u on that some of these share an unrealistic plan” – Amen, even if I believe that “some of these” should have been “most of these”.

    The trouble with OSSIM

    If you fiddle with MSSP, then you will most likely have heard of OSSIM. While the software itself appears quite powerful, it lacks in the MOST critical aspect of all: there is little if any documentation. Most of it is outdated and sources of information include a forum and a blog/tweeter feed. Come on OSSIM guys, you have the ball rolling, now you have to make it a nightmare for us to use the (otherwise quite good) product? To their defense, they have an open job posting about writing documentation but looks like there are no takers.