Small gotcha under Python’s PYTHONOPTIMIZE feature

A long, long, long time ago I started using interpreter optimizations to help organize code. If a block of code is within a constant (or expression of constants) that evaluate to False, then the block (or line) is optimized away.

if (false){ # this is optimized away }

Perl was very generous, and would allow for constants, true/false, and operations between the two to work.

Thankfully, Python has some optimizations available via the PYTHONOPTIMIZE environment variable (which can also be adjusted via the -o and -oo flags). They control the __debug__ constant, which is True (by default) or False (when optimizations are enabled), and will omit documentation strings if the level is 2 (or oo) or higher. Using these flags in a production environment allows for very verbose documentation and extra debugging logic in a development environment.

Unfortunately, I implemented this incorrectly in some places. To fine-tune some development debugging routines I had some lines like this:

if __debug__ and DEBUG_CACHE:
    pass

Python’s interpreter doesn’t optimize that away if DEBUG_CACHE is equal to False, or even if it IS False (i tested under 2.7 and 3.5). I should have realized this (or at least tested for it). I didn’t notice this until I profiled my app and saw a bunch of additional statistics logging that should not have been compiled.

The correct way to write the above is:

if __debug__:
    if DEBUG_CACHE:
        pass

Here is a quick test

trying running it with optimizations turned on:

export PYTHONOPTIMIZE=2
python test.py

and with them off

export PYTHONOPTIMIZE
python test.py

As far as the interpreter is concerned during the optimization phase, __debug__(False) and False is True.

import dis

def foo():
    """docstring"""
    if __debug__:
        print("debug message")
    return True

def bar():
    """docstring"""
    if __debug__ and False:
        print("debug message")
    return True

# ============

# this will be a lean function
dis.dis(foo)

print "- - - - - - - -"
# this will show extended logic
dis.dis(bar)

The default version of foo should look like this:

21           0 LOAD_CONST               1 ('debug message')
             3 PRINT_ITEM          
             4 PRINT_NEWLINE       

22           5 LOAD_GLOBAL              0 (True)
             8 RETURN_VALUE        

But you should see that the optimized version of foo creates some very lean code:

22           0 LOAD_GLOBAL              0 (True)
             3 RETURN_VALUE        

Now let’s look at the bar function when unoptimized

26           0 LOAD_GLOBAL              0 (__debug__)
             3 POP_JUMP_IF_FALSE       20
             6 LOAD_GLOBAL              1 (False)
             9 POP_JUMP_IF_FALSE       20

27          12 LOAD_CONST               1 ('debug message')
            15 PRINT_ITEM          
            16 PRINT_NEWLINE       
            17 JUMP_FORWARD             0 (to 20)

28     >>   20 LOAD_GLOBAL              2 (True)
            23 RETURN_VALUE        

Which generates the same code as the optimized version:

26           0 LOAD_GLOBAL              0 (__debug__)
             3 POP_JUMP_IF_FALSE       20
             6 LOAD_GLOBAL              1 (False)
             9 POP_JUMP_IF_FALSE       20

27          12 LOAD_CONST               1 ('debug message')
            15 PRINT_ITEM          
            16 PRINT_NEWLINE       
            17 JUMP_FORWARD             0 (to 20)

28     >>   20 LOAD_GLOBAL              2 (True)
            23 RETURN_VALUE

So let’s add a new function, biz

def biz():
    """docstring"""
    if __debug__:
        if False:
            print("debug message")
    return True

The unoptimized code:

33           0 LOAD_GLOBAL              0 (False)
             3 POP_JUMP_IF_FALSE       14

34           6 LOAD_CONST               1 ('debug message')
             9 PRINT_ITEM          
            10 PRINT_NEWLINE       
            11 JUMP_FORWARD             0 (to 14)

35     >>   14 LOAD_GLOBAL              1 (True)
            17 RETURN_VALUE    

And a lot of that gets optimized away:

35           0 LOAD_GLOBAL              0 (True)
             3 RETURN_VALUE        

Not many people use/abuse how the interpreter compiles code to their advantage, but if you do — pay attention to constructs like this.

Using __debug__ is a great way to hide logging code on a production environment. The evaluation of __debug__ only happens once, when Python first compiles the code, so there is very little overhead.

Optimized Archiving of Historical Time-Series Data for Analytics

The (forthcoming) Aptise platform features a web-indexer that tabulates a lot of statistical data each day for domains across the global internet.

While we don’t currently need this data, we may need it in the future. Keeping it in the “live” database is not really a good idea – it’s never used and just becomes a growing stack of general crap that automatically gets backed-up and archived every day as part of standard operations. It’s best to keep production lean and move this unnecessary data out-of-sight and out-of-mind.

An example of this type of data is a daily calculation on the relevance of each given topic to each domain that is tracked. We’re primarily concerned with conveying the current relevance, but in the future we will address historical trends.

Let’s look at the basic storage requirements of this as a PostgreSQL table:

| Column    | Format  | Size    |
| --------- | ------- | ------- |
| date      | Date    | 8 bytes |
| domain_id | Integer | 4 bytes |
| topic_id  | Integer | 4 bytes |
| count     | Integer | 4 bytes |

Each row is taking 20 bytes, 8 of which are due to date alone.

Storing this in PostgreSQL across thousands of domains and tags every day takes a significant chunk of storage space – and this is only one of many similar tables that primarily have archival data.

An easy way to save some space for archiving purposes is to segment the data by date, and move the storage of date information from a row and into the organizational structure.

If we were to keep everything in Postgres, we would create an explicit table for the date. For example:

CREATE TABLE report_table_a__20160101 (domain_id INT NOT NULL, topic_id INT NOT NULL, count INT NOT NULL DEFAULT 0);

| Column    | Format  | Size    |
| --------- | ------- | ------- |
| domain_id | Integer | 4 bytes |
| topic_id  | Integer | 4 bytes |
| count     | Integer | 4 bytes |

This structure conceptually stores the same data, but instead of repeating the date in every row, we record it only once within the table’s name. This simple shift will lead to a nearly 40% reduction in size.

In our use-case, we don’t want to keep this in PostgreSQL because the extra data complicates automated backups and storage. Even if we wanted this data live, having it within hundreds of tables is a bit much overkill. So for now, we’re going to export the data from a single date into a new file.

SELECT domain_id, topic_id, count FROM table_a WHERE date = '2016-01-01';

And we’re going to save that into a comma delimited file:

table_a-20160101.csv

I skipped a lot of steps above because I do this in Python — for reasons I’m about to explain.

As a raw csv file, my date-specific table is still pretty large at 7556804 bytes — so let’s consider ways to compress it:

Using standard zip compression, we can drop that down to 2985257 bytes. That’s ok, but not very good. If we use xz compression, it drops to a slightly better 2362719.

We’ve already compressed the data to 40% the original size by eliminating the date column, so these numbers are a pretty decent overall improvement — but considering the type of data we’re storing, the compression is just not very good. We’ve got to do better — and we can.

We can do much better and actually it’s pretty easy (!). All we need to do is understand compression algorithms a bit.

Generally speaking, compression algorithms look for repeating patterns. When we pulled the data out of PostgreSQL, we just had random numbers.

We can help the compression algorithm do its job by giving it better input. One way is through sorting:

SELECT domain_id, topic_id, count FROM table_a WHERE date = '2016-01-01' ORDER BY domain_id ASC, topic_id ASC, count ASC;

As a raw csv, this sorted date-specific table is still the same exact 7556804 bytes.

Look what happens when we try to compress it:

Using standard zip compression, we can drop that down to 1867502 bytes. That’s pretty good – we’re at 25.7% the size of the raw file AND it’s 60% the size of the non-sorted zip. That is a huge difference! If we use xz compression, we drop down to 1280996 bytes. That’s even better at 17%.

17% compression is honestly great, and remember — this is compressing the data that is already 40% smaller because we shifted the date column out. If we consider what the filesize with the date column would be, we’re actually at 10% compression. Wonderful.

I’m pretty happy with these numbers, but we can still do better — and without much more work.

As I said above, compression software looks for patterns. Although the sorting helped, we’re still a bit hindered because our data is in a “row” storage format. Consider this example:

1000,1000,1
1000,1000,2
1000,1000,3
1001,1000,1
1001,1000,2
1001,1000,3

There are lots of repeating patterns there, but not as many as if we represented the same information in a “column” storage format:

1000,1000,1000,1001,1001,1001
1000,1000,1000,1000,1000,1000
1,2,3,1,2,3

This is the same data, but as you can see it’s much more “machine” friendly – there are larger repeating patterns.

This transformation from row to column is an example of “transposing an array of data`; performing it (and reversing it) is incredibly easy with Python’s standard functions.

Let’s see what happens when I use transpose_to_csv below on the data from my csv file

def transpose_to_csv(listed_data):
    """given an array of `listed_data` will turn a row-store into a col-store (or vice versa)
    reverse with `transpose_from_csv`"""
    zipped = zip(*listed_data)
    list2 = [','.join([str(i) for i in zippedline]) for zippedline in zipped]
    list2 = '\n'.join(list2)
    return list2

def transpose_from_csv(string_data):
    """given a string of csvdata, will revert the output of `transpose_to_csv`"""
    destringed = string_data.split('\n')
    destringed = [line.split(',') for line in destringed]
    unzipped = zip(*destringed)
    return unzipped

As a raw csv, my transposed file is still the exact same size at 7556804 bytes.

However, if I zip the file – it drops down to 1425585 bytes.

And if I use xz compression… I’m now down to 804960 bytes.

This is a HUGE savings without much work.

The raw data in postgres was probably about 12594673 bytes (based on the savings, the file was deleted).
Stripping out the date information and storing it in the filename generated a 7556804 bytes csv file – a 60% savings.
Without thinking about compression, just lazily “zipping” the file created a file 2985257 bytes.
But when we thought about compression: we sorted the input, transposed that data into a column store; and applied xz compression; we resulted in a filesize of 804960 bytes – 10.7% of the csv size and 6.4% of the estimated size in PostgreSQL.

This considerably smaller amount of data can not be archived onto something like Amazon’s Glacier and worried about at a later date.

This may seem like a trivial amount of data to worry about, but keep in mind that this is a DAILY snapshot, and one of several tables. At 12MB a day in PostgreSQL, one year of data takes over 4GB of space on a system that is treated for high-priority data backups. This strategy turns that year of snapshots into under 300MB of information that can be warehoused on 3rd party systems. This saves us a lot of time and even more money! In our situation, this strategy is applied to multiple tables. Most importantly, the benefits cascade across our entire system as we free up space and resources.

These numbers could be improved upon further by finding an optimal sort order, or even using custom compression algorithms (such as storing the deltas between columns, then compressing). This was written to illustrate a quick way to easily optimize archived data storage.

The results:

Format Compression Sorted? Size % csv+date
csv+date 12594673 100
row csv 7556804 60
row csv zip 2985257 23.7
row csv xz 2362719 18.8
row csv Yes 7556804 60
row csv zip Yes 1867502 14.8
row csv xz Yes 1280996 10.2
col csv Yes 7556804 60
col csv zip Yes 1425585 11.3
col csv xz Yes 804960 6.4

Note: I purposefully wrote out the ASC on the sql sorting above, because the sort order (and column order) does actually factor into the compression ratio. On my dataset, this particular column and sort order worked the best — but that could change based on the underlying data.

trouble installing psycopg2 on OSX ?

I had some trouble with a new virtualenv — psycopg2 wouldn’t install.

I remembered going though this before, but couldn’t find my notes. I ended up fixing it before finding my notes ( which point to this StackOverflow question http://stackoverflow.com/questions/2088569/how-do-i-force-python-to-be-32-bit-on-snow-leopard-and-other-32-bit-64-bit-quest ) , but I want to share this with others.

psycopg2 was showing compilation errors in relation to some PostgreSQL libraries

> ld: warning: in /Library/PostgreSQL/8.4.5/lib/libpq.dylib, missing required architecture x86_64 in file

so then I checked how the file was built:

lets see how that was built…

$ file /Library/PostgreSQL/8.4.5/lib/

> /Library/PostgreSQL/8.4.5/lib/libpq.dylib: Mach-O universal binary with 2 architectures
> /Library/PostgreSQL/8.4.5/lib/libpq.dylib (for architecture ppc): Mach-O dynamically linked shared library
> /Library/PostgreSQL/8.4.5/lib/libpq.dylib (for architecture i386): Mach-O dynamically linked shared library i386

Crap. it’s built i386 only. The fix is easy right? We just need to export archflags and build.

$ export ARCHFLAGS=”-arch i386″
$ pip install –upgrade psycopg2

That works perfect, right?

Wrong.

> File “/environments/example-2.7.5/lib/python2.7/site-packages/psycopg2/__init__.py”, line 50, in
> from psycopg2._psycopg import BINARY, NUMBER, STRING, DATETIME, ROWID
>ImportError: dlopen(/environments/example-2.7.5/lib/python2.7/site-packages/psycopg2/_psycopg.so, 2): no suitable image found. Did find:
> /environments/example-2.7.5/lib/python2.7/site-packages/psycopg2/_psycopg.so: mach-o, but wrong architecture

I was dumbfounded for a few seconds, then I realized — Python was trying to run 64 bit (x86_64) , but I only have the 32 bit library.

the right fix? Rebuild PostgreSQL to support 64bit.

My PostgrSQL was a prebuilt package, and I don’t have time to fix that, so I need to do a few hacky/janky things

basically , we’re going to force python to run in i386 ( and not 64bit )

go to our virtualenv…

cd /environments/example-2.7.5/bin

back it up

cp python python-original

strip it…

# note that our last arg is the input, and the 2nd to last it output
lipo -thin i386 -output python-i386 python

replace it

rm python
mv python-i386 python

now install psycopg2

export ARCHFLAGS=”-arch i386″
pip install –upgrade psycopg2

yay this works !

now get some work done and save some time so you can build a 64bit PostgreSQL

Python Fun : Upgrading to 2.7 on OSX ; Installing Mysql-Python on OSX against MAMP ( ruby gem too)

I needed to upgrade from Python 2.6 to 2.7 and ran into a few issues along the way. Learn from my encounters below.

# Upgrading Python
Installing the new version of Python is really quick. Python.org publishes a .dmg installer that does just about everything for you. Let me repeat “Just about everything”.

You’ll need to do 2 things once you install the dmg, however the installer only mentions the first item below:

1. Run “/Applications/Python 2.x/Update Shell Command”. This will update your .bash_profile to look in “/Library/Frameworks/Python.framework/Versions/2.x/bin” first.

2. After you run the app above, in a NEW terminal window do the following:

* Check to see you’re running the new python with `python –version` or `which python`
* once you’re running the new python, re-install anything that installed an executable in bin. THIS INCLUDES SETUPTOOLS , PIP , and VIRTUALENV

It’s that second thing that caught me. I make use of virtualenv a lot, and while I was building new virtualenvs for some projects I realized that my installs were building against `virtualenv` and `setuptools` from the stock Apple install in “/Library/Python/2.6/site-packages” , and not the new Python.org install in “/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages”.

It’s worth nothing that if you install setuptools once you’ve upgraded to the Python.org distribution, it just installs into the “/Library/Frameworks/Python.framework” directory — leaving the stock Apple version untouched (basically, you can roll back at any time).

# Installing Mysql-Python (or the ruby gem) against MAMP

I try to stay away from Mysql as much as I can [ i <3 PostgreSQL ], but occasionally I need to run it: when I took over at TheDailyBeast.com, they were midway through a relaunch on Rails , and I have a few consulting clients who are on Django. I tried to run cinderella a while back ( http://www.atmos.org/cinderella/ ) but ran into too many issues. Instead of going with MacPorts or Homebrew, I've opted to just use MAMP ( http://www.mamp.info/en/index.html ) There's a bit of a problem though - the persons who are responsible for the MAMP distribution decided to clear out all the mysql header files, which you need in order to build the Python and Ruby modules. You have 2 basic options: 1. Download the "MAMP_components" zip (155MB) , and extract the mysql source files. i often used to do this, but the Python module needed a compiled lib and I was lazy so... 2. Download the tar.gz version of Mysql compiled for OSX from http://dev.mysql.com/downloads/mysql/ Whichever option you choose, the next steps are generally the same: ## Copy The Files ### Where to copy the files ? mysql_config is your friend. at least the MAMP one is. Make sure you can call the right mysql_config, and it'll tell you where the files you copy should be stashed. Since we're building against MAMP we need to make sure we're referencing MAMP's mysql_config
iPod:~jvanasco$ which myqsl_config
/Applications/MAMP/Library/bin/mysql_config

iPod:~jvanasco$ mysql_config
Usage: /Applications/MAMP/Library/bin/mysql_config [OPTIONS]
Options:
–cflags [-I/Applications/MAMP/Library/include/mysql -fno-omit-frame-pointer -D_P1003_1B_VISIBLE -DSIGNAL_WITH_VIO_CLOSE -DSIGNALS_DONT_BREAK_READ -DIGNORE_SIGHUP_SIGQUIT -DDONT_DECLARE_CXA_PURE_VIRTUAL]
–include [-I/Applications/MAMP/Library/include/mysql]
–libs [-L/Applications/MAMP/Library/lib/mysql -lmysqlclient -lz -lm]
–libs_r [-L/Applications/MAMP/Library/lib/mysql -lmysqlclient_r -lz -lm]
–plugindir [/Applications/MAMP/Library/lib/mysql/plugin]
–socket [/Applications/MAMP/tmp/mysql/mysql.sock]
–port [3306]
–version [5.1.44]
–libmysqld-libs [-L/Applications/MAMP/Library/lib/mysql -lmysqld -ldl -lz -lm]

### Include

Into /Applications/MAMP/include you need to place the mysql include files into a subdirectory called “mysql”


mkdir -P /Applications/MAMP/Library/include
cp -Rp MySQL-Distribution/include /Applications/MAMP/Library/include/mysql

### Lib

Into /Applications/MAMP/Library/lib you need to place the mysql lib files


mkdir -P /Applications/MAMP/Library/include
cp -Rp MySQL-Distribution/lib /Applications/MAMP/Library/lib/mysql

## Configure the Env / Installers

Note: if you’re installing for a virtualenv, this needs to be done after it’s been activated.

Set the archflags on the commandline:


export ARCHFLAGS="-arch $(uname -m)"

### Python Module

I found that the only way to install the module is by downloading the source ( off sourceforge! ).

I edited site.cfg to have this line:


mysql_config = /Applications/MAMP/Library/bin/mysql_config

Basically, you just need to tell mysql to use the MAMP version of mysql_config to figure everything out itself.

the next steps are simply:


python setup.py build
python setup.py install

If you get any errors, pay close attention to the first few lines.

If you see something like the following within the first 10-30 lines, it means the various files we placed in the step above are not where the installer wants them to be:


_mysql.c:36:23: error: my_config.h: No such file or directory
_mysql.c:38:19: error: mysql.h: No such file or directory
_mysql.c:39:26: error: mysqld_error.h: No such file or directory
_mysql.c:40:20: error: errmsg.h: No such file or directory

If you look up a few lines, you might see something like this:


building '_mysql' extension
gcc-4.0 -fno-strict-aliasing -fno-common -dynamic -g -O2 -DNDEBUG -g -O3 -arch i386 -Dversion_info=(1,2,3,'final',0) -D__version__=1.2.3 -I/Applications/MAMP/Library/include/mysql -I/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c _mysql.c -o build/temp.macosx-10.5-i386-2.7/_mysql.o -fno-omit-frame-pointer -D_P1003_1B_VISIBLE -DSIGNAL_WITH_VIO_CLOSE -DSIGNALS_DONT_BREAK_READ -DIGNORE_SIGHUP_SIGQUIT -DDONT_DECLARE_CXA_PURE_VIRTUAL

note how we see “/Applications/MAMP/Library/include/mysql” in there. When I quickly followed some instructions online that had all the files in /include — and not in that subdir — this error popped up. Once I changed the directory structure to match what my mysql_config wanted, the package installed perfectly.

### Ruby Gem

Assuming you’re using bundler:


bundle config build.mysql
--with-mysql-include=/Applications/MAMP/Library/include/mysql/
--with-mysql-lib=/Applications/MAMP/Library/lib
--with-mysql-config=/Applications/MAMP/Library/bin/mysql_config

and then before you do a `bundle install` , set the env vars


> export ARCHFLAGS="-arch x86_64"

or


> export ARCHFLAGS="-arch $(uname -m)"

## Test it

If things install nicely, lets make sure it works…


$ipod:~ jvanasco$ python
>>> import _mysql

Oh , crap:


Traceback (most recent call last):
File "", line 1, in
File "build/bdist.macosx-10.5-i386/egg/_mysql.py", line 7, in
File "build/bdist.macosx-10.5-i386/egg/_mysql.py", line 6, in __bootstrap__
ImportError: dlopen(/Users/jvanasco/.python-eggs/MySQL_python-1.2.3-py2.7-macosx-10.5-i386.egg-tmp/_mysql.so, 2): Library not loaded: libmysqlclient.18.dylib
Referenced from: /Users/jvanasco/.python-eggs/MySQL_python-1.2.3-py2.7-macosx-10.5-i386.egg-tmp/_mysql.so
Reason: image not found

Basically what’s happening is that as you run it, mysql_python drops a shared object in your userspace. That shared object is referencing the location that the Mysql.org distribution placed all the library files — which differs from where we placed them in MAMP.

There’s a quick fix — add this to your bash profile, or run it before starting mysql/your app:

export DYLD_LIBRARY_PATH="$DYLD_LIBRARY_PATH:DYLD_LIBRARY_PATH=/Applications/MAMP/Library/lib/mysql"

# Conclusion

There are too many posts on this subject matter to thank. A lot of people posted variations of this method – to which I’m very grateful – however no one addressed troubleshooting the Python process , which is why I posted this.

I also can’t stress the simple fact that if the MAMP distribution contained the header and built library files, none of this would be necessary.