Len() calls can be SLOW in Berkeley Database and Python bsddb.

In my day-to-day coding work, I make extensive use of Berkeley DB (bdb) hash and btree tables. They’re really fast, easy-ish to use, and work for the apps I need them for (persistent storage of json and other small data structures).

So, this python code was having all kinds of weird slowdowns for me, and it was the len() call (of all things) that was causing the issue!

As it turns out, sometimes the Berkeley database does have to iterate over all keys to give a proper answer. Even the “fast stats” *number of records* call has to

References:
Jesus Cea’s comments one why bdb’s don’t know how many keys they have
db_stat tool description
DB->stat api


Dumping and loading a bsddb, for humans.

Sometimes things happen with Python shelves that screw up the bsddb’s (Berkeley DB [bdb] databases*) that power them. A common way for this to happen is when two apps have it open for writing, and something goes flooey like both try to write to the same page. The bsddb emits this helpful error:

DBRunRecoveryError: [Terror, death and destruction will ensue] or something equally opaque and non-reassuring

So how to run the recovery, eh? Assuming you have the db_dump and db_load tools on your platform, take hints from Library and Extension FAQ and try this bash snippet:

#!/usr/bin/bash 

## example usage:
## $ bdb_repair  /path/to/my.db
function bdb_repair {
  BDIR=`dirname $1` #  /path/to/dir    
  BADDB=`basename $1`   #  bad.db  
  cd $BDIR  && \
  cp $BADDB{,.bak}  # seriously!  back it up first  
  db_dump -f $BADDB.dump  $BADDB   # might take a while
  db_load -f $BADDB.dump  $BADDB.repaired
  cp -o $BADDB.repaired $BADDB
  cd -
}

So far, I’ve had universal success with this method.

If any bash gurus want to improve the error handling here, I’d appreciate it.

FOOTNOTES
* Yes, I know this is redundant.


The 100 Doors Puzzle in R at Rosetta Code

From time to time I see a puzzle at Rosetta Code that interests me, and I post an R solution for it.  This time it was the 100 doors puzzle.

Problem: You have 100 doors in a row that are all initially closed. You make 100 passes by the doors. The first time through, you visit every door and toggle the door (if the door is closed, you open it; if it is open, you close it). The second time you only visit every 2nd door (door #2, #4, #6, …). The third time, every 3rd door (door #3, #6, #9, …), etc, until you only visit the 100th door.

Question: What state are the doors in after the last pass? Which are open, which are closed?

The code for this in R is pretty simple:

# UNOPTIMIZED
doors_puzzle <- function(ndoors=100,passes=100) {     doors <- rep(FALSE,ndoors)     for (ii in seq(1,passes)) {         mask <- seq(0,ndoors,ii)         doors[mask] <- !doors[mask]        }     return (which(doors == TRUE)) } doors_puzzle() ## optimized version... we only have to to up to the square root of 100 seq(1,sqrt(100))**2 [/sourcecode]


Monty Hall in R

Inspired by paddy3118, I decided to write up a Monty Hall simulation in R for Rosetta Code.   Enjoy!

… The rules of the game show are as follows: After you have chosen a door, the door remains closed for the time being. The game show host, Monty Hall, who knows what is behind the doors, now has to open one of the two remaining doors, and the door he opens must have a goat behind it…

R-project code follows


Quick and (less?)-Dirty JSON Speed Testing in Python

Back in a previous article, I made some bold claims.   After a good vetting on Reddit, the incomparable effbot pointed me toward timeit (cf: notes).

Quick, dirty, and quite possibly deeply flawed.

The profiler’s designed for profiling, not benchmarking, and Python code running under the profiler runs a lot slower than usual — but C code isn’t affected at all.

To get proper results, use the timeit module instead.

So, here is a revised analysis.  It still looks like cjson strongly outperforms the others.*  Most interestingly, I tried oblivion95’s suggestion to read in json using eval, and that seems slower than cjson, which seems implausible to me.  I look forward to corrections.

Results

dumping to JSON

cjson dump nested_dict
0.096393 0.096989 0.097203 0.097859 0.098357
demjson dump nested_dict
4.589573 4.601798 4.609123 4.621567 4.625506
simplejson dump nested_dict
0.595901 0.596267 0.596555 0.597104 0.597633

cjson dump ustring
0.024242 0.024264 0.024453 0.024475 0.024548
demjson dump ustring
2.350742 2.363112 2.364416 2.365360 2.374244
simplejson dump ustring
0.039637 0.039668 0.039820 0.039890 0.039976

loading from JSON

cjson load nested_dict_json
0.042304 0.042332 0.042936 0.043246 0.043858
demjson load nested_dict_json
8.317319 8.332928 8.334701 8.367242 8.371535
simplejson load nested_dict_json
1.858826 1.862957 1.864221 1.864268 1.868705
eval load nested_dict_json
0.484512 0.485497 0.487538 0.487866 0.488751

cjson load ustring_json
0.045566 0.045803 0.045846 0.046027 0.046056
demjson load ustring_json
3.391110 3.401287 3.403575 3.408148 3.416667
simplejson load ustring_json
0.243784 0.244193 0.244920 0.245126 0.246061
eval load ustring_json
0.121635 0.121801 0.122561 0.123064 0.123563

Code and footnotes follow


Using Python “timeit” to time functions

Python’s timeit module is not intuitive for me.

Steve D’Aprano’s thread on the python mailing list:  Timing a function object… and especially Carl Bank’s response of


def foo(x):
    return x+1

import timeit
timeit.Timer("foo(1)","from __main__ import foo")

was a godsend!

Reading this hint contradicted my blithe statement to friend this morning that “the only time __main__ appears in user code is to determine when a script is being run form the command line”.  Such ignorance and folly!


Quick and Dirty JSON Speed Testing in Python

[See updated post for analysis using timeit]

As per Poromenos‘ request on Reddit, I decided to do a bit of expansion on my cryptic comment about the major json packages in python (simplejson, cjson, demjson):

My conclusion: use demjson if you really really want to make sure everything is right, and you don’t care at all about time. Use simplejson if you’re in the 99% of all users who want reasonable performance over a broad range of objects, and use enhanced cjson 1.0.3x if you in the came with reasonable json inputs, and you need much faster (10x) speed…. that is, if the json step is the bottleneck.

More worrisome — demjson didn’t handle the unicode string I threw at it properly…

benchmark code and more indepth discussion


Learning Programming: Dissecting a Turd

As mentioned in my introduction to this series, learning to program is a series of lurching steps forward, sideways, and back, leading (if all goes well) to a better holistic understanding of the process.

In many rationalist disciplines, there are learning exercises that require grokking the whole and the parts in an iterative process, until a fullness is achieved. As learning occurs, one filters out more or the irrelevant detail. Think of learning to drive, or to speak a new language, or music theory. Without knowing “the point”, the big picture, the details are just noise, and tedia. Most programming (cf: math, stats) books and courses stay focused on the details far too long. What’s missing are activities that combine the micro and macro views fluidly. Code analysis fits the bill nicely.

One of the unsung marvels of the open-source movement is that it exposes mountains of good and bad code for all to see, of every imaginable pattern and anti-pattern.  Some of the code even works, despite of (or because of) its ugliness.

Learning to read others’ code is an under-explored pedagogy tool. It’s well and good to read the sort of well-commented, rational, toy examples that one finds in books, but it’s quite another thing entirely to learn to read other peoples’ eccentric, human-created code, warts and all.

Python has some nice advantages for this sort of exercise*. It has an interactive interpreter, and even the worst python code is still pretty easy to pick apart, for most reasonable length examples, using common modules. Give me Twisted, and I can give you write-only code, but that’s my failing, not the tool.

So, as my service / penance for all the bad code I’ve unleashed, I’ve decided to comment and describe the weird mishmash of ideas and patterns existing in a first generation, but working, application I wrote.

– read on for the code, the analysis, and more embarrassing warts… >


Learning Programming: Introduction

More people should know how to program*.  More resources are available than ever to help one learn.  Pc’s are ubiquitous, and the open source movement and the internet make powerful languages of every flavour (C, Python, Haskell, R…) and difficulty easily available, along with extensive documentation.   It should be a cakewalk, right? So why aren’t there more programmers?

read on for the thrilling conclusion!


Embedding a Python Shell in a Python Script

I am huge of advocate of command-line programs and domain specific languages. Back when I worked in statistical genetics, there were many fine programs that ran this way, allowing for easily repeatable analysis*, and easy scripting. Python has some great tools for creating programs driven by mini-languages (such as the cmd module), but they seem to date from a kinder, gentler time when people took what they could get for documentation, and they liked it. Finding simple examples of how they work is tough**. I’m sure in some future post, I’ll tackle this in more depth, but for now I want to focus on a simpler problem: embedding python shells into python scripts.

Should be trivial, right? IDLE exists, and python comes with a bundled interpreter. Searching for help here fails because “embedding a python shell” calls up documentation on how to get a shell in C environments, which is not what I mean at all. How often has it happened that there is some complex script that one wants to introspect, partway through, maybe after some data structures are created or loaded? A typical solution is something like:

if flag:
    import pdb
    pdb.set_trace()

This is fine and dandy, except that it’s not always “debugging” that one wants to do. Sometimes one wants to explore data, or gods forbid, enter new code as an experiment.

I read Rick Muller’s recipe about how to embed a code.InteractiveConsole into a python script, and I thought I could do a little better. The following snippet shows how a --shell command line flag (read using optparse) drops the user into a shell, using IPython if it’s available, or falling back on code.***

Read on for more, and the code…


Design a site like this with WordPress.com
Get started