« Full Disclosure on those 'Other Domains' | Main | Currently In Rotation »
October 5, 2006
What Perl isn't A Great Language For
Bob sent me this link "Why Perl Is a Great Language for Concurrent Programming"
http://t-a-w.blogspot.com/2006/10/why-perl-is-great-language-for.html
As big Python guy / new erlang addict, he had some ideas of his own on that article...
As a big perl supporter and longtime Python fan, I have ideas of my own...
Where to start?
First off, before I talk about any concurrency, I want to address this line:
Interting multiple values with a single INSERT statement is MySQL extension of SQL (a pretty useful one): INSERT INTO `externallinks` VALUES (a, b,c), (d, e, f), (g, h, i); and Sqlite doesn't support it. Now come on, MySQL is extremely popular, so being compatible with some of its extensions would really increase Sqlite's value as duct tape
Forgive me, but the value of SQLite value is that is supports SQL. Not 'MySQL'. Now TawsSQL, but SQL- as in the ANSI SQL92 STANDARD.
Chained inserts like that are something that MySQL just made up one day, and people who learn MySQL come to believe as fact. I know, because I was one of them. Luckily, some friends turned me onto real databases like Postgres -- and when the occasional work with Oracle came up, I wasn't like a deer in headlights -- I could actually write queries that work.
I've been eternally sorry for learning SQL with MySQL as a starting point. The language and documentation promote the worst design principles imaginable, and proprietary column types like SET and GROUP don't just foster schemas that can be normalized or migrated to real RDBMS platforms-- they make sure that your database code works on MySQL and nothing else.
Sure, you can make database schemas in MySQL that will actually work with Postgres and Oracle... but the manuals and 3rd party books all promote MySQLisms and straying from the norm -- and anyone who has had the nightmare of porting a DB from MySQL to PG or Oracle starts running into all those oddities where LEFT JOIN and DISTINCT just aren't treated the same way.
MySQL was built for speed and convenience- it cut corners and it did so poorly. The referential integrity was so poor, I had to constantly writing application code to check keys - which ultimately made it slower than Postgres by huge margins. I also had to constantly write application code to recheck insert/update values for content type and length-- until just recently when a special strict 'traditional' mode was released that tries to make mysql act like a real server, the db would just accept anything you toss at it-- converting and truncate characters as needed, never calling an error. That's not an enterprise tool, that an enterprise nightmare.
People boast of large corporations using MySQL for online commerce -- I keep tabs on those pages and put them in my bookmarks folder titled 'Do Not Shop Here'.
Getting back to the concurrent programming issue...
FindMeOn.com is in private beta now, about to go public...
For those that aren't in the know, FindMeOn.com is a massive identity project that I'm launching , and is mostly written in modperl2 and generally runs on the framework that was built for RoadSound.com.
Since Apache2/mod_perl2 is the primary platform for FindMeOn, everything that could be done in perl was done in perl. As needed, I refactor bottlenecks in whatever is appropriate -- C, PHP (no joke. sometimes it actually is the right tool ), even the web server nginx was able to cut down on a bunch code by handling stuff internally. Anyways, at last count, FindMeOn was about 24,000 lines of Perl code on the business logic alone (thats not including templates). Suffice to say, its massive.
FindMeOn does A LOT of link checking , url downloading , and web spidering. It would have been neat to keep it in Perl... but early attempts had bad results. Perl just didn't have the concurrent or asynchronous support one needs, and is a giant memory hog. Simple libraries like File::Find end up taking over 2mb of resident memory-- you should see what complex libraries compile to. And threaded perl is just downright scary- especially on FreeBSD.
So I started looking at other options.
Erlang would have been, by far, the best choice. It was explicitly designed for concurrent programming. But erlang has a giant problem , its not 'user friendly'-- yet.
There aren't the conveniences of large, well supported and well documented libraries, and there's a giant learning curve. It does have good string support and http support, but the support is not 'user friendly'.
Learning Erlang today is like trying to scale a vertical wall with no equipment the first time you go rock climbing. Unless you really really want to be an erlang programmer, you're just not going to pick it up. Hopefully in a few years that will change-- as I'd really like to learn it -- but I just don't have the time to put into learning it as some colleagues have. ( Note: everyone I know who was able to spend the time needed to learn it has said it was definitely worth the investment )
So I turned to Python- more specifically Twisted Python -- and I had a winner.
Twisted supports asynchronous programming -- ie non-blocking code, and also deferring blocking code to threads -- which is exactly what I needed.
Instead of going into the design specification that I needed to conform to, I'm just going to give an overview of what is going on. If you're reasonably computer literate, you'll understand why things happen as they do.
FindMeOn runs a Twisted Daemon that uses a single DB handle to query the master server every 10 seconds for links to validate. All links from the DB are 'locked' by the Twisted process with internal status codes.
The Daemon uses Semaphores in Twisted to manage a pool of database handles and 'potential' threads. I like to keep things down to 5 dbh/threads - its a good mix of performance and resource consumption.
As semaphores become available, Twisted acquires a lock on them, pulls a DB handle, and runs some blocking code within a seperate thread.
The blocking code tries to validate the URL. When you try to validate the URL anything could happen: you could instantly have success ( personally, I like that )-- but you could also have a server redirect, network latency issues, something specific to findmeon that results in an additional URL lookup, or the bit of data you're validating is just not there. In other words, validating a url could take 0.01 seconds or 20, and offers no guarantee on the outcome.
So the blocking code does its thing while we're in a safe little thread, exits to a deferred callback that does random updates to the DB based on the outcome and then returns the DBH to a pool, as well as letting go of the semaphore.
This all happens in a Twisted Scheduler, so it never stops. The record so far is running for 9 days straight ( then I had to update some code ). Its lean, its mean, and it has way better resource management than Perl.
Its also a dirty hack-- the next verison of Twisted and Python is set to include concurrent actions. I'm a bit fuzzy on things reading from the specs, but everything I've seen so far has been pretty amazing. Instead of giving each inteded action a semaphore that tries to get acquired, Concurrent is better at doling out slots to a list without creating as many instances in memory. Or something like that.
Anyways, Perl is great -- its just not a good tool for concurrent programming. The support in other languages is far better, and so is the memory handling.
Posted by Jonathan at October 5, 2006 3:46 PM
Trackback Pings
TrackBack URL for this entry:
http://www.destructuring.net/mt/mt-tb.cgi/2
Comments
Post a comment
Thanks for signing in, . Now you can comment. (sign out)
(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)